oats-center / isoblue Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 9.0 20.35 MB

ISOBlue Hardware, Avena software, and Deployment Files

License: MIT License

Python 48.42% Shell 12.57% Dockerfile 7.08% JavaScript 0.38% TypeScript 8.95% Jinja 2.13% Rust 20.47%

isoblue's People

Contributors

Stargazers

Watchers

Forkers

wang701 zheng261 jarett-lee hossamamr hossam-nabil ksoontho kuzma30 vioneta

isoblue's Issues

What containers to build and when?

What containers to build and based on what?

When to build

Current

All containers are built on all pushes
All containers are built on all PRs

NB: No need to deal with GH action rules

Option 1

All containers are built on every push that touches a file in services/*
All containers are built with an PR that touches a file in services/*

NB: This requires only one GH workflow file (and only one line needs to be added to enable a new service). DockerHub "does the right thing" when you push the exactly same container to it multiple times.

Option 2

Container is only built on pushes that touch a file in services/<service-name/*
Container is only build if PR touches a file is under services/<service-name/*

NB: This is more "correct" but would need a workflow for each service (I think) and those work flows would largely be exact duplicates of each other. However, this enables services that need a non-standard workflow. We could factor a lot of our custom logic into a bash file that all workflows call.

When to push

Current

All built containers are pushed to DockerHub (isoblue) right away.

Option 1

Push all "release" containers (push to master, v* branches, and/or v* tags) to DockerHub `isoblue right away.
Push all "non-release" containers to DockerHub isobluedev right away for easier PR / branch testing

Option 2

Push all "release" containers to DockerHub isoblue right away
Push all "non-release" containers to a private Docker server for easier PR / branch testing

Option 3

Only push "release" containers to DockerHub isoblue

Automated testing for services and integration

Automated testing of things like CA auth and wireguard will require further work, but to get started we need a way to automatically deploy a new "virtual" ISOBlue to run on something like GitHub Actions or Travis CI.

Wireguard Ansible check if bounce server config updates are needed

In the wireguard portion of Ansible we pause and request that the user copy the wireguard public key into their bounce server. This is great for the inital deploy but does not need to be done every time Ansible runs (ie for updating the device) and can cause unneeded delays if the user switches to another task waiting for Ansible to finish.

https://github.com/OATS-Group/isoblue-avena/blob/793fe0f2f3c0a002bfbcb078ca2a283e6f7e1b15/ansible/avena/roles/wireguard/tasks/main.yml#L33-L48

One solution would be some sort of ping to the bounce server. If it is reachable then the server clearly has our public key. However if the bounce server is unreachable this does not nessicarly mean the config is the issue.

Create a Contributing.md

A contributing.md file serves to define procedures for PRs, releases, etc and is important to make sure we keep things consistent as well as helping newcomers contribute to the project

Automated testing and `docker-compose.yml` overhaul

Previous discussion: #45 (comment)

Currently our docker-compose.yml files are chained together with a docker-compose -f ... -f ... -f .... command to bring up a series of containers for manual testing. We would like each docker-compose.yml file to be self-contained and (eventually) run automated tests when brought up (Maybe a separate docker-compose-testing.yml?)

For service xxx the docker-compose.yml should have the following properties:

When in folder isoblue-avena/services/xxx, docker-compose {build up down etc} should work
Any dependent containers are not build from adjacent folders but the latest good version pulled from dockerhub (which is automatically updated via github actions)
Ability to run automated tests on this container

@abalmos This is your vision. Any other comments?

auto detection of reboot require fails

After installing a new kernel, there should be a /var/run/reboot-required file. For some reason that does not exist and the reboot does not happen. For the meantime role avena always reboots.

admin/controller container? make base read-only?

This is not well thought out, however, could we somehow make the core Avena install read-only? Only docker-containers can be downloaded and started. We ought to be able to run SSH in a container and give that container enough access to the system, that you can admin the entire thing from that container.

The idea here is the base Avena doesn't change very often and deals with the minimal set of hardware issues it needs to. The rest is easily upgradeable from remote via docker deploys.

This SSH container would need to deal with CA auth

GPSd socket connection to host

Hard-coding host IP in kafka-gps-log.py is not a good solution as docker's localhost IP can change.

The optimal choice is to use docker.host.internal to have docker resolve to localhost. However, as mentioned in here, Linux won't have this function until docker version 20.04.

Maybe current workarounds is to use network_mode: host in docker-compose.yml and hard-code localhost in kafka-gps-log.py.

oada_upload postgres select query slow

Might need to make an index on (sent, timestamp) or something. Or maybe use sent as the space index in TSDB.

CAN Watchdog "Stay Online" Flag

From meeting with @aultac on 8/7/2020

There should be some sort of flag or mechanism to ask the ISOBlue to stay online even if no CAN is detected. This would be very helpful to keep demos and deploys from being interrupted.

There are several ways we could do this, however containerizing can-watchdog may add some complications. Would using DBUS be feasible?

This likely should be done in conjunction with #51

Not idempotent: TASK [core : use pip to ensure docker-compose is installed]

Not sure why?

Can we use a pip module? Or, maybe pip's return code communicates if something was changed or not?

Force one partition per topic for Kafka

Current docker containers start up with default settings which assign each topic with unknown number of partitions. This will cause consumers to consume from multiple partitions and produce out of order messages if not done right.

Missing link in README.md in docs folder

Links for instructions under Apalis device and concepts are dead.

Control LED status lights

In the bitbake days of ISOBlue we had a systemctl C script that controlled the LED lights on the board allowing us to indicate internet connectivity and general device health among other things. This was helpful for troubleshooting in the field as the operator telling us the LED status was an easy and informative first step in debugging

We have discussed containerizing things (like networkmanager) that would have to break some of the containerization principles, and if we choose to go down that path, this might be a good first one to get our feet wet. On the old kernel the LED status was controlled by writing a value to a file

Thingsboard overload

Current engine-rpm-pub.py sends seemingly high data rate engine rpm data that overloads thingsboard. It is unclear whether the engine rpm messages from the bus are too much or the something is up with Kafka.

Need to investigate.

Better design needed: TASK [ca : generate host certificate]

This one is very important and is very non-Ansible best practices.

Issues:

Not idempotent: Always re-generates the certificate.
- Should we only do this when near expiration?
- Should Ansible be generating these at all? Maybe something like vault (can it do it ??) should generate / store them and Ansible just grabs the stored one and ensures that it is on the node. Seems like a business issue to regenerate/extend the node's trust.
Major hack:
- We use with_items with one item to enable a double template expansion ... this is needed because we fetch the host key to from the remote directly (keys are generated on the node and private keys never leave for security reasons)

Better container logging

Current containers spit out everything that they can onto container logs.

It will be necessary and more storage-friendly if containers (kafka-canX-log, kafka-gps-log, engine-rpm-log) has log levels so that we can view the container output better.

Cluster machines for GH action builder?

Use a cluster VM as a GH builder?

gpsd to container

It is likely possible to move gpsd into a container.

This should be done at the same time as chrony container.

Implementation idea

base debian image with apt install gpsd or something like: https://hub.docker.com/r/skyhuborg/gpsd
bind mount (though a volume?) /run/chrony.gps0.sock so gpsd can discipline chrony

Gotchas

Pass USB gps device into the docker container ... this is possible, however, you need to restart the container with the right /dev/ttyUSBX file every time the USB device is plugged in ....
- That means gpsd realted things still leaks into Avena.
- I have seen things where a udev runs in a container to catch event like this. Seems messy.
- Can we listen to system udev events through the container with something like pyudev in the same way we use the system dbus?

Goals

Remove extra dependencies from Avena -- not all deploys will want gps.

toradex-apalis installer must be run locally

Doing something like:

$ sudo ./installers/toradex-apalis/make-install-disk.sh /dev/sda

fails because it tries to copy files at the end but the path is wrong. Must run the script locally.

output:

All paritions will be deleted. Do you wish to continue? [y/N]y   
umount: /dev/sda: not mounted.

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
Checking that no-one is using this disk right now ... OK

Disk /dev/sda: 14.6 GiB, 15664676864 bytes, 30595072 sectors
Disk model: Cruzer Dial     
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x522718ac

Old situation:

>>> Created a new DOS disklabel with disk identifier 0x8e9f2343.
/dev/sda1: Created a new partition 1 of type 'Linux' and of size 14.6 GiB.
Partition #1 contains a vfat signature.
/dev/sda2: Done.

New situation:
Disklabel type: dos
Disk identifier: 0x8e9f2343

Device     Boot Start      End  Sectors  Size Id Type
/dev/sda1        2048 30595071 30593024 14.6G 83 Linux

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
mkfs.fat 4.1 (2017-01-24)
--2020-04-08 14:04:32--  https://cdimage.debian.org/debian-cd/current/armhf/iso-cd/debian-10.3.0-armhf-netinst.iso
Resolving cdimage.debian.org (cdimage.debian.org)... 2001:6b0:19::165, 2001:6b0:19::173, 194.71.11.173, ...
Connecting to cdimage.debian.org (cdimage.debian.org)|2001:6b0:19::165|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://caesar.ftp.acc.umu.se/debian-cd/current/armhf/iso-cd/debian-10.3.0-armhf-netinst.iso [following]
--2020-04-08 14:04:33--  https://caesar.ftp.acc.umu.se/debian-cd/current/armhf/iso-cd/debian-10.3.0-armhf-netinst.iso
Resolving caesar.ftp.acc.umu.se (caesar.ftp.acc.umu.se)... 2001:6b0:19::142, 194.71.11.142
Connecting to caesar.ftp.acc.umu.se (caesar.ftp.acc.umu.se)|2001:6b0:19::142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 488943616 (466M) [application/x-iso9660-image]
Saving to: ‘/tmp/tmp.5p4DCoDpNh/debian-10.3.0-armhf-netinst.iso’

debian-10.3.0-armhf-netinst.iso        100%[============================================================================>] 466.29M  9.08MB/s    in 53s     

2020-04-08 14:05:27 (8.75 MB/s) - ‘/tmp/tmp.5p4DCoDpNh/debian-10.3.0-armhf-netinst.iso’ saved [488943616/488943616]

--2020-04-08 14:05:27--  http://http.us.debian.org/debian/dists/buster/main/installer-armhf/current/images/hd-media/hd-media.tar.gz
Resolving http.us.debian.org (http.us.debian.org)... 2600:3404:200:237::2, 2600:3402:200:227::2, 2620:0:861:1:208:80:154:15, ...
Connecting to http.us.debian.org (http.us.debian.org)|2600:3404:200:237::2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24306376 (23M) [application/x-gzip]
Saving to: ‘STDOUT’

-                                      100%[============================================================================>]  23.18M  10.2MB/s    in 2.3s    

2020-04-08 14:05:30 (10.2 MB/s) - written to stdout [24306376/24306376]

Making Avena install u-boot script
mkimage: Can't stat install-avena: No such file or directory
Copying Debian preseed file
cp: cannot stat 'preseed.cfg': No such file or directory
Copying device tree update script
cp: cannot stat 'update-device-tree': No such file or directory
Unmounting installer

Create an app to log received cell signal power to a CSV file.

An app that polls the RSSI power data from ModemManager.

ansible min version

Right now 2.7 is the min because of

include_tasks:
   file:
   apply:

can we do it another way?

Python ENV based logger

Nodejs has a very nice ENV based logger called debug. This is especially helpful in our containers for writing logs to disk instead of having to chase them in stdout/err. We should find and use a similar library for our python based containers

Deploying and Managing Containers

Currently we have no way to manage and deploy docker containers to our Avena devices in the field. In the few instances where I have needed to test containers in the field, I have manually git clone the repo and used docker-compose to start it manually. This is not elegant or sustainable.

Possible solutions:

Ansible based: Copy a Docker-Compose file over, possibly using a jinja2 template to configure it. Not elegant, and we are leaning away from using ansible for anything other than the initial deploy
k8s/k3s/other kubernetes based approach - The Kubernetes model does not map to our model perfectly and would require us to use Kubernetes in a way it was not originally build for
BalenaEngine - We don't know much about this or if it is applicable, appears to be a souped up docker-compose wrapper. Seems more aimed at a docker replacement than a container management tool
Homebrew custom solution - needs development effort that could be better spent elsewhere

@abalmos Are there any other approaches we discussed that I am missing?

Bring Prometheus/Grafana back

Restore the Prometheus/Grafana dashboard functionality. It was lost during move to oats2 and new docker-compose strategy.

ship roles with ansible galaxy

No comment needed

High CPU usage (possibly) due to Kafka

With all docker containers running, it seems like 4 cores on apalis-imx6 are nearly maxed out.

It is probable that Kafka caused this and we need to find a way to start Kafka and zookeeper containers with addition or removal of arguments like this.

Dealing with "supervisor" requests, .e.g, TSDB user creation

Postgres/timescaledb currently uses "password" as the password for the database. As the ISOBlue lies behind a VPN, no one outside of the VPN should be able to access it regardless of the passsword. However, this should still be investigated, especially if we want to expose it on the wireguard interface in the future

Original comment by @abalmos in #24 (comment)_

wgAvena will not connect at first

Network traffic FROM isoblue is needed to create the NAT tunnel, but by default there is none. How can we get the tunnel from the get go?

Ansible Wireguard bounce server config does not work with simaltanious deploys

TASK [wireguard : gather the wireguard public key] **********************************************
Tuesday 15 September 2020  13:27:33 -0400 (0:00:00.582)       0:02:45.574 ***** 
ok: [avena-apalis-3]
ok: [avena-apalis-6]

TASK [wireguard : pause] **********************************************
Tuesday 15 September 2020  13:27:34 -0400 (0:00:00.391)       0:02:45.966 ***** 
[wireguard : pause]
Please add:

[Peer]
# Name = avena-apalis-3
PublicKey = XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
AllowedIPs = XXX.XXX.XXX

to your bounce server wiregaurd configuration and then run

$ sudo su
# wg addconf <wg-interface> <(wg-quick strip <wg-interface>)

before hitting <enter> to continue.
:
ok: [avena-apalis-3]

TASK [wireguard : ensure wireguard auto restarts on failure] **********************************************
Tuesday 15 September 2020  13:30:31 -0400 (0:02:56.825)       0:05:42.791 ***** 
ok: [avena-apalis-3]
ok: [avena-apalis-6]

We should be seeing two prompts with wireguard public keys in this case

Chrony to container

It is likely possible to move chrony into a container.

This should be done at the same time as gpsd container.

Implementation idea

base debian image with apt install chrony or something like: https://hub.docker.com/r/geoffh1977/chrony
bind mount (though a volume?) /run/chrony.gps0.sock so gpsd can discipline chrony

Gotchas

Might need to manual ensure systemd-timesyncd is disabled (chrony apt package does that right now)

Goals

Remove extra dependencies from Avena -- not all deploys will want time sync.

avena-wireguard does not route all traffic

avena-wireguard (at least though an option) should route all traffic through the wg interface.

wireguard to container

This might require host networking mode (but I think that is still better than a local install)

Implementation idea

base debian image with apt install wireguard or something like: https://hub.docker.com/r/cmulk/wireguard-docker
container runs with host networking mode, so any interface the container creates is also at the system level.
key is created on container first start ... what is the "backdoor" if key is regenerated somehow?
Can use the standard wg-quick script, we just call it ourselves.

Gotchas

Host kernel still needs wireguard support ... I don't think there is a great way around this.
Can we support wireguard as a gateway with this setup?

Not idempotent: "TASK [core : ensure small Diffie-Hellman moduli are removed]"

It always causes a "change" due to the way it is implemented.

Maybe backports has a newer openssh with these old moduli's already removed?

Ansible does not install wireguard.

Task "ensure wireguard key exists" from role "avena-wireguard" failed with the following error:

TASK [avena-wireguard : ensure wireguard key exists] ******************************************************************************************************************** fatal: [avena-apalis-dev04]: FAILED! => {"changed": false, "msg": "Unable to change directory before execution: [Errno 2] No such file or directory: '/etc/wireguard'"}
Wireguard was not installed and deployment was halted.

Postgres sync

Postgres/timescaledb is becoming our primary storage layer for data logs. We should consider some schemes for getting this data back to a remote postgres for permanent storage.

This is an interesting project that we may be able to use and/or learn from: https://bucardo.org/Bucardo/

sshd container?

We could have a stand alone opensshd container. By mounting root, host networking mode, etc. that would gives you close to complete control.

You can mount the docker socket, so you could control the host docker daemon from within the container. Docker-in-docker (dind) is a somewhat understood practice.

Then we could have a CA version of that container for CA based fleet's like ours.

Make issues for service migration

We should make issues to track migrating as many of the "core" service into containers as possible. Each issue should indicate one thing to migrate and a short plan of attack.

License needed

We need a license for this repo. Not something urgent though.

GPS Data logging completeness

We are currently only logging time/lat/lng data from the gps module. According to man 8 gpsd the following additional data is also available:

Horizontal uncertainty (meters)
Altitude (meters)
Altitude uncertainty (meters)
Course (degrees from true north)
Course (uncertainty in meters)
Speed (meters per second)
Speed (uncertainty in meters per second)
Climb (meters per second)
Climb uncertainty (meters per second)
Device name

Likely all these should be logged, even if they will not be uploaded to OADA or wherever. However, I noticed that with the gps2tsdb container, some of them are null. Unsure if our GPS module does not support them or it is an issue with the python library, but we should discuss how to record values that are not available.

Ansible Pull for periodic updates

As the ISOBlue devices in the field tend to be offline as much as they are online, updating the ISOBlues using the typical 'push' Ansible mentality involves waiting for each device to come online and pushing the update before it goes back offline which is quite inconvenient and time consuming. At some point we should equip the ISOBlues with an Ansible Pull style updating system where each ISOBlue periodically applies the most recent stable version of the code automatically

avena-wireguard: MTU when default route interface changes

Wireguard sets the wg interface with an MTU based off of the default route's MTU when first started. However, when the default route interface changes the MTU is not updated. This can causes connectivity issues.

Should we enabled tcp_mtu_probing? This seems to have its own issue...
/etc/networks/if-up.d script to re-compute and set MTU?

Punting on the issue now ... we will only have cell based ones at first .

Apalis Installer does not work when run from another folder

Step 3 of the Apalis installer states:

Make installer disk:
$ ./installers/toradex-apalis/make-install-disk.sh DEVICE_FILE

Running this from the the root directory of the project as the installer guide states yields the following errors:

╭─aaron@TRANSLTR ~/code/OATS/isoblue-avena ‹field-fixes*› 
╰─$ sudo ./installers/toradex-apalis/make-install-disk.sh /dev/sdb
...
Making Avena install u-boot script
mkimage: Can't stat install-avena: No such file or directory
Copying Debian preseed file
cp: cannot stat 'preseed.cfg': No such file or directory
Copying device tree update script
cp: cannot stat 'update-device-tree': No such file or directory
Unmounting installer

However after cd ing into the toradex-apalis folder and running the script again, it proceeds without errors

╭─aaron@TRANSLTR ~/code/OATS/isoblue-avena/installers/toradex-apalis ‹field-fixes*› 
╰─$ sudo ./make-install-disk.sh /dev/sdb
...
Making Avena install u-boot script
Image Name:   Install Avena on Apalis
Created:      Tue Jun 23 17:26:02 2020
Image Type:   PowerPC Linux Script (uncompressed)
Data Size:    519 Bytes = 0.51 KiB = 0.00 MiB
Load Address: 00000000
Entry Point:  00000000
Contents:
   Image 0: 511 Bytes = 0.50 KiB = 0.00 MiB
Copying Debian preseed file
Copying device tree update script
Unmounting installer

Three things should be changed

The documentation should be changed to reflect that this script should be run from the toradex-apalis (if next two fixes cannot be done immediately)
The script should be modified to handle being invoked from folders other than the current directory
If there are errors while running the script it should report clearly and/or return an error state

Move networking into service

... or at least some of it.

Implementation idea

Talk to both modem manager and network manager via dbus to configure, start, and stop the various networking interfaces.

Goals

Establish cell connect (move all cell configuration away from core?)
Scan and connect to known WiFi
Broadcast a WiFi network with preconfigured password (at least ssh should be available over this interface)
(Maybe) Provide a webservice to configure the above -- so we can get Chromecast style connecting to local WiFi.

Permission for make-install-disk.sh

Right now if you run ./make-install-disk.sh /dev/diskname, the script will output:

All paritions will be deleted. Do you wish to continue? [y/N]y
umount: /dev/sdb: not mounted.
umount: /dev/sdb1: not mounted.
sfdisk: cannot open /dev/sdb: Permission denied
sfdisk: cannot open /dev/sdb: Permission denied
mkfs.fat 4.1 (2017-01-24)
mkfs.vfat: unable to open /dev/sdb1: Permission denied
mount: only root can do that
--2020-04-04 17:39:55--  https://cdimage.debian.org/debian-cd/current/armhf/iso-cd/debian-10.3.0-armhf-netinst.iso
Resolving cdimage.debian.org (cdimage.debian.org)... 194.71.11.173, 194.71.11.165, 2001:6b0:19::165, ...
Connecting to cdimage.debian.org (cdimage.debian.org)|194.71.11.173|:443... connected.
...

The script ignores the permission denied message and keeps doing the rest. The script should prompt user to run script with sudo and exit.

gpsd doesn't start when gps plugged in

When a gps is plugged in after isoblue starts gpsd is not started.

Refactor docker-compose.yml

Current docker-compose.yml and .env is very platform-specific. We should come up with a way to improve this.

Investigate new DBUS Python Libs

The current pyGObject based DBUS lib used with gps2tsdb requires 100s of MB of unrelated dependencies. If we continue to use Python for most of our containers and DBUS as our main communication bus, this will become unsustainable soon

Options:

Find out what files within pyGObject/pyCario are actually dependencies needed by python3-dbus and only download those. Will require significant effort to find these
Use another library. Possible alternatives:
- Jeepney New actively developed option. Very unstable and appears to have some weird issues
- python-dbus-next Redux of python-dbus which had fallen out of development, also very new in development
- Others?

canwatchdog to container

Implementation idea

base debian docker image with watchdog script
use dbus to issue a systemd sleep call
use normal canutils to bring up can interfaces --- web service to configure that ... maybe this should be part of the "controller" service.

Gotchas

Need use to host networking mode to access can interfaces

Containerized CAN to timescale

A canbus to postgres container.

A few ideas:

host networking mode and a simple python script to read from SocketCAN in chunks and write to timescale
socketcand (on host or in host mode container?) + a python container to read from socketcand and write to timescale
can-to-nanomsg (on host or in host mode container?) + a python container to read from nanomsg and write to timescale

Option 3 has the advantage of letting us try out nano message where we thought DBUS might work well for us.
Option 2 has the advantage of using existing socketcan code
Option 1 has the advantage of less moving parts.

Current thinking: Try out option 3 as a learning experience around nanomsg. Most of the core code is needed for all options, so we can always "downgrade" if we experience issues.

Current thinking 2: We can keep running our candump loggers a for a while too ... as a backup until we trust our new setup.

oats-center / isoblue Goto Github PK

isoblue's People

Contributors

Stargazers

Watchers

Forkers

isoblue's Issues

When to build

Current

Option 1

Option 2

When to push

Current

Option 1

Option 2

Option 3

Implementation idea

Gotchas

Goals

Implementation idea

Gotchas

Goals

Implementation idea

Gotchas

Implementation idea

Goals

Implementation idea

Gotchas

Recommend Projects

Recommend Topics

Recommend Org

Jobs