canonical / testflinger Goto Github PK

View Code? Open in Web Editor NEW

9.0 17.0 13.0 1.68 MB

Home Page: https://testflinger.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Python 95.12% Dockerfile 0.39% CSS 0.18% JavaScript 0.07% HTML 1.35% Makefile 0.04% HCL 0.29% Shell 2.53% Jinja 0.03%

testflinger's Introduction

What is Testflinger?

Testflinger is a system for orchestrating the time-sharing of access to a pool of target machines.

Each Testflinger system consists of:

a web service (called just Testflinger) that provides an API to request jobs by placing them on a queue
per machine agents that wait for jobs to placed on queues they can service and then process them

Jobs can be either fully automated scripts that can attempt to complete within the allocated time or interactive shell sessions.

The Testflinger system is particular useful for sharing finite machine resources between different consumers in a predictable fashion.

Typically this has been used for managing a test lab where CI/CD test runs and also exploratory testing by human operators is desired.

Documentation

You can find more information and documentation on the Testflinger Documentation Page.

Content of this repository

A full deployment of testflinger consists of the following components:

Testflinger Server: The API server, and Web UI
Testflinger Agent: Requests and processes jobs from associated queues on the server on behalf of a device
Testflinger Device Connectors: Handles provisioning and other device-specific details for each type of device
Testflinger CLI: The command-line tool for submitting jobs, checking status of jobs, and retreiving results

This monorepo is organized in a way that is consistant with the components described above:

└── providers
    ├── server
    ├── agent
    ├── device-connectors
    ├── cli
    └── docs

Github actions

If you need to submit a job to a testflinger server through a Github action (instead, for example, of using the command-line tool), you can use the submit action in a CI workflow.

The corresponding step in the workflow would look like this:

    - name: Submit job
      uses: canonical/testflinger/.github/actions/submit@v1
      with:
        poll: true
        job-path: ${{ steps.create-job.outputs.job-path }}

This assumes that there is a previous create-job step in the workflow that creates the job file and outputs the path to it, so that it can be used as input to the submit action. Alternatively, you can use the job argument (instead of job-path) to provide the contents of the job inline:

    - name: Submit job
      uses: canonical/testflinger/.github/actions/submit@v1
      with:
        poll: true
        job: |
            ...  # inline YAML for Testflinger job

In the latter case, do remember to use escapes for environment variables in the inline text, e.g. \$DEVICE_IP.

testflinger's People

Contributors

Stargazers

Watchers

Forkers

plars mthaddon cypresslin isabella232 hanhsuan tang-mm hum4n0id zongminl rmartin013 simondeziel hector-cao thp-canonical kiya956

testflinger's Issues

Unhandled stacks of python tracebacks when the agent is unable to resolve testflinger.c.c via DNS

I ran a bunch of tests and ONE of the runs resulted in a ton of cascaded tracebacks. These aren't immediately helpful and there are some here that could be caught with some exception hanlding.

For some reason, the agent on this run was unable to resolve testflinger.canonical.com, and that led to all the tracebacks... Just for debugging purposes, perhaps these could be caught and handled with some friendlier messages.

This is the only run out of 30 that had this issue, all using the same agent, so I don't know what the actual problem was. This bug, as noted above, is just about hopefully making those traces a bit more friendly.

bladernr@weavile:~$ testflinger submit --poll 6md.yaml                                                                               [61/61]
Job submitted successfully!                                                                                                                 
job_id: b933b67f-a71c-4917-bc5c-ee846659be62                          
This job is waiting on a node to become available.                                                                                          
Jobs ahead in queue: 14                                               
Jobs ahead in queue: 13                                                                                                                     
Jobs ahead in queue: 12                                                                                                                     
Jobs ahead in queue: 11                                                                                                                     Jobs ahead in queue: 10                                                                                                                     Jobs ahead in queue: 9                                                                                                                      
Jobs ahead in queue: 8                                                
Jobs ahead in queue: 7                                                                                                                      
ERROR: 2023-09-29 19:24:01 client.py:61 -- Timeout while trying to communicate with the server.                                             
ERROR: 2023-09-29 19:25:16 client.py:61 -- Timeout while trying to communicate with the server.                                             
Jobs ahead in queue: 6                                                                                                                      
Jobs ahead in queue: 5                                                
Jobs ahead in queue: 4                                                                                                                      
Jobs ahead in queue: 3                                                                                                                      
Jobs ahead in queue: 2                                                                                                                      
ERROR: 2023-09-29 22:10:53 client.py:61 -- Timeout while trying to communicate with the server.                                             
Jobs ahead in queue: 1                                                                                                                      
Jobs ahead in queue: 0                                                                                                                      
ERROR: 2023-09-29 22:46:28 client.py:61 -- Timeout while trying to communicate with the server.                                             
***********************************************                                                                                             
                                                                                                                                            
* Starting testflinger setup phase on multi-3 *                                                                                             
                                                                                                                                            
***********************************************                                                                                             
                                                                                                                                            
Setup                                                                                                                                       
                                                                                                                                            ***************************************************                                                                                                                                                                                                                                     
* Starting testflinger provision phase on multi-3 *                                                                                         
                                                                                                                                            
***************************************************                                                                                         
2023-09-30 02:52:12,569 multi-3 INFO: DEVICE AGENT: BEGIN provision                                                                         
2023-09-30 02:52:12,569 multi-3 INFO: DEVICE AGENT: Provisioning device                                                                     
2023-09-30 02:52:12,569 multi-3 INFO: DEVICE AGENT: Creating test jobs                                                                      
2023-09-30 02:52:16,845 multi-3 INFO: DEVICE AGENT: Created job d0f6945c-6903-4522-b26d-a872bcdd72b5                                        
2023-09-30 02:52:21,316 multi-3 INFO: DEVICE AGENT: Created job 9114248e-1d0d-407a-84b0-7580349ba535                                        
2023-09-30 02:52:26,187 multi-3 INFO: DEVICE AGENT: Created job 6a5f97b5-4952-47c1-8b90-21c77aab9fa0                                        
2023-09-30 02:52:30,651 multi-3 INFO: DEVICE AGENT: Created job 32880554-e1c8-4a8f-87a8-e882c8602b8d                                        
2023-09-30 02:52:35,490 multi-3 INFO: DEVICE AGENT: Created job 39cd04f3-2938-42c5-9db1-a026af67d083                                        
2023-09-30 02:52:42,702 multi-3 INFO: DEVICE AGENT: Created job 7789a4a3-c9e3-4fe2-a685-f33b13943514                                        
2023-09-30 02:54:16,828 multi-3 ERROR: DEVICE AGENT: Unable to communicate with specified server.                                           
2023-09-30 02:54:16,828 multi-3 ERROR: DEVICE AGENT: Unable to get status for job 7789a4a3-c9e3-4fe2-a685-f33b13943514                      
2023-09-30 02:54:53,280 multi-3 ERROR: DEVICE AGENT: Job 39cd04f3-2938-42c5-9db1-a026af67d083 failed to allocate, cancelling remaining jobs
2023-09-30 02:55:19,313 multi-3 ERROR: DEVICE AGENT: Unable to communicate with specified server.                                           
2023-09-30 02:55:19,313 multi-3 ERROR: DEVICE AGENT: Unable to cancel job 32880554-e1c8-4a8f-87a8-e882c8602b8d                              
2023-09-30 02:55:19,313 multi-3 ERROR: DEVICE AGENT: Unable to cancel job: 32880554-e1c8-4a8f-87a8-e882c8602b8d                             
Traceback (most recent call last):                                    
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 203, in _new_conn                                               
    sock = connection.create_connection(                              
  File "/usr/local/lib/python3.8/dist-packages/urllib3/util/connection.py", line 60, in create_connection                                   
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):                                                                  
  File "/usr/lib/python3.8/socket.py", line 918, in getaddrinfo                                                                             
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):                                                                 
socket.gaierror: [Errno -2] Name or service not known                 
                                                                                                                                            
The above exception was the direct cause of the following exception:                                                                        
                                                                      
Traceback (most recent call last):                                                                                                          
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 790, in urlopen                                             
    response = self._make_request(                                                                                                          
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 491, in _make_request                                       
    raise new_e                                                                                                                             
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 467, in _make_request                                       
    self._validate_conn(conn)                                                                                                               
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 1092, in _validate_conn                                     
    conn.connect()                                                                                                                          
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 611, in connect                                                 
    self.sock = sock = self._new_conn()                                                                                                     
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 210, in _new_conn                                               
    raise NameResolutionError(self.host, self, e) from e                                                                                    
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f8913fcc1c0>: Failed to resolve 'testflinger.canoni
cal.com' ([Errno -2] Name or service not known)

The above exception was the direct cause of the following exception:                                                                        

Traceback (most recent call last):                                    
  File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 486, in send                                                     
    resp = conn.urlopen(                                              
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 844, in urlopen                                             
    retries = retries.increment(                                      
  File "/usr/local/lib/python3.8/dist-packages/urllib3/util/retry.py", line 515, in increment                                               
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]                                                           
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='testflinger.canonical.com', port=443): Max retries exceeded with url: /v1/job/32
880554-e1c8-4a8f-87a8-e882c8602b8d/action (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f8913fcc1c0>: Fai
led to resolve 'testflinger.canonical.com' ([Errno -2] Name or service not known)"))                                                        

During handling of the above exception, another exception occurred:                                                                         

Traceback (most recent call last):                                    
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/multi/multi.py", line 182, in cancel_jobs
    self.client.cancel_job(job)                                       
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/multi/tfclient.py", line 149, in cancel_job
    self.post(f"/v1/job/{job_id}/action", {"action": "cancel"})                                                                             
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/multi/tfclient.py", line 79, in post
    req = requests.post(uri, json=data, timeout=timeout)                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 115, in post                                                          
    return request("post", url, data=data, json=json, **kwargs)                                                                             
  File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 59, in request                                                        
    return session.request(method=method, url=url, **kwargs)                                                                                
  File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 589, in request                                                  
    resp = self.send(prep, **send_kwargs)                             
  File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 703, in send                                                     
    r = adapter.send(request, **kwargs)                               
  File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 519, in send                                                     
    raise ConnectionError(e, request=request)                         
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='testflinger.canonical.com', port=443): Max retries exceeded with url: /v1/job
/32880554-e1c8-4a8f-87a8-e882c8602b8d/action (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f8913fcc1c0>: 
Failed to resolve 'testflinger.canonical.com' ([Errno -2] Name or service not known)"))                                                     

2023-09-30 02:55:27,110 multi-3 ERROR: DEVICE AGENT: Received status code 400 from server.                                                  
Traceback (most recent call last):                                    
  File "/usr/local/bin/snappy-device-agent", line 8, in <module>                                                                            
    sys.exit(main())                                                  
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/cmd.py", line 59, in main                                               
    raise SystemExit(args.func(args))                                 
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/multi/__init__.py", line 55, in provision
    self.device.provision()                                           
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/multi/multi.py", line 72, in provision
    raise ProvisioningError("Unable to allocate all devices")                                                                               
snappy_device_agents.devices.ProvisioningError: Unable to allocate all devices                                                              

*************************************************                     

* Starting testflinger cleanup phase on multi-3 *                     

*************************************************                     

2023-09-30 02:55:32,868 multi-3 ERROR: DEVICE AGENT: Unable to find multi-job data file, job_list.json not found
complete

CLI: UnboundLocalError: local variable 'queues' referenced before assignment

Alex reported this earlier, looks like we need to not just handle the 404 case, but others as well now that it's possible we could get 403, etc.

  testflinger list-queues
  shell: /usr/bin/bash -e {0}
  
Advertised queues on this server:
Traceback (most recent call last):
  File "/snap/testflinger-cli/161/bin/testflinger-cli", line 8, in <module>
    sys.exit(cli())
  File "/snap/testflinger-cli/161/lib/python3.8/site-packages/testflinger_cli/__init__.py", line 47, in cli
    tfcli.run()
  File "/snap/testflinger-cli/161/lib/python3.8/site-packages/testflinger_cli/__init__.py", line 142, in run
    raise SystemExit(self.args.func())
  File "/snap/testflinger-cli/161/lib/python3.8/site-packages/testflinger_cli/__init__.py", line 522, in list_queues
    for name, description in sorted(queues.items()):
UnboundLocalError: local variable 'queues' referenced before assignment
Error: Process completed with exit code 1.

Needham environment needs to use the central Testflinger server

We currently run our own jenkins full stack in Needham. But for consistency and easy of using testfinger across multiple teams, we need to have our environment talking to the central TF server so that users don't have to worry about --server options when deplyoying machines.

So this is about implementing our card https://warthogs.atlassian.net/browse/SERVCERT-92

Anyone can cancel any job

Currently, I can go here https://testflinger.canonical.com/jobs and cancel any job.
I'm not saying that people will do it on purpose but it could be accidental, e.g. I want to cancel a job of mine and accidentally copy another job id.

It would be nice to be able to extend a reservation if needed

Example: I reserve a machine for 48 hours... at 36 hours I realize I need it for another 48 hours to finish my work. It would be nice to be able to tell testflinger to extend my reservation (I have the reservation job id, right?) ... perhaps:

testflinger-cli extend-reservation --timeout=

and the agent would extend the timeout by

Add oemscript for HP oem devices

I did some trail run use --ubr on HP laptop, it worked well, so I think we can add another oemscript device-agent for support HP devices.

Using media parameter egual usb with muxpi agent can lead to job failure

When "media: usb" is used for a TF job with a device connected with zapper and a typecmux, the job can fail if there is no sdwire installed on the zapper.

Server Lab - Testflinger is unable to SSH until the machine is pinged...

This is just odd and I'm not sure what is going on here. But most / all our testflinger deployments are timing out because the agent cannot SSH to the node once MAAS marks it as "deployed". This is a strange network thing and I'm not sure what's happening here.

To recreate this, I started a quick 30 second reservation using Noble on the node Yakkey.

As you can see, once MAAS marks the node as deployed, TF begins trying to SSH to yakkey to verify it's operational:

2024-06-13 13:42:12,492 yakkey INFO: DEVICE CONNECTOR: MAAS: 9 minutes passed since deployment.
2024-06-13 13:43:13,487 yakkey INFO: DEVICE CONNECTOR: MAAS: 10 minutes passed since deployment.
2024-06-13 13:44:14,496 yakkey INFO: DEVICE CONNECTOR: MAAS: 11 minutes passed since deployment.
2024-06-13 13:44:15,477 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted.
2024-06-13 13:46:15,543 yakkey INFO: DEVICE CONNECTOR: MAAS: 12 minutes passed since deployment.
2024-06-13 13:46:16,484 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted.
2024-06-13 13:48:16,519 yakkey INFO: DEVICE CONNECTOR: MAAS: 13 minutes passed since deployment.
2024-06-13 13:48:17,441 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted.

looking at the agent, first I thought perhaps it wasn't able to see the node at all, indicating a failure in the network path, BUT from seeing updates to the arp table on the agent at that time, it DOES pick up the MAC and IP fro the node's interface:

Address                  HWtype  HWaddress           Flags Mask            Iface
10.245.128.1             ether   24:8a:07:72:ca:00   C                     eth0
10.245.128.4             ether   5c:b9:01:9b:59:ac   C                     eth0
10.245.128.14            ether   02:42:0a:f5:80:0e   C                     eth0
10.245.128.1             ether   24:8a:07:72:ca:00   C                     eth0
10.245.128.4             ether   5c:b9:01:9b:59:ac   C                     eth0
Address                  HWtype  HWaddress           Flags Mask            Iface
10.245.128.117           ether   02:42:0a:f5:80:75   C                     eth0
10.245.128.14            ether   02:42:0a:f5:80:0e   C                     eth0
10.245.128.1             ether   24:8a:07:72:ca:00   C                     eth0
10.245.128.4             ether   5c:b9:01:9b:59:ac   C                     eth0
Address                  HWtype  HWaddress           Flags Mask            Iface
10.245.128.117           ether   02:42:0a:f5:80:75   C                     eth0
10.245.128.14            ether   02:42:0a:f5:80:0e   C                     eth0
10.245.130.106           ether   ec:e7:a7:00:2e:e0   C                     eth0
10.245.128.1             ether   24:8a:07:72:ca:00   C                     eth0
10.245.128.4             ether   5c:b9:01:9b:59:ac   C                     eth0
Address                  HWtype  HWaddress           Flags Mask            Iface
10.245.128.117           ether   02:42:0a:f5:80:75   C                     eth0
10.245.128.14            ether   02:42:0a:f5:80:0e   C                     eth0
10.245.130.106           ether   ec:e7:a7:00:2e:e0   C                     eth0
10.245.128.1             ether   24:8a:07:72:ca:00   C                     eth0
10.245.128.4             ether   5c:b9:01:9b:59:ac   C                     eth0

So seeing that the ARP table has the node's info, and TF still has been unable to verify deployment via SSH, I next try SSH directly from the agent to the node:

root@yakkey:/data/testflinger/device-connectors# ssh 10.245.130.106
ssh: connect to host 10.245.130.106 port 22: Connection timed out

seeing that fail, to check the connection directly I now ping yakkey's public IP from the agent container:

root@yakkey:/data/testflinger/device-connectors# ping -c 50 10.245.130.106
PING 10.245.130.106 (10.245.130.106) 56(84) bytes of data.
64 bytes from 10.245.130.106: icmp_seq=2 ttl=64 time=0.378 ms
64 bytes from 10.245.130.106: icmp_seq=3 ttl=64 time=0.270 ms
64 bytes from 10.245.130.106: icmp_seq=4 ttl=64 time=0.410 ms
64 bytes from 10.245.130.106: icmp_seq=5 ttl=64 time=0.409 ms
64 bytes from 10.245.130.106: icmp_seq=6 ttl=64 time=0.317 ms
^C
--- 10.245.130.106 ping statistics ---
6 packets transmitted, 5 received, 16.6667% packet loss, time 5119ms
rtt min/avg/max/mdev = 0.270/0.356/0.410/0.055 ms

note that very first packet is lost but then ping starts working.

AND after that, SSH now works:

root@yakkey:/data/testflinger/device-connectors# ssh 10.245.130.106
The authenticity of host '10.245.130.106 (10.245.130.106)' can't be established.
ECDSA key fingerprint is SHA256:3e5eaA2rjFqApRNO//ziCxx/2qTSvNI8qbcBL9+7jug.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '10.245.130.106' (ECDSA) to the list of known hosts.
Please login as the user "ubuntu" rather than the user "root".

Connection to 10.245.130.106 closed.

I tried a second deployment wtih the same result, TF was not able to SSH successfully and determine that the node was deployed until after I pinged the machine from the agent.

this time I also ran an mtr report to see what the path appears to be:

root@yakkey:/data/testflinger/device-connectors# mtr -r 10.245.130.106
Start: 2024-06-13T14:16:16+0000
HOST: yakkey                      Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- yakkey.maas               10.0%    10    0.4   0.4   0.3   0.5   0.0

and that also triggered whatever was stuck and SSH finally worked from the agent and the node deployment was marked successful.

2024-06-13 14:14:29,092 yakkey INFO: DEVICE CONNECTOR: MAAS: 15 minutes passed since deployment.
2024-06-13 14:14:30,019 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted.
2024-06-13 14:16:30,083 yakkey INFO: DEVICE CONNECTOR: MAAS: 16 minutes passed since deployment.
2024-06-13 14:16:31,024 yakkey INFO: DEVICE CONNECTOR: MAAS: Checking if test image booted.
Warning: Permanently added '10.245.130.106' (ECDSA) to the list of known hosts.
2024-06-13 14:16:32,295 yakkey INFO: DEVICE CONNECTOR: MAAS: Deployed and booted.
2024-06-13 14:16:32,296 yakkey INFO: DEVICE CONNECTOR: END provision

************************************************

* Starting testflinger reserve phase on yakkey *

************************************************

Need an easy way for people to add and spin up agents

Spinning up an agent requires the same bits of data every time but there's no documented and easy way to do this.

Adrian came up with one where you can add a config file to a certain montiored directory and a periodic polling job picks up the new config file, creates an agent container and launches it.

I would like to see that, or something similar normailzed.

In the grand picture, it would be AWESOME to have a web frontend for testflinger where one could upload a config file, or just fill out a form that then creates the config file on the back end, and starts up the new agent.

Slightly more CLI than that would be some new option for testflinger-cli that submits the new config in much the same way we submit job yaml files:

testflinger create-agent /path/to/agent-config.yaml

Issue warnings on reserved node at given intervals leading up to reservation expiry

Using some mechanism (wall via ssh?) an agent should be able to send messages to all consoles on a node at periods leading up to expiry... perhaps at every 1/4 of the reserve period, then every 10 minutes in the last 2 hours, and ever 5 minutes in the last hour?

That way folks will have warnings that their reservation is about to expire. Whether or not they pay attention is on them.

[Feature] Prioritize queues when agents poll for new jobs

Not sure if this is a function of connectors, or the api server, or both, but it would be very useful to specify queues by priority.

Currently an agent config can have this:

job_queues:
- doubletusk
- egx
- baremetal
- hpe
- 202101-28672
- charmed-ceph
- micro-ceph

and it's equally likely that a job pending for ANY of those queues will be picked up "next" after a current job is run.

But there are cases where we want to share machines but still need them for a primary purpose. In the above case, that machine is primarily a contractual obligation for NVIDIA testing. So it would be quite useful if we could do something like this:

priority_queues:
- egx
- doubletusk
low_priority_queues:
- 202101-28672
- baremetal
- hpe
- charmed-ceph
- micro-ceph

And then the behaviour would be the agent would poll first for jobs in the egx or doubletusk queues, and if none exist, move on to the rest in sequence.

The possible scenario is:
Person 1 has a job running on it now
Person 2 has submitted a job for baremetal
Person 3 has submitted a job for hpe
Person 4 has submitted a job for baremetal
I have submitted a job for egx (or doubletusk) to run a regression test on a new update to linux-nvidia

With priority queues, as soon as Person 1's currently executing job completes, my egx/doubletusk job which is already in queue alongside the three 'baremetal' and 'hpe' jobs will get picked up first.

And I'd imagine that the polling would need bounce to better ensure priority enable this... so:

Poll priority_queues/egx
Poll priority_queues/doubletusk
Poll low_priority_queues/202101-28672
Poll priority_queues/egx
Poll priority_queues/doubletusk
Poll low_priority_queues/baremetal
Poll priority_queues/egx
Poll priority_queues/doubletusk
Poll low_priority_queues/hpe
...

Perhaps that already quietly happens simply by order of how the queues are listed in the agent config, and if so this would just make that more obvious but if not, then it would implement a way to jump the line for priority jobs.

Update documentation and configuration sample to remove firmware_update_command

firmware_update_command seems to have been removed from the code base but still present in various places in documentation and configuration sample,

output_timeout not working properly

After some changes in the way the per-phase execution happens, the output_timeout command wasn't functioning properly. It was unfortunately timing out way too soon because it doesn't have an update mechanism that can be triggered to reset the timer whenever there's new output.

We've disabled it at the moment, because it could cause jobs to stop way too soon, and this is a much bigger problem than not having an output timeout (which may or may not actually be useful anyway).

If we want to continue having a timeout after long periods of no output in the future, then this needs to be reworked before we turn it back on.

reserveration timeout should be possible for Minutes, Hours, or Days

I may be wrong, but my understanding is that currently the reservation timeout has to be set in seconds. This can be unwieldy when setting a reservation for several days or even when having do the math for several hour reservations. It would be nice to use the same conventions as sleep uses:

30m (30 minutes)
5h (5 hours)
4d (4 days)

Support device name in addition to MAAS ID for MAAS device agents

Currently, we need to specify MAAS-deployed servers via their six-character MAAS device IDs (e.g., gwmhd6) in Testflinger configuration files. This fact can create problems if those device IDs change, as happens if a node must be re-enlisted. After re-enlisting a node, the relevant configuration files must be tracked down and changed to match.

It would be much better if Testflinger could start with a machine name and then extract the MAAS ID from a call to "maas admin machines read", for use in subsequent maas calls. The current method of directly using MAAS IDs would have to be maintained for backward compatibility, of course.

Add additional output when MAAS hosted machines are reserved

When reserving a system using a YAML file as we're going to be suggesting, this is the current output from testflinger-cli:

*** TESTFLINGER SYSTEM RESERVED ***
You can now connect to [[email protected]](mailto:[email protected])
Current time:           [2023-06-21T15:04:51.451849]
Reservation expires at: [2023-06-21T15:31:31.451873]
Reservation will automatically timeout in 1600 seconds
To end the reservation sooner use: testflinger-cli cancel cab62369-d0fc-484c-8ba7-0084c3edb649

First, I'd like to dump the machine name as well as the IP, so that first line would look like:

You can now connect to tadrock by ssh to [email protected]

Next, I'd like also a blurb that includes system make/model and the OS (and kernel, if specified in the YAML):

System: Dell PowerEdge R340
OS: Ubuntu 22.04 LTS GA Kernel

This is all information that the agent is aware of, as it does a maas <login> machine read <machine ID> when it begins provisioning. So it's already collected the hostname, and whatever make/model data, and it knows from the YAML if it's installing 22.04, 20.04, or whatevere (and the kernel if we've explicitly specified a kernel in the YAML).

Finally, I'm not sure why there's a mailto: link in the output...

the output would then look something like:

*** TESTFLINGER SYSTEM RESERVED ***
You can now connect to tadrock by ssh to [email protected]

System: Dell PowerEdge R340
Installed OS: Ubuntu 22.04 LTS GA Kernel

Current time:           [2023-06-21T15:04:51.451849]
Reservation expires at: [2023-06-21T15:31:31.451873]

Reservation will automatically timeout in 1600 seconds
To end the reservation sooner use: testflinger-cli cancel cab62369-d0fc-484c-8ba7-0084c3edb649

Build testflinger agent docker image failed

When i try to build the docker image for the TF agent:

$ cd agent/extra/docker
$  docker build -t local/testflinger-agent:latest .

I had the error

the --chmod option requires BuildKit. Refer to https://docs.docker.com/go/buildkit/ to learn how to build images with BuildKit enabled

I have to use BuildKit to build the image successfully:

DOCKER_BUILDKIT=1 docker build -t local/testflinger-agent:latest .

My docker version:

Docker version 20.10.25, build 20.10.25-0ubuntu1~23.04.1

Need a way to mark agents as on and offline so that multi-device jobs don't fail by grabbing machines that are not currently available to testflinger

While we want folks to use testflinger to access servers in the lab, there are occasions where folks need to actually own the machine fully, and manage it directly via MAAS, not testflinger. Some examples:

maintenance such adding or removing hardware, debugging hardware or boot failures, MAAS regressions, etc (Cert)
Debugging customer issues such as needing to reconfigure storage on a machine and recommission multiple times while doing test deployments. (STS, Field, Cert)
having access to experiment with commissioning scripts, scrip-lets, and debugging their use for things like improving deployment automation in the field (Field Engineering specifically)
possibly others?

In these cases we need to be able to mark those machines as UNAVAILABLE to testflinger. There are ways to do this now... such as setting a reservation and then powering the machine off, or simply removing the agent from testflinger completely.

But it would be nice to simply be able to say something like testflinger offline <agent name> to mark it as unavailable and testflinger online <agent name> to make it available again.

This would let us easily and dynamically turn agents on and off without a lot of fuss, and will help prevent multi-jobs from failing simply because they happened to try grabbing a machine that was not usable.

Update multipass environment to work with the latest changes and mongodb

The testflinger.yaml contains outdated information about Redis database, which we don’t use anymore. It needs to be updated, since we want to be able to run testflinger instance with MongoDB in multipass

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Update actions/checkout digest to b4ffde6
Update docker/build-push-action digest to 4a13e50
Update docker/login-action digest to 3d58c27
Update docker/metadata-action digest to 2a4836a
Update dependency GitPython to v3.1.40
Update Terraform juju to ~> 0.10.0
Update actions/checkout action to v4
Update dependency xdg to v6
Click on this checkbox to rebase all open PRs at once

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.

Update mongo Docker tag to v7

Detected dependencies

docker-compose

server/devel/docker-compose.override.yml

server/docker-compose.yml

mongo 5

dockerfile

server/Dockerfile

ubuntu 22.04

github-actions

.github/workflows/agent-tox.yml

actions/checkout v3

actions/setup-python v4

.github/workflows/cli-publish-snap.yml

actions/checkout v4

snapcore/action-build v1

snapcore/action-publish v1

.github/workflows/cli-tox.yml

actions/checkout v4

actions/setup-python v4

.github/workflows/device-tox.yml

actions/checkout v4@3df4ab11eba7bda6032a0b82a6bb43b11571feac

actions/setup-python v4

.github/workflows/documentation_checks.yml

actions/checkout v4

actions/checkout v4

actions/checkout v4

.github/workflows/server-charm-check-libs.yml

actions/checkout v2

canonical/charming-actions 2.4.0

.github/workflows/server-charm-release-edge.yml

actions/checkout v2

canonical/charming-actions 2.4.0

.github/workflows/server-publish-oci-image.yml

docker/setup-buildx-action v3

actions/checkout v3

docker/login-action 65b78e6e13532edd9afa3aa52ac7964289d1a9c1

docker/metadata-action 9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7

docker/build-push-action f2a1d5e99d037542a71f64918e516c093c6f3fc4

.github/workflows/server-tox.yml

actions/checkout v3

actions/setup-python v4

pep621

agent/pyproject.toml

cli/pyproject.toml

xdg <5.2

device-connectors/pyproject.toml

PyYAML >=3.11

pip_requirements

agent/charms/testflinger-agent-charm/requirements-dev.txt

agent/charms/testflinger-agent-charm/requirements.txt

ops >= 1.4.0

Jinja2 ==3.1.2

GitPython ==3.1.37

agent/charms/testflinger-agent-host-charm/requirements-dev.txt

agent/charms/testflinger-agent-host-charm/requirements.txt

ops >= 1.4.0

docs/.sphinx/requirements.txt

server/charm/requirements.txt

ops >= 2.2.0

terraform

server/terraform/versions.tf

juju ~> 0.7.0

Check this box to trigger a request for Renovate to run again on this repository

It would be nice to have result output for the image boot check

The image boot check, e.g.[1] ignores the exception and the process output.
It would be nice to get result in case of exception, as I am encountering a situation where the machine is booted and I can connect to it manually, but testflinger is failing the check.

[1]

testflinger/device-connectors/src/testflinger_device_connectors/devices/cm3/cm3.py

Line 163 in e4a17a4

logger.info("Checking if test image booted.")

Disk configuration for an empty disk requires at least a partition

When defining empty disk in disks section like this:

        - id: disk1
          disk: 1
          type: disk
          ptable: gpt
          name: disk1

it fails with

2023-08-22 11:00:23,104 aitken INFO: DEVICE AGENT: MAAS: Clearing existing storage configuration
2023-08-22 11:00:24,943 aitken INFO: DEVICE AGENT: MAAS: Applying storage layout
Traceback (most recent call last):
File "/usr/local/bin/snappy-device-agent", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/cmd.py", line 59, in main
raise SystemExit(args.func(args))
File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/__init__.py", line 261, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/maas2/__init__.py", line 60, in provision
raise e
File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/maas2/__init__.py", line 55, in provision
device.provision()
File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/maas2/maas2.py", line 83, in provision
self.deploy_node(distro, kernel, user_data, storage_data)
File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/maas2/maas2.py", line 243, in deploy_node
self.maas_storage.configure_node_storage(storage_data)
File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/maas2/maas_storage.py", line 139, in configure_node_storage
self.process_by_dev_type(devs_by_type)
File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/maas2/maas_storage.py", line 341, in process_by_dev_type
dev_type_to_method[dev_type](dev)
File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/maas2/maas_storage.py", line 389, in process_disk
"block-id": device["parent_disk_blkid"],
KeyError: 'parent_disk_blkid'

When a partition is defined as well, deployment works fine

        - id: disk1
          disk: 1
          type: disk
          ptable: gpt
          name: disk1
        - id: disk1-part1
          device: disk1
          type: partition
          number: 1
          size: 500M

Unhandled exception when MAAS takes too long to respond

ERROR: 2024-02-09 11:11:55 client.py:61 -- Timeout while trying to communicate with the server.
Traceback (most recent call last):
  File "/usr/local/bin/snappy-device-agent", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/testflinger_device_connectors/cmd.py", line 60, in main
    raise SystemExit(args.func(args))
  File "/usr/local/lib/python3.8/dist-packages/testflinger_device_connectors/devices/__init__.py", line 323, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/testflinger_device_connectors/devices/maas2/__init__.py", line 45, in provision
    device = Maas2(args.config, args.job_data)
  File "/usr/local/lib/python3.8/dist-packages/testflinger_device_connectors/devices/maas2/maas2.py", line 52, in __init__
    self.maas_storage = MaasStorage(self.maas_user, self.node_id)
  File "/usr/local/lib/python3.8/dist-packages/testflinger_device_connectors/devices/maas2/maas_storage.py", line 39, in __init__
    self.node_info = self._node_read()
  File "/usr/local/lib/python3.8/dist-packages/testflinger_device_connectors/devices/maas2/maas_storage.py", line 55, in _node_read
    return self.call_cmd(cmd, output_json=True)
  File "/usr/local/lib/python3.8/dist-packages/testflinger_device_connectors/devices/maas2/maas_storage.py", line 74, in call_cmd
    raise MaasStorageError(proc.stdout.decode())
testflinger_device_connectors.devices.maas2.maas_storage.MaasStorageError: Authorization Error: 'Expired timestamp: given 1707494526 and now 1707495426 has a greater difference than threshold 300'

This is pretty cut and dry, it feels like this could have some error handling to provide a more user-friendly error message.

"Error: the MAAS device connector reports that it took too long to get a response from the MAAS server. Marking deployment failed."

or something helpful and easier to understand. The raw message about timestamps being expired is creating a lot of confusion for people who see it.

Some TF job with large image as provision_data url are failing with timeout error

Some jobs are failing with a 5GB image provided in the provision_data section, that error is quite new. There used to be a 30min timeout in the past, it seems reducing this timeout to 20min can trigger this issue with some Riverside devices
See http://10.102.156.15:8080/job/partner-engineering/job/riverside/job/riverside-jetson-agx-orin-cdimage-daily/273/console

Provide the serial console log as an artifact

This is a feature request for testflinger to save a copy of the device's serial console log as an artifact, or provide some means to retrieve it.

One possible implementation is to add an option to the provision yaml to indicate this is wanted:

  log_console: True

The log could then be retrieved via:

testflinger-cli console_log <job_id>

As each device is different, the agent would need different methods for collecting the console logs. It's understandable that some devices may not be able to provide a console log.

TypeError: expected str, bytes or os.PathLike object, not NoneType when trying to reserve using yaml with disk definition

Trying to even reserve a machine with a disk layout specified is failing with a traceback about something being a NoneType:

Yaml being used
reserve-sample.yaml.txt

This is the actual traceback:

2023-08-25 16:31:01,521 doubletusk INFO: DEVICE AGENT: MAAS: Configuring node storage
2023-08-25 16:31:01,522 doubletusk INFO: DEVICE AGENT: MAAS: Clearing existing storage configuration
2023-08-25 16:31:03,501 doubletusk INFO: DEVICE AGENT: MAAS: Applying storage layout
Traceback (most recent call last):
  File "/usr/local/bin/snappy-device-agent", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/cmd.py", line 59, in main
    raise SystemExit(args.func(args))
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/__init__.py", line 261, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/maas2/__init__.py", line 60, in provision
    raise e
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/maas2/__init__.py", line 55, in provision
    device.provision()
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/maas2/maas2.py", line 83, in provision
    self.deploy_node(distro, kernel, user_data, storage_data)
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/maas2/maas2.py", line 243, in deploy_node
    self.maas_storage.configure_node_storage(storage_data)
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/maas2/maas_storage.py", line 139, in configure_node_storage
    self.process_by_dev_type(devs_by_type)
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/maas2/maas_storage.py", line 341, in process_by_dev_type
    dev_type_to_method[dev_type](dev)
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/maas2/maas_storage.py", line 480, in process_format
    self.call_cmd(
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/maas2/maas_storage.py", line 66, in call_cmd
    proc = subprocess.run(
  File "/usr/lib/python3.8/subprocess.py", line 493, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.8/subprocess.py", line 1639, in _execute_child
    self.pid = _posixsubprocess.fork_exec(
TypeError: expected str, bytes or os.PathLike object, not NoneType
****************************************************
* Starting testflinger cleanup phase on doubletusk *
****************************************************

[Feature] testflinger cli could report status of a queue (summary of incomplete jobs)

CLI is the primary interface to testflinger, and even if we implement some way to launch jobs from a web page, often CLI is the best way to interact with testflinger.

I find that I sometimes launch up to 80-100 jobs in sequence to pressure test systems. This, obviously, takes a very long time to complete 80 - 100 jobs against a machine, and currently my only real way of knowing that the queue 'fleep' is either running a job, or is completely finished, is polling each job, or running a loop against every known job ID I've created to get a status. It is a bit painful to check all the fleep job-ids and have to wade through 70 "Completed" just to get to 71 "Provisioning" and 72-80 "Waiting".

It would be nice to get a summary of jobs (since we CAN look here: https://testflinger.canonical.com/jobs and see a long list of job IDs and the associated queues) via CLI for just the single requested queue.

Something like this:

$ testflinger-cli queue-status fleep
70 complete, 1 Running, 9 Waiting

$ testflinger-cli queue-status --verbose fleep
75 complete, 1 Running, 4 Waiting
de153d8f-7d32-47d7-9a05-a20f2ef6bb35	waiting	        2023-10-13 15:22:46
ba73620d-6d1a-45ab-bb68-a640e4e4c489	waiting	        2023-10-13 15:22:45
8b0bb52f-08d8-4671-b275-55d84a965f7c	waiting	        2023-10-13 15:22:43
a23b7217-c78a-42e0-9636-55ef3fe37bc2	waiting	        2023-10-13 15:22:41
b1077c76-a336-47a4-97a0-c86d3f1c007d	provisioning	2023-10-13 15:22:40

$ testflinger-cli queue-status --verbose --show-completed fleep
de153d8f-7d32-47d7-9a05-a20f2ef6bb35	waiting	        2023-10-13 15:22:46
ba73620d-6d1a-45ab-bb68-a640e4e4c489	waiting	        2023-10-13 15:22:45
8b0bb52f-08d8-4671-b275-55d84a965f7c	waiting	        2023-10-13 15:22:43
a23b7217-c78a-42e0-9636-55ef3fe37bc2	waiting	        2023-10-13 15:22:41
b1077c76-a336-47a4-97a0-c86d3f1c007d	provisioning	2023-10-13 15:22:40
b46e12c9-e37d-4ded-a8d7-85d3c3dc6e8c	complete        2023-10-13 15:22:39
5d2640d1-b830-4a50-bf26-dde41f4ef2c9	complete        2023-10-13 15:22:34
e264a1cc-4bbb-4b59-a06b-a79e1ec2c10f	complete        2023-10-13 15:22:33
7634c826-1bf2-482f-ac00-81b62a89dfe3	complete        2023-10-13 15:22:32
6509f367-961f-4a39-a06a-db4dbc967d3c	complete        2023-10-13 15:22:30
d908e5cc-ee6d-4a3f-9a8e-00ec4f325e93	complete        2023-10-13 15:22:29
...

--verbose --show-completed would list all the completed jobs that TF still has info for (e.g. the ones that haven't aged out and results/artifacts are still available, with the assumption that anything older you really don't care about at all)

JSONDecodeError when provisioning with maas storage specified

A reported getting a JSONDecodeError in the output when trying to provision a maas system. It sounds like the user had also either specified a storage configuration, or else the system just had been configured with a default storage configuration and failed to get the block device data associated with it.

This needs investigation.
The JSONDecodeError happened because it didn't get json back from the call and it was expected. It is not maas that was supposed to generate the json, it was a method in the custom storage module for the maas device connector which runs a command to retrieve disk information about the maas node and either raises an exception if the command failed, or constructs json from the output and returns it. The command succeeded, but there was no output with which to construct the json, resulting in this corner case which was unexpected.

I would like to better understand what happened here to understand if this is a normal situation that can be recovered from, or whether it's an error condition and we should raise an exception. Reproducing this in the server lab might help if possible.

Log output from the failed provision step:

*****************************************************

* Starting testflinger provision phase on jasperoid *

*****************************************************

Traceback (most recent call last):
  File "/usr/local/bin/snappy-device-agent", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/testflinger_device_connectors/cmd.py", line 83, in main
    sys.exit(func(args))
  File "/usr/local/lib/python3.8/dist-packages/testflinger_device_connectors/devices/__init__.py", line 321, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/testflinger_device_connectors/devices/maas2/__init__.py", line 41, in provision
    device = Maas2(args.config, args.job_data)
  File "/usr/local/lib/python3.8/dist-packages/testflinger_device_connectors/devices/maas2/maas2.py", line 51, in __init__
    self.maas_storage = MaasStorage(self.maas_user, self.node_id)
  File "/usr/local/lib/python3.8/dist-packages/testflinger_device_connectors/devices/maas2/maas_storage.py", line 40, in __init__
    self.node_info = self._node_read()
  File "/usr/local/lib/python3.8/dist-packages/testflinger_device_connectors/devices/maas2/maas_storage.py", line 56, in _node_read
    return self.call_cmd(cmd, output_json=True)
  File "/usr/local/lib/python3.8/dist-packages/testflinger_device_connectors/devices/maas2/maas_storage.py", line 84, in call_cmd
    return json.loads(output)
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Usability issue on launchpad username when using "lp:" as prefix

Main problem:

The way the prefix lp: is written tends to lead the user to think that it's okay to put a space between the prefix and the Launchpad username.

Because, the usage of : as part of the prefix tends to make one assume that it's another key-value parameter. If some other character were used, like lp/locnnil, this confusion wouldn't happen.

This is the error that occurs when you put a space on the ssh_keys value like: lp: locnnil

2023-12-15 18:45:25,270 sru20210629207 INFO: DEVICE AGENT: BEGIN reservation
Traceback (most recent call last):
  File "/srv/testflinger-agent/sru20210629207/env/bin/snappy-device-agent", line 8, in <module>
    sys.exit(main())
  File "/srv/testflinger-agent/sru20210629207/env/lib/python3.8/site-packages/snappy_device_agents/cmd.py", line 59, in main
    raise SystemExit(args.func(args))
  File "/srv/testflinger-agent/sru20210629207/env/lib/python3.8/site-packages/snappy_device_agents/devices/__init__.py", line 172, in reserve
    proc = subprocess.run(cmd)
  File "/usr/lib/python3.8/subprocess.py", line 493, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.8/subprocess.py", line 1639, in _execute_child
    self.pid = _posixsubprocess.fork_exec(
TypeError: expected str, bytes or os.PathLike object, not dict

So, this problem wouldn't happen if, for example, instead of using : as a separator, the / character could be used. It could be configured like this:

job_queue: 202106-29207
provision_data:
  distro: jammy
reserve_data:
  ssh_keys:
    - lp/locnnil
  timeout: 21600 # 6 hours

Alternatively, an error could also be provided on the client side if the person specifies the lp username with a space after the :.

Trailing slash goes 404

E.g.

https://testflinger.canonical.com/queues
vs.
https://testflinger.canonical.com/queues/

And any other URL.

Provide CLI to retrieve status of a job or reservation

Testflinger allows for a reservation to be created for a device. However, it is the necessary to poll the job and watch the output to know when the device is ready and how to connect to it. This request is to proved that information via a status command:

$ testflinger status <job_id>
{
  "queue": "rpi3bp",
  "device": "rpi3bp-001",
  "ip": "1.2.3.4",
  "serial": 1.2.3.5:3000",
  "status": "READY",
  "reservation_remaining": 500
}

status could be used to indicate if the device is still "WAITING", is "PROVISIONING", "READY" for use, or the reservation has "TERMINATED" (or other values). Other values might be useful, but these are the ones that come to mind currently.

Update README with info about MongoDB

Please, fix the information about configuration values in README, since we don't use Redis anymore.

Annotated tag driven versioning of testflinger CLI builds onto stable channel

The build pipeline introduced in #151 makes testflinger CLI builds available on edge channel on main branch commits.

To follow this up with some further maturing of the build practices of the project, it would be helpful to make stable versions of testflinger available, with predictable version strings in use, specifically by following semantic versioning practices. For instance:

A major version change used to communicate breaking changes, with the expectation that the CLI tool used against a testflinger server should match with the version string.
A major version would correspond to a channel for the snap (and for the charms).
Minor versions would communicate new features, API endpoints etc non-breaking changes.
Patch versions would be used for bug fixes.

Version incrementing could be implemented a number of ways, including setuptools-scm as used in canonical/checkbox (hopefully a lot simpler than in checkbox where about 99% of the complexity is for the complex build matrix and essentially templating of the snapcraft.yml) or poetry-dynamic-versioning if you wanted to be more hipster about it.

For MAAS systems, power off machines after reservation expires

Once a reservation expires (perhaps add a 30 minute grace period?) have maas release the machine and power it off rather than leaving it running and accessible.

(maybe this is mean? but it does enforce the need to either re-new the reservation or get your stuff done on time).

Note, this should only be done in cases where a MAAS node has been reserved, not for things run using a testflinger job normally.

[multi-device] send status updates for allocation progress as devices are allocated

Request from alexb:
"... it would be nice to know that each job has found a machine in the queue and started just like, a status for the jobs started by the multijob, not all the logs, just the status that it is progressing"

I agree this would be a good feature and probably pretty straightforward to do. Putting it hear so that we don't forget.

Environment variable couldn't be all digital for SECURE_ID

This might be the limitation of subprocess, the error log as below while the SECURE_ID is a valid HEX string 123456.

Traceback (most recent call last):
  File "/usr/local/bin/testflinger-device-connector", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/testflinger_device_connectors/cmd.py", line 60, in main
    raise SystemExit(args.func(args))
  File "/usr/local/lib/python3.10/dist-packages/testflinger_device_connectors/devices/__init__.py", line 194, in runtest
    raise e
  File "/usr/local/lib/python3.10/dist-packages/testflinger_device_connectors/devices/__init__.py", line 190, in runtest
    exitcode = testflinger_device_connectors.run_test_cmds(
  File "/usr/local/lib/python3.10/dist-packages/testflinger_device_connectors/__init__.py", line 373, in run_test_cmds
    return _run_test_cmds_str(cmds, config, env)
  File "/usr/local/lib/python3.10/dist-packages/testflinger_device_connectors/__init__.py", line 487, in _run_test_cmds_str
    result = runcmd("./tf_cmd_script", env)
  File "/usr/local/lib/python3.10/dist-packages/testflinger_device_connectors/__init__.py", line 332, in runcmd
    with subprocess.Popen(
  File "/usr/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.10/subprocess.py", line 1783, in _execute_child
    env_list.append(k + b'=' + os.fsencode(v))
  File "/usr/lib/python3.10/os.py", line 811, in fsencode
    filename = fspath(filename)  # Does type-checking of `filename`.
TypeError: expected str, bytes or os.PathLike object, not int

The reason is env_list.append(k + b'=' + os.fsencode(v)) will use os.fsencode to encode the environment variables and check the data type should be string.

def fsencode(filename):
        """Encode filename (an os.PathLike, bytes, or str) to the filesystem
        encoding with 'surrogateescape' error handler, return bytes unchanged.
        On Windows, use 'strict' error handler if the file system encoding is
        'mbcs' (which is the default encoding).
        """
        filename = fspath(filename)  # Does type-checking of `filename`.
        if isinstance(filename, str):
            return filename.encode(encoding, errors)
        else:
            return filename

It could change this line to

env = {x: str(y) for x, y in env.items() if y }

to solve this problem.