frzb / coinboot Goto Github PK
View Code? Open in Web Editor NEWA framework for diskless computing
Home Page: https://coinboot.io
License: GNU General Public License v3.0
A framework for diskless computing
Home Page: https://coinboot.io
License: GNU General Public License v3.0
When a plugin contains directories which are not owned by root these directories become wrongly owned by root when the plugin is applied to a Coinboot node.
For SSH host keys are an essential securtiy feature against man-in-the-middle attacks.
On each start of a Coinboot node a new host key is generated by the OpenSSH daemon and needs to be acknowledged when initiating a SSH connection or is ignore by the SSH client configuration at all.
In a controlled cluster environment where access is only happening in the local network with a minimal risk for man-in-the-middle attacks sharing host keys is acceptable. So we have to:
Find a way to create a persistent shared host key.
Recent release of debootstrap
have support for caching of packages.
From the manpage:
--cache-dir=DIR
Cache .deb files under directory. It should be an absolute path.
For improving the developer experience we should integrate this into debirf
to speed-up repeated builds local builds.
If we can bring Linux 5.9 to Coinboot we can use Zstd compression also for the Kernel, Ramdisk, and Initramfs and benefit from higher compression ratios beside the best in class decompression performance of Zstd.
Coinboot should not only support AMD GPUs but as well NVIDIA GPUs.
So we have to come up with a plugin providing the proprietary NVIDIA GPU driver.
Reference GPU for this PoC is a NVIDIA P106-100.
From the dnsmasq manpage
--tftp-max=
Set the maximum number of concurrent TFTP connections allowed. This defaults to 50. When serving a large number of TFTP
connections, per-process file descriptor limits may be encountered. Dnsmasq needs one file descriptor for each concur‐
rent TFTP connection and one file descriptor per unique file (plus a few others). So serving the same file simultane‐
ously to n clients will use require about n + 10 file descriptors, serving different files simultaneously to n clients
will require about (2*n) + 10 descriptors. If --tftp-port-range is given, that can affect the number of concurrent con‐
nections.
The default tftp-max
value of 50
is obvious to low for scenarios where hundreds of nodes boot at the same time.
In such scenarios congestion like situations have been observed with with the default value of tftp-max
with lots of nodes that seem to be stuck in the boot processes.
An ad-hoc adjustment of tftp-max=4096
in conf/dnsmasq/coinboot.conf
resolved the situation.
We need to raise the default value of tftp-max
to a sensible value.
The configuration should be verified by some load testing in the CI pipeline with software like fbender which can also benchmark TFTP servers.
While I was thinking about the rework of the release scheme I recognized that build process has to be reproducible to not end up with a moving target. Main target of this effort is the rootfs build with debirf
based on debootstrap
.
The people at Debian already addressed this topic:
https://wiki.debian.org/ReproducibleInstalls
So we should find out which software we should use for creating a reproducible rootfs build.
Repository: https://github.com/upx/upx
As it is highly unlikely that the BC-150 APUs are running together with other GPUs in the same system.
It makes sense to separate this driver plugin to minimize filesize overhead.
https://www.amd.com/en/support/kb/release-notes/rn-amdgpu-unified-linux-21-40-1
Release 21.40.1 is the first Radeon™ Software for Linux that uses unified ROCm™ and graphics drivers. This release has not been fully validated for Machine Learning use cases. Users are recommended to use https://rocmdocs.amd.com/en/latest/Current_Release_Notes/Current-Release-Notes.html for ROCm™ use cases.
It is a know upstream issue:
The driver keeps all logs in memory and will drop log entries if Loki is not reachable and if the quantity of max_retries has been exceeded. To avoid the dropping of log entries, setting max_retries to zero allows unlimited retries; the drive will continue trying forever until Loki is again reachable. Trying forever may have undesired consequences, because the Docker daemon will wait for the Loki driver to process all logs of a container, until the container is removed. Thus, the Docker daemon might wait forever if the container is stuck.
Feeding the Docker container logs via Promtail or Fluentd into Loki and not over the Loki Docker logging driver should be sufficient as work-around.
AMD BC-250-APUs have the code name cyan-skillfish
.
We no longer support Coinboot images based on Ubuntu Xenial.
We have to remove the support for Xenial from our build scripts and pipelines.
A centralized network share on the Coinboot server that is mounted in the filesystem of the Coinboot nodes would be an improveme for operations.
That network share would ease up storing persistent data and access to data that is needed during operations like firmware images or kernel dumps.
WebDAV is to favour cause it requires no privileges from the Docker host the Coinboot server container is running on.
Currently there is no swap enable.
This leads to constraints when applications allocate a lot of memory and the overall memory is on the lower end with 4 GB and let's in the worst cause the OOM killer stopping the memory allocating process.
Like it currently happens with Teamred Miner in multiple GPU setup with only 4 GB system memory.
We should enable swap backed by a zram drive covering the full memory capacity or 8 GB whichever is smaller.
Similiar to how it was done successfully for Fedora 34.
https://fedoraproject.org/wiki/Changes/Scale_ZRAM_to_full_memory_size
By using a custom configuration for dpkg
we can control which files to drop during install of a package.
For example: /etc/dpkg/dpkg.cfg.d/01_coinboot
# block documentation
path-exclude /usr/share/doc/*
# keep copyright files for legal reasons
path-include /usr/share/doc/*/copyright
path-exclude /usr/share/man/*
path-exclude /usr/share/groff/*
path-exclude /usr/share/info/*
# lintian stuff is small, but really unnecessary
path-exclude /usr/share/lintian/*
path-exclude /usr/share/linda/*
# block non-us locales
path-exclude /usr/share/locale/*
path-include /usr/share/locale/en*
Inspired by: https://wiki.ubuntu.com/ReducingDiskFootprint#Drop_unnecessary_files
Despite the error message the CI/CD pipeline suceeds.
We have to identify if this is a real problem our just a side effect of installing dnsmasq
for running a preflight check with dnsmasq
aginst our dnsmasq
configuration file.
https://github.com/frzb/coinboot/runs/4657577120?check_suite_focus=true#step:5:375
[snip]
Job for dnsmasq.service failed because the control process exited with error code.
See "systemctl status dnsmasq.service" and "journalctl -xe" for details.
invoke-rc.d: initscript dnsmasq, action "start" failed.
● dnsmasq.service - dnsmasq - A lightweight DHCP and caching DNS server
Loaded: loaded (�]8;;file://fv-az246-793/lib/systemd/system/dnsmasq.service�/lib/systemd/system/dnsmasq.service�]8;;�; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2021-12-29 10:07:33 UTC; 6ms ago
Process: 2975 ExecStartPre=/usr/sbin/dnsmasq --test (code=exited, status=0/SUCCESS)
Process: 2976 ExecStart=/etc/init.d/dnsmasq systemd-exec (code=exited, status=2)
Dec 29 10:07:33 fv-az246-793 systemd[1]: Starting dnsmasq - A lightweight DHCP and caching DNS server...
Dec 29 10:07:33 fv-az246-793 dnsmasq[2975]: dnsmasq: syntax check OK.
Dec 29 10:07:33 fv-az246-793 dnsmasq[2976]: dnsmasq: failed to create listening socket for port 53: Address already in use
Dec 29 10:07:33 fv-az246-793 systemd[1]: dnsmasq.service: Control process exited, code=exited, status=2/INVALIDARGUMENT
Dec 29 10:07:33 fv-az246-793 dnsmasq[2976]: failed to create listening socket for port 53: Address already in use
Dec 29 10:07:33 fv-az246-793 systemd[1]: dnsmasq.service: Failed with result 'exit-code'.
Dec 29 10:07:33 fv-az246-793 dnsmasq[2976]: FAILED to start up
Dec 29 10:07:33 fv-az246-793 systemd[1]: Failed to start dnsmasq - A lightweight DHCP and caching DNS server.
[snap]
The current metadata structure of the plugin creation files looks like this:
plugin: AMDGPU-Pro Polaris
archive_name: amdgpupro_polaris
version: 20.50-1234664
description: AMD Polaris GPU (RX500/RX400 family) firmware and driver with support for OpenCL 1.2
maintainer: Gunter Miegel <[email protected]>
source: https://www.amd.com/en/support/kb/release-notes/rn-amdgpu-unified-linux-20-50
run: |
< pluging creation code>
As Coinboot progresses we might bring up multiple kernel versions like 5.11.0-46-generic
or 5.13.0-25-generic
.
For the plugins that have a kernel dependency we have to reflect that in plugin metadata with adding the mandatory key kernel
.
For the plugins that don't have a kernel dependency I purpose to set kernel
to the value all
.
Currently in debirf:
this happens in sync: download. decompress. install.
change this to be:
used a fresh cloned vanilla version of the coinboot master and try to do a
docker-compose up
and it failed with the linked error message
Building coinboot
Traceback (most recent call last):
File "/usr/bin/docker-compose", line 11, in <module>
load_entry_point('docker-compose==1.17.1', 'console_scripts', 'docker-compose')()
File "/usr/lib/python2.7/dist-packages/compose/cli/main.py", line 68, in main
command()
File "/usr/lib/python2.7/dist-packages/compose/cli/main.py", line 121, in perform_command
handler(command, command_options)
File "/usr/lib/python2.7/dist-packages/compose/cli/main.py", line 952, in up
start=not no_start
File "/usr/lib/python2.7/dist-packages/compose/project.py", line 431, in up
svc.ensure_image_exists(do_build=do_build)
File "/usr/lib/python2.7/dist-packages/compose/service.py", line 318, in ensure_image_exists
self.build()
File "/usr/lib/python2.7/dist-packages/compose/service.py", line 923, in build
shmsize=parse_bytes(build_opts.get('shm_size')) if build_opts.get('shm_size') else None,
TypeError: build() got an unexpected keyword argument 'stream'
Booting a focal
suite kernel 5.15.0-48-generic
with zram
bootflag ends up in a kernel panic:
$ qemu-system-x86_64 -kernel coinboot-vmlinuz-5.15.0-48-generic -initrd coinboot-initramfs-5.15.0-48-generic -m 4096 -smp 2 -nographic -serial mon:stdio -append "console=ttyS0 net.ifnames=0 biosdevname=0 break=skip_loading_plugins zram"
[...]
Honoring zram kernel arg
[ 4.269253] zram: Added device: zram0
[ 4.284727] zstd: Unknown symbol ZSTD_initCCtx (err -2)
[ 4.285098] zstd: Unknown symbol ZSTD_getParams (err -2)
[ 4.285335] zstd: Unknown symbol ZSTD_CCtxWorkspaceBound (err -2)
[ 4.285609] zstd: Unknown symbol ZSTD_compressCCtx (err -2)
modprobe: can't load module zstd (kernel/crypto/zstd.ko): unknown symbol in module, or unknown parameter
[ 4.360702] Can't allocate a compression stream
[ 4.361164] zram: Cannot initialise zstd compressing backend
sh: write error: Cannot allocate memory
coinboot-initramfs-5.4.0-58-generic
seems to have an issue with locale
During login/start of the image the following error message get's displayed:
warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
Like described under: https://jamespotz.github.io/blog/eos-zswap-zstd-z3fold
CernVM-FSCernVM-FS could be useful to distribute software - in contrast to a conventional network filesystem it caches aggressively.
Currently we use https://github.com/frzb/coinboot/blob/master/debirf/profiles/coinboot/modules/z0_remove-locales to get rid of locales and man pages.
https://packages.debian.org/buster/localepurge seems to cover all this and is seems to have some integration in dpkg
.
We need to evaluate if a switch to localpurge
makes sense.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.