status-im / infra-nimbus Goto Github PK

View Code? Open in Web Editor NEW

9.0 43.0 6.0 10.73 MB

Infrastructure for Nimbus cluster

Home Page: https://nimbus.team

Makefile 3.26% HCL 31.33% Python 53.55% Shell 10.16% Jinja 1.69%

infra eth2 fleet nimbus

infra-nimbus's People

Contributors

Stargazers

Watchers

Forkers

sbc64 narimiran isabella232 pinkdiamond1 william3johnson inferno-inc

infra-nimbus's Issues

Deploy Beacon Node on Windows

We need a Windows host for Prater testnet nodes. The minimum hardware requirements:

At least 4-core CPU
At least 8 GB of RAM
NVMe SSD Disk

It will run 3 instances of infra-role-beacon-node connected to the Prater testnet. Each instance will run a build from a different branch (unstable, testing, stable). The nodes will take over validators of the current Prater testnet nodes with 03 index (e.g. stable-03, testing-03, etc).

It should also build the newest version of respective branch daily.

Full Details: #58

Create a node for the Kiln testnet

We need to prepare a new node for the long-lived Kiln testnet (testing the upcoming eth1-eth2 merge).

My suggestion is to reuse the current host metal-07.he-eu-hel1.nimbus.prater for this purpose by moving all existing Prater validators on it to metal-06.he-eu-hel1.nimbus.prater.

The freed host needs to run 4 beacon nodes paired with 4 Eth1 instances through the --web3-url parameter. Initially, all Eth1 instances will be based on Geth, but in the future we'll run Geth, Nethermind, Besu and Nimbus-eth1. The beacon nodes will be compiled from the kiln-dev-auth branch and started with the --network=kiln parameter. The Geth instances will be compiled from the kiln-merge-v2 branch and started with the --network=kiln parameter.

Missing logs from Nimbus ELK stack

Some logs are missing when querying them in our Nimbus Kibana instance, specifically:

{"lvl":"DBG","ts":"2022-01-28 14:05:54.724+00:00","msg":"Received request","topics":"beacnde","peer":"82.139.21.242:49814","meth":"GET","uri :"/totally_a_test"}
{"lvl":"DBG","ts":"2022-01-28 14:05:54.724+00:00","msg":"Request is not part of API","topics":"beacnde","peer":"82.139.21.242:49814","meth":"GET","uri":"/totally_a_test"}

Querying for these returns nothing:
https://nimbus-logs.infra.status.im/goto/05738788ae13e81a579cbcadc06e4cbb

Port fleet to AWS

Nimbus team has received some perks from the Ethereum Foundation. To make use of these resources I want to port the whole fleet from DigitalOcean to AWS.
Details:

You will be eligible for:
- $5,000 in AWS Promotional Credit valid for 2 years
- 1 year of AWS Business Support (up to $5,000).
- 80 credits for self-paced labs

Deploy Beacon Node on MacOS

We need a MacOS host for Prater testnet nodes. The minimum hardware requirements:

At least 4-core CPU
At least 8 GB of RAM
NVMe SSD Disk
Preferably a new Mac Mini with M1

It should also build the newest version of respective branch daily.

Full Details: #58

Create a new fleet for testing nbc and libp2p

We need a new fleet of 5 hosts to test nbc and libp2p. Using t3a.medium should be a good start.

Deploy new layout of Pyrmont fleet

I've merged #34, so now it's time to deploy the validators to the new hosts.

Plan:

Remove all validators and secrets from old fleet
Destroy all containers on new fleet
Deploy validators to the new fleet without starting containers
Destroy all containers on old fleet
Recreate all containers on new fleet

Windows Prater host ran out of disk space due to logs

Today at night the windows-01.gc-us-central1-a.nimbus.prater ran out of disk space which cause the nodes to go into a restart loop for about 2 hours:

This was caused by logs being rotated but not being deleted, which can be seen here:

admin@windows-01 MINGW64 .../nimbus/beacon-node-prater-unstable                                                                                          
$ ls -l *log
-rw-r--r-- 1 admin 197121          0 Sep  1 01:00 beacon-node-prater-unstable.20210901-010001.#0001.err.log
-rw-r--r-- 1 admin 197121 1723503744 Sep  2 01:00 beacon-node-prater-unstable.20210901-010001.#0001.out.log
-rw-r--r-- 1 admin 197121          0 Sep  2 01:00 beacon-node-prater-unstable.20210902-010000.#0001.err.log
-rw-r--r-- 1 admin 197121 1913530105 Sep  3 01:00 beacon-node-prater-unstable.20210902-010000.#0001.out.log
...(omitted)...
-rw-r--r-- 1 admin 197121          0 Oct 10 01:00 beacon-node-prater-unstable.20211010-010000.#0001.err.log                                   
-rw-r--r-- 1 admin 197121 1930078876 Oct 11 01:00 beacon-node-prater-unstable.20211010-010000.#0001.out.log                                   
-rw-r--r-- 1 admin 197121          0 Oct 11 01:00 beacon-node-prater-unstable.20211011-010000.#0001.err.log                                   
-rw-r--r-- 1 admin 197121 1926189824 Oct 12 00:59 beacon-node-prater-unstable.20211011-010000.#0001.out.log

All these logs took up ~111 GB of disk space, which after clearing restored the nodes.

Metrics / Grafana setup for the Fluffy fleet.

Since the launch of the Fluffy testnet/fleet (#87) quite some metrics have been added to Fluffy.
We are at the stage where it would be useful to be able to track these for our fleet.

I assume this means a setup of Prometheus and Grafana, similar to how it is done for nimbus-eth2.
I'd also like to have a read only public access to the Grafana dashboard, as exists for nimbus-eth2.

I have no idea how far these setups scale for a lot of nodes, so if doing this for 32 nodes in such dashboard is too much, it is fine to just select a sample of the nodes. Having a sample would be already sufficient to understand what is going on in the network.

I think we can use the same system with instance selector in the grafana dashboard?

There exists an Grafana dashboard with useful metrics for Fluffy here:
https://github.com/status-im/nimbus-eth1/blob/master/fluffy/grafana/fluffy_grafana_dashboard.json

It does not contain yet the instance variable. I can add that later.

Repurpose the Steklo node for the Nocturne testnet

The Steklo testnet has been discontinued and a new longer-term Rayonism testnet is planned to replace it (codename Nocturne with genesis time 2021-05-12 12:00:00 UTC). We can use the existing Steklo server to run 2400 validators for this new testnet. The validator keys can be found here:

https://github.com/status-im/nimbus-private/tree/master/nocturne_deposits

Just like in Steklo, the client should be built from the quick-merge-v1 branch. When launching it, the only change from the Steklo setup is that the --network parameter should be set to nocturne.

Deploy initial Fluffy testnet fleet

In the near future (exact dates undefined) cross client public testnets will be started for the Portal Network.
This issue is about having already our "own testnet fleet" running to get ready on the infrastructure side for when we need to join the public testnet(s). It would allow us developers to test more on a real life scenario also.

At least for the initial tests a Fluffy node will be very light on the resources. We do need however a good amount of nodes running for the testing to be relevant.
I expect maximum ~2 GB of storage per node for now (this will change for some nodes later on). CPU / memory is a bit more difficult to estimate at the moment, but it should be rather low in comparison with an eth2 or eth1 node.

Because of this it might be cheaper/better to run several nodes on the same machine. Especially if that makes it easier to launch more of them as we see fit.
With the low resources they use, different IPs might be the more costly part? For this reason there is the hidden development option in Fluffy to increase the amount of nodes with the same IP address that are accepted in the routing table (see options further down). This allows us to run many nodes on the same machine with 1 interface/ip.

I was thinking of initially starting with 32 to 64 nodes. Perhaps split over 2 to 4 machines. Then with the possibility of increasing the number of nodes per machine in the future. It depends on the cost, of which I have no good view. But this shouldn't be expensive I would expect. Feel free to suggest different setups.
Something like 4 of these nodes should be the bootstrap nodes.

Fluffy build instructions

https://github.com/status-im/nimbus-eth1/tree/master/fluffy#how-to-build--run

General fluffy options:

 ./build/fluffy --help
Usage: 

fluffy [OPTIONS]... command

The following options are available:

 --log-level              Sets the log level [=DEBUG].
 --udp-port               UDP listening port [=9009].
 --listen-address         Listening address for the Discovery v5 traffic [=0.0.0.0].
 --bootstrap-node         ENR URI of node to bootstrap Discovery v5 and the Portal networks from. Argument
                              may be repeated.
 --bootstrap-file         Specifies a line-delimited file of ENR URIs to bootstrap Discovery v5 and Portal
                              networks from.
 --nat                    Specify method to use for determining public address. Must be one of: any, none,
                              upnp, pmp, extip:<IP> [=any].
 --enr-auto-update        Discovery can automatically update its ENR with the IP address and UDP port as
                              seen by other nodes it communicates with. This option allows to enable/disable
                              this functionality [=false].
 --data-dir               The directory where fluffy will store the content data
                              [=~/.cache/fluffy].
 --netkey-file            Source of network (secp256k1) private key file [=config.dataDir / "netkey"].
 --metrics                Enable the metrics server [=false].
 --metrics-address        Listening address of the metrics server [=127.0.0.1].
 --metrics-port           Listening HTTP port of the metrics server [=8008].
 --rpc                    Enable the JSON-RPC server [=false].
 --rpc-port               HTTP port for the JSON-RPC server [=8545].
 --rpc-address            Listening address of the RPC server [=127.0.0.1].
 --bridge-client-uri      if provided, enables getting data from bridge node.
 --proxy-uri              URI of eth client where to proxy unimplemented rpc methods to
                              [=http://127.0.0.1:8546].
 --radius                 Hardcoded (logarithmic) radius for each Portal network. This is a temporary
                              development option which will be replaced in the future by e.g. a storage size
                              limit [=256].

Options that should be configurable in the role

--log-level -> default to debug for now
--udp-port -> e.g. when we run several nodes on the same machine
--nat:extip -> in case we need to set this
--data-dir -> database etc location
--netkey-file -> Differently from eth2 nodes, the netkey file should default be stored for all nodes, not just bootstrap nodes. As the data it stores in the database is based on this key and the radius value. However, there is no huge issue with losing it, it will just redownload an other range of data. Except for bootstrap nodes once we would be running in a public testnet. Also, the file is just plaintext at the moment and not yet an encrypted keystore.
--bootstrap-file -> We should create a file with the ENRs of the designated bootstrap nodes
--bootstrap-node -> or, instead of the file passed as individual arguments
--table-ip-limit -> Hidden parameter, can set to 1024 to avoid any ip limiting issues for now
--bucket-ip-limit -> Hidden parameter, can set to 24 to avoid any ip limiting issues for now
--bits-per-hop=1 -> set to 1 default now, later we might play with this value
--radius=253 -> Should become adjustable, can set to 253 default for now

Metrics and rpc will be definitely be useful to enable later on. And have some dashboard, but it is not a priority for now (we don't really have good metrics yet anyhow).

It would be nice to see some of these nodes running in say two weeks (week of 9 March), if that is feasible.

cc @jakubgs

Nimbus ElasticSearch cluster queries are noticeably slow

When trying to view longer periods of time on https://nimbus-logs.status.im/ there queries are quite slow and sometimes even fail. This is most probably due to use using sc1 EBS volumes.

infra-nimbus/logs.tf

Lines 13 to 14 in 4c10d02

 data_vol_size = 500 /* We'll be storing TRACE logs */ 

 data_vol_type = "sc1" /* Change to gp2 for SSD */

Wipe database of 2 mainnet nodes

Every weekend, we should wipe the database of 2 mainnet nodes and restart them, so as to stress the syncing algorithm, forwards and backwards.

On one of them, we should run trusted node sync without backfill before restartin:

build/nimbus_beacon_node trustedNodeSync --network:mainnet \
 --data-dir=build/data/shared_mainnet_0 \
 --trusted-node-url=http://localhost:XXXX \
 --backfill:false

New `prater` testnet launching

There's a new 200k validator testnet being discussed at the eth2 dev call, named prater - we will need a new infra setup for this akin to the one for pyrmont - this setup will eventually replace pyrmont but the timeline is not yet certain.

We will receive a total of 40k validators to manage (vs the 20k we have on Pyrmont today).

This might be a good time to investigate cheaper alternatives to AWS or alternative node types on AWS.

Add a new Hertzner host

The server should have similar specs to the existing servers:

It will host:

Our RocketPool prater nodes
Our Blox staking network nodes
A nimbus-eth1 instance connected to Goerli

Have nimbus bootstrap nodes run with subscribe-all-subnets

The next release (current unstable branch) will introduce a feature to subscribe the node to all subnets, which means it will practically subscribe to all eth2 gossipsub topics.

We could/should make our bootnodes turn this feature on, so that they are nodes which are most interesting to connect to also after syncing or for discovery reasons.

To enable you need to pass --subscribe-all-subnets:true on cli.

Deploy an eth2 mainnet node to be used as bootstrap node

Deployment of a first eth2 mainnet node, to be used as bootstrap node.

Similar like the pyrmont nodes, but now for the mainnet. To be started as soon as nimbus-eth2 has mainnet support (included the genesis).

It should:

be build from master
have no validators attached
not change ip / ports
not change network key (which is default arranged already through https://github.com/status-im/infra-role-beacon-node/blob/master/tasks/compose.yml#L42 )

The two last points are I believe already the case for all our pyrmont nodes, I just repeat it here again for clarity as for a bootstrap node it is important these stay the same over time, else the ENR is no longer useful.

SSD Firmware upgrade required on Hetzner

I got an email about and important Samsung PM9A1 NVMe SSD firmware upgrade from Hetzner:

Dear Sir or Madam

We're writing to you today because you have one or more dedicated root servers mounted with a Samsung PM9A1 NVMe SSD, and the manufacturer recommends that you make a firmware update.

With the currently installed firmware, there may be increased RTBB (runtime bad blocks), and UECC (uncorrectable error correction codes/media errors) occur in some cases.

That's why we recommend that you check on the firmware update on all of your servers that are mounted with one or more Samsung PM9A1 NVMe SSDs. Please use the Hetzner Rescue System and enter the command "update_samsung" to check on the firmware update, and update the firmware if necessary.

Even though data loss is not likely in this process, we recommend that you perform a full backup of your data before updating the firmware.

Your following servers are affected:

AX41-NVMe #1517896 (65.21.230.244) - metal-01.he-eu-hel1.nimbus.eth1
AX41-NVMe #1628143 (65.108.4.68) - windows-02.he-eu-hel1.slave.ci
If you have any questions or need any support, please contact our team by writing a support request. Log onto your Robot account and then go to "Support" in the menu. We will be happy to help!

Kind regards

Deploy a Goerli fully synced node for Nimbus

Right now we use Infura wss:// for accessing the Goerli testnet, for example:

infra-nimbus/ansible/group_vars/nimbus-pyrmont-master.yml

Line 4 in 7901c37

 beacon_node_web3_url: 'wss://goerli.infura.io/ws/v3/675db4626923473591cf6418e4dae175' 

Running our own full node with --ws enabled would be preferable.

Add eth2stats client to each node

eth2stats allows us to track the nodes on https://pyrmont.eth2.wtf/#/

The instructions for starting the eth2stats client are here:

https://github.com/protolambda/pyrmont#eth2stats

with --eth2stats.node-name="example node" and --beacon.type=nimbus changed, as well as the ports for metrics and RPC.

A way to collect metrics from Nimbus beacon nodes

I've talked with @Swader about how to collect and publicly publish metrics available in beacon node logs. Currently the logs are being collected in our ES cluster and can be queried to collect information we need.

What he proposed is publishing JSON files with the data on a public HTTPS endpoint.
Example:
https://gist.github.com/Swader/5aecd54c27b46e175c8c1954c48cd30a

Introduce Windows and macOS hosts to the Nimbus fleet

Nimbus strives to officially support Linux, Windows and macOS in both 32-bit and 64-bit builds for x86 and ARM. Unfortunately, at the moment, we are testing adequately only one specific build configuration on our Pyrmon and Prater testnet fleets (Linux, 64-bit, x86 aka AMD64). The goal of this task is to introduce additional Windows and macOS hosts to the fleet that will allow us to detect potential problems on these operating systems.

We need to select a cloud provider for a single Windows server and a single macOS server. The minimum hardware requirements for both hosts are as follows:

at least 4-core CPU
at least 8 GB of RAM
NVMe SSD Disk

Preferably, the macOS host should be the new mac mini M1.

Each host will run 3 instances of infra-role-beacon-node connected to the Prater testnet. Each instance will run a build from a different branch (unstable, testing, stable). The Windows servers will take over the validators of the current Prater testnet assigned to index 03 (e.g. stable-03, testing-03, etc). The macOS will take the validators currently assigned to index 04.

On the Windows server, the nimbus_beacon_node.exe executable will be launched as a service through the
https://github.com/winsw/winsw utility.

Both hosts will feature an update script that builds the binaries from source from the latest revision of the respective git branch. The update script should be executable on-demand and also automatically once per day through a scheduled task. The node should be restarted only if the latest git revision has changed. The nimbus-private repo should be updated with some minimal instructions for executing the update script manually and for inspecting the log output of the automated runs.

Tasks

#59 - Deploy Beacon Node on Windows
#60 - Deploy Beacon Node on MacOS

Host RPM/DEB in yum/apt repo

In status-im/nimbus-eth2#3034 we create RPM:s/DEB:s as part of the release process - the next step would be for these to be uploaded to a repo that users can add to their package manager and thus have the ability to upgrade using package manager commands.

This would require figuring out how such repos are created and where from they can be served.

Nodes running out of space due to log growth spikes

Some nodes stopped rotating logs using the logrotate script in cron.hourly due to lack of space:

master-01.aws-eu-central-1a.nimbus.test
node-01.aws-eu-central-1a.nimbus.test
node-02.aws-eu-central-1a.nimbus.test
node-03.aws-eu-central-1a.nimbus.test

Example logs from node-01:

[email protected]:~ % grep cron.hourly /var/log/syslog
Mar 12 01:17:01 node-01.aws-eu-central-1a.nimbus.test CRON[7574]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Mar 12 02:17:01 node-01.aws-eu-central-1a.nimbus.test CRON[7859]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)

They stop after 02:17:01 because of lack of space:

[email protected]:~ % grep -i "no space" /var/log/syslog      
Mar 12 02:35:44 node-01.aws-eu-central-1a.nimbus.test kernel: [1251094.964501] systemd-journald[411]: Failed to create new system journal: No space left on device
Mar 12 02:35:44 node-01.aws-eu-central-1a.nimbus.test kernel: [1251094.974423] systemd-journald[411]: Failed to create new user journal: No space left on device
Mar 12 02:35:44 node-01.aws-eu-central-1a.nimbus.test kernel: [1251094.990822] systemd-journald[411]: Failed to create new user journal: No space left on device
Mar 12 02:35:44 node-01.aws-eu-central-1a.nimbus.test rsyslogd: file '13' write error: No space left on device [v8.32.0 try http://www.rsyslog.com/e/2027 ]
Mar 12 02:35:44 node-01.aws-eu-central-1a.nimbus.test rsyslogd: file '13' write error: No space left on device [v8.32.0 try http://www.rsyslog.com/e/2027 ]

New layout of Nimbus nodes

We will need a new layout of Nimbus nodes in the fleet. All hosts will be in the madella network:

4 hosts running master branch
4 hosts running devel branch
2 hosts running nim-libp2p-auto-bump branch

Each node should have its Docker image updated every 24 hours, but with an offset so as to not synchronize restarts across hosts.

There are 1024 validator keys we posses. Those are currently stored in master-01 and slave-01 to 05. Those need to be spread out so all 10 of madella network nodes have their own.

They are stored under secrets and validators subfolders of the node data.dir:

[email protected]:~ % sudo ls -l /data/beacon-node-testnet3/data/shared_medalla_0 | grep -E '(secrets|validators)'
drwxr-xr-x   2 dockremap dockremap 28672 Aug  3 17:26 secrets
drwxr-xr-x 172 dockremap dockremap 32768 Sep 16 13:10 validators

They need to be transferred in matching pairs.

Details: https://status-im.github.io/nimbus-eth2/medalla.html?highlight=validators#key-management

Automatically purge old logs from ES cluster

The TRACE log level in the ElasticSearch cluster will cause very fast growth of stored data. To avoid issues with running out of space we need a periodic job that will remove logs older than N months.

Server to test libp2p `unstable` branch

libp2p needs to move forward with new features and potentially larger and riskier changes/refactors. In order to prevent disrupting production code that relies on a stable master branch, all major changes are now being worked in the unstable branch. For further background, take a look at vacp2p/nim-libp2p#542

We want to replicate the same flow of libp2p master for testing with libp2p unstable. Essentially, we want a new box where we can deploy the unstable branch in the same way we do with master, which (AFAIK) is currently being deployed to unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont with the nim-libp2p-auto-bump branch. Just like with master, there will be an auto-bump (nim-libp2p-auto-bump-unstable) branch in nimbus-eth for the nim-libp2p unstable branch that will be updated every time a new commit lands there. This will need to eventually be migrated to the prater network (#41).

To summarize:

We need an additional box exactly the same as unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont
A new autobump job will be generated to create a branch to deploy to this new machine
- Please let me know if this needs to happen first

Research Hetzner as a hosting provider

Our current testnet servers for pyrmont (and soon prater) are running on AWS where the cost is too high, it's going to be close to 10k USD for both testnets. For the new prater testnet we can look into a cheaper hosting provider.

For the prater testnet we're using the same instance type as for the large pyrmont instances. These are AWS z1d.large instances with 2 CPU, 16GB RAM and a 150GB nvme drive.

In #41 we've already discussed this and Hetzner came up as a possible provider.

Monitoring

To estimate which server resources we need, we can look at the infra metrics we've collected from the pyrmont testnet. They can be seen in Grafana.. The data I'm looking at is for z1d.large instances for the stable branch.

The instances are maxing out the CPU (near 100% CPU usage) and the nimbus_beacon_node process only uses a single core. The instances are running on a 4 GHz core freq. The RAM usage is only 1.7GB:

The disk space used is around 25G and disk operations around 60 IOPS write and below 10 IOPS for read (the spikes are probably due to log-rotation since debug logging is enabled which produces a lot of data):

(-66 is the write iops, 1 is the read iops)

Overall I think that a machine with 2 CPU (but with a high core frequency), 4GB RAM and an SSD drive with 50GB is enough.

Hetzner Instances

There's Hetzner Cloud: https://www.hetzner.com/cloud which has interesting machines such as CCX11 with 2 (dedicated) cores, 8 gb ram, 80gb drive for €24.88/mo. But the CPU is Intel Xeon Skylake with 2,1GHz.

Taking a look at geekbench scores:

AWS z1d.large https://browser.geekbench.com/v5/cpu/6917113 has 1084 Single-Core Score
The Hetzner Cloud CPUs (Intel Xeon Skylake) have half the score ~700 https://browser.geekbench.com/v5/cpu/7008330. The EPYC CPUs have worse single-core score but higher multi-core score.
The dedicated Hetzner Ryzen machines have around 1300 Single-Core Score https://browser.geekbench.com/v5/cpu/7013293 (comparison https://browser.geekbench.com/v5/cpu/compare/7013293?baseline=6917113)

The cheapest dedicated machine I could find is AX41 with AMD Ryzen 5 3600 6-core, 64 GB RAM and 4 TB HDD for 40.46 € monthly + 46.41 € setup fee. Can be customized on https://www.hetzner.com/dedicated-rootserver/ax41/configurator. With ECC RAM and SSD it's around 50 EUR (60 USD).

We pay around 380 USD/month for an AWS instance. We could get better performance using a Ryzen dedicated machine on hetzner for 60 USD/month. Since the nimbus process is single-core we'd probably pay for 6 cores and not use 5 of them. Same for RAM and Disk. We'd have 60GB unused ram and near 4TB of empty HDD space but maybe it can be used for something else

Set prometheus and grafana to collect at 12s intervals

In eth2, things happen at a 12s cadence, so it makes sense that monitoring tools poll once every 12s, instead of the default 15s - this should make the graphs line up better with fewer spikes / irregularities that are caused by lack of timing sympathy.

Provision a new large node for the Steklo (Rayonism) testnet

This node will participate in the first testnet aiming to experiment with the upcoming Eth1+Eth2 merge functionality. The node needs to be provisioned with the following roles:

infra-role-beacon-node
infra-role-geth

The Geth node will need to use a recent version of Geth (compiled from the master branch). It needs to be launched with a new special flag (--catalyst) that enables the merge spec. It should also allow overriding the the --miner.etherbase flag. Since the merge spec requires a fresh network, the storage requirements will be low and there will be no syncing.

The beacon node will use a new docker tag, called rayonism. It will be initially built from the quick-merge-v1 branch in nimbus-eth2. The --network parameter will be set to steklo. The node will run 2048 validators that will be pushed later today or tomorrow to nimbus-private/steklo_validators.

Deploy validator nodes on Hetzner server

We bought a dedicated server from Hetzner (ax41-nvme) which we want to use for running validator nodes (see #45 for the research).

To do this we need to setup supporting infrastructure for the new provider first. This infra will run in Hetzner Cloud and the instances will be in the same data center as the dedicated server (Finland).

Specifically we need the following servers in hetzner cloud (https://www.hetzner.com/cloud):

3x CX11 for Consul
1x CPX11 for Prometheus
1x CPX11 for Logstash

The tasks are:

Create a terraform module for hetzner cloud (https://github.com/status-im/infra-tf-hetzner-cloud) that follows the same patterns as our other modules (https://github.com/status-im/infra-tf-digital-ocean, https://github.com/status-im/infra-tf-google-cloud/, ...)
Use the module in our existing setup to provision servers/applications mentioned above
Create a terraform module for the dedicated server and bootstrap it with the existing ansible bootstrap playbooks
Use the dedicated server tf module in this repo and a few deploy validators

Migrate ELK Stack to Hetzner

Based on investigation from #81 it makes no sense to keep our ElasticSearch cluster, or other elements of ELK stack, in AWS.

Moving them to Hetzner will make them both cheaper, and perform better, and also be able to handle more traffic.

Create a nim-waku fleet

We need a test fleet for Waku implementation in Nim: https://github.com/status-im/nim-waku

The fleet should be separate from Nimbus test fleet and once tested will probably replace eth.prod.

Investigate large volume of connections in FIN_WAIT2 state

We're seeing a lot of connections stuck in FIN_WAIT2 state on Pyrmont fleet hosts.

Windows nodes not receiving or sending attestation

I just noticed that the windows host has not been handling attestations since 2021-08-06 at 14:00:00:

But I can clearly see validators getting loaded at startup:

admin@windows-01 MINGW64 .../nimbus/beacon-node-prater-stable
$ grep 'Local validator attached' beacon-node-prater-stable.out.log | wc -l
2500

Create separate ElasticSearch cluster

@tersec has complained that:

Currently we have logging set to DEBUG level, not TRACE, which would have made debugging some issues a bit easier without waiting to reproduce things. This has become an increasing issue as sometimes reproduction has needed multiple days.
Just because it's a function of how long it takes for some event to occur. And my understanding had been that it was due to storage space concerns.

And that is kinda correct. If we flooded our main ES cluster with TRACE logs from Nimbus fleet it would make querying for anything else extremely slow. It might make sense to create a separate ElasticSearch/Logstash/Kibana stack to Nimbus.

Collect host resource metrics from Windows

The nodes deployed in #59 expose metrics for Prometheus and are being collected, but there is no collection of host resource metrics, and as far as I can tell there is no Windows version of Netdata.

Diagnose Nimbus hosts going down randomly

We've been having Nimbus nodes going down randomly without a clear culprit. We need to debug this in detail.

Create a new role for nimbus-eth1 and use it on the nimbus.eth1 host.

Date-stamp log files

Currently, log files are called docker.1 etc - this is a bit annoying because every hour, all logfiles are renamed and it's a bit messy to correlate an event in metrics with the corresponding log file.

Date-stamped files instead would be nice, UTC ideally so that it's easy to cross-reference.

Deploy new mainnet instances

With the hetzner servers, we can run a few more mainnet nodes giving us additional insight into mainnet behaviors - ideal would be to deploy 4 servers with 6 instances on each server, split between stable and testing, totalling 24 instances.

The instances should run with --subscribe-all-subnets similar to the existing mainnet bootnodes on AWS (which must remain)

Segmentation faults on Prater metal-01 host

We saw multiple nodes from different branches fail due to segfaults on metal-01.he-eu-hel1.nimbus.prater

[email protected]:~ % sudo dmesg | grep segfault | wc -l
76

[7689909.942090] nimbus_beacon_n[890804]: segfault at 20 ip 00005607b4eb8f9e sp 00007ffddee17270 error 4 in nimbus_beacon_node_d878948e[5607b4e8d000+a10000]
[7689909.942096] Code: 04 25 38 c9 ff ff 64 49 8b 4d 00 48 8b 4c c8 f8 48 89 08 64 49 8b 45 00 48 ff c8 64 49 89 45 00 49 83 3e 07 77 b6 4d 8b 7e 08 <49> 83 7f 20 00 74 1a 64 48 ff 42 60 49 8d 7e 10 48 89 55 c8 41 ff
[7695548.662046] nimbus_beacon_n[963140]: segfault at 0 ip 0000558ac6123184 sp 00007ffe11ab91b0 error 4 in nimbus_beacon_node_bef13b6c[558ac60f7000+a10000]
[7695548.662052] Code: 83 fa 07 0f 86 64 01 00 00 48 83 e2 fb 48 89 16 48 89 01 e9 e9 fc ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 49 8b 47 10 <48> 8b 00 48 83 c0 10 f0 48 29 06 e9 6e fe ff ff 66 66 2e 0f 1f 84
[8289560.838367] nimbus_beacon_n[2608030]: segfault at 20 ip 0000564b43a8bf9e sp 00007fff7f3136c0 error 4 in nimbus_beacon_node_bef13b6c[564b43a60000+a10000]
[8289560.838378] Code: 04 25 38 c9 ff ff 64 49 8b 4d 00 48 8b 4c c8 f8 48 89 08 64 49 8b 45 00 48 ff c8 64 49 89 45 00 49 83 3e 07 77 b6 4d 8b 7e 08 <49> 83 7f 20 00 74 1a 64 48 ff 42 60 49 8d 7e 10 48 89 55 c8 41 ff
[8677743.469070] nimbus_beacon_n[3587633]: segfault at 563c88c80c28 ip 0000563c8511fe11 sp 00007fffc71d2060 error 7 in nimbus_beacon_node_bef13b6c[563c850f4000+a10000]
[8677743.469076] Code: 64 48 39 04 25 10 f2 ff ff 7d 7e 64 48 8b 04 25 00 00 00 00 4c 89 e6 48 8d b8 78 c9 ff ff 4c 8d a8 10 c9 ff ff e8 cf c6 84 00 <48> 89 58 08 48 c7 00 04 00 00 00 64 48 8b 14 25 28 c9 ff ff 64 48
[8677796.904202] nimbus_beacon_n[3588949]: segfault at 20 ip 0000561a3ca34f9e sp 00007ffc56cb0960 error 4 in nimbus_beacon_node_bef13b6c[561a3ca09000+a10000]
[8677796.904208] Code: 04 25 38 c9 ff ff 64 49 8b 4d 00 48 8b 4c c8 f8 48 89 08 64 49 8b 45 00 48 ff c8 64 49 89 45 00 49 83 3e 07 77 b6 4d 8b 7e 08 <49> 83 7f 20 00 74 1a 64 48 ff 42 60 49 8d 7e 10 48 89 55 c8 41 ff

unexpected and asymetric Docker image age change

stefan$ ansible all -i ansible/inventory/test -u stefan -o -m shell -a "docker ps" | sed 's/\\n/\n/g'
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details
[WARNING]: Unhandled error in Python interpreter discovery for host node-09.aws-eu-central-1a.nimbus.test: Failed to connect to the host via ssh:
[email protected]: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
node-09.aws-eu-central-1a.nimbus.test | UNREACHABLE!: Data could not be sent to remote host "18.197.66.222". Make sure this host can be reached over ssh: [email protected]: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
node-06.aws-eu-central-1a.nimbus.test | CHANGED | rc=0 | (stdout) CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS                                                                    NAMES
153a4e0b19e1        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   2 hours ago         Up 2 hours          0.0.0.0:9000->9000/tcp, 0.0.0.0:9200->9200/tcp, 0.0.0.0:9000->9000/udp   beacon-node-testnet0-1
5d5dade8d4ad        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9100->9100/tcp, 0.0.0.0:9300->9300/tcp, 0.0.0.0:9100->9100/udp   beacon-node-testnet1-1
f45f83537612        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9101->9101/tcp, 0.0.0.0:9301->9301/tcp, 0.0.0.0:9101->9101/udp   beacon-node-testnet1-2
f8ff407899db        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   47 hours ago        Up 46 hours         0.0.0.0:9001->9001/tcp, 0.0.0.0:9201->9201/tcp, 0.0.0.0:9001->9001/udp   beacon-node-testnet0-2
node-07.aws-eu-central-1a.nimbus.test | CHANGED | rc=0 | (stdout) CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS                                                                    NAMES
60a4779539ad        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   2 hours ago         Up 2 hours          0.0.0.0:9000->9000/tcp, 0.0.0.0:9200->9200/tcp, 0.0.0.0:9000->9000/udp   beacon-node-testnet0-1
e72edae9d9b1        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9100->9100/tcp, 0.0.0.0:9300->9300/tcp, 0.0.0.0:9100->9100/udp   beacon-node-testnet1-1
c58de58f068b        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9101->9101/tcp, 0.0.0.0:9301->9301/tcp, 0.0.0.0:9101->9101/udp   beacon-node-testnet1-2
d745c18c4e42        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   47 hours ago        Up 46 hours         0.0.0.0:9001->9001/tcp, 0.0.0.0:9201->9201/tcp, 0.0.0.0:9001->9001/udp   beacon-node-testnet0-2
node-04.aws-eu-central-1a.nimbus.test | CHANGED | rc=0 | (stdout) CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS                                                                    NAMES
c6e5d7419567        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   2 hours ago         Up 2 hours          0.0.0.0:9000->9000/tcp, 0.0.0.0:9200->9200/tcp, 0.0.0.0:9000->9000/udp   beacon-node-testnet0-1
5a0aad9ff4fa        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9100->9100/tcp, 0.0.0.0:9300->9300/tcp, 0.0.0.0:9100->9100/udp   beacon-node-testnet1-1
51db543b7245        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9101->9101/tcp, 0.0.0.0:9301->9301/tcp, 0.0.0.0:9101->9101/udp   beacon-node-testnet1-2
5a8b822316c2        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   47 hours ago        Up 46 hours         0.0.0.0:9001->9001/tcp, 0.0.0.0:9201->9201/tcp, 0.0.0.0:9001->9001/udp   beacon-node-testnet0-2
node-03.aws-eu-central-1a.nimbus.test | CHANGED | rc=0 | (stdout) CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS                                                                    NAMES
ecbca4995034        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   2 hours ago         Up 2 hours          0.0.0.0:9000->9000/tcp, 0.0.0.0:9200->9200/tcp, 0.0.0.0:9000->9000/udp   beacon-node-testnet0-1
50b45d6d7c7e        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9100->9100/tcp, 0.0.0.0:9300->9300/tcp, 0.0.0.0:9100->9100/udp   beacon-node-testnet1-1
cc7521ad84c9        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9101->9101/tcp, 0.0.0.0:9301->9301/tcp, 0.0.0.0:9101->9101/udp   beacon-node-testnet1-2
7ba8047ff571        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   47 hours ago        Up 46 hours         0.0.0.0:9001->9001/tcp, 0.0.0.0:9201->9201/tcp, 0.0.0.0:9001->9001/udp   beacon-node-testnet0-2
node-05.aws-eu-central-1a.nimbus.test | CHANGED | rc=0 | (stdout) CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS                                                                    NAMES
37090c5006c0        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   2 hours ago         Up 2 hours          0.0.0.0:9000->9000/tcp, 0.0.0.0:9200->9200/tcp, 0.0.0.0:9000->9000/udp   beacon-node-testnet0-1
5afd8c1d9ed2        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9100->9100/tcp, 0.0.0.0:9300->9300/tcp, 0.0.0.0:9100->9100/udp   beacon-node-testnet1-1
fbf7199f1f9b        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9101->9101/tcp, 0.0.0.0:9301->9301/tcp, 0.0.0.0:9101->9101/udp   beacon-node-testnet1-2
e1fb0a073318        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   47 hours ago        Up 46 hours         0.0.0.0:9001->9001/tcp, 0.0.0.0:9201->9201/tcp, 0.0.0.0:9001->9001/udp   beacon-node-testnet0-2
node-08.aws-eu-central-1a.nimbus.test | CHANGED | rc=0 | (stdout) CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS                                                                    NAMES
efa53f069180        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   2 hours ago         Up 2 hours          0.0.0.0:9000->9000/tcp, 0.0.0.0:9200->9200/tcp, 0.0.0.0:9000->9000/udp   beacon-node-testnet0-1
9a13b135d61f        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9100->9100/tcp, 0.0.0.0:9300->9300/tcp, 0.0.0.0:9100->9100/udp   beacon-node-testnet1-1
07cfe461de92        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9101->9101/tcp, 0.0.0.0:9301->9301/tcp, 0.0.0.0:9101->9101/udp   beacon-node-testnet1-2
5b919ff93b8e        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   47 hours ago        Up 46 hours         0.0.0.0:9001->9001/tcp, 0.0.0.0:9201->9201/tcp, 0.0.0.0:9001->9001/udp   beacon-node-testnet0-2
node-01.aws-eu-central-1a.nimbus.test | CHANGED | rc=0 | (stdout) CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS                                                                    NAMES
73ae218df3bc        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   2 hours ago         Up 27 minutes       0.0.0.0:9000->9000/tcp, 0.0.0.0:9200->9200/tcp, 0.0.0.0:9000->9000/udp   beacon-node-testnet0-1
e228a4ecda3c        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9100->9100/tcp, 0.0.0.0:9300->9300/tcp, 0.0.0.0:9100->9100/udp   beacon-node-testnet1-1
c65a1ad33606        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9101->9101/tcp, 0.0.0.0:9301->9301/tcp, 0.0.0.0:9101->9101/udp   beacon-node-testnet1-2
166f983f8077        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   47 hours ago        Up 46 hours         0.0.0.0:9001->9001/tcp, 0.0.0.0:9201->9201/tcp, 0.0.0.0:9001->9001/udp   beacon-node-testnet0-2
node-02.aws-eu-central-1a.nimbus.test | CHANGED | rc=0 | (stdout) CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS                                                                    NAMES
d1cfb0c4e200        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   2 hours ago         Up 2 hours          0.0.0.0:9000->9000/tcp, 0.0.0.0:9200->9200/tcp, 0.0.0.0:9000->9000/udp   beacon-node-testnet0-1
282e36994eb9        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9100->9100/tcp, 0.0.0.0:9300->9300/tcp, 0.0.0.0:9100->9100/udp   beacon-node-testnet1-1
c17c7a278bce        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9101->9101/tcp, 0.0.0.0:9301->9301/tcp, 0.0.0.0:9101->9101/udp   beacon-node-testnet1-2
755a7c26e463        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   47 hours ago        Up 46 hours         0.0.0.0:9001->9001/tcp, 0.0.0.0:9201->9201/tcp, 0.0.0.0:9001->9001/udp   beacon-node-testnet0-2
master-01.aws-eu-central-1a.nimbus.test | CHANGED | rc=0 | (stdout) CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS                                                                    NAMES
10170f398ef7        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   2 hours ago         Up 27 minutes       0.0.0.0:9000->9000/tcp, 0.0.0.0:9200->9200/tcp, 0.0.0.0:9000->9000/udp   beacon-node-testnet0-1
6807c2cdebe8        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9100->9100/tcp, 0.0.0.0:9300->9300/tcp, 0.0.0.0:9100->9100/udp   beacon-node-testnet1-1
8c3fcffa5507        statusteam/nimbus_beacon_node:testnet1   "/usr/bin/run_beacon…"   18 hours ago        Up 18 hours         0.0.0.0:9101->9101/tcp, 0.0.0.0:9301->9301/tcp, 0.0.0.0:9101->9101/udp   beacon-node-testnet1-2
4cd87e925075        statusteam/nimbus_beacon_node:testnet0   "/usr/bin/run_beacon…"   47 hours ago        Up 46 hours         0.0.0.0:9001->9001/tcp, 0.0.0.0:9201->9201/tcp, 0.0.0.0:9001->9001/udp   beacon-node-testnet0-2

Investigate hosts running out of memory due to Beacon nodes

We've been seeing Nimbus hosts failing due to lack of memory, like yesterday:

https://grafana.status.im/d/QCTZ8-Vmk/single-host?orgId=1&var-host=master-01.aws-eu-central-1a.nimbus.test&from=1585134265851&to=1585216675067

Deploy a Mainnet fully synced node for Nimbus

Eth2 launches on 8th of December and before that happens we'll need a mainnet full node.

It will be used in the future to provide data on state of Eth1 to the Nimbus fleet.

Beacon chain kubernetes based multinet cluster

Based on the work done here: https://github.com/eth2-clients/multinet it would be nice to open a discussion about setting up our own multinet cluster in order to execute a variety of tests in the presence of other clients.
The requirements are simple in terms of tech, just a standard kubernetes (+helm) cluster and access to it.
I personally have experience with gcloud but aws should be ok too (or azure).

The key question is how much can we spend in terms of CPU cores and memory (also storage but that's the cheap bit)?

Assuming that each node should receive at least 0.5 core and 2GB memory.
If we could have 64 cores we could run 128 nodes of 64 (could try 128 even) validators each.
That would make 4096 (8192) validators. Should we run more validators per node btw?
Any input is welcome as I don't have too much knowledge in that ratio.

cc @arnetheduck @dryajov @mratsim @stefantalpalaru

Logrotate hourly and keep 72 hours of logs

Currently running on daily rotation, 5 days:

[email protected]:/data/log/beacon-node-pyrmont-devel$ ls -lah
total 23G
drwxr-xr-x 2 syslog syslog 4.0K Nov 20 00:03 .
drwxr-xr-x 3 syslog adm    4.0K Nov 17 15:20 ..
-rw-r--r-- 1 syslog adm    5.4G Nov 20 13:15 docker.log
-rw-r--r-- 1 syslog adm     17G Nov 20 00:08 docker.log.1
-rw-r--r-- 1 syslog adm    569M Nov 19 00:02 docker.log.2.gz

Ideally, also get rid of prefix (Nov 20 10:11:30 pyrmont-06.aws-eu-central-1a.nimbus.test docker/beacon-node-pyrmont-devel[600]: ), it makes the logs non-json, hard to read and unnecessarily large

Disable duplicate log message detection

See:

Deploy additional synced Eth1 nodes to not abuse Infura

One of my infura id:s gets swamped that happens to be included in this repo gets overrun with requests on a regular basis preventing me from running tests - it's happening on a daily basis now, likely because of the increased number of nodes we run.

We should migrate more of the fleet to internal instances (leaving a few on infura for testing), preferably a mix of viable eth1 clients.

Migrate whole Pyrmont fleet to Hetzner

The Pyrmont fleet has little utility but has to exist until Ethereum Foundation decides it doesn't need it anymore.

Considering the costs - ~6k EUR per month - we need to move out validators to more cost effective AX41-NVMe Hetzner hosts.
This should reduce the cost of the fleet 30x fold easily. The layout suggested by @zah looks like this:

metal-01 - 4000 validators per instance
metal-02 - 989 validators per instance
metal-03- 10 validators per instance
metal-04 - 1 validator per instance

This assumes 4 beacon node instances per host: stable, testing, unstable, and nim-libp2p-auto-bump-unstable

Depending on whether a single host can handle 4 instances running 4000 validators each I might have to redesign this layout.

	data_vol_size = 500 /* We'll be storing TRACE logs */
	data_vol_type = "sc1" /* Change to gp2 for SSD */