paritytech / helm-charts Goto Github PK

View Code? Open in Web Editor NEW

25.0 13.0 22.0 3.23 MB

Parity & Polkadot Helm charts collection

License: GNU General Public License v3.0

Smarty 63.50% Mustache 6.35% Shell 3.29% Makefile 1.36% Dockerfile 0.92% Go 24.59%

blockchain helm helm-charts polkadot polkadot-sdk kubernetes

helm-charts's Introduction

Parity Helm Charts

Parity's Kubernetes Helm charts collection.

Charts list

Common: a generic helm chart for kubernetes
Node: deploy Substrate/Polkadot nodes
Bridges Common Relay: deploy bridges-common-relay service
Polkadot Basic Notification: deploy a chain notification bot
Polkadot Introspector: deploy a chain monitoring and introspection service
Staking miner: deploy the staking-miner for submitting solutions to NPoS elections
Polkadot STPS: deploy the Polkadot Standard Transactions Per Second (sTPS) performance benchmarking service
Substrate faucet: deploy Substrate Faucet service
Substrate telemetry: deploy the Substrate Telemetry service
Testnet Manager: deploy a management tool for operating testnets

helm-charts's People

Contributors

Stargazers

Watchers

helm-charts's Issues

Add node pruning option

We should be able to set the pruning option (pruned/archive) in the chart values to set the --pruning flags and add useful pod/service label.

List common substrate flags option and allow configuring them with chart values (`--unsafe-rpc-external`, `--telemetry-url`, `--log`). We shouldn't include more esoteric flags (polkadot/cumulus specific)

Implement S3 support for backups

It could e.g. be achieved with one tool for both clouds: s5cmd, which potentially also downloads the backup faster than gsutil. It would be interesting to do some tests.

Revamp polkadot-notification-service chart

Do the following changes on the polkadot notification chart

Rename polkadot-basic-notification -> polkadot-notification-service to match https://github.com/paritytech/polkadot-notification-service
Enable the kubernetes probes
Replace the statefulset replica hack by 1 deployment + 1 configmap resource to manage each notification-service pod.

Print node flags on startup for easy debugging

It would be convenient to have the node flags printed out before startup as those are built from a complex template.

Add UI support for the substrate-faucet chart

Faucet image docker image: paritytech/polkadot-testnet-faucet#238

Revamp node helm-chart YAML templates

node Helm chart templates are missing some of the configurable parameters available in Kubernetes (like loading environment variables from a ConfigMap using envFrom).
It is time-consuming (though rewarding on larger scale) to maintain our own standardized library of Helm templates. So instead I re-used the great template library from Bitnami during the development of staking-miner Helm chart

A node Helm chart templates should be refactored in the same way to be consistent with available Kubernetes features

Use static version instead of `tag: latest` for the node helm chart

Currently, when a new polkadot release is available, people should change their values files to point to the new version.
I think the best is to release a new version for the node helm chart per each release and use the chart appVersion as default for the image tag, then people who are using our chart for their deployment can easily update their nodes by bumping the chart version instead of editing the values.yaml file manually or by a pipeline.

Allow fixing the p2p IP

Automatic node port discovery was introduced in #28, however in some cases the operator will not want / be able to open a large range of ports (30000-32767) on their Kubernetes nodes.

An option should be added to fix the attributed nodePort, however in this case it might be impossible to support more than 1 replica for the statefulset as fixing the port will result in a port conflict for the second replica.

Similarly, it should be possible to deploy a node which uses a fixed p2p IP by using loadBalancerIP for LoadBalancer services using a pre-reserved IP at the Cloud Provider (eg. for GCP).
However in this case, the p2p service would no longer be of type NodePort but instead of type LoadBalancer.

Standardize default exposed ports in the node chart

We want to make it easier to reason about the exposed ports for substrate nodes (especially collators that run 2 nodes in 1). So from internal discussions at Parity we brainstormed the following table. The logic is to reuse conventions that arose organically while minimizing confusion (eg. it is very hard to differentiate 30334 and 30344 at a glance).

To achieve this we propose to shift port numbers by -1000 for the secondary chain (ie. relay-chain for the collator). Note that most of the times those ports don't need to be exposed.

Type	Primary	Secondary
p2p_tcp	30333	29333
p2p_ws	30444	29444
prom	9615	8615
rpc	9933	8933
rpc_ws	9944	8944

Add jaeger-agent/grafana-agent sidecar option

Add an option to the node helm chart for doing the following:

Run jaeger-agent or grafana agent as a sidecar of the node pod to receive traces
Configure the agent to forward traces to the Jaeger/Tempo collector URL
Add the --jaeger-agent 127.0.0.1:$AGENT_PORT flag to the node

Add an option to run the polkadot-introspector kvdb exporter as a sidecar

The polkadot-introspector kvdb tool can be used to monitor the database continuously. We should add support for running this exporter as a sidecar in the node helm chart.

We have successfully set it up with this configuration but it feels like a lot of boilerplate to be adding for people who would like to set this up.


extraContainers:
  - name: relaychain-kvdb-introspector
    image: paritytech/polkadot-introspector:438d3406
    command: [
      "polkadot-introspector",
      "kvdb",
      "--db",
      "/data/chains/versi_v1_9/db/full",
      "--db-type",
      "rocksdb",
      "prometheus",
      "--port",
      "9620"
    ]
    resources:
      limits:
        memory: "1Gi"
    ports:
      - containerPort: 9620
        name: relay-kvdb-prom
    volumeMounts:
      - mountPath: /data
        name: chain-data
  - name: parachain-kvdb-introspector
    image: paritytech/polkadot-introspector:438d3406
    command: [
      "polkadot-introspector",
      "kvdb",
      "--db",
      "/data/chains/versi_v1_9/db/full/parachains/db",
      "--db-type",
      "rocksdb",
      "prometheus",
      "--port",
      "9621"
    ]
    resources:
      limits:
        memory: "1Gi"
    ports:
      - containerPort: 9621
        name: para-kvdb-prom
    volumeMounts:
      - mountPath: /data
        name: chain-data

Note there should be the option to run 1 or 2 sidecars. 1 to monitor the main db and 1 for the parachain db (as an option for relay chains).
We also need to create the appropriate ServiceMonitor for loading data in Prometheus.

[common] `.Values.ingress.className` does nothing

It's not possible to template ingressClassName field there, so only the annotation option is possible, which is deprecated since 1.22.

Use helm-readme-generator for documenting helm chart values

The Bitnami Helm Readme Generator is very useful to maintain up to date README files. We should generalize the use of the tool:

All charts values.yaml comments need to be rewritten to follow the proper syntax
Apply readme-generator on all charts and document how to use it in CONTRIBUTING.md
Add a CI job that validates that the readme-generator has correctly been applied on PRs and blocks merge if it is not the case

[node] fix `backup-chain-gcs` init container

node chart has init container backup-chain-gcs to dump db to gcp on startup.

If the pod has CrashLoopBackOff the same dump will be uploaded to gcp every restart.

We need to create a "lock file" or "status of the last backup" file, Which will be checked before db is uploaded to GCP.
if the last backup is younger than 1h (1h - should be configurable in values.yml) - skip the backup.
if the last backup failed (we have a lock file) - fail with an error message.

Since the pod is in CrashLoopBackOff it will be hard to exec to the pod and clean a lockfile, we need to add option to remove it.

statefulSet inject keys doesn't work

When iterating over the {{- range $index, $key := .Values.node.keys }} the helm chart fails to correctly insert values.node.command in the subsequent initContainer command this is because when you use a range you change the scope of the Values.

Consider the following values.yaml

node:
    keys: 
    - type: "gran"
      scheme: "ed25519"
      seed: "//Blah"

This results in the following error: nil pointer evaluating interface {}.node as node.command is no longer in scope.

The following works but is obviously undesirable:

node:
   keys:
   - type: "gran"
      scheme: "ed25519"
      seed: "//Blah"
      Values:
        node:
          command: "command"

An easy fix is evaluating {{ .Values.node.command }} as $COMMAND and setting that as an envar similar to how we set {{ .Values.node.chain }} in the same init container. Alternatively we can set {{ .Values.node.command }} as a helm variable before the loop.

Happy to put up a fix for this if you let me know your preferred method.

Replace curl for startup probe

Currently an exec to curl is used in the startup probe to work around the issue of the RPC endpoint only allowing local network access by default.

Possible solutions:

Disable the startup probe by default, and use a regular httpGet probe. However it will be the user's responsibility to set the correct flags to allow kubernetes probes traffic to the RPC endpoint
Enable the probes by default, assuming the default rpc flag has not been overridden.

Allow mounting a Kubernetes emptyDir at the keystore path

When inserting keys into a substrate node those will end up in the /data/.../chains/chain_name/keystore folder which in our setup is stored in the "data" Kubernetes volume. In a secure setup, we don't want this data volume to contain our private keys so we should mount a tmpfs (ie Kubernetes volume emptyDir type) on this path.
This will prevent keys from being persisted in the data disk after having been sourced from Hashicorp Vault or other secure places.

helm-charts/charts/substrate-faucet/

Hello @PierreBesson and thank you for creating this helm chart.

I am trying to use it to deploy a faucet for Picasso Rococo

Question : How would I be able to test that the deployment is good or how to use it? Using the matrix bot?

Steps I did before installing the helm chart

create a matrix bot account composablefi_faucet
create a matrix access token for the bot

Those are the values I used (removed the secrets data)

helm install substrate-faucet parity/substrate-faucet \
    --set faucet.secret.SMF_BACKEND_FAUCET_ACCOUNT_MNEMONIC="removed" \
    --set faucet.secret.SMF_BOT_MATRIX_ACCESS_TOKEN="removed" \
    --set faucet.config.SMF_BACKEND_RPC_ENDPOINT="https://picasso-rococo-rpc-lb.composablenodes.tech/" \
    --set faucet.config.SMF_BACKEND_INJECTED_TYPES='{}' \
    --set faucet.config.SMF_BACKEND_NETWORK_DECIMALS='12' \
    --set faucet.config.SMF_BOT_MATRIX_SERVER="https://matrix.org" \
    --set faucet.config.SMF_BOT_MATRIX_BOT_USER_ID="@composablefi_faucet:matrix.org" \
    --set faucet.config.SMF_BOT_NETWORK_UNIT="PICA" \
    --set faucet.config.SMF_BOT_DRIP_AMOUNT="1"

Testing the endpoint picasso-rpc-lb.composablenodes.tech seems to be fine

curl -H "Content-Type: application/json" -d '{"id":1, "jsonrpc":"2.0", "method": "rpc_methods"}'  https://westend-rpc.polkadot.io
{"jsonrpc":"2.0","result":{"methods":["account_nextIndex","author_hasKey","author_hasSessionKeys","author_insertKey
... (content removed)


echo '{"id":1, "jsonrpc":"2.0", "method": "rpc_methods"}' | websocat wss://picasso-rpc-lb.composablenodes.tech
{"jsonrpc":"2.0","result":{"methods":["account_nextIndex","assets_balanceOf","assets_listAssets","author_hasKey","author_hasSessionKeys","author_insertKey","author_pendingExtrinsics","author_removeExtrinsic","author_rotateKeys","author_submitAndWatchExtrinsic","author_sub
...  (content removed)

By the way https://rococo-rpc.polkadot.io seems to be down

curl -H "Content-Type: application/json" -d '{"id":1, "jsonrpc":"2.0", "method": "rpc_methods"}'  https://rococo-rpc.polkadot.io
Service Unavailable

echo '{"id":1, "jsonrpc":"2.0", "method": "rpc_methods"}' | websocat wss://rococo-rpc.polkadot.io
websocat: WebSocketError: WebSocketError: Received unexpected status code (503 Service Unavailable)
websocat: error running

while https://westend-rpc.polkadot.io/ works

curl -H "Content-Type: application/json" -d '{"id":1, "jsonrpc":"2.0", "method": "rpc_methods"}'  https://westend-rpc.polkadot.io
{"jsonrpc":"2.0","result":{"methods":["account_nextIndex","author_hasKey","author_hasSessionKeys","author_insertKey","author_pendingExtrinsics","author_removeExtrinsic","author_rotateKeys","author_submitAndWatchExtrinsic","author_submitExtrinsic","author_unwatchExtrinsi
... (content removed)

echo '{"id":1, "jsonrpc":"2.0", "method": "rpc_methods"}' | websocat wss://westend-rpc.polkadot.io
{"jsonrpc":"2.0","result":{"methods":["account_nextIndex","author_hasKey","author_hasSessionKeys","author_insertKey
... (content removed)

Here is what I have in the logs

kubectl logs substrate-faucet-85b7c64b7f-vcqsn (expand me)

kubectl logs substrate-faucet-85b7c64b7f-vcqsn 
yarn run v1.22.5
$ node ./build/src/start.js
2023-03-30 12:46:26        API/INIT: Api will be available in a limited mode since the provider does not support subscriptions
[2023-03-30T12:46:26.928] [INFO] default - 🚰 Plip plop - Creating the faucets's account
[2023-03-30T12:46:26.929] [INFO] default - Ignore list: (1 entries)
[2023-03-30T12:46:26.929] [INFO] default -  ''
SMF:
  📦 BOT:
     ✅ BACKEND_URL: "http://localhost:5555"
     ✅ DRIP_AMOUNT: 1
     ✅ MATRIX_ACCESS_TOKEN: *****
     ✅ MATRIX_BOT_USER_ID: "@composablefi_faucet:matrix.org"
     ✅ MATRIX_SERVER: "https://matrix.org"
     ✅ NETWORK_DECIMALS: 12
     ✅ NETWORK_UNIT: "PICA"
     ✅ FAUCET_IGNORE_LIST: ""
     ✅ DEPLOYED_REF: "unset"
     ✅ DEPLOYED_TIME: "unset"
[2023-03-30T12:46:26.965] [INFO] default - ✅ BOT config validated
SMF:
  📦 BACKEND:
     ✅ FAUCET_ACCOUNT_MNEMONIC: *****
     ✅ FAUCET_BALANCE_CAP: 100
     ✅ INJECTED_TYPES: "[]"
     ✅ NETWORK_DECIMALS: 12
     ✅ PORT: 5555
     ✅ RPC_ENDPOINT: "https://picasso-rococo-rpc-lb.composablenodes.tech/"
     ✅ DEPLOYED_REF: "paritytech/faucet:latest"
     ✅ DEPLOYED_TIME: "2023-03-30T15:45:28"
     ✅ EXTERNAL_ACCESS: false
     ✅ DRIP_AMOUNT: "0.5"
     ✅ RECAPTCHA_SECRET: *****
[2023-03-30T12:46:26.979] [INFO] default - ✅ BACKEND config validated
[2023-03-30T12:46:26.995] [INFO] default - Starting faucet v1.1.2
[2023-03-30T12:46:26.995] [INFO] default - Faucet backend listening on port 5555.
[2023-03-30T12:46:26.995] [INFO] default - Using @polkadot/api 10.0.1
Connected to the in-memory SQlite database.
Getting saved sync token...
Getting push rules...
Attempting to send queued to-device messages
Got saved sync token
Got reply from saved sync, exists? false
All queued to-device messages sent
2023-03-30 12:46:31        API/INIT: RPC methods not decorated: assets_balanceOf, assets_listAssets, crowdloanRewards_amountAvailableToClaimFor, ibc_clientUpdateTimeAndHeight, ibc_generateConnectionHandshakeProof, ibc_queryBalanceWithAddress, ibc_queryChannel, ibc_queryChannelClient, ibc_queryChannels, ibc_queryClientConsensusState, ibc_queryClientState, ibc_queryClients, ibc_queryConnection, ibc_queryConnectionChannels, ibc_queryConnectionUsingClient, ibc_queryConnections, ibc_queryDenomTrace, ibc_queryDenomTraces, ibc_queryEvents, ibc_queryLatestHeight, ibc_queryNewlyCreatedClient, ibc_queryNextSeqRecv, ibc_queryPacketAcknowledgement, ibc_queryPacketAcknowledgements, ibc_queryPacketCommitment, ibc_queryPacketCommitments, ibc_queryPacketReceipt, ibc_queryProof, ibc_queryRecvPackets, ibc_querySendPackets, ibc_queryUnreceivedAcknowledgement, ibc_queryUnreceivedPackets, ibc_queryUpgradedClient, ibc_queryUpgradedConnectionState, pablo_pricesFor, pablo_simulateAddLiquidity, pablo_simulateRemoveLiquidity
2023-03-30 12:46:32        API/INIT: picasso/10011: Not decorating unknown runtime apis: 0x9c53906fa888fe7c/1, 0x5c497be959ff24ab/1, 0xf60c4a6e7ca253cc/1, 0xa74824145d05c12a/1
Got push rules
Adding default global override for .org.matrix.msc3786.rule.room.server_acl
Checking lazy load status...
Checking whether lazy loading has changed in store...
Storing client options...
Stored client options
Getting filter...
[2023-03-30T12:46:36.615] [INFO] default - Fetched faucet balance 💰
Sending initial sync request...
Waiting for saved sync before starting sync processing...
Adding default global override for .org.matrix.msc3786.rule.room.server_acl
Caught /sync error TypeError: Cannot read properties of undefined (reading 'cryptoStore')
    at /faucet/node_modules/matrix-js-sdk/lib/sync.js:1191:49
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async Object.promiseMapSeries (/faucet/node_modules/matrix-js-sdk/lib/utils.js:445:5)
    at async SyncApi.processSyncResponse (/faucet/node_modules/matrix-js-sdk/lib/sync.js:1184:5)
    at async SyncApi.doSync (/faucet/node_modules/matrix-js-sdk/lib/sync.js:843:9)
[2023-03-30T13:03:00.027] [INFO] default - Auto-joined !JTMeQUcNDfTSdeIIvP:matrix.org.
EventTimelineSet.addLiveEvent: ignoring duplicate event $yZzI-R0yRHKSxnsfHpGveuj3QMgs7T3Zugk7NB__mmI
[2023-03-30T13:03:06.314] [INFO] default - Auto-joined !JTMeQUcNDfTSdeIIvP:matrix.org.
2023-03-30 20:41:56        RPC-CORE: queryStorageAt(keys: Vec<StorageKey>, at?: BlockHash): Vec<StorageChangeSet>:: [502]: Bad Gateway
[2023-03-30T20:41:56.989] [ERROR] default - Error: [502]: Bad Gateway
    at HttpProvider._HttpProvider_send (/faucet/node_modules/@polkadot/rpc-provider/cjs/http/index.js:162:19)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async callWithRegistry (/faucet/node_modules/@polkadot/rpc-core/cjs/bundle.js:172:28)
2023-03-30 20:44:06        RPC-CORE: queryStorageAt(keys: Vec<StorageKey>, at?: BlockHash): Vec<StorageChangeSet>:: [502]: Bad Gateway
[2023-03-30T20:44:06.311] [ERROR] default - Error: [502]: Bad Gateway
    at HttpProvider._HttpProvider_send (/faucet/node_modules/@polkadot/rpc-provider/cjs/http/index.js:162:19)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async callWithRegistry (/faucet/node_modules/@polkadot/rpc-core/cjs/bundle.js:172:28)

Create an umbrella charts for "polkadot-stack" and polkadot-parachain-stack"

We should create dedicated charts to provide standardized deployment of polkadot and polkadot-parachain (cumulus). Those would be wrappers around the generic "node" helm-chart with values correctly set for their respective latest image tag as well as more comprehensive docs on how to correctly configure them.

For simplicity, the chart version could be equal to the polkadot/cumulus release version.
We need to find a way to semi-automate version upgrades.
This is the opportunity to add end 2 end chart tests to validate that the node can connect to the network.

Update:
As agreed upon this comment, this chart will now be a polkadot-stack chart and polkadot-parachain-stack.

docker images in use to be wrapped with the umbrella chart:

Provide examples of deploying a node with Ingress

As mentioned in #108 (comment) it would be nice to have examples of provisioning a Substrate node with Ingress object (possibly for different popular Ingress Controllers). Ingress is usually required to have more control over proxying the traffic to the node. We at Parity is using Ingress to proxy traffic to boot nodes and RPC nodes. We can put examples of using it into a separate examples directory to not pollute the original Helm chart

Remove `node.serviceAccountName` property and fully replace with `serviceAccount.name`

The service account name is already defined in serviceAccount.name, so it's strange that it's possible to redefine it in node.serviceAccountName and unclear what purpose it has.

Note apparently this is mandatory to set it to the right value for Vault auth to work:

        vault.hashicorp.com/role: {{ .Values.node.vault.authRole | default (include "node.serviceAccountName" .) | squote }}

[node] Separate volume for collator database

Ideally we should have 4 volume:

relaychain db
relaychain keystore
parachain db
parachain keystore

In this case we will able to use pvc of any working node as source for next node.

Originally posted by @BulatSaif in #111 (comment)

[node] Bug: empty vault key injection

When we use .extraDerivation in .Values.node.vault.keys the node helm chart will still inject vault key, even if vault agent failed to mount key:

cat: /vault/secrets/name: No such file or directory
Inserted key aura (type=aura, scheme=sr25519) into Keystore

The Inserted command will not fail, since it can derive from well-known key (for example //Alice, //extraDerivation)

we need to add a check in inject-vault-keys init-container. If the file does not exist the init container should fail.

Allow mounting chain data on ephemeral volumes

For testing/benchmarks use cases, it should be possible to setup a node with non-persistent volumes. So an option to use Kubernetes ephemeral volumes would be useful.

Implement PodDisruptionBudget in node helm-chart

PodDisruptionBudget support would be very useful to be resilient to Kubernetes node pool upgrades.

It should be configured as such:

node:
  disruptionBudget:
    # only one of minAvailable and maxUnavailable can be set
    minAvailable:
    maxUnavailable:

The template logic must check that minAvailable and maxUnavailable are not set at the same time.

Add CONTRIBUTING guidelines

We want to encourage people outside of Parity to contribute to this repository.
https://docs.github.com/en/communities/setting-up-your-project-for-healthy-contributions/setting-guidelines-for-repository-contributors

Missing backslash escaping cause flags to be ignored

The following issue has been reported to me by @kogeler :

Containers:
  kusama:
    Container ID:  containerd://8a9995292c78499072b0abaf828c67bfbf27024548bf144cb9e2dd511c5d7eb1
    Image:         parity/polkadot:v0.9.16
    Image ID:      docker.io/parity/polkadot@sha256:46ec2899a865ff7640ea3eaaf7306ecfc3128609fc40d7fe587f486c7ff9eba9
    Ports:         9933/TCP, 9944/TCP, 9615/TCP, 30333/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      /bin/sh
    Args:
      -c
      RELAY_CHAIN_P2P_PORT="$(cat /data/relay_chain_p2p_port)"
      echo "RELAY_CHAIN_P2P_PORT=${RELAY_CHAIN_P2P_PORT}"
      exec polkadot \
        --name=${POD_NAME} \
        --base-path=/data/ \
        --chain=${CHAIN} \
        --pruning=archive --rpc-external --ws-external --rpc-methods=safe --rpc-cors=all --prometheus-external --telemetry-url='wss://submit.telemetry.parity-stg.parity.io/submit/ 1'
        --listen-addr=/ip4/0.0.0.0/tcp/${RELAY_CHAIN_P2P_PORT} \
        --listen-addr=/ip4/0.0.0.0/tcp/30333 \
Newline is not escaped after [flags' unwrapping](https://github.com/paritytech/helm-charts/blob/main/charts/node/templates/statefulset.yaml#L283).

If you check the pod you can see:

polkadot@kusama-public-sidecar-node-0:/$ ps -p 1 -o args
COMMAND
polkadot --name=kusama-public-sidecar-node-0 --base-path=/data/ --chain=kusama --pruning=archive --rpc-external --ws-external --rpc-methods=safe --rpc-cors=all --prometheus-external --telemetry-url=wss://submit.telemetry.parity-stg.parity.io/submit/ 1
--listen-addr flags are absent in fact

[Question] p2p networking between pods (across nodes)

Hi team 👋

I'm using the node chart and having some trouble with p2p networking between pods.

It looks like I created a fork where pods in aks-testnet-23183882-vmss000001 node are isolated from those in the other node (peer discovery is working locally with --allow-private-ipv4 flag within each node).

> k get pods -o wide
NAME                  READY   STATUS    RESTARTS   AGE     IP            NODE                              NOMINATED NODE   READINESS GATES
ajuna-node-0          1/1     Running   0          5m51s   10.244.1.10   aks-testnet-23183882-vmss000000   <none>           <none>
ajuna-node-1          1/1     Running   0          5m5s    10.244.2.13   aks-testnet-23183882-vmss000001   <none>           <none>
ajuna-validator-0-0   1/1     Running   0          5m51s   10.244.2.11   aks-testnet-23183882-vmss000001   <none>           <none>
ajuna-validator-1-0   1/1     Running   0          5m51s   10.244.2.12   aks-testnet-23183882-vmss000001   <none>           <none>
ajuna-validator-2-0   1/1     Running   0          5m51s   10.244.1.11   aks-testnet-23183882-vmss000000   <none>           <none>

Commands I'm using:

--chain=testnet
--name=$(POD_NAME)
--base-path=/data
--rpc-cors=all
--ws-external
--rpc-methods=safe
--allow-private-ipv4
--listen-addr=/ip4/0.0.0.0/tcp/30333

Any pointers you could share to debug this?

End-to-end tests for the node Helm chart

Create end-to-end tests that would cover some of the most common scenarios for deploying a node with the Helm chart: a full node, an RPC node, a bootnode, a validator, and a collator.

The tests should be included into the CI pipeline. All the tests should be passing before a PR can be merged.

Implement StatefulSet persistentVolumeClaimRetentionPolicy for the node chart

Available in k8s 1.23+, persistentVolumeClaimRetentionPolicy allows automatic cleanup of the statefulset associated PVCs when scaling down or deleting a Statefulset.

This will be a great help for managing tesnets that are frequently scaled up and down.

Run all containers as read-only and prevent privilege escalation by default

If I recall, the polkadot images are pretty friendly with running as read-only as long as you use a volume for the chain data.

   containers:
      - name: mynode
         image: <the_image>
         securityContext:
            runAsUser: 1000
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false

For tmp containers where we do not want to mount a volume, the solution to keep running the containe as read-only is to mount whatever folder where the node needs to write as tmpfs.

Implement Statefulset updateStrategy.rollingUpdate.maxUnavailable for the node chart

Available in k8s 1.24+, the maximum unavailable pods rolling update strategy will make it much smoother to roll out updates to Statefulset which can currently be done 1 pod at a time.

We should be careful to use the proper helm kubernetes version checks in the templates to ensure compatibility with the user cluster.

inject-keys init container displays keys

When inserting keys via the values.node.keys method the init container displays them, meaning that anyone who has read access to the statefulSet can read them.

It would probably be better to mount these as secret and inject them using a file redirect than an echo.

Support externalized relay-chain node for collators

The parachain collator 0.9.300 supports collation via RPC relay chain node. In this mode the collator doesn't need to run a local relay-chain node and simply needs to point to a relay-chain RPC URL.

Add support for this mode:

add node.collatorRelayChain.rpcUrl which when set will add --relay-chain-rpc-url ws://rpc-node-url and disable the relay-chain data and keystore volume
remove collator flags when this is on

Telemetry chart does not install "by default"

Here is my test:

NS=testing
kubectl create ns $NS
helm install substrate-telemetry parity/substrate-telemetry -n $NS

The 3 deployments fail:

NOTE: In the meantime, I did install a LoadBalancer in my cluster but the lack of it does not seem to be the issue.

Revamp node flags settings using charts values

When using the node helm chart, it is really hard to figure out whether flags should be set in the chart values or inside node.flags. The situation can be improved:

List common substrate flags option and allow configuring them with chart values (--unsafe-rpc-external, --telemetry-url, --log). We shouldn't include more esoteric flags (polkadot/cumulus specific)
The pruning and database options should be under chainData and relayChainData.
Preferentially use default values (ie. not setting the flags when value=null)
relayChainFlags should have --name=${POD_NAME}" by default and the same telemetry URLs as the main chain.
Add script on node startup to check if any passed flags is in the list of flags that are handled by chart values and fail if it's the case
Also forbid to set flags which don't make sense for the chart: --ws-port and --rpc-port

Rename helper templates in node chart

Since helm v3.7.0 we have the option to refer to a subcharts templated helper functions. However if the parent chart uses identically named helper templates then the child ones will be overwritten.

Additionally chart.name is fairly close to Chart.name which is a helm builtin object as we should move to avoid confusion.
I suggest we change the naming of these functions from chart.blah to node.blah or some other nomenclature to avoid helper templates being overwritten when called from a parent chart.

Method for adding custom node-keys

In our use case we use the pallet-node-authorization and we need to be able to insert the node-key.

Currently the only methods for achieving this are:

Bring up a node with node.persistGeneratedNodeKey true and then kubectl exec into the container and put our key in /data/node-key OR
Bring up a node with node.persistGeneratedNodeKey true and take the generated node-key and recreate this in our node-authorization pallet, recreate the chainspec then download it using the node.customChainspecUrl

A better solution would be allowing us the option to create our own node-key mounted as a read-only secret instead of reading it from the RW data volume.

We can provide more detail if requested.

Manage node key operations using subkey

As pointed out in this discussion, the node key commands might not always be available on node binaries, we should switch to using the parity/subkey:latest docker image.

Allow annotating created p2p services

It would be useful to be able to independently annotate the auto-created p2p Services.
This would enable using kubernetes-sigs/external-dns to manage Bootnodes / RPC nodes DNS entries.

For example:

apiVersion: v1
kind: Service
metadata:
  annotations:
    cloud.google.com/load-balancer-type: Internal
    external-dns.alpha.kubernetes.io/hostname: bootnode.testnet.parity.io.

Default http startup probes failing

Startup probes appear to be consistently failing in our testnets:
polkadot version: 0.9.36

probe config:

    Startup:    http-get http://:http-rpc/health delay=0s timeout=1s period=10s #success=1 #failure=30

  Warning  Unhealthy  18m (x123 over 27h)  kubelet  Startup probe failed: Get "http://10.20.142.31:9933/health": dial tcp 10.20.142.31:9933: connect: connection refused

HPA scaling event does not handle creation of p2p services

Horizontal Pod Autoscaler added in #120 when enabled conditionally removes replicas field from the StatefulSet. Creation of p2p services relies on the presence of that field. Thus when replica count is scaled up or down by HPA additional p2p services are not created/removed and no Pods can work.

We need to check whether Helm has any mechanism to rely on the current replicas set by HPA. Or, if not implement some custom handler that monitors K8s scaling events and creates/removes p2p services accordingly

[node] Add second Prometheus port to `serviceMonitor` for collator node

Description

If collator.isParachain set to true, node exposes 2 Prometheus ports (relaychain and parachain), If monitoring is enabled both ports should be scraped.

Current behavior

Only parachain metrics are collected.

Expected behavior

Both parachain and relaychain metrics collected.

[node] Fix readinessProbe

Staring from polkadot v0.9.28, logs are full with following lines:

2022-10-03 20:29:32 Accepting new connection 1/100
2022-10-03 20:29:33 Rejected connection: Transport(i/o error: unexpected end of file
Caused by:
     unexpected end of file)

It is caused by readinessProbe it uses tcpSocket to check if the port is open.
Previously port was closed until the node is synced but the substrate network was refactored several times and now this is not the case.

[node] Don't set storage class to "default" in order to use the real default cluster storage class

In the node helm-chart, we force set the node.chainData.storageClass" to "default", this prevents Kubernetes from using the cluster defined default storage class if this one is not named default`.
The solution is to not set the storageClass line in the YAML template when the value is unset.

Reference: https://kubernetes.io/docs/concepts/storage/storage-classes/

Make it easy to deploy a substrate-connect enabled bootnode to Kubernetes

Substrate-connect light clients have been officially announced as ready to use for the general public (https://www.youtube.com/watch?v=TDbTCrDDO2U). However, from an operational point of view, an obstacle to adoption is that to work properly, the light client needs to access a bootnode which exposes it's p2p over websocket (typically --listen-addr /ip4/0.0.0.0/tcp/30444/ws --listen-addr /ip6/::/tcp/30444/ws). Moreover, as browsers will only allow connecting to a secured websocket, you need a reverse proxy in front such as nginx to add a letsencrypt certificate.

I believe it would be valuable to offer an easy way to deploy such a bootnode to kubernetes with the following:

~~Option to auto-generate the relevant p2p-ws Ingress and Service on the helm-chart~~
Example config showing how to set up an ingress controller and external-dns / cert-manager for automatic certificate management.

ping @tomaka