relaycorp / cloud-gateway Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 1.0 5.88 MB

Infrastructure as Code and configuration for all Awala-Internet Gateways run by Relaycorp

License: MIT License

HCL 100.00%

cloud awala-gateway awala

cloud-gateway's People

Contributors

Watchers

Forkers

rajesh2k3

cloud-gateway's Issues

Run CogRPC service 24/7

We're currently running it on demand, which means Cloud Run will kill the service when it goes unused for a few minutes.

We're doing this to save costs whilst we're beta testing Awala/Letro, but we should undo this when we go live because it can take ~15 seconds for the service to start, which would be problematic with couriers.

Parcel delivery often times out after 30 seconds

And it shouldn't take more than a second to complete successfully.

This started happening today. All backing services look healthy.

What we see on the Android Gateway logs:

2020-11-25 11:20:51.616 6001-6026/tech.relaycorp.gateway I/DeliverParcelsToGateway: Could not deliver parcels due to server error
    tech.relaycorp.relaynet.bindings.pdc.ServerConnectionException: Failed to connect to https://poweb-test.relaycorp.tech:443/v1/parcels
     Caused by: io.ktor.network.sockets.SocketTimeoutException: Socket timeout has been expired [url=https://poweb-test.relaycorp.tech:443/v1/parcels, socket_timeout=unknown] ms
     Caused by: java.net.SocketTimeoutException: timeout

What we see on the LB logs:

TODO

Undo custom debugging image
Undo 588d85e

Get GCB to use a limited, read-only service account when previewing PRs

I don't think that's possible but I'm checking with GCP: https://console.cloud.google.com/support/cases/detail/26077434?project=relaycorp-cloud-gateway

Spread pods evenly across AZs using Pod Topology Spread Constraints

Node/pod affinity is overly complex, error-prone and brittle. Especially on the gateway, given that it has 5 components.

Pod Topology Spread Constraints solves this issue, but it requires Kubernetes 1.19, which isn't yet supported by GKE.

Replace MongoDB with GCP Firestore as backend for public keys and certificates

Describe the problem

A HA MongoDB cluster is too complicated and expensive for what it's used for today, which is mostly to store public keys and certificates.

Describe the solution you'd like

Use GCP Firestore instead, which is fully managed by GCP and doesn't require any capacity planning/monitoring.

TODO

Implement FirestorePublicKeyStore.
Implement FirestoreCertificateStore.
Integrate FirestorePublicKeyStore and FirestoreCertificateStore in the gateway.
Provision GCP resources (including Google Service Accounts using Workload Identity).
Configure Firestore in the gateway.

Additional context

Write tests with Terraform Compliance

https://terraform-compliance.com/

Require connection to NATS instances on the same node or availability zone

Blocked by nats-io/k8s#174

Create helmfile builder to replace the one created by the GCB community

The community builder for helmfile isn't great:

It's massive. 1.4 GB.
It seems abandoned and PRs aren't followed up; e.g. GoogleCloudPlatform/cloud-builders-community#463 and GoogleCloudPlatform/cloud-builders-community#462
You can't use kubectl from hooks because gcloud isn't where the kube context says it is.
Its install of helm diff doesn't actually work.

I had to create https://github.com/relaycorp/cloud-gateway/blob/d18b95cd2590fea7bedfa4744a12eeb81e51a7e2/charts/scripts/helmfile.sh and patch the builder to work around most of the issues above.

Set up continuous deployment of Relaynet-Internet Gateway to a production environment

Output: https://github.com/relaycorp/cloud-gateway

TODO

Create GitHub issue when GCB fails to deploy changes

https://cloud.google.com/cloud-build/docs/configuring-notifications/configure-http

Identity public key can't be exported

Both the CogRPC server and the certificate rotator job are failing because PKI.js is trying to export a public key from a private key using the private key's parameters (instead of actually export()ing the private key to SPKI).

 GCPKeystoreError: Private key cannot be exported (requested format: jwk)
at .GcpKmsRsaPssProvider.onExportKey ( /opt/gw/node_modules/@relaycorp/awala-keystore-cloud/src/lib/gcp/GcpKmsRsaPssProvider.ts:36 )
at .GcpKmsRsaPssProvider.exportKey ( /opt/gw/node_modules/webcrypto-core/build/webcrypto-core.js:220 )
at .SubtleCrypto.exportKey ( /opt/gw/node_modules/webcrypto-core/build/webcrypto-core.js:1465 )
at .CryptoEngine.exportKey ( /opt/gw/node_modules/pkijs/build/index.js:5555 )
at .derSerializePublicKey ( /opt/gw/node_modules/@relaycorp/relaynet-core/src/lib/crypto/keys/serialisation.ts:17 )
at .getRSAPublicKeyFromPrivate ( /opt/gw/node_modules/@relaycorp/relaynet-core/src/lib/crypto/keys/generation.ts:55 )
at .InternetGatewayManager.get ( /opt/gw/node_modules/@relaycorp/relaynet-core/src/lib/nodes/managers/NodeManager.ts:46 )
at process.processTicksAndRejections ( node:internal/process/task_queues:95 )
at .InternetGatewayManager.getCurrent ( /opt/gw/src/node/InternetGatewayManager.ts:37 )
at .<anonymous> ( /opt/gw/src/queueWorkers/crcIncoming.ts:48 )

https://console.cloud.google.com/errors/detail/CMyRtJ_Opc7mHg?project=gw-frankfurt-4065

Regenerate key pairs in Frankfurt and extend expiry date to 6 months

See https://github.com/relaycorp/relaynet-internet-gateway/blob/b7aafc931f013c7ebb9e1599434d9cd464866e8b/src/bin/generate-keypairs.ts#L21-L22
Update Android Gateway to use new certificate as fallback.

Replace Minio with GCS

And whilst I'm at it, configure the bucket to have objects expire after 48 hours (for the alpha testing).

Depends on relaycorp/awala-gateway-internet#200

Revoke unnecessary permissions granted to GCB service account by default

For example, we should:

Limit access to the specified GKE clusters.
Limit access to the few GCS buckets it accesses.

See: https://cloud.google.com/cloud-build/docs/cloud-build-service-account#default_permissions_of_service_account

Prove that our cloud infrastructure matches the code in our open source repositories

Executive summary

We must prove that our cloud infrastructure matches the Docker images, Kubernetes resources and Terraform resources in our open source repositories. Simply asking people to trust us is not an option: They must have the certainty that we're not spying on them, selling their data, running a mass surveillance programme for the Five Eyes or censoring people.

Another (equally important) reason to do this is to protect the Relaycorp SRE team from powerful adversaries who might secretly try to force us (collectively or individually) to give away certain metadata or censor certain users/services. If every change or external access to the infrastructure is independently verified, this threat should be avoided. (Unfortunately, an attacker unaware of this measure may still target us, but we can mitigate this by advertising very prominently the fact that our cloud infrastructure is independently verified.)

We basically have to prove two things: That our cloud infrastructure is exactly what people can find on GitHub, and that we don't have any backdoors.

Why? Relaynet is end-to-end encrypted and doesn't leak PII

Indeed, those two properties make Relaynet apps immune to a wide range of privacy threats you'd tend to find in Internet apps. However, we could theoretically still infer the following:

Who's talking to whom: Each device running Relaynet will have a globally unique address derived from a public key (analogous to Bitcoin addresses), so, theoretically, we -- as the operator of a Relaynet-Internet Gateway -- could infer who's talking to whom because we'd have the address of the two private gateways in any communication. Even after AwalaNetwork/specs#27 has been implemented, the operator of the public gateway could still infer who's talking to whom if both peers are served by the same public gateway.
When interacting with certain centralised services, we could identify the service the user is using. That's because messages bound for the service' servers may have a URL like https://relaynet.twitter.com. This doesn't apply to decentralised services, or centralised services where the provider chooses to run their server-side app behind a Relaynet-Internet Gateway.

Additionally, we need to log the IP addresses from end users and couriers so that we ensure our systems are being used fairly and to triage production issues.

Finally, it's likely we'll eventually have to block certain centralised services to comply with UK/US legislation, so in this case we'd have to prove that we're only blocking the services listed publicly. (This doesn't apply to decentralised services, which we could never block)

How? Cloud provenance is not a thing yet

Option A: Google Trillian Logs

In an ideal world, our cloud providers (Terraform Cloud, Kubernetes, GCP, Mongo Atlas and Cloudflare) would use a tool like Google Trillian to log provisioning, deprovisioning and access events. This would allow us to broadcast logs so that anyone anywhere can verify the integrity of our cloud infrastructure.

We'd essentially be moving the provenance issue up in the chain, and it'd be up to cloud providers to honour their contractual obligations with Relaycorp and comply with applicable legislation. They'd have a lot to lose if they don't.

But this option isn't really an option in the foreseeable future.

Option B: Ask a reputable, independent third-party to audit our infrastructure in real time

They'd basically get read-only access to the configuration of our cloud resources (but no access to the data inside), as well as their (de)provisioning and access logs. With this level of access, they could operate a system 24/7 to monitor our cloud resources and make sure they match the public Docker images, Kubernetes resources and Terraform resources.

I don't think a software tool like this exists yet, so we'll have to build it and make it open source. This tool has to be trivial to deploy, run and upgrade.

This tool should effectively make sure that provisioning and deprovisioning events match changes to cloud resources on GitHub. Additionally, the tool could also consume access logs so this independent party can be alerted to any direct access to the DB (for example) -- If we need to access the DB, we should justify that access to them (e.g., investigating a security vulnerability).

This would make offsite backups tricky, because we'd need a secret key to decrypt the backups if we need to restore them. One way to address this is by splitting the key, and having their part of the key available on demand in the tool. But this would introduce two additional challenges:

We'll have to do an event similar to a root CA key signing ceremony when generating and splitting the key.
The tool would have to be highly available: If we need to restore a backup, we have to be able to do it almost instantly -- with no advance notice or request. (Of course, they'd still be alerted if we retrieve it and we'd have to justify why we did it)

Option C: Deploy an independent tool that tracks our infrastructure in real time

We'd leverage the tool described in Option B, but we'd deploy it ourselves to a separate GCP project whose audit logs are publicly available.

Publishing audit logs is a bit risky, since they might (occasionally) contain sensitive information or PII about Relaycorp staff, which is why we're not making audit logs publicly available in the GCP projects hosting the services.

This option has no dependencies on third parties so it seems like the most likely approach to begin with.

Provenance is necessary but not sufficient to gain trust

There are many more things we have to do to gain people's trust, including non-technical measures such as transparency when dealing with law enforcement (I think Signal is an example to follow in this regard).

Require TLS to connect NATS Streaming to PostgreSQL

Blocked by nats-io/k8s#169

Integrate service mesh

For security (e.g., mTLS) and observability reasons.

We'll probably just use istio as GKE supports it out of the box.

Note: Knative comes with service mesh built-in (see relaycorp/awala-gateway-internet#761).

Enable GKE database encryption

See database_encryption in https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster

PostgreSQL user password is stored in clear text in Stan

Because of nats-io/k8s#167

It also means that the password is included in the GCB logs, so we must reset the password when this is fixed.

Require connections to Vault instances on the same node or availability zone

Blocked by hashicorp/vault-helm#424

Disable TLS 1.0 and 1.1

https://cloud.google.com/load-balancing/docs/use-ssl-policies

Replace NATS Streaming with Redis PubSub

The problem

NATS Streaming is a delight to use in development, but less so to deploy and operate (especially when using Kubernetes + GitOps):

It requires two backing services: NATS server and a fault-tolerant, persistence backend. That means: Increased costs (labour + hosting), and more components that could fail.
I'm having to deploy a highly-available PostreSQL instance just for NATS Streaming, which is insane. We already use MongoDB, but it isn't supported. (MySQL and a distributed file storage are also supported, but neither offers any advantage over PostgreSQL)
I'm uncomfortable using the official Helm charts in production as I don't think they're production-ready, based on the issued I've found (and submitted PRs for), the workarounds I've had to implement, the overall brittleness of the resources in the chart (e.g., provisioning DB tables via the initdb script) and the overall insecurity of the chart (e.g., storing sensitive values in the clear instead of using secrets).

This is what the systems architecture looks like today:

The solution

relaycorp/awala-gateway-internet#929
relaycorp/awala-gateway-internet#930
Provision and integrate GCP resources (e.g., PubSub, Redis).
Deprovision GCP and Kubernetes resources used by NAT Streaming (e.g., PostgreSQL).

This is what it'll look like afterwards:

Alternatives considered

Kafka is too expensive at +$1.2k/month for the simplest possible HA cluster. Kafka also feels like a massive overkill for what it'd be used for.

Connect to DB using a static IP address

Same as relaycorp/terraform-google-awala-endpoint#2

Connect NATS Streaming to PostgreSQL using a service account

Instead of a password managed by Terraform.

Here's how to do this with a sidecar: https://cloud.google.com/sql/docs/postgres/connect-kubernetes-engine

Load balancer terminates keep-alive parcel collections after 30 seconds

Which is consistent with their documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/ingress-xlb#support_for_websocket

The only "solution" I could find involves setting a global limit on the LB backend, which we can't use as that'd affect the regular HTTP endpoints in the PoWeb service. I'm checking with GCP if there's an alternative to this, but I doubt it: https://console.cloud.google.com/support/cases/detail/26133079?project=relaycorp-cloud-gateway.

I think the long-term solution should be to move the parcel collection endpoint to a separate service (relaycorp/awala-gateway-internet#324). But for now, we'll have to get private gateways to reconnect.

GCB can't install Vault due to RBAC permission issues

failed to create resource: clusterroles.rbac.authorization.k8s.io "vault-agent-injector-clusterrole" is forbidden: user "<redacted>@cloudbuild.gserviceaccount.com" (groups=["system:authenticated"]) is attempting to grant RBAC permissions not currently held:
  {APIGroups:["admissionregistration.k8s.io"], Resources:["mutatingwebhookconfigurations"], Verbs:["get" "list" "watch" "patch"]}

I've just raised a P2 case with GCP: https://console.cloud.google.com/support/cases/detail/26015995?orgonly=true&project=relaycorp-cloud-gateway

As a workaround, I'm going to grant GCB the GKE cluster admin role.

Replace MongoDB with GCP Firestore as backend for key/value configuration

Describe the problem

Once #86 is done, a whole HA MongoDB cluster would only be used to store 2-3 entries in a key/value table... Which is a bit of an overkill.

Describe the solution you'd like

Use a Kubernetes ConfigMap for this key/value configuration. It'd be good to implement the core of it as a separate library so it can also be used in relaycorp/awala-pong#26

TODO:

Integrate keyev + keyev-firestore, replacing MongoDB.
Write migration to save MongoDB values in keyev.

Additional context

Move environments into dedicated GCP projects

For example, the Frankfurt environment should live in its own dedicated GCP project.

This is needed because Datastore/Firestore ties the whole project to a region 🤦🏾‍♂️ : https://cloud.google.com/datastore/docs/locations#selecting_a_location

Parcel collection in keep-alive mode may be missing messages received after the connection started

That would explain relaycorp/relaynet-gateway-android#724

Replace HashiCorp Vault with GCP KMS

Describe the problem

Vault is a complex system to maintain, especially in production, as it uses many Kubernetes resources, a GCS bucket and a KMS keyring.

Describe the solution you'd like

Use GCP KMS in production, whilst keeping Vault in development.

This is what we'd need to do:

relaycorp/relayverse#37
Integrate the new private key store in the gateway
Release new gateway version to production.

Additional context

Increase TTL of stored messages to 3-6 months

Describe the problem

The Relaynet spec allows messages to be kept for up to 6 months, but I'm configuring GCS in #65 to delete messages after a small number of days whilst we're in an alpha phase.

Describe the solution you'd like

Keep the messages for a few months.

Load balancer can't be created with plaintext HTTP disabled

This annotation:

cloud-gateway/charts/gateway/values.yml.gotmpl

Line 16 in 5422ecf

kubernetes.io/ingress.allow-http: "false"

Will cause the following provisioning issue:

Error during sync: error running load balancer syncing routine: loadbalancer <whatever> does not exist: invalid ingress frontend configuration, please check your usage of the 'kubernetes.io/ingress.allow-http' annotation.

GCP support issue: https://console.cloud.google.com/support/cases/detail/26077051?project=relaycorp-cloud-gateway

Set up monitoring and alerting in the Relaynet-Internet Gateway and backing services

GKE metrics.
- Drop monitoring admin role
Logging. relaycorp/awala-gateway-internet#391
- Let folks in googleapis/nodejs-logging#875 know about https://github.com/relaycorp/pino-cloud-js
- Log uncaught exception: relaycorp/awala-gateway-internet#399
Error reporting. relaycorp/awala-gateway-internet#391
Tracing
Set up alerts.
- Billing.

Automate Vault upgrades

See: https://www.vaultproject.io/docs/platform/k8s/helm/run#upgrading-vault-servers

Implement automated and periodic backups of Vault's GCS bucket

Describe the problem

I've just emptied the bucket by mistake, and it would've been kind of nice to have a backup to restore. But that isn't the case, so I had to start afresh, which means that Android Gateway users must now reset their app's data.

Describe the solution you'd like

Vault's GCS bucket to be backed up every 8-24 hours (TBD). And a way to restore the backup when needed.

Automate deprovisioning of environments

I didn't do it as part of the provisioning because deprovisioning is a risky endeavour and has to be considered properly.

I'm inclined to solve this in two steps:

Prepare the Terraform-managed cloud resources for removal. I.e.:
- Disable prevent_destroy and resource-specific equivalents.
- Disable the GCB trigger for k8s deployments.
- Create (or enable) a new GCB trigger to be run before terraform destroy (see next step).
Run a GCB trigger to:
- Deprovision k8s resources, including cloud resources managed by k8s (e.g., the LB).
- Empty GCS buckets by creating a lifecycle rule that sets their objects' age to 0 days. This is much quicker and cheaper than deleting objects, especially in versioned buckets.

That way, deprovisioning an environment will involve the following steps:

Turn on the flag to allow destruction of the environment in Terraform.
Run the deprovisioning GCB trigger.
Delete the Terraform-managed resources.

relaycorp / cloud-gateway Goto Github PK

cloud-gateway's People

Contributors

Watchers

Forkers

cloud-gateway's Issues

TODO

Describe the problem

Describe the solution you'd like

Additional context

TODO

Executive summary

Why? Relaynet is end-to-end encrypted and doesn't leak PII

How? Cloud provenance is not a thing yet

Option A: Google Trillian Logs

Option B: Ask a reputable, independent third-party to audit our infrastructure in real time

Option C: Deploy an independent tool that tracks our infrastructure in real time

Provenance is necessary but not sufficient to gain trust

The problem

The solution

Alternatives considered

Describe the problem

Describe the solution you'd like

Additional context

Describe the problem

Describe the solution you'd like

Additional context

Describe the problem

Describe the solution you'd like

Describe the problem

Describe the solution you'd like

Recommend Projects

Recommend Topics

Recommend Org

Jobs