GithubHelp home page GithubHelp logo

relaycorp / cloud-gateway Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 1.0 5.88 MB

Infrastructure as Code and configuration for all Awala-Internet Gateways run by Relaycorp

License: MIT License

HCL 100.00%
cloud awala-gateway awala

cloud-gateway's People

Contributors

dependabot[bot] avatar gnarea avatar relaybot-admin avatar

Watchers

 avatar  avatar

Forkers

rajesh2k3

cloud-gateway's Issues

Run CogRPC service 24/7

We're currently running it on demand, which means Cloud Run will kill the service when it goes unused for a few minutes.

We're doing this to save costs whilst we're beta testing Awala/Letro, but we should undo this when we go live because it can take ~15 seconds for the service to start, which would be problematic with couriers.

See also #64.

Parcel delivery often times out after 30 seconds

And it shouldn't take more than a second to complete successfully.

This started happening today. All backing services look healthy.

What we see on the Android Gateway logs:

2020-11-25 11:20:51.616 6001-6026/tech.relaycorp.gateway I/DeliverParcelsToGateway: Could not deliver parcels due to server error
    tech.relaycorp.relaynet.bindings.pdc.ServerConnectionException: Failed to connect to https://poweb-test.relaycorp.tech:443/v1/parcels
     Caused by: io.ktor.network.sockets.SocketTimeoutException: Socket timeout has been expired [url=https://poweb-test.relaycorp.tech:443/v1/parcels, socket_timeout=unknown] ms
     Caused by: java.net.SocketTimeoutException: timeout

What we see on the LB logs:

lb

TODO

  • Undo custom debugging image
  • Undo 588d85e

Replace MongoDB with GCP Firestore as backend for public keys and certificates

Describe the problem

A HA MongoDB cluster is too complicated and expensive for what it's used for today, which is mostly to store public keys and certificates.

Describe the solution you'd like

Use GCP Firestore instead, which is fully managed by GCP and doesn't require any capacity planning/monitoring.

TODO

  • Implement FirestorePublicKeyStore.
  • Implement FirestoreCertificateStore.
  • Integrate FirestorePublicKeyStore and FirestoreCertificateStore in the gateway.
  • Provision GCP resources (including Google Service Accounts using Workload Identity).
  • Configure Firestore in the gateway.

Additional context

See also:

Create helmfile builder to replace the one created by the GCB community

The community builder for helmfile isn't great:

I had to create https://github.com/relaycorp/cloud-gateway/blob/d18b95cd2590fea7bedfa4744a12eeb81e51a7e2/charts/scripts/helmfile.sh and patch the builder to work around most of the issues above.

Set up continuous deployment of Relaynet-Internet Gateway to a production environment

Output: https://github.com/relaycorp/cloud-gateway

TODO

  • Merge *-chart repos into respective app repos.
  • Recreate test env to ensure everything works.
  • Authenticate with Vault using Kubernetes service accounts:
  • Reinstate TLS in Stan-Postgres connections
  • Make GKE cluster regional (requires destroying the whole environment).
  • Move secrets shared with CF from k8s to GCP secrets manager: https://cloud.google.com/secret-manager
  • helmfile: use apply instead of sync. Require helm-diff. See: roboll/helmfile#1134
  • Stan pipleine: Don't leak DB password
  • Right-size cluster
  • Set requests and limits on each pod
  • Set replicas for each component
  • Split up the monolithic Terraform module into one module per environment.
  • NATS: Configure TTL for each channel.
  • Replace BackendConfig CRD with Terraform resource
  • Replace managed certificate CRD with TF resource
  • Integrate helmfile diff on PRs.
  • Enable clustering in NATS Streaming
  • Set up anti-affinity for each service
  • Gateway: checksum of CMs/secrets.
  • Set gateway key in values file

Identity public key can't be exported

Both the CogRPC server and the certificate rotator job are failing because PKI.js is trying to export a public key from a private key using the private key's parameters (instead of actually export()ing the private key to SPKI).

 GCPKeystoreError: Private key cannot be exported (requested format: jwk)
at .GcpKmsRsaPssProvider.onExportKey ( /opt/gw/node_modules/@relaycorp/awala-keystore-cloud/src/lib/gcp/GcpKmsRsaPssProvider.ts:36 )
at .GcpKmsRsaPssProvider.exportKey ( /opt/gw/node_modules/webcrypto-core/build/webcrypto-core.js:220 )
at .SubtleCrypto.exportKey ( /opt/gw/node_modules/webcrypto-core/build/webcrypto-core.js:1465 )
at .CryptoEngine.exportKey ( /opt/gw/node_modules/pkijs/build/index.js:5555 )
at .derSerializePublicKey ( /opt/gw/node_modules/@relaycorp/relaynet-core/src/lib/crypto/keys/serialisation.ts:17 )
at .getRSAPublicKeyFromPrivate ( /opt/gw/node_modules/@relaycorp/relaynet-core/src/lib/crypto/keys/generation.ts:55 )
at .InternetGatewayManager.get ( /opt/gw/node_modules/@relaycorp/relaynet-core/src/lib/nodes/managers/NodeManager.ts:46 )
at process.processTicksAndRejections ( node:internal/process/task_queues:95 )
at .InternetGatewayManager.getCurrent ( /opt/gw/src/node/InternetGatewayManager.ts:37 )
at .<anonymous> ( /opt/gw/src/queueWorkers/crcIncoming.ts:48 ) 

https://console.cloud.google.com/errors/detail/CMyRtJ_Opc7mHg?project=gw-frankfurt-4065

Prove that our cloud infrastructure matches the code in our open source repositories

Executive summary

We must prove that our cloud infrastructure matches the Docker images, Kubernetes resources and Terraform resources in our open source repositories. Simply asking people to trust us is not an option: They must have the certainty that we're not spying on them, selling their data, running a mass surveillance programme for the Five Eyes or censoring people.

Another (equally important) reason to do this is to protect the Relaycorp SRE team from powerful adversaries who might secretly try to force us (collectively or individually) to give away certain metadata or censor certain users/services. If every change or external access to the infrastructure is independently verified, this threat should be avoided. (Unfortunately, an attacker unaware of this measure may still target us, but we can mitigate this by advertising very prominently the fact that our cloud infrastructure is independently verified.)

We basically have to prove two things: That our cloud infrastructure is exactly what people can find on GitHub, and that we don't have any backdoors.

Why? Relaynet is end-to-end encrypted and doesn't leak PII

Indeed, those two properties make Relaynet apps immune to a wide range of privacy threats you'd tend to find in Internet apps. However, we could theoretically still infer the following:

  • Who's talking to whom: Each device running Relaynet will have a globally unique address derived from a public key (analogous to Bitcoin addresses), so, theoretically, we -- as the operator of a Relaynet-Internet Gateway -- could infer who's talking to whom because we'd have the address of the two private gateways in any communication. Even after AwalaNetwork/specs#27 has been implemented, the operator of the public gateway could still infer who's talking to whom if both peers are served by the same public gateway.
  • When interacting with certain centralised services, we could identify the service the user is using. That's because messages bound for the service' servers may have a URL like https://relaynet.twitter.com. This doesn't apply to decentralised services, or centralised services where the provider chooses to run their server-side app behind a Relaynet-Internet Gateway.

Additionally, we need to log the IP addresses from end users and couriers so that we ensure our systems are being used fairly and to triage production issues.

Finally, it's likely we'll eventually have to block certain centralised services to comply with UK/US legislation, so in this case we'd have to prove that we're only blocking the services listed publicly. (This doesn't apply to decentralised services, which we could never block)

How? Cloud provenance is not a thing yet

Option A: Google Trillian Logs

In an ideal world, our cloud providers (Terraform Cloud, Kubernetes, GCP, Mongo Atlas and Cloudflare) would use a tool like Google Trillian to log provisioning, deprovisioning and access events. This would allow us to broadcast logs so that anyone anywhere can verify the integrity of our cloud infrastructure.

We'd essentially be moving the provenance issue up in the chain, and it'd be up to cloud providers to honour their contractual obligations with Relaycorp and comply with applicable legislation. They'd have a lot to lose if they don't.

But this option isn't really an option in the foreseeable future.

Option B: Ask a reputable, independent third-party to audit our infrastructure in real time

They'd basically get read-only access to the configuration of our cloud resources (but no access to the data inside), as well as their (de)provisioning and access logs. With this level of access, they could operate a system 24/7 to monitor our cloud resources and make sure they match the public Docker images, Kubernetes resources and Terraform resources.

I don't think a software tool like this exists yet, so we'll have to build it and make it open source. This tool has to be trivial to deploy, run and upgrade.

This tool should effectively make sure that provisioning and deprovisioning events match changes to cloud resources on GitHub. Additionally, the tool could also consume access logs so this independent party can be alerted to any direct access to the DB (for example) -- If we need to access the DB, we should justify that access to them (e.g., investigating a security vulnerability).

This would make offsite backups tricky, because we'd need a secret key to decrypt the backups if we need to restore them. One way to address this is by splitting the key, and having their part of the key available on demand in the tool. But this would introduce two additional challenges:

  • We'll have to do an event similar to a root CA key signing ceremony when generating and splitting the key.
  • The tool would have to be highly available: If we need to restore a backup, we have to be able to do it almost instantly -- with no advance notice or request. (Of course, they'd still be alerted if we retrieve it and we'd have to justify why we did it)

Option C: Deploy an independent tool that tracks our infrastructure in real time

We'd leverage the tool described in Option B, but we'd deploy it ourselves to a separate GCP project whose audit logs are publicly available.

Publishing audit logs is a bit risky, since they might (occasionally) contain sensitive information or PII about Relaycorp staff, which is why we're not making audit logs publicly available in the GCP projects hosting the services.

This option has no dependencies on third parties so it seems like the most likely approach to begin with.

Provenance is necessary but not sufficient to gain trust

There are many more things we have to do to gain people's trust, including non-technical measures such as transparency when dealing with law enforcement (I think Signal is an example to follow in this regard).

Replace NATS Streaming with Redis PubSub

The problem

NATS Streaming is a delight to use in development, but less so to deploy and operate (especially when using Kubernetes + GitOps):

  • It requires two backing services: NATS server and a fault-tolerant, persistence backend. That means: Increased costs (labour + hosting), and more components that could fail.
  • I'm having to deploy a highly-available PostreSQL instance just for NATS Streaming, which is insane. We already use MongoDB, but it isn't supported. (MySQL and a distributed file storage are also supported, but neither offers any advantage over PostgreSQL)
  • I'm uncomfortable using the official Helm charts in production as I don't think they're production-ready, based on the issued I've found (and submitted PRs for), the workarounds I've had to implement, the overall brittleness of the resources in the chart (e.g., provisioning DB tables via the initdb script) and the overall insecurity of the chart (e.g., storing sensitive values in the clear instead of using secrets).

This is what the systems architecture looks like today:

original

The solution

This is what it'll look like afterwards:

after

Alternatives considered

Kafka is too expensive at +$1.2k/month for the simplest possible HA cluster. Kafka also feels like a massive overkill for what it'd be used for.

Load balancer terminates keep-alive parcel collections after 30 seconds

Which is consistent with their documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/ingress-xlb#support_for_websocket

The only "solution" I could find involves setting a global limit on the LB backend, which we can't use as that'd affect the regular HTTP endpoints in the PoWeb service. I'm checking with GCP if there's an alternative to this, but I doubt it: https://console.cloud.google.com/support/cases/detail/26133079?project=relaycorp-cloud-gateway.

I think the long-term solution should be to move the parcel collection endpoint to a separate service (relaycorp/awala-gateway-internet#324). But for now, we'll have to get private gateways to reconnect.

GCB can't install Vault due to RBAC permission issues

failed to create resource: clusterroles.rbac.authorization.k8s.io "vault-agent-injector-clusterrole" is forbidden: user "<redacted>@cloudbuild.gserviceaccount.com" (groups=["system:authenticated"]) is attempting to grant RBAC permissions not currently held:
  {APIGroups:["admissionregistration.k8s.io"], Resources:["mutatingwebhookconfigurations"], Verbs:["get" "list" "watch" "patch"]}

I've just raised a P2 case with GCP: https://console.cloud.google.com/support/cases/detail/26015995?orgonly=true&project=relaycorp-cloud-gateway

As a workaround, I'm going to grant GCB the GKE cluster admin role.

Replace MongoDB with GCP Firestore as backend for key/value configuration

Describe the problem

Once #86 is done, a whole HA MongoDB cluster would only be used to store 2-3 entries in a key/value table... Which is a bit of an overkill.

Describe the solution you'd like

Use a Kubernetes ConfigMap for this key/value configuration. It'd be good to implement the core of it as a separate library so it can also be used in relaycorp/awala-pong#26

TODO:

  • Integrate keyev + keyev-firestore, replacing MongoDB.
  • Write migration to save MongoDB values in keyev.

Additional context

See also:

Increase TTL of stored messages to 3-6 months

Describe the problem

The Relaynet spec allows messages to be kept for up to 6 months, but I'm configuring GCS in #65 to delete messages after a small number of days whilst we're in an alpha phase.

Describe the solution you'd like

Keep the messages for a few months.

Load balancer can't be created with plaintext HTTP disabled

This annotation:

kubernetes.io/ingress.allow-http: "false"

Will cause the following provisioning issue:

Error during sync: error running load balancer syncing routine: loadbalancer <whatever> does not exist: invalid ingress frontend configuration, please check your usage of the 'kubernetes.io/ingress.allow-http' annotation.

GCP support issue: https://console.cloud.google.com/support/cases/detail/26077051?project=relaycorp-cloud-gateway

Set up monitoring and alerting in the Relaynet-Internet Gateway and backing services

Implement automated and periodic backups of Vault's GCS bucket

Describe the problem

I've just emptied the bucket by mistake, and it would've been kind of nice to have a backup to restore. But that isn't the case, so I had to start afresh, which means that Android Gateway users must now reset their app's data.

Describe the solution you'd like

Vault's GCS bucket to be backed up every 8-24 hours (TBD). And a way to restore the backup when needed.

Automate deprovisioning of environments

I didn't do it as part of the provisioning because deprovisioning is a risky endeavour and has to be considered properly.

I'm inclined to solve this in two steps:

  • Prepare the Terraform-managed cloud resources for removal. I.e.:
    • Disable prevent_destroy and resource-specific equivalents.
    • Disable the GCB trigger for k8s deployments.
    • Create (or enable) a new GCB trigger to be run before terraform destroy (see next step).
  • Run a GCB trigger to:
    • Deprovision k8s resources, including cloud resources managed by k8s (e.g., the LB).
    • Empty GCS buckets by creating a lifecycle rule that sets their objects' age to 0 days. This is much quicker and cheaper than deleting objects, especially in versioned buckets.

That way, deprovisioning an environment will involve the following steps:

  1. Turn on the flag to allow destruction of the environment in Terraform.
  2. Run the deprovisioning GCB trigger.
  3. Delete the Terraform-managed resources.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.