GithubHelp home page GithubHelp logo

relaycorp / awala-gateway-internet Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 0.0 5.97 MB

The Awala-Internet Gateway

Home Page: https://docs.relaycorp.tech/awala-gateway-internet/

License: GNU Affero General Public License v3.0

Dockerfile 0.11% JavaScript 1.73% TypeScript 97.15% Mustache 1.01%
awala-gateway awala

awala-gateway-internet's People

Contributors

dependabot-preview[bot] avatar dependabot[bot] avatar gnarea avatar snyk-bot avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

awala-gateway-internet's Issues

Delivered parcels can be redelivered after they've been collected

When a parcel is redelivered before it's been collected, the new parcel would supersede its predecessor per the Relaynet specs. However, they parcel will be deleted once collected, so it's possible for an attacker to redeliver a parcel.

This could lead to replay attacks if private gateways fail to ignore previously-processed, incoming parcels.

App can crash during upgrade if there are private gateways connected

Due to the max limit of 1 NATS subscriber per channel:

Error: too many subscriptions per channel
    at Object.callback (/opt/gw/node_modules/node-nats-streaming/lib/stan.js:708:28)
    at Object.callback (/opt/gw/node_modules/nats/lib/nats.js:2045:16)
    at Client.processMsg (/opt/gw/node_modules/nats/lib/nats.js:1437:13)
    at Client.processInbound (/opt/gw/node_modules/nats/lib/nats.js:1310:14)
    at Socket.<anonymous> (/opt/gw/node_modules/nats/lib/nats.js:820:10)
    at Socket.emit (events.js:314:20)
    at Socket.EventEmitter.emit (domain.js:483:12)
    at addChunk (_stream_readable.js:297:12)
    at readableAddChunk (_stream_readable.js:272:9)
    at Socket.Readable.push (_stream_readable.js:213:10)
    at TCP.onStreamRead (internal/stream_base_commons.js:188:23)

https://console.cloud.google.com/errors/CJK7mvCxyYuMBw?time=P30D&project=relaycorp-cloud-gateway

Implement messaging-level rate limiting

To prevent abuse, we should rate-limit the operations from private gateways across underlying networks (i.e., regardless of whether the message was received via the Internet or couriers).

The following limits should be implemented, with the concrete parameters defined via configuration by the operator:

  • Number of parcels to or from any given private gateway.
    • X parcels per gateway per minute.
    • Y parcels per gateway per hour.
    • Z parcels per gateway per day.
  • Ingress and egress network bandwidth per private gateway.
    • X MiB per gateway per hour.
    • Y MiB per gateway per day.
    • Z MiB per gateway per week.

System and software stack

Use rate-limiter-flexible with a Redis backend.

Related issues

Create separate Kubernetes service for parcel collection endpoint

Describe the problem

GCP LBs close WebSocket connections after 30 seconds by default and we can't increase this limit on a per-endpoint basis. See relaycorp/cloud-gateway#53

Describe the solution you'd like

Serve the parcel collection endpoint via a dedicated Kubernetes service, so that we can set a high service-level timeout (e.g., 30 minutes).

This may have some unknown implications though because we'd be doing something very weird: Having one deployment proxied through two different services, without any overlap in the endpoints they serve (besides the health check).

This will also make functional testing harder because we'd probably have to create an ingress to route requests to the right service.

#325 is a prerequisite for this. Otherwise, the issue of dangling broken connections will be exacerbated with a very long timeout.

Proactively delete expired parcels

Instead of relying on lifecycle policies that will remove the parcels after 6 months (the maximum TTL), we should have a background mechanism that automatically deletes expired parcels. This only applies to parcels whose delivery (if any) hasn't been acknowledged by the recipient, since we still delete parcels upon receiving their ACK.

NATS Streaming doesn't support delayed messages so we'll have to use a different backing service. For example, we could index parcels in MongoDB and run a query periodically to filter expired parcels.

Certification path of parcel sender is incomplete sometimes

That in turn causes the sender authorisation check to fail even though the sender is authorised.

I realised this was the case in #14 and spent an awful lot of time debugging it, but I haven't been able to replicate it outside this repo. I even created an integration test in relaynet-core-js to replicate the issue but the test passes: relaycorp/relaynet-core-js@8ee7be4

I decided to pause this and merge #14 with a failing test (which is skipped) so I can take a fresh look in a few days.

Implement PoWeb interface for private gateways

This is what the desktop/Android gateways will connect to when the Internet is available.

TODO:

  • Implement private gateway registration.
  • Implement parcel delivery.
  • Implement parcel collection (#205)
  • Write functional tests
  • Deploy to test environment

Set up development environment

  • Set up Docker, Docker Hub and Docker Compose (or Minikube).
  • Configure local services (e.g., DB).
  • Configure linter.
  • Configure build automation.
  • Configure test runner.
  • Configure CIS.
  • Etc.

Gracefully close long-lived connections to backing services when processes end

Fastify helps with this with an onClose hook which the Mongoose plugin uses, but we're not currently calling fastify.close() following a SIGINT/SIGTERM. https://github.com/hemerajs/fastify-graceful-shutdown can do that for us, but it's a bit of an overkill and has some caveats (as mentioned in the README).

All other processes (i.e., the gRPC server and background queues) don't currently offer a way to release resources upon process termination, so we'll have to implement something custom.

Convert project to monorepo

We should eventually convert this repo to a monorepo to make it easier to maintain and manage its artefacts.

We could arrange the components based on the artefact they output, instead of having a single Docker image as the only output artefact:

  • crc-cogrpc, the server for the CogRPC binding as a Docker image.
  • pdc-pohttp, the server for the PoHTTP binding as a Docker image.
  • crc-incoming-queue, the background job that processes incoming cargoes, as a Docker image.
  • key-management, a script to generate/rotate/revoke identity (RSA) and session (ECDH) keys.
  • etc...

We shouldn't be too granular to begin with to minimise complexity. For example, the two crc-* components could be one at the start.

Consider using https://rushjs.io/ to manage the monorepo.

Parcels with a backing service as a destination aren't filtered out

Someone could craft a parcel in such a way that it'd result in a POST request from a background queue. It's unlikely that such requests will work, being as the payload would be a parcel, but to guard against the unlikely case that one of those requests could have undesired effects, we should refuse destinations whose host name matches one of the following:

  • Its TLS is .local
  • It's got no TLD (e.g., vault).
  • It's a private subnet IP address.

PoWeb crashes when trying to send parcel to client after WS connection is closed

https://console.cloud.google.com/errors/CPyauKi4uIy8YQ?project=relaycorp-cloud-gateway

Unfortunately, the traceback is useless:

Error: WebSocket is not open: readyState 3 (CLOSED)
    at sendAfterClose (/opt/gw/node_modules/ws/lib/websocket.js:754:17)
    at WebSocket.send (/opt/gw/node_modules/ws/lib/websocket.js:345:7)
    at Duplex.duplex._write (/opt/gw/node_modules/ws/lib/stream.js:157:8)
    at doWrite (_stream_writable.js:403:12)
    at writeOrBuffer (_stream_writable.js:387:5)
    at Duplex.Writable.write (_stream_writable.js:318:11)
    at Object.sink (/opt/gw/node_modules/stream-to-it/sink.js:71:20)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)

But the logs do show that a parcel was attempted to be sent 48ms after the connection was closed.

logs

Parcel collection: Server ends connection with 1011 but client gets 1005 (with keep alive off)

This may actually be two separate issues.

Given a collection with keep alive off and one outgoing parcel:

  • The server is ending the connection with a 1011 due to a "Client failure". That happens after sending the parcel to the client but before an ACK is received.
  • Ktor on the client is ending the connection because either the PoWeb client or the server sent a 1005 to its peer.

Server logs:

[
  {
    "insertId": "dbjtxng48z3dw7",
    "jsonPayload": {
      "hostname": "gw-test-relaynet-internet-gateway-poweb-c754f748f-5zx7x",
      "peerGatewayAddress": "<redacted>",
      "msg": "Sending parcel",
      "pid": 6,
      "level": 30,
      "parcelObjectKey": "parcels/gateway-bound/<redacted>/<redacted>/<redacted>/39ef5ad3753dc7d0b25dfd734d85ba1cd91d6541cfadf17ad4393f7c7a09cf29",
      "reqId": "fa4b8734aac99f0aeb2bb8fe066af2d5/12532018683185456542"
    },
    "resource": {
      "type": "k8s_container",
      "labels": {
        "pod_name": "gw-test-relaynet-internet-gateway-poweb-c754f748f-5zx7x",
        "container_name": "poweb",
        "project_id": "public-gw",
        "namespace_name": "default",
        "cluster_name": "gateway-example",
        "location": "europe-west2-a"
      }
    },
    "timestamp": "2020-11-24T15:03:14.546Z",
    "severity": "INFO",
    "labels": {
      "compute.googleapis.com/resource_name": "gke-gateway-example-relaynet-gateway--e9096053-9qs9",
      "k8s-pod/app_kubernetes_io/instance": "gw-test",
      "k8s-pod/app_kubernetes_io/component": "poweb",
      "k8s-pod/app_kubernetes_io/name": "relaynet-internet-gateway",
      "k8s-pod/pod-template-hash": "c754f748f"
    },
    "logName": "projects/public-gw/logs/stdout",
    "receiveTimestamp": "2020-11-24T15:03:19.016243138Z"
  },
  {
    "insertId": "dbjtxng48z3dw8",
    "jsonPayload": {
      "msg": "Closing connection",
      "pid": 6,
      "closeCode": 1011,
      "level": 30,
      "reqId": "fa4b8734aac99f0aeb2bb8fe066af2d5/12532018683185456542",
      "closeReason": "Client failure",
      "hostname": "gw-test-relaynet-internet-gateway-poweb-c754f748f-5zx7x"
    },
    "resource": {
      "type": "k8s_container",
      "labels": {
        "location": "europe-west2-a",
        "project_id": "public-gw",
        "cluster_name": "gateway-example",
        "namespace_name": "default",
        "pod_name": "gw-test-relaynet-internet-gateway-poweb-c754f748f-5zx7x",
        "container_name": "poweb"
      }
    },
    "timestamp": "2020-11-24T15:03:14.753Z",
    "severity": "INFO",
    "labels": {
      "k8s-pod/app_kubernetes_io/instance": "gw-test",
      "k8s-pod/app_kubernetes_io/component": "poweb",
      "compute.googleapis.com/resource_name": "gke-gateway-example-relaynet-gateway--e9096053-9qs9",
      "k8s-pod/app_kubernetes_io/name": "relaynet-internet-gateway",
      "k8s-pod/pod-template-hash": "c754f748f"
    },
    "logName": "projects/public-gw/logs/stdout",
    "receiveTimestamp": "2020-11-24T15:03:19.016243138Z"
  }
]

Client traceback:

    java.util.concurrent.CancellationException: ActorCoroutine was cancelled
        at kotlinx.coroutines.ExceptionsKt.CancellationException(Exceptions.kt:22)
        at kotlinx.coroutines.channels.ActorCoroutine.onCancelling(Actor.kt:134)
        at kotlinx.coroutines.JobSupport.notifyCancelling(JobSupport.kt:332)
        at kotlinx.coroutines.JobSupport.tryMakeCompletingSlowPath(JobSupport.kt:916)
        at kotlinx.coroutines.JobSupport.tryMakeCompleting(JobSupport.kt:875)
        at kotlinx.coroutines.JobSupport.makeCompletingOnce$kotlinx_coroutines_core(JobSupport.kt:840)
        at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:111)
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
        at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
        at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:738)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
     Caused by: java.lang.IllegalArgumentException: Code 1005 is reserved and may not be used.
        at okhttp3.internal.ws.WebSocketProtocol.validateCloseCode(WebSocketProtocol.kt:134)
        at okhttp3.internal.ws.RealWebSocket.close(RealWebSocket.kt:435)
        at okhttp3.internal.ws.RealWebSocket.close(RealWebSocket.kt:427)
        at io.ktor.client.engine.okhttp.OkHttpWebsocketSession$outgoing$1.invokeSuspend(OkHttpWebsocketSession.kt:62)
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
        at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:738) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665) 

Functional tests in CI stopped working at some point between 31st August and 11th September

And it's still broken:

FAIL src/functionalTests/internet_e2e.test.ts (16.645 s)
  ● Sending pings via PoWeb and receiving pongs via PoHTTP

    : Timeout - Async callback was not invoked within the 10000 ms timeout specified by jest.setTimeout.Timeout - Async callback was not invoked within the 10000 ms timeout specified by jest.setTimeout.Error:

      36 | import { generatePdaChain, IS_GITHUB, sleep } from './utils';
      37 | 
    > 38 | test('Sending pings via PoWeb and receiving pongs via PoHTTP', async () => {
         | ^
      39 |   const powebClient = PoWebClient.initLocal(GW_POWEB_LOCAL_PORT);
      40 |   const gwPDAChain = await generatePdaChain();
      41 |   const privateGatewaySigner = new Signer(

      at new Spec (../../node_modules/jest-jasmine2/build/jasmine/Spec.js:116:22)
      at Object.<anonymous> (internet_e2e.test.ts:38:1)

  ● Sending pings via CogRPC and receiving pongs via PoHTTP

    expect(received).toHaveLength(expected)

    Expected length: 2
    Received length: 1
    Received array:  [[]]

      113 |     expect(collectedCargoes).toHaveLength(1);
      114 |     const collectedMessages = await extractMessagesFromCargo(collectedCargoes[0], pdaChain);
    > 115 |     expect(collectedMessages).toHaveLength(2);
          |                               ^
      116 |     expect(ParcelCollectionAck.deserialize(collectedMessages[0])).toHaveProperty(
      117 |       'parcelId',
      118 |       pingParcelData.parcelId,

      at Object.<anonymous> (internet_e2e.test.ts:115:31)

Implement binding-level rate limiting

To prevent abuse, we should rate-limit the operations from Internet clients. Most of the limits will be tied to the IP address of the client, and it is assumed that multiple private gateways can be behind the same IP address.

The following limits should be implemented, with the concrete parameters defined via configuration by the operator:

  • Number of HTTP requests per IP address across all bindings.
    • X requests per IP address per second.
    • Y requests per IP address per minute.
    • Z requests per IP address per hour.
  • PoWeb binding:
    • Number of private gateway registrations (global).
      • Y registrations per minute.
      • Z registrations per hour.
    • Number of private gateway registrations per IP address.
      • X registrations per IP address per minute.
      • Y registrations per IP address per hour.
      • Z registrations per IP address per day.
    • Number of parcel deliveries per IP address.
      • X deliveries per IP address per second.
      • Y deliveries per IP address per minute.
      • X deliveries per IP address per second.
    • Number of parcel collection requests per IP address. This boils down to: How many private gateways do we want to allow per IP address?
      • X requests per IP address per minute.
  • PoHTTP binding:
    • Number of parcel deliveries per IP address.
      • X deliveries per IP address per second.
      • Y deliveries per IP address per minute.
      • Z deliveries per IP address per hour.
  • CogRPC binding:
    • Number of cargo collection calls per IP address.
      • X calls per IP address per second.
      • Y calls per IP address per minute.
      • Z calls per IP address per hour.
    • Number of cargo delivery calls per IP address.
      • X calls per IP address per minute.

System and software stack

Use rate-limiter-flexible with a Redis backend.

Related issues

Make parcel store compatible with GCS

GCS is supposed to be S3-compatible, but at least the listObjectsV2 method is not: The ContinuationToken parameter has a different name. There could be more compatibility issues though.

Annotate logs with load balancer traces

For example, in Fastify, we'd have to do something like this on GCP per fastify/help#360:

  await server.addHook('onRequest', (request, reply, done) => {
    if (typeof request.id === 'string') {
      const [trace, spanId] = request.id.split(';')[0].split('/');
      const bindings = {
        'logging.googleapis.com/spanId': spanId,
        'logging.googleapis.com/trace': `projects/<GCP_PROJECT_ID>/traces/${trace}`,
      };
      request.log = request.log.child(bindings);
      reply.log = reply.log.child(bindings);
    }
    done();
  });

I've just tested it in relaycorp/cloud-gateway@ad8c8a5:

tracing

TODO

  • Implement @relaycorp/pino-cloud-tracing library to generate child logger from a trace id like X-Amzn-Trace-Id or X-Cloud-Trace-Context (GCP).
  • Implement @relaycorp/fastify-lb-tracing Fastify plugin using the library above.
  • Integrate @relaycorp/fastify-lb-tracing in PoWeb service.
  • Integrate @relaycorp/fastify-lb-tracing in PoHTTP service.
  • Integrate @relaycorp/pino-cloud-tracing in PoWeb's WebSocket endpoint.
  • Integrate @relaycorp/pino-cloud-tracing in CogRPC service.

Redeliver parcels using exponential backoff

The implementation in #33 will retry failed attempts to deliver parcels as frequently as possible until the parcel is accepted or it expires. Apart from being inefficient from our point of view, it may worsen the downtime in the target server.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.