relaycorp / awala-gateway-internet Goto Github PK
View Code? Open in Web Editor NEWThe Awala-Internet Gateway
Home Page: https://docs.relaycorp.tech/awala-gateway-internet/
License: GNU Affero General Public License v3.0
The Awala-Internet Gateway
Home Page: https://docs.relaycorp.tech/awala-gateway-internet/
License: GNU Affero General Public License v3.0
We need to call cca.validate()
when collecting cargo via CogRPC.
We need to use WebSocket pings to detect broken connections, and then terminate them.
See client-side counterparts:
When a parcel is redelivered before it's been collected, the new parcel would supersede its predecessor per the Relaynet specs. However, they parcel will be deleted once collected, so it's possible for an attacker to redeliver a parcel.
This could lead to replay attacks if private gateways fail to ignore previously-processed, incoming parcels.
Due to the max limit of 1 NATS subscriber per channel:
Error: too many subscriptions per channel
at Object.callback (/opt/gw/node_modules/node-nats-streaming/lib/stan.js:708:28)
at Object.callback (/opt/gw/node_modules/nats/lib/nats.js:2045:16)
at Client.processMsg (/opt/gw/node_modules/nats/lib/nats.js:1437:13)
at Client.processInbound (/opt/gw/node_modules/nats/lib/nats.js:1310:14)
at Socket.<anonymous> (/opt/gw/node_modules/nats/lib/nats.js:820:10)
at Socket.emit (events.js:314:20)
at Socket.EventEmitter.emit (domain.js:483:12)
at addChunk (_stream_readable.js:297:12)
at readableAddChunk (_stream_readable.js:272:9)
at Socket.Readable.push (_stream_readable.js:213:10)
at TCP.onStreamRead (internal/stream_base_commons.js:188:23)
https://console.cloud.google.com/errors/CJK7mvCxyYuMBw?time=P30D&project=relaycorp-cloud-gateway
To prevent abuse, we should rate-limit the operations from private gateways across underlying networks (i.e., regardless of whether the message was received via the Internet or couriers).
The following limits should be implemented, with the concrete parameters defined via configuration by the operator:
X
parcels per gateway per minute.Y
parcels per gateway per hour.Z
parcels per gateway per day.X
MiB per gateway per hour.Y
MiB per gateway per day.Z
MiB per gateway per week.Use rate-limiter-flexible
with a Redis backend.
We should dedupe them like we dedupe CCAs.
GCP LBs close WebSocket connections after 30 seconds by default and we can't increase this limit on a per-endpoint basis. See relaycorp/cloud-gateway#53
Serve the parcel collection endpoint via a dedicated Kubernetes service, so that we can set a high service-level timeout (e.g., 30 minutes).
This may have some unknown implications though because we'd be doing something very weird: Having one deployment proxied through two different services, without any overlap in the endpoints they serve (besides the health check).
This will also make functional testing harder because we'd probably have to create an ingress to route requests to the right service.
#325 is a prerequisite for this. Otherwise, the issue of dangling broken connections will be exacerbated with a very long timeout.
Instead of relying on lifecycle policies that will remove the parcels after 6 months (the maximum TTL), we should have a background mechanism that automatically deletes expired parcels. This only applies to parcels whose delivery (if any) hasn't been acknowledged by the recipient, since we still delete parcels upon receiving their ACK.
NATS Streaming doesn't support delayed messages so we'll have to use a different backing service. For example, we could index parcels in MongoDB and run a query periodically to filter expired parcels.
The receiving end of relaycorp/awala-pong#2 and the counterpart to relaycorp/awala-gateway-desktop#11 in the user gateway.
The incoming parcel must be added to the queue, so it can be picked up in #2.
See deliverPndParcel
and related code in PoC: https://github.com/relaynet/poc/blob/master/bin/relayer-gateway
Prerequisites:
That in turn causes the sender authorisation check to fail even though the sender is authorised.
I realised this was the case in #14 and spent an awful lot of time debugging it, but I haven't been able to replicate it outside this repo. I even created an integration test in relaynet-core-js to replicate the issue but the test passes: relaycorp/relaynet-core-js@8ee7be4
I decided to pause this and merge #14 with a failing test (which is skipped) so I can take a fresh look in a few days.
See also #11.
TODO:
This is on the receiving end of relaycorp/relaynet-courier#12 and is equivalent to relaycorp/awala-gateway-desktop#14 in the user gateway.
This task must queue the cargo for decryption, and the encapsulated parcels must be queued so they can be picked up in #7.
Fastify helps with this with an onClose
hook which the Mongoose plugin uses, but we're not currently calling fastify.close()
following a SIGINT
/SIGTERM
. https://github.com/hemerajs/fastify-graceful-shutdown can do that for us, but it's a bit of an overkill and has some caveats (as mentioned in the README).
All other processes (i.e., the gRPC server and background queues) don't currently offer a way to release resources upon process termination, so we'll have to implement something custom.
Due to PeculiarVentures/PKI.js#287
I missed this whilst working on #1.
E.g., max message size.
https://grpc.github.io/grpc/core/group__grpc__arg__keys.html
Due to websockets/ws#1811. Potential solution: alanshaw/stream-to-it#10
This is breaking the JVM/Android implementation due to relaycorp/awala-poweb-jvm#68
We should eventually convert this repo to a monorepo to make it easier to maintain and manage its artefacts.
We could arrange the components based on the artefact they output, instead of having a single Docker image as the only output artefact:
crc-cogrpc
, the server for the CogRPC binding as a Docker image.pdc-pohttp
, the server for the PoHTTP binding as a Docker image.crc-incoming-queue
, the background job that processes incoming cargoes, as a Docker image.key-management
, a script to generate/rotate/revoke identity (RSA) and session (ECDH) keys.We shouldn't be too granular to begin with to minimise complexity. For example, the two crc-*
components could be one at the start.
Consider using https://rushjs.io/ to manage the monorepo.
ws
doesn't do it automatically: https://github.com/websockets/ws#how-to-detect-and-close-broken-connections
The counterpart to relaycorp/relaynet-courier#12
Equivalent in user gateway: relaycorp/awala-gateway-desktop#14
Someone could craft a parcel in such a way that it'd result in a POST request from a background queue. It's unlikely that such requests will work, being as the payload would be a parcel, but to guard against the unlikely case that one of those requests could have undesired effects, we should refuse destinations whose host name matches one of the following:
.local
vault
).https://console.cloud.google.com/errors/CPyauKi4uIy8YQ?project=relaycorp-cloud-gateway
Unfortunately, the traceback is useless:
Error: WebSocket is not open: readyState 3 (CLOSED)
at sendAfterClose (/opt/gw/node_modules/ws/lib/websocket.js:754:17)
at WebSocket.send (/opt/gw/node_modules/ws/lib/websocket.js:345:7)
at Duplex.duplex._write (/opt/gw/node_modules/ws/lib/stream.js:157:8)
at doWrite (_stream_writable.js:403:12)
at writeOrBuffer (_stream_writable.js:387:5)
at Duplex.Writable.write (_stream_writable.js:318:11)
at Object.sink (/opt/gw/node_modules/stream-to-it/sink.js:71:20)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
But the logs do show that a parcel was attempted to be sent 48ms after the connection was closed.
Otherwise we won't be able to test the CoSocket interface implemented in the relaying gateway.
This may actually be two separate issues.
Given a collection with keep alive off and one outgoing parcel:
1011
due to a "Client failure". That happens after sending the parcel to the client but before an ACK is received.1005
to its peer.Server logs:
[
{
"insertId": "dbjtxng48z3dw7",
"jsonPayload": {
"hostname": "gw-test-relaynet-internet-gateway-poweb-c754f748f-5zx7x",
"peerGatewayAddress": "<redacted>",
"msg": "Sending parcel",
"pid": 6,
"level": 30,
"parcelObjectKey": "parcels/gateway-bound/<redacted>/<redacted>/<redacted>/39ef5ad3753dc7d0b25dfd734d85ba1cd91d6541cfadf17ad4393f7c7a09cf29",
"reqId": "fa4b8734aac99f0aeb2bb8fe066af2d5/12532018683185456542"
},
"resource": {
"type": "k8s_container",
"labels": {
"pod_name": "gw-test-relaynet-internet-gateway-poweb-c754f748f-5zx7x",
"container_name": "poweb",
"project_id": "public-gw",
"namespace_name": "default",
"cluster_name": "gateway-example",
"location": "europe-west2-a"
}
},
"timestamp": "2020-11-24T15:03:14.546Z",
"severity": "INFO",
"labels": {
"compute.googleapis.com/resource_name": "gke-gateway-example-relaynet-gateway--e9096053-9qs9",
"k8s-pod/app_kubernetes_io/instance": "gw-test",
"k8s-pod/app_kubernetes_io/component": "poweb",
"k8s-pod/app_kubernetes_io/name": "relaynet-internet-gateway",
"k8s-pod/pod-template-hash": "c754f748f"
},
"logName": "projects/public-gw/logs/stdout",
"receiveTimestamp": "2020-11-24T15:03:19.016243138Z"
},
{
"insertId": "dbjtxng48z3dw8",
"jsonPayload": {
"msg": "Closing connection",
"pid": 6,
"closeCode": 1011,
"level": 30,
"reqId": "fa4b8734aac99f0aeb2bb8fe066af2d5/12532018683185456542",
"closeReason": "Client failure",
"hostname": "gw-test-relaynet-internet-gateway-poweb-c754f748f-5zx7x"
},
"resource": {
"type": "k8s_container",
"labels": {
"location": "europe-west2-a",
"project_id": "public-gw",
"cluster_name": "gateway-example",
"namespace_name": "default",
"pod_name": "gw-test-relaynet-internet-gateway-poweb-c754f748f-5zx7x",
"container_name": "poweb"
}
},
"timestamp": "2020-11-24T15:03:14.753Z",
"severity": "INFO",
"labels": {
"k8s-pod/app_kubernetes_io/instance": "gw-test",
"k8s-pod/app_kubernetes_io/component": "poweb",
"compute.googleapis.com/resource_name": "gke-gateway-example-relaynet-gateway--e9096053-9qs9",
"k8s-pod/app_kubernetes_io/name": "relaynet-internet-gateway",
"k8s-pod/pod-template-hash": "c754f748f"
},
"logName": "projects/public-gw/logs/stdout",
"receiveTimestamp": "2020-11-24T15:03:19.016243138Z"
}
]
Client traceback:
java.util.concurrent.CancellationException: ActorCoroutine was cancelled
at kotlinx.coroutines.ExceptionsKt.CancellationException(Exceptions.kt:22)
at kotlinx.coroutines.channels.ActorCoroutine.onCancelling(Actor.kt:134)
at kotlinx.coroutines.JobSupport.notifyCancelling(JobSupport.kt:332)
at kotlinx.coroutines.JobSupport.tryMakeCompletingSlowPath(JobSupport.kt:916)
at kotlinx.coroutines.JobSupport.tryMakeCompleting(JobSupport.kt:875)
at kotlinx.coroutines.JobSupport.makeCompletingOnce$kotlinx_coroutines_core(JobSupport.kt:840)
at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:111)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:738)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
Caused by: java.lang.IllegalArgumentException: Code 1005 is reserved and may not be used.
at okhttp3.internal.ws.WebSocketProtocol.validateCloseCode(WebSocketProtocol.kt:134)
at okhttp3.internal.ws.RealWebSocket.close(RealWebSocket.kt:435)
at okhttp3.internal.ws.RealWebSocket.close(RealWebSocket.kt:427)
at io.ktor.client.engine.okhttp.OkHttpWebsocketSession$outgoing$1.invokeSuspend(OkHttpWebsocketSession.kt:62)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:738)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
And configure Docker Hub to create a tag for each release.
And it's still broken:
FAIL src/functionalTests/internet_e2e.test.ts (16.645 s)
● Sending pings via PoWeb and receiving pongs via PoHTTP
: Timeout - Async callback was not invoked within the 10000 ms timeout specified by jest.setTimeout.Timeout - Async callback was not invoked within the 10000 ms timeout specified by jest.setTimeout.Error:
36 | import { generatePdaChain, IS_GITHUB, sleep } from './utils';
37 |
> 38 | test('Sending pings via PoWeb and receiving pongs via PoHTTP', async () => {
| ^
39 | const powebClient = PoWebClient.initLocal(GW_POWEB_LOCAL_PORT);
40 | const gwPDAChain = await generatePdaChain();
41 | const privateGatewaySigner = new Signer(
at new Spec (../../node_modules/jest-jasmine2/build/jasmine/Spec.js:116:22)
at Object.<anonymous> (internet_e2e.test.ts:38:1)
● Sending pings via CogRPC and receiving pongs via PoHTTP
expect(received).toHaveLength(expected)
Expected length: 2
Received length: 1
Received array: [[]]
113 | expect(collectedCargoes).toHaveLength(1);
114 | const collectedMessages = await extractMessagesFromCargo(collectedCargoes[0], pdaChain);
> 115 | expect(collectedMessages).toHaveLength(2);
| ^
116 | expect(ParcelCollectionAck.deserialize(collectedMessages[0])).toHaveProperty(
117 | 'parcelId',
118 | pingParcelData.parcelId,
at Object.<anonymous> (internet_e2e.test.ts:115:31)
Per https://specs.relaynet.link/RS-004#handshake-packets
This wasn't implemented in the PoC but it's similar to the handshake in PoWebSocket: https://github.com/relaynet/poc/blob/master/PoWebSocket/_handshake.js
To prevent abuse, we should rate-limit the operations from Internet clients. Most of the limits will be tied to the IP address of the client, and it is assumed that multiple private gateways can be behind the same IP address.
The following limits should be implemented, with the concrete parameters defined via configuration by the operator:
X
requests per IP address per second.Y
requests per IP address per minute.Z
requests per IP address per hour.Y
registrations per minute.Z
registrations per hour.X
registrations per IP address per minute.Y
registrations per IP address per hour.Z
registrations per IP address per day.X
deliveries per IP address per second.Y
deliveries per IP address per minute.X
deliveries per IP address per second.X
requests per IP address per minute.X
deliveries per IP address per second.Y
deliveries per IP address per minute.Z
deliveries per IP address per hour.X
calls per IP address per second.Y
calls per IP address per minute.Z
calls per IP address per hour.X
calls per IP address per minute.Use rate-limiter-flexible with a Redis backend.
We should just create a pool at startup and then reuse it in RPCs.
Deploy to GCP so that we can easily deploy/test the gateway.
TODO:
It should be X-Relaynet-Streaming-Mode
with keep-alive
and close-upon-completion
as possible values, as in the JVM implementation.
Using relaycorp/relaynet-pohttp-js#3
This is analogous to deliverCrcParcel
in the PoC: https://github.com/relaynet/poc/blob/master/bin/relayer-gateway
Counterpart in user gateway: relaycorp/awala-gateway-desktop#10
GCS is supposed to be S3-compatible, but at least the listObjectsV2
method is not: The ContinuationToken
parameter has a different name. There could be more compatibility issues though.
Specifying the current key id in the GATEWAY_KEY_ID
environment variable makes it extremely cumbersome to bootstrap a new environment or do key rotation, so we should store that pointer to the current key id in a persistent data store like etcd or MongoDB.
Equivalent issue in pong service: relaycorp/awala-pong#26
Due to AwalaNetwork/specs#32
Fixing this requires relaycorp/relaynet-core-js#129
Since grpc
is now deprecated: https://grpc.io/blog/grpc-js-1.0/
For example, in Fastify, we'd have to do something like this on GCP per fastify/help#360:
await server.addHook('onRequest', (request, reply, done) => {
if (typeof request.id === 'string') {
const [trace, spanId] = request.id.split(';')[0].split('/');
const bindings = {
'logging.googleapis.com/spanId': spanId,
'logging.googleapis.com/trace': `projects/<GCP_PROJECT_ID>/traces/${trace}`,
};
request.log = request.log.child(bindings);
reply.log = reply.log.child(bindings);
}
done();
});
I've just tested it in relaycorp/cloud-gateway@ad8c8a5:
TODO
@relaycorp/pino-cloud-tracing
library to generate child logger from a trace id like X-Amzn-Trace-Id
or X-Cloud-Trace-Context
(GCP).@relaycorp/fastify-lb-tracing
Fastify plugin using the library above.@relaycorp/fastify-lb-tracing
in PoWeb service.@relaycorp/fastify-lb-tracing
in PoHTTP service.@relaycorp/pino-cloud-tracing
in PoWeb's WebSocket endpoint.@relaycorp/pino-cloud-tracing
in CogRPC service.The counterpart to relaycorp/relaynet-courier#13 and roughly equivalent to relaycorp/awala-gateway-desktop#15 in the user gateway.
See fetchCargoesPayloads
and related code in the PoC implementation: https://github.com/relaynet/poc/blob/master/bin/relayer-gateway
The implementation in #33 will retry failed attempts to deliver parcels as frequently as possible until the parcel is accepted or it expires. Apart from being inefficient from our point of view, it may worsen the downtime in the target server.
I missed this whilst working on #1.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.