divviup / prio-server Goto Github PK

View Code? Open in Web Editor NEW

72.0 72.0 14.0 5.57 MB

A Prio server implementation.

License: Mozilla Public License 2.0

Rust 53.36% Dockerfile 0.26% Makefile 0.57% HCL 19.94% Go 25.39% Shell 0.49%

prio-server's People

Contributors

Stargazers

Watchers

Forkers

gurayalsac winstrom longlivedt nci-ccr-oit aaomidi bmw rizk-nci-admin priyankamall jmhodges simonsan hostirosti divergentdave isabella232

prio-server's Issues

Have ingestion servers specify an `aud` in their OIDC auth tokens when authenticating to S3

The AWS IAM role assumption policy we define for an ingestion server in Google cloud looks like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "accounts.google.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "accounts.google.com:sub": "${var.ingestor_google_service_account_id}"
        }
      }
    }
  ]
}

So this federates identity with accounts.google.com and lets service account var.ingestor_google_service_account_id assume the role. @yuriks points out in #51 that we could include an accounts.google.com:oaud condition in the policy. To do that, we'd first have to agree with ingestion server authors on what aud value they would specify when they request auth tokens from Google, making this a protocol issue.

Implement retries in S3 requests

As a workaround for a mismatch between connection pool timeouts in Hyper and AWS S3, #36 carefully constructs HTTP clients with a carefully chosen timeout value. For more robustness, we should implement retries.

Use private network for GKE cluster

Our GKE cluster is currently configured to use public networks managed by Google/GKE. This means each worker node gets a public IP, and (I think) communication between worker nodes goes over the public internet. While this makes it trivial for our jobs to perform the egress they need (e.g. to AWS or GCP APIs or to fetch peer manifests), this is wasteful (ISRG is committed to environmentally responsible practices, which means reducing, reusing, recycling IPv4 addresses) and could be more secure. We should configure a private network for the GKE cluster and then narrowly control what egress and ingress we permit to worker nodes.

Fix deprecation warning in Terraform GH action

Running a test job on a PR today I saw GitHub throwing some warnings:
https://github.com/abetterinternet/prio-server/actions/runs/296499641

Apparently some GH Actions features are being deprecated to address a vulnerability: https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands/

I don't see where we directly use this, so it might be the Terraform actions we depend on.

Consider enabling compression when writing Avro messages

We could use avro_rs' support for compression codecs. However, the bulk of the data in the ingestion batches is encrypted, so we might not gain much from compression, and it's not impossible that we'll actually lose time doing the deflate/inflate.

Consider adapting facilitator crypto to use reference keys

It may not be practical to decrypt incoming packets with keys in a product like Amazon KMS, but we might want to it for the less frequently used Avro message signing key. This would require teaching the facilitator to use remote reference keys, as well as the ring::signature stuff it does now.

CI for facilitator

We will use GitHub Actions to build and test the Rust code in prio-server/facilitator and libprio-rs. This will at least do build and test, emitting code coverage. If it's easy to do, we will emit x86_64 Linux binaries, but anyone else will have to cargo build for their own platform.

Run CI build even when lints return failure

Currently as was implemented in #40, a failure in cargo fmt or clippy will immediately abort the job without running the actual cargo build or tests. This isn't ideal because, while fmt or clippy failures should be addressed before a PR is merged, in most cases they wouldn't actually cause compilation errors, and so it would save an edit-push-CI cycle if those steps were executed regardless, so that any possible compilation or test errors would be surface right during the first build attempt, rather than requiring fmt/lint problems to be fixed before we even attempt to run those.

I see two approaches to fixing this:

Move clippy (and probably also fmt) to its own dedicated CI job, which would run in parallel with the build job. This uses more CI resources since it's essentially compiling things twice, but makes the two jobs run/fail independently.
Modify the workflows definition file so that it continues executing later steps even if the clippy step fails. This complicates the workflow definition since GitHub doesn't seem to support this use case well. (You need to insert explicit if: conditions on every single later build step to instruct them to execute regardless.)

Security assessment

Get a security assessment, review its findings and address the scary ones. Individual issues will be files for remediation of individual findings.

Discover parameters from peer manifests in workflow manager

For Narnia and the foreseeable future, the workflow manager won't know how to discover peer manifests or validate batch signatures, instead delegating that work to individual facilitator (data plane) jobs. That will make it harder to restrict network egress for those jobs. We can move this work into the workflow manager and have it hand static parameters to facilitator jobs it runs.

AWS API requests on S3 PUTs should include content hash

Besides the application-level signature scheme over Avro batches, Amazon S3 PUT requests can sign over the content being uploaded, which gives us an integrity check before the workload manager evaluates an object. We should configure the bucket policies on buckets to require this so Amazon can do the signature verification for us.

https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-header-based-auth.html

Terraform: refactor peer data share processor variables to account for multiple operators

#51 makes the assumption that all peer data share processors are operated by a single organization (i.e., ISRG operates all facilitators, NIH/NCI operates all PHA servers). In particular, the Terraform modules assume a single global manifest is associated with all peer data share processors. This won't hold forever, so we should refactor the representation of peer data share processors to allow for multiple operators and multiple global manifests. This would mean restructuring the peer_share_processor_names in the top-level .tfvars to either be a list of pha-name, global manifest pairs, or perhaps a map structure like

{
    operator-name-1 => {
        global-manifest-url => <url>
        pha-names => [pha-name1, pha-name2, ...]
    }
    operator-name-2 => {
        global-manifest-url => <url>
        pha-names => [...]
    }
}

...and then create and configure data share processors appropriately.

Finer grained access control for individual jobs

For Narnia and the immediate future past that, we create a single GCP service account and corresponding Kubernetes service account for each data share processor, and then have both the workflow manager and individual facilitator jobs it dispatches run as that account. We could create service accounts for each individual workflow steps and use them to construct more restrictive policies. This would let us deny the workflow manager the ability to delete certain objects, and deny facilitator jobs access to Kubernetes API.

Sum part message should include the number of shares that are included in the sum part

In order to de-bias the final data, we need to include the total number of shares that are included in the sum part. We will have to extend the PrioSumPart message to include a field for this and have the reduce step(s) record the value.

Deploy tool(s)

Once we have a Terraformized application, we still need tools around that which can do things like issue keys, construct keyfiles, post them to S3 buckets, get certificates, etc. This tool would fit into a PHA onboarding workflow in which we obtain a minimal set of parameters from new PHAs (e.g., S3 bucket URLs, keyfile location) and let the tool do the rest accordingly.

Move the manifests (global, etc) to be deployed using terraform

Right now we have some statically deployed manifest files - we should make these get deployed using GCP.

facilitator: revisit argument handling

From #4

I think we should probably split each subcommand into a separate binary or at least a separate file. That huge main.rs command parser look very unwieldy.

I think we probably don't want to do all configuration through flags, because there's a lot of knobs already and it seems we'll only have more, but we'll need to figure out a config file schema or what to do instead first.

Maybe consider using clap v3 (no stable release yet) or https://github.com/TeXitoi/structopt (which clap v3 is based on) to do the flag parsing. The canonical API being a struct also makes it a bit easier to load those values in from other sources too.

Annotated structs would definitely cut down on the volume of argument handling code. We also have to figure out how parameters will be provided when this is run by the execution manager, which could be done with command line arguments passed to the Docker container, environment variables or a config file placed into the container.

Migrate all ISRG owned storage to GCP cloud storage

We need to move our entire deployment, including all the buckets we use as mailboxes, to GCP for budget reasons. This will entail some protocol changes because we will have to figure out how Apple will authenticate, and what parameters we need to discover from them to be able to configure the ingestion buckets (probably a GCP service account).

Protocol for key exchanges

Data share processors must provide their ECIES public keys to the mobile device OS owners, and then the ingestors, facilitators and PHA must exchange public keys. We need to work out an automation friendly means of doing these key exchanges each time a new facilitator-PHA pair is brought online.

implement invalid packet handling

Once #6 is settled, we should implement handling of all the error cases and write tests that exercise them all.

Depends on:

[schema] aggregation-share/sum-part has incorrect cardinality of sum and batch_uuid

In the original protobuf schema defined in the IDL document, PrioSumPart contained a single bytes value_sum field, and a repeated batch_uuid list. The batch_uuid list represents the uuids of all of the batches (tracing back to the ingestor) that participated in this aggregation.

At some point in the conversion to Avro, this got flipped and the current schema now has only a single batch_uuid, but an array of sums.

Not only does this not let us properly represent the batch_uuids when we do the final multi-batch reduction sum, the sums array will also always contain only one value (because all aggregations we'll do always produce a single output sum per file, both the per-batch sum, and the overall sum for a given time range), so this seems like an oversight and should be fixed to match the intent of the original schema.

Facilitator should read and write objects from S3

In order to read and write messages from S3, the facilitator needs to implement an S3Transport alongside its existing FileTransport. Rusoto appears to be the crate of choice for working with AWS APIs.

Code coverage in CI

We can use this GitHub action to get code coverage reports from test runs in CI.

Ingestigate HTTP caching vs. manifest files

For simplicity we chose to server server manifest files over HTTPS instead of live config endpoints. The drawback there is that stale manifests could be cached by, I dunno, proxies or CDNs or ISPs. We should investigate ways to mitigate this.

Facilitator Docker image

This eventually will be a Kubernetes service, so the facilitator needs to be dockerized. We will provide a Dockerfile, build docker images in CI, and store them in Dockerhub (https://hub.docker.com/u/letsencrypt).

Purge processed data from storage buckets

Persistent data (ingestion batches, validation batches and sum parts) should be purged after some delay has passed to keep a lid on storage costs. Yuri suggested implementing this in the execution manager, since it already is a cronjob that periodically scans the various buckets to dispatch work. We should also carefully devise a retention policy.

Place entire cert chain in specific manifest when certifying packet encryption keys

Per https://docs.google.com/document/d/1eKIXOVK6W8AsSnoisw26R1rpKtPBf0rnjn9kSSEC3PE/edit?disco=AAAAHJj4NAg, Apple needs the whole cert chain given to them, not just the leaf. The ACME protocol exposes an endpoint for this, so this should be possible, but we have to figure out how to get it from CertMagic.

Facilitator design document

Complete https://docs.google.com/document/d/1MdfM3QT63ISU70l63bwzTrxr93Z7Tv7EDjLfammzo6Q/edit# and have it approved by stakeholders

Update docs, etc. to reflect that ingestor uses a single batch signing key

We concluded in the design doc that the ingestion servers will use a single batch signing key for all messages, regardless of which facilitator-PHA pair is the recipient. I need to update the design doc and make any corresponding code changes.

Protocol for invalid packet handling

The IDL document describes an "invalid UUID" file alongside the sum part emitted by the facilitator. We need to resolve open questions about how to handle these packets at different pipeline stages and how to represent these packets in the intermediate and final product. Final decisions to be recorded in the design document.

Initial facilitator implementation

Write the tool that uses libprio_rs to construct, validate and aggregate Prio data batches. It should be possible to exercise the end to end pipeline from the command line, with realistic Avro encoded data being emitted at each step.

Rename facilitator crate

@yuriks points out that the crate we call facilitator could just as easily be the basis of a PHA server. We should rename it to something like "prio-data-share-processor".

Decide what goes into `encryption_key_id` field on `PrioDataSharePacket`

Eventually, both Apple and Google servers will be populating the encryption_key_id field in the PrioDataSharePacket messages they send. For data share processors to handle them generically, we need to agree on what value goes in there, and make sure it's a value that lets the data share processor look up the appropriate packet decryption key. We would have to make sure that whatever value we use is available to both Apple and Google ingestion servers and mobile devices, as necessary.

One proposal is the serial number of the X.509 certificate used to transmit the packet encryption key to Apple.

Enable CI for the deploy tool

Enable CI and testing/code cov for the deploy tool

Automated message signing and packet encryption key rotation

Once we have all participants in the system advertising public keys and other params from keyfiles or manifests, we can automatically rotate keys used to sign messages written into S3 buckets.

Make `encryption_key_id` optional in `PrioDataSharePacket`

@winstrom informs me that Apple's ingestion server will not be populating the encryption_key_id field in PrioDataSharePacket messages at first. This means:

We must make that field optional in the Avro schema so it can be safely omitted by Apple
Data share processors will not be able to rely on its presence, meaning that for each packet, they may have to try all the packet decryption keys they have available to them in the k8s secret store

Move Avro schema into own repository

We store the JSON Avro schema files alongside the Rust implementation here in prio-server. @yuriks points out that this means we have to rev the implementation in lockstep with the schema. This would be easier if the schema was pulled in as a versioned dependency, and it would also be better for other projects wanting to consume the schemas.

Clippy lint annotations not showing in pull request diffs

For example, the permalinks don't work and there are no annotations in the diff: https://github.com/abetterinternet/prio-server/pull/53/checks?check_run_id=1282426037

Investigate resource limits on Kubernetes namespaces

We should investigate Kubernetes or GCP level resource limits, to mitigate the risk of runaway jobs causing problems.

Include key identifier in signed headers

Reviewers of #2 pointed out we should include an explicit key identifier in headers so that message recipients can gracefully handle key rotations instead of having to try all the keys listed in a peer's manifest.

facilitator: std::path::Path is wrong type for Transport::{get, put}

In #4, @yuriks notes:

I think using Path here is actually not the right call: Path inherently deals with OS-native paths. While that's the natural key for FileTransport, it doesn't really apply for something like S3. If I run this on Windows, for example, I actually still want to keep using / as the separator when I upload to S3, not \. So I think the right way of doing this is to have key be a generic path value (can either just use str with / as the separator, or create your own newtype over it) and then inside FileTransport you can parse it and re-convert to a Path when accessing the filesystem.

Metrics and alerting

Figure out what kind of metrics we want to emit from the server, what conditions to alert on, and where alerts should go. For instance, the facilitator could encounter failures because of bad data emitted by an ingestion server, so perhaps we should figure out how to route such alerts to the other organisation.

Turn on server side encryption with KMS in cloud storage buckets

While the ingestion share packets are encrypted with the ECIES keys anyway, we should turn on server side encryption of all bucket contents, ideally with a KMS key, to also protect metadata, validation shares and sum parts. This should be easy to do in Terraform.

https://docs.aws.amazon.com/AmazonS3/latest/dev/bucket-encryption.html

Set up release variants of Dockerfiles

Right now we're using debug builds because they're faster. For prod deployments we'll want to do release builds. Docker has passthrough environment variables that could be good for this. Follow-up from #97.

Replace packet UUID with monotonic packet sequence number

Eventually share processors have to evaluate a sequence of (own validation, peer validation, data share) triples, and must ensure that all three packets have the same UUIDs. Can the share processor assume that packets will appear in the same order in all three sequences of packets? If not, then I have to sort each packet sequence lexicographically by UUID in O(n log n) before I even begin processing triples. If instead we require that both share processors maintain the order of packets emitted by the ingestor, we can skip that step. Further, if we replace the packet UUID with a sequence number, then share processors can easily defend themselves against malformed peer validation files by verifying that sequence numbers increase monotonically as they process validation packets.

The pair (batch_uuid, packet sequence number) remains unique globally.

Execution manager

The design doc describes an execution manager responsible for coordinating the map/reduce steps as Kubernetes jobs. We should write that.

Document batch formats in design doc

The IDL document contains a semi-formal spec of the ingestion, validation and sum part batches, but it is no longer authoritative, especially since we have changed some things about the signature format in the Avro schema. We should formally document the batch file layouts, including the signatures, in the server design doc.

Break current aggregation reduce step into validation map step and final sum reduce

The implementation of aggregation in #4 implements peer validation share validation and per-batch summing in the final reduce/aggregation step, but those steps could be implemented as a map step in parallel across the batches. We should revisit the implementation in lib/aggregation.rs and break the aggregate_share method into a separate step.

Terraform-ize cloud deployment

Once we have settled on a public cloud to use, write a Terraform module that can spin up a facilitator instance.

Enable GCP DNS Provider as an option for certificate issuance

@yuriks made a great point about handling fewer credentials, and since our environment is mainly built around GCP it makes sense to reduce the dependence on CF as the only DNS provider.

divviup / prio-server Goto Github PK

prio-server's People

Contributors

Stargazers

Watchers

Forkers

prio-server's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs