GithubHelp home page GithubHelp logo

divviup / prio-server Goto Github PK

View Code? Open in Web Editor NEW
72.0 72.0 14.0 5.57 MB

A Prio server implementation.

License: Mozilla Public License 2.0

Rust 53.36% Dockerfile 0.26% Makefile 0.57% HCL 19.94% Go 25.39% Shell 0.49%

prio-server's People

Contributors

aaomidi avatar bdaehlie avatar bmw avatar branlwyd avatar dependabot[bot] avatar divergentdave avatar ezekiel avatar gsquire avatar hostirosti avatar jmhodges avatar jsha avatar tgeoghegan avatar winstrom avatar yuriks avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prio-server's Issues

Have ingestion servers specify an `aud` in their OIDC auth tokens when authenticating to S3

The AWS IAM role assumption policy we define for an ingestion server in Google cloud looks like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "accounts.google.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "accounts.google.com:sub": "${var.ingestor_google_service_account_id}"
        }
      }
    }
  ]
}

So this federates identity with accounts.google.com and lets service account var.ingestor_google_service_account_id assume the role. @yuriks points out in #51 that we could include an accounts.google.com:oaud condition in the policy. To do that, we'd first have to agree with ingestion server authors on what aud value they would specify when they request auth tokens from Google, making this a protocol issue.

Implement retries in S3 requests

As a workaround for a mismatch between connection pool timeouts in Hyper and AWS S3, #36 carefully constructs HTTP clients with a carefully chosen timeout value. For more robustness, we should implement retries.

Use private network for GKE cluster

Our GKE cluster is currently configured to use public networks managed by Google/GKE. This means each worker node gets a public IP, and (I think) communication between worker nodes goes over the public internet. While this makes it trivial for our jobs to perform the egress they need (e.g. to AWS or GCP APIs or to fetch peer manifests), this is wasteful (ISRG is committed to environmentally responsible practices, which means reducing, reusing, recycling IPv4 addresses) and could be more secure. We should configure a private network for the GKE cluster and then narrowly control what egress and ingress we permit to worker nodes.

Consider adapting facilitator crypto to use reference keys

It may not be practical to decrypt incoming packets with keys in a product like Amazon KMS, but we might want to it for the less frequently used Avro message signing key. This would require teaching the facilitator to use remote reference keys, as well as the ring::signature stuff it does now.

CI for facilitator

We will use GitHub Actions to build and test the Rust code in prio-server/facilitator and libprio-rs. This will at least do build and test, emitting code coverage. If it's easy to do, we will emit x86_64 Linux binaries, but anyone else will have to cargo build for their own platform.

Run CI build even when lints return failure

Currently as was implemented in #40, a failure in cargo fmt or clippy will immediately abort the job without running the actual cargo build or tests. This isn't ideal because, while fmt or clippy failures should be addressed before a PR is merged, in most cases they wouldn't actually cause compilation errors, and so it would save an edit-push-CI cycle if those steps were executed regardless, so that any possible compilation or test errors would be surface right during the first build attempt, rather than requiring fmt/lint problems to be fixed before we even attempt to run those.

I see two approaches to fixing this:

  • Move clippy (and probably also fmt) to its own dedicated CI job, which would run in parallel with the build job. This uses more CI resources since it's essentially compiling things twice, but makes the two jobs run/fail independently.
  • Modify the workflows definition file so that it continues executing later steps even if the clippy step fails. This complicates the workflow definition since GitHub doesn't seem to support this use case well. (You need to insert explicit if: conditions on every single later build step to instruct them to execute regardless.)

Security assessment

Get a security assessment, review its findings and address the scary ones. Individual issues will be files for remediation of individual findings.

Discover parameters from peer manifests in workflow manager

For Narnia and the foreseeable future, the workflow manager won't know how to discover peer manifests or validate batch signatures, instead delegating that work to individual facilitator (data plane) jobs. That will make it harder to restrict network egress for those jobs. We can move this work into the workflow manager and have it hand static parameters to facilitator jobs it runs.

Terraform: refactor peer data share processor variables to account for multiple operators

#51 makes the assumption that all peer data share processors are operated by a single organization (i.e., ISRG operates all facilitators, NIH/NCI operates all PHA servers). In particular, the Terraform modules assume a single global manifest is associated with all peer data share processors. This won't hold forever, so we should refactor the representation of peer data share processors to allow for multiple operators and multiple global manifests. This would mean restructuring the peer_share_processor_names in the top-level .tfvars to either be a list of pha-name, global manifest pairs, or perhaps a map structure like

{
    operator-name-1 => {
        global-manifest-url => <url>
        pha-names => [pha-name1, pha-name2, ...]
    }
    operator-name-2 => {
        global-manifest-url => <url>
        pha-names => [...]
    }
}

...and then create and configure data share processors appropriately.

Finer grained access control for individual jobs

For Narnia and the immediate future past that, we create a single GCP service account and corresponding Kubernetes service account for each data share processor, and then have both the workflow manager and individual facilitator jobs it dispatches run as that account. We could create service accounts for each individual workflow steps and use them to construct more restrictive policies. This would let us deny the workflow manager the ability to delete certain objects, and deny facilitator jobs access to Kubernetes API.

Deploy tool(s)

Once we have a Terraformized application, we still need tools around that which can do things like issue keys, construct keyfiles, post them to S3 buckets, get certificates, etc. This tool would fit into a PHA onboarding workflow in which we obtain a minimal set of parameters from new PHAs (e.g., S3 bucket URLs, keyfile location) and let the tool do the rest accordingly.

facilitator: revisit argument handling

From #4

  • I think we should probably split each subcommand into a separate binary or at least a separate file. That huge main.rs command parser look very unwieldy.
  • I think we probably don't want to do all configuration through flags, because there's a lot of knobs already and it seems we'll only have more, but we'll need to figure out a config file schema or what to do instead first.
  • Maybe consider using clap v3 (no stable release yet) or https://github.com/TeXitoi/structopt (which clap v3 is based on) to do the flag parsing. The canonical API being a struct also makes it a bit easier to load those values in from other sources too.

Annotated structs would definitely cut down on the volume of argument handling code. We also have to figure out how parameters will be provided when this is run by the execution manager, which could be done with command line arguments passed to the Docker container, environment variables or a config file placed into the container.

Migrate all ISRG owned storage to GCP cloud storage

We need to move our entire deployment, including all the buckets we use as mailboxes, to GCP for budget reasons. This will entail some protocol changes because we will have to figure out how Apple will authenticate, and what parameters we need to discover from them to be able to configure the ingestion buckets (probably a GCP service account).

Protocol for key exchanges

Data share processors must provide their ECIES public keys to the mobile device OS owners, and then the ingestors, facilitators and PHA must exchange public keys. We need to work out an automation friendly means of doing these key exchanges each time a new facilitator-PHA pair is brought online.

[schema] aggregation-share/sum-part has incorrect cardinality of sum and batch_uuid

In the original protobuf schema defined in the IDL document, PrioSumPart contained a single bytes value_sum field, and a repeated batch_uuid list. The batch_uuid list represents the uuids of all of the batches (tracing back to the ingestor) that participated in this aggregation.

At some point in the conversion to Avro, this got flipped and the current schema now has only a single batch_uuid, but an array of sums.

Not only does this not let us properly represent the batch_uuids when we do the final multi-batch reduction sum, the sums array will also always contain only one value (because all aggregations we'll do always produce a single output sum per file, both the per-batch sum, and the overall sum for a given time range), so this seems like an oversight and should be fixed to match the intent of the original schema.

Ingestigate HTTP caching vs. manifest files

For simplicity we chose to server server manifest files over HTTPS instead of live config endpoints. The drawback there is that stale manifests could be cached by, I dunno, proxies or CDNs or ISPs. We should investigate ways to mitigate this.

Purge processed data from storage buckets

Persistent data (ingestion batches, validation batches and sum parts) should be purged after some delay has passed to keep a lid on storage costs. Yuri suggested implementing this in the execution manager, since it already is a cronjob that periodically scans the various buckets to dispatch work. We should also carefully devise a retention policy.

Protocol for invalid packet handling

The IDL document describes an "invalid UUID" file alongside the sum part emitted by the facilitator. We need to resolve open questions about how to handle these packets at different pipeline stages and how to represent these packets in the intermediate and final product. Final decisions to be recorded in the design document.

Initial facilitator implementation

Write the tool that uses libprio_rs to construct, validate and aggregate Prio data batches. It should be possible to exercise the end to end pipeline from the command line, with realistic Avro encoded data being emitted at each step.

Rename facilitator crate

@yuriks points out that the crate we call facilitator could just as easily be the basis of a PHA server. We should rename it to something like "prio-data-share-processor".

Decide what goes into `encryption_key_id` field on `PrioDataSharePacket`

Eventually, both Apple and Google servers will be populating the encryption_key_id field in the PrioDataSharePacket messages they send. For data share processors to handle them generically, we need to agree on what value goes in there, and make sure it's a value that lets the data share processor look up the appropriate packet decryption key. We would have to make sure that whatever value we use is available to both Apple and Google ingestion servers and mobile devices, as necessary.

One proposal is the serial number of the X.509 certificate used to transmit the packet encryption key to Apple.

Make `encryption_key_id` optional in `PrioDataSharePacket`

@winstrom informs me that Apple's ingestion server will not be populating the encryption_key_id field in PrioDataSharePacket messages at first. This means:

  • We must make that field optional in the Avro schema so it can be safely omitted by Apple
  • Data share processors will not be able to rely on its presence, meaning that for each packet, they may have to try all the packet decryption keys they have available to them in the k8s secret store

Move Avro schema into own repository

We store the JSON Avro schema files alongside the Rust implementation here in prio-server. @yuriks points out that this means we have to rev the implementation in lockstep with the schema. This would be easier if the schema was pulled in as a versioned dependency, and it would also be better for other projects wanting to consume the schemas.

Include key identifier in signed headers

Reviewers of #2 pointed out we should include an explicit key identifier in headers so that message recipients can gracefully handle key rotations instead of having to try all the keys listed in a peer's manifest.

facilitator: std::path::Path is wrong type for Transport::{get, put}

In #4, @yuriks notes:

I think using Path here is actually not the right call: Path inherently deals with OS-native paths. While that's the natural key for FileTransport, it doesn't really apply for something like S3. If I run this on Windows, for example, I actually still want to keep using / as the separator when I upload to S3, not \. So I think the right way of doing this is to have key be a generic path value (can either just use str with / as the separator, or create your own newtype over it) and then inside FileTransport you can parse it and re-convert to a Path when accessing the filesystem.

Metrics and alerting

Figure out what kind of metrics we want to emit from the server, what conditions to alert on, and where alerts should go. For instance, the facilitator could encounter failures because of bad data emitted by an ingestion server, so perhaps we should figure out how to route such alerts to the other organisation.

Set up release variants of Dockerfiles

Right now we're using debug builds because they're faster. For prod deployments we'll want to do release builds. Docker has passthrough environment variables that could be good for this. Follow-up from #97.

Replace packet UUID with monotonic packet sequence number

Eventually share processors have to evaluate a sequence of (own validation, peer validation, data share) triples, and must ensure that all three packets have the same UUIDs. Can the share processor assume that packets will appear in the same order in all three sequences of packets? If not, then I have to sort each packet sequence lexicographically by UUID in O(n log n) before I even begin processing triples. If instead we require that both share processors maintain the order of packets emitted by the ingestor, we can skip that step. Further, if we replace the packet UUID with a sequence number, then share processors can easily defend themselves against malformed peer validation files by verifying that sequence numbers increase monotonically as they process validation packets.

The pair (batch_uuid, packet sequence number) remains unique globally.

Execution manager

The design doc describes an execution manager responsible for coordinating the map/reduce steps as Kubernetes jobs. We should write that.

Document batch formats in design doc

The IDL document contains a semi-formal spec of the ingestion, validation and sum part batches, but it is no longer authoritative, especially since we have changed some things about the signature format in the Avro schema. We should formally document the batch file layouts, including the signatures, in the server design doc.

Break current aggregation reduce step into validation map step and final sum reduce

The implementation of aggregation in #4 implements peer validation share validation and per-batch summing in the final reduce/aggregation step, but those steps could be implemented as a map step in parallel across the batches. We should revisit the implementation in lib/aggregation.rs and break the aggregate_share method into a separate step.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.