Comments (6)
The IDL document names files like {aggregation_prefix}/YYmmddHHMM-YYmmddHHMM.invalid_uuid_{n}.avro
to be emitted alongside the sum parts by the facilitator. The format of this file is TBD, but I took a stab at specifying it in #2 (short version: it's a list of UUIDs). In that PR, @tlepoint suggested "to add a reason for rejection, for example INVALID_CIPHERTEXT
or INVALID_PROOF
." That seems very reasonable to me, but there's a number of places in the pipeline that could fail, all the way from ingestion through sum part construction. I think we need to keep track of invalid packets the whole way through from ingestor to final sum part construction.
I'm going to run through the pipeline stages I see and try to enumerate error cases so we can agree what to do about them. For each failure case I identify, I have noted how I think it should be handled. The heuristic I'm using for classification is that errors that can be resolved with a software fix to a single component should halt batch processing so we can deploy a fix and try again (e.g., facilitator is rejecting valid Avro messages because of a typo in facilitator code) but errors caused by an individual packet being malformed for any reason should not block processing of the rest of the batch, and will be recorded in an "invalid packets" list that moves through the pipeline with the good data.
I will probably end up using "validation" and "verification" interchangeably below, for which I apologize. If someone can make an argument for using one or the other word consistently everywhere, I am all ears.
Ingestion
Per Apple, if anything goes wrong during ingestion, the relevant packet or batch will be discarded, so there's nothing for data share processors to do.
PHA/Facilitator intake
(i.e. generation of validation share)
- I/O errors (file not found, short reads, network failures, etc.): stop processing the batch, retry later.
- Malformed ingestion header (including bad signature): stop processing the batch, alert humans.
- Malformed individual packet (bad encoding): record bad packet with
INVALID_PACKET
, move on. - Individual packet cannot be decrypted: record bad packet with
INVALID_CIPHERTEXT
, move on. - Individual packet with bad value of a parameter like
r_pit
): record bad packet withINVALID_PARAMETERS
, move on.
To enable the intake step to indicate failures to subsequent steps, the PrioValidityPacket
Avro structure would be changed to contain a union over the triple (f_r, g_r, h_r) and a rejection reason. Keeping the invalid packets inline with the list of valid ones makes it easier to resolve the (ingestion packet, own validation packet, peer validation packet) triples during the aggregation step.
PHA/Facilitator aggregation
- I/O errors (file not found, short reads, network failures, etc.): stop processing the batch, retry later.
- Malformed ingestion header (including bad signature): stop processing the batch, alert humans.
- Mismatch between parameters in validation or ingestion headers (i.e., inconsistent batch ID, name, bins, epsilon, prime, number of servers or hamming weight): stop processing the batch, alert humans.
- Mismatch in packet count between validation batches (e.g., facilitator ingestion batch is 100 packets, facilitator emits 100 validation packets, but PHA only emits 50 validation packets): validate packets present in both validation batches, record missing ones as bad packets with
MISSING_PEER_VALIDATION
. - Verification of individual ingestion packet against verification shares fails: record bad packet with
INVALID_PROOF
, move on.
The invalid_packet
Avro structure would be augmented to contain a rejection_reason
field. It also needs a batch_uuid
field: since the aggregation step sums over multiple batches, the packet's UUID is not sufficient (so the list of UUIDs I used in #2 is already wrong).
The file emitted during the aggregation step will be named {aggregation_prefix}/YYmmddHHMM-YYmmddHHMM.invalid_packets_{n}.avro
since it contains more than just UUIDs now.
We end up with this enumeration of packet rejection reasons which may appear alongside the sum part sent by the facilitator to the PHA:
INVALID_PARAMETERS
INVALID_CIPHERTEXT
INVALID_PACKET
MISSING_PEER_VALIDATION
INVALID_PROOF
from prio-server.
One other question: if we encounter zero invalid packets going through the whole pipeline, what should the facilitator emit in the invalid_packets_{n}.avro
file? An empty file? An Avro file containing an empty list?
from prio-server.
Cross-posting helpful insights from a colleague from email:
> I'm not sure what the protocol between devices and ingestion servers look like, but are there any failure cases where an individual packet could be rejected but the overall batch can continue? If so, should those failures be reported to the next stage (PHA and facilitator servers) to be rolled forward into invalid packet files?
There are reasons for rejecting, but those rejected packets will not be added to the batch, so no need to forward to the list of invalid packets.
> One other question: if we encounter zero invalid packets going through the whole pipeline, what should the facilitator emit in the invalid_packets_{n}.avro file? An empty file? An Avro file containing an empty list?
Avro file with empty list feels like the right answer. Other options mean extra code to distinguish the empty case and do something special with it.
I think our colleague is right on both counts and plan to adopt these recommendations as part of closing this ticket.
from prio-server.
This won't make it for the integration test, punt.
from prio-server.
We decided we would onboard the first PHA without this.
from prio-server.
The system has been in operation for a year-ish and we haven't ever felt a need to gather and expose this kind of information, so I am closing this as not to be fixed.
from prio-server.
Related Issues (20)
- Consider replacing facilitator_intake_packets_per_ingestion_batch gauge with a histogram
- `facilitator`: Do something about `avro_rs` dependency and the `uuid` crate
- Add metric to workflow-manager for batches missing peer validations
- Profile memory usage of workflow-manager HOT 4
- Adjust spot VM node pool sizes HOT 1
- workflow-manager: Credentials for S3 requests are not cached HOT 2
- Staging environments do not currently alert to VictorOps
- Pre-initialize counter metrics
- Monitor cloud credit balance
- Prio-server HOT 1
- Make use of google_container_node_pool.autoscaling.total_max_node_count
- Explicitly check HTTP status of `data.http` resources
- Switch to dtolnay/rust-toolchain action
- Task queue growth alert can be randomly triggered due to scale-in
- Expedite key rotation in `prod-intl` HOT 6
- Upgrade EKS clusters to Kubernetes 1.22 HOT 6
- Re-rotate keys for `mx` localities in `g-enpa` aggregator HOT 1
- Exporter and alerting for ABR ENPA certificates
- Replace `k8s-cloudwatch-adapter` with a supported alternative
- Upgrade EKS clusters to Kubernetes 1.23
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from prio-server.