tf-encrypted / moose Goto Github PK

Secure distributed dataflow framework for encrypted machine learning and data processing

License: Apache License 2.0

Python 16.99% Makefile 0.15% Rust 78.09% Jupyter Notebook 4.15% Dockerfile 0.01% Shell 0.57% Batchfile 0.04%

secure-computation machine-learning privacy cryptography data-science distributed-computing

moose's Issues

Apply isort recursively

isort should be applied with the --recursive flag in the makefile. A first attempt of simply adding the flag caused it to misrecognize certain subdirectories, including compiler. Fixing this might actually be that easy once #25 is completed.

Investigate python version bug in CI

In #59, we found that CI would fail when the new patch of python (3.8.6) was installed. There were two specific problems encountered, and it's unclear what triggered each of them:

the cached venv is invalid for the newer python version (we do not currently include a hash of python version in the cache key)
venv/bin/python was not found, suggesting that there's a broken symlink as a result of the difference in python versions

It was not straightforward to fix, so I left it alone. We can consider fixing this if it becomes enough of a problem.

test runtime #2

do codeblocks work

should be reported by cape-bot

Runtime and Worker Hardening

Assess and mitigate side-channel (timing) attacks on the runtime. Our tensorized approach mitigates this to a significant extent, but we should engage additional expertise on the topic.
Implement cryptographic approval of jobs/computations (to verify public keys/identities and prevent unwanted computations to run)

Ben: It is a little vague what we actually want to do here. We know CPP usually has to restart their worker at the beginning of working sessions but not much more. In the absence of having strong deliverables I would suggest we timebox this to a few days and do whatever we can to make the workers more resilient.

Original Clickup: https://app.clickup.com/t/f90mp2

Translate Python function to MP-SPDZ mlir

Look into Python's AST following the idea of mlir-npcomp

Representation and serialization of computations

Computations are currently (loosely) defined by Python classes and serialized using JSON/Marshal. Both of these are bad, especially the fact that Marshal is not a secure way of exchanging data.

This issue is about deciding on how we represent computations in a more general format (protobuffers? flatbuffers? something else?) and how we serialize them.

Storage query parameter

Hello again gang. It looks like during the Rust migration of Moose, the storage trait/base class lost its query parameter. From the Python code, we can see that there is a query parameter which is a json encoded string.

import abc


class DataStore:
    @abc.abstractmethod
    async def load(self, session_id, key, query):
        pass

    @abc.abstractmethod
    async def save(self, session_id, key, value):
        pass

What we have in Rust looks like this. We no longer have a parameter for the query.

#[async_trait]
pub trait AsyncStorage {
    async fn save(&self, key: &str, val: &Value) -> Result<()>;
    async fn load(&self, key: &str, type_hint: Option<Ty>) -> Result<Value>;
}

My question is, do we want to have this query be a part of the Rust trait? If so, do we want to continue using a json string, or should we leverage Rust's expressive type system to encode a query? My vote would be to use Rust types, but I can see the appeal either way. I also think that we should include a way when saving a tensor to optionally specify the column names of the tensor, this way when we load the tensor, it will be more consistent with any other CSV, and we can select specific columns to load.

Make Elk available in the docker image

We need access to Elk one way or another. For now we might simply assume that it is available on the system. Concretely, this issue is about make that valid for the docker image.

replicated::mean operation

This issue is about implementing a mean operation for replicated placements. The suggested strategy is to introduce std::mean, fixed::mean, replicated::mean, and ring::mean. Mid-term we might have to more things (operations) in place that could allow the unroll to happen earlier in the layers (not even sure why that would be relevant?), but waiting with this means we don't have to introduce public/clear replicated tensors etc. Note that ring::mean is essentially a fused kernel.

std::mean op
fixed::mean op
replicated::mean op
lowering from std::mean -> fixed::mean -> replicated::mean -> fixed::ring_mean
fixed::ring_mean kernel immediately calling into rust (fixed_ring_mean)
rust code for fixed_ring_mean, calling ring_sum and ring_mul but with custom code for computing the (fixedpoint) denominator

Make player offsets compliant to literature

Passing data should go from i -> i+1

Reorganize source files under subdirectory

I suggest that we move all source files under a subdirectory, to more clearly separate them from examples and docker files. If we had a name for the runtime yet it would be natural to use this for the subdirectory, but lacking it I suggest codename moose. This would leave us with the following directory structure:

/docker: everything related to building docker images, currently only for the dev worker
/examples: self-contained examples which assume that Moose has already been installed on the system
/moose: everything source code related, including setup.py and requirements-dev.txt
/main.py: moves into examples, and either becomes an example or works as a template for creating a new example
/README.md: stays put

wdyt @jvmncs @yanndupis @rdragos?

Discuss how we distribute Moose

Instead of packaging the runtime independently it might make more sense to essentially consider it as a library from which runtimes can easily be built (kind of like LLVM/MLIR). In particular, projects like TFE and Cape Federated can build small wrappers around the library to each have their own specific runtime with for instance custom operations and kernels. If we indeed go for this then it may make less sense to distribute a pip package, but rather offer only a source code release or a rust crate.

Interface for objects that can be reused different Placements

Runtime computations currently only reference Python native types (e.g. float/string/int/etc.) across different Placements in the EDSL. We are already seeing the need for ndarrays/tensors and Keras models to be used this way as well, and it's likely that we will continue to add types in the future. For clarity, it would be helpful to have a dedicated interface/abstraction for these, so that there are a well-established set of "traits" that such objects must implement in order to be recognizable & useable by our @computation decorator.

Serialization tests

We don't use gRPC for tests, so this isn't covered by just testing ops for correctness in the TestRuntime

Update all examples to match python-functions

This includes READMEs.

Integrate new mpspdz I/O into runtime

Discovery on adding support for name scopes to computations

Original clickup ticke: https://app.clickup.com/t/dx2y33

For graph visualization (and debugging) we would like to have support for name scoping. These should have no effort at runtime but simply add a bit of superficial meta-structure to computations that can be very useful in certain situations. See for instance https://www.tensorflow.org/api_docs/python/tf/name_scope.

An alternative to name scopes would be to have layered graphs, where hints are added to the computation to allow upper layers to function as an interpretation of lower layers, including interpretations for higher level tensors based on lower level tensors. The tricky part here seems to be that of what to do when we optimize a graph by either merging or pruning nodes. This means we will have to deal with composite higher level nodes (eg add+add+add) and partial higher level edges (with some component values not being computed).

Another related concept is that of sub-graphs and/or sub-computations, although these have an effort at runtime. We should also figure out how relevant name scopes are given we have support for sb-graphs and sub-computations.

Allow MP-SPDZ directory to be specified using env variable read at runtime

Currently we hard-code the path of the MP-SPDZ directory in the kernels to /MP-SPDZ. This is fine for the docker images but won't work if Moose is being executed natively.

Set up scanning for Rust runtime code

Set up some combination of tools like Github Advanced Security, dependabot, or similar to scan for vulnerabilities in the tf-encrypted code base.

Copying note from other ticket: maybe there is some inspiration via libsodium

Morten:

Here's what I set up already, but waiting for us to move out of the rust root directory: https://github.com/tf-encrypted/runtime/blob/main/.github/workflows/audit.yml

Been wanting to look further to what's done for libsodium as well https://libsodium.gitbook.io/doc/internals#static-analysis

Thor:

That's great! I'm not familiar with any projects that do static analysis on rust code. I've heard of a few thesis projects to implement something but haven't even heard about any runnable results being made available.

Regarding cargo-audit: I'm pretty sure that's also what dependabot uses, so you might be able to get PRs that run tests automatically when there are updates available to remediate problems.

The audit action which is the most reasonable near-term solution can't find the rust code to scan because it's in a nested directory. (Specifically, Cargo.toml is under rust/ rather than at the root of the repo.)

Once the rust worker is live in prod, the rust dir will be promoted to replace the python code at the root of the repository, and this action can be enabled.

Lex:

My previous company open sourced this: https://github.com/sonatype-nexus-community/cargo-pants It accepts a path to Cargo file.

Original Clickup: https://app.clickup.com/t/pd6qgu

Make debugging easier in the runtime

Original clickup ticket: https://app.clickup.com/t/d311eu

This is a collection of tasks related to debugging the runtime, which should be addressed as part of migration.

External Cryptographic Implementation Assessment

Original clickup tickets: https://app.clickup.com/t/f90mkm

Running MP-SPDZ using the third-party scripts

Run some symbolic links with sess_id/invoc_key/Player-Data.

We also need to figure out the port number (which should be a hash of the two keys).

Rust Performance Measurements

We should retest benchmarks as we migrate to Rust and also test on a variety of worker setups (different size of workers different amount of memory for workers)

KJ: do you think this is trivial with the benchmarking script you made? as in, will it work automatically with the new runtime (I imagine?) if so (or not!) can you add a time estimate here?

Yann: Correct. Having the new runtime won't impact the benchmarking script since it interacts only with pycape (create datasets with different sizes, add dataviews, run a job etc.). So it will be trivial to re-benchmark. What could be more time consuming is if we want to benchmark with different machines (need launch new workers with new config etc.) but that's not a problem.

Original Clickup: https://app.clickup.com/t/f908fa

Test runtime issue

changing encode type

Improve error handling

Right now we only have a very basic handling of errors during async execution. So to-be-decided efforts should be made to improve upon this for both development and application.

Add tensor datatype in the mpspdz edsl

All expressions on placements should go through its compiler

Per #29 it is currently only ApplyFunctionExpressions that go through the placement's compiler, yet the placement should have a saying in any type of expression assigned to it, including potentially rejecting certain expressions that it doesn't support (say, non-arithmetic expressions on the MpspdzPlacement).

Rename TestRuntime to LocalRuntime

we get some warning in the test scripts due to using the name TestRuntime

Implement ACLs based on identities

The runtime currently doesn't check nor use the identities of connecting peers, it only uses mTLS to ensure that peers have valid certificates. This issue is about figuring out where we want to employ ACLs and implement it.

The key part for extracting identity is context.peer() and context.peer_identities() using the context passed into servicer methods.

Add MPSPDZ worker docker image to be able to build it from source

Reorganize examples

I suggest we move each of the three current examples into their own subdirectory, and move docker/dev/docker-compose.yaml into one or more of them (and adapt it according to each example). We should also add a small README.md to the examples directory containing the relevant parts of docker/dev/README.md (ie including the parts about Docker Compose and excluding the parts about how to build the worker dev image)

Storage: Borrow instead of own

Hi gang,

If we take a look at the following line: https://github.com/tf-encrypted/runtime/blob/7254b6230fdf0af1061e3848e419b2514a5875cc/rust/moose/src/storage.rs#L13

The save and load methods on the storage traits take exclusive ownership of the data passed to them. I propose that instead of taking ownership, we instead borrow the values in these methods.

What do you think?

Problem

Currently, if you write something like this, you get a compiler error because the variable key is owned by the first function that it gets passed to, storage.save.

        let expected = Float64Tensor::from(
            array![[1.0, 2.0], [3.0, 4.0]]
                .into_dimensionality::<IxDyn>()
                .unwrap(),
        );
        let key = "my-object".to_string();
        let storage = S3SyncStorage::default();
        storage.save(key, Value::from(expected));
        let loaded = storage.load(key);

Compiler error message:

error[E0382]: use of moved value: `key`
   --> src/cape/storage.rs:205:35
    |
202 |         let key = "my-object".to_string();
    |             --- move occurs because `key` has type `std::string::String`, which does not implement the `Copy` trait
203 |         let storage = S3SyncStorage::default();
204 |         storage.save(key, Value::from(expected));
    |                      --- value moved here
205 |         let loaded = storage.load(key);
    |                                   ^^^ value used here after move

Solution

pub trait SyncStorage {
    fn save(&self, key: &String, val: &Value) -> Result<()>;
    fn load(&self, key: &String) -> Result<Value>;
}

#[async_trait]
pub trait AsyncStorage {
    async fn save(&self, key: &String, val: &Value) -> Result<()>;

    async fn load(&self, key: &String) -> Result<Value>;
}

Test for webhooks

Deal with MP-SPDZ i/o inside the runtime.

Read and write functionality from MP-SPDZ I/O files.

Shape inference & checks at compile-time where possible

As a data scientist, I would like to know if my job will fail beforehand due to a shape mismatch. If possible, this shoudl happen before the job actually runs and tell me the error. If it is not possible to check beforehand, we should surface helpful error messages that explain expected and real shapes.

Original Clickup: https://app.clickup.com/t/jazxk1

Support for Fixed point computations in MP-SPDZ edsl

Move aio constructor inside RemoteRuntime

do it in a separate issue?

Originally posted by @mortendahl in #49 (comment)

Figure out how to support custom operations and remove current dependency on eg TensorFlow

See eg #39 and its comments. Main question is: how can we avoid the runtime depending on TensorFlow yet still allow strong support for it in eg Cape Federated?

Improve installation experience

It could be easier to install. Here's a few suggestions:

Have setup.py install from e.g. requirements-dev.txt when a dev install is detected (what could we do there?)
Hide protoc instruction behind e.g. make build

Compile MP-SPDZ MLIR to MP-SPDZ bytecode from runtime

Goal is to compile MLIR to MPC using Elk, and compile MPC to bytecode using MP-SPDZ. Result must be assigned to call MP-SPDZ operations during eDSL processing.

MessagePack for serializing computations

Idea is that specification of computation is given in the form of Rust structs, and we use serde to serialize into eg JSON or MsgPack.

See https://crates.io/crates/rmp-serde and https://pypi.org/project/pyserde/

Spike on integration with JAX

clickup ticket: https://app.clickup.com/t/d2wpyw

Avoid using placements as context managers in tests

A failing test_run_program will currently return the latest result from test_op, perhaps due to the global state used to support placements as context managers.

To avoid global state, all eDSL operations could be improved to take a placement as an optional argument, so that

with plc:
  z = add(x, y)

is really just syntactic sugar for

z = add(x, y, placement=plc)

where the former can be used for convenience and the latter must be used in tests.

Draft for modelling TFF computations as dataflow graphs

Introduce sfix/cfix type in eDSL

Typed expressions and operations in eDSL

Add support for output types to all expressions and operations.

Improve resilience

This issue is about going through the runtime code base and gather a list of tasks that could improve the resilience for the runtime, including recovering from an error during computation evaluation.

This includes making sure Moose doesn't silently error when an argument is missing.

Run computations in RustExecutor with Ring128Tensor datatype.

Refactor docker-compose yaml from examples

Need to make sure we parse in the correct ids/workers from the yaml files. See how this was done for the MP-SPDZ example.

Update CI to use dev image

CI is currently using ubuntu-latest, yet as soon as we add support for eg MP-SPDZ it would be nice to instead use either the worker docker image or a new dev docker image (that the worker image could potentially be derived from) where all dependencies are already installed.

tf-encrypted / moose Goto Github PK

moose's Issues

Problem

Solution

Recommend Projects

Recommend Topics

Recommend Org

Jobs