GithubHelp home page GithubHelp logo

tf-encrypted / moose Goto Github PK

View Code? Open in Web Editor NEW
56.0 12.0 15.0 45.56 MB

Secure distributed dataflow framework for encrypted machine learning and data processing

License: Apache License 2.0

Python 16.99% Makefile 0.15% Rust 78.09% Jupyter Notebook 4.15% Dockerfile 0.01% Shell 0.57% Batchfile 0.04%
secure-computation machine-learning privacy cryptography data-science distributed-computing

moose's Issues

Apply isort recursively

isort should be applied with the --recursive flag in the makefile. A first attempt of simply adding the flag caused it to misrecognize certain subdirectories, including compiler. Fixing this might actually be that easy once #25 is completed.

Investigate python version bug in CI

In #59, we found that CI would fail when the new patch of python (3.8.6) was installed. There were two specific problems encountered, and it's unclear what triggered each of them:

  1. the cached venv is invalid for the newer python version (we do not currently include a hash of python version in the cache key)
  2. venv/bin/python was not found, suggesting that there's a broken symlink as a result of the difference in python versions

It was not straightforward to fix, so I left it alone. We can consider fixing this if it becomes enough of a problem.

Runtime and Worker Hardening

  • Assess and mitigate side-channel (timing) attacks on the runtime. Our tensorized approach mitigates this to a significant extent, but we should engage additional expertise on the topic.
  • Implement cryptographic approval of jobs/computations (to verify public keys/identities and prevent unwanted computations to run)

Ben: It is a little vague what we actually want to do here. We know CPP usually has to restart their worker at the beginning of working sessions but not much more. In the absence of having strong deliverables I would suggest we timebox this to a few days and do whatever we can to make the workers more resilient.

Original Clickup: https://app.clickup.com/t/f90mp2

Storage query parameter

Hello again gang. It looks like during the Rust migration of Moose, the storage trait/base class lost its query parameter. From the Python code, we can see that there is a query parameter which is a json encoded string.

import abc


class DataStore:
    @abc.abstractmethod
    async def load(self, session_id, key, query):
        pass

    @abc.abstractmethod
    async def save(self, session_id, key, value):
        pass

What we have in Rust looks like this. We no longer have a parameter for the query.

#[async_trait]
pub trait AsyncStorage {
    async fn save(&self, key: &str, val: &Value) -> Result<()>;
    async fn load(&self, key: &str, type_hint: Option<Ty>) -> Result<Value>;
}

My question is, do we want to have this query be a part of the Rust trait? If so, do we want to continue using a json string, or should we leverage Rust's expressive type system to encode a query? My vote would be to use Rust types, but I can see the appeal either way. I also think that we should include a way when saving a tensor to optionally specify the column names of the tensor, this way when we load the tensor, it will be more consistent with any other CSV, and we can select specific columns to load.

Make Elk available in the docker image

We need access to Elk one way or another. For now we might simply assume that it is available on the system. Concretely, this issue is about make that valid for the docker image.

replicated::mean operation

This issue is about implementing a mean operation for replicated placements. The suggested strategy is to introduce std::mean, fixed::mean, replicated::mean, and ring::mean. Mid-term we might have to more things (operations) in place that could allow the unroll to happen earlier in the layers (not even sure why that would be relevant?), but waiting with this means we don't have to introduce public/clear replicated tensors etc. Note that ring::mean is essentially a fused kernel.

  • std::mean op
  • fixed::mean op
  • replicated::mean op
  • lowering from std::mean -> fixed::mean -> replicated::mean -> fixed::ring_mean
  • fixed::ring_mean kernel immediately calling into rust (fixed_ring_mean)
  • rust code for fixed_ring_mean, calling ring_sum and ring_mul but with custom code for computing the (fixedpoint) denominator

Reorganize source files under subdirectory

I suggest that we move all source files under a subdirectory, to more clearly separate them from examples and docker files. If we had a name for the runtime yet it would be natural to use this for the subdirectory, but lacking it I suggest codename moose. This would leave us with the following directory structure:

  • /docker: everything related to building docker images, currently only for the dev worker
  • /examples: self-contained examples which assume that Moose has already been installed on the system
  • /moose: everything source code related, including setup.py and requirements-dev.txt
  • /main.py: moves into examples, and either becomes an example or works as a template for creating a new example
  • /README.md: stays put

wdyt @jvmncs @yanndupis @rdragos?

Discuss how we distribute Moose

Instead of packaging the runtime independently it might make more sense to essentially consider it as a library from which runtimes can easily be built (kind of like LLVM/MLIR). In particular, projects like TFE and Cape Federated can build small wrappers around the library to each have their own specific runtime with for instance custom operations and kernels. If we indeed go for this then it may make less sense to distribute a pip package, but rather offer only a source code release or a rust crate.

Interface for objects that can be reused different Placements

Runtime computations currently only reference Python native types (e.g. float/string/int/etc.) across different Placements in the EDSL. We are already seeing the need for ndarrays/tensors and Keras models to be used this way as well, and it's likely that we will continue to add types in the future. For clarity, it would be helpful to have a dedicated interface/abstraction for these, so that there are a well-established set of "traits" that such objects must implement in order to be recognizable & useable by our @computation decorator.

Serialization tests

We don't use gRPC for tests, so this isn't covered by just testing ops for correctness in the TestRuntime

Discovery on adding support for name scopes to computations

Original clickup ticke: https://app.clickup.com/t/dx2y33

For graph visualization (and debugging) we would like to have support for name scoping. These should have no effort at runtime but simply add a bit of superficial meta-structure to computations that can be very useful in certain situations. See for instance https://www.tensorflow.org/api_docs/python/tf/name_scope.

An alternative to name scopes would be to have layered graphs, where hints are added to the computation to allow upper layers to function as an interpretation of lower layers, including interpretations for higher level tensors based on lower level tensors. The tricky part here seems to be that of what to do when we optimize a graph by either merging or pruning nodes. This means we will have to deal with composite higher level nodes (eg add+add+add) and partial higher level edges (with some component values not being computed).

Another related concept is that of sub-graphs and/or sub-computations, although these have an effort at runtime. We should also figure out how relevant name scopes are given we have support for sb-graphs and sub-computations.

Set up scanning for Rust runtime code

Set up some combination of tools like Github Advanced Security, dependabot, or similar to scan for vulnerabilities in the tf-encrypted code base.

Copying note from other ticket: maybe there is some inspiration via libsodium

Morten:

Here's what I set up already, but waiting for us to move out of the rust root directory: https://github.com/tf-encrypted/runtime/blob/main/.github/workflows/audit.yml

Been wanting to look further to what's done for libsodium as well https://libsodium.gitbook.io/doc/internals#static-analysis

Thor:

That's great! I'm not familiar with any projects that do static analysis on rust code. I've heard of a few thesis projects to implement something but haven't even heard about any runnable results being made available.

Regarding cargo-audit: I'm pretty sure that's also what dependabot uses, so you might be able to get PRs that run tests automatically when there are updates available to remediate problems.

The audit action which is the most reasonable near-term solution can't find the rust code to scan because it's in a nested directory. (Specifically, Cargo.toml is under rust/ rather than at the root of the repo.)

Once the rust worker is live in prod, the rust dir will be promoted to replace the python code at the root of the repository, and this action can be enabled.

Lex:

My previous company open sourced this: https://github.com/sonatype-nexus-community/cargo-pants It accepts a path to Cargo file.

Original Clickup: https://app.clickup.com/t/pd6qgu

Rust Performance Measurements

We should retest benchmarks as we migrate to Rust and also test on a variety of worker setups (different size of workers different amount of memory for workers)

KJ: do you think this is trivial with the benchmarking script you made? as in, will it work automatically with the new runtime (I imagine?) if so (or not!) can you add a time estimate here?

Yann: Correct. Having the new runtime won't impact the benchmarking script since it interacts only with pycape (create datasets with different sizes, add dataviews, run a job etc.). So it will be trivial to re-benchmark. What could be more time consuming is if we want to benchmark with different machines (need launch new workers with new config etc.) but that's not a problem.

Original Clickup: https://app.clickup.com/t/f908fa

Improve error handling

Right now we only have a very basic handling of errors during async execution. So to-be-decided efforts should be made to improve upon this for both development and application.

All expressions on placements should go through its compiler

Per #29 it is currently only ApplyFunctionExpressions that go through the placement's compiler, yet the placement should have a saying in any type of expression assigned to it, including potentially rejecting certain expressions that it doesn't support (say, non-arithmetic expressions on the MpspdzPlacement).

Implement ACLs based on identities

The runtime currently doesn't check nor use the identities of connecting peers, it only uses mTLS to ensure that peers have valid certificates. This issue is about figuring out where we want to employ ACLs and implement it.

The key part for extracting identity is context.peer() and context.peer_identities() using the context passed into servicer methods.

Reorganize examples

I suggest we move each of the three current examples into their own subdirectory, and move docker/dev/docker-compose.yaml into one or more of them (and adapt it according to each example). We should also add a small README.md to the examples directory containing the relevant parts of docker/dev/README.md (ie including the parts about Docker Compose and excluding the parts about how to build the worker dev image)

Storage: Borrow instead of own

Hi gang,

If we take a look at the following line: https://github.com/tf-encrypted/runtime/blob/7254b6230fdf0af1061e3848e419b2514a5875cc/rust/moose/src/storage.rs#L13

The save and load methods on the storage traits take exclusive ownership of the data passed to them. I propose that instead of taking ownership, we instead borrow the values in these methods.

What do you think?

Problem

Currently, if you write something like this, you get a compiler error because the variable key is owned by the first function that it gets passed to, storage.save.

        let expected = Float64Tensor::from(
            array![[1.0, 2.0], [3.0, 4.0]]
                .into_dimensionality::<IxDyn>()
                .unwrap(),
        );
        let key = "my-object".to_string();
        let storage = S3SyncStorage::default();
        storage.save(key, Value::from(expected));
        let loaded = storage.load(key);

Compiler error message:

error[E0382]: use of moved value: `key`
   --> src/cape/storage.rs:205:35
    |
202 |         let key = "my-object".to_string();
    |             --- move occurs because `key` has type `std::string::String`, which does not implement the `Copy` trait
203 |         let storage = S3SyncStorage::default();
204 |         storage.save(key, Value::from(expected));
    |                      --- value moved here
205 |         let loaded = storage.load(key);
    |                                   ^^^ value used here after move

Solution

pub trait SyncStorage {
    fn save(&self, key: &String, val: &Value) -> Result<()>;
    fn load(&self, key: &String) -> Result<Value>;
}

#[async_trait]
pub trait AsyncStorage {
    async fn save(&self, key: &String, val: &Value) -> Result<()>;

    async fn load(&self, key: &String) -> Result<Value>;
}

Improve installation experience

It could be easier to install. Here's a few suggestions:

  • Have setup.py install from e.g. requirements-dev.txt when a dev install is detected (what could we do there?)
  • Hide protoc instruction behind e.g. make build

Avoid using placements as context managers in tests

A failing test_run_program will currently return the latest result from test_op, perhaps due to the global state used to support placements as context managers.

To avoid global state, all eDSL operations could be improved to take a placement as an optional argument, so that

with plc:
  z = add(x, y)

is really just syntactic sugar for

z = add(x, y, placement=plc)

where the former can be used for convenience and the latter must be used in tests.

Improve resilience

This issue is about going through the runtime code base and gather a list of tasks that could improve the resilience for the runtime, including recovering from an error during computation evaluation.

This includes making sure Moose doesn't silently error when an argument is missing.

Update CI to use dev image

CI is currently using ubuntu-latest, yet as soon as we add support for eg MP-SPDZ it would be nice to instead use either the worker docker image or a new dev docker image (that the worker image could potentially be derived from) where all dependencies are already installed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.