GithubHelp home page GithubHelp logo

statdp's Introduction

StatDP

Statistical Counterexample Detector for Differential Privacy.

Usage

We assume your algorithm implementation has the folllowing signature: (prng, queries, epsilon, ...) (Pseudo-random generator, list of queries, privacy budget and extra arguments).

Throughout your algorithm, any random number must be generated through the provided generator (i.e., prng) for better scalability with multiple cores. It is an instance of numpy.random.Generator which supports a collection of standard distributions.

Then you can simply call the detection tool with automatic database generation and event selection:

from statdp import detect_counterexample

def your_algorithm(prng, queries, epsilon, ...):
    # your algorithm implementation here
    # prng must be used instead of np.random
    prng.laplace(loc=0, scale=1 / epsilon)
 
if __name__ == '__main__':
    # algorithm privacy budget argument(`epsilon`) is needed
    # otherwise detector won't work properly since it will try to generate a privacy budget
    result = detect_counterexample(your_algorithm, {'epsilon': privacy_budget}, test_epsilon)

The result is returned in variable result, which is stored as [(epsilon, p, d1, d2, kwargs, event), (...)].

The detect_counterexample accepts multiple extra arguments to customize the process, check the signature and notes of detect_counterexample method to see how to use.

def detect_counterexample(algorithm, test_epsilon, default_kwargs=None, databases=None, num_input=(5, 10),
                          event_iterations=100000, detect_iterations=500000, cores=None, sensitivity=ALL_DIFFER,
                          quiet=False, loglevel=logging.INFO):
    """
    :param algorithm: The algorithm to test for.
    :param test_epsilon: The privacy budget to test for, can either be a number or a tuple/list.
    :param default_kwargs: The default arguments the algorithm needs except the first Queries argument.
    :param databases: The databases to run for detection, optional.
    :param num_input: The length of input to generate, not used if database param is specified.
    :param event_iterations: The iterations for event selector to run.
    :param detect_iterations: The iterations for detector to run.
    :param cores: The number of max processes to set for multiprocessing.Pool(), os.cpu_count() is used if None.
    :param sensitivity: The sensitivity setting, all queries can differ by one or just one query can differ by one.
    :param quiet: Do not print progress bar or messages, logs are not affected.
    :param loglevel: The loglevel for logging package.
    :return: [(epsilon, p, d1, d2, kwargs, event)] The epsilon-p pairs along with databases/arguments/selected event.
    """

Install

We recommend installing statdp in a conda virtual environment (or venv if you prefer, the setup is similar):

# we use python 3.8, but 3.6 and above should work fine
conda create -n statdp anaconda python=3.8
conda activate statdp
# install dependencies from conda for best performance
conda install numpy numba matplotlib sympy tqdm coloredlogs pip
# install icc_rt compiler for best performance with numba, this requires using intel's channel
conda install -c intel icc_rt
# install the remaining non-conda dependencies and statdp 
pip install .

Then you can run examples/benchmark.py to run the experiments we conducted in the paper.

Visualizing the results

A nice python library matplotlib is recommended for visualizing your result.

There's a python code snippet at /examples/benchmark.py(plot_result method) to show an example of plotting the results.

Then you can generate a figure like the iSVT 4 in our paper.

iSVT4

Customizing the detection

Our tool is designed to be modular and components are fully decoupled. You can write your own input generator/event selector and apply them to hypothesis test.

In general the detection process is

test_epsilon --> generate_databases --((d1, d2, kwargs), ...), epsilon--> select_event --(d1, d2, kwargs, event), epsilon--> hypothesis_test --> (d1, d2, kwargs, event, p-value), epsilon

You can checkout the definition and docstrings of the functions respectively to define your own generator/selector. Basically the detect_counterexample function in statdp.core module is just shortcut function to take care of the above process for you.

test_statistics function in hypotest module can be used universally by all algorithms (this function is to calculate p-value based on the observed statistics). However, you may need to design your own generator or selector for your own algorithm, since our input generator and event selector are designed to work with numerical queries on databases.

Citing this work

You are encouraged to cite the following paper if you use this tool for academic research:

@inproceedings{ding2018detecting,
  title={Detecting Violations of Differential Privacy},
  author={Ding, Zeyu and Wang, Yuxin and Wang, Guanhong and Zhang, Danfeng and Kifer, Daniel},
  booktitle={Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security},
  pages={475--489},
  year={2018},
  organization={ACM}
}

License

MIT.

statdp's People

Contributors

yuxincs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

statdp's Issues

documentation of event format

I'd like to use the results of StatDP for some downstream task. Unfortunately, the format of the events returned by detect_counterexample is not documented.

How are the 5 different kinds of events according to section 4.3 in the publication represented in this implementation?

Pair dataset generation for histograms is invalid

I may be misunderstanding something, but it seems to me that the candidate data pair generation for histogram queries is not valid.

As mentioned in the paper, only the "one-below" and "one-above" inputs correspond to histograms coming from L_1-adjacent databases. In the code

def generate_databases(algorithm, num_input, default_kwargs):

however, all the other simple input vectors are generated, which do not correspond to histograms from adjacent databases. This might lead to false positives for detecting privacy violations.

Anyways, thanks for this -- the paper is neat.

where did you implement the event_space generation feature?

In your paper, you mentioned different kinds of methods for generating event space. But in the implementation of algorithms, it seems you already know which event is going to violate the differential privacy.

Take histogram for example, your implementation of the algorithm is:
def histogram(queries, epsilon): noisy_array = np.asarray(queries, dtype=np.float64) + np.random.laplace(scale=1.0 / epsilon, size=len(queries)) return noisy_array[0]

How do you know in advance that the quantization of the first element of the noisy vector would cause violation of differential privacy?

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.