GithubHelp home page GithubHelp logo

ai-sdc / acro Goto Github PK

View Code? Open in Web Editor NEW
15.0 4.0 2.0 5.52 MB

Tools for the Automatic Checking of Research Outputs. These are the tools for researchers to use as drop-in replacements for commands that produce outputs in Stata Python and R

License: MIT License

Python 94.17% R 1.74% Stata 4.09%
data-privacy data-protection privacy privacy-tools statistical-disclosure-control

acro's Introduction

License Latest Version DOI codecov Python versions

SACRO-ML

A collection of tools and resources for managing the statistical disclosure control of trained machine learning models. For a brief introduction, see Smith et al. (2022).

The sacroml package provides:

  • A variety of privacy attacks for assessing machine learning models.
  • The safemodel package: a suite of open source wrappers for common machine learning frameworks, including scikit-learn and Keras. It is designed for use by researchers in Trusted Research Environments (TREs) where disclosure control methods must be implemented. Safemodel aims to give researchers greater confidence that their models are more compliant with disclosure control.

Installation

PyPI package

Install sacroml and manually copy the examples.

To install only the base package, which includes the attacks used for assessing privacy:

$ pip install sacroml

To additionally install the safemodel package:

$ pip install sacroml[safemodel]

Note: macOS users may need to install libomp due to a dependency on XGBoost:

$ brew install libomp

Running

See the examples.

Acknowledgement

This work was funded by UK Research and Innovation under Grant Numbers MC_PC_21033 and MC_PC_23006 as part of Phase 1 of the DARE UK (Data and Analytics Research Environments UK) programme, delivered in partnership with Health Data Research UK (HDR UK) and Administrative Data Research UK (ADR UK). The specific projects were Semi-Automatic checking of Research Outputs (SACRO; MC_PC_23006) and Guidelines and Resources for AI Model Access from TrusTEd Research environments (GRAIMATTER; MC_PC_21033).­This project has also been supported by MRC and EPSRC [grant number MR/S010351/1]: PICTURES.

acro's People

Contributors

bloodearnest avatar dependabot[bot] avatar jim-smith avatar mahaalbashir avatar pre-commit-ci[bot] avatar rpreen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

acro's Issues

write to excel fails if output is not a table

error message


AttributeError Traceback (most recent call last)
Cell In[26], line 1
----> 1 output = acro.finalise("test_results.xlsx")

File ~/GitHub/AI-SDC/ACRO/acro/acro.py:86, in ACRO.finalise(self, filename)
84 utils.finalise_json(filename, self.results)
85 elif extension == ".xlsx":
---> 86 utils.finalise_excel(filename, self.results)
87 else:
88 raise ValueError("Invalid file extension. Options: {.json, .xlsx}")

File ~/GitHub/AI-SDC/ACRO/acro/utils.py:147, in finalise_excel(filename, results)
145 for table in output["output"]:
146 start = 1 + writer.sheets[output_id].max_row
--> 147 table.to_excel(writer, sheet_name=output_id, startrow=start)

AttributeError: 'str' object has no attribute 'to_excel'

to reproduce

run test.ipynb (the one that uses the charity data)
and add a cell at the end with the command
output = acro.finalise("test_results.xlsx")

cause

in utils.py around line 147
for table in output["output"]:
start = 1 + writer.sheets[output_id].max_row
table.to_excel(writer, sheet_name=output_id, startrow=start)

Fix

Assumption is that the output is a table, which is not always true (e.g. unsupported output)
need to test the type and the write out as appropriate if it is jsut a string

crosstab function does not behave correctly when passed a list of aggfuncs

e.g. passing aggfunc=['mean',std'] in the argument works in pd.crosstab but not in acro.crosstab.

However the functionality is there to fix it, just needs:

  1. the line of code change from get_aggfunc to get_aggfuncs acro.py line 164
  2. Because pandas doesn't write all the aggregates into one cell, but produces different columns for them, it produces a table with Len(aggfuncs) times the normal number of columns.
    So, if aggfuncs is a list, then after you make the change above, it then throws an error when trying to apply a mask because those assume that the table only has one agg func. I think that the answer to is allow 'Freq' as an aggregation function and then make the masks bigger that way.
    I.e. if a user asks for mean and std, then when the table values are created for masking, use ['freq', X] (for I think any valid statistic x) and only look at the first table.shape[1]/2 columns

Adding SDC guidance for researchers and checkers

Assuming the availability of SDC guidance such as in this handbook or as is being produced in WP1 (e.g., statistic type, description, requirements, disclosure issues, mitigation, possibly TRE specific rules, etc.) this information needs to be made available to both researchers and TRE checkers. So the question is how best to make this available for ease of access, maintenance, and including ACRO examples.

Should this information be made available on an independent website or included directly within ACRO documentation? ACRO documentation could simply link to the relevant part of an external website? How coupled should the guidance be with ACRO implementation?

Should the checker GUI be able to directly search/display guidance or just provide links?

Currently we attempt to include the whole ACRO command within the JSON output, but we should also include a simplified form of either (a) the function name (e.g., "function": "ols") or (b) a type (e.g., "type": "regression") so that the GUI can then use this to decide what SDC guidance to reference for more information?

In JSON output, outcomes are encoded as embedded strings, rather than as JSON objects

Hi there from your friends over at OpenSAFELY

When getting familiar and testing out ACRO, we noticed the outcome field is encoded as a string representing a JSON object, as opposed to just a JSON object like we would have expected. e.g.

    {
    "output_0_2023-05-17-15422890": {
        "command": "safe_table = acro.crosstab(df.recommend, df.parents)",
        "summary": "fail; threshold: 4 cells suppressed; ",
        "outcome": "{\"great_pret\":{\"not_recom\":\"ok\",\"priority\":\"ok\",\"recommend\":\"threshold; \",\"spec_prior\":\"ok\",\"very_recom\":\"threshold; \"},\"pretentious\":{\"not_recom\":\"ok\",\"priority\":\"ok\",\"recommend\":\"threshold; \",\"spec_prior\":\"ok\",\"very_recom\":\"ok\"},\"usual\":{\"not_recom\":\"ok\",\"priority\":\"ok\",\"recommend\":\"threshold; \",\"spec_prior\":\"ok\",\"very_recom\":\"ok\"}}",
        "output": "/home/wavy/bennett/test-oxcef/outputs/output_0_2023-05-17-15422890.csv",
        "timestamp": "2023-05-17-15422890",
        "comments": "Please let me have this data., 6 cells were supressed in this table"
    }
}

This means that to inspect the JSON outcome field programatically, we need to do something like:

outcome = json.loads(data["output_0_2023-05-17-15422890"]["outcome"])

Is this intended behaviour?

If not, I'd be happy to create a PR to fix, with tests.

digitally sign acro outputs

very much a 'simple thing that could be done if TREs are worried that some users might edit acro outputs after we have checked them and said they are ok.

Won't in the MOSCoW category for SACRO

Agree aggregation functions to implement

spss list: https://www.ibm.com/support/pages/how-do-i-aggregate-spss

Aggregate functions include:

  • SUM(varlist) Sum across cases.

  • MEAN(varlist) Mean across cases.

  • MEDIAN(varlist) Median across cases.

  • SD(varlist) Standard deviation across cases.

  • MAX(varlist) Maximum value across cases.

  • MIN(varlist) Minimum value across cases.

  • PGT(varlist,value) Percentage of cases greater than the specified value.

  • PLT(varlist,value) Percentage of cases less than the specified value.

  • PIN(varlist,value1,value2) Percentage of cases between value1 and value2, inclusive.

  • POUT(varlist,value1,value2) Percentage of cases not between value1 and value2. Cases where the source variable equals value1 or value2 are not counted.

  • FGT(varlist,value) Fraction of cases greater than the specified value.

  • FLT(varlist,value) Fraction of cases less than the specified value.

  • FIN(varlist,value1,value2) Fraction of cases between value1 and value2, inclusive.

  • FOUT(varlist,value1,value2) Fraction of cases not between value1 and value2. Cases where the source variable equals value1 or value2 are not counted.

  • N(varlist) Weighted number of cases in break group.

  • NU(varlist) Unweighted number of cases in break group.

  • NMISS(varlist) Weighted number of missing cases.

  • NUMISS(varlist) Unweighted number of missing cases.

  • FIRST(varlist) First nonmissing observed value in break group.

  • LAST(varlist) Last nonmissing observed value in break group.

Create function to "add text to output"

When a user creates an output, they should be able to add a text description that gets stored and lets them communicate with the output checkers to provide some context for the output.

Maybe easiest to assume that they can call 'list outputs' first to decide which output they are adding to.

Providing additional information in ACRO outputs

In addition to providing automated rules-based checking, should ACRO include/highlight further information for the checker (such as for which there may not be any rules)? For example, in this handbook, the regressions section include a "minimum requirement" and that includes: Number of observations, Degrees of freedom, Variable labels, Omitted parameters, Cohort specification.

The ACRO output for regressions currently includes the degrees of freedom in the summary because that's what it checks. But we could easily include in the summary something like "additional information: variable labels = ["bla", "blah"]. Ideally we would automatically extract as much of this information as possible, but we could also just provide a prompt to the user asking them to supply it after they run an acro.ols()?

For situations where we don't have/know an appropriate rule but we do know the variables/requirements that matter, if we structured that information in the JSON output (e.g., "dof": 5) and then subsequently the result of the TRE checker was also recorded in a structured way, this might enable the rules to be learnt. In this example, if we knew that dof was important but didn't know that TRE checkers deem a value above/below 10 to be important, we could learn that information?

Set up repo

Repo needs setting up with a development branch as default and all the relevant hooks/ci stuff for pre-commit and linting

Create a function that converts the outputs to csv data (to_csv)

  • This function will be used by the researchers to save the outputs in a csv format
  • Instead of waiting till the end to use the finalise function, this function will give the option to the researcher to save outputs in a csv format during the analysis process

Support for different Python versions

ACRO is currently written for Python 3.10 and 3.11, but it should be possible to support previous versions if we do some testing to check and modify any >=3.10 syntaxes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.