ai-sdc / acro Goto Github PK

Tools for the Automatic Checking of Research Outputs. These are the tools for researchers to use as drop-in replacements for commands that produce outputs in Stata Python and R

License: MIT License

Python 94.17% R 1.74% Stata 4.09%

data-privacy data-protection privacy privacy-tools statistical-disclosure-control

acro's Introduction

SACRO-ML

A collection of tools and resources for managing the statistical disclosure control of trained machine learning models. For a brief introduction, see Smith et al. (2022).

The sacroml package provides:

A variety of privacy attacks for assessing machine learning models.
The safemodel package: a suite of open source wrappers for common machine learning frameworks, including scikit-learn and Keras. It is designed for use by researchers in Trusted Research Environments (TREs) where disclosure control methods must be implemented. Safemodel aims to give researchers greater confidence that their models are more compliant with disclosure control.

Installation

Install sacroml and manually copy the examples.

To install only the base package, which includes the attacks used for assessing privacy:

$ pip install sacroml

To additionally install the safemodel package:

$ pip install sacroml[safemodel]

Note: macOS users may need to install libomp due to a dependency on XGBoost:

$ brew install libomp

Running

See the examples.

Acknowledgement

This work was funded by UK Research and Innovation under Grant Numbers MC_PC_21033 and MC_PC_23006 as part of Phase 1 of the DARE UK (Data and Analytics Research Environments UK) programme, delivered in partnership with Health Data Research UK (HDR UK) and Administrative Data Research UK (ADR UK). The specific projects were Semi-Automatic checking of Research Outputs (SACRO; MC_PC_23006) and Guidelines and Resources for AI Model Access from TrusTEd Research environments (GRAIMATTER; MC_PC_21033).This project has also been supported by MRC and EPSRC [grant number MR/S010351/1]: PICTURES.

acro's People

Contributors

Stargazers

Watchers

Forkers

zayntawfik opensafely-core

acro's Issues

add del method that calls finalise in css use forgets

Needs to be discussed with TREs and researchers whether this writes to a separate folder called 'autosaves' or some such

Decide on approach: different versions for different language str 1 version with different skins?

write to excel fails if output is not a table

error message

AttributeError Traceback (most recent call last)
Cell In[26], line 1
----> 1 output = acro.finalise("test_results.xlsx")

File ~/GitHub/AI-SDC/ACRO/acro/acro.py:86, in ACRO.finalise(self, filename)
84 utils.finalise_json(filename, self.results)
85 elif extension == ".xlsx":
---> 86 utils.finalise_excel(filename, self.results)
87 else:
88 raise ValueError("Invalid file extension. Options: {.json, .xlsx}")

File ~/GitHub/AI-SDC/ACRO/acro/utils.py:147, in finalise_excel(filename, results)
145 for table in output["output"]:
146 start = 1 + writer.sheets[output_id].max_row
--> 147 table.to_excel(writer, sheet_name=output_id, startrow=start)

AttributeError: 'str' object has no attribute 'to_excel'

to reproduce

run test.ipynb (the one that uses the charity data)
and add a cell at the end with the command
output = acro.finalise("test_results.xlsx")

cause

in utils.py around line 147
for table in output["output"]:
start = 1 + writer.sheets[output_id].max_row
table.to_excel(writer, sheet_name=output_id, startrow=start)

Fix

Assumption is that the output is a table, which is not always true (e.g. unsupported output)
need to test the type and the write out as appropriate if it is jsut a string

create acro.custom_output(filename:string) for currently unsupported outputs

takes a filename, creates an output using __addoutput()
lets them add text for description

Handling missing or non-positive values

Create support for histograms in acro

crosstab function does not behave correctly when passed a list of aggfuncs

e.g. passing aggfunc=['mean',std'] in the argument works in pd.crosstab but not in acro.crosstab.

However the functionality is there to fix it, just needs:

the line of code change from get_aggfunc to get_aggfuncs acro.py line 164
Because pandas doesn't write all the aggregates into one cell, but produces different columns for them, it produces a table with Len(aggfuncs) times the normal number of columns.
So, if aggfuncs is a list, then after you make the change above, it then throws an error when trying to apply a mask because those assume that the table only has one agg func. I think that the answer to is allow 'Freq' as an aggregation function and then make the masks bigger that way.
I.e. if a user asks for mean and std, then when the table values are created for masking, use ['freq', X] (for I think any valid statistic x) and only look at the first table.shape[1]/2 columns

Pull UWESDC across from gitlab

Adding SDC guidance for researchers and checkers

Assuming the availability of SDC guidance such as in this handbook or as is being produced in WP1 (e.g., statistic type, description, requirements, disclosure issues, mitigation, possibly TRE specific rules, etc.) this information needs to be made available to both researchers and TRE checkers. So the question is how best to make this available for ease of access, maintenance, and including ACRO examples.

Should this information be made available on an independent website or included directly within ACRO documentation? ACRO documentation could simply link to the relevant part of an external website? How coupled should the guidance be with ACRO implementation?

Should the checker GUI be able to directly search/display guidance or just provide links?

Currently we attempt to include the whole ACRO command within the JSON output, but we should also include a simplified form of either (a) the function name (e.g., "function": "ols") or (b) a type (e.g., "type": "regression") so that the GUI can then use this to decide what SDC guidance to reference for more information?

Plan to TREs for comment

Implement more aggregation functions (Mode/Median)

Add a timestamp or unique session id to output names

So that outputs created by different sessions don't get overwritten

In JSON output, outcomes are encoded as embedded strings, rather than as JSON objects

Hi there from your friends over at OpenSAFELY

When getting familiar and testing out ACRO, we noticed the outcome field is encoded as a string representing a JSON object, as opposed to just a JSON object like we would have expected. e.g.

    {
    "output_0_2023-05-17-15422890": {
        "command": "safe_table = acro.crosstab(df.recommend, df.parents)",
        "summary": "fail; threshold: 4 cells suppressed; ",
        "outcome": "{\"great_pret\":{\"not_recom\":\"ok\",\"priority\":\"ok\",\"recommend\":\"threshold; \",\"spec_prior\":\"ok\",\"very_recom\":\"threshold; \"},\"pretentious\":{\"not_recom\":\"ok\",\"priority\":\"ok\",\"recommend\":\"threshold; \",\"spec_prior\":\"ok\",\"very_recom\":\"ok\"},\"usual\":{\"not_recom\":\"ok\",\"priority\":\"ok\",\"recommend\":\"threshold; \",\"spec_prior\":\"ok\",\"very_recom\":\"ok\"}}",
        "output": "/home/wavy/bennett/test-oxcef/outputs/output_0_2023-05-17-15422890.csv",
        "timestamp": "2023-05-17-15422890",
        "comments": "Please let me have this data., 6 cells were supressed in this table"
    }
}

This means that to inspect the JSON outcome field programatically, we need to do something like:

outcome = json.loads(data["output_0_2023-05-17-15422890"]["outcome"])

Is this intended behaviour?

If not, I'd be happy to create a PR to fix, with tests.

create process for external validation

Link to Tau Argus (or open safely) for external validation ?

do we need a data dictionary as we did for safe models?

if so, are there existing standards we should be adopting?
cf Deutsche Bank semantic mappings (Jim to dig out emails)

ACRO tables: find cleaner way of setting p and nk params than using global variables

digitally sign acro outputs

very much a 'simple thing that could be done if TREs are worried that some users might edit acro outputs after we have checked them and said they are ok.

Won't in the MOSCoW category for SACRO

ACRO tables: run some scaleability tests

Produce example python and R test scripts based on notebooks/R test script shows off all the acro features

automate testing all possible tables up-front I.e. before researcher ever arrives??

Change default behaviour to not suppressing outputs

As per feedback from uses and TREs.

Design scaleability experiments

Create simple front end to create the risk specification doc

familiarisation with ACRO and basics of identifying primary vulnerable cells

ESSNet_SDC_-_Guidelines_for_the_checking.pdf

ACRO User Guide v01e clean.docx

acro user guide is most important and code is in the Eurostat repo linked from project home page

Define coding standards, and cI tests for e.g. Exception handling

output from regression results is mangled- missing newline somewhere

compare what. gets put into the object with what gets shown on screen pin the notebooks from results.summary()

There is a missing newline after "Covariance Type: nonrobust"

acro.crosstab with aggregation fountain throws type errors if some cells in masks are empty

if you run the test-nursery notebook with an aggregation function it throws TypeErrors because the
masks passed are not binary if there are empty cells when the call to pd.crosstab with different sdc checking functions as aggregation function is called

Create Stata front end

license details redacted

Agree aggregation functions to implement

spss list: https://www.ibm.com/support/pages/how-do-i-aggregate-spss

Aggregate functions include:

SUM(varlist) Sum across cases.
MEAN(varlist) Mean across cases.
MEDIAN(varlist) Median across cases.
SD(varlist) Standard deviation across cases.
MAX(varlist) Maximum value across cases.
MIN(varlist) Minimum value across cases.
PGT(varlist,value) Percentage of cases greater than the specified value.
PLT(varlist,value) Percentage of cases less than the specified value.
PIN(varlist,value1,value2) Percentage of cases between value1 and value2, inclusive.
POUT(varlist,value1,value2) Percentage of cases not between value1 and value2. Cases where the source variable equals value1 or value2 are not counted.
FGT(varlist,value) Fraction of cases greater than the specified value.
FLT(varlist,value) Fraction of cases less than the specified value.
FIN(varlist,value1,value2) Fraction of cases between value1 and value2, inclusive.
FOUT(varlist,value1,value2) Fraction of cases not between value1 and value2. Cases where the source variable equals value1 or value2 are not counted.
N(varlist) Weighted number of cases in break group.
NU(varlist) Unweighted number of cases in break group.
NMISS(varlist) Weighted number of missing cases.
NUMISS(varlist) Unweighted number of missing cases.
FIRST(varlist) First nonmissing observed value in break group.
LAST(varlist) Last nonmissing observed value in break group.

Option to reload / restart acro session

Would need discussing with TREs to see if they feel there is a security issue.

deal with negative values in nk and p-ratio dominance checks

Support survival table (Kaplan Meier)

new release for TRE alpha testing

Create function to "add text to output"

When a user creates an output, they should be able to add a text description that gets stored and lets them communicate with the output checkers to provide some context for the output.

Maybe easiest to assume that they can call 'list outputs' first to decide which output they are adding to.

create functionality for users to rename outputs to something more descriptive

Use case:
Imagine a users has produced 5 tables, then are told they need to redo the third.
So they go in and redo it, it currently would be called output6, but it would make more sense if they could rename it from output_6 to out_3_v2 for example.

Providing additional information in ACRO outputs

In addition to providing automated rules-based checking, should ACRO include/highlight further information for the checker (such as for which there may not be any rules)? For example, in this handbook, the regressions section include a "minimum requirement" and that includes: Number of observations, Degrees of freedom, Variable labels, Omitted parameters, Cohort specification.

The ACRO output for regressions currently includes the degrees of freedom in the summary because that's what it checks. But we could easily include in the summary something like "additional information: variable labels = ["bla", "blah"]. Ideally we would automatically extract as much of this information as possible, but we could also just provide a prompt to the user asking them to supply it after they run an acro.ols()?

For situations where we don't have/know an appropriate rule but we do know the variables/requirements that matter, if we structured that information in the JSON output (e.g., "dof": 5) and then subsequently the result of the TRE checker was also recorded in a structured way, this might enable the rules to be learnt. In this example, if we knew that dof was important but didn't know that TRE checkers deem a value above/below 10 to be important, we could learn that information?

print_outputs() should return string as well a print to screen

so it can be passed through to stata better

Discuss models for safe tables in R/'native python', etc

datasets will often have columns of different types.
pandas supports this well
what about other things?

add cell detail to outcome reports as well as the mask table

I.e. not just printing the overall mask but a list of not-ok cells and helpful info
e.g.(for threshold rule)

row[A]col[B] M < 10
row[X]col[Y] n<10
etc.

could be useful for researcher to see how to amalgamate sub-groups to reduce risk

no cells with counts fewer than K
not allowed to usemax/min as aggregation functions
P/Q vulnerability

separate analytic results out from json files

Create a function that converts the outputs to csv data (to_csv)

This function will be used by the researchers to save the outputs in a csv format
Instead of waiting till the end to use the finalise function, this function will give the option to the researcher to save outputs in a csv format during the analysis process