GithubHelp home page GithubHelp logo

synthesized-io / fairlens Goto Github PK

View Code? Open in Web Editor NEW
87.0 87.0 8.0 6.28 MB

Identify bias and measure fairness of your data

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
bias data data-analysis data-science fairness pandas python statistics

fairlens's People

Contributors

bogdansurdu avatar dependabot[bot] avatar hilly12 avatar pre-commit-ci[bot] avatar rob-tay avatar simonhkswan avatar tonbadal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fairlens's Issues

Simple statistical comparisons of sensitive groups

Is your feature request related to a problem? Please describe.

Biases are currently identified by comparing the distributions of a target variable and this comparison is achieved by calculating a statistical distance metric between the distributions, e.g the earth-mover's distance. Although this method can identify differences between target variables, it's both hard to interpret the numeric result and understand what the difference actually is.

To provide further interpretation of a potential bias, I think it would be useful to be able to calculate simpler statistical measures (e.g central moments, variance) of the target variable distributions and compare them across sensitive groups. Statistical tests can then be performed to determine the significance of these differences.

Additionally, it may also be useful to perform and provide different hypothesis tests that compare the distributions. For example, the Brunner-Munzel test may be appropriate and more powerful at discovering differences in the distributions.

Describe the solution you'd like
New metrics that can be used to calculate and show simpler statistical properties of target variable distributions plus corresponding hypothesis tests. Implementations of non-parametric hypothesis tests for comparing distributions and discovering significant biases (e.g Brunner-Munzel). These metrics and test p-values can then reported in the FairnessScorer

Describe alternatives you've considered

Additional context

Add version number to docs

Is your feature request related to a problem? Please describe.
There's no indication of what fairlens version we are working with when reading the docs. Readthedocs only gives "latest" which isn't very informative

Describe the solution you'd like
Add the fairlens version to the docs index

Describe alternatives you've considered

Additional context

Mitigate biases that are detected in datasets.

Is your feature request related to a problem? Please describe.
Once a bias has been measured in a dataset, it would be nice to be able to still use the dataset without having to worry about the biases.

Describe the solution you'd like

  • for a given metric be able to improve the measure of bias in a dataset.

Describe alternatives you've considered

Additional context

Bottleneck in deep search

The fine grained deep search in sensitive.detection._deep_search performs poorly on large datasets and bottlenecks the sensitive attribute detection in the fairness scorer. Currently, we compute the str_distance between all values in the dataset paired with the expected values for each sensitive attribute. Only considering the unique values would speed this up considerably. Additionally we may want to limit the deep search to only look at a sample of unique values in large datasets, since we don't necessarily need to check all the unique values.

Roadmap

I've made a start on creating a roadmap in the projects page.

Publishing

I believe the aim here is to publish the package onto pypi. @rob-tay do you know much about how to do that?

Brand Assets

Visual assets needed for fairlens:

  • documentation logo
  • readme logo
  • example screenshots

Add user guides

The docs need detailed user guides for bias measurement, sensitive attribute detection and visualization. Worth looking into using ipython extension for sphinx.

Compas dataset needs to be separated

ProPublica's website indicates that the repeats in the dataset are due to people receiving different COMPAS assessments, ie. one for Risk of Violence, one for Risk of Recidivism, and one for Risk of Failure to Appear. It seems to be that different algorithms are used by COMPAS for each of those cases. It therefore makes sense to split the dataset into separate parts, ie. one for each type of assessment.

[ISSUE] Templates for Pull Requests

Is there an existing issue for this?

  • I have searched the existing issues

Issue Type

Improvement

Describe the issue

At the moment, Fairlens does not have a general template for Pull Requests and would benefit from having a standardized method for them.

Describe a solution you would like

Add a subfolder PULL_REQUEST_TEMPLATE in .github containing markdown files with the different sections and required information or context for a new Pull Request.

Describe alternatives you have considered

It might be possible to use YAML files as with the Issue forms, but it seems that other repositories do not use them and this way of doing it isn't documented by Github either.

Additional context

No response

Test Issue

Test Description

import numpy as np

x = np.array([0, 1, 1, 3])

Review current fairness measurement packages.

We'd like to provide a comprehensive solution that stacks up well against similar packages. It's important then to know what the current solutions in this space are. This issue will be resolved if we have a comprehensive confluence page that gives a good idea of the current space of fairness packages. In particular, we're interested in what measures of fairness they use and the motivation behind those choices.

Report generation in fairness scorer

The fairness scorer should be able to produce a report which aggregates the demographic score, hidden correlations, and any other metrics. Would be useful to have a single value representing the bias of the dataset.

Implement fairness scorer

The fairness scorer is a module which combines the features of the bias and sensitive package to generate a report, figures, etc, analysing potential biases the dataset.

Add p value testing

Need to add methods to compute the p-value for each metric using bootstrapping, permutation tests, etc. Should be in a separate module and use a wrapper.

Refactor bias metrics

Decide on whether to use abstract classes or functions for distance metrics and refactor accordingly.

Updated README and documentation

What things do we have left to do here?

Updated:

  • Write short 2-3 tutorials based on either COMPAS, German Credit, Adult, or LSAC datasets.
  • Include a fairness scorer use case in README.
  • Polishing overview and quickstart.
  • Include contribution guides in docs

Plot scaling

Is your feature request related to a problem? Please describe.
I think y-axis of distribution plots needs to be scaled similarly for data from the same column.

1bd4b3f9-1e98-4b14-b805-e9e2823b7655
49eafd98-b3df-4022-aa15-181675a12c93

Describe the solution you'd like
We could set the upper y limit of the plot to some constant times the maximum value in the target column.

plt.ylim(0, df[target_attr].max() * 1.2)

Integrating insight into fairlens

Discussed in #108

We have to sync fairlens and insight and start using insights methods in fairlens. The desired structure for metrics in fairlens is the following.
IMG_20210915_042058__01

Pairwise distance computation in the fairness scorer

Is there an existing issue for this?

  • I have searched the existing issues

Is your feature request related to a problem?

At the moment, the fairness scorer compares the distribution of a variable in a sensitive sub-group to the overall distribution. This works well for symmetrical statistical distance metrics, but it would be useful to have a way of using asymmetrical metrics such as disparate impact, to produce a similar table comparing distributions of different pairings of subgroups instead.

Describe the solution you would like

A method similar to fairlens.FairnessScorer.distribution_score, which iterates through pairs of the subgroups instead.

Setting up CI + packaging

We need to automate the building and packaging of fairlens so that v1.0 can be built when we open source. I suppose we'll want to publish on pypi too. We should have a think about the future release goals too.

  • Linting
  • Packing
  • #64
  • Build Docs on main #74 #83

Heatmap of interactions between sensitive and non sensitive columns

Datasets often have proxies for sensitive attributes; i.e. non sensitive columns highly correlated with sensitive columns. The fairness scorer should be able to detect these and plot 2D correlation heat maps for all pairs of columns. This would require integrating some of the work done on correlations done in the sensitive package with the fairness scorer, and adding additional correlation metrics to account for all types of columns.

Script to walk files and generate docs using the automodule directive

Some popular repositories such as pandas seem to have custom scripts for api doc generation. It could be useful to have a script that walks through all the files in src/fairlens and builds rst files for them. Nice use cases include having a separate html file for each metric, method, etc. Alternatively we could use templates for sphinx api-doc.

Cross reference API reference in docs

Is your feature request related to a problem? Please describe.
The documentation references object names in the user guides and tutorials but they are not cross-referenced to the respective entry in the API reference. It would be nicer to be able to click on the name and it takes you to the more detailed API reference for that object.

Describe the solution you'd like
Links to the respective API reference for any named objects in the documentation

Describe alternatives you've considered

Additional context

Detecting sensitive attributes using word vectors

Deep search currently uses the Ratcliffe-Obershelp algorithm to match strings in a column with potential aliases to determine whether the attribute corresponds to a sensitive attribute. Using word vectors will remove the need to match with aliases and might be faster.

Continuous data is handled incorrectly in emd metric

The emd metric function builds a metric space from all the unique values in the data, and uses pd.Series.value_counts to create the histogram.

if counts is None:
if group1 is None:
raise InsufficientParamError()
# Find the predicates for the two groups
pred1, pred2 = utils.get_predicates(df, group1, group2)
# Compute the histogram / counts for each group
g1_counts = df[pred1][target_attr].value_counts().to_dict()
g2_counts = df[pred2][target_attr].value_counts().to_dict()
counts = g1_counts, g2_counts
space = df[target_attr].unique()

For categorical attributes this is fine, but if the data is continuous this won't work as intended; the metric space blows up and the distance isn't really being calculated between meaningful distributions. Instead, either the continuous data needs to be binned before pd.Series.value_counts or should be passed to pyemd.emd_samples instead, which also does the binning.

Create a summary of fairness metrics.

For the first stage of this project, we'd like to have a solution that measures fairness in a few common ways. To start with it would be helpful to create some documentation summarising the current common ways to measure fairness and bias.

Distance computation for all demographics in the fairness scorer

The fairness scorer needs a fast and efficient (multi-threaded) method to compute the statistical distance between the distribution of a demographic and the distribution of the population without the demographic, for all combinations of demographics in the selected (sensitive) attributes. Efficiently mapping distance metrics across a group by object might involve handling some cases differently, for instance with categorical distance metrics, it would be much faster to compute the bin edges for the entire column beforehand rather than calling np.histogram_bin_edges once for each group.

Sample architecture:

def distribution_score(
    self,
    mode: str = "auto",
    alpha: float = 0.05,
    min_dist: Optional[float] = None,
    min_count: Optional[int] = 50,
    weighted: bool = True,
    max_comb: Optional[int] = 3
) -> pd.DataFrame:
    """Returns the biases and fairness score by analyzing the distribution difference between sensitive
    variables and the target variable.
    Args:
        mode (str, optional):
            Choose a different metric to use. Defaults to automatically chosen metric depending on
            the distribution of the target variable.
        alpha (float, optional):
            Maximum p-value to accept a bias. Defaults to 0.05.
        min_dist (Optional[float], optional):
            If set, any bias with smaller distance than min_dist will be ignored. Defaults to None.
        min_count (Optional[int], optional):
            If set, any bias with less samples than min_count will be ignored. Defaults to 50.
        max_comb (Optional[int], optional):
            Max number of combinations of sensitive attributes to be considered. Defaults to 3.
    """

Sample output on compas:

Demographic              Distance        P-Value        Counts

African-American Male

Caucasian Male

...

Add public datasets

Would be good to have a collection of relevant public datasets referenced somewhere in the repo. Could be either just in the Readme or also in a dedicated data folder. E.g COMPAS

Binary distribution plots are difficult to interpret

Is your feature request related to a problem? Please describe.
Distribution plots of binary target variables are arguably difficult to interpret properly since the overlaying often covers up relevant information.

bf13b397-76cb-4c99-ba30-39c1fcc4c3fd

Describe the solution you'd like
Ideally by using distr_plot the user would be able to clearly see the disparity between positive and negative classes for each pair of values.

Describe alternatives you've considered
A simple solution would be to make the bars appear side by side. Open to suggestions for alternative plots.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.