mozilla / presc Goto Github PK

Performance Robustness Evaluation for Statistical Classifiers

License: Mozilla Public License 2.0

Python 15.83% Dockerfile 0.05% Makefile 0.05% Shell 0.02% Jupyter Notebook 84.04%

presc's Introduction

PRESC: Performance and Robustness Evaluation for Statistical Classifiers

PRESC is a toolkit for the evaluation of machine learning classification models. Its goal is to provide insights into model performance which extend beyond standard scalar accuracy-based measures and into areas which tend to be underexplored in application, including:

Generalizability of the model to unseen data for which the training set may not be representative
Sensitivity to statistical error and methodological choices
Performance evaluation localized to meaningful subsets of the feature space
In-depth analysis of misclassifications and their distribution in the feature space

More details about the specific features we are considering are presented in the project roadmap. We believe that these evaluations are essential for developing confidence in the selection and tuning of machine learning models intended to address user needs, and are important prerequisites towards building trustworthy AI.

It also includes a package to carry out copies of machine learning classifiers.

As a tool, PRESC is intended for use by ML engineers to assist in the development and updating of models. It is usable in the following ways:

As a standalone tool which produces a graphical report evaluating a given model and dataset
As a Python package/API which can be integrated into an existing pipeline

A further goal is to use PRESC:

As a step in a Continuous Integration workflow: evaluations run as a part of CI, for example, on regular model updates, and fail if metrics produce unacceptable values.

For the time being, the following are considered out of scope:

User-facing evaluations, eg. explanations
Evaluations which depend explicitly on domain context or value judgements of features, eg. protected demographic attributes. A domain expert could use PRESC to study misclassifications across such protected groups, say, but the PRESC evaluations themselves should be agnostic to such determinations.
Analyses which do not involve the model, eg. class imbalance in the training data

There is a considerable body of recent academic research addressing these topics, as well as a number of open-source projects solving related problems. Where possible, we plan to offer integration with existing tools which align with our vision and goals.

Documentation

Project documentation is available here and provides much more detail, including:

Getting set up
Running a report
Computing evaluations
Configuration
Package API

Examples

An example script demonstrating how to run a report is available here.

There are a number of notebooks and explorations in the examples/ dir, but they are not guaranteed to run or be up-to-date as the package has undergone major changes recently and we have not yet finished updating these.

Some well-known datasets are provided in CSV format in the datasets/ dir for exploration purposes.

Notes for contributors

Contributions are welcome. We are using the repo issues to manage project tasks in alignment with the roadmap, as well as hosting discussions. You can also reach out on Gitter.

We recommend that submissions for new feature implementations include a Juypter notebook demonstrating their application to a real-world dataset and model.

This repo adheres to Python black formatting, which is enforced by a pre-commit hook (see below).

Along with code contributions, we welcome general feedback:

Testing out the package functionality. Try running the report on a classification model and dataset. You can also try running individual evaluations in a Jupyter notebook.
- If you don't have a dataset or classification model to work with, you can use one of the datasets in the repo, and create a classifier using scikit-learn. Some examples are given in the examples/ dir.
- If you can apply PRESC to a classification problem you have already been working on, we'd be very excited to hear your feedback. If your data & model can be considered public, you are welcome to submit any artifacts to our examples/ dir.
Please open issues for any bugs you encounter (including things that don't work as you expect or aren't well explained).
- If you want to offer a PR for a fix, that is welcome too.
We would welcome any feedback on the general approach, the evaluations described in the roadmap, the results you get from running PRESC, etc, including similar projects you're familiar with. You can start a discussion by opening an issue.

The development of the ML Classifier Copies package is being carried out in the branch model-copying.

Setting up a dev environment

Make sure you have conda (eg. Miniconda) installed. conda init should be run during installation to set the PATH properly.

Set up and activate the environment. This will also enable a pre-commit hook to verify that code conforms to flake8 and black formatting rules. On Windows, these commands should be run from the Anaconda command prompt.

$ conda env create -f environment.yml
$ conda activate presc
$ python -m pip install -e .
$ pre-commit install

To run tests:

$ pytest

Acknowledgements

This project is maintained by Mozilla's Data Science team. We have also received code contributions by participants in the following programs, and we are grateful for their support:

The ML Classifier Copying package is being funded through the NGI0 Discovery Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 825322.

presc's People

Contributors

Stargazers

Watchers

presc's Issues

[Outreachy applications] Visualization for misclassifications

Misclassifications can reveal a lot about the boundaries of performance of a classifier. Develop a visualization that helps dig into misclassified datapoints in the test set. A simple approach for a binary classifier would be to plot a histogram of the predicted class probabilities across the misclassified test samples in each class.

Add more datasets to the repo

We should make a wider variety of datasets available for examples and testing. These should include some requiring more complex models or data transformation.

Prototyping for class fit

Test out the class fit in a notebook using the implementation from #200. These should be run on two different dataset/model pairs.

[Outreachy applications] Importance score for dataset training samples

Implement a way to assess the importance of an inidividual training datapoint to the performance of the model. This could be done by training the same model with a particular point included and then excluded from the training set, and computing the difference in performance scores on the test set.

Investigate using conditional feature distributions in an automated setting

Does it make sense to use the conditional feature distributions in an automated setting? This should be explored in a notebook using the results of #193 and #194.

[Outreachy applications] Calibration plot

A calibration plot assesses how well the predicted class probabilities match the actual rate of occurrence in the dataset.

Write a function to create a calibration plot given one or more binary classifiers and a test set. For each model it would compute predicted probabilities for each test datapoint. These probability values should then be binned according to a scheme (eg. intervals of 10%), and the observed occurrence rate can be computed as the proportion of true positive class samples out of all samples in each bin. The plot then displays the observed occurrence rates vs the bin midpoints for each model.

Rework evaluation visualizations

The existing visualizations could stand to be tidied up and made more uniform in appearance. Together with this, we would like to move to a visualization library that offers clean, maintainable syntax and interactivity features out-of-the-box (currently considering Altair).

The goal is to review and refactor the visualizations for the currently implemented evaluations (under presc.evaluations).

Implementation for spatial distributions

Implement the computation for spatial distributions. This may be doable using scikit-learn's NearestNeighbors. The data will likely need to be scaled first.

Prototyping for feature-based splitting

Test out the feature-based splitting approach in a notebook using the implementation from #211 . These should be run on two different dataset/model pairs.

Prototyping for conditional feature distributions

Test out the conditional feature distributions in a notebook using the implementation from #193. These should be run on two different dataset/model pairs.

Investigate baseline for spatial distributions

The spatial distributions comparison is likely only meaningful when compared to a relevant baseline, such as the distribution of distances between correctly classified points. Develop a baseline computation based on the results from #203 and #204 and test it out in a notebook.

Add a unified entrypoint to run the report

We need a better interface to run the report from, and to remove any dependency on the example dataset/model we are currently using in the report.

As a part of this, we need a way to share the required inputs (data/model/config) between the notebooks. Looks like the simplest approach would be to serialize and reload separately in each notebook.

Implementation for feature-based splitting

Implement the computation for feature-based splitting. This can probably use scikit-learn's GroupKFold. Feature splitting should reuse existing logic as appropriate.

[Outreachy applications] Comparing test sample classifications between models

To delve further into model performance comparisons, it would be interesting to compare the classification results (ie. predicted probabilities) for individual datapoints. This would also help to understand misclassifications.

Develop a visualization to compare predicted class probabilities across models for binary classifiers. For example, this could present a histogram of the difference in predicted probabilities between the two models across all training samples. Misclassifications under either model could also be split out.

Refactor feature binning logic to separate utility function

We expect to use the binning methodology from presc.misclassifications.misclass_rate.misclass_rate_feature in other places. This should be moved out into a separate tested utility function.

Work out implementation details for label flipping

Prior to implementing label flipping, we need to work out the methodology for traversing the space of possible label flips.

Work out implementation details for entropy-based splitting

Prior to implementing entropy-based splitting, we need to work out the methodology for splitting the data.

Prototyping for conditional metrics

Test out applications for the conditional metrics in a notebook using the implemention from #188. These should be run on two different dataset/model pairs.

Make all PRs Black adherent

As mentioned in the Contribution guidelines :
Code formatting guidelines should strinctly adhere to Python Black formatting guidelines.

Now, sometimes the contributors forget to ensure that the above is followed. To reduce the efforts while making sure the required code formatting is followed, adding Python Black as a pre-commit hook would make things easier.

Investigate baseline for class fit

The class fit metric is likely only meaningful when compared to a relevant baseline. Develop a baseline computation based on the results from #200 and #201 and test it out in a notebook.

Test out a calibration plot using the conditional metrics tool

Create a calibration plot using the conditional metrics implementation #188. These can be appended to the notebook from #190.

Prototyping for novelty

Test out the novelty metric for datasets in a notebook using the implementation from #219. These should be run on two different dataset/model pairs.

Prototyping for entropy-based splitting

Test out the entropy-based splitting approach in a notebook using the implementation from #214. These should be run on two different dataset/model pairs.

Implementation for label flipping

Implement the computation for label flipping.

As the details are not fully fleshed out, this will require first developing a plan for how the evaluation would work in the context of PRESC, ie. any limitations or assumptions on the inputs and the computational approach.

[Outreachy applications] Learning from misclassifications

When training a classification model, it is common to look at accuracy and the confusion matrix, which give a summary view of misclassifications. By itself, these metrics are informative but not very actionable.

Develop a metric or visualization that reveals something more about each misclassified point (beyond just the fact that it was misclassified) that can be used to improve the model.

Some examples of a metric might be the classification probability scores for the different classes, which can indicate whether the misclassified points were close to the decision boundary or not, or the distance from the class mean in feature space, indicating whether the misclassified points are outliers.

A good place to start is to study the misclassifications you got from your model for task #2. What do they tell you about how to improve your model?

Set up automatic API documentation

Get set up with a tool, eg. Sphinx, to auto-generate API documentation for the library.

Implementation for conditional metrics

Implement the computation for conditional metrics.

This can probably be done by refactoring presc.misclassifications.misclass_rate.misclass_rate_feature to accept a metric function.

Work out implementation details for counterfactual accuracy

The description of counterfactual accuracy in the paper makes certain assumptions about the model and dataset. We need to decide how these will affect the implementation.

Management for config options

Many components of the PRESC computations should be customizable by a user running the tools in their own environment. Some options are currently passed around as function args, but we need a more unified way of setting config options that doesn't require code changes, eg. via a config file.

Some of the components that will be configured via options are:

The evaluation computations, eg. the binning scheme to use for grouping test samples
Plot components, eg. element sizing or plot arrangement
How an evaluation applies to a dataset, eg. some would be run per-feature, and we'd want to include or exclude certain features. Also, some of the per-evaluation options may be allowed to vary per-feature in this case.
The report generation, eg. which evaluations to include

Test/example case wrapper API

It would be great to have a collection of preset dataset/model pairs with pretrained models that can be easily loaded and used in examples or test cases without needing to copy around per-instance boilerplate code for train/test splitting, model training, etc.

We want to develop a wrapper class that offers this functionality. These example instances will likely reference Dataset instances (#208), or it might make more sense to intregate the two.

[Outreachy applications] Visualization of an evaluation metric

A common theme in classifier tuning and evaluation is to plot a metric against the values of some parameter with repeated runs to assess variability. We would like to have a general utility for producing such plots.

It should take as input at table with columns (x, y1, y2, ..., yk), ie multiple y values for each x, and plot the average y value vs x with the spread of the y-values represented in some way, eg. a band.

Implementation of entropy-based splitting

Implement the computation for entropy-based splitting. This could probably be an extension of scikit-learn's BaseCrossValidator class.

[Outreachy applications] Multi-Class Imbalanced Data

Recent studies point to the fact that it is not the imbalanced data itself, but rather other data difficulty factors, amplified by the data imbalance, that pose a challenge during the learning process. Such factors include small sample size, presence of disjoint and overlapping data distributions, and presence of outliers and noisy observations. Multi-Class nature of classification can further amplify the problem if the data is imbalanced.

Earlier suggestions for dealing with multi-class imbalanced data involves decomposing the given problem to binary. This leads to quite a significant loss of information and loss of relationship among the classes being decomposed. For instance, it is possible that class A can be a part of the majority for class B but be a minority for class C. Real-world scenarios have more complex problems.

Propose a solution to deal with imbalanced data problem with embedded data-level difficulties, i.e., atypical data distributions, overlapping classes, and small disjuncts, in the multi-class setting.

Curate a collection of dataset/model pairs

Add a collection of dataset/model pairs to the repo that can be loaded into notebooks for running examples or used in test cases. At a minimum, a modelling problem for a given dataset could be set up by running a notebook.

Prototyping for counterfactual accuracy

Test out the counterfactual accuracy in a notebook using the implementation from #197 . These should be run on two different dataset/model pairs.

Implementation of statistical distance measures

Tracking ongoing work on statistical distance measures.

Prototyping for spatial distributions

Test out the spatial distributions in a notebook using the implementation from #203. These should be run on two different dataset/model pairs.

[Outreachy applications] Startup task: Train and test a classification model

This is a good way to get started with the environment and the problem domain. It will also provide the basis for a test case for future work. At a minimum, you should:

load a dataset from the repo
train a classification model from scikit-learn
compute an evaluation metric on a held-out test set

Feel free to include any additional steps you feel are relevant or you are interested in trying out, such as:

basic exploratory analysis of the dataset
data preprocessing
hyperparameter tuning

When you start work on this task, please post a comment here indicating which dataset and model you will be working with so that other contributors can avoid duplicating your work.

Dataset wrapper API

For convenience we should wrap test datasets in an object that offers a common API, eg. accessing feature columns and label column, accessing pre-split train/test sets, etc. That way we don't need to pass around information about column names or indexing separately. There should also be methods for reading and writing from files.

We can probably create instances of these objects directly as a part of the dataset setup script #207 and work with these from then on.

Implementation for novelty

Implement the computation for the novelty metric.

Investigate baselines for conditional metrics assessment

How can we make a decision about performance based on conditional metrics? This may either be a general rule or specific to each specific metric. The goal is to propose some approaches using the results of #188 and #190 and test them out in a notebook. This will help to look for patterns that we can use to guide intuition for selecting a baseline.

[Outreachy applications] Traversal of the space of cross-validation folds

Similarly to #3, we want to investigate how much the performance score computed using cross-validation depends on the number of folds. Eg. how would our performance estimate change if we used 10-fold rather than 5-fold?

Write a function that takes a scikit-learn estimator and a dataset, and computes an evaluation metric using repeated K-fold cross-validation over a grid of K values from 1 to n. It should output a table of K with the average metric value across the folds, one for each repeat.

Implementation for counterfactual accuracy

Implement the computation for counterfactual accuracy based on results of #196.

As some of the details have not been fully fleshed out in the original paper, and certain assumptions are made about the model and dataset, this will require first developing a plan for how the evaluation would work in the context of PRESC, ie. any limitations or assumptions on the inputs and the computational approach.

Prototyping for label flipping

Test out the label flipping in a notebook using the implementation from #217. These should be run on two different dataset/model pairs.

[Outreachy applications] PLEASE READ Updates to PR submission deadlines and review expectations!

Some of you have reached out to us regarding the closure of the repo to new contributors and the long delays in getting outstanding PRs reviewed. We want to be fair and considerate of all of your excellent work, but we also have to be realistic. We can not keep up with the volume of PRs and maintain a meaningful level of feedback.

Regarding new contributors with existing work: In order to allow contributors who started working on a PR before we announced the project closed, we will allow a 2 day extension (until Wednesday, March 25th 20:00 GMT (UTC+0)). If you have work that was already started before today you can submit a [WIP] PR before that time and it will get reviewed. We will be verifying commit timestamps to treat this as a strict deadline.
Between Wednesday March 25th and Monday March 30th We ask that NO ONE SUBMIT PRs to the repo during this time. This will allow @dzeber and I time to clear the current queue of outstanding PRs, ensuring that every applicant gets a chance for feedback before the final week of applications.
Between Monday March 30th and April 7th Any PRs submitted in this period can be considered as part of candidates' final application (to be submitted via the process specified on the outreachy project page), however, we can not guarantee that any of these PRs will be reviewed before the application deadline! I have created the post-closure-pr label (see this issue for example) that will help us triage contributions that are submitted between 30-Mar and 07-Apr.

Thank you all once again for your enthusiasm, excellent communication, great work, and for contributing to open source. Apologies for the bumps along the way with this project, we are doing our absolute best.

[Outreachy applications] Covariate Shift

As said in issue #3, a function has been written to check the evaluation metric for a model when train-test split ratio is varied from 0.0- 1.0. But there are cases where we cannot just randomly decide the split ratio on the basis on evaluation metric as the metric does not improve beyond a certain point. Example: data set based on predicting the probability that an auto insurance policy holder files a claim by Porto Seguro has data which has covariate shift. For this data using the regular methods does not yield a good evaluation metric score.

Propose a solution to detect and handle covariates in the data. The above mentioned data or any other can be used for implementation which may or may not have covariates.
Note: This issue has been created to provide a solution to issue #3 which has an exception of not dealing with data that has covariate shift.

Covariate Shift: It refers to a situation where predictor variables have different characteristics (distribution) in train and test data. So our model is unable to predict well on the test data as it has been trained on a certain set of feature characteristics which are different from test data.

Replace dataset files with reproducible code

Rather than maintaining actual data files in the repo, we should use a script to download each of them from their source location and apply any preprocessing. Together with this, we should add an API function that loads a dataset by running its script.

[Outreachy applications] Traversal of the space of train/test splits

Given a classification model, we want to investigate how much the performance score computed on the test set depends on the choice of train/test split proportion. Eg. how would our performance estimate change if we used a 60/40 split rather than 80/20?

Write a function that takes a scikit-learn estimator and a dataset, and computes an evaluation metric over a grid of train/test split proportions from 0 to 100%. To assess variability, for each split proportion it should resplit and recompute the metric multiple times. It should output a table of splits with multiple metric values per split.

Implementation for class fit metric

Implement the computation for class fit. This will likely use #199

Implementation for conditional feature distributions

Implement the computation for conditional feature distributions.