GithubHelp home page GithubHelp logo

amiratag / datashapley Goto Github PK

View Code? Open in Web Editor NEW
250.0 11.0 65.0 43 KB

Data Shapley: Equitable Valuation of Data for Machine Learning

License: MIT License

Python 53.63% Jupyter Notebook 46.37%

datashapley's Introduction

Data Shapley: Equitable Valuation of Data for Machine Learning

Code for implementation of "Data Shapley: Equitable Valuation of Data for Machine Learning".

Please cite the following work if you use this benchmark or the provided tools or implementations:

@inproceedings{ghorbani2019data,
  title={Data Shapley: Equitable Valuation of Data for Machine Learning},
  author={Ghorbani, Amirata and Zou, James},
  booktitle={International Conference on Machine Learning},
  pages={2242--2251},
  year={2019}
}

Prerequisites

  • Python, NumPy, Tensorflow 1.12, Scikit-learn, Matplotlib

Basic Usage

To divide value fairly between individual train data points/sources given the learning algorithm and a meausre of performance for the trained model (test accuracy, etc)

Authors

License

This project is licensed under the MIT License - see the LICENSE.md file for details

datashapley's People

Contributors

amiratag avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datashapley's Issues

Running TMC-Shapley for CNN

Hi there,

I try to run the TMC-Shapley on the Chexpert dataset for a DenseNet-121. I only want to calculate the TMC-Shapley on a small number of training points.
The code seems to run fine without any errors. Still I have a problem with the following piece of code in the _one_iteration method:

X_batch = np.zeros((0,) + tuple(self.X.shape[1:]))
y_batch = np.zeros(0, int)
truncation_counter = 0
for n, idx in enumerate(idxs):
old_score = new_score
X_batch = np.concatenate((X_batch, self.X[sources[idx]]))
y_batch = np.concatenate((y_batch, self.y[sources[idx]]))
if (self.is_regression or len(set(y_batch)) == len(set(self.y_test))): ##FIXIT
self.restart_model()
self.model.fit(X_batch, y_batch)
new_score = self.value(self.model, metric=self.metric)

In case the if-condition is false, the TMC-Shapley calculations terminate but every data point has marginal contribution of zero. I guess this is not the itention of the calculations.
In case the if-condition is true, the TMC-Shapley calculations do not terminate.
My X data comes in the shape of numberx3x224x224 and my y data in the shape of numberx1. Thus, the only thing I changed from the original source code is that is replaced the y_batch = np.zeros(0, int) by y_batch = np.zeros((0,) + tuple(self.y.shape[1:])).

Is there in general a problem with TMC-Shapley for CNNs or did I make a mistake?

Thanks for the help in advance.

Best regards,
Fabian

Python version in this project

Hello, I'm looking into your project and was wondering if you could specify the Python version that it's built to work with.

Thank you for your help.

Wrong Shape in DShap.py/init_score

The following lines (DShap.py: 172/173) intend to append the f1 score of the true output (y_test) and a random classification (rnd_y).

rnd_y = np.random.permutation(self.y)
rnd_f1s.append(f1_score(self.y_test, rnd_y))  

Since train and test set can differ in length, it must be:

rnd_y = np.random.permutation(self.y_test)
rnd_f1s.append(f1_score(self.y_test, rnd_y))  

Otherwise the code produces a error, because of the different shapes.

Question in TMC

if self.is_regression or len(set(y_batch)) == len(set(self.y_test)): ##FIXIT

Hi,

why do we restart the model when the second conditional flow is met?

Thanks

Show performance graph by gradually removing worst values instead of best values

The example notebook generates a performance graph by removing the best data points gradually. I'm trying to figure out how the accuracy changes when the worst performing data points are removed gradually. I can't seem to figure out where the values of the worst data points are being gradually removed for the chart and then how to do the opposite.

Thanks,

Bob

It's super slow.

Great idea. However, I tried on 10k rows and 80 columns data with my 32 cores machine. It's keep running for 5 days.

Is g_shap only built to run for fully connected networks?

I see that when running g_shap, model = ShapNN() as opposed to self.model.

Does this mean that no matter what model family we choose, gshap scores will come from a NN fit on the data? What if I wanted to find the gshap scores for other gradient-based models?

Performance graph phenomenon

Hi,

Just had a quick inquiry about the results of my performance graph. I am a little confused as to why the accuracy drastically increases at around 75% removed. I saw a similar increase when using logistic regression as well. Any ideas?
mygraph

Request for implementation of data shapley for image processing

Page 10 of the paper mentions the application of Shapley in veterinary data modelling using DeepTag i.e. Citation 25. Another application of the paper talks about the Fairness in gender detection using image data processing.
May I have the glance of the text and image processing techniques used?

Also, Does the Data Shapley paper hold good for skewed data?

Example notebook's bug

In example notebook, the cell

dshap = DShap(X, y, X_test, y_test, num_test, sources=sources, model_family=model, metric='accuracy',
              directory=directory, seed=0)

can't run because there's no sources defined.

Using DataShapley with Regression

I would love to use this with regression but the documentation indicates it is not yet implemented. Is there an intent to add regression support in the near future by chance?

Plots in Example.ipynb

Hello,

The plots at the bottom of the jupyter notebook example.ipynb, they are not labeled properly.

Also I could not find similar plots in the paper.

So can you please share what those plots mean, the ones after calling

convergence_plots(dshap.marginals_tmc)
and
convergence_plots(dshap.marginals_g)

Thanks,

Notebook Error Ideas

Hi,
Unfortunately I keep getting this error:

Traceback (most recent call last):
File "C:/Users/----------------------------y/main.py", line 47, in
directory=director, seed=0)
File "C:\Users-------------------------------\DShap.py", line 69, in init
os.makedirs(directory)
File "C:\Users--------------------------------os.py", line 220, in makedirs
mkdir(name, mode)
FileNotFoundError: [WinError 2] The system cannot find the file specified: './temp'

Seems to be this chunk of code...

directory = './temp'
dshap = DShap(X,X_raw[:train_size], X_test, y_test, num_test,
sources=None,
sample_weight=None,
model_family=model,
metric='accuracy',
overwrite=True,
directory=directory, seed=0)

Is this to do with Windows?

AttributeError: 'DShap' object has no attribute 'marginals_tmc'

When I run Example.ipynb, executing convergence_plots(dshap.marginals_tmc), I met the error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
 in 
----> 1 convergence_plots(dshap.marginals_tmc)

AttributeError: 'DShap' object has no attribute 'marginals_tmc'

Could you please tell me if you lost one function in DShap class?

Is the model training is wrong?

TMC-Shapley algorithm in the paper says that the model is trained for each new datum that is added to the batch. However, in the line below the model is trained only when the batch size is the same as the test size. Why is that so?

or len(set(y_batch)) == len(set(self.y_test))): ##FIXIT

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.