amiratag / datashapley Goto Github PK

View Code? Open in Web Editor NEW

250.0 11.0 65.0 43 KB

Data Shapley: Equitable Valuation of Data for Machine Learning

License: MIT License

Python 53.63% Jupyter Notebook 46.37%

datashapley's Introduction

Data Shapley: Equitable Valuation of Data for Machine Learning

Code for implementation of "Data Shapley: Equitable Valuation of Data for Machine Learning".

Please cite the following work if you use this benchmark or the provided tools or implementations:

@inproceedings{ghorbani2019data,
  title={Data Shapley: Equitable Valuation of Data for Machine Learning},
  author={Ghorbani, Amirata and Zou, James},
  booktitle={International Conference on Machine Learning},
  pages={2242--2251},
  year={2019}
}

Prerequisites

Python, NumPy, Tensorflow 1.12, Scikit-learn, Matplotlib

Basic Usage

To divide value fairly between individual train data points/sources given the learning algorithm and a meausre of performance for the trained model (test accuracy, etc)

Authors

Amirata Ghorbani - Website
James Zou - Website

License

This project is licensed under the MIT License - see the LICENSE.md file for details

datashapley's People

Contributors

Stargazers

Watchers

datashapley's Issues

Can you help me figure out where to find the data that is being dropped during each iteration?

Does anyone have an example of running the the example notebook with "conv" model type?

I'm trying to work out the changes necessary to run the example notebook with the model type of conv. If anyone has any examples of this already created it would be super helpful.

Thanks,

Bob

Running TMC-Shapley for CNN

Hi there,

I try to run the TMC-Shapley on the Chexpert dataset for a DenseNet-121. I only want to calculate the TMC-Shapley on a small number of training points.
The code seems to run fine without any errors. Still I have a problem with the following piece of code in the _one_iteration method:

X_batch = np.zeros((0,) + tuple(self.X.shape[1:]))
y_batch = np.zeros(0, int)
truncation_counter = 0
for n, idx in enumerate(idxs):
old_score = new_score
X_batch = np.concatenate((X_batch, self.X[sources[idx]]))
y_batch = np.concatenate((y_batch, self.y[sources[idx]]))
if (self.is_regression or len(set(y_batch)) == len(set(self.y_test))): ##FIXIT
self.restart_model()
self.model.fit(X_batch, y_batch)
new_score = self.value(self.model, metric=self.metric)

In case the if-condition is false, the TMC-Shapley calculations terminate but every data point has marginal contribution of zero. I guess this is not the itention of the calculations.
In case the if-condition is true, the TMC-Shapley calculations do not terminate.
My X data comes in the shape of numberx3x224x224 and my y data in the shape of numberx1. Thus, the only thing I changed from the original source code is that is replaced the y_batch = np.zeros(0, int) by y_batch = np.zeros((0,) + tuple(self.y.shape[1:])).

Is there in general a problem with TMC-Shapley for CNNs or did I make a mistake?

Thanks for the help in advance.

Best regards,
Fabian

Python version in this project

Hello, I'm looking into your project and was wondering if you could specify the Python version that it's built to work with.

Thank you for your help.

Wrong Shape in DShap.py/init_score

The following lines (DShap.py: 172/173) intend to append the f1 score of the true output (y_test) and a random classification (rnd_y).

rnd_y = np.random.permutation(self.y)
rnd_f1s.append(f1_score(self.y_test, rnd_y))

Since train and test set can differ in length, it must be:

rnd_y = np.random.permutation(self.y_test)
rnd_f1s.append(f1_score(self.y_test, rnd_y))

Otherwise the code produces a error, because of the different shapes.

Question in TMC

DataShapley/DShap.py

Line 255 in 7d64ad6

if self.is_regression or len(set(y_batch)) == len(set(self.y_test)): ##FIXIT

Hi,

why do we restart the model when the second conditional flow is met?

Thanks

Show performance graph by gradually removing worst values instead of best values

The example notebook generates a performance graph by removing the best data points gradually. I'm trying to figure out how the accuracy changes when the worst performing data points are removed gradually. I can't seem to figure out where the values of the worst data points are being gradually removed for the chart and then how to do the opposite.

Thanks,

Bob

It's super slow.

Great idea. However, I tried on 10k rows and 80 columns data with my 32 cores machine. It's keep running for 5 days.

Is g_shap only built to run for fully connected networks?

I see that when running g_shap, model = ShapNN() as opposed to self.model.

Does this mean that no matter what model family we choose, gshap scores will come from a NN fit on the data? What if I wanted to find the gshap scores for other gradient-based models?

Performance graph phenomenon

Hi,

Just had a quick inquiry about the results of my performance graph. I am a little confused as to why the accuracy drastically increases at around 75% removed. I saw a similar increase when using logistic regression as well. Any ideas?

Request for implementation of data shapley for image processing

Page 10 of the paper mentions the application of Shapley in veterinary data modelling using DeepTag i.e. Citation 25. Another application of the paper talks about the Fairness in gender detection using image data processing.
May I have the glance of the text and image processing techniques used?

Also, Does the Data Shapley paper hold good for skewed data?

Example notebook's bug

In example notebook, the cell

dshap = DShap(X, y, X_test, y_test, num_test, sources=sources, model_family=model, metric='accuracy',
              directory=directory, seed=0)

can't run because there's no sources defined.

Using DataShapley with Regression

I would love to use this with regression but the documentation indicates it is not yet implemented. Is there an intent to add regression support in the near future by chance?

Plots in Example.ipynb

Hello,

The plots at the bottom of the jupyter notebook example.ipynb, they are not labeled properly.

Also I could not find similar plots in the paper.

So can you please share what those plots mean, the ones after calling

convergence_plots(dshap.marginals_tmc)
and
convergence_plots(dshap.marginals_g)

Thanks,

Notebook Error Ideas

Hi,
Unfortunately I keep getting this error:

Traceback (most recent call last):
File "C:/Users/----------------------------y/main.py", line 47, in
directory=director, seed=0)
File "C:\Users-------------------------------\DShap.py", line 69, in init
os.makedirs(directory)
File "C:\Users--------------------------------os.py", line 220, in makedirs
mkdir(name, mode)
FileNotFoundError: [WinError 2] The system cannot find the file specified: './temp'

Seems to be this chunk of code...

directory = './temp'
dshap = DShap(X,X_raw[:train_size], X_test, y_test, num_test,
sources=None,
sample_weight=None,
model_family=model,
metric='accuracy',
overwrite=True,
directory=directory, seed=0)

Is this to do with Windows?

AttributeError: 'DShap' object has no attribute 'marginals_tmc'

When I run Example.ipynb, executing convergence_plots(dshap.marginals_tmc), I met the error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
 in 
----> 1 convergence_plots(dshap.marginals_tmc)

AttributeError: 'DShap' object has no attribute 'marginals_tmc'

Could you please tell me if you lost one function in DShap class?

Is the model training is wrong?

TMC-Shapley algorithm in the paper says that the model is trained for each new datum that is added to the batch. However, in the line below the model is trained only when the batch size is the same as the test size. Why is that so?

DataShapley/DShap.py

Line 365 in 96e8ecb

or len(set(y_batch)) == len(set(self.y_test))): ##FIXIT

Thanks.

amiratag / datashapley Goto Github PK

datashapley's Introduction

Data Shapley: Equitable Valuation of Data for Machine Learning

Prerequisites

Basic Usage

Authors

License

datashapley's People

Contributors

Stargazers

Watchers

Forkers

datashapley's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs