stfwn / ats-privacy-replication Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 23 KB

License: MIT License

Python 7.60% Shell 1.95% Jupyter Notebook 90.46%

ats-privacy-replication's Introduction

🤖 Cyber security scientist at TNO.
📫 [email protected]

ats-privacy-replication's People

Contributors

Watchers

ats-privacy-replication's Issues

Bug in CIFAR-100 transforms

There seems to be a bug in the transforms used to normalize CIFAR-100: the mean and std from CIFAR-10 is used. This is where the bug occurs:

https://github.com/stfwn/mscai-fact-ai/blob/4e1bec5a2162ff9e5a27aa346d1e395f2e0e725e/original/benchmark/comm.py#L52-L53

The test set is normalized correctly. It seems likely that this harms test accuracy.

Details

At first the transforms are initialized correctly by a _build_cifar100 function in the inversefed module here:

https://github.com/stfwn/mscai-fact-ai/blob/4e1bec5a2162ff9e5a27aa346d1e395f2e0e725e/original/benchmark/comm.py#L104

But these are then overridden by the ones containing the bug in their own build_transform function here:

https://github.com/stfwn/mscai-fact-ai/blob/4e1bec5a2162ff9e5a27aa346d1e395f2e0e725e/original/benchmark/comm.py#L110-L111

Misc

Notably (but not neccessarily a bug), the first set of transformations contains a random crop and random horizontal flip while the second does not.

# First transforms
Compose(
    RandomCrop(size=(32, 32), padding=4)
    RandomHorizontalFlip(p=0.5)
    Compose(
    ToTensor()
    Normalize(mean=[0.5071598291397095, 0.4866936206817627, 0.44120192527770996], std=[0.2673342823982239, 0.2564384639263153, 0.2761504650115967])
)
)

# Bugged transforms
Compose(
    ToTensor()
    Normalize(mean=[0.4914672374725342, 0.4822617471218109, 0.4467701315879822], std=[0.24703224003314972, 0.24348513782024384, 0.26158785820007324])
)

Discuss with TA
Figure out what difference this makes if any.
Discuss with TA again

Refactor of attack

create blackbox attack class with all functions
add attack to main
Ask authors about which images are used to compute PSNR metrics for tables

Figure out section 4.2

Figure out what is going on in 4.2 section of paper

Unclear if M^s and M^r are implemented/respected

Here's the paragraph above algorithm 1, which outlines the top policy search process.

Does the original implementation do this correctly? ~~It seems like it doesn't in original/benchmark/search_transform_attack.py~~ Yes
Do we do this correctly atm? --> ~~Atm we call into search_transform_attack.py, so I think not.~~ Yes

Thoughts?

Set up GPU environment

Receive credentials & instructions on how to connect and run
(optional) Write convenience scripts to make it nicer 😄

Figure 6: split open grad sim sum over layers and log grad sim per layer

https://github.com/stfwn/mscai-fact-ai/blob/c0d88b3dfd9568ada9a4e079e25e8a06952b79f6/original/benchmark/search_transform_attack.py#L142-L143

E: This might not be this part actually.

[Attack] Add their original repo to ours and check/add logging

Bug in batch_generate.py

They are using the following code to draw a random transformation from the list of subpolicies:

random.randint(-1, 50)

The problem is that there are 50 policies, and they set the upper bound to 50 (inclusive) which occasionally causes IndexError: list index out of range exception.

Fix it in our codebase

Figure out how gradient reconstruction works exactly

`sample_list` vs. "100 randomly selected images"

https://github.com/stfwn/mscai-fact-ai/blob/c0d88b3dfd9568ada9a4e079e25e8a06952b79f6/original/benchmark/search_transform_attack.py#L187

# Same as:
list(range(200, 700, 5))

This line creates sample indices to be used in computing S_pri and seems to refer to this part of appendix A:

This does not seem random. Maybe they mean 'random' as in 'non-sequential' and implemented it this way so that it's reproducable?

Later on they have these sequential sets of 100 samples to compute S_acc:

https://github.com/stfwn/mscai-fact-ai/blob/c0d88b3dfd9568ada9a4e079e25e8a06952b79f6/original/benchmark/search_transform_attack.py#L209-L210

Here's they don't claim any randomness.

Add ConvNet model to `models.py`

I refactored the ConvNet model into our codebase and did some trial runs to reverse-engineer which width they could have used. I think we can safely go with width=16 and come close enough to reproduce. That one levels out at about 94% test accuracy on F-MNISt, and they report 94.25 👍

Red: width=16
Dark blue: width=8
Light blue: width=32

Undocumented augmentations

Summary

In cifar100_train.py, as soon as you add any augmentation, you also get a random crop and flip for free. This should be documented somewhere, but I don't think it is.

Details

When printing the transforms in the actual train function with trainloader.dataset.transform, here's what you get in various situations.

No augmentations

Run with:

python benchmark/cifar100_train.py --arch ResNet20-4 --data cifar100 --epochs 200 --aug_list ''  --mode aug

And get augmentations:

# Test
Compose(
    ToTensor()
    Normalize(mean=[0.4914672374725342, 0.4822617471218109, 0.4467701315879822], std=[0.24703224003314972, 0.24348513782024384, 0.26158785820007324])
)

# Test
Compose(
    ToTensor()
    Normalize(mean=[0.5071598291397095, 0.4866936206817627, 0.44120192527770996], std=[0.2673342823982239, 0.2564384639263153, 0.2761504650115967])
)

Run with:

python benchmark/cifar100_train.py --arch ResNet20-4 --data cifar100 --epochs 200 --aug_list ''  --mode crop

And get:

# Train
Compose(
    RandomCrop(size=(32, 32), padding=4)
    RandomHorizontalFlip(p=0.5)
    ToTensor()
    Normalize(mean=[0.4914672374725342, 0.4822617471218109, 0.4467701315879822], std=[0.24703224003314972, 0.24348513782024384, 0.26158785820007324])
)
# Test
 Compose(
    ToTensor()
    Normalize(mean=[0.5071598291397095, 0.4866936206817627, 0.44120192527770996], std=[0.2673342823982239, 0.2564384639263153, 0.2761504650115967])
)

With augmentations

Run with:

python benchmark/cifar100_train.py --arch ResNet20-4 --data cifar100 --epochs 200 --aug_list '43-18-18'  --mode aug

And get these transforms on the train set:

# Train
Compose(
    RandomCrop(size=(32, 32), padding=4)
    RandomHorizontalFlip(p=0.5)
    <benchmark.comm.sub_transform object at 0x7f40fd0a41c0>
    ToTensor()
    Normalize(mean=[0.4914672374725342, 0.4822617471218109, 0.4467701315879822], std=[0.24703224003314972, 0.24348513782024384, 0.26158785820007324])
)

# Test
Compose(
    ToTensor()
    Normalize(mean=[0.5071598291397095, 0.4866936206817627, 0.44120192527770996], std=[0.2673342823982239, 0.2564384639263153, 0.2761504650115967])
)

Report experimental setup

a bit exprimental setup and code, tiny-imagenet, computational reqs: hours, epochs

[Models and training] Add their original repo to ours and check/add logging

Small model for augmentations search

That is about first model from section 4.4, page 5ft: "Ms is used for
privacy quantification. It is trained only with 10% of the
original training set for 50 epochs. This overhead is equivalent to the training with the entire set for 5 epochs, which is
very small. "

Create script for creating small, reproductible subset of CIFAR100 for train and evaluation of small model for policy search. (with seed)
Train small model on this subset

Run all the experiments required for tables

Table 1

Done

Table 2

Done (with possible extra runs for enhancement if there is time)

Table 3

TODO: list this

Table 4

Running this tonight (commented out are part of table but already ran):

# Table 4
## a) Training
# python3.9 main.py train --model resnet20 --dataset fmnist -e 50 --bugged-loss
python3.9 main.py train --model resnet20 --dataset fmnist -e 50 --bugged-loss --aug-list 3-1-7
python3.9 main.py train --model resnet20 --dataset fmnist -e 50 --bugged-loss --aug-list 43-18-18
python3.9 main.py train --model resnet20 --dataset fmnist -e 50 --bugged-loss --aug-list 3-1-7+43-18-18

## b) Training
# python3.9 main.py train --model convnet --dataset fmnist -e 60 --bugged-loss
python3.9 main.py train --model convnet --dataset fmnist -e 50 --bugged-loss --aug-list 21-13-3
python3.9 main.py train --model convnet --dataset fmnist -e 50 --bugged-loss --aug-list 7-4-15
python3.9 main.py train --model convnet --dataset fmnist -e 50 --bugged-loss --aug-list 7-4-15+21-13-3

## a) attacks
for img_idx in {0..5}
do
        # python3.9 main.py attack --model resnet20 --dataset fmnist --optimizer inversed --image-index $img_idx
        python3.9 main.py attack --model resnet20 --dataset fmnist --optimizer inversed --image-index $img_idx --aug-list 3-1-7
        python3.9 main.py attack --model resnet20 --dataset fmnist --optimizer inversed --image-index $img_idx --aug-list 43-18-18
        python3.9 main.py attack --model resnet20 --dataset fmnist --optimizer inversed --image-index $img_idx --aug-list 3-1-7+43-18-18

        # python3.9 main.py attack --model convnet --dataset fmnist -e 50 --optimizer inversed --image-index $img_idx
        python3.9 main.py attack --model convnet --dataset fmnist -e 50 --optimizer inversed --image-index $img_idx --aug-list 21-13-3
        python3.9 main.py attack --model convnet --dataset fmnist -e 50 --optimizer inversed --image-index $img_idx --aug-list 7-4-15
        python3.9 main.py attack --model convnet --dataset fmnist -e 50 --optimizer inversed --image-index $img_idx --aug-list 7-4-15+21-13-3
done

Minor accuracy bug

Summary

Accuracy is computed per batch and averaged over the batch dimension, resulting in about 0.2% error in practice versus computing it per sample.

Details

Accuracy is computed per batch here:

https://github.com/stfwn/mscai-fact-ai/blob/d838a4baf957fcb59e6ed702566ccaf8a9af974f/ATSPrivacy/inversefed/data/loss.py#L120-L128

All the batch accuracies are summed up:

https://github.com/stfwn/mscai-fact-ai/blob/d838a4baf957fcb59e6ed702566ccaf8a9af974f/ATSPrivacy/inversefed/training/training_routine.py#L88-L89

And averaged to produce the epoch metric:

https://github.com/stfwn/mscai-fact-ai/blob/d838a4baf957fcb59e6ed702566ccaf8a9af974f/ATSPrivacy/inversefed/training/training_routine.py#L101

But since the test set is 10k samples big, batch size is 128 and drop_last=False in the dataloader, this leads to one last batch of (10000 / 128 - 10000 // 128) * 128 = 16 samples counting 128 / 6 = 8 times more than other samples.

In the worst case those 16 are correct and all hypothetical other samples of the last batch would be wrong, leading to one 100% in the tally versus a 12.5%. If all the other samples have 70% accuracy this results in a score of a 1% percent higher accuracy. (((78*0.7+1)/79 - (78*0.7+0.125)/79)*100)

In practice it only matters about 0.2% on average.

Extension: extra attack

Reproduce Figure 6

Reproduce Figure 4

This one seems to be pretty clear

Set up report writing workspace

Overleaf?

Extensions

Candidates

Rescale ImageNet
1. Train ResNet20 on it
2. ~~Do policy search on it~~ Reuse policies found from Cifar100.
3. Do reconstructions on it
4. Report PSNR and val accuracy
5. Compare with paper claims
Do policy search on different dataset
~~Add transformations to transformation library~~
~~Use different model~~
~~Test policies with more transformations~~
~~Try different kinds of attacks~~

Add std fill to figure 1

Reproduce Table 1

Figure out section 4.4

read and understand it in the paper
find it in the code and refactor if needed

Fix bug with finding trained model by attack

Reproduce Figure 1

Shizzle doesn't work

Try ResNet20 with width=64 instead of width=16
Try ConvNet with more channels

Line 12-13 of algorithm 1 can cause infinite loop if no policies have high enough `S_acc`

Unlikely.

Find out what should be in report

Division by zero in `inversefed/reconstruction_algorithms.py:reconstruction_cost`

cnt = 0
...
total_cost = cost / cnt

Leads to division by zero and therefore inf loss. This cripples optimization schemes that don't make use of the sim cost function (where it's fixed).

Original inversefed repo doesn't have cnt, so this was added in ATSPrivacy.

Create fix for negative grad sim

One solution could be to only take the angle into account.

Report - Results

Number of tested policies lower than declared

Random policy generator can occasionally sample the same policy more than once. In that case, number of evaluated policies is lower than expected, i.e. for our setup it's 1592 instead of 1600. The same issue exists in the original implementation

Decide whether to fix

Contact authors

Ask about 0.5 * loss bug
Ask about which images are used to compute PSNR metrics for tables
Ask about if the model to compute PSNR from figure 2 is just M^s?
Ask what Figure 4-3 (and such) means under 'figure implementation' on page 11

@Sloetoe did this around 12PM today.

Figure out 4.3 section

figure out what is going on in 4.3 section of paper
figure out how they use https://arxiv.org/pdf/2006.04647.pdf paper for search
document that in code or refactor if needed

Bug in inversefed

I noticed these lines:

https://github.com/stfwn/mscai-fact-ai/blob/4e1bec5a2162ff9e5a27aa346d1e395f2e0e725e/original/inversefed/data/loss.py#L48

https://github.com/stfwn/mscai-fact-ai/blob/4e1bec5a2162ff9e5a27aa346d1e395f2e0e725e/original/inversefed/data/loss.py#L104

In both cases, the loss is halved before being returned.

The line in the Classification class was acknowledged as a bug here and subsequently fixed. The author posted an update for table 1 and said they would try to update the paper on arXiv, which hasn't happened yet.

The PSNR class contains the same halving and was not discussed. Find out if this is a bug or intended.

The acknowledged bug is present in the reference implementation we're working with because it depends on this work, and it seems we must assume that experiments were run with it present.