GithubHelp home page GithubHelp logo

claws-lab / petgen Goto Github PK

View Code? Open in Web Editor NEW
16.0 2.0 2.0 1.93 MB

A PyTorch implementation of the ACM SIGKDD 2021 paper titled "PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models"

License: MIT License

Python 99.91% Shell 0.09%
deep-learning text-generation adversarial-attacks fraud-detection social-network embeddings sequence-embedding

petgen's Introduction

PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models (ACM SIGKDD 2021)

One-Line Description

We conduct an adversarial attack on deep learning-based popularly-used malicious user detection models.

Introduction

What should a malicious user write next to fool a detection model? Identifying malicious users is critical to ensure the safety and integrity of internet platforms. Several deep learning based detection models have been created. However, malicious users can evade deep detection models by manipulating their behavior, rendering these models of little use. The vulnerability of such deep detection models against adversarial attacks is unknown. Here we create a novel adversarial attack model against deep user sequence embeddingbased classification models, which use the sequence of user posts to generate user embeddings and detect malicious users. In the attack, the adversary generates a new post to fool the classifier. We propose a novel end-to-end Personalized Text Generation Attack model, called PETGEN, that simultaneously reduces the efficacy of the detection model and generates posts that have several key desirable properties.

PETGEN

If you make use of this code, the PETGEN algorithm, or the datasets in your work, please cite the following paper:

@inproceedings{he2021petgen,
  title={PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models},
  author={He, Bing and Ahamad, Mustaque and Kumar, Srijan},
  booktitle={Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery \& Data Mining},
  pages={575--584},
  year={2021}
}

Data

Data: the data is presented as follows: (Here, we take a sequence with 3 posts as an example)

  • Sequence = (post1, post2, post3)
  • Context = (context1, context2, context3)
  • Label = 0 OR 1 (0: benign 1: malicious)

Then we save it in dictionary by pickle files as follows:

  • Seq2context: {(post1, post2, post3): (context1, context2, context3)}
  • Seq2label: {(post1, post2, post3):label}
  • Here is the link for the large Yelp dataset. The small Wikipedia data is already included in the repository.

In dataset directory, following Text-GAN repository, we use Wikipedia data as an example to show how to put the input data in the right location. wiki.txt is the training data, iw.txt and wi.txt are the generated word dictionary. Under the testdata directory, wiki.pkl is the Seq2context file, context.txt is the context file, label.txt is the label information for each sequence, test.txt is the same testing data as Text-GAN during training. If you want to reuse the repository, create and name the corresponding file. We also provide the code (/dataset/data_creation.py) to process the pickle file and generate the text file needed in code.

Code

To run the code, go to "run" directory by cd run and use the following command line (Here we use wiki data as an example. More details are in the instruction):

bash petgen.sh

For the package support, please run:

pip install -r requirements.txt

Instructions

  1. Instructor&Model

For PETGEN, the entire runing process is defined in instructor/real_data/intructor.py. Some basic functions like init_model()and optimize() are defined in the base class BasicInstructor in instructor.py. For GAN-based frameworks, we have two components: 1) generator and 2) discriminator. Here, /models/generator.py is the code for generator while /models/discriminator.py is for discriminator.

  1. Logging&Saving

We have log directory to record the whole logs. PETGEN uses the logging module in Python to record the running process, like generator's loss and metric scores. For the convenience of visualization, there would be two same log file saved in log/log_****_****.txt and save/**/log.txt respectively. Additionally, we have save direcotry to save the result and generated text. The code would automatically save the state dict of models and a batch-size of generator's samples in ./save/**/models and ./save/**/samples per log step, where ** depends on your hyper-parameters. For save, for instance, we can choose to save pretrained generator by changing if_sav_pretrain in config.py. Additionally, if we trained a generator in the past and want to reuse it again, we can change if_use_saved_gen in config.py.

  1. Running Signal

You can easily control the training process with the class Signal (please refer to utils/helpers.py) based on dictionary file run_signal.txt. For using the Signal, just edit the local file run_signal.txt and set pre_sig to Fasle for example, the program will stop pre-training process and step into next training phase. It is convenient to early stop the training if you think the current training is enough.

  1. Automatically select GPU (Use GPU by default)

In config.py, the program would automatically select a GPU device with the least GPU-Util in nvidia-smi. This feature is enabled by default. If you want to manually select a GPU device, please uncomment the --device args in run_[run_model].py and specify a GPU device with command.

  1. Parameter

First, we have to chose which dataset we use. In config.py, we assign the target dataset (e.g., "wiki") to variable dataset. Next, we can specify the hyperparameters used in the training, like learning rate and epoches. Following Text-GAN repo, we change the corresponding value in config.py and run_relgan.py. For example, for the training and testing mode, we change if_test in run_relgan.py. for batch size, we can change batch_size in config.py. This also applies to other deep learning related parameters.

  • if you have any questions, please feel free to contact Bing He ([email protected]).
  • if you have any suggestions to make the release better, please feel free to send a message.
  • our code is based on Text-GAN repository (Many thanks). If possible, please make sure Text-GAN can be executable at first.

petgen's People

Contributors

binghe2727 avatar srijankr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

petgen's Issues

About Multi-Stage Multi-Task Learning

Hi author!
Thank for your interesting great work.
But I am confuse about Multi-Stage Multi-Task Learning mentioned in paper
image
According to this picture in paper
I would like to know when we do multi-stage multi-task learning
For each task training, should we train the task until convergence and then move on to the next task, or do we just need to train some epochs (steps) and then we can switch to the next task?

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.