GithubHelp home page GithubHelp logo

neural_lda_approximation's Introduction

Neural LDA Approximation

Approximation of the latent dirichlet allocation via deep neural networks.

Preliminaries

Download NLTK corpora which are used for defining stopwords and lemmatizing stuff:

python -m nltk.downloader stopwords
python -m nltk.downloader reuters
python -m nltk.downloader wordnet

Download the english wikipedia dump with:

mkdir ./data/wikipedia_dump/
cd ./data/wikipedia_dump/
wget -nc https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Preprocess the files by calling the gensim script in the same directory as the dump:

python -m gensim.scripts.make_wiki <PATH_TO_WIKI_DUMP> ./wiki

(This will take about 7-8 hours depending on your CPU) Then decompress the id->word mapping by using bzip2 e.g.:

bunzip2 wiki_wordids.txt.bz2

Training the LDA will also take several hours. So make sure to use the --num_workers flag to speed up the computation accordingly to your computers specs.

How to fix ImportError: cannot import name 'zero_gradients' from 'torch.autograd.gradcheck':

Usage:

1. Attack

python main.py [-h] [--gpu | -g GPU] [--num_workers | -w WORKERS] [--num_topics | -t TOPICS] [--from_scratch | -s]

e.g.

python main.py --attack_id 33 --l2_attack --advs_eps 100

Arguments

Argument Type Description
-h, --help None shows argument help message
-g, --gpu INT specifies which GPU should be used [0, 1] (default=0)
-b, --batch_size INT specifies the batch size (default=128)
-e, --epochs INT specifies the training epochs (default=100)
-l, --learning_rate FLOAT specifies the learning rate (default=0.01)
-w, --num_workers INT number of workers to compute the LDA model (default=4)
-t, --num_topics INT number of topics which the LDA and the DNN tries to assign the text into
-a, --attack_id INT specifies the word id for the target of the adversarial attack
-ae, --advs_eps FLOAT specifies the epsilon for the adversarial attack
-ai, --prob_attack BOOL flag to activate whole probability distribution target attack
-r, --random_test BOOL enables random test documents for evaluation
-s, --from_scratch BOOL flag to ignore every pretrained model and datasets and create everything from scratch
-ts, --topic_stacking BOOL flat to use the topic stacking attack to evaluate the performance for more than one target topic at the same time
-l2, --l2_attack BOOL flag to activate the L2 norm attack (rounded floats) instead of the LINF (whole integers)
-f, --full_attack BOOL flag to attack every topic
-v, --verbose BOOL flag to set Gensim to verbose mode to print the LDA information during it's training

2. LDA Matching

python lda_matching.py [--benchmark | -b]

3. LDA Stability Test

python test_lda_stability.py [--corpus_sizes | -cs]

e.g.

python test_lda_stability.py --corpus_sizes 0.01 0.1 0.5 1.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.