GithubHelp home page GithubHelp logo

quynhneo / detm-arxiv Goto Github PK

View Code? Open in Web Editor NEW
20.0 2.0 4.0 92.94 MB

Implementation of Dynamic Embedding Topic Modeling on arxiv.org articles

License: GNU General Public License v3.0

Python 100.00%

detm-arxiv's Introduction

Extracting topic trends from paper abstracts with DETM

Quynh M. Nguyen a, b and Kyle Cranmer a, c

a Physics Department, New York University, New York 10003

b Applied Math Lab, Courant Institute, New York University, New York 10012

c Center for Data Science, New York University, New York 10011

Project description

Running dynamic embedded topic modeling on abstracts of arxiv articles and discover how topics in STEM change in time. This is an implementation of Dynamic Embedded Topic Modeling by Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei of Columbia University.

Get the abstracts

Visit https://www.kaggle.com/Cornell-University/arxiv to get arxiv-metadata-oai-snapshot.json which contains about 2 million records, each has a dozen of fields, and we are interested in abstract, categories, and update_date.

Generate embedding with word2vec

Modify the path to arxiv-metadata-oai-snapshot.json in arxivtools/word2vec.py and run:

python arxivtools/word2vec.py

This will read in abstracts, remove punctuations, remove stop words listed in arxivtools/stops.txt, remove rare words that appear in less than 30 abstracts, and words appear in more than 70% of abstracts, and produces vector representations of all the words left (default embedding dimension = 300) using original settings from Mikolov 2013 NIPS paper. The ressults are save as embeddings.txt where each line is a word following by 300 numbers. The process takes about an hour per 150,000 abstracts on a laptop.

Clone our fork of the original DETM repository

This is the main repo for DETM. We have made some changes to fix runtime errors, match the setting in the paper, adapt to arxiv metadata file, but no change to the model:

git clone https://github.com/quynhneo/DETM

The environtment could be set up by pip or conda, for example, using conda:

conda create --name detm --file requirements.txt 
conda activate detm

Preprocess text data

This step will convert each abstract to a bag of words (bag of integer tokens to be exact), with timestamp for each abstract, split the data into train, validation, test. These will be stored in .mat files. It also create a list of words, the vocabulary of all the abstracts, stored in vocab.txt. This is just list of words, without vectors. The vectors will be taken from embeddings.txt. So ideally the two lists contain the same words, or vocab is a large subset of embeddings. Modify path to arxiv-metadata-oai-snapshot.json in scripts/data_undebates.py and run:

python scripts/data_undebates.py

This will take about 5 minutes per 150,000 abstracts on a laptop. Using default settings, the output will be save in script/split_paragraph_False/min_df_30

Run Dynamic Embedded Topic Modeling

To run with all defaults settings, make changes in two lines: https://github.com/quynhneo/DETM/blob/master/main.py#L34: the parent folder of preprocessed data folder min_df_30. https://github.com/quynhneo/DETM/blob/master/main.py#L35 : path to prefit embedding embeddings.txt. Run with all default settings:

python main.py

This stage will take much longer and should be run with GPU (CPU mode is too slow even with a 16 cores node)

More instruction for running on a cluster using CUDA is here

Output will be 3 .mat files in results.

Plot the results

Edit beta_file in plot_word_evolution.py to be the path to the file ending in _beta in results and run:

python plot_word_evolution.py 

Results

The plot below shows results for DETM trained on hep-ph (high energy physics phenomenology) category, containning 150,000 abstracts. Six out of 50 topics are shown here. For each topics, probabilities of some selected words (in most cases, words with high probability) are plotted against time (2007-2020).

result

In topics #33 and #34, peak probability of the word 750 coincides with the flurry of papers on a possible discovery of new physics at 750 GeV around 2015-2016, which turned out to be just a statistical fluke. Topic 38 shows the increase in higgs around the time of the discovery of Higgs boson in 2012.

The above plots are from running 400 epoches on data of 150,000 abstracts of hep-ph. We use 1 Nvidia RTX8000 GPUs and the runtime was 13 hours.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.