rajaswa / drift Goto Github PK

View Code? Open in Web Editor NEW

112.0 3.0 12.0 141.14 MB

DRIFT is a tool for Diachronic Analysis of Scientific Literature.

Home Page: https://aclanthology.org/2021.emnlp-demo.40/

License: MIT License

Makefile 0.55% Python 99.45%

diachronic-embeddings scientific-visualization nlp hacktoberfest

drift's Introduction

Hi there, I'm Rajaswa - aka rajaswa 👋

I'm a senior year undergraduate student at BITS Goa.

🌱 I’m currently exploring Computational Psycholinguistics
👯 I’m looking to collaborate with other researchers and linguists
🥅 2020 Goals: Building computational tools for researchers working at the intersection of linguistics theory and NLP!
😄 Pronouns: He/Him
⚡ Fun fact: Most of the Indic languages will have grammatical genders!

Connect with me:

Languages and Tools:

drift's People

Contributors

Stargazers

Watchers

Forkers

gchhablani abheesht17 harsh4799 trendingtechnology darkknight2223 jayantb1019 ag027592 lexuss-d tanay0nspark azizighani doubianimehdi ainlp123

drift's Issues

Data Scraping

Scrape data from here
Make a separate Excel file for every conference
Can also store it as json

Add POS/TF-IDF to WordCloud and Keyword Extraction

Background Color change in WordCloud throws error

Alternatives to top-k selection

The top-k words is not very useful as of this moment.
Need to implement the following options:

POS-based selection for keywords.
TF/IDF-based selection.
Yake-based selection.

Unclear app issue during app start

Unclear app issue during the app start. Needs to be fixed before making things public.

Initialize Code Structure

We want a structure flexible enough to be able to work with it using ReactJS/Flask.

Add statistics to analysis methods

Need to display data frames of some statistical information, year-wise word-frequency, number of articles/words/tokens/POS in each year, etc.

Add option for excluding common academic words

Need to find a list and add a checkbox in streamlit app to remove the popular academic words as they might not provide too many insights.

When choosing the LDA Topic Modelling section , the following message appears :
ValueError: list.remove(x): x not in list
Traceback:
File "c:\users\doub2420.virtualenvs\drift-qengzvvy\lib\site-packages\streamlit\script_runner.py", line 337, in run_script
exec(code, module.dict)
File "C:\drift\app.py", line 1726, in
year_paths.remove(os.path.join(vars["data_path"], "compass.txt"))

What would that mean ?

Thanks !

Changes to Tracking Clusters Method

We need to either:

Add multiple graphs to show clusters changing
Within the same graph show clusters moving (Harder and might be messy)

Stop Words Custom list and Remove digits not transform it to english numerals

Hi,

I would like to have an option to use custom stop words and also to not have digits transformed into english numerals (thousand, hundred , and so on ...) because it doesn't help with the purpose of tracking trends or analyzing the abstracts.

Thank you !

Changes to Productivity Plot

Check if cluster labels are more or less correct, otherwise we will remove/change the cluster table.
Formatting changes for cluster table might be required
Labels in dataframe should be named

Add frequency plots in productivity analysis method

We need to add a frequency plot alongside the productivity plots in order to help with the analysis.

Improve UI documentation/explanations

We need better "About", "Summary" sections.
Also need "How to Infer" sections.

Improve DataFrame in Productivity/Frequency Plot

Switch to valedica Gensim

Add Part-of-Speech download option from NLTK

This should probably be done in app.py unless it can be done in setup.py.

Predict keywords for the next few years

References:
https://www.aclweb.org/anthology/L16-1052/

Add Similarity Matrices and Track Acceleration

Make similarity matrices for keywords. Track their acceleration (https://sci-hub.se/10.1109/ijcnn.2019.8852140).

Track Trends with Similarity error while selecting next column

Is that a bug ?

Improve Multi-word token finding in Productivity Plot

Top-K in Yake is unused

Either top-K should be removed, or used in plot. If top-K is being used, then there should be many bars in the bar plot.

Tracking trends with Similarity - Dropdown Refresh Bug

After selecting a word , the next set of words do not refresh the dataframe.

Script for Clustering Word Embeddings

Use K-Means for clustering the diachronic word embeddings.
Rough Sketch:
- The class can have multiple functions: train, predict, store, visualisation, etc.
- The function(s) will take as input word vectors from a particular timestamp. They will also take as input parameters of K-Means like number of clusters, etc.
- Add functionality for visualisation.
- Return the centroids and the cluster to which the words belong.

Keyword Extraction from every timespan

For every timespan, identify keywords (make diagrams: https://arxiv.org/pdf/2006.01131.pdf). Frequency is generally used as a proxy for keyword identification. We can explore methods like RAKE, etc. Other than just identifying words with the highest frequency in every timespan, we can look for words with the highest jump in frequency in two consecutive timespans. Not just words, we can analyse n-grams too.