When choosing the LDA Topic Modelling section , the following message appears :
ValueError: list.remove(x): x not in list
Traceback:
File "c:\users\doub2420.virtualenvs\drift-qengzvvy\lib\site-packages\streamlit\script_runner.py", line 337, in run_script
exec(code, module.dict)
File "C:\drift\app.py", line 1726, in
year_paths.remove(os.path.join(vars["data_path"], "compass.txt"))
I would like to have an option to use custom stop words and also to not have digits transformed into english numerals (thousand, hundred , and so on ...) because it doesn't help with the purpose of tracking trends or analyzing the abstracts.
Use K-Means for clustering the diachronic word embeddings.
Rough Sketch:
The class can have multiple functions: train, predict, store, visualisation, etc.
The function(s) will take as input word vectors from a particular timestamp. They will also take as input parameters of K-Means like number of clusters, etc.
Add functionality for visualisation.
Return the centroids and the cluster to which the words belong.
For every timespan, identify keywords (make diagrams: https://arxiv.org/pdf/2006.01131.pdf). Frequency is generally used as a proxy for keyword identification. We can explore methods like RAKE, etc. Other than just identifying words with the highest frequency in every timespan, we can look for words with the highest jump in frequency in two consecutive timespans. Not just words, we can analyse n-grams too.