bcipolli / mls-aihack-mother-repo Goto Github PK

Hackathon project for Machine Learning Society in San Diego

Python 0.32% Jupyter Notebook 4.73% HTML 94.95%

mls-aihack-mother-repo's Introduction

Goal:

Match news articles
Find common text in the articles, and remove
Cluster "residual words" via topic modeling
Apply "topics" to all residuals from a news source, to get their "bias"

Setup & Installation:

pip install -r requirements.txt
python -c "import nltk; nltk.download()"

Then choose to install the "popular" collection.

NLP Demo

This demo shows stemming, lemmatizing, and word counting (including tf-idf)

python nlp_demo.py

Downloading data

Run

python registry_data.py

You can tweak parameters, such as the min # articles per event or api key, within the script.

Modeling

python main.py

Viewing Results

At the end of the modeling process a 3D graph will be generated for visualization purposes.

Results

Found common words across news articles within an event.
When clustering “residual” words via LDA, a lot of emotion words appear
Sources did not separate by topic
- MAYBE: sources use emotional words to describe the news; not consistent by event.

Future Directions

Model new source bias within a particular topic
Boost / attenuate emotion words via sentiment analysis
See if there’s bias by author
Include & apply fake news dataset.

mls-aihack-mother-repo's People

Contributors

Stargazers

Watchers

mls-aihack-mother-repo's Issues

Create 3D plot focused on news sources, not LDA topics

From @frogman141 :

i think i get what your asking you not only want to have a scatter of new bias per news article you want an even larger dot that encaspulates their range of bias

Limit # points per category in 3d plot

Plot gets overwhelming when there's tons of points per category. Better set a max value and subsample when needed; perhaps 100 per category?

Add arg to group articles by source

Currently, LDA runs on a per-article basis. Good to add a flag, to run it on a per-source basis as well.

In this case, we'd need to apply the LDA model to each article.

Apply LDA model to Kaggle fake news

Not sure exactly how to do this, as the "truth" words have been removed from our article text.

Could start by getting all non-stopwords, applying our vocabulary, and plotting in the same scatter plot.

Explore the parameter space to try and improve results.

To play: python main.py --csv-file raw_dataframe.csv Then add flags to explore the parameter space:

--source-thresh SOURCE_THRESH Min % of events a news source must cover, to be included.
Default 0.5; lowering this would include a broader set of news sources.
--min-article-length MIN_ARTICLE_LENGTH Min # words in an article (pre-parsing)
Set to 250. Are longer articles more biased?
--min-vocab-length MIN_VOCAB_LENGTH Min # words in an article (post-lemmatizing, vectorizing)
Set to 100. Are longer articles more biased?
--lda-min-appearances LDA_MIN_APPEARANCES Min # appearances of a word, to be included in the vocabulary
Set to 2. Could raise this, to focus on the most common words.
--lda-vectorization-type {count,tfidf} Type of vectorization of article to word counts, to do.
Set to count. Not 100% tfidf is working, but if it is, we should use it.
--lda-groupby {source,article} Run LDA on text separated by article, or by news source?
Set to article right now. this just means: what are the "documents" (sets of words) sent into LDA? Could be by article, or could aggregate over source.
--lda-topics LDA_TOPICS # of LDA topics
Set to 10. Clusters indicate that maybe a higher number could be helpful.
--lda-iters LDA_ITERS # of LDA iterations
1500. Probably could be lowered for larger datasets.
--truth-frequency-thresh TRUTH_FREQUENCY_THRESH % of articles in a news event that must mention a word, for it to be "truth" / removed.
Set to 0.5. Could be higher (e.g. 1.1 - force no words to be removed) or lower (e.g. 0.1, remove most words and leave only infrequent words for bias. Could also be implemented as a range, to say: bias words appear often, but not as often as truth words, and not as infrequently as random garbage.

bcipolli / mls-aihack-mother-repo Goto Github PK

mls-aihack-mother-repo's Introduction

mls-aihack-mother-repo's People

Contributors

Stargazers

Watchers

mls-aihack-mother-repo's Issues

Create 3D plot focused on news sources, not LDA topics

Limit # points per category in 3d plot

Add arg to group articles by source

Apply LDA model to Kaggle fake news

Explore the parameter space to try and improve results.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs