GithubHelp home page GithubHelp logo

mls-aihack-mother-repo's Introduction

Goal:

  • Match news articles
  • Find common text in the articles, and remove
  • Cluster "residual words" via topic modeling
  • Apply "topics" to all residuals from a news source, to get their "bias"

Setup & Installation:

pip install -r requirements.txt
python -c "import nltk; nltk.download()"

Then choose to install the "popular" collection.

NLP Demo

This demo shows stemming, lemmatizing, and word counting (including tf-idf)

python nlp_demo.py

Downloading data

Run

python registry_data.py

You can tweak parameters, such as the min # articles per event or api key, within the script.

Modeling

python main.py

Viewing Results

At the end of the modeling process a 3D graph will be generated for visualization purposes.

Results

  • Found common words across news articles within an event.
  • When clustering “residual” words via LDA, a lot of emotion words appear
  • Sources did not separate by topic
    • MAYBE: sources use emotional words to describe the news; not consistent by event.

Future Directions

  • Model new source bias within a particular topic
  • Boost / attenuate emotion words via sentiment analysis
  • See if there’s bias by author
  • Include & apply fake news dataset.

mls-aihack-mother-repo's People

Contributors

avicennax avatar frogman141 avatar jwilber avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mls-aihack-mother-repo's Issues

Add arg to group articles by source

Currently, LDA runs on a per-article basis. Good to add a flag, to run it on a per-source basis as well.

In this case, we'd need to apply the LDA model to each article.

Apply LDA model to Kaggle fake news

Not sure exactly how to do this, as the "truth" words have been removed from our article text.

Could start by getting all non-stopwords, applying our vocabulary, and plotting in the same scatter plot.

Explore the parameter space to try and improve results.

To play: python main.py --csv-file raw_dataframe.csv Then add flags to explore the parameter space:

  • --source-thresh SOURCE_THRESH Min % of events a news source must cover, to be included.
    Default 0.5; lowering this would include a broader set of news sources.

  • --min-article-length MIN_ARTICLE_LENGTH Min # words in an article (pre-parsing)
    Set to 250. Are longer articles more biased?

  • --min-vocab-length MIN_VOCAB_LENGTH Min # words in an article (post-lemmatizing, vectorizing)
    Set to 100. Are longer articles more biased?

  • --lda-min-appearances LDA_MIN_APPEARANCES Min # appearances of a word, to be included in the vocabulary
    Set to 2. Could raise this, to focus on the most common words.

  • --lda-vectorization-type {count,tfidf} Type of vectorization of article to word counts, to do.
    Set to count. Not 100% tfidf is working, but if it is, we should use it.

  • --lda-groupby {source,article} Run LDA on text separated by article, or by news source?
    Set to article right now. this just means: what are the "documents" (sets of words) sent into LDA? Could be by article, or could aggregate over source.

  • --lda-topics LDA_TOPICS # of LDA topics
    Set to 10. Clusters indicate that maybe a higher number could be helpful.

  • --lda-iters LDA_ITERS # of LDA iterations
    1500. Probably could be lowered for larger datasets.

  • --truth-frequency-thresh TRUTH_FREQUENCY_THRESH % of articles in a news event that must mention a word, for it to be "truth" / removed.
    Set to 0.5. Could be higher (e.g. 1.1 - force no words to be removed) or lower (e.g. 0.1, remove most words and leave only infrequent words for bias. Could also be implemented as a range, to say: bias words appear often, but not as often as truth words, and not as infrequently as random garbage.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.