GithubHelp home page GithubHelp logo

notnews: predict soft news using story text and the url structure

image

image

Documentation Status

image

The package provides classifiers for soft news based on the story text and the url structure for both the US and UK news media. We provide also provide a way to infer the 'kind' of news---Arts, Books, Science, Sports, Travel, etc.---for the US news media.

Streamlit App: https://notnews-notnews-streamlitstreamlit-app-u8j3a6.streamlit.app/

Quick Start

>>> import pandas as pd
>>> from notnews import *

>>> # Get help
>>> help(soft_news_url_cat_us)

Help on method soft_news_url_cat in module notnews.soft_news_url_cat:

soft_news_url_cat(df, col='url') method of builtins.type instance
    Soft News Categorize by URL pattern.

    Using the URL pattern to categorize the soft/hard news of the input
    DataFrame.

    Args:
        df (:obj:`DataFrame`): Pandas DataFrame containing the URL
            column.
        col (str or int): Column's name or location of the URL in
            DataFrame (default: url).

    Returns:
        DataFrame: Pandas DataFrame with additional columns:
            - `soft_lab` set to 1 if URL match with soft news URL pattern.
            - `hard_lab` set to 1 if URL match with hard news URL pattern.

>>> # Load data
>>> df = pd.read_csv('./notnews/tests/sample_us.csv')
>>> df
            src                                                url                                               text
0             nyt  http://www.nytimes.com/2017/02/11/us/politics/...  Mr. Kushner on something of a crash course in ...
1  huffingtonpost  http://grvrdr.huffingtonpost.com/302/redirect?...  Authorities are still searching for a man susp...
2             nyt  http://www.nytimes.com/2016/09/19/us/politics/...  Photo  WASHINGTON — In releasing a far more so...
3          google  http://www.foxnews.com/world/2016/07/17/turkey...  The Turkish government on Sunday ratcheted up ...
4             nyt  http://www.nytimes.com/interactive/2016/08/29/...  NYTimes.com no longer supports Internet Explor...
5           yahoo  https://www.yahoo.com/news/pittsburgh-symphony...  PITTSBURGH AP — Pittsburgh Symphony Orchestra ...
6         foxnews  http://www.foxnews.com/politics/2016/08/13/cli...  Hillary Clintons campaign is questioning a rep...
7         foxnews  http://www.foxnews.com/us/2017/04/15/april-gir...  April the giraffe has given birth at a New Yor...
8         foxnews  http://www.foxnews.com/politics/2017/05/03/hil...  Want FOX News Halftime Report in your inbox ev...
9             nyt  http://www.nytimes.com/2016/09/06/obituaries/p...  Shes an extremely liberated woman Ms. DeCrow s...
>>>
>>> # Get the Soft News URL category
>>> df_soft_news_url_cat_us  = soft_news_url_cat_us(df, col='url')
>>> df_soft_news_url_cat_us
            src                                                url                                               text  soft_lab  hard_lab
0             nyt  http://www.nytimes.com/2017/02/11/us/politics/...  Mr. Kushner on something of a crash course in ...       NaN       1.0
1  huffingtonpost  http://grvrdr.huffingtonpost.com/302/redirect?...  Authorities are still searching for a man susp...       NaN       NaN
2             nyt  http://www.nytimes.com/2016/09/19/us/politics/...  Photo  WASHINGTON — In releasing a far more so...       NaN       1.0
3          google  http://www.foxnews.com/world/2016/07/17/turkey...  The Turkish government on Sunday ratcheted up ...       NaN       1.0
4             nyt  http://www.nytimes.com/interactive/2016/08/29/...  NYTimes.com no longer supports Internet Explor...       NaN       1.0
5           yahoo  https://www.yahoo.com/news/pittsburgh-symphony...  PITTSBURGH AP — Pittsburgh Symphony Orchestra ...       1.0       NaN
6         foxnews  http://www.foxnews.com/politics/2016/08/13/cli...  Hillary Clintons campaign is questioning a rep...       NaN       1.0
7         foxnews  http://www.foxnews.com/us/2017/04/15/april-gir...  April the giraffe has given birth at a New Yor...       NaN       NaN
8         foxnews  http://www.foxnews.com/politics/2017/05/03/hil...  Want FOX News Halftime Report in your inbox ev...       NaN       1.0
9             nyt  http://www.nytimes.com/2016/09/06/obituaries/p...  Shes an extremely liberated woman Ms. DeCrow s...       NaN       NaN
>>>

Installation

Installation is as easy as typing in:

pip install notnews

API

  1. soft_news_url_cat_us Uses URL patterns in prominent outlets to classify the type of news. It is based on a slightly amended version of the regular expression used to classify news, and non-news in Exposure to ideologically diverse news and opinion on Facebook by Bakshy, Messing, and Adamic in Science in 2015. Our only amendment: sport rather than sports. The classifier success is liable to vary over time and across outlets.
  • Arguments:

    • df:
    • url: column with the domain names/URLs. Default is url
  • What it does:

    • converts url to lower case
    • regex
    URL containing any of the following words is classified as soft news:
    sport|entertainment|arts|fashion|style|lifestyle|leisure|celeb|movie|music|gossip|food|travel|horoscope|weather|gadget
    
    URL conta ining any of following words is classified as hard news:
    politi|usnews|world|national|state|elect|vote|govern|campaign|war|polic|econ|unemploy|racis|energy|abortion|educa|healthcare|immigration
  • Output:

    • Given both the regex can return true, the potential set is: soft, hard, soft and hard, or empty string.
    • By default it creates two columns, `hard_lab and soft_lab`
  • Examples:

    >>> import pandas as pd
    >>> from notnews import soft_news_url_cat_us
    >>>
    >>> df = pd.DataFrame([{'url': 'http://nytimes.com/sports/'}])
    >>> df
                            url
    0  http://nytimes.com/sports/
    >>>
    >>> soft_news_url_cat_us(df)
                            url  soft_lab hard_lab
    0  http://nytimes.com/sports/         1     None
  1. pred_soft_news_us: We use data from NY Times to train a model. The function uses the trained model to predict soft news.
  • Arguments:

    • df: pandas dataframe. No default.
    • text: column with the story text.
  • Functionality:

    • Normalizes the text and gets the bi-grams and tri-grams
    • Outputs calibrated probability of soft news using the trained model
  • Output

    • Appends a column with probability of soft news (prob_soft_news_us)
  • Examples:

    >>> import pandas as pd
    >>> from notnews import pred_soft_news_us
    >>>
    >>> df = pd.read_csv('notnews/tests/sample_us.csv')
    >>> df
                src                                                url                                               text
    0             nyt  http://www.nytimes.com/2017/02/11/us/politics/...  Mr. Kushner on something of a crash course in ...
    1  huffingtonpost  http://grvrdr.huffingtonpost.com/302/redirect?...  Authorities are still searching for a man susp...
    2             nyt  http://www.nytimes.com/2016/09/19/us/politics/...  Photo  WASHINGTON — In releasing a far more so...
    3          google  http://www.foxnews.com/world/2016/07/17/turkey...  The Turkish government on Sunday ratcheted up ...
    4             nyt  http://www.nytimes.com/interactive/2016/08/29/...  NYTimes.com no longer supports Internet Explor...
    5           yahoo  https://www.yahoo.com/news/pittsburgh-symphony...  PITTSBURGH AP — Pittsburgh Symphony Orchestra ...
    6         foxnews  http://www.foxnews.com/politics/2016/08/13/cli...  Hillary Clintons campaign is questioning a rep...
    7         foxnews  http://www.foxnews.com/us/2017/04/15/april-gir...  April the giraffe has given birth at a New Yor...
    8         foxnews  http://www.foxnews.com/politics/2017/05/03/hil...  Want FOX News Halftime Report in your inbox ev...
    9             nyt  http://www.nytimes.com/2016/09/06/obituaries/p...  Shes an extremely liberated woman Ms. DeCrow s...
    >>>
    >>> pred_soft_news_us(df)
    Using model data from /opt/notebooks/not_news/notnews_pub/notnews/data/us_model/nyt_us_soft_news_classifier.joblib...
    Using vectorizer data from /opt/notebooks/not_news/notnews_pub/notnews/data/us_model/nyt_us_soft_news_vectorizer.joblib...
    Loading the model and vectorizer data file...
                src                                                url                                               text  prob_soft_news_us
    0             nyt  http://www.nytimes.com/2017/02/11/us/politics/...  Mr. Kushner on something of a crash course in ...           0.175099
    1  huffingtonpost  http://grvrdr.huffingtonpost.com/302/redirect?...  Authorities are still searching for a man susp...           0.044617
    2             nyt  http://www.nytimes.com/2016/09/19/us/politics/...  Photo  WASHINGTON — In releasing a far more so...           0.010398
    3          google  http://www.foxnews.com/world/2016/07/17/turkey...  The Turkish government on Sunday ratcheted up ...           0.011246
    4             nyt  http://www.nytimes.com/interactive/2016/08/29/...  NYTimes.com no longer supports Internet Explor...           0.021861
    5           yahoo  https://www.yahoo.com/news/pittsburgh-symphony...  PITTSBURGH AP — Pittsburgh Symphony Orchestra ...           0.372437
    6         foxnews  http://www.foxnews.com/politics/2016/08/13/cli...  Hillary Clintons campaign is questioning a rep...           0.077207
    7         foxnews  http://www.foxnews.com/us/2017/04/15/april-gir...  April the giraffe has given birth at a New Yor...           0.481287
    8         foxnews  http://www.foxnews.com/politics/2017/05/03/hil...  Want FOX News Halftime Report in your inbox ev...           0.004383
    9             nyt  http://www.nytimes.com/2016/09/06/obituaries/p...  Shes an extremely liberated woman Ms. DeCrow s...           0.694037
    >>>
  1. pred_what_news_us: We use a model trained on the
    annotated NY Times corpus to predict the

    type of news---Arts, Books, Business Finance, Classifieds, Dining, Editorial, Foreign News, Health, Leisure, Local, National, Obits, Other, Real Estate, Science, Sports, Style, and Travel.

  • Arguments:

    • df: pandas dataframe. No default.
    • text: column with the story text.
  • Functionality:

    • Normalizes the text and gets the bi-grams and tri-grams
    • Outputs calibrated probability of the type of news using the trained model
  • Output

    • Appends a column of predicted catetory (pred_what_news_us) and the columns for probability of each category. (prob_*)
  • Examples:

    >>> import pandas as pd
    >>> from notnews import pred_what_news_us
    >>>
    >>> df = pd.read_csv('notnews/tests/sample_us.csv')
    >>> df
                src                                                url                                               text
    0             nyt  http://www.nytimes.com/2017/02/11/us/politics/...  Mr. Kushner on something of a crash course in ...
    1  huffingtonpost  http://grvrdr.huffingtonpost.com/302/redirect?...  Authorities are still searching for a man susp...
    2             nyt  http://www.nytimes.com/2016/09/19/us/politics/...  Photo  WASHINGTON — In releasing a far more so...
    3          google  http://www.foxnews.com/world/2016/07/17/turkey...  The Turkish government on Sunday ratcheted up ...
    4             nyt  http://www.nytimes.com/interactive/2016/08/29/...  NYTimes.com no longer supports Internet Explor...
    5           yahoo  https://www.yahoo.com/news/pittsburgh-symphony...  PITTSBURGH AP — Pittsburgh Symphony Orchestra ...
    6         foxnews  http://www.foxnews.com/politics/2016/08/13/cli...  Hillary Clintons campaign is questioning a rep...
    7         foxnews  http://www.foxnews.com/us/2017/04/15/april-gir...  April the giraffe has given birth at a New Yor...
    8         foxnews  http://www.foxnews.com/politics/2017/05/03/hil...  Want FOX News Halftime Report in your inbox ev...
    9             nyt  http://www.nytimes.com/2016/09/06/obituaries/p...  Shes an extremely liberated woman Ms. DeCrow s...
    >>>
    >>> pred_what_news_us(df)
    
    Using model data from /opt/notebooks/not_news/notnews_pub/notnews/data/us_model/nyt_us_classifier.joblib...
    Using vectorizer data from /opt/notebooks/not_news/notnews_pub/notnews/data/us_model/nyt_us_vectorizer.joblib...
    Loading the model and vectorizer data file...
                src                                                url                                               text  ... prob_sports  prob_style  prob_travel
    0             nyt  http://www.nytimes.com/2017/02/11/us/politics/...  Mr. Kushner on something of a crash course in ...  ...    0.000000    0.037708     0.000000
    1  huffingtonpost  http://grvrdr.huffingtonpost.com/302/redirect?...  Authorities are still searching for a man susp...  ...    0.000505    0.000243     0.000416
    2             nyt  http://www.nytimes.com/2016/09/19/us/politics/...  Photo  WASHINGTON — In releasing a far more so...  ...    0.000000    0.051815     0.000000
    3          google  http://www.foxnews.com/world/2016/07/17/turkey...  The Turkish government on Sunday ratcheted up ...  ...    0.001302    0.001378     0.000040
    4             nyt  http://www.nytimes.com/interactive/2016/08/29/...  NYTimes.com no longer supports Internet Explor...  ...    0.003500    0.010600     0.000973
    5           yahoo  https://www.yahoo.com/news/pittsburgh-symphony...  PITTSBURGH AP — Pittsburgh Symphony Orchestra ...  ...    0.161347    0.009316     0.000476
    6         foxnews  http://www.foxnews.com/politics/2016/08/13/cli...  Hillary Clintons campaign is questioning a rep...  ...    0.006366    0.003844     0.005973
    7         foxnews  http://www.foxnews.com/us/2017/04/15/april-gir...  April the giraffe has given birth at a New Yor...  ...    0.000808    0.047357     0.015018
    8         foxnews  http://www.foxnews.com/politics/2017/05/03/hil...  Want FOX News Halftime Report in your inbox ev...  ...    0.000626    0.000459     0.000000
    9             nyt  http://www.nytimes.com/2016/09/06/obituaries/p...  Shes an extremely liberated woman Ms. DeCrow s...  ...    0.000000    0.019162     0.000000
    
    [10 rows x 22 columns]
    >>>
  1. soft_news_url_cat_uk Uses URL patterns in prominent outlets to classify the type of news. It is based on a slightly amended version of the regular expression used to classify news, and non-news in Exposure to ideologically diverse news and opinion on Facebook by Bakshy, Messing, and Adamic. Science. 2015. Amendment: sport rather than sports. The classifier success is liable to vary over time and across outlets.
  • Arguments:

    • df: pandas dataframe. No default.
    • url: column with the domain names/URLs. Default is url
  • What it does:

    • converts url to lower case
    • regex
    URL containing any of the following words is classified as soft news:
    sport|entertainment|arts|fashion|style|lifestyle|leisure|celeb|movie|music|gossip|food|travel|horoscope|weather|gadget
    
    URL containing any of following words is classified as hard news:
    politi|usnews|world|national|state|elect|vote|govern|campaign|war|polic|econ|unemploy|racis|energy|abortion|educa|healthcare|immigration
  • Output:

    • Given both the regex can return true, the potential set is: soft, hard, soft and hard, or empty string.
    • By default it creates two columns, `hard_lab and soft_lab`
  • Examples:

    >>> import pandas as pd
    >>> from notnews import soft_news_url_cat_uk
    >>>
    >>> df = pd.DataFrame([{'url': 'https://www.theguardian.com/us/sport'}])
    >>> df
                                        url
    0  https://www.theguardian.com/us/sport
    >>>
    >>> soft_news_url_cat_uk(df)
                                        url  soft_lab hard_lab
    0  https://www.theguardian.com/us/sport         1     None
    >>>
  1. pred_soft_news_uk: We use the model

    to predict soft news for UK news media.

Command Line

We also implement the scripts to process the input file in the CSV format:

  1. soft_news_url_cat_us

    usage: soft_news_url_cat_us [-h] [-o OUTPUT] [-u URL] input
    
    US Soft News Category by URL pattern
    
    positional arguments:
    input                 Input file
    
    optional arguments:
    -h, --help            show this help message and exit
    -o OUTPUT, --output OUTPUT
                            Output file with category data
    -u URL, --url URL     Name or index location of column contains the domain
                            or URL (default: url)
  2. pred_soft_news_us

    usage: pred_soft_news_us [-h] [-o OUTPUT] [-t TEXT] input
    
    Predict Soft News by text using NYT Soft News model
    
    positional arguments:
    input                 Input file
    
    optional arguments:
    -h, --help            show this help message and exit
    -o OUTPUT, --output OUTPUT
                            Output file with prediction data
    -t TEXT, --text TEXT  Name or index location of column contains the text
                            (default: text)
  3. pred_what_news_us

    usage: pred_what_news_us [-h] [-o OUTPUT] [-t TEXT] input
    
    Predict What News by text using NYT What News model
    
    positional arguments:
    input                 Input file
    
    optional arguments:
    -h, --help            show this help message and exit
    -o OUTPUT, --output OUTPUT
                            Output file with prediction data
    -t TEXT, --text TEXT  Name or index location of column contains the text
                            (default: text)
  4. soft_news_url_cat_uk

    usage: soft_news_url_cat_uk [-h] [-o OUTPUT] [-u URL] input
    
    UK Soft News Category by URL pattern
    
    positional arguments:
    input                 Input file
    
    optional arguments:
    -h, --help            show this help message and exit
    -o OUTPUT, --output OUTPUT
                            Output file with category data
    -u URL, --url URL     Name or index location of column contains the domain
                            or URL (default: url)
  5. pred_soft_news_uk

    usage: pred_soft_news_uk [-h] [-o OUTPUT] [-t TEXT] input
    
    Predict Soft News by text using UK URL Soft News model
    
    positional arguments:
    input                 Input file
    
    optional arguments:
    -h, --help            show this help message and exit
    -o OUTPUT, --output OUTPUT
                            Output file with prediction data
    -t TEXT, --text TEXT  Name or index location of column contains the text
                            (default: text)

Underlying Data

  • For more information about how to get the underlying data for UK model, see here. For information about the data underlying the US model, see here

Applications

We use the model to estimate the supply of not news in the US and the UK.

Documentation

For more information, please see project documentation.

Authors

Suriyan Laohaprapanon and Gaurav Sood

Contributor Code of Conduct

The project welcomes contributions from everyone! In fact, it depends on it. To maintain this welcoming atmosphere, and to collaborate in a fun and productive way, we expect contributors to the project to abide by the Contributor Code of Conduct

License

The package is released under the MIT License.

Not News's Projects

archive_news_cc icon archive_news_cc

Closed Caption Transcripts of News Videos from archive.org 2014--2023

good_nyt icon good_nyt

Patterns in NYT production from 1987 to 2007

ipso_facto icon ipso_facto

Analysis of IPSO (Independent Press Standard Organization) complaints: https://www.ipso.co.uk/IPSO/index.html

lacc_to_csv icon lacc_to_csv

Los Angeles Closed-Caption Television News Archive Data to CSV

notnews icon notnews

classifiers for soft news based on the story text and the url structure for both the US and UK news media.

rainbow icon rainbow

Racial Diversity of News Coverage and the Newsroom

top10 icon top10

Top 10 News! Scraping and Parsing Home pages and Top 10 Lists on News Sites

top_news icon top_news

Collecting URLs Daily From News Feeds of Major National News Sites

uk_media_ideology icon uk_media_ideology

Measuring media ideology using twitter follower network, tweet text, and media text

uk_not_news icon uk_not_news

Not News: Provision of Apolitical News in the British News Media

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.