GithubHelp home page GithubHelp logo

opportunitylivetv / article-tagging Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pandey/article-tagging

1.0 3.0 1.0 53.24 MB

Natural Language Processing of Chicago news articles

Jupyter Notebook 78.60% Python 9.44% R 11.96%

article-tagging's Introduction

Installation and Usage

Requirements

To install this library, you will need at least the python packages nltk, numpy, scikit-learn, and pandas. We recommend using Anaconda to manage python environments:

$ # create a new anaconda environment with required packages
$ conda create -n cjp-ap nltk numpy scikit-learn pandas pytest
$ source activate cjp-ap
(cjp-ap) $ ...

Installation

Download the code from git, cd into the directory, and run the setup.py file. If you created an Anaconda environment, then make sure that environment is active before running the setup file.

$ git clone [email protected]:chicago-justice-project/article-tagging.git
$ cd article-tagging
$ python setup.py install

In an attempt to keep the git repo from blowing up in size, we do not package the models. (This also helps to avoid problems with different pickling protocols being used by different python versions.) This means you will have to build the model from scratch. This takes roughly 4 GB of RAM.

$ python -m tagnews.crimetype.models.binary_stemmed_logistic.save_model

All together

$ # clone the repo
$ git clone [email protected]:chicago-justice-project/article-tagging.git
$ cd article-tagging
$ # create a new anaconda environment with required packages
$ conda create -n cjp-ap nltk numpy scikit-learn pandas pytest
$ source activate cjp-ap
(cjp-ap) $ cd lib
(cjp-ap) $ # make/save the model, this may take a while...
(cjp-ap) $ python -m tagnews.crimetype.models.binary_stemmed_logistic.save_model
(cjp-ap) $ cd ..
(cjp-ap) $ python setup.py install

nltk

As long as the nltk package is already installed, running the setup.py file should automatically download the required nltk corpora. If that does not work for some reason, then you will need to download the corpora manually. See the list required_nltk_packages in setup.py. Each corpus can be downloaded by running nltk.download(corpus_name).

Testing

You will additionally need pytest installed to run the tests.

To test an installation, you can run

import tagnews
tagnews.test()

to run the tests. During development, you can run py.test from the top level of this repo. Either way, you should see a couple tests pass.

Usage

From python

The main class is tagnews.crimetype.tag.Tagger:

>>> import tagnews
>>> tagger = tagnews.crimetype.tag.Tagger()
>>> article_text = 'A short article. About drugs and police.'
>>> tagger.relevant(article_text, prob_thresh=0.1)
True
>>> tagger.tagtext(article_text, prob_thresh=0.5)
['DRUG', 'CPD']
>>> tagger.tagtext_proba(article_text)
DRUG     0.747944
CPD      0.617198
VIOL     0.183003
UNSPC    0.145019
ILSP     0.114254
POLM     0.059985
...

From the command line

The installation comes with a very rudimentary command line interface, which without any arguments defaults to reading from the stdin.

$ python -m tagnews.crimetype.cli
Go ahead and start typing. Hit ctrl-d when done.
<type here>

Or you can provide a list of articles to tag, a CSV of the probability of each tag is output to <article name>.tagged.

$ python -m tagnews.crimetype.cli sample-article-1.txt sample-article-2.txt
$ cat sample-article-1.txt.tagged
  CPD, 0.912382307
UNSPC, 0.051873838
 SEXA, 0.031065436
 BEAT, 0.023119570
 DRUG, 0.017140532
...

Note that the -m flag is required.

Background

We want to compare the amount different types of crimes are reported in certain areas vs. the actual occurrence amount in those areas. Are some crimes under-represented in certain areas but over-represented in others? To accomplish this, we'll need to be able to extract a type-of-crime tag and geospatial data from news articles.

We meet every Tuesday at Chi Hack Night, and you can find out more about this specific project here.

The Chicago Justice Project has been scraping RSS feeds of articles written by Chicago area news outlets for several years, allowing them to collect almost 300,000 articles. At the same time, an amazing group of volunteers have helped them tag these articles. The tags include crime categories like "Gun Violence", "Drugs", "Sexual Assault", but also organizations such as "Cook County State's Attorney's Office", "Illinois State Police", "Chicago Police Department", and other miscellaneous categories such as "LGBTQ", "Immigration". The volunteer UI was also recently updated to allow highlighting of geographic information.

Contributing

You want to contribute? Great! Check out the CONTRIBUTING.md file for more info.

Areas of research

Type-of-Crime Article Tagging

This part of this project aims to automate the category tagging using a specific branch of Machine Learning known as Natural Language Processing.

Possible models to use (some of which we have tried!) include

It might be useful to have an additional corpus of news articles that we can use for unsupervised feature learning without having to worry about over-fitting.

Automated Geolocation

We also need to automatically find the geographic area of the crime the article is talking about. We have just recently updated the tagging interface to also allow highlighting geospatial information inside of articles and are collecting ground truth data. Once we have collected this data, we need to automate the process of detecting location information inside articles. An important note, we are relying on the power of current geocoders to take unstructured location information and output a latitude/longitude pair.

One possible path forward appeared to involve an approach developed by Everyblock. They got funding from the Knight Foundation to geolocate news articles and were required to open source their code. A brief investigation seems to show that their geolocating is actually just a giant Regular Expression. Investigation showed that it was not accurate enough on its own for our purposes.

Things to checkout:

Things to consider

Some articles may discuss multiple crimes. Some crimes may occur in multiple areas, whereas others may not be associated with any geographic information (e.g. some kinds of fraud).

See Also

article-tagging's People

Contributors

alabavery avatar kbrose avatar mattesweeney avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Forkers

martina6hall

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.