To install this library, you will need at least the python packages nltk, numpy, scikit-learn, and pandas. We recommend using Anaconda to manage python environments:
$ # create a new anaconda environment with required packages
$ conda create -n cjp-ap nltk numpy scikit-learn pandas pytest
$ source activate cjp-ap
(cjp-ap) $ ...
Download the code from git, cd
into the directory, and run the setup.py file. If you created an Anaconda environment, then make sure that environment is active before running the setup file.
$ git clone [email protected]:chicago-justice-project/article-tagging.git
$ cd article-tagging
$ python setup.py install
In an attempt to keep the git repo from blowing up in size, we do not package the models. (This also helps to avoid problems with different pickling protocols being used by different python versions.) This means you will have to build the model from scratch. This takes roughly 4 GB of RAM.
$ python -m tagnews.crimetype.models.binary_stemmed_logistic.save_model
$ # clone the repo
$ git clone [email protected]:chicago-justice-project/article-tagging.git
$ cd article-tagging
$ # create a new anaconda environment with required packages
$ conda create -n cjp-ap nltk numpy scikit-learn pandas pytest
$ source activate cjp-ap
(cjp-ap) $ cd lib
(cjp-ap) $ # make/save the model, this may take a while...
(cjp-ap) $ python -m tagnews.crimetype.models.binary_stemmed_logistic.save_model
(cjp-ap) $ cd ..
(cjp-ap) $ python setup.py install
As long as the nltk package is already installed, running the setup.py file should automatically download the required nltk corpora. If that does not work for some reason, then you will need to download the corpora manually. See the list required_nltk_packages
in setup.py. Each corpus can be downloaded by running nltk.download(corpus_name)
.
You will additionally need pytest
installed to run the tests.
To test an installation, you can run
import tagnews
tagnews.test()
to run the tests. During development, you can run py.test
from the top level of this repo. Either way, you should see a couple tests pass.
The main class is tagnews.crimetype.tag.Tagger
:
>>> import tagnews
>>> tagger = tagnews.crimetype.tag.Tagger()
>>> article_text = 'A short article. About drugs and police.'
>>> tagger.relevant(article_text, prob_thresh=0.1)
True
>>> tagger.tagtext(article_text, prob_thresh=0.5)
['DRUG', 'CPD']
>>> tagger.tagtext_proba(article_text)
DRUG 0.747944
CPD 0.617198
VIOL 0.183003
UNSPC 0.145019
ILSP 0.114254
POLM 0.059985
...
The installation comes with a very rudimentary command line interface, which without any arguments defaults to reading from the stdin.
$ python -m tagnews.crimetype.cli
Go ahead and start typing. Hit ctrl-d when done.
<type here>
Or you can provide a list of articles to tag, a CSV of the probability of each tag is output to <article name>.tagged
.
$ python -m tagnews.crimetype.cli sample-article-1.txt sample-article-2.txt
$ cat sample-article-1.txt.tagged
CPD, 0.912382307
UNSPC, 0.051873838
SEXA, 0.031065436
BEAT, 0.023119570
DRUG, 0.017140532
...
Note that the -m
flag is required.
We want to compare the amount different types of crimes are reported in certain areas vs. the actual occurrence amount in those areas. Are some crimes under-represented in certain areas but over-represented in others? To accomplish this, we'll need to be able to extract a type-of-crime tag and geospatial data from news articles.
We meet every Tuesday at Chi Hack Night, and you can find out more about this specific project here.
The Chicago Justice Project has been scraping RSS feeds of articles written by Chicago area news outlets for several years, allowing them to collect almost 300,000 articles. At the same time, an amazing group of volunteers have helped them tag these articles. The tags include crime categories like "Gun Violence", "Drugs", "Sexual Assault", but also organizations such as "Cook County State's Attorney's Office", "Illinois State Police", "Chicago Police Department", and other miscellaneous categories such as "LGBTQ", "Immigration". The volunteer UI was also recently updated to allow highlighting of geographic information.
You want to contribute? Great! Check out the CONTRIBUTING.md file for more info.
This part of this project aims to automate the category tagging using a specific branch of Machine Learning known as Natural Language Processing.
Possible models to use (some of which we have tried!) include
- Bag of words
- n-gram models
- A combination of bag-of-words and n-gram models
- Word Vectorization as a pre-processing step
- Convolutional Neural Networks
- Recurrent Neural Networks
It might be useful to have an additional corpus of news articles that we can use for unsupervised feature learning without having to worry about over-fitting.
We also need to automatically find the geographic area of the crime the article is talking about. We have just recently updated the tagging interface to also allow highlighting geospatial information inside of articles and are collecting ground truth data. Once we have collected this data, we need to automate the process of detecting location information inside articles. An important note, we are relying on the power of current geocoders to take unstructured location information and output a latitude/longitude pair.
One possible path forward appeared to involve an approach developed by Everyblock. They got funding from the Knight Foundation to geolocate news articles and were required to open source their code. A brief investigation seems to show that their geolocating is actually just a giant Regular Expression. Investigation showed that it was not accurate enough on its own for our purposes.
Things to checkout:
Some articles may discuss multiple crimes. Some crimes may occur in multiple areas, whereas others may not be associated with any geographic information (e.g. some kinds of fraud).