GithubHelp home page GithubHelp logo

propublica / cookcountyjail2 Goto Github PK

View Code? Open in Web Editor NEW
23.0 23.0 6.0 232 KB

A new version of the cook county jail scraper, inspired by the Supreme Chi-Town Coding Crew

License: MIT License

Python 33.03% HTML 66.76% Shell 0.21%

cookcountyjail2's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cookcountyjail2's Issues

endpoints error out

All scraping endpoints error out. Please excuse me if I configured this wrong. I was wondering if this is still running for anyone.

There's a jail # that looks like `00:00:00`

Running from seed created yesterday:

DEBUG:scrapy.dupefilters:Filtered duplicate request: <GET http://www2.cookcountysheriff.org/search2/details.asp?jailnumber=00:00:00> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-06-14 08:40:14 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www2.cookcountysheriff.org/search2/details.asp?jailnumber=00:00:00> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)

LOTS of missing data

Something seems very wrong with the scraper's ability to pick up new inmates. this is disconcerting

You should consider retrying missing IDs from previous days

In the previous version, I had found that sometimes the Cook County website would not always generate a page for an inmate. So what I did was add logic to test for missing IDs from previous days including checking for existing inmates that appear to have been released.

Do you think that is check should be added?

more tests

Things to test:

  • Cleanup scripts
  • Spider use of config
  • Mirroring / saving bits
  • URL generation
  • Seed file bits

Moto looks like it could be highly useful for our purposes.

Seed data should be optional

Since I don't have any of the old daily inmate summaries, I can't run the scraper locally. This should be optional for the first run.

2017-06-10 14:53:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-06-10 14:53:47 [scrapy.core.engine] INFO: Spider opened
2017-06-10 14:53:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-10 14:53:47 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-06-10 14:53:47 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/home/wil/.virtualenvs/ccj-cookcountyjail2/lib/python3.6/site-packages/scrapy/core/engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "/home/wil/code/ccj/cookcountyjail2/jailscraper/spiders/inmate_spider.py", line 32, in start_requests
    for url in self._generate_urls():
  File "/home/wil/code/ccj/cookcountyjail2/jailscraper/spiders/inmate_spider.py", line 59, in _generate_urls
    last_file = keys[-1].get()
IndexError: list index out of range

offer local mirroring option in addition to s3

would be nice to decide magically..

e.g., in project_config.py:

MIRROR_HTML = True
MIRROR_HTML_LOCATION = 's3://mybucket.mydomain.tld'

or

MIRROR_HTML_LOCATION = `data/scrape`

Would probably want to default to local, too...

Empty next court dates cause error

Need a test. Related to #26 ... we should test this case :)

ERROR:scrapy.core.scraper:Spider error processing <GET http://www2.cookcountysheriff.org/search2/details.asp?jailnumber=2017-0612061> (referer: None)
Traceback (most recent call last):
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/DE-Admin/Code/cookcountyjail2/jailscraper/spiders/inmate_spider.py", line 51, in parse
    'Court_Date': inmate.court_date,
  File "/Users/DE-Admin/Code/cookcountyjail2/jailscraper/models.py", line 71, in court_date
    return self._makedate(value)
  File "/Users/DE-Admin/Code/cookcountyjail2/jailscraper/models.py", line 33, in _makedate
    return datetime.strptime(x, '%m/%d/%Y').strftime('%Y-%m-%d')
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_strptime.py", line 362, in _strptime
    (data_string, format))
ValueError: time data '' does not match format '%m/%d/%Y'
2017-06-13 11:14:12 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www2.cookcountysheriff.org/search2/details.asp?jailnumber=2017-0612061> (referer: None)
Traceback (most recent call last):
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/DE-Admin/Code/cookcountyjail2/jailscraper/spiders/inmate_spider.py", line 51, in parse
    'Court_Date': inmate.court_date,
  File "/Users/DE-Admin/Code/cookcountyjail2/jailscraper/models.py", line 71, in court_date
    return self._makedate(value)
  File "/Users/DE-Admin/Code/cookcountyjail2/jailscraper/models.py", line 33, in _makedate
    return datetime.strptime(x, '%m/%d/%Y').strftime('%Y-%m-%d')
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_strptime.py", line 362, in _strptime
    (data_string, format))
ValueError: time data '' does not match format '%m/%d/%Y'

better testing mode

thinking there can be a configurable target % of urls and then some way of doing modulo math or taking a random sample of urls

handle empty "target"

"target" in the config specifies essentially which directory to use when storing to S3 ('dev', 'prod', etc). But maybe you don't want that! Turns out, we don't. So need to handle an empty string there.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.