propublica / cookcountyjail2 Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 6.0 232 KB

A new version of the cook county jail scraper, inspired by the Supreme Chi-Town Coding Crew

License: MIT License

Python 33.03% HTML 66.76% Shell 0.21%

cookcountyjail2's People

Stargazers

Watchers

Forkers

wilbertom eads nwinklareth lorarjohns forestofthings

cookcountyjail2's Issues

endpoints error out

All scraping endpoints error out. Please excuse me if I configured this wrong. I was wondering if this is still running for anyone.

scrape.sh should respect target

consider factoring parsing into an object as was done before

make testing easier for sure

develop and document new data model

There's a jail # that looks like `00:00:00`

Running from seed created yesterday:

DEBUG:scrapy.dupefilters:Filtered duplicate request: <GET http://www2.cookcountysheriff.org/search2/details.asp?jailnumber=00:00:00> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-06-14 08:40:14 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www2.cookcountysheriff.org/search2/details.asp?jailnumber=00:00:00> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)

decouple s3 storage for scraped pages and csv snapshots

deleting tempdir in spider bombs when using local storage

LOTS of missing data

Something seems very wrong with the scraper's ability to pick up new inmates. this is disconcerting

automate seeding current scrape from previous scrapes

calculate age at booking in parser

set daily csv files to public-read

configurable soft maximum jail number

does it make sense to move project_config to a module?

mark all seeded records incomplete if seed file is more than one day old

need to mock environment variables, not actually set them, in tests

does "target" even make sense as configuration option?

make spider smart about finding jail #s beyond soft max

You should consider retrying missing IDs from previous days

In the previous version, I had found that sometimes the Cook County website would not always generate a page for an inmate. So what I did was add logic to test for missing IDs from previous days including checking for existing inmates that appear to have been released.

Do you think that is check should be added?

handle missing days

seed file day-before-today logic doesn't seem to be working right

document contributing

publish csv snapshot manifest

more tests

Things to test:

Moto looks like it could be highly useful for our purposes.

rewrite readme

add tests for url generation

Seed data should be optional

Since I don't have any of the old daily inmate summaries, I can't run the scraper locally. This should be optional for the first run.

2017-06-10 14:53:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-06-10 14:53:47 [scrapy.core.engine] INFO: Spider opened
2017-06-10 14:53:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-10 14:53:47 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-06-10 14:53:47 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/home/wil/.virtualenvs/ccj-cookcountyjail2/lib/python3.6/site-packages/scrapy/core/engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "/home/wil/code/ccj/cookcountyjail2/jailscraper/spiders/inmate_spider.py", line 32, in start_requests
    for url in self._generate_urls():
  File "/home/wil/code/ccj/cookcountyjail2/jailscraper/spiders/inmate_spider.py", line 59, in _generate_urls
    last_file = keys[-1].get()
IndexError: list index out of range

wire up to use amazon s3 for storage

develop basic ingest script from CSV files

depends on #32

remove scrapy cruft

read in environment variables for parts of configuration

git push -u advice in readme is wrong

offer local mirroring option in addition to s3

would be nice to decide magically..

e.g., in project_config.py:

MIRROR_HTML = True
MIRROR_HTML_LOCATION = 's3://mybucket.mydomain.tld'

MIRROR_HTML_LOCATION = `data/scrape`

Would probably want to default to local, too...

parse hash the same way SC3 did

Here it is: https://github.com/sc3/cookcountyjail/blob/9631ee1978f74ca74804f6564575cc62acebce09/scraper/inmate_details.py#L69-L77

Since the original code is GPL, will need to recreate and not copy.

clean up requirements

config system may not be working right

Empty next court dates cause error

Need a test. Related to #26 ... we should test this case :)

ERROR:scrapy.core.scraper:Spider error processing <GET http://www2.cookcountysheriff.org/search2/details.asp?jailnumber=2017-0612061> (referer: None)
Traceback (most recent call last):
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/DE-Admin/Code/cookcountyjail2/jailscraper/spiders/inmate_spider.py", line 51, in parse
    'Court_Date': inmate.court_date,
  File "/Users/DE-Admin/Code/cookcountyjail2/jailscraper/models.py", line 71, in court_date
    return self._makedate(value)
  File "/Users/DE-Admin/Code/cookcountyjail2/jailscraper/models.py", line 33, in _makedate
    return datetime.strptime(x, '%m/%d/%Y').strftime('%Y-%m-%d')
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_strptime.py", line 362, in _strptime
    (data_string, format))
ValueError: time data '' does not match format '%m/%d/%Y'
2017-06-13 11:14:12 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www2.cookcountysheriff.org/search2/details.asp?jailnumber=2017-0612061> (referer: None)
Traceback (most recent call last):
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/DE-Admin/.virtualenvs/cookcountyjail2/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/DE-Admin/Code/cookcountyjail2/jailscraper/spiders/inmate_spider.py", line 51, in parse
    'Court_Date': inmate.court_date,
  File "/Users/DE-Admin/Code/cookcountyjail2/jailscraper/models.py", line 71, in court_date
    return self._makedate(value)
  File "/Users/DE-Admin/Code/cookcountyjail2/jailscraper/models.py", line 33, in _makedate
    return datetime.strptime(x, '%m/%d/%Y').strftime('%Y-%m-%d')
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_strptime.py", line 362, in _strptime
    (data_string, format))
ValueError: time data '' does not match format '%m/%d/%Y'

string replacement might be faster than actually coercing to a real date.

propublica / cookcountyjail2 Goto Github PK

cookcountyjail2's People

Stargazers

Watchers

Forkers

cookcountyjail2's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs