freelawproject / juriscraper Goto Github PK

View Code? Open in Web Editor NEW

340.0 41.0 98.0 54.23 MB

An API to scrape American court websites for metadata.

Home Page: https://free.law/juriscraper/

License: BSD 2-Clause "Simplified" License

Python 1.73% HTML 98.26% Makefile 0.01% Jinja 0.01%

scraping government courts pacer

juriscraper's People

Contributors

Stargazers

Watchers

Forkers

nowherenearithaca janderse systemizer brianwc kemitchell adelevie uikit0 paultopia m4h7 andr3ic uglyboxer voutilad rlfordon divergentdave brittany1908 awenner cgruppioni chrisfarhat anseljh janellembecker johnhawkinson wethepeopleonline jon-freed umeboshi2 cogito-clarus thiggimajig eheyburn ro5s glennneiger j2kao sharpenyoursword tigrankhachikyan parthsagdeo paxonf varun-magesh swipswaps parkerhancock mmantel shilad alex-devoid d0nghyunkang wilsonqin ikeboy nndnha rileyfroehlich cgdeboer-toptal rayrrr sumit1777 drewsilcock tsvetizlateva dannytimothy madegde leeveronica-forks karolinefirmino evandoblanco theophile tewen dmvj adaniy sayyidaladam vonwooding albertisfu trentmercer hughbe lezh1k lucioric2000 grx7 ecoverve arderyp ttys0dev armvndj k4ac cweider whiteknightai2 bronsonh13 ebk13579 honeykjoule iodrift xtfx2001 bradley39e erosendo ofcounsel grossir kfinity patentcounsel demothedread satkarra ralexx guspan-tanadi giancohs heisme2day kastningbrandon

juriscraper's Issues

Add timeouts to every request

An anti-feature of the super-popular requests library is that it doesn't have timeouts by default. This means that if a connection hangs for whatever reason, the program itself will hang...forever.

I filed https://github.com/kennethreitz/requests/issues/3070 about changing this in the next major version of requests, but until then, we need to add timeout values to every request. I think two of the Texas scrapers in #160 are currently frozen for just this reason.

I'll be pushing a release with this fix shortly.

License

The README says "Juriscraper is licensed under the permissive BSD license.", but there is no license text. Could you please add an explicit license?
I'd propose a text, but there are more BSD licenses so I don't know which you intended.

Change the API to be Sites > Opinions > Meta data

Hi,

I just found out about your project, and read through the code. I like the organization a lot. I see that the scraping generally returns arrays of data, such as lists of all case name or case dates. I think it'd be very useful to be able to retrieve a single list of opinion objects, each of which would have attributes such as name, date, etc. Have you considered this kind of API?

Bitbucket: https://bitbucket.org/mlissner/juriscraper/issue/2
Originally Reported By: Robb Shecter
Originally Created At: 2012-04-12T08:51:56.662

ca2 scraper returns ALL CAPS case names sometimes

The ca2 scraper needs to call titlecase() to fix some uppercase case names.

Help wanted checking the following scrapers

The following courts generated errors on last run and are likely broken:
(Compare the court website's most recent opinion with https://courtlistener.com 's latest opinion from that court.)

Texas opinion scraper is down

This is an urgent issue, but I have too many other things on my plate. Anybody able to take a look?

NY App Div CaseNames lack proper spacing

See cases from Oct. 23 such as MaxonAlcoHoldings,LLCvSTSSteel,Inc. (N.Y. App. Div. 2014) https://www.courtlistener.com/?q=&stat_Precedential=on&order_by=dateFiled+desc&court=nyappdiv

ca1 scraper is missing lower court meta-data

The ca1 scraper does not parse out all meta-data available.

lower_court_numbers and lower_courts are available.

9th Circuit Oral Argument Audio Scraper Stopped?

Last oral argument audio files from 9th Circuit on CL are from May 6th, 2015. However, there are more than 100 more recent audio files on their website. So, it appears it's been broken a while and we have some catching up to do.

Logging location should be configurable

Noticed while working on issue #59, it seems there's some hardcoding of log locations. These (if more than one) should probably be set in a config file or environment setting. Ideally the default would also be a relative location to the current working directory and not a hardcoded path most likely not existing on most machines (especially non-Un*x).

Warning: No such file or directory: /var/log/juriscraper/debug.log. Have you created the directory for the log?
Juriscraper will continue to run, but will not create logs.

Custom _download() methods are breaking tests.

This will take some finesse, but the problem is that the tests call parse() and parse calls _download. Since we've overridden download in many cases, that code is run instead of the code with the catch for if self.method == 'LOCAL'.

Some solution will need to be found, especially for the cases where Selenium is being run in the _download methods.

Make Juriscraper installable with pip

I think this would aid people in getting it into their workflow and that it would help us update/release versions. I think there's something about Python "wheels" being used these days for this.

There will also likely be a problem with logging, since Juriscraper is currently configured to use /var/log/juriscraper/debug.log by default, which requires root access.

Need a backscraper for cit in 2012

Looking at the coverage graphs on CL, it's clear that we need a backscraper for this year.

Would be great to get one, since it would round out the court.

Bitbucket: https://bitbucket.org/mlissner/juriscraper/issue/8
Originally Reported By: Mike Lissner
Originally Created At: 2013-02-09T22:52:47.009

Add Justice Department Supreme Court Briefs Scraper

Once freelawproject/courtlistener supports "briefs" as a document type, we should build a scraper for this impressive collection of briefs: http://www.justice.gov/osg/supreme-court-briefs

This probably goes in a new directory /briefs as opposed to /opinions or /oral_args

Investigate blocking all network requests during tests

I'm pretty sure some of the tests are hitting the network. This isn't a great situation but there seems to be a decent solution to this over on stackoverflow.

Monkey patching socket ought to do it:

import socket
def guard(*args, **kwargs):
    raise Exception("I told you not to use the Internet!")
socket.socket = guard

I believe this just breaks socket so that no library can connect using it, suckers.

Termination Dates for District-Level Docket Entries

Issue by thinkcomp
Tuesday Apr 02, 2013 at 04:04 GMT
Originally opened as https://github.com/freelawproject/recap-server/issues/13

RECAP should also capture the termination dates of docket line item entries on district-level PACER history pages. They're very useful. Right now it only gets the Filed date and Entered date. Sometimes there's a Terminated date, as well.

Naming of get_reporter_citation

We currently have _get_neutral_citation and _get_west_citation but "west" is not totally accurate. I think we really mean "bound_volume" or "reporter" citation just to distinguish this from neutral citations. Maryland is a good example where we can get their official reporter in a backscraper but the "West" volume would be the Atlantic Reporter cite, which isn't posted on the Md. website.

A while back we changed the CL database to include a bunch of citation fields. Do we need juriscraper to better align to these different types of cites? For instance, if there were a page that provided both the state official reporter citation and the west regional reporter citation, then we would only have _get_west_citation available, but we'd also need something like _get_reporter_citation for the official reporter citations.

I suppose we can use get_west_citation for the non-west Maryland official reporter in the Maryland backscraper I'd like to make, and wait for the day when we actually encounter a site with both official reporter cites and West regional reporter cites, but I thought I'd just raise the issue in case you have a preference for tidying this up in some way now. If having one Maryland backscraper use a slightly innacurate field doesn't bother you, then just close this.

Bitbucket: https://bitbucket.org/mlissner/juriscraper/issue/12
Originally Reported By: Brian Carver
Originally Created At: 2014-06-28T21:50:48.297

Create mailing list for Juriscraper-related discussion and email notifications

We should have a mailing list called "juriscraper" where we send the Juriscraper notifcations and other related stuff.

So:

Set up new list at: http://lists.freelawproject.org/cgi-bin/mailman/create
Re-configure courtlistener to send to that list using a specific subject line, [juriscraper-notifications] perhaps?
Update the readme to mention this and put it out on the various Twitters.

Move Phantomjs to a single import and single instance

I'm pretty sure most of the reason that PhantomJS is slow is because we import and instantiate it every single time. A better solution would keep it around after it was started each time. An even better solution would keep it around, but only instantiate it when it was first needed (so calling just one scraper that doesn't use Phantomjs isn't negatively impacted).

There's an answer on StackOverflow that has some tips about this. This also might help with the Webdriver connection errors we seem to have on longer-running processes.

Move all *_example.html files to the testing directory

They're kind of a PITA, and they don't belong in with the code, despite how convenient that may be for the tests.

Should be an easy change. Just need to make the tests aware that they need to look somewhere else.

6th Circuit now providing oral argument audio; needs scraper

http://www.ca6.uscourts.gov/internet/court_audio/aud1.php

High priority.

Abbreviations should follow blue book format

It seems that the state subdirectory currently has placeholder files that use the postal code abbreviations of the states. Besides the fact that no one but the postmaster general can keep these straight, they are not the abbreviations used in legal citation, per the commandments from on high from our masters at the Bluebook.

With the federal courts we have thus far had the, I think, commendable practice of naming the file after the courtID that will eventually be used in courtlistener, and hence in some search queries made by humans that have to remember these abbreviations. Our legal audience will expect the Bluebook abbreviations.

I'd ask we change all our state placeholder filenames to correspond to the names found in this table: http://www.law.cornell.edu/citation/4-500.htm because I can remember Mass. Minn. Mich. Miss. Mo. but there's no hope for me (or other humans) with the two-letter codes.

Oh, and I'm willing to do this and create a pull request if there is concensus that it's OK to do.

Bitbucket: https://bitbucket.org/mlissner/juriscraper/issue/4
Originally Reported By: Brian Carver
Originally Created At: 2012-05-23T11:59:13.685

Lxml not listed as a dependency and does not have the version locked.

Need to update the readme with this information.

Bitbucket: https://bitbucket.org/mlissner/juriscraper/issue/9
Originally Reported By: Anonymous
Originally Created At: 2013-02-27T23:53:19.655

NY App Divs provide neutral citations but we save them as docket numbers

This is a fairly easy one, but if you look at the page here you'll see that NY App Div is using neutral citations. Our scraper is recognizing these as docket numbers instead and needs to be updated.

At the same time we need to write a cleanup script to fix the content in CourtListener.

Can you mark "help wanted" on any scraper tickets that might be easy for someone to pick up?

Can you mark "help wanted" on any scraper tickets that might be easy for someone to pick up?
This is for us to provide challenges during PyCon.. https://us.pycon.org/2015/schedule/presentation/318/

Return meta-data to caller as plain Python objects

The current way of returning opinions from scrapers complicates the callers. While the meta-data may be scraped column-by-column, it generally isn't used that way.

Possible simplification:

    # Returns all the scraped cases.  Call after parse().
    # Returns a list of dictionaries, one for each case.
    def get_cases(self)

This could also be used to implement to_csv(), to_html(), to_xml(), to_json() generically.

Returns data like:

[OrderedDict([
    ('case_dates', '2014-06-30'), 
    ('case_names', u'93-06 994'), 
    ('download_urls', 'http://www.va.gov/vetapp14/Files4/1429545.txt'), 
    ('precedential_statuses', u'Unpublished'), 
    ('docket_numbers', u'93-06 994'), 
    ('neutral_citations', u'1429545')]),
 OrderedDict([
    ('case_dates', '2014-06-30'), 
    ('case_names', u'13-34 313'), 
        ...
    ('neutral_citations', u'1429590')]),
]

For existing scrapers this would suffice:

    def get_cases(self):
        cases = []
        for attr in self._all_attrs:
            values = getattr(self, attr)
            if (values is None):
                continue
            i = 0
            for value in values:
                if (len(cases) <= i):
                    cases.append(collections.OrderedDict())
                if (isinstance(value, datetime.date)):
                    value = value.isoformat()
                cases[i][attr] = value
                i += 1
        return cases

Add Army Court of Criminal Appeals

Requested by a user to add the opinions located here:

https://www.jagcnet.army.mil/ACCA#

As was explained by the user, we have coverage of the Air Force Court of Criminal Appeals and the Navy-Marine Court of Criminal Appeals, so this feels like a missing gap.

string_utils should replace windows-1252 encoded special characters

string_utils.py currently converts a number of nasty curly apostrophes and such to something sensible. It doesn't seem to work on this cp1252 aka windows-1252 encoded page:

http://www.sconet.state.oh.us/ROD/docs/default.asp?Page=1&Sort=docdecided%20DESC&PageSize=25&Source=0&iaFilter=2012&ColumnMask=669

The summaries contain both emdashes and curly apostrophes that don't get converted right. Example:

u'Attorney misconduct, including failing to act with reasonable
diligence in representing a client, failing to promptly refund any
unearned fee upon the lawyer\x92s withdrawal from employment, and
knowingly failing to respond to a demand for information by a
disciplinary authority during an investigation\x97Indefinite suspension.'

The \x97 appears as an emdash in the original html and the \x92 appears as an apostrophe in the original html.

Bitbucket: https://bitbucket.org/mlissner/juriscraper/issue/6
Originally Reported By: Brian Carver
Originally Created At: 2012-06-19T09:13:24.008

Create scraper for Board of Immigration Appeals (BIA)

There's a pretty great collection of decisions going all the way back to 1960, many including summaries here:

www.justice.gov/eoir/ag-bia-decisions

A user just suggested that we should add this, and they're absolutely right. It'd be killer to get these included.

Develop Appeals/COLR Scrapers

Information on all the court web sites that juriscraper should ultimately cover is collected at http://bit.ly/CourtDocs

Use that site to "claim" a court web site if you are working on its scraper so that there will not be duplication of efforts.

Bitbucket: https://bitbucket.org/mlissner/juriscraper/issue/5
Originally Reported By: Brian Carver
Originally Created At: 2012-05-28T01:33:37.229

Develop District Court Scrapers

The District Court of New Mexico has a document retrieval system OUTSIDE of PACER that provides opinions:

http://www.nmcourt.fed.us/Drs-Web/input

It has over 20,000 opinions going back many years, and will display
1,000 at a time (quickly!). If one enters date restrictions then it's
possible to walk through the archive at a rate of less than 1k per page.

Filing as a bug, since district courts aren't part of the current road map, and we don't want to miss it.

Bitbucket: https://bitbucket.org/mlissner/juriscraper/issue/3
Originally Reported By: Mike Lissner
Originally Created At: 2012-04-13T19:23:17.961

Update dependencies

In particular, python-dateutil is crashing in tests when the latest version is used.

Missing Cases from Texas in 2015

An API user has reported missing citations in Texas from 2015. I investigated and found that cases were indeed missing. Seems the scraper might have been broken or that they've retroactively created some documents -- not sure.

I'm re-running all 16 of Texas's scrapers for the entire year to make sure to catch anything that's missing. Most have finished. The only scrapers that still need to be run are:

texapp_1
texapp_2
texapp_5
texapp_14

The rest finished before I made this ticket.

ND scraper breaks local parse by using DeferringList

The nd.py scraper breaks dumping the parsed data from a local test because it uses a DeferringList even when a LOCAL request is being parsed.

Tests should be analyzed and sped up. Slowness should be a failure.

This is a super easy one to do, but one that I haven't been able to get to. When running tests, there are a number of scrapers that are slow and that are reported by the testing system as "slow scrapers". These scrapers should be tuned to be faster (whether in tests or in the actual code, if it's actually slow), and once all scrapers are fast (or at least tests are), the warnings for slow scrapers should be made into full-on failures.

Basic goals here are:

Identify any scrapers that are actually CPU intensive and fix them.
Make tests faster so we don't have to wait as long.
Make warnings into failures so we won't have regressions.

Should be a fun and fairly easy one.

Command-line disable duplicate matching

Maybe this is already possible, but would be nice if there were a command-line flag one could send juriscraper to tell it not to worry about the duplicate-detection it does. In those weird instances where the court's placement of new material seems to trick the duplicate-detector, would be good to be able to manually tell it: "No, really. Check every url just to be sure."

Bitbucket: https://bitbucket.org/mlissner/juriscraper/issue/10
Originally Reported By: Brian Carver
Originally Created At: 2014-03-31T03:29:21.586

Several OA scrapers have returned no results. Investigate and fix.

Including:

Harmonize function not catching all values of USA

On February 29th, at least three 9th Circuit cases were added to CourtListener with citations including "USA".

Not sure why this happened:

Bitbucket: https://bitbucket.org/mlissner/juriscraper/issue/1
Originally Reported By: Mike Lissner
Originally Created At: 2012-03-07T08:24:17.541

Backscrapers to deploy

The following backscrapers are ready, but are awaiting the enhancements that I've made to the CL caller that mitigates the impact of large scrapes:

Oral Argument:

cadc
scotus

Opinions:

Listed here: https://trello.com/c/tVJdNYih/49-add-any-courts-that-are-ready-to-go

Suport Oral Argument Video

We've got oral argument audio and it's great. But it's already clear that many courts are moving to video or started there to begin with. When you look at the states, you'll find a lot of video. (See our wiki for a preliminary list: https://github.com/freelawproject/juriscraper/wiki/Court-Websites ) Also of great interest is that the 9th Circuit which previously only did video for en banc oral arguments has started providing video of most oral arguments and appears to be moving towards having video of all oral arguments. See http://www.ca9.uscourts.gov/media/

I suppose the issue on juriscraper is resolved fairly easily as we're already all set to download most any filetype and its associated metadata, but it seems like some sort of filetype flag might be needed with options "audio" "video" "opinion" "docket" "whatever-else-we-decide-is-a-standalone-type". This same issue should be filed on CourtListener, where integrating a video player would likely be a larger project than any changes necessary to juriscraper.

Juriscraper should use html markup to divide paragraphs of opinion text

Parsing the text of the newly scraped opinions is much more difficult than those taken from other sources, since it doesn't include the

markup annotating new paragraphs.

Alabama scraper broken

I wrote a long description and @github mobile erased it when I slid my finger wrong way. Ugh. Little warning?
See:
http://judicial.alabama.gov/supreme_opinions.cfm

Opinions are missing. Important one today.

Simplify backscraper API

The backscraper API is currently a private API and varies from scraper to scraper. So CourtListener or other callers of Juriscraper currently need slightly different logic for each backscraper.

One way to simplify it would be to add to the public AbstractSite API:

# Returns an iterator to pass to backscrape(). 
# Returns None if this scraper is not also a backscraper.
# Covers at least from startdate to enddate inclusive. 
# May return slightly more than that.
# Usage:
# for i in get_backscrape_iterator(startdate, enddate):
#     site.backscrape(i)
#     site.parse()
#     ...
def get_backscrape_iterator(startdate, enddate)

What do you think?

Consider switching to html5parser for all parsing

lxml has an html5parser that can handle some of the inanities that bad HTML pages present.

For example, this page:

http://media.ca11.uscourts.gov/opinions/unpub/logname.php?begin=9720&num=485&numBegin=1

Has random less than signs in some of the docket numbers, which results in a terrible HTML tree. I was able to solve that in ca11_u, which fixes the problem and even preserves the API for the scrapers.

The hard part of this is that the html5parser's fromstring function returns _Element objects while the html.fromstring function returns HtmlElements. I was able to get around this in ca11_u with something like:

from lxml.html import fromstring, tostring
from lxml.html import html5parser

e = html5parser.fromstring(text)
html_element = fromstring(tostring(e))

Though that involves an extra parse and an extra serialization, all of which sucks and obscures.

Add scraper for nevada court of appeals

Looks like an easy subclass of nev_p.py, but it's actually a court that I don't think we have in CourtListener yet. Been a long time since that happened, so:

Need to decide what to call this jurisdiction. Probably nevctapp? I can't find any references to it in Blue Book v 19.
Need to find notes on how to add this to CourtListener, and it's possible we'll need to re-jigger the order of the courts in the modal (this, alas, is done manually).

Anyway, worth doing, since it's not every day we have a new jurisdiction.

Texas Supreme Court

Last opinion is from Dec. 23, 2014. Something changed on their site, because there are January 2015 opinions available.

Need to wrap up the BAP courts to release Juriscraper 1.0.

That's all that's left before we hit a massive, incredible milestone.

FISA Court has a new website

http://www.fisc.uscourts.gov

I wonder if there's a way for @FISACourt and Juriscraper to team up on this one? Relevant issue on GitHub: konklone/fisacourt#13

Either way, I didn't see any commits about it since the change on 4/30.

Bitbucket: https://bitbucket.org/mlissner/juriscraper/issue/11
Originally Reported By: Eric Mill
Originally Created At: 2014-05-25T05:57:28.523

Issues with federal_special backscrapers

I tried using the backscrapers for the first time today, and ran into some trouble.

Can you pull the latest Juriscraper code and use the included sample_caller to make sure all your imports are working?

Here's what I get right now:

#!python


$ python sample_caller.py -c opinions/united_states_backscrapers
Usage: sample_caller.py -c COURTID [-d|--daemon] [-b|--binaries]

To test ca1, downloading binaries, use: 
    python sample_caller.py -c opinions.united_states.federal_appellate.ca1 -b

To test all federal courts, omitting binaries, use: 
    python sample_caller.py -c opinions.united_states.federal_appellate

sample_caller.py: error: Unable to import module or package. Aborting.

I've got these courts added to the front end, so it'd be great to get the back corpus finally. FINALLY!!!

Bitbucket: https://bitbucket.org/mlissner/juriscraper/issue/7
Originally Reported By: Mike Lissner
Originally Created At: 2013-01-30T05:46:54.850

Syllabus vs. Summary Only One Should Rein Supreme

In CourtListener, we have syllabus in our data model, and it's described as:

"A summary of the issues presented in the case and the outcome."

In Juriscraper, we collect summaries. I could put a hard-coded mapping in CourtListener that says basically:

syllabus=summary

But that's lame and unfortunate. Is there a legal distinction here that's important? The solution I'd like to put in place is to choose syllabus as our preferred word and to update Juriscraper to same.

freelawproject / juriscraper Goto Github PK

juriscraper's People

Contributors

Stargazers

Watchers

Forkers

juriscraper's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs