GithubHelp home page GithubHelp logo

freelawproject / juriscraper Goto Github PK

View Code? Open in Web Editor NEW
340.0 41.0 98.0 54.23 MB

An API to scrape American court websites for metadata.

Home Page: https://free.law/juriscraper/

License: BSD 2-Clause "Simplified" License

Python 1.73% HTML 98.26% Makefile 0.01% Jinja 0.01%
scraping government courts pacer

juriscraper's People

Contributors

agarzaarvizu avatar albertisfu avatar andr3ic avatar arderyp avatar audiodude avatar cweider avatar dependabot[bot] avatar divergentdave avatar drewsilcock avatar erosendo avatar flooie avatar grossir avatar honeykjoule avatar ikeboy avatar janderse avatar johnhawkinson avatar kfinity avatar lezh1k avatar lucioric2000 avatar m4h7 avatar mlissner avatar mmantel avatar mseflek avatar pre-commit-ci[bot] avatar quevon24 avatar tewen avatar ttys0dev avatar umeboshi2 avatar varun-magesh avatar voutilad avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

juriscraper's Issues

Add timeouts to every request

An anti-feature of the super-popular requests library is that it doesn't have timeouts by default. This means that if a connection hangs for whatever reason, the program itself will hang...forever.

I filed https://github.com/kennethreitz/requests/issues/3070 about changing this in the next major version of requests, but until then, we need to add timeout values to every request. I think two of the Texas scrapers in #160 are currently frozen for just this reason.

I'll be pushing a release with this fix shortly.

License

The README says "Juriscraper is licensed under the permissive BSD license.", but there is no license text. Could you please add an explicit license?
I'd propose a text, but there are more BSD licenses so I don't know which you intended.

Change the API to be Sites > Opinions > Meta data

Hi,

I just found out about your project, and read through the code. I like the organization a lot. I see that the scraping generally returns arrays of data, such as lists of all case name or case dates. I think it'd be very useful to be able to retrieve a single list of opinion objects, each of which would have attributes such as name, date, etc. Have you considered this kind of API?


Help wanted checking the following scrapers

The following courts generated errors on last run and are likely broken:
(Compare the court website's most recent opinion with https://courtlistener.com 's latest opinion from that court.)

  • colo Last opinion in CL is 2/17/15.
  • nh Last opinion in CL is 1/22/15. _FIXED BY PR #51_
  • ri Missing lots of recent opinions.
  • ohio Last opinion in CL is 2/10/15.
  • ohioctapp Something amiss (all 12 districts) since 2/9/15.
  • ala See Issue #46 It may be that we can stop using alanic.
  • vt _FIXED AS OF 6/4/15_
  • idahoctapp Last criminal opinion in CL is 2/26/15.
  • sd Last opinion in CL is 1/28/15. _FIXED AS OF 6/4/15_
  • fladistctapp Last opinion in CL is 4/1/15.
  • nev Last opinion in CL is 2/24/15.
  • mo Last opinion in CL is 3/14/15.
  • moctapp Last opinion in CL is 3/13/15. (Link is to Eastern only; check all districts).
  • connappct Last opinion in CL is 3/31/15.
  • Idaho Civil
  • Kentucky

9th Circuit Oral Argument Audio Scraper Stopped?

Last oral argument audio files from 9th Circuit on CL are from May 6th, 2015. However, there are more than 100 more recent audio files on their website. So, it appears it's been broken a while and we have some catching up to do.

Logging location should be configurable

Noticed while working on issue #59, it seems there's some hardcoding of log locations. These (if more than one) should probably be set in a config file or environment setting. Ideally the default would also be a relative location to the current working directory and not a hardcoded path most likely not existing on most machines (especially non-Un*x).

Warning: No such file or directory: /var/log/juriscraper/debug.log. Have you created the directory for the log?
Juriscraper will continue to run, but will not create logs.

Custom _download() methods are breaking tests.

This will take some finesse, but the problem is that the tests call parse() and parse calls _download. Since we've overridden download in many cases, that code is run instead of the code with the catch for if self.method == 'LOCAL'.

Some solution will need to be found, especially for the cases where Selenium is being run in the _download methods.

Make Juriscraper installable with pip

I think this would aid people in getting it into their workflow and that it would help us update/release versions. I think there's something about Python "wheels" being used these days for this.

There will also likely be a problem with logging, since Juriscraper is currently configured to use /var/log/juriscraper/debug.log by default, which requires root access.

Investigate blocking all network requests during tests

I'm pretty sure some of the tests are hitting the network. This isn't a great situation but there seems to be a decent solution to this over on stackoverflow.

Monkey patching socket ought to do it:

import socket
def guard(*args, **kwargs):
    raise Exception("I told you not to use the Internet!")
socket.socket = guard

I believe this just breaks socket so that no library can connect using it, suckers.

Naming of get_reporter_citation

We currently have _get_neutral_citation and _get_west_citation but "west" is not totally accurate. I think we really mean "bound_volume" or "reporter" citation just to distinguish this from neutral citations. Maryland is a good example where we can get their official reporter in a backscraper but the "West" volume would be the Atlantic Reporter cite, which isn't posted on the Md. website.

A while back we changed the CL database to include a bunch of citation fields. Do we need juriscraper to better align to these different types of cites? For instance, if there were a page that provided both the state official reporter citation and the west regional reporter citation, then we would only have _get_west_citation available, but we'd also need something like _get_reporter_citation for the official reporter citations.

I suppose we can use get_west_citation for the non-west Maryland official reporter in the Maryland backscraper I'd like to make, and wait for the day when we actually encounter a site with both official reporter cites and West regional reporter cites, but I thought I'd just raise the issue in case you have a preference for tidying this up in some way now. If having one Maryland backscraper use a slightly innacurate field doesn't bother you, then just close this.


Move Phantomjs to a single import and single instance

I'm pretty sure most of the reason that PhantomJS is slow is because we import and instantiate it every single time. A better solution would keep it around after it was started each time. An even better solution would keep it around, but only instantiate it when it was first needed (so calling just one scraper that doesn't use Phantomjs isn't negatively impacted).

There's an answer on StackOverflow that has some tips about this. This also might help with the Webdriver connection errors we seem to have on longer-running processes.

Abbreviations should follow blue book format

It seems that the state subdirectory currently has placeholder files that use the postal code abbreviations of the states. Besides the fact that no one but the postmaster general can keep these straight, they are not the abbreviations used in legal citation, per the commandments from on high from our masters at the Bluebook.

With the federal courts we have thus far had the, I think, commendable practice of naming the file after the courtID that will eventually be used in courtlistener, and hence in some search queries made by humans that have to remember these abbreviations. Our legal audience will expect the Bluebook abbreviations.

I'd ask we change all our state placeholder filenames to correspond to the names found in this table: http://www.law.cornell.edu/citation/4-500.htm because I can remember Mass. Minn. Mich. Miss. Mo. but there's no hope for me (or other humans) with the two-letter codes.

Oh, and I'm willing to do this and create a pull request if there is concensus that it's OK to do.


Return meta-data to caller as plain Python objects

The current way of returning opinions from scrapers complicates the callers. While the meta-data may be scraped column-by-column, it generally isn't used that way.

Possible simplification:

    # Returns all the scraped cases.  Call after parse().
    # Returns a list of dictionaries, one for each case.
    def get_cases(self)

This could also be used to implement to_csv(), to_html(), to_xml(), to_json() generically.

Returns data like:

[OrderedDict([
    ('case_dates', '2014-06-30'), 
    ('case_names', u'93-06 994'), 
    ('download_urls', 'http://www.va.gov/vetapp14/Files4/1429545.txt'), 
    ('precedential_statuses', u'Unpublished'), 
    ('docket_numbers', u'93-06 994'), 
    ('neutral_citations', u'1429545')]),
 OrderedDict([
    ('case_dates', '2014-06-30'), 
    ('case_names', u'13-34 313'), 
        ...
    ('neutral_citations', u'1429590')]),
]

For existing scrapers this would suffice:

    def get_cases(self):
        cases = []
        for attr in self._all_attrs:
            values = getattr(self, attr)
            if (values is None):
                continue
            i = 0
            for value in values:
                if (len(cases) <= i):
                    cases.append(collections.OrderedDict())
                if (isinstance(value, datetime.date)):
                    value = value.isoformat()
                cases[i][attr] = value
                i += 1
        return cases

string_utils should replace windows-1252 encoded special characters

string_utils.py currently converts a number of nasty curly apostrophes and such to something sensible. It doesn't seem to work on this cp1252 aka windows-1252 encoded page:

http://www.sconet.state.oh.us/ROD/docs/default.asp?Page=1&Sort=docdecided%20DESC&PageSize=25&Source=0&iaFilter=2012&ColumnMask=669

The summaries contain both emdashes and curly apostrophes that don't get converted right. Example:

u'Attorney misconduct, including failing to act with reasonable
diligence in representing a client, failing to promptly refund any
unearned fee upon the lawyer\x92s withdrawal from employment, and
knowingly failing to respond to a demand for information by a
disciplinary authority during an investigation\x97Indefinite suspension.'

The \x97 appears as an emdash in the original html and the \x92 appears as an apostrophe in the original html.


Develop District Court Scrapers

The District Court of New Mexico has a document retrieval system OUTSIDE of PACER that provides opinions:

http://www.nmcourt.fed.us/Drs-Web/input

It has over 20,000 opinions going back many years, and will display
1,000 at a time (quickly!). If one enters date restrictions then it's
possible to walk through the archive at a rate of less than 1k per page.

Filing as a bug, since district courts aren't part of the current road map, and we don't want to miss it.


Update dependencies

In particular, python-dateutil is crashing in tests when the latest version is used.

Missing Cases from Texas in 2015

An API user has reported missing citations in Texas from 2015. I investigated and found that cases were indeed missing. Seems the scraper might have been broken or that they've retroactively created some documents -- not sure.

I'm re-running all 16 of Texas's scrapers for the entire year to make sure to catch anything that's missing. Most have finished. The only scrapers that still need to be run are:

  • texapp_1
  • texapp_2
  • texapp_5
  • texapp_14

The rest finished before I made this ticket.

Tests should be analyzed and sped up. Slowness should be a failure.

This is a super easy one to do, but one that I haven't been able to get to. When running tests, there are a number of scrapers that are slow and that are reported by the testing system as "slow scrapers". These scrapers should be tuned to be faster (whether in tests or in the actual code, if it's actually slow), and once all scrapers are fast (or at least tests are), the warnings for slow scrapers should be made into full-on failures.

Basic goals here are:

  1. Identify any scrapers that are actually CPU intensive and fix them.
  2. Make tests faster so we don't have to wait as long.
  3. Make warnings into failures so we won't have regressions.

Should be a fun and fairly easy one.

Command-line disable duplicate matching

Maybe this is already possible, but would be nice if there were a command-line flag one could send juriscraper to tell it not to worry about the duplicate-detection it does. In those weird instances where the court's placement of new material seems to trick the duplicate-detector, would be good to be able to manually tell it: "No, really. Check every url just to be sure."


Suport Oral Argument Video

We've got oral argument audio and it's great. But it's already clear that many courts are moving to video or started there to begin with. When you look at the states, you'll find a lot of video. (See our wiki for a preliminary list: https://github.com/freelawproject/juriscraper/wiki/Court-Websites ) Also of great interest is that the 9th Circuit which previously only did video for en banc oral arguments has started providing video of most oral arguments and appears to be moving towards having video of all oral arguments. See http://www.ca9.uscourts.gov/media/

I suppose the issue on juriscraper is resolved fairly easily as we're already all set to download most any filetype and its associated metadata, but it seems like some sort of filetype flag might be needed with options "audio" "video" "opinion" "docket" "whatever-else-we-decide-is-a-standalone-type". This same issue should be filed on CourtListener, where integrating a video player would likely be a larger project than any changes necessary to juriscraper.

Simplify backscraper API

The backscraper API is currently a private API and varies from scraper to scraper. So CourtListener or other callers of Juriscraper currently need slightly different logic for each backscraper.

One way to simplify it would be to add to the public AbstractSite API:

# Returns an iterator to pass to backscrape(). 
# Returns None if this scraper is not also a backscraper.
# Covers at least from startdate to enddate inclusive. 
# May return slightly more than that.
# Usage:
# for i in get_backscrape_iterator(startdate, enddate):
#     site.backscrape(i)
#     site.parse()
#     ...
def get_backscrape_iterator(startdate, enddate)

What do you think?

Consider switching to html5parser for all parsing

lxml has an html5parser that can handle some of the inanities that bad HTML pages present.

For example, this page:

http://media.ca11.uscourts.gov/opinions/unpub/logname.php?begin=9720&num=485&numBegin=1

Has random less than signs in some of the docket numbers, which results in a terrible HTML tree. I was able to solve that in ca11_u, which fixes the problem and even preserves the API for the scrapers.

The hard part of this is that the html5parser's fromstring function returns _Element objects while the html.fromstring function returns HtmlElements. I was able to get around this in ca11_u with something like:

from lxml.html import fromstring, tostring
from lxml.html import html5parser

e = html5parser.fromstring(text)
html_element = fromstring(tostring(e))

Though that involves an extra parse and an extra serialization, all of which sucks and obscures.

Add scraper for nevada court of appeals

Looks like an easy subclass of nev_p.py, but it's actually a court that I don't think we have in CourtListener yet. Been a long time since that happened, so:

  • Need to decide what to call this jurisdiction. Probably nevctapp? I can't find any references to it in Blue Book v 19.
  • Need to find notes on how to add this to CourtListener, and it's possible we'll need to re-jigger the order of the courts in the modal (this, alas, is done manually).

Anyway, worth doing, since it's not every day we have a new jurisdiction.

Texas Supreme Court

Last opinion is from Dec. 23, 2014. Something changed on their site, because there are January 2015 opinions available.

Issues with federal_special backscrapers

I tried using the backscrapers for the first time today, and ran into some trouble.

Can you pull the latest Juriscraper code and use the included sample_caller to make sure all your imports are working?

Here's what I get right now:

#!python


$ python sample_caller.py -c opinions/united_states_backscrapers
Usage: sample_caller.py -c COURTID [-d|--daemon] [-b|--binaries]

To test ca1, downloading binaries, use: 
    python sample_caller.py -c opinions.united_states.federal_appellate.ca1 -b

To test all federal courts, omitting binaries, use: 
    python sample_caller.py -c opinions.united_states.federal_appellate

sample_caller.py: error: Unable to import module or package. Aborting.

I've got these courts added to the front end, so it'd be great to get the back corpus finally. FINALLY!!!


Syllabus vs. Summary Only One Should Rein Supreme

In CourtListener, we have syllabus in our data model, and it's described as:

"A summary of the issues presented in the case and the outcome."

In Juriscraper, we collect summaries. I could put a hard-coded mapping in CourtListener that says basically:

syllabus=summary

But that's lame and unfortunate. Is there a legal distinction here that's important? The solution I'd like to put in place is to choose syllabus as our preferred word and to update Juriscraper to same.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.