freelawproject / juriscraper Goto Github PK
View Code? Open in Web Editor NEWAn API to scrape American court websites for metadata.
Home Page: https://free.law/juriscraper/
License: BSD 2-Clause "Simplified" License
An API to scrape American court websites for metadata.
Home Page: https://free.law/juriscraper/
License: BSD 2-Clause "Simplified" License
An anti-feature of the super-popular requests
library is that it doesn't have timeouts by default. This means that if a connection hangs for whatever reason, the program itself will hang...forever.
I filed https://github.com/kennethreitz/requests/issues/3070 about changing this in the next major version of requests, but until then, we need to add timeout values to every request. I think two of the Texas scrapers in #160 are currently frozen for just this reason.
I'll be pushing a release with this fix shortly.
The README says "Juriscraper is licensed under the permissive BSD license.", but there is no license text. Could you please add an explicit license?
I'd propose a text, but there are more BSD licenses so I don't know which you intended.
Hi,
I just found out about your project, and read through the code. I like the organization a lot. I see that the scraping generally returns arrays of data, such as lists of all case name or case dates. I think it'd be very useful to be able to retrieve a single list of opinion objects, each of which would have attributes such as name, date, etc. Have you considered this kind of API?
The ca2 scraper needs to call titlecase() to fix some uppercase case names.
The following courts generated errors on last run and are likely broken:
(Compare the court website's most recent opinion with https://courtlistener.com 's latest opinion from that court.)
This is an urgent issue, but I have too many other things on my plate. Anybody able to take a look?
See cases from Oct. 23 such as MaxonAlcoHoldings,LLCvSTSSteel,Inc. (N.Y. App. Div. 2014) https://www.courtlistener.com/?q=&stat_Precedential=on&order_by=dateFiled+desc&court=nyappdiv
The ca1 scraper does not parse out all meta-data available.
lower_court_numbers and lower_courts are available.
Last oral argument audio files from 9th Circuit on CL are from May 6th, 2015. However, there are more than 100 more recent audio files on their website. So, it appears it's been broken a while and we have some catching up to do.
Noticed while working on issue #59, it seems there's some hardcoding of log locations. These (if more than one) should probably be set in a config file or environment setting. Ideally the default would also be a relative location to the current working directory and not a hardcoded path most likely not existing on most machines (especially non-Un*x).
Warning: No such file or directory: /var/log/juriscraper/debug.log. Have you created the directory for the log?
Juriscraper will continue to run, but will not create logs.
This will take some finesse, but the problem is that the tests call parse() and parse calls _download. Since we've overridden download in many cases, that code is run instead of the code with the catch for if self.method == 'LOCAL'
.
Some solution will need to be found, especially for the cases where Selenium is being run in the _download methods.
I think this would aid people in getting it into their workflow and that it would help us update/release versions. I think there's something about Python "wheels" being used these days for this.
There will also likely be a problem with logging, since Juriscraper is currently configured to use /var/log/juriscraper/debug.log by default, which requires root access.
Looking at the coverage graphs on CL, it's clear that we need a backscraper for this year.
Would be great to get one, since it would round out the court.
Once freelawproject/courtlistener supports "briefs" as a document type, we should build a scraper for this impressive collection of briefs: http://www.justice.gov/osg/supreme-court-briefs
This probably goes in a new directory /briefs as opposed to /opinions or /oral_args
I'm pretty sure some of the tests are hitting the network. This isn't a great situation but there seems to be a decent solution to this over on stackoverflow.
Monkey patching socket ought to do it:
import socket def guard(*args, **kwargs): raise Exception("I told you not to use the Internet!") socket.socket = guard
I believe this just breaks socket so that no library can connect using it, suckers.
Issue by thinkcomp
Tuesday Apr 02, 2013 at 04:04 GMT
Originally opened as https://github.com/freelawproject/recap-server/issues/13
RECAP should also capture the termination dates of docket line item entries on district-level PACER history pages. They're very useful. Right now it only gets the Filed date and Entered date. Sometimes there's a Terminated date, as well.
We currently have _get_neutral_citation and _get_west_citation but "west" is not totally accurate. I think we really mean "bound_volume" or "reporter" citation just to distinguish this from neutral citations. Maryland is a good example where we can get their official reporter in a backscraper but the "West" volume would be the Atlantic Reporter cite, which isn't posted on the Md. website.
A while back we changed the CL database to include a bunch of citation fields. Do we need juriscraper to better align to these different types of cites? For instance, if there were a page that provided both the state official reporter citation and the west regional reporter citation, then we would only have _get_west_citation available, but we'd also need something like _get_reporter_citation for the official reporter citations.
I suppose we can use get_west_citation for the non-west Maryland official reporter in the Maryland backscraper I'd like to make, and wait for the day when we actually encounter a site with both official reporter cites and West regional reporter cites, but I thought I'd just raise the issue in case you have a preference for tidying this up in some way now. If having one Maryland backscraper use a slightly innacurate field doesn't bother you, then just close this.
We should have a mailing list called "juriscraper" where we send the Juriscraper notifcations and other related stuff.
So:
[juriscraper-notifications]
perhaps?I'm pretty sure most of the reason that PhantomJS is slow is because we import and instantiate it every single time. A better solution would keep it around after it was started each time. An even better solution would keep it around, but only instantiate it when it was first needed (so calling just one scraper that doesn't use Phantomjs isn't negatively impacted).
There's an answer on StackOverflow that has some tips about this. This also might help with the Webdriver connection errors we seem to have on longer-running processes.
They're kind of a PITA, and they don't belong in with the code, despite how convenient that may be for the tests.
Should be an easy change. Just need to make the tests aware that they need to look somewhere else.
http://www.ca6.uscourts.gov/internet/court_audio/aud1.php
High priority.
It seems that the state subdirectory currently has placeholder files that use the postal code abbreviations of the states. Besides the fact that no one but the postmaster general can keep these straight, they are not the abbreviations used in legal citation, per the commandments from on high from our masters at the Bluebook.
With the federal courts we have thus far had the, I think, commendable practice of naming the file after the courtID that will eventually be used in courtlistener, and hence in some search queries made by humans that have to remember these abbreviations. Our legal audience will expect the Bluebook abbreviations.
I'd ask we change all our state placeholder filenames to correspond to the names found in this table: http://www.law.cornell.edu/citation/4-500.htm because I can remember Mass. Minn. Mich. Miss. Mo. but there's no hope for me (or other humans) with the two-letter codes.
Oh, and I'm willing to do this and create a pull request if there is concensus that it's OK to do.
Need to update the readme with this information.
This is a fairly easy one, but if you look at the page here you'll see that NY App Div is using neutral citations. Our scraper is recognizing these as docket numbers instead and needs to be updated.
At the same time we need to write a cleanup script to fix the content in CourtListener.
Can you mark "help wanted" on any scraper tickets that might be easy for someone to pick up?
This is for us to provide challenges during PyCon.. https://us.pycon.org/2015/schedule/presentation/318/
The current way of returning opinions from scrapers complicates the callers. While the meta-data may be scraped column-by-column, it generally isn't used that way.
Possible simplification:
# Returns all the scraped cases. Call after parse().
# Returns a list of dictionaries, one for each case.
def get_cases(self)
This could also be used to implement to_csv(), to_html(), to_xml(), to_json() generically.
Returns data like:
[OrderedDict([
('case_dates', '2014-06-30'),
('case_names', u'93-06 994'),
('download_urls', 'http://www.va.gov/vetapp14/Files4/1429545.txt'),
('precedential_statuses', u'Unpublished'),
('docket_numbers', u'93-06 994'),
('neutral_citations', u'1429545')]),
OrderedDict([
('case_dates', '2014-06-30'),
('case_names', u'13-34 313'),
...
('neutral_citations', u'1429590')]),
]
For existing scrapers this would suffice:
def get_cases(self):
cases = []
for attr in self._all_attrs:
values = getattr(self, attr)
if (values is None):
continue
i = 0
for value in values:
if (len(cases) <= i):
cases.append(collections.OrderedDict())
if (isinstance(value, datetime.date)):
value = value.isoformat()
cases[i][attr] = value
i += 1
return cases
Requested by a user to add the opinions located here:
https://www.jagcnet.army.mil/ACCA#
As was explained by the user, we have coverage of the Air Force Court of Criminal Appeals and the Navy-Marine Court of Criminal Appeals, so this feels like a missing gap.
string_utils.py currently converts a number of nasty curly apostrophes and such to something sensible. It doesn't seem to work on this cp1252 aka windows-1252 encoded page:
The summaries contain both emdashes and curly apostrophes that don't get converted right. Example:
u'Attorney misconduct, including failing to act with reasonable
diligence in representing a client, failing to promptly refund any
unearned fee upon the lawyer\x92s withdrawal from employment, and
knowingly failing to respond to a demand for information by a
disciplinary authority during an investigation\x97Indefinite suspension.'
The \x97 appears as an emdash in the original html and the \x92 appears as an apostrophe in the original html.
There's a pretty great collection of decisions going all the way back to 1960, many including summaries here:
www.justice.gov/eoir/ag-bia-decisions
A user just suggested that we should add this, and they're absolutely right. It'd be killer to get these included.
Information on all the court web sites that juriscraper should ultimately cover is collected at http://bit.ly/CourtDocs
Use that site to "claim" a court web site if you are working on its scraper so that there will not be duplication of efforts.
The District Court of New Mexico has a document retrieval system OUTSIDE of PACER that provides opinions:
http://www.nmcourt.fed.us/Drs-Web/input
It has over 20,000 opinions going back many years, and will display
1,000 at a time (quickly!). If one enters date restrictions then it's
possible to walk through the archive at a rate of less than 1k per page.
Filing as a bug, since district courts aren't part of the current road map, and we don't want to miss it.
In particular, python-dateutil
is crashing in tests when the latest version is used.
An API user has reported missing citations in Texas from 2015. I investigated and found that cases were indeed missing. Seems the scraper might have been broken or that they've retroactively created some documents -- not sure.
I'm re-running all 16 of Texas's scrapers for the entire year to make sure to catch anything that's missing. Most have finished. The only scrapers that still need to be run are:
The rest finished before I made this ticket.
The nd.py scraper breaks dumping the parsed data from a local test because it uses a DeferringList even when a LOCAL request is being parsed.
This is a super easy one to do, but one that I haven't been able to get to. When running tests, there are a number of scrapers that are slow and that are reported by the testing system as "slow scrapers". These scrapers should be tuned to be faster (whether in tests or in the actual code, if it's actually slow), and once all scrapers are fast (or at least tests are), the warnings for slow scrapers should be made into full-on failures.
Basic goals here are:
Should be a fun and fairly easy one.
Maybe this is already possible, but would be nice if there were a command-line flag one could send juriscraper to tell it not to worry about the duplicate-detection it does. In those weird instances where the court's placement of new material seems to trick the duplicate-detector, would be good to be able to manually tell it: "No, really. Check every url just to be sure."
Including:
On February 29th, at least three 9th Circuit cases were added to CourtListener with citations including "USA".
Not sure why this happened:
The following backscrapers are ready, but are awaiting the enhancements that I've made to the CL caller that mitigates the impact of large scrapes:
Oral Argument:
Opinions:
We've got oral argument audio and it's great. But it's already clear that many courts are moving to video or started there to begin with. When you look at the states, you'll find a lot of video. (See our wiki for a preliminary list: https://github.com/freelawproject/juriscraper/wiki/Court-Websites ) Also of great interest is that the 9th Circuit which previously only did video for en banc oral arguments has started providing video of most oral arguments and appears to be moving towards having video of all oral arguments. See http://www.ca9.uscourts.gov/media/
I suppose the issue on juriscraper is resolved fairly easily as we're already all set to download most any filetype and its associated metadata, but it seems like some sort of filetype flag might be needed with options "audio" "video" "opinion" "docket" "whatever-else-we-decide-is-a-standalone-type". This same issue should be filed on CourtListener, where integrating a video player would likely be a larger project than any changes necessary to juriscraper.
Parsing the text of the newly scraped opinions is much more difficult than those taken from other sources, since it doesn't include the
markup annotating new paragraphs.
I wrote a long description and @github mobile erased it when I slid my finger wrong way. Ugh. Little warning?
See:
http://judicial.alabama.gov/supreme_opinions.cfm
Opinions are missing. Important one today.
The backscraper API is currently a private API and varies from scraper to scraper. So CourtListener or other callers of Juriscraper currently need slightly different logic for each backscraper.
One way to simplify it would be to add to the public AbstractSite API:
# Returns an iterator to pass to backscrape().
# Returns None if this scraper is not also a backscraper.
# Covers at least from startdate to enddate inclusive.
# May return slightly more than that.
# Usage:
# for i in get_backscrape_iterator(startdate, enddate):
# site.backscrape(i)
# site.parse()
# ...
def get_backscrape_iterator(startdate, enddate)
What do you think?
lxml
has an html5parser that can handle some of the inanities that bad HTML pages present.
For example, this page:
http://media.ca11.uscourts.gov/opinions/unpub/logname.php?begin=9720&num=485&numBegin=1
Has random less than signs in some of the docket numbers, which results in a terrible HTML tree. I was able to solve that in ca11_u, which fixes the problem and even preserves the API for the scrapers.
The hard part of this is that the html5parser's fromstring
function returns _Element
objects while the html.fromstring
function returns HtmlElement
s. I was able to get around this in ca11_u with something like:
from lxml.html import fromstring, tostring
from lxml.html import html5parser
e = html5parser.fromstring(text)
html_element = fromstring(tostring(e))
Though that involves an extra parse and an extra serialization, all of which sucks and obscures.
Looks like an easy subclass of nev_p.py, but it's actually a court that I don't think we have in CourtListener yet. Been a long time since that happened, so:
Anyway, worth doing, since it's not every day we have a new jurisdiction.
Last opinion is from Dec. 23, 2014. Something changed on their site, because there are January 2015 opinions available.
That's all that's left before we hit a massive, incredible milestone.
I wonder if there's a way for @FISACourt and Juriscraper to team up on this one? Relevant issue on GitHub: konklone/fisacourt#13
Either way, I didn't see any commits about it since the change on 4/30.
I tried using the backscrapers for the first time today, and ran into some trouble.
Can you pull the latest Juriscraper code and use the included sample_caller to make sure all your imports are working?
Here's what I get right now:
#!python
$ python sample_caller.py -c opinions/united_states_backscrapers
Usage: sample_caller.py -c COURTID [-d|--daemon] [-b|--binaries]
To test ca1, downloading binaries, use:
python sample_caller.py -c opinions.united_states.federal_appellate.ca1 -b
To test all federal courts, omitting binaries, use:
python sample_caller.py -c opinions.united_states.federal_appellate
sample_caller.py: error: Unable to import module or package. Aborting.
I've got these courts added to the front end, so it'd be great to get the back corpus finally. FINALLY!!!
In CourtListener, we have syllabus
in our data model, and it's described as:
"A summary of the issues presented in the case and the outcome."
In Juriscraper, we collect summaries
. I could put a hard-coded mapping in CourtListener that says basically:
syllabus=summary
But that's lame and unfortunate. Is there a legal distinction here that's important? The solution I'd like to put in place is to choose syllabus
as our preferred word and to update Juriscraper to same.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.