GithubHelp home page GithubHelp logo

freelawproject / courtlistener Goto Github PK

View Code? Open in Web Editor NEW
525.0 49.0 144.0 251.28 MB

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.

Home Page: https://www.courtlistener.com

License: Other

Python 59.90% CSS 0.76% JavaScript 15.43% HTML 14.80% Shell 0.19% Dockerfile 0.03% Makefile 0.01% TypeScript 0.57% PLpgSQL 8.30%
courts government government-data legaltech

courtlistener's Introduction

CourtListener

Started in 2009, CourtListener.com is the main initiative of Free Law Project. The goal of CourtListener.com is to provide high quality legal data and services.

What's Here

This repository is organized in the following way:

  • cl: the Django code for this project. 99% of everything is in this directory.
  • docker: Where to find compose files and docker files for various components.
  • scripts: logrotate, systemd, etc, and init scripts for our various configurations and daemons.

Getting Involved

If you want to get involved send us an email with your contact info or take a look through the issues list. There are innumerable things we need help with, but we especially are looking for help with:

  • legal research in order to fix data errors or other problems (check out the data-quality label for some starting points)
  • fixing bugs and building features (most things are written in Python)
  • machine learning or natural language problems.
  • test writing -- we always need more and better tests

In general, we're looking for all kinds of help. Get in touch if you think you have skills we could use or if you have skills you want to learn by improving CourtListener.

Contributing code

See the developer guide to get started.

Copyright

All materials in this repository are copyright Free Law Project under the Affero GPL. See LICENSE.txt for details.

Contact

To contact Free Law Project, see here:

https://free.law/contact/

                                   g@@D
                                  "l@@B!
                                   "@@"
                                    @@
                                    @@
                            _P '@.  @@
                            71__@   @@
                              @@    @@    __
                              @@    @@  ;F  @
                              @@    @@  'h__@
                              @@    @@    @g
                              @@    @@    @g
                              @@    @@    @g                     _~~_
                              @@    @@    @g   @@@@@@@@@@@@@@@@@@F  |!
                              @@    @@    @g   @@         @T     TmmP
   _gg_                       @@    @@    @g   @@         @'
   @   @gggggggggggggggggg    @@    @@    @g   @@         @\
   '@WP      .@         @@    @@    @@    @g   @@        J "_
             !@         @@    @@    @@    @g   @@       ,'  T
             ;@         @@    @@    @@    @g   @@       8    %
             W @        @@    @@    @@    @g   @@   ___d______@_-_
            @   q       @@    @@    @@    @g   @@   ______________
           ;"    g      @@    @@    @@    @g   @@   0@@@@@@@@@@@@"
       ____E_____]L___  @@    @@    @@    @g   @@
       ,_____________   @@    @@    @@    @g   @@
       '@@@@@@@@@@@@D   @@    @@    @@    @g   @@
                        @@    @@    @@    @g   @@
                        @@    @@    @@    @g   @@
                  _~ggg~_.       __g@g~_.      gg        ggg   gggggggg_,   ;gggggggggggg
                g@@P"""<@@g    _@@P"""<B@g     @@        @@@   @@@"""""Q@g  """""9@g"""""
              .@@F       "    @@F       "@@,   @@        @@@   @@@      @@g      [@g
              g@@            |@@         (@@   @@        @@@   @@@     ,@@/      [@g
              [@@            [@@         j@@   @@        @@g   @@@@@@@@@B        [@g
               @@L       ,    @@L       ,@@'   @@1       @@'   @@@   '@@L        [@g
                T@@_____g@@    T@@_____g@@      @@g____+@@?    @@@     @@a       [@g
                  '4B@BP"        '=B@BP"          <8B@B+"      BBB      0BB      "BN
              g        ;;   _~mma_  mmmmqmmmmm  mmmmmmms  _        ;   gmmmmmmm  gmmmmm__
              g        [|  F            |]      |         g\_      [   @         @       q
              g        [|  1.           |]      |         g  q     [   @         @       [
              g        [|    "+m__      |]      P""""""   g   "_   [   @""""""   @_______'
              g        [|         \,    |]      |         g     \, [   @         @    `a
              g        [| ,       /'    |]      |         g       q[   @         @      0
              """""""" ''   ""==""      '"      """"""""  "        "   """"""""  "       "
                        @@    @@    @@    @g   @@
                        @@    @@    @@    @g   @@
                        @@    @@    @@    @g   @@

courtlistener's People

Contributors

albertisfu avatar colinstarger avatar cweider avatar davidxia avatar dependabot[bot] avatar drewsilcock avatar dschnelldavis avatar elliottash avatar erosendo avatar flooie avatar grossir avatar ikeboy avatar jeffgortmaker avatar johnhawkinson avatar johnludwigm avatar jon-ashley avatar jraller avatar krist-jin avatar litewarp avatar malteos avatar mattdahl avatar mlissner avatar pre-commit-ci[bot] avatar probablyfaiz avatar quevon24 avatar rowyn avatar ss108 avatar troglodite2 avatar ttys0dev avatar voutilad avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

courtlistener's Issues

Atom feeds should be advertised on the alerts page

This will help promote the use of the atom feeds over the emailer, which I'm guessing will place less demand on the server (though I haven't entirely thought it out yet).

This ought to be easy once the feeds work properly, but not a priority right now.

One consideration is that people will want to turn off (or delete) a feed once they are subscribed that way. But there's no way to turn off a feed at present.


Sitemap.xml lacks priority for items created through the flatpages framework

If you pull up sitemap.xml, you will see that it has entries for the pages that were created via the flatpages module, but that they lack priority settings.

Since they are more important to the meaning of the site than many of the other items in the sitemap, this needs to be set.


Give Users the Ability to Create TopicTags

Users should be able to tag a document with a keyword or "TopicTag". By default such tags should be public, but optionally, users could create private tags. Users should be able to select whether (and which) TopicTags are part of their bulk downloads. Users should be able to form groups that work together to create TopicTags of mutual interest, and then they need the ability to see/download 1) No TopicTags, 2) Just Their Own TopicTags, 3) Just TopicTags from selected groups or 4) All public TopicTags.

This is a major enhancement, quite desirable, but somewhat complex to implement.


Search and alert creation need to be combined

After using the site for a while now, I've noticed a few times when I wish a search was an alert, and I couldn't figure out how to make it happen. I've never had the opposite feeling if wishing an alert was a search, and I can't think of a reason why I would.

Since the search system is simply powering the alert system, we should just show the alert creation form for every search. It'll simplify things significantly.

Making the change is fairly simple, so I plan to do it before beta release.


Spaces in URLs again

All of the April 2, 2010 cases in the 11th Circuit have URLs that include a space at the beginning and a space at the end. I didn't check other circuits.

Also, I tried to erase the spaces and see if the case showed up there too and got the error page. (So, I don't think there are dupes.) However, I then found that the "file a bug" link does not work.


Need to do some Email Anti-spam work...ug.

Atwood has a very good post on this today, and I was noticing some strange fields in the emails yesterday.

http://www.codinghorror.com/blog/2010/04/so-youd-like-to-send-some-email-through-code.html

This looks like a pain, though probably one we need to endure to make the alerts go through consistently.


Conversion to HTML can include lots of whitespace

I've only seen this once, so far, here:

http://courtlistener.com/ca4/Riley%20v.%20Dozier%20Internet%20Law,%20PC/

but the HTML version has A LOT of extra spaces sprinkled throughout.

Perhaps post-processing could be done that would essentially amount to a looping find and replace (find two spaces and replace them with one space until no more instances of two spaces together are found.)

To which "component" does this apply: "backend" or "web"?


Case Name capitalization issues

See: http://courtlistener.com/ca9/Crs%20Recovery,%20Inc.%20V.%20John%20Laxton/

The plaintiff's name is CRS Recovery (all caps CRS), but the database contains "Crs" also the database has this with capital "V." rather than lowercase "v." between the parties. I know the 9th Cir. provides the names in ALLCAPS, so I assume we're doing some case conversion that likely assumes just the first letter of each word should be capitalized--usually a good assumption, and thus solving the first problem may be harder than solving the second.


Updating Sphinx indexes should be done with a delta index

As our index grows (it's at 300MB now), we're going to need to start using the delta index system that Sphinx allows.

This will allow us to update the sphinx index with much less downtime.

It's complicated and not necessary yet, so I haven't set it up, but we're going to need to eventually.


Add Links to Alternative Document Sources

In addition to providing the hyperlink to the Court website where we retrieved a document, a drop-down box should provide alternative sources for the same document, such as resource.org, Justia, Google Scholar, Findlaw, Cornell LII, Fastcase, LexisNexis, and Westlaw, etc. This option should also appear on any result page created for documents we are missing, as in the case of "red links" citations that we might ask people to sponsor for scanning. Sites would preferably have a consistent/predictable URL structure to make this possible.


Add Fed. Cir. Motion Orders

At this page:

http://www.cafc.uscourts.gov/motions/search.asp

The Fed. Cir. provides no-cost access to orders resolving certain precalendar motions that are acted on by the clerk of the court from this page. These are not listed on the usual Opinions & Orders page. These are not of major importance right away, but since they are available it would be nice to add them.

However, they should not be added until the case-number duplicate vs SHA-1 duplicate issue is resolved, because adding these under the current dupe-checker would mean that we only get the first pre-calendar motion in the database, and not the ultimate opinion (or any subsequent motion orders). That would be bad. That these are available gives us another reason to want to check dupes by comparing SHA-1 of documents.


Fed Cir scraper stopped working

There have been three Fed. Cir. opinions released over the last few days, but none of them have made it onto the site. I ran the scraper manually with scrape/13 and it said "It worked. Duplicate found at 4." and so in some sense it even knows that the first duplicate is the fourth one down, but it doesn't seem to be putting the prior three into the database.


Format when browsing the case lists

Currently when one browses all the cases, either from all the courts or a specific Circuit, a given entry looks like this:

Mid-Continent Casualty Co. v. American Pride Bldg., 09-11238
Monday, March 29th, 2010
Status: Precedential/Published.
Download PDF: From the court | Our backup

The first line is all italicized and is a link to the site's text/html version of the opinion, and Status and Download PDF are bold.

When I'm browsing opinions/all I find myself really wishing that it indicated which Circuit each case was coming from, but I have to mouse-over the links to get that info. I think even on the Circuit-specific lists, it would be fine to list it. People used to looking at court citations would not be surprised by that and might even expect it.

I'd suggest:
Italicize the case name only and make the case name only be the link to the opinion page. Then after the case number add:

(1st Cir.)
(2d Cir.)
(3rd Cir.)
(4th Cir.)
(5th Cir.)
(6th Cir.)
(7th Cir.)
(8th Cir.)
(9th Cir.)
(10th Cir.)
(11th Cir.)
(D.C. Cir.)
(Fed. Cir.)

Note that the Second Circuit really is abbreviated (2d Cir.) without the 'n' and there seems to be no good reason that I know of for why the folks that decide this stuff decided this, but it is the convention that people will expect.

By not making the case number and the Circuit be linked-text it leaves open the possibility that when we get multiple documents for the same case number, we might enable people to click on the case number and go to an overview page for all documents related to that case.


Once per day scraping is too slow

We should move towards real-time scraping as much as possible.

This could be done pretty easily with a daemon that checks each site every 20 minutes, and compares the HTML of the site with a SHA1 of the previous visit.

If different, run the full scraper. If same, wait 20 minutes, repeat.

The catch here is that issue #29 is likely a blocker for this, since updating the search index is rather compute intensive. This would also enable real-time email alerts (which would be awesome).


Duplicate case in database

I did a search for "copyright" today to see if it would list the 1st Circuit's Raytheon opinion. (It did--nice!) but then noticed that
Mid-Continent Casualty Co. v. American Pride Bldg., 09-11238 (11th Cir.) shows up twice, once with a SPACE before " Mid-Continent" in the URL.

Not sure what's going on there.


Add older data to corpus

I think we have opinions starting on March 13, 2010. That's a weird start date. Is it possible/easy to have the scraper do a one-time run to retrieve all the older opinions that happen to still exist on the various Circuit sites? This would add a couple of years of older opinions for most circuits. Alternatively, is it possible for the scraper to just do a one-time run where it goes back to Jan. 1, 2010 and then be able to say that our coverage begins with 2010 and if you want older stuff, tough luck?


Multiple-field search operator not working?

@(caseName,docText) Strickland

according to our Advanced search page should give results that contain Strickland in BOTH the caseName and the document text. Sometimes it appears to give results that contain EITHER Strickland in the caseName or the docText and with some other queries it doesn't seem to work at all. In either case, it also throws up this yellow error:

We completed your search, but @ is not a valid attribute.
Valid attributes are @court, @casename, @docStatus and @doctext.

and at a minimum that error can't be right because the advanced search page suggested such searches are grammatical (or our instructions on the advanced search page are wrong, but then how is one supposed to do a multiple-field search, if not like that?)


Duplicate alerts need to be avoided if they are common

Currently when an alert is created by a user, it can be the exact duplicate of an alert that is already in the system. This is problematic because it requires space in our database, and because the emailer will have to check both alerts each day/week/month.

If there are many duplicate alerts queries in the system, we could optimize things by checking for dups at the time of creation or editing. Then, at the time of deletion, if there is only one user associated with the alert, we can delete the alert entirely. If there is more than one, we can simply delete the association between the user and the alert. Leaving the alert in the DB for the other users associated with it.

If a user is editing an alert that is shared by another user, the association would need to be torn down, and the new alert created.

Whether to do this will be a balance between added complication in the codebase, and the needs of our computer.

It would add some complication to the codebase.


Scraper doesn't get Amended opinions from 1st Circuit

Here is an amendment to an opinion (hence the "A" at the end of the pdf) that was released on Feb. 28, 2010 by the 1st Cir:

http://www.ca1.uscourts.gov/pdf.opinions/09-1020E-01A.pdf

The site has opinions released just before and after this date, but not this one.

From views.py

        # next: docType
        docType = docTypes[i].text.strip()
        if "unpublished" in docType.lower():
            doc.documentType = "U"
        elif "published" in docType.lower():
            doc.documentType = "P"
        else:
            # it's an errata, or something else we don't care about
            i += 1
            continue

Is that the code that is making us skip this document or just fail to classify it?

The above pdf is an example of "Errata" and they are sometimes very important. If the scraper is currently discarding them, then that's not ideal.

It goes back to the issue of how to check for duplicate documents. If the scraper relies on case name and number, then these amendments will also be missed, but if the scraper relies on SHA1 comparisons, then one has to download a document before one knows if it is a duplicate, probably resulting in lots of unnecessary downloads.

The ideal solution will likely be to configure each Circuit so that it downloads documents and runs SHA1 on them UNTIL it finds a duplicate and then stops downloading docs from that Circuit. If done right, this should only result in one unnecessary download per Circuit per day and would hopefully guarantee that no documents are missed.


No Unpublished/Non-Precedential opinions from 2nd, 5th, or 11th Circuits.

No Unpublished/Non-Precedential opinions from 2nd, 5th, or 11th Circuits.

Try the following searches:

@docStatus U @court ca2

@docStatus U @court ca5

@docStatus U @court ca11

None yield any results.

2nd Circuit opinions whose file name ends in "so" are "Summary Orders" and should be classified as Unpublished/Non-Precedential. I'm not sure if we're not scraping them or not classifying them right.

5th Circuit opinions that are Unpublished/Non-Precedential are listed on the right-hand side of their opinions web page, and so again, I don't know if we're not scraping them or mis-classifying them.

11th Circuit opinions that are Unpublished/Non-Precedential come from a separate page on their site: http://www.ca11.uscourts.gov/opinions/indexunpub.php and I'm also unsure if we're not scraping that page or misclassifying what we gather there.


Erroneously using file name as case number for 6th Cir.

Right now it looks like we're using the file name to deduce the case number, but the 6th Circuit has some goofy file-naming convention that makes this produce the wrong result.

If we instead pulled the case numbers from the tables produced on this page:

http://www.ca6.uscourts.gov/cgi-bin/newopn.pl

instead of:

http://www.ca6.uscourts.gov/cgi-bin/opinions.pl which is used in view.py right now, then I think the scraper could still retrieve the proper case numbers without parsing the pdfs.


Add Oral Argument Audio

Oral Arguments Audio:

Seven of the thirteen Circuit courts currently provide oral argument audio online as follows:

1st Circuit:
http://www.ca1.uscourts.gov/files/audio/audiorss.php (RSS)
files are in form: http://www.ca1.uscourts.gov/files/audio/##-####.mp3

2d Circuit:
None that I can find.

3rd Circuit:
Last 7 days listed here:
http://www.ca3.uscourts.gov/oralargument/ListArguments7.aspx
files are in form: http://www.ca3.uscourts.gov/oralargument/audio/##-####PlaintiffvDefendant.wma

Entire archive listed here:
http://www.ca3.uscourts.gov/oralargument/ListArgumentsAll.aspx

4th Circuit:
None that I can find.

5th Circuit:
http://www.ca5.uscourts.gov/OralArgumentRecordings.aspx
files are in form: http://www.ca5.uscourts.gov/OralArgRecordings/09/##-#####_M-D-YYYY.wma

6th Circuit:
None that I can find.

7th Circuit:
http://www.ca7.uscourts.gov/fdocs/docs.fwx (past week)
files are in form: http://www.ca7.uscourts.gov/fdocs/docs.fwx?submit=showbr&shofile=##-####_001.mp3

8th Circuit:
http://8cc-www.ca8.uscourts.gov/circ8rss.xml
files are in form: http://8cc-www.ca8.uscourts.gov/OAaudio/2010/2/######.MP3 (case number w/o hyphen)

9th Circuit:
http://www.ca9.uscourts.gov/media/
files are linked to in form: http://www.ca9.uscourts.gov/media/view_subpage.php?pk_id=0000005305 (random #?)
and then on a subsequent page: http://www.ca9.uscourts.gov/datastore/media/2010/04/09/##-#####.wma

10th Circuit:
None that I can find.

11th Circuit:
None that I can find.

D.C. Circuit:
Policy against providing the tapes to the public until after the case has been completely closed and even then not online.

Federal Circuit:
http://oralarguments.cafc.uscourts.gov/ but this only provides a search box in which you must enter the date.
THEN files are in form: http://oralarguments.cafc.uscourts.gov/mp3/####-####.mp3 (case # uses full yr: 2009-####)


Need to do some URL shortening...

I am working on getting the emails going out, and I've noticed that there is a real problem with the length of the URLs for cases.

Currently, urls are of the form:

  • courtlistener.com/court/caseNameShort
  • courtlistener.com/court/caseNumber

I'm thinking it would be pretty cool, and rather easy to add a URL shortening service for the purpose of emails and unique locations of documents.

I poked around, and found that .li ccTLDs can be purchased here:
http://www.switch.ch/

And, I found that crt.li is available for 17 CHF (15USD). I'd prefer ctl.nr, but the .nr ending costs $500 via wire transfer.

With that, and the sha1 sum (or the case number), pretty short URLs could be made:

  • .crt.li/e370a40765e5e6705b8578787b70dd20ed69cdf1
  • .crt.li/caseNumber

Something to think about.


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.