GithubHelp home page GithubHelp logo

phhartl / eu-judgement-analyse Goto Github PK

View Code? Open in Web Editor NEW
6.0 3.0 0.0 35.65 MB

Quantitative analysis of judgments of the European Court of Justice

License: MIT License

Python 35.71% HTML 0.71% CSS 10.44% JavaScript 52.80% Sass 0.34%
eur-lex legal-texts nlp judgments corpora quantitative-analysis

eu-judgement-analyse's Introduction

Quantitative analysis of judgments by the European Court of Justice

Requirements

Usage

Accquire an EU account & get access to the EUR-Lex API

Create an official European Union account here via 'Create an account'. Afterwards head to https://eur-lex.europa.eu/homepage.html and login with your EU account. Then navigate to https://eur-lex.europa.eu/protected/web-service-registration.html and apply for API access. Your request will then be approved or rejected by an EU official. You will get an e-mail if your access is granted with your API password. To get your API username go to https://eur-lex.europa.eu/protected/user-account.html and use the name under the section "User name". Use this username & password combinination in the next step.

EUR-LEX access

Create a file called eur_lex.ini in the root directory of the project with your EUR-LEX username and password specified as follows:

[eur-lex]
username=APIUsername
password=APIPassword

Corpus Acquisition

Run setup.py and follow the prompts on the CLI to create a corpus of judgments for all specified languages. If you want to use the API and the web application please make sure to have an instance of MongoDB running or select Export to corpus.csv instead if you wish to only use the dataset.

Analysis

Run server.py to start the server on localhost:5000. Once running, you can send corpus and analysis requests via HTTP requests with JSON body.

Progressive web application

  • Install nodejs, which comes with npm pre-packed
  • Open a command line and navigate to the webapp folder
  • Install the required node modules with npm install (this only needs to be performed once for the first time install)
  • Start the node server with npm start

After starting the server, the webapp can be reached on localhost:3000 on your preferred browser by default. Make sure the python - Server is running as well, since it handles the queries sent through the webapp (see Analysis). For a detailed explanation with screenshots please refer to the separate web app documentation inside the webapp folder.

Server API

The API accepts a JSON-body when requesting data and returns results as JSON. Path: /eu-judgments/api/data, method: GET

JSON format

The JSON requires 3 mandatory keys to be specified:

key data type description
language en,de language of corpus to use
corpus all, Dictionary[ ] (sub-)corpus query. See the schema and query example.
analysis Dictionary[ ] Defenition of the analysis to perform. See Analysis and query example.

The keys of the JSON returned from the server matches the types specified for analysis.

Analysis metrics: whole corpus

Unless specified otherways we always use the pre-trained blackstone by The Incorporated Council of Law Reporting for England and Wales for english and the standard medium german model of spaCy for all metrics. Furthermore every text also gets preprocessed and normalized. This enhances the quality of word & sentence separation significantly. Specifically we remove paragraph numbers, white spaces before punctuation & parentheses, certain legal reoccurring legal headlines as well as an ever present header in older texts.

metric arguments type(return value) description
tokens remove_punctuation, remove_stopwords,include_pos, exclude_pos, min_freq_per_doc , limit List[str] A customizable list of all tokens in the corpus.
unique_tokens remove_punctuation, remove_stopwords,include_pos, exclude_pos, min_freq_per_doc Set[str] A customizable set of all unique tokens in the corpus.
token_count remove_punctuation, remove_stopwords,include_pos, exclude_pos, min_freq_per_doc int # of all tokens.
average_token_length remove_punctuation, remove_stopwords,include_pos, exclude_pos, min_freq float Mean token length in corpus based on different filter options.
average_word_length remove_stopwords,include_pos, exclude_pos, min_freq float Mean word length in corpus.
most_frequent_words remove_stopwords, lemmatise, n List[Tuple[str, int]] Most frequently used words. Can be lemmatised and stop words removed
sentences List[str] All sentences in the corpus.
sentence_count int # of sentences.
lemmata remove_stopwords,include_pos, exclude_pos List[Tuple[str, str] A list of all words and their lemmata (we use [3] for german lemmatisation).
pos_tags include_pos, exclude_pos (List[Tuple[str, str] A list of all tokens and their universal part of speech tags .
named_entities List[Tuple[str, List[str]]] Calculates all named entities in the corpus and groups them by their label.
readability float The average readability score of the corpus (Flesch-Reading-Ease. Identical to Defensiveness used by [5].
n-grams n, filter_stopwords, filter_nums, min_freq List[str] The most common n-grams (collocations) with length n (default 2). Can be optionally be filtered. Similar to work done by [6].
sentiment int A normalized sentiment value for the whole corpus (0 - negative, 1 - neutral, 2 - positive sentiment) by [2]. An almost identical method is used by [5] (called Friendliness in their work), which this feature is inspired from.

Specific metrics : specific sub-corpora

type-value arguments type(return value) description
keywords top_n List[Tuple[str, int]] List of keyterms computed with the PositionRank algorithm, with their corresponding weight in the document. Only available on single documents or per document basis.
similarity float Calculates the vector similarity (0 - 1) of two documents based on their word embeddings (similiar to [4]). Only available when comparing two documents, returns -1 instead if more or less than two documents are specified.

Specific metrics : sub-corpus

The following metrics specify per-doc-analysis and return a list of their respective metric for each document described above:

  • tokens_per_doc
  • token_count_per_doc
  • unique_tokens_per_doc
  • most_frequent_words_per_doc
  • sentences_per_doc
  • sentence_count_per_doc
  • pos_tags_per_doc
  • lemmata_per_doc
  • named_entities_per_doc
  • readability_per_doc
  • sentiment_per_doc
  • keywords_per_doc
  • n-grams_per_doc

Note: keywords is only available per document and cannot be computed on a corpus at once, because PositionRank is not suited for more than one document.

Analysis of single document

Example:

{
    "language": "en",
    "corpus": 
        {
            "column" : "celex",
            "value": "61955CJ0008"
        },
    "analysis": [
        {
            "type": "n-grams",
            "n": 2,
            "limit": 10
        },
        {
            "type": "readability"
        },
        {
            "type": "tokens",
            "limit": 50
        }
    ]
}

Creating custom sub-corpora

Sub-corpora can be created using values that must or must not be included in a document to be added. Use column to determine the column according to the database schema and value to determine its value. (Exception: date taking a start date and end date)
Set the search identifier flag to true, if you want to search for abbriviations (ids) instead of verbose descriptions (labels).
Set operator to NOT, if you want to exclude all documents containing this value from your sub-corpus.
When using an array of values, all documents matching any of the values in the array will be in- or exluded.

Example custom subcorpus:

{
    "language": "en",
    "corpus": [
        {
            "column" : "date",
            "start date" : "1958-07-17",
            "end date" : "1975-02-25"
        },
        {
            "column" : "author",
            "value" : "Court of Justice"
        },
        {
            "operator" : "NOT",
            "search identifier" : true,
            "column" : "case_law_directory",
            "value" : ["F", "C"]
        }
   ]
}

Customisation

By default the API is configured to on a single system without any scaling options enabled except multi-threading. The configuration file config.ini then looks something like this:

[execution_mode]
server_mode=False 
[mongo_db]
host=localhost
port=27017
collection=judgment_corpus
[celery]
broker=redis
host=localhost
port=6379
[analysis]
threads=-1

Besides the configuration of the database in the secion mongo_db, it is also possible to limit the analysis to certain number of threads by changing threads to a number of threads. By default it uses all threads (specified by -1). In this configuration however it is not possible to execute multiple request at the same time as there is only one analysis instance at a time. If you want to change this set server_mode to True and make sure celery & redis are installed and the later is running on your setup. If necessary change the redis port in the configuration file. Celery then acts as load distribution system (task queue) while redis acts as task broker. We provide two different task queues. One for small analysis tasks with less than ten documents in a corpus and one for bigger corpora. These are defined in tasks.py. To enable both queues open to separte terminals before starting your server and execute:

celery -A tasks.celery worker -Q celery -c2

This command starts a task queue with two processes (aka analysis instances) for time efficient analysis tasks. Note the -c2 parameter here. This specifies the number of sub processes spawned for this queue. So if you want more than two process use another number here. Each process accepts multiple tasks (we use the default value of four here), before using a new process. In conclusion this configuration offers eight slots (two processes a four tasks) of time efficient calculation.

celery -A tasks.celery worker -Q huge_corpus -c10 -Ofair

This command starts a task queue with ten processes (aka analysis instances). However here each process can only accept one task at a time. So if a new request is taken care of a whole new process is used to ensure proper parallelisation up the limit specified with -c10 (in this case ten processes). After this limit has been reached, each new task must wait for an other task to be finished before being processed. To save memory we decided to end each process after execution if it exceeds a memory limit of 6 GB. The parameter -Ofair ensures each broker only accepts one task. Unfortunately this configuration parameter is currently ignored when set via Python so we need to specify it via command line (https://stackoverflow.com/questions/42433770/celery-multiple-workers-but-one-queue).

Database schema

key value-type description
_id string MongoDB UID
reference string Cellar API reference number
title string Document title
text string Full text of the judgment
keywords string
parties string Parties involved in the judgment
subject string Subject of the case
endorsements string
grounds string Legal grounds
decisions_on_costs string
operative_part string
celex string CELEX number of the judgment
ecli string European 5-part unique document identifier
date string Adoption, signature or publication date (varies)
case_affecting string[ ] CELEX numbers of acts quoted in the operative part
affected_by_case string[ ] CELEX numbers of decisions affecting the act
author { ids : string[ ], labels : string[ ] }
subject_matter { ids : string[ ], labels : string[ ] } Subject matter descriptors
case_law_directory { ids : string[ ], labels : string[ ] } Assigned case-law directory code
applicant { ids : string[ ], labels : string[ ] } Entity, who submitted the application
defendant { ids : string[ ], labels : string[ ] } Entity defending
procedure_type { ids : string[ ], labels : string[ ] } Nature and outcome (where possible) of the proceedings

References

[1] Florescu, C., & Caragea, C. (2017, July). Positionrank: An unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1105-1115).

[2] Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.

[3] Liebeck, M., & Conrad, S. (2015, July). Iwnlp: Inverse wiktionary for natural language processing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural

[4] Ash, E., Chen, D. L., & Ornaghi, A. (2018). Implicit bias in the judiciary: Evidence from judicial language associations. Technical report. 4.1, 4.3

[5] Carlson, K., Livermore, M. A., & Rockmore, D. (2015). A quantitative analysis of writing style on the US Supreme Court. Wash. UL Rev., 93, 1461.

[6] Abegg, A., & Bubenhofer, N. (2016). Empirische Linguistik im Recht: Am Beispiel des Wandels des Staatsverständnisses im Sicherheitsrecht, öffentlichen Wirtschaftsrecht und Sozialrecht der Schweiz. Ancilla Iuris, (1), 1-41.

eu-judgement-analyse's People

Contributors

dependabot[bot] avatar mrroobot avatar phhartl avatar realwhimsy avatar thomfischer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

eu-judgement-analyse's Issues

Bug: CorpusAnalysis get_g_grams not working

get_n_grams does not work for CropusAnalysis objects. However, it does work for Analysis.
Error:

line 276, in get_n_grams
    return list(textacy.extract.ngrams(self.corpus, n, filter_stops=filter_stop_words, filter_punct=True, filter_nums=filter_nums, min_freq=min_freq))
  File "textacy\extract.py", line 155, in ngrams
    ngrams_ = list(ngrams_)
  File "textacy\extract.py", line 141, in <genexpr>
    ngrams_ = (ngram for ngram in ngrams_ if not any(w.like_num for w in ngram))
  File "textacy\extract.py", line 139, in <genexpr>
    ngrams_ = (ngram for ngram in ngrams_ if not any(w.is_punct for w in ngram))
  File "textacy\extract.py", line 136, in <genexpr>
    ngram for ngram in ngrams_ if not ngram[0].is_stop and not ngram[-1].is_stop
  File "textacy\extract.py", line 133, in <genexpr>
    ngrams_ = (ngram for ngram in ngrams_ if not any(w.is_space for w in ngram))
  File "textacy\extract.py", line 133, in <genexpr>
AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'is_space'

Fix dependency issues on analysis

Dependencies added by 46a0f66 have not been listed in the requirements.txt file and manual installation lead to a compile error on the blackstone package, since it requires spacy2.1.8, which fails on compilation. (uptodate versions of spacy compile properly.)
Add dependencies to the requirements file, to ensure all contributors are using the correct version of each package.
If any external dependencies are required, state them in README.md.

CorpusAnalysis: per_doc analysis not assigned to document IDs

When performing a per-doc-analysis on a corpus, like get_tokens_per_doc, the result is a list of a list of all tokens, without any indicator, to which document each result list belongs.
This should probably be changed to a list of dicts containing a celex and result key, to ensure each result is linked to its original doc.

Text complexity

How hard is it to understand a certain text? Is there a difference between languages?

Replace beautifulSoup data parse approach

Benchmarks show that the current approach of parsing data via XML, beautifulSoup and find(), when fully implemented, has an average execution time of 22ms for one single document tag (sample size=100) (author tag used). Timed from the received response until all authors of the document are inside a list in the final JSON for the mongo database.
Using OrderedDicts, performance increases to 300% with an avg. execution time of <7ms (sample size=100).

Since this code has to be executed 20 * 6000 times (n(document tags) * n(documents)), this improvement is significant.

Dynamic data visualisation

Implement dynamic visualisation for the data to be queried from the server.
Eventually:

  • different graphs
  • word clouds
  • etc

Lookup table for stored identifiers

Currently only the IDs of people/institution/etc are stored, since this fully sufficient for data analysis.
However, in the web application, being able to display a verbose equivalent would be advantageous.
Preferably create a dedicated document in our database for that purpose.

ID Label
COMM Comission
B European Community (EEC/EC)

Handle API requests async

Currently, a single request locks the whole server/website. We should provide an asynchronous query queue to circumvent this problem and enable multiple users at a time.

Judgement classifcation

Is it possible to classify judgments in to certain clusters e.g. by using the already defined directory codes of the EU in combination with supervised learning algorithms.

Implement "x per doc" visualization

Some analysis metrics like total token count can be returned per document in a corpus (eg. a corpus with 10 documents returns 10 token counts, 1 for each document). This could be visualized in a bar chart per document, as long as the document count is reasonable

Disable unused pipeline components

Currently, specific pipeline components are more demanding than others. However, we calculate all metrics at once independent of the requested metrics. We should disable unnecessary pipeline components to speed up calculation and reduce memory footprint prior to actually calculating anything. This needs to be done in the analysis as well as the api backend.

Transform database query to query profile

It seems like we get random parts of the corpus after each query instead of a sorted corpus. We must check whether there are duplicates in multiple sides or the query "seed" stays consistent. If it doesn't try to fix it with a query profile instead of a query.

Bug: API response ignores limit for tokens

Sending a request with the following body returns 638 (all?) tokens instead of the set limit of 30:

{
"language": "en",
"corpus": {
"column": "celex",
"value": "61955CJ0008"
},
"analysis": [
{
"type": "tokens",
"limit": "30"
}
]
}

Edit: this bug only appeared after I pulled the new version which was merged yesterday, the token limit worked prior to pull request #47

Web client start page

Create a basic web page for the client.
The page could include:

  • search bar
  • selection box for type of analysis
  • option to select time frame for analysis
  • placeholder for dynamic graphs

Keep corpus up-to-date automatically

Write a script that checks the current status of the corpus documents and only fetches new documents from the EUR-Lex servers.
Preferably save it in its own update.py file, to allow managing update periods with CronJobs.

Fix high memory allocation

When requesting, parsing and inserting every available document into the database (24 languages * 5,555 docs), python steadily increases its memory allocation up to 3.5GB shortly before finishing.
The source of this might be some large variables not getting properly overwritten or deleted by the garbage collector.
This problem should be fixed, so the code can be run on lower performance hardware, since CPU usage is already pretty low.

Dialog for specifying corpus creation criteria

create a form of dialog (either CLI or inside the webapp) to ask the user which criterias to create the corpus on (for now, only a list of language options, possibly other criteria)

Server REST-API

Implement a REST-API that sends analyses data to a requesting client based on parameters.

Manually update blackstone to spaCy 2.2

Currently, Blackstone requires spaCy version 2.1.8, but this version is known for memory leaks. We should either switch to 2.1.9 (which got those leaks fixed) or upgrade the models of blackstone to spaCy 2.2. for multi-core support and memory leak fixes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.