phhartl / eu-judgement-analyse Goto Github PK

View Code? Open in Web Editor NEW

6.0 3.0 0.0 35.65 MB

Quantitative analysis of judgments of the European Court of Justice

License: MIT License

Python 35.71% HTML 0.71% CSS 10.44% JavaScript 52.80% Sass 0.34%

eur-lex legal-texts nlp judgments corpora quantitative-analysis

eu-judgement-analyse's Introduction

Quantitative analysis of judgments by the European Court of Justice

Requirements

EU account
EUR-Lex API access
Python 3.7 (3.8 only supported under Linux)
MongoDB
Visual Studio C++ Build Tools (Windows only)
nodejs
Redis (Optional, only if you want to use the server mode)
To install the python packages, run pip install -r requirements.txt
To install the node packages, npm install inside the webapp folder

Usage

Accquire an EU account & get access to the EUR-Lex API

Create an official European Union account here via 'Create an account'. Afterwards head to https://eur-lex.europa.eu/homepage.html and login with your EU account. Then navigate to https://eur-lex.europa.eu/protected/web-service-registration.html and apply for API access. Your request will then be approved or rejected by an EU official. You will get an e-mail if your access is granted with your API password. To get your API username go to https://eur-lex.europa.eu/protected/user-account.html and use the name under the section "User name". Use this username & password combinination in the next step.

EUR-LEX access

Create a file called eur_lex.ini in the root directory of the project with your EUR-LEX username and password specified as follows:

[eur-lex]
username=APIUsername
password=APIPassword

Corpus Acquisition

Run setup.py and follow the prompts on the CLI to create a corpus of judgments for all specified languages. If you want to use the API and the web application please make sure to have an instance of MongoDB running or select Export to corpus.csv instead if you wish to only use the dataset.

Analysis

Run server.py to start the server on localhost:5000. Once running, you can send corpus and analysis requests via HTTP requests with JSON body.

Progressive web application

Install nodejs, which comes with npm pre-packed
Open a command line and navigate to the webapp folder
Install the required node modules with npm install (this only needs to be performed once for the first time install)
Start the node server with npm start

After starting the server, the webapp can be reached on localhost:3000 on your preferred browser by default. Make sure the python - Server is running as well, since it handles the queries sent through the webapp (see Analysis). For a detailed explanation with screenshots please refer to the separate web app documentation inside the webapp folder.

Server API

The API accepts a JSON-body when requesting data and returns results as JSON. Path: /eu-judgments/api/data, method: GET

JSON format

The JSON requires 3 mandatory keys to be specified:

key	data type	description
language	`en`,`de`	language of corpus to use
corpus	`all`, Dictionary[ ]	(sub-)corpus query. See the schema and query example.
analysis	Dictionary[ ]	Defenition of the analysis to perform. See Analysis and query example.

The keys of the JSON returned from the server matches the types specified for analysis.

Analysis `metrics`: whole corpus

Unless specified otherways we always use the pre-trained blackstone by The Incorporated Council of Law Reporting for England and Wales for english and the standard medium german model of spaCy for all metrics. Furthermore every text also gets preprocessed and normalized. This enhances the quality of word & sentence separation significantly. Specifically we remove paragraph numbers, white spaces before punctuation & parentheses, certain legal reoccurring legal headlines as well as an ever present header in older texts.

`metric`	arguments	type(return value)	description
tokens	`remove_punctuation`, `remove_stopwords`,`include_pos`, `exclude_pos`, `min_freq_per_doc` , `limit`	List[str]	A customizable list of all tokens in the corpus.
unique_tokens	`remove_punctuation`, `remove_stopwords`,`include_pos`, `exclude_pos`, `min_freq_per_doc`	Set[str]	A customizable set of all unique tokens in the corpus.
token_count	`remove_punctuation`, `remove_stopwords`,`include_pos`, `exclude_pos`, `min_freq_per_doc`	int	# of all tokens.
average_token_length	`remove_punctuation`, `remove_stopwords`,`include_pos`, `exclude_pos`, `min_freq`	float	Mean token length in corpus based on different filter options.
average_word_length	`remove_stopwords`,`include_pos`, `exclude_pos`, `min_freq`	float	Mean word length in corpus.
most_frequent_words	`remove_stopwords`, `lemmatise`, `n`	List[Tuple[str, int]]	Most frequently used words. Can be lemmatised and stop words removed
sentences		List[str]	All sentences in the corpus.
sentence_count		int	# of sentences.
lemmata	`remove_stopwords`,`include_pos`, `exclude_pos`	List[Tuple[str, str]	A list of all words and their lemmata (we use [3] for german lemmatisation).
pos_tags	`include_pos`, `exclude_pos`	(List[Tuple[str, str]	A list of all tokens and their universal part of speech tags .
named_entities		List[Tuple[str, List[str]]]	Calculates all named entities in the corpus and groups them by their label.
readability		float	The average readability score of the corpus (Flesch-Reading-Ease. Identical to `Defensiveness` used by [5].
n-grams	`n`, `filter_stopwords`, `filter_nums`, `min_freq`	List[str]	The most common n-grams (collocations) with length n (default 2). Can be optionally be filtered. Similar to work done by [6].
sentiment		int	A normalized sentiment value for the whole corpus (0 - negative, 1 - neutral, 2 - positive sentiment) by [2]. An almost identical method is used by [5] (called `Friendliness` in their work), which this feature is inspired from.

Specific `metrics` : specific sub-corpora

`type`-value	arguments	type(return value)	description
keywords	`top_n`	List[Tuple[str, int]]	List of keyterms computed with the PositionRank algorithm, with their corresponding weight in the document. Only available on single documents or per document basis.
similarity		float	Calculates the vector similarity (0 - 1) of two documents based on their word embeddings (similiar to [4]). Only available when comparing two documents, returns -1 instead if more or less than two documents are specified.

Specific `metrics` : sub-corpus

The following metrics specify per-doc-analysis and return a list of their respective metric for each document described above:

tokens_per_doc
token_count_per_doc
unique_tokens_per_doc
most_frequent_words_per_doc
sentences_per_doc
sentence_count_per_doc
pos_tags_per_doc
lemmata_per_doc
named_entities_per_doc
readability_per_doc
sentiment_per_doc
keywords_per_doc
n-grams_per_doc

Note: keywords is only available per document and cannot be computed on a corpus at once, because PositionRank is not suited for more than one document.

Analysis of single document

Example:

{
    "language": "en",
    "corpus": 
        {
            "column" : "celex",
            "value": "61955CJ0008"
        },
    "analysis": [
        {
            "type": "n-grams",
            "n": 2,
            "limit": 10
        },
        {
            "type": "readability"
        },
        {
            "type": "tokens",
            "limit": 50
        }
    ]
}

Creating custom sub-corpora

Sub-corpora can be created using values that must or must not be included in a document to be added. Use column to determine the column according to the database schema and value to determine its value. (Exception: date taking a start date and end date)
Set the search identifier flag to true, if you want to search for abbriviations (ids) instead of verbose descriptions (labels).
Set operator to NOT, if you want to exclude all documents containing this value from your sub-corpus.
When using an array of values, all documents matching any of the values in the array will be in- or exluded.

Example custom subcorpus:

{
    "language": "en",
    "corpus": [
        {
            "column" : "date",
            "start date" : "1958-07-17",
            "end date" : "1975-02-25"
        },
        {
            "column" : "author",
            "value" : "Court of Justice"
        },
        {
            "operator" : "NOT",
            "search identifier" : true,
            "column" : "case_law_directory",
            "value" : ["F", "C"]
        }
   ]
}

Customisation

By default the API is configured to on a single system without any scaling options enabled except multi-threading. The configuration file config.ini then looks something like this:

[execution_mode]
server_mode=False 
[mongo_db]
host=localhost
port=27017
collection=judgment_corpus
[celery]
broker=redis
host=localhost
port=6379
[analysis]
threads=-1

Besides the configuration of the database in the secion mongo_db, it is also possible to limit the analysis to certain number of threads by changing threads to a number of threads. By default it uses all threads (specified by -1). In this configuration however it is not possible to execute multiple request at the same time as there is only one analysis instance at a time. If you want to change this set server_mode to True and make sure celery & redis are installed and the later is running on your setup. If necessary change the redis port in the configuration file. Celery then acts as load distribution system (task queue) while redis acts as task broker. We provide two different task queues. One for small analysis tasks with less than ten documents in a corpus and one for bigger corpora. These are defined in tasks.py. To enable both queues open to separte terminals before starting your server and execute:

celery -A tasks.celery worker -Q celery -c2

This command starts a task queue with two processes (aka analysis instances) for time efficient analysis tasks. Note the -c2 parameter here. This specifies the number of sub processes spawned for this queue. So if you want more than two process use another number here. Each process accepts multiple tasks (we use the default value of four here), before using a new process. In conclusion this configuration offers eight slots (two processes a four tasks) of time efficient calculation.

celery -A tasks.celery worker -Q huge_corpus -c10 -Ofair

This command starts a task queue with ten processes (aka analysis instances). However here each process can only accept one task at a time. So if a new request is taken care of a whole new process is used to ensure proper parallelisation up the limit specified with -c10 (in this case ten processes). After this limit has been reached, each new task must wait for an other task to be finished before being processed. To save memory we decided to end each process after execution if it exceeds a memory limit of 6 GB. The parameter -Ofair ensures each broker only accepts one task. Unfortunately this configuration parameter is currently ignored when set via Python so we need to specify it via command line (https://stackoverflow.com/questions/42433770/celery-multiple-workers-but-one-queue).

Database schema

key	value-type	description
_id	string	MongoDB UID
reference	string	Cellar API reference number
title	string	Document title
text	string	Full text of the judgment
keywords	string
parties	string	Parties involved in the judgment
subject	string	Subject of the case
endorsements	string
grounds	string	Legal grounds
decisions_on_costs	string
operative_part	string
celex	string	CELEX number of the judgment
ecli	string	European 5-part unique document identifier
date	string	Adoption, signature or publication date (varies)
case_affecting	string[ ]	CELEX numbers of acts quoted in the operative part
affected_by_case	string[ ]	CELEX numbers of decisions affecting the act
author	{ ids : string[ ], labels : string[ ] }
subject_matter	{ ids : string[ ], labels : string[ ] }	Subject matter descriptors
case_law_directory	{ ids : string[ ], labels : string[ ] }	Assigned case-law directory code
applicant	{ ids : string[ ], labels : string[ ] }	Entity, who submitted the application
defendant	{ ids : string[ ], labels : string[ ] }	Entity defending
procedure_type	{ ids : string[ ], labels : string[ ] }	Nature and outcome (where possible) of the proceedings

References

[1] Florescu, C., & Caragea, C. (2017, July). Positionrank: An unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1105-1115).

[2] Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.

[3] Liebeck, M., & Conrad, S. (2015, July). Iwnlp: Inverse wiktionary for natural language processing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural

[4] Ash, E., Chen, D. L., & Ornaghi, A. (2018). Implicit bias in the judiciary: Evidence from judicial language associations. Technical report. 4.1, 4.3

[5] Carlson, K., Livermore, M. A., & Rockmore, D. (2015). A quantitative analysis of writing style on the US Supreme Court. Wash. UL Rev., 93, 1461.

[6] Abegg, A., & Bubenhofer, N. (2016). Empirische Linguistik im Recht: Am Beispiel des Wandels des Staatsverständnisses im Sicherheitsrecht, öffentlichen Wirtschaftsrecht und Sozialrecht der Schweiz. Ancilla Iuris, (1), 1-41.

eu-judgement-analyse's People

Contributors

Stargazers

Watchers

eu-judgement-analyse's Issues

Decide which text metrics can be calculated on-the-fly and which can be calculated dynamically

Analyse which query parameter corresponds to which juristical concept

How is each query parameter handled in the XML response from EUR-LEX?

Bug: CorpusAnalysis get_g_grams not working

get_n_grams does not work for CropusAnalysis objects. However, it does work for Analysis.
Error:

line 276, in get_n_grams
    return list(textacy.extract.ngrams(self.corpus, n, filter_stops=filter_stop_words, filter_punct=True, filter_nums=filter_nums, min_freq=min_freq))
  File "textacy\extract.py", line 155, in ngrams
    ngrams_ = list(ngrams_)
  File "textacy\extract.py", line 141, in <genexpr>
    ngrams_ = (ngram for ngram in ngrams_ if not any(w.like_num for w in ngram))
  File "textacy\extract.py", line 139, in <genexpr>
    ngrams_ = (ngram for ngram in ngrams_ if not any(w.is_punct for w in ngram))
  File "textacy\extract.py", line 136, in <genexpr>
    ngram for ngram in ngrams_ if not ngram[0].is_stop and not ngram[-1].is_stop
  File "textacy\extract.py", line 133, in <genexpr>
    ngrams_ = (ngram for ngram in ngrams_ if not any(w.is_space for w in ngram))
  File "textacy\extract.py", line 133, in <genexpr>
AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'is_space'

Fix dependency issues on analysis

Dependencies added by 46a0f66 have not been listed in the requirements.txt file and manual installation lead to a compile error on the blackstone package, since it requires spacy2.1.8, which fails on compilation. (uptodate versions of spacy compile properly.)
Add dependencies to the requirements file, to ensure all contributors are using the correct version of each package.
If any external dependencies are required, state them in README.md.

Request data from server via REST

Initialising CorpusAnalysis pipeline allocates too much memory

When trying to call CorpusAnalysis.init_pipeline() on the entire corpus, self.corpus = textacy.Corpus(self.nlp, data=texts) allocates 12GB of memory within 1 minute.
This must absolutely be fixed if possible.

Check for duplicates before DB insertion

Don't insert a file into the DB if it's a duplicate.

Sentiment analysis

Are judgements neutrally written?

Translate \u unicode characters in response to readable format

Data structure schema

Which parts of the metadata do we actually need?

Analyser cannot be initialised

an object of the Analyser class inside analysis.py cannot be created.

CorpusAnalysis: per_doc analysis not assigned to document IDs

When performing a per-doc-analysis on a corpus, like get_tokens_per_doc, the result is a list of a list of all tokens, without any indicator, to which document each result list belongs.
This should probably be changed to a list of dicts containing a celex and result key, to ensure each result is linked to its original doc.

Intertextuality

How are judgements referencing each other?

Get english corpus data

Create database schema

MongoDB connection with PyMongo

Add id / label toggle to search filters

Text complexity

How hard is it to understand a certain text? Is there a difference between languages?

Replace beautifulSoup data parse approach

Benchmarks show that the current approach of parsing data via XML, beautifulSoup and find(), when fully implemented, has an average execution time of 22ms for one single document tag (sample size=100) (author tag used). Timed from the received response until all authors of the document are inside a list in the final JSON for the mongo database.
Using OrderedDicts, performance increases to 300% with an avg. execution time of <7ms (sample size=100).

Since this code has to be executed 20 * 6000 times (n(document tags) * n(documents)), this improvement is significant.

Dynamic data visualisation

Implement dynamic visualisation for the data to be queried from the server.
Eventually:

different graphs
word clouds
etc

Finalise database query

Lookup table for stored identifiers

Currently only the IDs of people/institution/etc are stored, since this fully sufficient for data analysis.
However, in the web application, being able to display a verbose equivalent would be advantageous.
Preferably create a dedicated document in our database for that purpose.

ID	Label
COMM	Comission
B	European Community (EEC/EC)

Handle API requests async

Currently, a single request locks the whole server/website. We should provide an asynchronous query queue to circumvent this problem and enable multiple users at a time.

Judgement classifcation

Is it possible to classify judgments in to certain clusters e.g. by using the already defined directory codes of the EU in combination with supervised learning algorithms.

Update visualisation with requested data

Change the form of visualisation present on the web page based on the server data requested.

Implement "x per doc" visualization

Some analysis metrics like total token count can be returned per document in a corpus (eg. a corpus with 10 documents returns 10 token counts, 1 for each document). This could be visualized in a bar chart per document, as long as the document count is reasonable

Setup basic api connection

Provide a basic example on how to use the EUR-LEX API.

Disable unused pipeline components

Currently, specific pipeline components are more demanding than others. However, we calculate all metrics at once independent of the requested metrics. We should disable unnecessary pipeline components to speed up calculation and reduce memory footprint prior to actually calculating anything. This needs to be done in the analysis as well as the api backend.

Transform database query to query profile

It seems like we get random parts of the corpus after each query instead of a sorted corpus. We must check whether there are duplicates in multiple sides or the query "seed" stays consistent. If it doesn't try to fix it with a query profile instead of a query.

Find a suitable regex query to eliminate parapraph numbers

Best bet atm:

(?i)((?<=[.’']\s)|(?<=\bgrounds\b\s)|(?<=\blaw\b\s)|(?<=\bjudgment\b\s)|(?<=\bpreliminary ruling\b\s)|(?<=\bcosts\b\s))(\d{1,3}\.*\s)(?=[A-Z])

https://regex101.com/r/2AgRRW/1

Acquire EUR-LEX api key

Get an European Union account & then apply for an api key.

Remove non-english documents from SOAP response

The respone for the SOAP request currently includes non-english documents. This needs to be addressed via filtering or a different query.

Bug: API response ignores limit for tokens

Sending a request with the following body returns 638 (all?) tokens instead of the set limit of 30:

{
"language": "en",
"corpus": {
"column": "celex",
"value": "61955CJ0008"
},
"analysis": [
{
"type": "tokens",
"limit": "30"
}
]
}

Edit: this bug only appeared after I pulled the new version which was merged yesterday, the token limit worked prior to pull request #47

Web client start page

Create a basic web page for the client.
The page could include:

search bar
selection box for type of analysis
option to select time frame for analysis
placeholder for dynamic graphs

Keep corpus up-to-date automatically

Write a script that checks the current status of the corpus documents and only fetches new documents from the EUR-Lex servers.
Preferably save it in its own update.py file, to allow managing update periods with CronJobs.

Classical text analysis metrics

Fix high memory allocation

When requesting, parsing and inserting every available document into the database (24 languages * 5,555 docs), python steadily increases its memory allocation up to 3.5GB shortly before finishing.
The source of this might be some large variables not getting properly overwritten or deleted by the garbage collector.
This problem should be fixed, so the code can be run on lower performance hardware, since CPU usage is already pretty low.

Dialog for specifying corpus creation criteria

create a form of dialog (either CLI or inside the webapp) to ask the user which criterias to create the corpus on (for now, only a list of language options, possibly other criteria)

Transform XML into database-readable format

24 language corpus support

Allow for the creation of a corpus for every language available by eur-lex.

Implement PoS-Tags visualization

Split date into start and end date on webpage filters

Server REST-API

Implement a REST-API that sends analyses data to a requesting client based on parameters.

Diachronic Analysis

Are there topical changes over time?

Safe humanly readable labels for subject matter and directory codes

Instead of only saving their identifier, also save their preferred label.

Extract content of judgements

What is each judgement about?

Manually update blackstone to spaCy 2.2

Currently, Blackstone requires spaCy version 2.1.8, but this version is known for memory leaks. We should either switch to 2.1.9 (which got those leaks fixed) or upgrade the models of blackstone to spaCy 2.2. for multi-core support and memory leak fixes.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.