GithubHelp home page GithubHelp logo

europa's Introduction

EUROPA: A Legal Multilingual Keyphrase Generation Dataset

This repo contains the evaluation tools used in the keyphrase generation task from the benchmark named EUROPA. The dataset can be downloaded from the Hugging Face datasets hub: https://huggingface.co/datasets/NCube/europa

Keyphrase evaluation

In most, if not all, of keyphrase extraction/generation tasks, evaluation is based on the following protocol: candidate keyphrases are stemmed and compared with reference keyphrases through an exact matching approach. This evaluation script is based on https://github.com/memray/OpenNMT-kpg-release that only covers English, though. In order to properly evaluate models in the scope of our multilingual corpus, our repo contains resources for addressing the morphological diversity of all official languages of the European Union.

Quickstart

Content:

  • ./eval/ contains scripts for multilingual keyphrase evaluation
  • ./output_kps/ contains the candidate keyphrases generated by the models used in our paper presenting the EUROPA dataset. Feel free to use these outputs for comparing performance with other models.

An output keyphrase TSV file must contain the following columns:

  • language: ISO 639-1 language code (e.g. da);
  • celex_id: CELEX identifier inherited from CJEU. Different translated versions of the same judgment share the same celex_id (e.g. 61999CJ0414);
  • language_celex_id: concatenation of the two previous columns. This becomes the unique identifier for each instance (e.g. da61999CJ0414);
  • generated_keyphrases: model output. Phrases should be separated by semicolons: " ; ".

Environment:

If you have conda, you may set the environment with the following command:

conda env create -n europa -f requirements.yml
conda activate europa

For running the evaluation, you can use the following commands:

time python main.py -toy 1 -output_filename_prefix allMetricsWithMultilingualStemming -output_kps_filename output_kps_YAKE.tsv -path_to_folder_with_candidate_kps ./output_kps/
time python main.py -toy 1 -output_filename_prefix allMetricsWithMultilingualStemming -output_kps_filename output_kps_mt5-base.tsv -path_to_folder_with_candidate_kps ./output_kps/
time python main.py -toy 1 -output_filename_prefix allMetricsWithMultilingualStemming -output_kps_filename output_kps_mBART50.tsv -path_to_folder_with_candidate_kps ./output_kps/

The toy argument can be set to 1 when you run the script for the first time (it will use only 100 instances). Set it to 0 if you want to run the evaluation on all (90.5k) instances. Since the keyphrase evaluation requires the processing of three elements (target keyphrases, generated keyphrases and input text), the full evaluation may be time-consuming (e.g. 1h20 with AMD Ryzen 7 7700X 8-Core Processor).

The output_filename_prefix argument can be any string for conveniently naming output files.

Each command will produce two output files in the folder {path_to_folder_with_candidate_kps} containing output keyphrases. For instance, the last command will produce:

  • allMetricsWithMultilingualStemming_preprocessStemmingMu_languageScores_output_kps_mBART50.tsv : average scores across languages and scores for each language;
  • allMetricsWithMultilingualStemming_preprocessStemmingMu_indiviualScores_output_kps_mBART50.tsv : scores for each instance.

Citation

If you use our dataset and/or our multilingual evaluation script, please cite:

@article{salaun2024europa,
  title={EUROPA: A Legal Multilingual Keyphrase Generation Dataset},
  author={Sala{\"u}n, Olivier and Piedboeuf, Fr{\'e}d{\'e}ric and Berre, Guillaume Le and Hermelo, David Alfonso and Langlais, Philippe},
  journal={arXiv preprint arXiv:2403.00252},
  year={2024}
}

KPEval

The KPEval evaluation framework used in Section 7 of our paper is not covered in this repo, but can be found at: https://github.com/uclanlp/KPEval

europa's People

Stargazers

Wenhao XU avatar

Watchers

RALI, Université de Montréal avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.