GithubHelp home page GithubHelp logo

lener-br's Introduction

LeNER-Br: a Dataset for Named Entity Recognition in Brazilian Legal Text

This repo holds the dataset and source code described in the paper below, which was generated as a collaboration between two institutions of the University of Brasília: NEXT (Núcleo de P&D para Excelência e Transformação do Setor Público) and CiC (Departamento de Ciência da Computação).

@InProceedings{luz_etal_propor2018,
          author = {Pedro H. {Luz de Araujo} and Te\'{o}filo E. {de Campos} and
                    Renato R. R. {de Oliveira} and Matheus Stauffer and
                    Samuel Couto and Paulo Bermejo},
          title = {{LeNER-Br}: a Dataset for Named Entity Recognition in {Brazilian} Legal Text},
          booktitle = {International Conference on the Computational Processing of Portuguese
                       ({PROPOR})},
	  publisher = {Springer},
	  series = {Lecture Notes on Computer Science ({LNCS})},
	  pages = {313--323},
          year = {2018},
          month = {September 24-26},
          address = {Canela, RS, Brazil},	  
	  doi = {10.1007/978-3-319-99722-3_32},
	  url = {https://teodecampos.github.io/LeNER-Br/},
}	  

We also provide the LSTM-CRF model described in the paper, which achieved an average f1-score of 92.53% (token) and 86.61% (entity) on the test set.

The sections below describe the requirements and the dataset and model files.

We kindly request that users cite our paper in any publication that is generated as a result of the use of our source code, our dataset or our pre-trained models.

Note: although this GitHub repository was created in May 2020 to increase the visibility of this project, the dataset and source code has been available from the site of the authors since September 2018.

Requirements

  1. Python 3.6
  2. pip

LeNER-Br Dataset

The directory structure is as follows:

  • the train, test and dev folders hold space separated text files where the first column are the words and the second column are the correspondent named entity tags. Sentences are separeted by empty lines. In addition, each folder has a file that is the concatenation of all the other conll files of the same folder (train.conll, dev.conll and test.conll).
  • metadata holds json files with additional information from each annotated document.
  • raw_text holds the source txt files that originated the conll files.
  • scripts hold an abbreviation list used for sentence segmentation and the script that generated the conll files. To use the script:
python textToConll.py path/to/txtfile

Model

The model code is adapted from this repo and implements a NER model using Tensorflow (LSTM + CRF + chars embeddings). All code files modified are marked as such at the beginning. The section below summarizes the use of the model. For more in depth explanations of how to use the model and change its configurations refer to the README of the original implementation.

Evaluation

  • To install the required python packages, run from the model folder:
pip install -r requirements.txt
  • To obtain the f1 scores (per token) for each class on each part of the dataset:
python classScores.py train
python classScores.py dev
python classScores.py test
  • To obtain the f1 scores (per entity) for each class on each part of the dataset:
python evaluate.py train
python evaluate.py dev
python evaluate.py test
  • To tag a raw text file:
python evaluateText path/to/txtfile
  • To tag sentences in a interactive way:
python evaluate.py

or

python evaluateSentence.py
  • To retrain the model from scratch:
python train.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.