GithubHelp home page GithubHelp logo

multiconer-baseline's Introduction

MULTI-CONER NER Baseline.

This code repository provides you with baseline approach for Named Entity Recognition (NER). In this repository the following functionalities are provided:

  • CoNLL data readers
  • Usage of any HuggingFace pre-trained transformer models
  • Training and Testing through Pytorch-Lightning

Below a more detailed description on how to use this code is provided.

Running the Code

Arguments:

p.add_argument('--train', type=str, help='Path to the train data.', default=None)
p.add_argument('--test', type=str, help='Path to the test data.', default=None)
p.add_argument('--dev', type=str, help='Path to the dev data.', default=None)

p.add_argument('--out_dir', type=str, help='Output directory.', default='.')
p.add_argument('--iob_tagging', type=str, help='IOB tagging scheme', default='wnut')

p.add_argument('--max_instances', type=int, help='Maximum number of instances', default=-1)
p.add_argument('--max_length', type=int, help='Maximum number of tokens per instance.', default=50)

p.add_argument('--encoder_model', type=str, help='Pretrained encoder model to use', default='xlm-roberta-large')
p.add_argument('--model', type=str, help='Model path.', default=None)
p.add_argument('--model_name', type=str, help='Model name.', default=None)
p.add_argument('--stage', type=str, help='Training stage', default='fit')
p.add_argument('--prefix', type=str, help='Prefix for storing evaluation files.', default='test')

p.add_argument('--batch_size', type=int, help='Batch size.', default=128)
p.add_argument('--gpus', type=int, help='Number of GPUs.', default=1)
p.add_argument('--epochs', type=int, help='Number of epochs for training.', default=5)
p.add_argument('--lr', type=float, help='Learning rate', default=1e-5)
p.add_argument('--dropout', type=float, help='Dropout rate', default=0.1)

Running

Train a XLM-RoBERTa base model
python -m ner_baseline.train_model --train train.txt --dev dev.txt --out_dir . --model_name xlmr_ner --gpus 1 \
                                   --epochs 2 --encoder_model xlm-roberta-base --batch_size 64 --lr 0.0001
Evaluate the trained model
python -m ner_baseline.evaluate --test test.txt --out_dir . --gpus 1 --encoder_model xlm-roberta-base \
                                --model MODEL_FILE_PATH --prefix xlmr_ner_results

Predicting the tags from a pretrained model
python -m ner_baseline.predict_tags --test test.txt --out_dir . --gpus 1 --encoder_model xlm-roberta-base \
                                --model MODEL_FILE_PATH --prefix xlmr_ner_results --max_length 500

  • For this functionality we have implemented an efficient approach for predicting the output tags, independent of the tokenizer used.
    • The method parse_tokens_for_ner in reader.py while reading the data in CoNLL format, for each token it tokenizes it into its subwords, and additionally generates a mask, where only the first subword of a token is marked with True.
    • For example, for the token MultiCoNER, if we use XLM-RoBERTa, we get the following tokens ['▁Multi', 'Co', 'NER'], which result in the following token mask [True, False, False]
    • These token masks are part of the output returned by the provided reader.
    • Finally, when predicting the token tags, the model has the predict_tags, which picks only the first tag from the first subword of each token. This process is efficient and is implemented using native python functionalities, e.g. [compress(pred_tags_, mask_) for pred_tags_, mask_ in zip(pred_tags, token_mask)], which is executed for an entire batch.

Setting up the code environment

$ pip install -r requirements.txt

License

The code under this repository is licensed under the Apache 2.0 License.

multiconer-baseline's People

Contributors

bfetahu avatar lisaterumi avatar amazon-auto avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.