GithubHelp home page GithubHelp logo

dwadden / dygiepp Goto Github PK

View Code? Open in Web Editor NEW
561.0 17.0 117.0 1.07 MB

Span-based system for named entity, relation, and event extraction.

License: MIT License

Python 82.99% Jsonnet 1.77% Shell 2.26% Perl 1.13% Dockerfile 0.58% Jupyter Notebook 11.27%

dygiepp's Introduction

DyGIE++

Implements the model described in the paper Entity, Relation, and Event Extraction with Contextualized Span Representations.

Table of Contents

See the doc folder for documentation with more details on the data, model implementation and debugging, and model configuration.

Updates

October 2023: Unfortunately, AllenNLP (on which DyGIE++ is built) has been archived and is not actively maintained. Due to changes to various software packages, the unavailability of older versions, following the instructions under dependencies now raises errors when trying to install DyGIE++. I don't have bandwidth to get things updated. I'd welcome a PR to update the relevant dependencies and get things working again! See the dependencies section for more info.

December 2021: A couple nice additions thanks to PR's from contributors:

  • There is now a script to convert BRAT-formatted annotations to DyGIE. See here for more details. Thanks to @serenalotreck for this feature.
  • There are Spacy bindings for DyGIE entity and relation extraction; see the section on Spacy bindings. Thanks to @e3oroush for this feature.

April 2021: We've added data and models for the MECHANIC dataset, presented in the NAACL 2021 paper Extracting a Knowledge Base of Mechanisms from COVID-19 Papers.

You can also get the data by running bash scripts/data/get_mechanic.sh, which will put the data in data/mechanic.

After moving the models to the pretrained folder, you can make predictions like this:

allennlp predict \
  pretrained/mechanic-coarse.tar.gz \
  data/mechanic/coarse/test.json \
  --predictor dygie \
  --include-package dygie \
  --use-dataset-reader \
  --output-file predictions/covid-coarse.jsonl \
  --cuda-device 0 \
  --silent

Project status

This branch used to be named allennlp-v1, and it has been made the new master. It's compatible with new version of AllenNLP, and the model configuration process has been simplified. I'd recommend using this branch for all future work. If for some reason you need the older version of the code, it's on the branch emnlp-2019.

Unfortunately, I don't have the bandwidth at this point to add additional features. But please create a new issue if you have problems with:

  • Reproducing the results reported in the README.
  • Making predictions on a new dataset using pre-trained models.
  • Training your own model on a new dataset.

See below for guidelines on creating an issue.

There are a number of ways this code could be improved, and I'd definitely welcome pull requests. If you're interested, see contributions.md for a list of ieas.

Submit a model!

If you have a DyGIE model that you've trained on a new dataset, feel free to upload it here and I'll add it to the collection of pre-trained models.

Issues

If you're unable to run the code, feel free to create an issue. Please do the following:

  • Confirm that you've set up a Conda environement exactly as in the Dependencies section below. I can only offer support if you're running code within this environment.
  • Specify any commands you used to download pretrained models or to download / preprocess data. Please enclose the code in code blocks, for instance:
    # Download pretrained models.
    
    bash scripts/pretrained/get_dygiepp_pretrained.sh
  • Share the command that you ran to cause the issue, for instance:
    allennlp evaluate \
    pretrained/scierc.tar.gz \
    data/scierc/normalized_data/json/test.json \
    --cuda-device 2 \
    --include-package dygie
    
  • If you're using your own dataset, attach a minimal example of the data which, when given as input, causes the error you're seeing. This could be, for instance, a single line form a .jsonl file.
  • Include the full error message that you're getting.

Dependencies

Update (October 2023): These directions no longer work. Python 3.7 is no longer available from conda, and AllenNLP is no longer actively maintained, causing some dependencies to break. I'd welcome a PR to get things working again.

Clone this repository and navigate the the root of the repo on your system. Then execute:

conda create --name dygiepp python=3.7
pip install -r requirements.txt
conda develop .   # Adds DyGIE to your PYTHONPATH

This library relies on AllenNLP and uses AllenNLP shell commands to kick off training, evaluation, and testing.

If you run into an issue installing jsonnet, this issue may prove helpful.

Docker build

A Dockerfile is provided with the Pytorch + CUDA + CUDNN base image for a full-stack GPU install. It will create conda environments dygiepp for modeling & ace-event-preprocess for ACE05-Event preprocessing.

By default the build downloads datasets and dependencies for all tasks. This takes a long time and produces a large image, so you will want to comment out unneeded datasets/tasks in the Dockerfile.

  • Comment out unneeded task sections in Dockerfile.
  • Build container: docker build --tag dygiepp:dev <dygiepp-repo-dirpath>
  • Run the container interactively, mount this project dir to /dygiepp/: docker run --gpus all -it --ipc=host -v <dygiepp-repo-dirpath>:/dygiepp/ --name dygiepp dygiep:dev

NOTE: This Dockerfile was added in a PR from a contributor. I haven't tested it, so it's not "officially supported". More PR's are welcome, though.

Training a model

Warning about coreference resolution: The coreference code will break on sentences with only a single token. If you have these in your dataset, either get rid of them or deactivate the coreference resolution part of the model.

We rely on Allennlp train to handle model training. The train command takes a configuration file as an argument, and initializes a model based on the configuration, and serializes the traing model. More details on the configuration process for DyGIE can be found in doc/config.md.

To train a model, enter bash scripts/train.sh [config_name] at the command line, where the config_name is the name of a file in the training_config directory. For instance, to train a model using the scierc.jsonnet config, you'd enter

bash scripts/train.sh scierc

The resulting model will go in models/scierc. For more information on how to modify training configs (e.g. to change the GPU used for training), see config.md.

Information on preparing specific training datasets is below. For more information on how to create training batches that utilize GPU resources efficiently, see model.md. Hyperparameter optimization search is implemented using Optuna, see model.md.

SciERC

To train a model for named entity recognition, relation extraction, and coreference resolution on the SciERC dataset:

  • Download the data. From the top-level folder for this repo, enter bash ./scripts/data/get_scierc.sh. This will download the scierc dataset into a folder ./data/scierc
  • Train the model. Enter bash scripts/train.sh scierc.
  • To train a "lightweight" version of the model that doesn't do coreference propagation and uses a context width of 1, do bash scripts/train.sh scierc_lightweight instead. More info on why you'd want to do this in the section on making predictions.

GENIA

The steps are similar to SciERC.

  • Download the data. From the top-level folder for this repo, enter bash ./scripts/data/get_genia.sh.
  • Train the model. Enter bash scripts/train genia.
  • As with SciERC, we also offer a "lightweight" version with a context width of 1 and no coreference propagation.

ChemProt

The ChemProt corpus contains entity and relation annotations for drug / protein interaction. The ChemProt preprocessing requires a separate environment:

conda deactivate
conda create --name chemprot-preprocess python=3.7
conda activate chemprot-preprocess
pip install -r scripts/data/chemprot/requirements.txt

Then, follow these steps:

  • Get the data.
    • Run bash ./scripts/data/get_chemprot.sh. This will download the data and process it into the DyGIE input format.
      • NOTE: This is a quick-and-dirty script that skips entities whose character offsets don't align exactly with the tokenization produced by SciSpacy. We lose about 10% of the named entities and 20% of the relations in the dataset as a result.
    • Switch back to your DyGIE environment.
    • Collate the data:
      mkdir -p data/chemprot/collated_data
      
      python scripts/data/shared/collate.py \
        data/chemprot/processed_data \
        data/chemprot/collated_data \
        --train_name=training \
        --dev_name=development
      
    • For a quick spot-check to see how much of the data was lost, you can run:
    python scripts/data/chemprot/03_spot_check.py
    ```   ```
    
  • Train the model. Enter bash scripts/train chemprot.

ACE05 (ACE for entities and relations)

Creating the dataset

For more information on ACE relation and event preprocessing, see doc/data.md and this issue.

We use preprocessing code adapted from the DyGIE repo, which is in turn adapted from the LSTM-ER repo. The following software is required:

  • Java, to run CoreNLP.
  • Perl.
  • zsh. If this isn't available on your system, you can create a conda environment and install zsh.

First, we need to download Stanford CoreNLP:

bash scripts/data/ace05/get_corenlp.sh

Then, run the driver script to preprocess the data:

bash scripts/data/get_ace05.sh [path-to-ACE-data]

The results will go in ./data/ace05/collated-data. The intermediate files will go in ./data/ace05/raw-data.

Training a model

Enter bash scripts/train ace05_relation. A model trained this way will not reproduce the numbers in the paper. We're in the process of debugging and will update.

ACE05 Event

Creating the dataset

The preprocessing code I wrote breaks with the newest version of Spacy. So unfortunately, we need to create a separate virtualenv that uses an old version of Spacy and use that for preprocessing.

conda deactivate
conda create --name ace-event-preprocess python=3.7
conda activate ace-event-preprocess
pip install -r scripts/data/ace-event/requirements.txt
python -m spacy download en_core_web_sm

Then, collect the relevant files from the ACE data distribution with

bash ./scripts/data/ace-event/collect_ace_event.sh [path-to-ACE-data].

The results will go in ./data/ace-event/raw-data.

Now, run the script

python ./scripts/data/ace-event/parse_ace_event.py [output-name] [optional-flags]

You can see the available flags by calling parse_ace_event.py -h. For detailed descriptions, see data.md. The results will go in ./data/ace-event/processed-data/[output-name]. We require an output name because you may want to preprocess the ACE data multiple times using different flags. For default preprocessing settings, you could do:

python ./scripts/data/ace-event/parse_ace_event.py default-settings

Now conda deactivate the ace-event-preprocess environment and re-activate your modeling environment.

Finally, collate the version of the dataset you just created. For instance, continuing the example above,

mkdir -p data/ace-event/collated-data/default-settings/json

python scripts/data/shared/collate.py \
  data/ace-event/processed-data/default-settings/json \
  data/ace-event/collated-data/default-settings/json \
  --file_extension json

Training the model

To train on the data preprocessed with default settings, enter bash scripts/train.sh ace05_event. A model trained in this fashion will reproduce (within 0.1 F1 or so) the results in Table 4 of the paper. To train on a different version, modify training_config/ace05_event.jsonnet to point to the appropriate files.

To reproduce the results in Table 1 requires training an ensemble model of 4 trigger detectors. The basic process is as follows:

  • Merge the ACE event train + dev data, then create 4 new train / dev splits.
  • Train a separate trigger detection model on each split. To do this, modify training_config/ace05_event.jsonnet by setting
    model +: {
      modules +: {
        events +: {
          loss_weights: {
            trigger: 1.0,
            arguments: 0.5
          }
        }
      }
    }
  • Make trigger predictions using a majority vote of the 4 ensemble models.
  • Use these predicted triggers when making event argument predictions based on the event argument scores output by the model saved at models/ace05_event.

If you need more details, email me.

MECHANIC

You can get the dataset by running bash scripts/data/get_mechanic.sh. For detailed training instructions, see the DyGIE-COFIE repo.

Evaluating a model

To check the performance of one of your models or a pretrained model, you can use the allennlp evaluate command.

Note that allennlp commands will only be able to discover the code in this package if:

  • You run the commands from the root folder of this project, dygiepp, or:
  • You add the code to your Python path by running conda develop . from the root folder of this project.

Otherwise, you will get an error ModuleNotFoundError: No module named 'dygie'.

In general, you can make evaluate a model like this:

allennlp evaluate \
  [model-file] \
  [data-path] \
  --cuda-device [cuda-device] \
  --include-package dygie \
  --output-file [output-file] # Optional; if not given, prints metrics to console.

For example, to evaluate the pretrained SciERC model, you could do

allennlp evaluate \
  pretrained/scierc.tar.gz \
  data/scierc/normalized_data/json/test.json \
  --cuda-device 2 \
  --include-package dygie

To evaluate a model you trained on the SciERC data, you could do

allennlp evaluate \
  models/scierc/model.tar.gz \
  data/scierc/normalized_data/json/test.json \
  --cuda-device 2  \
  --include-package dygie \
  --output-file models/scierc/metrics_test.json

Pretrained models

A number of models are available for download. They are named for the dataset they are trained on. "Lightweight" models are models trained on datasets for which coreference resolution annotations were available, but we didn't use them. This is "lightweight" because coreference resolution is expensive, since it requires predicting cross-sentence relationships between spans.

If you want to use one of these pretrained models to make predictions on a new dataset, you need to set the dataset field for the instances in your new dataset to match the name of the dataset the model was trained on. For example, to make predictions using the pretrained SciERC model, set the dataset field in your new instances to scierc. For more information on the dataset field, see data.md.

To download all available models, run scripts/pretrained/get_dygiepp_pretrained.sh. Or, click on the links below to download only a single model.

Available models

Below are links to the available models, followed by the name of the dataset the model was trained on.

Spacy bindings

DyGIE can now be called from Spacy! For example usage, see the demo notebook. This feature was added by a contributor; please tag @e3oroush on related issues.

Performance of pretrained models

  • SciERC

    "_scierc__ner_f1": 0.6846741045214326,
    "_scierc__relation_f1": 0.46236559139784944
    
  • SciERC lightweight

    "_scierc__ner_f1": 0.6717245404143566,
    "_scierc__relation_f1": 0.4670588235294118
    
  • GENIA

    "_genia__ner_f1": 0.7713070807912737
    
  • GENIA lightweight And the lightweight version:

    "_genia__ner_f1": 0.7690401296349251
    
  • ChemProt

    "_chemprot__ner_f1": 0.9059113300492612,
    "_chemprot__relation_f1": 0.5404867256637169
    

    Note that we're doing span-level evaluation using predicted entities. We're also evaluating on all ChemProt relation classes, while the official task only evaluates on a subset (see Liu et al. for details). Thus, our relation extraction performance is lower than, for instance, Verga et al., where they use gold entities as inputs for relation prediction.

  • ACE05-Relation

    "_ace05__ner_f1": 0.8634611855386309,
    "_ace05__relation_f1": 0.6484907497565725,
    
  • ACE05-Event

    "_ace-event__ner_f1": 0.8927209418006965,
    "_ace-event_trig_class_f1": 0.6998813760379595,
    "_ace-event_arg_class_f1": 0.5,
    "_ace-event__relation_f1": 0.5514950166112956
    

Making predictions on existing datasets

To make a prediction, you can use allennlp predict. For example, to make a prediction with the pretrained scierc model, you can do:

allennlp predict pretrained/scierc.tar.gz \
    data/scierc/normalized_data/json/test.json \
    --predictor dygie \
    --include-package dygie \
    --use-dataset-reader \
    --output-file predictions/scierc-test.jsonl \
    --cuda-device 0 \
    --silent

The predictions include the predict labels, as well as logits and softmax scores. For more information see, docs/data.md.

Caveat: Models trained to predict coreference clusters need to make predictions on a whole document at once. This can cause memory issues. To get around this there are two options:

  • Make predictions using a model that doesn't do coreference propagation. These models predict a sentence at a time, and shouldn't run into memory issues. Use the "lightweight" models to avoid this. To train your own coref-free model, set coref loss weight to 0 in the relevant training config.
  • Split documents up into smaller chunks (5 sentences should be safe), make predictions using a model with coref prop, and stitch things back together.

See the docs for more prediction options.

Relation extraction evaluation metric

Following Li and Ji (2014), we consider a predicted relation to be correct if "its relation type is correct, and the head offsets of two entity mention arguments are both correct".

In particular, we do not require the types of the entity mention arguments to be correct, as is done in some work (e.g. Zhang et al. (2017)). We welcome a pull request that implements this alternative evaluation metric. Please open an issue if you're interested in this.

Working with new datasets

Follow the instructions as described in Formatting a new dataset.

Making predictions on a new dataset

To make predictions on a new, unlabeled dataset:

  1. Download the pretrained model that most closely matches your text domain.
  2. Make sure that the dataset field for your new dataset matches the label namespaces for the pretrained model. See here for more on label namespaces. To view the available label namespaces for a pretrained model, use print_label_namespaces.py.
  3. Make predictions the same way as with the existing datasets:
allennlp predict pretrained/[name-of-pretrained-model].tar.gz \
    [input-path] \
    --predictor dygie \
    --include-package dygie \
    --use-dataset-reader \
    --output-file [output-path] \
    --cuda-device [cuda-device]

A couple tricks to make things run smoothly:

  1. If you're predicting on a big dataset, you probably want to load it lazily rather than loading the whole thing in before predicting. To accomplish this, add the following flag to the above command:
--overrides "{'dataset_reader' +: {'lazy': true}}"
  1. If the model runs out of GPU memory on a given prediction, it will warn you and continue with the next example rather than stopping entirely. This is less annoying than the alternative. Examples for which predictions failed will still be written to the specified jsonl output, but they will have an additional field {"_FAILED_PREDICTION": true} indicating that the model ran out of memory on this example.
  2. The dataset field in the dataset to be predicted must match one of the datasets on which the model was trained; otherwise, the model won't know which labels to apply to the predicted data.

Training a model on a new (labeled) dataset

Follow the process described in Training a model, but adjusting the input and output file paths as appropriate.

Contact

For questions or problems with the code, create a GitHub issue (preferred) or email [email protected].

dygiepp's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dygiepp's Issues

How to make inferences from the pretrained model

I have not yet dived deep into the model code. I ran the allennlp predict as mentioned in the docs. It printed the inferences in the terminal. So is there any way to predict the relations,events etc from the shell.
A demo in Jupyter Notebook will be very useful.

How to use my dataset

I want to use my dataset, my data include *.ann and *.txt. How to convert data into project input data? use tools? I hope you can give me some advice.

How to use a locally downloaded "roberta-base" model?

When I use ace05-relation.tar.gz to predict on my own dataset, the following error message occurs:
Downloading: 4%|▍ | 20.7M/501M [30:35<108:46:45, 1.23kB/s]
Downloading: 4%|▍ | 20.7M/501M [30:45<101:02:20, 1.32kB/s]
Downloading: 4%|▍ | 20.7M/501M [31:42<211:21:11, 632B/s]
Downloading: 4%|▍ | 20.7M/501M [33:21<388:42:18, 343B/s]
Downloading: 4%|▍ | 20.7M/501M [34:21<418:01:17, 319B/s]
Downloading: 4%|▍ | 20.8M/501M [34:46<355:17:32, 376B/s]
Downloading: 4%|▍ | 20.8M/501M [34:49<255:20:31, 523B/s]
Downloading: 4%|▍ | 20.8M/501M [34:52<185:06:53, 721B/s]
Downloading: 4%|▍ | 20.8M/501M [34:55<138:31:05, 963B/s]
Downloading: 4%|▍ | 20.8M/501M [34:59<106:28:06, 1.25kB/s]("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
2020-12-26 21:32:33,238 - INFO - allennlp.models.archival - removing temporary unarchived model dir at /tmp/tmporiove4f
Traceback (most recent call last):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/urllib3/response.py", line 438, in _error_catcher
yield
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/urllib3/response.py", line 519, in read
data = self._fp.read(amt) if not fp_closed else b""
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/http/client.py", line 461, in read
n = self.readinto(b)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/http/client.py", line 505, in readinto
n = self.fp.readinto(b)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/requests/models.py", line 753, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/urllib3/response.py", line 576, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/urllib3/response.py", line 541, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/contextlib.py", line 130, in exit
self.gen.throw(type, value, traceback)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/urllib3/response.py", line 455, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/modeling_utils.py", line 926, in from_pretrained
local_files_only=local_files_only,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/file_utils.py", line 1007, in cached_path
local_files_only=local_files_only,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/file_utils.py", line 1216, in get_from_cache
http_get(url_to_download, temp_file, proxies=proxies, resume_size=resume_size, user_agent=user_agent)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/file_utils.py", line 1088, in http_get
for chunk in r.iter_content(chunk_size=1024):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/requests/models.py", line 756, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/wangpancheng/anaconda3/envs/dygie_ace/bin/allennlp", line 8, in
sys.exit(run())
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/main.py", line 34, in run
main(prog="allennlp")
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/commands/init.py", line 118, in main
args.func(args)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/commands/predict.py", line 205, in _predict
predictor = _get_predictor(args)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/commands/predict.py", line 110, in _get_predictor
overrides=args.overrides,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/models/archival.py", line 208, in load_archive
model = _load_model(config.duplicate(), weights_path, serialization_dir, cuda_device)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/models/archival.py", line 246, in _load_model
cuda_device=cuda_device,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/models/model.py", line 406, in load
return model_class._load(config, serialization_dir, weights_file, cuda_device)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/models/model.py", line 305, in _load
vocab=vocab, params=model_params, serialization_dir=serialization_dir
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 601, in from_params
**extras,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 629, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 200, in create_kwargs
cls.name, param_name, annotation, param.default, params, **extras
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 307, in pop_and_construct_arg
return construct_arg(class_name, name, popped_params, annotation, default, **extras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 341, in construct_arg
return annotation.from_params(params=popped_params, **subextras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 601, in from_params
**extras,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 629, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 200, in create_kwargs
cls.name, param_name, annotation, param.default, params, **extras
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 307, in pop_and_construct_arg
return construct_arg(class_name, name, popped_params, annotation, default, **extras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 388, in construct_arg
**extras,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 341, in construct_arg
return annotation.from_params(params=popped_params, **subextras)
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 601, in from_params
**extras,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/from_params.py", line 631, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/modules/token_embedders/pretrained_transformer_mismatched_embedder.py", line 65, in init
transformer_kwargs=transformer_kwargs,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/modules/token_embedders/pretrained_transformer_embedder.py", line 79, in init
**(transformer_kwargs or {}),
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/allennlp/common/cached_transformers.py", line 86, in get
**kwargs,
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/models/auto/modeling_auto.py", line 656, in from_pretrained
pretrained_model_name_or_path, *model_args, config=config, **kwargs
File "/home/wangpancheng/anaconda3/envs/dygie_ace/lib/python3.7/site-packages/transformers/modeling_utils.py", line 935, in from_pretrained
raise EnvironmentError(msg)
OSError: Can't load weights for 'roberta-base'. Make sure that:

  • 'roberta-base' is a correct model identifier listed on 'https://huggingface.co/models'

  • or 'roberta-base' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.

It seems the model will download the pretrained roberta-base model online, but because the Internet download speed is slow, the model cannot be downloaded successfully.
So I wonder how to use a locally downloaded roberta-base model? @dwadden Thanks.

License?

Thank you for making the code available! It would be very useful to us, but could you maybe add a license?

ValueError: The following unexpected fields should be prefixed with an underscore: sentence_start.

Hi,

this is the first time I use dygiepp, so I am quite unexperienced.

I used the processed ace-event dataset
when starting the training I get error

ValueError: The following unexpected fields should be prefixed with an underscore: sentence_start.

one of the fields in the used .json files is
"sentence_start"..."

In document.py I find

    "Make sure we only have allowed fields."
    allowed_field_regex = ("doc_key|dataset|sentences|weight|.*ner$|"
                           ".*relations$|.*clusters$|.*events$|^_.*")

Prefixing sentence_start with underscore yields another error message

Thanks for helping me out

Questions about dataset preprocessing

In the documentation, there is two dataset preprocessing steps. One for entity and relations and the second one is for events. In the first task, Stanford Corenlp is used, but in the second task, Spacy is used. Can you please explain, what is the difference? I see relation labels are different in these preprocessing steps, such as, "ORG-AFF.Membership" or "GEN-AFF" and their offset values are different too. There are other differences too. It would be helpful if you provide some details.

Since ACE05 is a benchmark dataset, I assume, token/entity/relation/event annotation is already there. Then why do you need Corenlp or Spacy libraries?

Can not reproduce SciERC results reported in paper

Hello,

Thanks for great work and paper!

I'm trying to reproduce SciERC results, but I'm getting only:

"best_validation__rel_f1": 0.33248730964467005,
"best_validation_ner_f1": 0.6711409395972655,

Paper is reporting 48.4 F1 score on relations extraction on SciERC.

My questions are:

  1. Are there any additional steps to take, besides downloading SciERC dataset, to train model and reproduce F1 scores?
  2. I have noticed that that validation loss is (almost constantly) increasing every epoch (train loss is dropping). Is this by expected behavior?
  3. What was the reason that trainer decided to stop train script earlier with message "INFO - allennlp.training.trainer - Ran out of patience. Stopping training."?

Bellow I'm placing full installation step-by-step, log tail, system info and pip list:

Installation:

conda create --name dygie python=3.7
conda activate dygie
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch
pip install allennlp
pip install botocore
git clone https://github.com/dwadden/dygiepp.git
cd dygiepp/
pip install -r requirements.txt
bash ./scripts/data/get_scierc.sh 
bash ./scripts/train/train_scierc.sh 0

Log tail:

[...]
2019-11-13 16:08:07,639 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
2019-11-13 16:08:07,640 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
ner_precision: 0.9754, ner_recall: 0.9725, ner_f1: 0.9740, loss: 4.5208 ||: 100%|#########9| 499/500 [01:05<00:00,  8.05it/s]2019-11-13 16:08:20,732 - WARNING - root - NaN or Inf found in input tensor.                                                                          
2019-11-13 16:08:20,732 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
2019-11-13 16:08:20,734 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
2019-11-13 16:08:20,734 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
2019-11-13 16:08:20,740 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
2019-11-13 16:08:20,740 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
2019-11-13 16:08:20,742 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
2019-11-13 16:08:20,742 - WARNING - root - NaN or Inf found in input tensor.                                                                                                                                                                                                       
ner_precision: 0.9755, ner_recall: 0.9726, ner_f1: 0.9740, loss: 4.5118 ||: 100%|##########| 500/500 [01:05<00:00,  7.62it/s]                                                                                                                                                      
2019-11-13 16:08:20,746 - INFO - allennlp.training.trainer - Validating                                                                                                                                                                                                            
ner_precision: 0.6449, ner_recall: 0.6873, ner_f1: 0.6655, loss: 232.1718 ||: 100%|##########| 50/50 [00:02<00:00, 21.36it/s]                                                                                                                                                      
2019-11-13 16:08:23,090 - INFO - allennlp.training.trainer - Ran out of patience.  Stopping training.                                                                                                                                                                              
2019-11-13 16:08:23,090 - INFO - allennlp.training.checkpointer - loading best weights                                                                                                                                                                                             
2019-11-13 16:08:23,219 - INFO - allennlp.commands.train - To evaluate on the test set after training, pass the 'evaluate_on_test' flag, or use the 'allennlp evaluate' command.                                                                                                   
2019-11-13 16:08:23,219 - INFO - allennlp.models.archival - archiving weights and vocabulary to ./models/scierc/model.tar.gz                                                                                                                                                       
2019-11-13 16:08:40,237 - INFO - allennlp.common.util - Metrics: {
  "best_epoch": 23,                                                                                                                                                                                                                                                                
  "peak_cpu_memory_MB": 3783.14,                                                                                                                                                                                                                                                   
  "peak_gpu_0_memory_MB": 8654,                                                                                                                                                                                                                                                    
  "training_duration": "0:43:55.006665",                                                                                                                                                                                                                                           
  "training_start_epoch": 0,                                                                                                                                                                                                                                                       
  "training_epochs": 37,                                                                                                                                                                                                                                                           
  "epoch": 37,                                                                                                                                                                                                                                                                     
  "training__coref_precision": 0.8032079153669152,                                                                                                                                                                                                                                 
  "training__coref_recall": 0.52068901932887,                                                                                                                                                                                                                                      
  "training__coref_f1": 0.6030049906804981,                                                                                                                                                                                                                                        
  "training__coref_mention_recall": 0.9909547738693467,                                                                                                                                                                                                                            
  "training_ner_precision": 0.9815991970558715,                                                                                                                                                                                                                                    
  "training_ner_recall": 0.9786524349566378,                                                                                                                                                                                                                                       
  "training_ner_f1": 0.9801236011357441,
  "training__rel_precision": 0.8399269628727937,
  "training__rel_recall": 0.8051341890315052,
  "training__rel_f1": 0.8221626452189454,
  "training__rel_span_recall": 0.837222870478413,
  "training__trig_id_precision": 0,
  "training__trig_id_recall": 0,
  "training__trig_id_f1": 0,
  "training__trig_class_precision": 0,
  "training__trig_class_recall": 0,
  "training__trig_class_f1": 0,
  "training__arg_id_precision": 0,
  "training__arg_id_recall": 0,
  "training__arg_id_f1": 0,
  "training__arg_class_precision": 0,
  "training__arg_class_recall": 0,
  "training__arg_class_f1": 0,
  "training__args_multiple": 0,
  "training_loss": 4.156547112312106,
  "training_cpu_memory_MB": 3783.14,
  "training_gpu_0_memory_MB": 8654,
  "validation__coref_precision": 0.5785864145810381,
  "validation__coref_recall": 0.40000249926139114,
  "validation__coref_f1": 0.47185963848741963,
  "validation__coref_mention_recall": 0.9338235294117647,
  "validation_ner_precision": 0.6489988221436984,
  "validation_ner_recall": 0.6836228287841191,
  "validation_ner_f1": 0.6658610271902824,
  "validation__rel_precision": 0.3944954128440367,
  "validation__rel_recall": 0.378021978021978,
  "validation__rel_f1": 0.3860830527497194,
  "validation__rel_span_recall": 0.4175824175824176,
  "validation__trig_id_precision": 0,
  "validation__trig_id_recall": 0,
  "validation__trig_id_f1": 0,
  "validation__trig_class_precision": 0,
  "validation__trig_class_recall": 0,
  "validation__trig_class_f1": 0,
  "validation__arg_id_precision": 0,
  "validation__arg_id_recall": 0,
  "validation__arg_id_f1": 0,
  "validation__arg_class_precision": 0,
  "validation__arg_class_recall": 0,
  "validation__arg_class_f1": 0,
  "validation__args_multiple": 0,
  "validation_loss": 223.13119384765625,
  "best_validation__coref_precision": 0.5206815500675314,
  "best_validation__coref_recall": 0.39612618358360496,
  "best_validation__coref_f1": 0.44944354440112394,
  "best_validation__coref_mention_recall": 0.9375,
  "best_validation_ner_precision": 0.6602641056422568,
  "best_validation_ner_recall": 0.6823821339950371,
  "best_validation_ner_f1": 0.6711409395972655,
  "best_validation__rel_precision": 0.3933933933933934,
  "best_validation__rel_recall": 0.2879120879120879,
  "best_validation__rel_f1": 0.33248730964467005,
  "best_validation__rel_span_recall": 0.3208791208791209,
  "best_validation__trig_id_precision": 0,
  "best_validation__trig_id_recall": 0,
  "best_validation__trig_id_f1": 0,
  "best_validation__trig_class_precision": 0,
  "best_validation__trig_class_recall": 0,
  "best_validation__trig_class_f1": 0,
  "best_validation__arg_id_precision": 0,
  "best_validation__arg_id_recall": 0,
  "best_validation__arg_id_f1": 0,
  "best_validation__arg_class_precision": 0,
  "best_validation__arg_class_recall": 0,
  "best_validation__arg_class_f1": 0,
  "best_validation__args_multiple": 0,
  "best_validation_loss": 142.15479759216308
}

System:

Ubuntu 18.04.3 LTS
GPU 0: GeForce RTX 2080 Ti
Driver Version: 418.88
CUDA Version: 10.1

pip list:

$ pip list                                                                                                                                                                                                                                                                   
Package                       Version                                                                                                                                                                                                                                              
----------------------------- -------------------                                                                                                                                                                                                                                  
alabaster                     0.7.12                                                                                                                                                                                                                                               
allennlp                      0.9.0                                                                                                                                                                                                                                                
atomicwrites                  1.3.0                                                                                                                                                                                                                                                
attrs                         19.3.0                                                                                                                                                                                                                                               
Babel                         2.7.0                                                                                                                                                                                                                                                
beautifulsoup4                4.8.1                                                                                                                                                                                                                                                
blis                          0.2.4                                                                                                                                                                                                                                                
boto3                         1.10.16                                                                                                                                                                                                                                              
botocore                      1.13.16                                                                                                                                                                                                                                              
certifi                       2019.9.11                                                                                                                                                                                                                                            
cffi                          1.13.1                                                                                                                                                                                                                                               
chardet                       3.0.4                                                                                                                                                                                                                                                
Click                         7.0                                                                                                                                                                                                                                                  
conllu                        1.3.1                                                                                                                                                                                                                                                
cycler                        0.10.0                                                                                                                                                                                                                                               
cymem                         2.0.2                                                                                                                                                                                                                                                
docutils                      0.15.2                                                                                                                                                                                                                                               
editdistance                  0.5.3                                                                                                                                                                                                                                                
flaky                         3.6.1                                                                                                                                                                                                                                                
Flask                         1.1.1                                                                                                                                                                                                                                                
Flask-Cors                    3.0.8                                                                                                                                                                                                                                                
ftfy                          5.6                                                                                                                                                                                                                                                  
gevent                        1.4.0                                                                                                                                                                                                                                                
greenlet                      0.4.15                                                                                                                                                                                                                                               
h5py                          2.10.0                                                                                                                                                                                                                                               
idna                          2.8                                                                                                                                                                                                                                                  
imagesize                     1.1.0                                                                                                                                                                                                                                                
importlib-metadata            0.23                                                                                                                                                                                                                                                 
itsdangerous                  1.1.0                                                                                                                                                                                                                                                
Jinja2                        2.10.3
jmespath                      0.9.4
joblib                        0.14.0
jsonnet                       0.14.0
jsonpickle                    1.2
kiwisolver                    1.1.0
lxml                          4.4.1
MarkupSafe                    1.1.1
matplotlib                    3.1.1
mkl-fft                       1.0.15
mkl-random                    1.1.0
mkl-service                   2.3.0
more-itertools                7.2.0
murmurhash                    1.0.2
nltk                          3.4.5
numpy                         1.17.3
numpydoc                      0.9.1
olefile                       0.46
overrides                     2.5
packaging                     19.2
pandas                        0.25.3
parsimonious                  0.8.1
Pillow                        6.2.1
pip                           19.3.1
plac                          0.9.6
pluggy                        0.13.0
preshed                       2.0.1
protobuf                      3.10.0
py                            1.8.0
pycparser                     2.19
Pygments                      2.4.2
pyparsing                     2.4.5
pytest                        5.2.2
python-dateutil               2.8.0
python-Levenshtein            0.12.0
pytorch-pretrained-bert       0.6.2
pytorch-transformers          1.1.0
pytz                          2019.3
regex                         2019.11.1
requests                      2.22.0
responses                     0.10.6
s3transfer                    0.2.1
scikit-learn                  0.21.3
scipy                         1.3.2
sentencepiece                 0.1.83
setuptools                    41.6.0.post20191030
six                           1.13.0
snowballstemmer               2.0.0
soupsieve                     1.9.5
spacy                         2.1.9
Sphinx                        2.2.1
sphinxcontrib-applehelp       1.0.1
sphinxcontrib-devhelp         1.0.1
sphinxcontrib-htmlhelp        1.0.2
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          1.0.2
sphinxcontrib-serializinghtml 1.1.3
sqlparse                      0.3.0
srsly                         0.2.0
tensorboardX                  1.9
thinc                         7.0.8
torch                         1.2.0
torchvision                   0.4.0a0+6b959ee
tqdm                          4.38.0
Unidecode                     1.1.1
urllib3                       1.25.7
wasabi                        0.4.0
wcwidth                       0.1.7
Werkzeug                      0.16.0
wheel                         0.33.6
word2number                   1.1
zipp                          0.6.0

Model tests are failing because of missing `span_emb_dim` key

All models tests are failing with message: allennlp.common.checks.ConfigurationError: 'key "span_emb_dim" is required at location "coref."'

(dygie) ~/ml/dygiepp/dygie(debugging) $ python -mpytest tests
================================================= test session starts ==================================================
platform linux -- Python 3.7.5, pytest-5.2.2, py-1.8.0, pluggy-0.13.0
rootdir: /home/konrad/ml/dygiepp/dygie, inifile: pytest.ini
plugins: flaky-3.6.1
collected 15 items                                                                                                     

tests/data/ie_json_test.py ........                                                                              [ 53%]
tests/models/coref_test.py F                                                                                     [ 60%]
tests/models/dygie_test.py F                                                                                     [ 66%]
tests/models/relation_test.py FFFFF                                                                              [100%]
[...]

Multi-token triggers for event extraction

I want to run the event extraction pipeline on my own dataset of ACE/ERE-like events.
The events have multi-token and discontinuous trigger spans.

I have not yet dived deep into the model code, but was hoping for some ideas how to adapt it to multi-token spans.

  • Is it possible with the current approach to model multi-token triggers?
  • Is it possible to model discontinuous triggers?
  • Which functions in the code are prime candidates for adaptation?

For now, I will parse and use the head token of the trigger annotations, but due to the discriminative and content-rich nature of my multi-token triggers I expect it to hurt trigger classification.

Thank you for making this code available and supporting it. Having a SotA event extraction system available really helps my research.

Pre-processing help needed!

Hello,

Thanks for the great work! I really like it and appreciate posting your work here.

so my end goal is to predict the relationship of the dose group (drug) and adverse events using your model. eg. the text will contain the following sentences: group 3 showed decrease in food consumption. group 2 had inflammation and increase in alopecia. Group 4 exhibited sudden heart attack.

Here, the entities will be the dose group, adverse events (bolded) and the relations would be either increase, decrease, present (adverse event present)

The raw data that I have is semi-unlabeled where I will have a sentence, label entities and relations but don't have indices of where these entities & relations start and ends within a sentence. (like you mention here)
image

Can you please provide help in how to pre-process my raw data so that it goes into your model as an input?

Also with preprocessed data, do you think it'll work when I train Chemprot model with my new training dataset? I found that chemprot is the closest domain/model to my dataset.

Thank you so much in advance and I hope you reply back :)

using dygiepp for other languages than english

Hi,

I would use dygiepp on the dutch language.
It means I should use other embeddings than Bert. There is a Dutch version 'Bertje'.
I would be grateful if you could give me a clue where to start to adapt in your script, or in allennlp to use this Dutch version.

Thanks a lot

Failed building wheel for jsonnet

run "pip install -r requirements.txt" failed

Downloading http://mirrors.tencentyun.com/pypi/packages/62/1e/a94a8d635fa3ce4cfc7f506003548d0a2447ae76fd5ca53932970fe3053f/pyasn1-0.4.8-py2.py3-none-any.whl (77 kB)
|████████████████████████████████| 77 kB 691 kB/s
Building wheels for collected packages: en-core-sci-sm, jsonnet
Building wheel for en-core-sci-sm (setup.py) ... done
Created wheel for en-core-sci-sm: filename=en_core_sci_sm-0.2.3-py3-none-any.whl size=16230213 sha256=a78b379ac4f2ce377645373d63a8b5419e2e63bf9b9f02db5e4f107a84930d27
Stored in directory: /root/.cache/pip/wheels/7c/c8/0d/0db35734344a895a1a329527f24538c21f1442878b8448d7d4
Building wheel for jsonnet (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /root/miniconda3/envs/dygiepp/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"'; file='"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-py44whdj
cwd: /tmp/pip-install-fout_ucm/jsonnet/
Complete output (33 lines):
running bdist_wheel
running build
running build_ext
g++ -c -g -O3 -Wall -Wextra -Woverloaded-virtual -pedantic -std=c++0x -fPIC -Iinclude -Ithird_party/md5 -Ithird_party/json core/desugarer.cpp -o core/desugarer.o
make: g++: Command not found
make: *** [core/desugarer.o] Error 127
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-fout_ucm/jsonnet/setup.py", line 75, in
test_suite="python._jsonnet_test",
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/setuptools/init.py", line 144, in setup
return distutils.core.setup(**attrs)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 223, in run
self.run_command('build')
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/tmp/pip-install-fout_ucm/jsonnet/setup.py", line 54, in run
raise Exception('Could not build %s' % (', '.join(LIB_OBJECTS)))
Exception: Could not build core/desugarer.o, core/formatter.o, core/libjsonnet.o, core/lexer.o, core/parser.o, core/pass.o, core/static_analysis.o, core/string_utils.o, core/vm.o, third_party/md5/md5.o

ERROR: Failed building wheel for jsonnet
Running setup.py clean for jsonnet
Successfully built en-core-sci-sm
Failed to build jsonnet
Installing collected packages: unidecode, itsdangerous, Werkzeug, flask, flask-cors, parsimonious, sqlparse, jmespath, botocore, s3transfer, boto3, pytorch-pretrained-bert, jsonnet, h5py, overrides, conllu, threadpoolctl, scipy, scikit-learn, editdistance, greenlet, gevent, protobuf, tensorboardX, wcwidth, ftfy, zipp, importlib-metadata, jsonpickle, blis, cymem, preshed, murmurhash, plac, srsly, wasabi, thinc, spacy, sentencepiece, pytorch-transformers, responses, attrs, py, more-itertools, pluggy, pytest, allennlp, pandas, soupsieve, beautifulsoup4, lxml, python-Levenshtein, PyYAML, pyasn1, rsa, colorama, awscli, pybind11, psutil, nmslib, scispacy, en-core-sci-sm
Running setup.py install for jsonnet ... error
ERROR: Command errored out with exit status 1:
command: /root/miniconda3/envs/dygiepp/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"'; file='"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-nukhhlr0/install-record.txt --single-version-externally-managed --compile --install-headers /root/miniconda3/envs/dygiepp/include/python3.7m/jsonnet
cwd: /tmp/pip-install-fout_ucm/jsonnet/
Complete output (35 lines):
running install
running build
running build_ext
g++ -c -g -O3 -Wall -Wextra -Woverloaded-virtual -pedantic -std=c++0x -fPIC -Iinclude -Ithird_party/md5 -Ithird_party/json core/desugarer.cpp -o core/desugarer.o
make: g++: Command not found
make: *** [core/desugarer.o] Error 127
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-fout_ucm/jsonnet/setup.py", line 75, in
test_suite="python._jsonnet_test",
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/setuptools/init.py", line 144, in setup
return distutils.core.setup(**attrs)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/setuptools/command/install.py", line 61, in run
return orig.install.run(self)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/command/install.py", line 545, in run
self.run_command('build')
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/tmp/pip-install-fout_ucm/jsonnet/setup.py", line 54, in run
raise Exception('Could not build %s' % (', '.join(LIB_OBJECTS)))
Exception: Could not build core/desugarer.o, core/formatter.o, core/libjsonnet.o, core/lexer.o, core/parser.o, core/pass.o, core/static_analysis.o, core/string_utils.o, core/vm.o, third_party/md5/md5.o
----------------------------------------
ERROR: Command errored out with exit status 1: /root/miniconda3/envs/dygiepp/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"'; file='"'"'/tmp/pip-install-fout_ucm/jsonnet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-nukhhlr0/install-record.txt --single-version-externally-managed --compile --install-headers /root/miniconda3/envs/dygiepp/include/python3.7m/jsonnet Check the logs for full command output.

Training on WLP dataset?

Hi,
Are you planning to release the code for training a model on the WLP corpus as well?

Thanks!

Out of Memory

Hi,

I am training on a document of 40 sentences. some are long.
Do you have some advices to lower the GPU memory use (e.g. batch size (I cannot find the params as I am not familiar with allennlp) and sentence length)

I got out of memory in a 32GB Nvidia GPU.

Thanks

Apply on Roles & Triggers across sentences.

Hi, I'd like to apply DyGIE++ on the Roles Across Multiple Sentences (RAMS) dataset.

In the RAMS dataset, the event triggers and arguments may be in separate sentences. For example, the trigger could be in sentence 3, but the victim and killer is on sentence 4.

But looking at data.md, it seems like the data format is required to have the trigger and arguments in the same sentence. Is DyGIE++ capable to processing event extraction across sentences?

no module named 'torch__.C'

Hi
Thank you for your amazing work and for publishing the code!

While replicating your work on making predictions on the existing dataset I encountered the following error: can you please help me out?

allennlp predict ./scripts/pretrained/genia-lightweight.tar.gz \ ./scripts/processed_data/json-coref-ident-only/test.json \ --predictor dygie \ --include-package dygie \ --use-dataset-reader \ --output-file predictions/genia-test.jsonl \ --cuda-device 0

image

Thank you!

KeyError: 'NewsDNA__argument_labels'

Hi

I tried to train and test with a new dataset that I formatted similar to ace event data

However, I keep having the next error

File "/home/thierry/repos/dygiepp-dev/dygie/models/events.py", line 177, in forward
mention_pruner = self._mention_pruners[self._active_namespaces["argument"]]
File "/home/thierry/miniconda3/lib/python3.8/site-packages/torch/nn/modules/container.py", line 286, in getitem
return self._modules[key]

I assume I should specify the used labels somewhere ?

Thanks for helping

Training with `ace05_best_ner_bert.jsonnet` returns error: `str' object has no attribute 'get_lr'`

Using the following code:

import json
import sys

from allennlp.commands import main

overrides = json.dumps({"trainer": {"cuda_device": -1}})
runs = [
    ( "../ace05_best_relation_bert.jsonnet", "../data/relation" )
]

for config_file, serialization_dir in runs:

    sys.argv = [
        "allennlp",  
        "train",
        config_file,
        "-s", serialization_dir,
        "--include-package", "dygie",
        "--include-package", "ie_json",
        "-o", overrides,
    ]

    main()

Attempting to run ace05_best_ner_bert.jsonnet returns the error:
image

I'm using the versions of allennlp and torch suggested in requirements.txt:
image
image

SciERC datset

Hi,
In the SciERC dataset (scierc.raw.tar.gz), for each abstract doc, there are three files - .txt (which is the raw text of the abstract), .ann (which captures the annotations for entities, relations, corefs). There is also a third file .xml.txt associated with each doc abstract - this seems to contain some details including the pos, parse tree/dependency information. What is this xml file and how is this to be generated? What is the process to create this if we want to bring in our own domain-specific documents to be trained on dygiee++?
Would appreciate a quick response, thanks,
Sundar

How to run my dataset

I have some problem here.

2020-06-11 21:22:09,219 - INFO - allennlp.training.trainer - Beginning training.
2020-06-11 21:22:09,219 - INFO - allennlp.training.trainer - Epoch 0/249
2020-06-11 21:22:09,219 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 3337.832
2020-06-11 21:22:09,275 - INFO - allennlp.training.trainer - Training
0%| | 0/1990 [00:03<?, ?it/s]
Traceback (most recent call last):
File "/root/miniconda3/envs/dygiepp/bin/allennlp", line 8, in
sys.exit(run())
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/init.py", line 102, in main
args.func(args)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 124, in train_model_from_args
args.cache_prefix)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 168, in train_model_from_file
cache_directory, cache_prefix)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
metrics = trainer.train()
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
train_metrics = self._train_epoch(epoch)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
loss = self.batch_loss(batch_group, for_training=True)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 261, in batch_loss
output_dict = self.model(**batch)
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "./dygie/models/dygie.py", line 280, in forward
output_relation = self._relation.predict_labels(relation_labels, output_relation, metadata)
File "./dygie/models/relation.py", line 193, in predict_labels
predictions = self.decode(output_dict)["decoded_relations_dict"]
File "./dygie/models/relation.py", line 218, in decode
top_spans, predicted_relations, num_spans_to_keep)
File "./dygie/models/relation.py", line 249, in _decode_sentence
label_name = self.vocab.get_token_from_index(label, namespace="relation_labels")
File "/root/miniconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 644, in get_token_from_index
return self._index_to_token[namespace][index]
KeyError: 0

KeyError: "'' not found in vocab namespace 'scierc__ner_labels

I'm going to close this for lack of activity, feel free to reopen if not resolved.

Originally posted by @dwadden in #59 (comment)

Hi, thanks for your response. I was able to bypass the error last time and was able to extract entities on my custom dataset (wasn't able to run it on the pre-existing test.json). I am trying to run it again on a different dataset, but the error persists. I created a new environment and installed dependencies using the updated requirements.txt scratch. The stack trace is the same as above.

Coreference propagation on ACE05

In the paper you mention as ACE does not have coreference annotation, you use OntoNotes for coreference propagation. I was wondering if it is possible to predict coreference on ACE using the current trained model?

_normalize_word method in IEJsonReader class

Hi dwadden,

thank you very much for uploading your code in github.
I have succeeded in running and training your model in my env following your instruction.

I have just noticed that _normalized_word method does NOT take "self" argument.
Given the model seems working, so it is not a big deal, but just let me report this for your reference.

English : no such file or directory

Hi,
Where do you get the "English" file or directory for ace05.
After running get_corenlp.sh and then subsequently running get_ace05.sh I get an error:
cp: cannot stat ‘./scripts/data/ace05/common//English’: No such file or directory
run.zsh:4: no matches found: English/
/timex2norm/*.sgm
etc

Thanks for any help.

ScispaCy vs. Stanford NLP tokenization with SciERC model

Hi,

I'm trying to apply the SciERC pre-trained model to an unlabeled dataset of abstracts from plant science papers. I used the following command line to format my code:

python scripts/new-dataset/format_new_dataset.py ../knowledge-graph/data/first_manuscript_data/clustering_pipeline_output/JA_GA_chosen_abstracts/ ../knowledge-graph/data/first_manuscript_data/dygiepp/prepped_data/dygiepp_formatted_data_SciERC.jsonl scierc

where the directory JA_GA_chosen_abstracts contains a .txt file for each abstract.

I was then successfully able to run the pre-trained SciERC model on this data. However, when looking at the results, I noticed I was getting a lot of entities that were either a single round bracket, a single hyphen, or a word followed or preceded by a hyphen. When I looked more closely at the tokenized sentences in the preprocessed jsonl file, it was clear that this is because the spaCy tokenizer in ./scripts/new-dataset/format_new_dataset.py splits hyphenated words, and leaves parentheses/brackets as-is.

However, when I looked at the processed SciERC json files, it looks like they were tokenized with PTB3 token transforms ("(" becomes "-LRB-", etc.), and without splitting hyphenated words. A cursory google makes it seem like this tokenization may have been done with the Stanford NLP tokenizer, because it gives options to use PTB3 tranforms and to not split hyphenated words.

I checked out the webpage where the processed SciERC dataset is pulled from, and skimmed the paper and the repo, but didn't see anything that indicated to me how the dataset was tokenized. was wondering if you knew what tokenizer had been used on the SciERC data, and if you thought it would be better to use the same tokenization scheme on new datasets to get better performance with the pre-trained model. If it turns out it was done with the Stanford NLP tokenizer, I'd be more than happy to open a PR adding an option to use that tokenizer in format_new_dataset.py.

Thanks!

Train ChemProt model problem

I met a problem when I trained ChemProt model
when I run "bash ./scripts/train/train_chemprot.sh -1"

2020-05-27 14:17:59,641 - INFO - allennlp.training.optimizers - Number of trainable parameters: 123007840
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.optimizer.infer_type_and_cast = True
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.optimizer.lr = 0.001
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.optimizer.t_total = 10000
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.optimizer.warmup = 0.1
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.optimizer.weight_decay = 0
2020-05-27 14:17:59,642 - INFO - allennlp.common.registrable - instantiating registered subclass bert_adam of <class 'allennlp.training.optimizers.Optimizer'>
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.type = reduce_on_plateau
2020-05-27 14:17:59,642 - INFO - allennlp.common.registrable - instantiating registered subclass reduce_on_plateau of <class 'allennlp.training.learning_rate_schedulers.learning_rate_scheduler.LearningRateScheduler'>
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.factor = 0.5
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.mode = max
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.patience = 4
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.num_serialized_models_to_keep = 3
2020-05-27 14:17:59,642 - INFO - allennlp.common.params - trainer.keep_serialized_model_every_num_seconds = None
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.model_save_interval = None
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.summary_interval = 100
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.histogram_interval = None
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.should_log_parameter_statistics = True
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.should_log_learning_rate = False
2020-05-27 14:17:59,643 - INFO - allennlp.common.params - trainer.log_batch_size_period = None
2020-05-27 14:17:59,806 - INFO - allennlp.training.trainer - Beginning training.
2020-05-27 14:17:59,806 - INFO - allennlp.training.trainer - Epoch 0/249
2020-05-27 14:17:59,806 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 2505.212
2020-05-27 14:17:59,891 - INFO - allennlp.training.trainer - Training
0%| | 0/1299 [00:00<?, ?it/s]./scripts/train/train_chemprot.sh: line 18: 6835 Killed ie_train_data_path=$data_root/training.jsonl ie_dev_data_path=$data_root/development.jsonl ie_test_data_path=$data_root/test.jsonl cuda_device=$cuda_device allennlp train $config_file --cache-directory $data_root/cached --serialization-dir ./models/$experiment_name --include-package dygie


I think the reason for this problem is that the memmory is too small. I would like to hear your advice.
my cloud server:
2vCPUs | 4GB

ace05_event.jsonnet embeddings shape size error

First, thanks for all your great work!

I've put my own labeled data in the DyGIE++ format, following the instructions in (https://github.com/dwadden/dygiepp/blob/master/DATA.md). I was able to successfully train the Relation extractor fine using the ace05_best_relation_bert.jsonnet config, with quite good performance.

When I next tried to train for Events using ace05_event.jsonnet, however, I ran into the following error:
image

Following the comments in https://github.com/dwadden/dygiepp/blob/master/training_config/template_dw.libsonnet#L88, I then made a copy of ace05_event.jsonnet, changing the n_trigger_labels and n_ner_labels values to reflect my own dataset counts, ie:

ace05_event_copy.jsonnet

  n_trigger_labels: 28,    // prev: 34
  n_ner_labels: 103,       // prev: 8

However I'm still getting the error, RuntimeError: shape '[-1, 2745]' is invalid for input of size 5601840 in https://github.com/dwadden/dygiepp/blob/master/dygie/models/events.py#L553.

Trying to apply the pre-trained ACE05-Event model to new data

Hi there! We're trying use the pre-trained ACE05-Event model for a project on new data. However, we've been struggling to get it to run. In order to test it out, we tried the following on a line of data from the SciERC dataset and got the error below.

!echo '{"events": [[], [], [], [], []], "clusters": [[[6, 17], [32, 32]], [[4, 4], [55, 55], [91, 91]], [[58, 62], [64, 64], [79, 79]]], "sentences": [["This", "paper", "presents", "an", "algorithm", "for", "computing", "optical", "flow", ",", "shape", ",", "motion", ",", "lighting", ",", "and", "albedo", "from", "an", "image", "sequence", "of", "a", "rigidly-moving", "Lambertian", "object", "under", "distant", "illumination", "."], ["The", "problem", "is", "formulated", "in", "a", "manner", "that", "subsumes", "structure", "from", "motion", ",", "multi-view", "stereo", ",", "and", "photo-metric", "stereo", "as", "special", "cases", "."], ["The", "algorithm", "utilizes", "both", "spatial", "and", "temporal", "intensity", "variation", "as", "cues", ":", "the", "former", "constrains", "flow", "and", "the", "latter", "constrains", "surface", "orientation", ";", "combining", "both", "cues", "enables", "dense", "reconstruction", "of", "both", "textured", "and", "texture-less", "surfaces", "."], ["The", "algorithm", "works", "by", "iteratively", "estimating", "affine", "camera", "parameters", ",", "illumination", ",", "shape", ",", "and", "albedo", "in", "an", "alternating", "fashion", "."], ["Results", "are", "demonstrated", "on", "videos", "of", "hand-held", "objects", "moving", "in", "front", "of", "a", "fixed", "light", "and", "camera", "."]], "ner": [[[4, 4, "Generic"], [6, 17, "Task"], [20, 21, "Material"], [24, 26, "Material"], [28, 29, "OtherScientificTerm"]], [[32, 32, "Generic"], [42, 42, "Material"], [44, 45, "Material"], [48, 49, "Material"]], [[55, 55, "Generic"], [58, 62, "OtherScientificTerm"], [64, 64, "Generic"], [67, 67, "Generic"], [69, 69, "OtherScientificTerm"], [72, 72, "Generic"], [74, 75, "OtherScientificTerm"], [79, 79, "Generic"], [81, 88, "Task"]], [[91, 91, "Generic"], [95, 105, "Method"]], [[115, 118, "Material"]]], "relations": [[[4, 4, 6, 17, "USED-FOR"], [20, 21, 4, 4, "USED-FOR"], [24, 26, 20, 21, "FEATURE-OF"], [28, 29, 24, 26, "FEATURE-OF"]], [[42, 42, 44, 45, "CONJUNCTION"], [44, 45, 48, 49, "CONJUNCTION"]], [[58, 62, 55, 55, "USED-FOR"], [67, 67, 64, 64, "HYPONYM-OF"], [67, 67, 69, 69, "USED-FOR"], [67, 67, 72, 72, "CONJUNCTION"], [72, 72, 64, 64, "HYPONYM-OF"], [72, 72, 74, 75, "USED-FOR"], [79, 79, 81, 88, "USED-FOR"]], [[95, 105, 91, 91, "USED-FOR"]], []], "doc_key": "ICCV_2003_158_abs"}' > scierc-test.jsonl
!allennlp predict pretrained/ace05-event.tar.gz \
    scierc-test.jsonl \
    --predictor dygie \
    --include-package dygie \
    --use-dataset-reader \
    --output-file predictions/scierc-test.jsonl

This is the error:

2020-06-09 22:19:37,555 - ERROR - allennlp.data.vocabulary - Namespace: ner_labels
2020-06-09 22:19:37,555 - ERROR - allennlp.data.vocabulary - Token: Generic
Traceback (most recent call last):
  File "/usr/local/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/usr/local/lib/python3.6/dist-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/predict.py", line 227, in _predict
    manager.run()
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/predict.py", line 201, in run
    for model_input_instance, result in zip(batch, self._predict_instances(batch)):
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/predict.py", line 159, in _predict_instances
    results = [self._predictor.predict_instance(batch_data[0])]
  File "./dygie/predictors/dygie.py", line 81, in predict_instance
    dataset.index_instances(model.vocab)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/data/dataset.py", line 155, in index_instances
    instance.index_fields(vocab)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/data/instance.py", line 72, in index_fields
    field.index(vocab)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/data/fields/sequence_label_field.py", line 100, in index
    for label in self.labels]
  File "/usr/local/lib/python3.6/dist-packages/allennlp/data/fields/sequence_label_field.py", line 100, in <listcomp>
    for label in self.labels]
  File "/usr/local/lib/python3.6/dist-packages/allennlp/data/vocabulary.py", line 637, in get_token_index
    return self._token_to_index[namespace][self._oov_token]
KeyError: '@@UNKNOWN@@'

We aren't super familiar with allennlp, so we aren't sure how to diagnose this issue. We're not sure why there is an issue with the NER_labels as we're hoping to just predict those, rather than train the model on them. Is there a way to override this?

Thanks a lot!

Keyword extraction from scientific abstracts

Hi! I'm interested in using dygiepp for automatically generating keywords from scientific abstracts, and testing how it performs compared to Textacy algorithms for keyword extraction (for context, see Mini-Conf/Mini-Conf#34).

I'm looking to preprocess sample abstracts in https://github.com/anaerobeth/dygiepp/tree/keyword-extraction/data/miniconf/raw_data to generate suitable .txt and .ann files for use in making predictions using the pretrained scierc lightweight model. Can you provide some guidance on how to get started with this task? Thanks!

Using a Fine Tuned SciBERT

I wanted to replace the Pre-trained SciBERT model with a Fine Tuned SciBERT Model. I achieved the Language Model Fine Tuning via this blog: https://github.com/Nikoschenk/language_model_finetuning/blob/master/scibert_fine_tuner.ipynb.

It uses the HuggingFace Library to achieve the Fine Tuning. The resulting HuggingFace Model has these components:

  1.   Config.json
    
  2.   Pytorch_model.bin
    
  3.   Special_tokens_map.json
    
  4.   Tokenizer_config.json
    
  5.   Vocab.txt
    

    I noticed that when I run Dygie’s get_scibert.py script, it downloads the Pytorch model as follows:

  6. scibert_scivocab_cased/weights.tar.gz

  7. scibert_scivocab_cased/vocab.txt

    Further, weights.tar.gz, is made up of pytorch_model.bin & bert_config.json.

    I repackaged HuggingFace outputs into the new weights.tar.gz (pytorch_model.bin & config.json renamed as bert_config.json).

    It works fine.

    I just wanted a second opinion, about the above approach. Pls advise.

question about param _n_labels

why the param in code TimeDistributed(torch.nn.Linear(mention_feedforward.get_output_dim(), self._n_labels - 1)) is self._n_labels - 1 , should it be self._n_labels ?
And on the other hand,why do you design dummy_scores in ner.py ?

Cannot disable NER when training events

I tried setting the loss_weights: ner: to 0, but training errors with

2020-08-14 08:45:43,915 - INFO - allennlp.training.trainer - Training
  0%|          | 0/183 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "/opt/conda/envs/dygiepp/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 124, in train_model_from_args
    args.cache_prefix)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 168, in train_model_from_file
    cache_directory, cache_prefix)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
    metrics = trainer.train()
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
    train_metrics = self._train_epoch(epoch)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
    loss = self.batch_loss(batch_group, for_training=True)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 256, in batch_loss
    output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/util.py", line 331, in data_parallel
    outputs = parallel_apply(replicas, inputs, moved, used_device_ids)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
KeyError: Caught KeyError in replica 0 on device 2.
Original Traceback (most recent call last):
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "./dygie/models/dygie.py", line 298, in forward
    ner_labels, metadata)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "./dygie/models/events.py", line 183, in forward
    ner_scores = output_ner["ner_scores"]
KeyError: 'ner_scores'

Seems to that in events.py the ner_scores are not optional right now as they are passed as a req. arg. to the _mention_pruner func.

For now I am testing the pipeline by producing NER predictions from the pre-trained ACE05-event model on my custom data and feeding that into the model for silver-standard reference NER labels.

I set event_args_use_ner_labels to false too. Is this required or are the NER labels referred to here model predictions not based off gold-standard labels?
I get higher F1 for TrigC and ArgC when setting this to false when training with NER labels produced by prediction from the pre-trained ACE05 model.)

Event extraction training without Named Entities or Relations

My custom dataset has ACE-like event annotations with triggers and arguments but no NER or relations.
I tried running the ACE training pipeline on my event data and left the relations and ner keys empty, e.g.:

{
"doc_key": "aal00",
"sentences": [
    ["American", "Airlines", "Up", "on", "Record", "April", "Traffic", ",", "Upbeat", "Q2", "View"],
    ["Premier", "passenger", "carrier", ",", "American", "Airlines", "Group", "Inc", ".", "AAL", "saw", "its", "shares", "rise", "4.76", "%", "to", "$", "47.08", "at", "the", "close", "of", "business", "on", "Apr", "9", ",", "following", "the", "release", "of", "its", "traffic", "report", "for", "the", "month", "of", "April", "."]
],
"events": [
    [ ],
    [
        [[24, "SecurityValue"], [25, 26, "IncreaseAmount"], [23, 23, "Security"], [30, 37, "TIME"], [28, 29, "Price"]],
        [[45, "FinancialReport"], [43, 43, "Reportee"], [46, 50, "TIME"]]
    ]
],
"ner": [
    [ ],
    [ ]
],
"relations": [
    [ ],
    [ ]
],
"clusters": [
    [ ],
    [ ]
]
}

I configured the .jsonnet file based off train_ace05_event.jsonnet changing
n_trigger_labels to the amount of event types and n_ner_labels: 0, because I have no NER annotations.
Those were all the relevant config keys I could identify from the file itself, I probably missed some because training failed with the following error:

2020-08-06 14:18:05,035 - INFO - allennlp.training.trainer - Training
  0%|          | 0/365 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/opt/conda/envs/dygiepp/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 124, in train_model_from_args
    args.cache_prefix)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 168, in train_model_from_file
    cache_directory, cache_prefix)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
    metrics = trainer.train()
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
    train_metrics = self._train_epoch(epoch)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
    loss = self.batch_loss(batch_group, for_training=True)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 261, in batch_loss
    output_dict = self.model(**batch)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "./dygie/models/dygie.py", line 298, in forward
    ner_labels, metadata)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "./dygie/models/events.py", line 267, in forward
    trig_arg_embeddings, top_trig_scores, top_arg_scores, top_arg_mask)
  File "./dygie/models/events.py", line 553, in _compute_argument_scores
    embeddings_flat = pairwise_embeddings.view(-1, feature_dim)
RuntimeError: shape '[-1, 2642]' is invalid for input of size 17840250
  • Is it possible to train Event extraction with only events (trigger + arguments) without NER and relation annotations?
  • Am I missing a config key to set here? I suspect the amount of argument types has to be set somewhere but the config for this is not obvious.

Branch #allennlp-v1 not usable

I am testing the branch #allennlp-v1 to run the ACE-Event training code which has some code issues regarding imports.

 (dygiepp) root@0a587743b213:/dygiepp# rm -rf ./models/ace05-event; bash ./scripts/train/train_ace05_event.sh 0
2020-08-03 08:58:29,323 - INFO - transformers.file_utils - PyTorch version 1.5.1 available.
Traceback (most recent call last):
  File "/opt/conda/envs/dygiepp/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/__main__.py", line 19, in run
    main(prog="allennlp")
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 91, in main
    import_module_and_submodules(package_name)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/util.py", line 351, in import_module_and_submodules
    import_module_and_submodules(subpackage)
  File "/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/util.py", line 340, in import_module_and_submodules
    module = importlib.import_module(package_name)
  File "/opt/conda/envs/dygiepp/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/dygiepp/dygie/data/__init__.py", line 3, in <module>
    from dygie.data.iterators.batch_iterator import BatchIterator
  File "/dygiepp/dygie/data/iterators/batch_iterator.py", line 10, in <module>
    from allennlp.data.dataloader import DataLoader, PyTorchDataLoader
ImportError: cannot import name 'PyTorchDataLoader' from 'allennlp.data.dataloader' (/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/dataloader.py)
(dygiepp) root@0a587743b213:/dygiepp# python -c "from allennlp.data.dataloader import PyTorchDataLoader"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: cannot import name 'PyTorchDataLoader' from 'allennlp.data.dataloader' (/opt/conda/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/dataloader.py)

I also tested the branch two weeks ago at commit 07074fd and encountered multiple issues with sen_dict() missing in the dataloader code, indicating refactoring is still ongoing.

Is the allennlp-v1 branch meant to be used now or will it be merged to master when ready to use?
I am planning on making adjustments to dygiepp to run on a my own ERE-like event dataset and the upgraded dependencies would make Apex usable which is a big plus for me.

In any case, thanks for making DYGIE++ source available and maintaining it!

ACE05 data preprocess problem: get_token_of

Hi, when I ran the preprocessing script parse_ace_event.py, I got the Exception: Should not get here from the get_token_of function. I found that it was the char index couldn't match the token index which may due to the tokenization. Have you ever met such problem, and how should i fix the problem?

Finetune pretrained model using labelled data

Hi,

I emailed you earlier regarding this. To make this more official, I Was wondering if you had any suggestions on how to finetune a pretrained model.

I.e. I have a set of annotated articles. I would love to use this data to finetune the ace-event model.

Thanks!

KeyError: "'' not found in vocab namespace 'scierc__ner_labels

I am trying to run pretrained sciERC model on preprocessed sciERC dataset. I run the following command to do so:

allennlp predict pretrained/scierc.tar.gz \ data/processed_data/json/test.json \ --predictor dygie \ --include-package dygie \ --use-dataset-reader \ --output-file predictions/scierc-test.jsonl \ --cuda-device -1 \ --silent

I run into the following error:

Traceback (most recent call last):
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 724, in get_token_index
return self._token_to_index[namespace][token]
KeyError: ''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 727, in get_token_index
return self._token_to_index[namespace][self._oov_token]
KeyError: '@@unknown@@'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/yelman/anaconda3/envs/dygiepp/bin/allennlp", line 8, in
sys.exit(run())
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/main.py", line 34, in run
main(prog="allennlp")
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/init.py", line 119, in main
args.func(args)
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/predict.py", line 224, in _predict
predictor = _get_predictor(args)
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/predict.py", line 119, in _get_predictor
overrides=args.overrides,
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/archival.py", line 208, in load_archive
model = _load_model(config.duplicate(), weights_path, serialization_dir, cuda_device)
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/archival.py", line 246, in _load_model
cuda_device=cuda_device,
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/model.py", line 406, in load
return model_class._load(config, serialization_dir, weights_file, cuda_device)
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/model.py", line 305, in _load
vocab=vocab, params=model_params, serialization_dir=serialization_dir
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/from_params.py", line 604, in from_params
**extras,
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/from_params.py", line 634, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/yelman/Desktop/dygiepp-master/dygie/models/dygie.py", line 111, in init
params=modules.pop("ner"))
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/from_params.py", line 634, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/yelman/Desktop/dygiepp-master/dygie/models/ner.py", line 50, in init
null_label = vocab.get_token_index("", namespace)
File "/home/yelman/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 732, in get_token_index
f"'{token}' not found in vocab namespace '{namespace}', and namespace "
KeyError: "'' not found in vocab namespace 'scierc__ner_labels', and namespace does not contain the default OOV token ('@@unknown@@')"

AssertionError: No super class method found for "_instances_from_cache_file"

Hi,

Thank you for the excellent work. I follow the steps by setting up the environment and install all requirements exactly. I want to download the scierc dataset but have the following problem.
I ran these command exactly:

conda create --name dygiepp python=3.7
pip install -r requirements.txt
conda develop .
bash ./scripts/data/get_scierc.sh

This is the problem I have:

Traceback (most recent call last):
  File "scripts/data/shared/normalize.py", line 5, in <module>
    from dygie.data.dataset_readers.document import Document, Dataset
  File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/__init__.py", line 1, in <module>
    from dygie.data.dataset_readers.dygie import DyGIEReader
  File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/dataset_readers/dygie.py", line 29, in <module>
    class DyGIEReader(DatasetReader):
  File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/dataset_readers/dygie.py", line 202, in DyGIEReader
    @overrides
  File "/ldap_shared/home/v_yuchen_zeng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/overrides/overrides.py", line 67, in overrides
    raise AssertionError('No super class method found for "%s"' % method.__name__)
AssertionError: No super class method found for "_instances_from_cache_file"
Traceback (most recent call last):
  File "scripts/data/shared/collate.py", line 4, in <module>
    from dygie.data.dataset_readers import document
  File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/__init__.py", line 1, in <module>
    from dygie.data.dataset_readers.dygie import DyGIEReader
  File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/dataset_readers/dygie.py", line 29, in <module>
    class DyGIEReader(DatasetReader):
  File "/ldap_shared/home/v_yuchen_zeng/696ds/dygiepp-master/dygie/data/dataset_readers/dygie.py", line 202, in DyGIEReader
    @overrides
  File "/ldap_shared/home/v_yuchen_zeng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/overrides/overrides.py", line 67, in overrides
    raise AssertionError('No super class method found for "%s"' % method.__name__)
AssertionError: No super class method found for "_instances_from_cache_file"

I tried emnlp-2019 branch but I still got the same problem. Have you ever had to this problem before? Thanks!

Missing events config template

Hi,
thank you very much for providing your source code!
I know that training on ace events is wip but if you would push your local template we could try to figure out how to train in the meantime

Question about the number of documents of ACE2005 event extraction

Thanks for making a comprehensive summarization of Event Extraction pre-processing.

I am confused by the number of documents for each split. File dev.filelist, test.filelist and train.filelist have 28, 40, 529 lines, which in total makes 597 documents, but Table 8 in the Appendix depicts there are 599 documents.

Dockerfile expects script that was removed

It looks like scripts/pretrained/get_scibert.py was removed in this commit, but the Dockerfile still expects it

$ docker build .

420100K .......... .......... .......... .......... .......... 99% 2.89M 0s
420150K .......... .......... .......... .......... .......... 99% 4.39M 0s
420200K .......... ...                                        100%  108M=4m20s

2021-05-13 18:44:49 (1.58 MB/s) - ‘./pretrained/mechanic-granular.tar.gz’ saved [430298612/430298612]


Removing intermediate container 2024d2013fad
 ---> b44f296f2ce3
Step 19/27 : COPY scripts/pretrained/get_scibert.py /tmp/get_scibert.py
COPY failed: stat /var/lib/docker/tmp/docker-builder787936846/scripts/pretrained/get_scibert.py: no such file or directory

I understand that the Dockerfile isn't officially supported; just wanna log this here in case someone encounters the same issue

Dcoker Build Fail

I cloned this repository from master(main) branch, but I saw the following error message.
=```

[internal] load build definition from Dockerfile 0.2s
=> => transferring dockerfile: 3.33kB 0.1s
=> [internal] load .dockerignore 0.2s
=> => transferring context: 443B 0.1s
=> [internal] load metadata for docker.io/pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel 6.2s
=> [auth] pytorch/pytorch:pull token for registry-1.docker.io 0.0s
=> CANCELED [ 1/29] FROM docker.io/pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel@sha256:ccebb46f954b1d32a4700aaeae0e24bd68653f92c6f276a608bf592b660b63d7 122.3s
=> => resolve docker.io/pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel@sha256:ccebb46f954b1d32a4700aaeae0e24bd68653f92c6f276a608bf592b660b63d7 0.0s
=> => sha256:bb833e4d631feff31ab57559d64617ad895d3ae7f45fdb651f9ba2df50b183b7 10.06kB / 10.06kB 0.0s
=> => sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1 845B / 845B 0.0s
=> => sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4 184B / 184B 0.0s
=> => sha256:ccebb46f954b1d32a4700aaeae0e24bd68653f92c6f276a608bf592b660b63d7 3.05kB / 3.05kB 0.0s
=> => sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff 35.36kB / 35.36kB 0.0s
=> => sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c 26.69MB / 26.69MB 0.0s
=> => sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4 162B / 162B 0.0s
=> => sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322 7.22MB / 7.22MB 0.0s
=> => sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1 8.45MB / 8.45MB 0.0s
=> => sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a 77.59MB / 688.74MB 123.2s
=> => sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa 90.18MB / 820.84MB 123.2s
=> => sha256:b398e882f4149bf61faa8f2c1d47a4fe98b8fe1b2c9379da1d58ddc54fe67cf0 110.10MB / 532.41MB 123.2s
=> => extracting sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c 15.6s
=> => extracting sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff 0.0s
=> => extracting sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1 0.0s
=> => extracting sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4 0.0s
=> => extracting sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322 5.3s
=> => extracting sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1 2.7s
=> => extracting sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4 0.0s
=> [internal] load build context 122.3s
=> => transferring context: 903.71MB 122.1s
=> CACHED [ 2/29] RUN mkdir /dygiepp 0.0s
=> CACHED [ 3/29] RUN apt-get update && apt-get -y install gcc make sqlite3 0.0s
=> CACHED [ 4/29] RUN conda create --name dygiepp python=3.7 -y 0.0s
=> CACHED [ 5/29] RUN conda install -c conda-forge jsonnet -y 0.0s
=> CACHED [ 6/29] COPY requirements.txt /tmp/requirements.txt 0.0s
=> CACHED [ 7/29] RUN pip install -r /tmp/requirements.txt 0.0s
=> CACHED [ 8/29] RUN conda create --name ace-event-preprocess python=3.7 -y 0.0s
=> CACHED [ 9/29] COPY scripts/data/ace-event/requirements.txt /tmp/ace-prep-requirements.txt 0.0s
=> CACHED [10/29] RUN pip install -r /tmp/ace-prep-requirements.txt 0.0s
=> CACHED [11/29] RUN python -m spacy download en 0.0s
=> CACHED [12/29] RUN apt-get install openjdk-8-jdk openjdk-8-jre wget unzip -y 0.0s
=> CACHED [13/29] COPY scripts/data/ace05/get_corenlp.sh /tmp/get_corenlp.sh 0.0s
=> CACHED [14/29] RUN cd /dygiepp/ && bash /tmp/get_corenlp.sh 0.0s
=> CACHED [15/29] RUN conda install -c conda-forge zsh -y 0.0s
=> CACHED [16/29] RUN apt-get install unzip wget -y 0.0s
=> CACHED [17/29] COPY scripts/data/shared /dygiepp/scripts/data/shared 0.0s
=> CACHED [18/29] COPY scripts/data/get_scierc.sh /tmp/get_scierc.sh 0.0s
=> CACHED [19/29] COPY dygie /dygiepp/dygie 0.0s
=> CACHED [20/29] RUN cd /dygiepp && bash /tmp/get_scierc.sh 0.0s
=> CACHED [21/29] RUN apt-get install wget -y 0.0s
=> CACHED [22/29] COPY scripts/pretrained/get_dygiepp_pretrained.sh /tmp/get_dygiepp_pretrained.sh 0.0s
=> CACHED [23/29] RUN cd /dygiepp && bash /tmp/get_dygiepp_pretrained.sh 0.0s
=> ERROR [24/29] COPY scripts/pretrained/get_scibert.py /tmp/get_scibert.py 0.0s


[24/29] COPY scripts/pretrained/get_scibert.py /tmp/get_scibert.py:


failed to compute cache key: "/scripts/pretrained/get_scibert.py" not found: not found


Looking forward to hearing from you soon!

KeyError: 'None__ner_labels' when predicting on new dataset.

I encounter the problem KeyError: 'None__ner_labels', when I try to use dygiepp to predict on new dataset. The following is the detail:

Traceback (most recent call last):
File "/home/wangpancheng/anaconda3/envs/dygiepp/bin/allennlp", line 8, in
sys.exit(run())
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/main.py", line 34, in run
main(prog="allennlp")
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/init.py", line 118, in main
args.func(args)
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/predict.py", line 220, in _predict
manager.run()
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/predict.py", line 187, in run
for model_input_instance, result in zip(batch, self._predict_instances(batch)):
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/commands/predict.py", line 146, in _predict_instances
results = [self._predictor.predict_instance(batch_data[0])]
File "/serverData2/wangpancheng/wpc/dygiepp/dygie/predictors/dygie.py", line 56, in predict_instance
prediction = model.make_output_human_readable(model(**model_input)).to_json()
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/serverData2/wangpancheng/wpc/dygiepp/dygie/models/dygie.py", line 239, in forward
spans, span_mask, span_embeddings, sentence_lengths, ner_labels, metadata)
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/serverData2/wangpancheng/wpc/dygiepp/dygie/models/ner.py", line 90, in forward
scorer = self._ner_scorers[self._active_namespace]
File "/home/wangpancheng/anaconda3/envs/dygiepp/lib/python3.7/site-packages/torch/nn/modules/container.py", line 286, in getitem
return self._modules[key]
KeyError: 'None__ner_labels'

dataset format

I have used the code to preprocess the ACE05 dataset. I am wondering if you could explain the output format.

I got 351/80/80 lines in train/dev/test json files. One line from the train.json file looks like below.

{"sentences": [["CNN_ENG_20030515_073019", ".7"], ["NEWS", "STORY"], ["2003-05-15", "09:52:27"], ["earlier", "we", "talk", "about", "a", "new", "book", "claiming", "that", "president", "john", "kennedy", "had", "an", "affair", "with", "a", "white", "house", "intern", "early", "1960s", "."], ["a", "kennedy", "biographer", "robert", "dallek", "came", "across", "the", "story", "while", "doing", "research", "."], ["the", "woman", "'s", "name", "has", "remain", "a", "mystery", "."], ["the", "60-year-old", "tells", "the", "new", "york", "daily", "news", "and", "others", "dpa", "she", "is", "glad", "to", "have", "the", "weight", "she", "'s", "been", "carrying", "for", "41", "years", "now", "off", "her", "shoulders", "."], ["she", "says", "she", "was", "19", "at", "the", "time", ",", "working", "in", "d.c", ".", "at", "the", "white", "house", ",", "1962", ",", "1963", ",", "she", "says", "today", "the", "allegations", "about", "her", "affair", "are", "the", "truth", "."], ["right", "now", "she", "lives", "on", "the", "upper", "east", "side", ",", "works", "at", "a", "presbyterian", "church", ",", "has", "two", "married", "daughters", "and", "after", "the", "news", "is", "out", ",", "after", "carrying", "it", "for", "41", "years", ",", "she", "feels", "better", "about", "it", "."], ["this", "news", "breaking", "just", "today", "."], ["robert", "dallek", "tried", "to", "do", "the", "research", "to", "track", "this", "woman", "down", "in", "the", "book", ",", "it", "did", "not", "happen", "but", "now", "she", "has", "indeed", "come", "forward", "."], ["2003-05-15", "09:53:14"]], "ner": [[], [], [], [[7, 7, "PER"], [16, 17, "PER"], [15, 15, "PER"], [25, 25, "PER"], [23, 24, "ORG"]], [[30, 30, "PER"], [31, 31, "PER"], [32, 33, "PER"]], [[43, 43, "PER"]], [[52, 52, "PER"], [62, 62, "PER"], [69, 69, "PER"], [78, 78, "PER"], [55, 58, "ORG"], [60, 60, "ORG"]], [[92, 92, "GPE"], [83, 83, "PER"], [103, 103, "PER"], [109, 109, "PER"], [81, 81, "PER"], [96, 97, "ORG"]], [[121, 123, "LOC"], [134, 134, "PER"], [129, 129, "ORG"], [117, 117, "PER"], [149, 149, "PER"]], [], [[171, 171, "PER"], [183, 183, "PER"], [161, 162, "PER"]], []], "relations": [[], [], [], [[16, 17, 25, 25, "PER-SOC"], [25, 25, 23, 24, "ORG-AFF"]], [], [], [], [[83, 83, 96, 97, "ORG-AFF"], [81, 81, 92, 92, "PHYS"]], [[117, 117, 121, 123, "GEN-AFF"], [117, 117, 129, 129, "ORG-AFF"], [117, 117, 134, 134, "PER-SOC"]], [], [], []], "clusters": [], "doc_key": "CNN_ENG_20030515_073019.7"}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.