GithubHelp home page GithubHelp logo

goodbai-nlp / amrbart Goto Github PK

View Code? Open in Web Editor NEW
93.0 2.0 28.0 6.75 MB

Code for our paper "Graph Pre-training for AMR Parsing and Generation" in ACL2022

License: MIT License

Shell 2.11% Python 93.27% Perl 4.61%
semantic pre-training amrparsing generation

amrbart's Introduction

AMRBART

The refactored implementation for ACL2022 paper "Graph Pre-training for AMR Parsing and Generation". You may find our paper here (Arxiv). The original implementation is avaliable here

PWC

PWC

PWC

PWC

News🎈

  • (2022/12/10) fix max_length bugs in AMR parsing and update results.
  • (2022/10/16) release the AMRBART-v2 model which is simpler, faster, and stronger.

Requirements

  • python 3.8
  • pytorch 1.8
  • transformers 4.21.3
  • datasets 2.4.0
  • Tesla V100 or A100

We recommend to use conda to manage virtual environments:

conda env update --name <env> --file requirements.yml

Data Processing

You may download the AMR corpora at LDC.

Please follow this respository to preprocess AMR graphs:

bash run-process-acl2022.sh

Usage

Our model is avaliable at huggingface. Here is how to initialize a AMR parsing model in PyTorch:

from transformers import BartForConditionalGeneration
from model_interface.tokenization_bart import AMRBartTokenizer      # We use our own tokenizer to process AMRs

model = BartForConditionalGeneration.from_pretrained("xfbai/AMRBART-large-finetuned-AMR3.0-AMRParsing-v2")
tokenizer = AMRBartTokenizer.from_pretrained("xfbai/AMRBART-large-finetuned-AMR3.0-AMRParsing-v2")

Pre-training

bash run-posttrain-bart-textinf-joint-denoising-6task-large-unified-V100.sh "facebook/bart-large"

Fine-tuning

For AMR Parsing, run

bash train-AMRBART-large-AMRParsing.sh "xfbai/AMRBART-large-v2"

For AMR-to-text Generation, run

bash train-AMRBART-large-AMR2Text.sh "xfbai/AMRBART-large-v2"

Evaluation

cd evaluation

For AMR Parsing, run

bash eval_smatch.sh /path/to/gold-amr /path/to/predicted-amr

For better results, you can postprocess the predicted AMRs using the BLINK tool following SPRING.

For AMR-to-text Generation, run

bash eval_gen.sh /path/to/gold-text /path/to/predicted-text

Inference on your own data

If you want to run our code on your own data, try to transform your data into the format here, then run

For AMR Parsing, run

bash inference_amr.sh "xfbai/AMRBART-large-finetuned-AMR3.0-AMRParsing-v2"

For AMR-to-text Generation, run

bash inference_text.sh "xfbai/AMRBART-large-finetuned-AMR3.0-AMR2Text-v2"

Pre-trained Models

Pre-trained AMRBART

Setting Params checkpoint
AMRBART-large 409M model

Fine-tuned models on AMR-to-Text Generation

Setting BLEU(JAMR_tok) Sacre-BLEU checkpoint output
AMRBART-large (AMR2.0) 50.76 50.44 model output
AMRBART-large (AMR3.0) 50.29 50.38 model output

To get the tokenized bleu score, you need to use the scorer we provide here. We use this script in order to ensure comparability with previous approaches.

Fine-tuned models on AMR Parsing

Setting Smatch(amrlib) Smatch(amr-evaluation) Smatch++(smatchpp) checkpoint output
AMRBART-large (AMR2.0) 85.5 85.3 85.4 model output
AMRBART-large (AMR3.0) 84.4 84.2 84.3 model output

Acknowledgements

We thank authors of SPRING, amrlib, and BLINK that share open-source scripts for this project.

References

@inproceedings{bai-etal-2022-graph,
    title = "Graph Pre-training for {AMR} Parsing and Generation",
    author = "Bai, Xuefeng  and
      Chen, Yulong  and
      Zhang, Yue",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.415",
    pages = "6001--6015"
}

amrbart's People

Contributors

cylnlp avatar flipz357 avatar goodbai-nlp avatar zoher15 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

amrbart's Issues

Generate random AMRs

Hi! I was trying to use the checkpoint to do inference on my own data, but I find that sometimes the model might generate strange AMRs like this:

 ( <pointer:0> console-01 :ARG0 ( <pointer:1> bug :ARG1-of ( <pointer:2> chase-01 :ARG0 ( <pointer:3> child :ARG0-of ( <pointer:4> nervous-01 ) ) :time ( <pointer:5> month :mod ( <pointer:6> last ) ) ) :mod <pointer:3> ) :ARG1 <pointer:3> )</AMR> ( <pointer:7> kid ) )</AMR></AMR></AMR> )</AMR>-of ( <pointer:8> consol-01 :ARG0 <pointer:1> :ARG1 <pointer:7> )</AMR> :time ( <pointer:9> month :mod <pointer:6> ) :mod ( <pointer:10> last ))</AMR></AMR> :ARG0 ( <pointer:11> i ) :ARG1 ( <pointer:12> person :wiki - :name ( <pointer:13> name :op1 <lit> The </lit> :op2 <lit> Big </lit> :op3 <lit> Bad </lit> :op4 <lit> Old </lit> :op5 <lit> One </lit> ) ) :ARG2 <pointer:7> :mod ( <pointer:14> only ) ) :ARG1-of <pointer:8> )</AMR> :ARG0 <pointer:11></AMR></AMR> :ARG3 ( <pointer:15> comfort-01 :ARG0 <pointer:11> :ARG1 <pointer:7> :time ( <pointer:16> now ) ) :op2 ( <pointer:17> next ) ) :ARG3-of ( <pointer:18> console-01 :ARG1 <pointer:7></AMR> ) :time <pointer:9> )</AMR> <lit></AMR></AMR> :time <pointer:16></AMR></AMR> :mod ( <pointer:19> today ) ) :ARG0-of</AMR></AMR> :ARG1 (</AMR></AMR> :ARG2 ( <pointer:69></AMR> :ARG0</AMR> :ARG1</AMR> ) :ARG3 (</AMR> now )</AMR> ) ) :quant</AMR></AMR> :op2</AMR></AMR>lic</AMR></AMR> :medium</AMR></AMR> :op3</AMR></AMR>BN</AMR></AMR> </lit></AMR></AMR> :frequency</AMR></AMR> :quant</AMR> ) :ARG0</AMR></AMR> <lit> )</AMR> :mod</AMR></AMR>now</AMR></AMR>roid</AMR></AMR>icks</AMR></AMR>oner</AMR></AMR> :instrument</AMR></AMR>yan</AMR></AMR>ably</AMR></AMR> privately</AMR></AMR>throp</AMR></AMR> :domain</AMR></AMR> :duration (</AMR> :time</AMR></AMR>Now</AMR></AMR> seconds</AMR></AMR> flesh</AMR></AMR>chen</AMR></AMR>B</AMR></AMR>illion</AMR></AMR>');</AMR></AMR> collectively</AMR></AMR> weekday</AMR></AMR> :consist-of</AMR></AMR>olt</AMR></AMR>new</AMR></AMR>ankind</AMR></AMR>));</AMR></AMR>Gener</AMR></AMR> hardcore</AMR></AMR> Blackburn</AMR></AMR> November</AMR></AMR>enna</AMR></AMR>cester</AMR></AMR>Face</AMR></AMR>");</AMR></AMR> Nov</AMR></AMR>neck</AMR></AMR>'),</AMR></AMR>UE</AMR></AMR>ainer</AMR></AMR>min</AMR></AMR>athi</AMR></AMR>gas</AMR></AMR>BC</AMR></AMR>aman</AMR></AMR>Sing</AMR></AMR>be</AMR></AMR> coral</AMR></AMR>fer</AMR></AMR>lar</AMR></AMR> have</AMR></AMR>Beg</AMR></AMR> Bearing</AMR></AMR>Proof</AMR></AMR>can</AMR></AMR> now</AMR></AMR>Two</AMR></AMR>bin</AMR></AMR>Be</AMR></AMR>external</AMR></AMR>semb</AMR></AMR>among</AMR></AMR>christ</AMR></AMR>Having</AMR></AMR>bling</AMR></AMR> Weeks</AMR></AMR>other</AMR></AMR> having</AMR></AMR> Citizen</AMR></AMR>tex</AMR></AMR>liction</AMR></AMR> Kimmel</AMR></AMR>deg</AMR></AMR> Various</AMR></AMR> Liter</AMR></AMR>ening</AMR></AMR> whisk</AMR></AMR> counted</AMR></AMR> Nature</AMR></AMR> Parenthood</AMR></AMR>ched</AMR></AMR>bearing</AMR></AMR> Having</AMR></AMR>ching</AMR></AMR>gin</AMR></AMR>oys</AMR></AMR>raise</AMR></AMR>che</AMR></AMR>animate</AMR></AMR>having</AMR></AMR> The</AMR></AMR> denote</AMR></AMR> Memorial</AMR></AMR>anthrop</AMR></AMR> Licensed</AMR></AMR> differences</AMR></AMR> euphem</AMR></AMR> fung</AMR></AMR>licted</AMR></AMR>atars</AMR></AMR>becue</AMR></AMR> Mens</AMR></AMR>add</AMR></AMR> enlisted</AMR></AMR> of</AMR></AMR>asion</AMR></AMR>chester</AMR></AMR>equipped</AMR></AMR>ometime</AMR></AMR> being</AMR></AMR>gener</AMR></AMR>Building</AMR></AMR>World</AMR></AMR> Motorsport</AMR></AMR>some</AMR></AMR> Roads</AMR></AMR> Blood</AMR></AMR> Clockwork</AMR></AMR>gan</AMR></AMR>Central</AMR></AMR>raised</AMR></AMR> Summoner</AMR></AMR>of</AMR></AMR> Months</AMR></AMR>neys</AMR></AMR>iday</AMR></AMR>ueller</AMR></AMR> sometime</AMR></AMR>phalt</AMR></AMR>'d</AMR></AMR>starter</AMR></AMR> occasions</AMR></AMR> membership</AMR></AMR> be</AMR></AMR> City</AMR></AMR>uers</AMR></AMR> Animals</AMR></AMR>deen</AMR></AMR>match</AMR></AMR>world</AMR></AMR> Racing</AMR></AMR>idences</AMR></AMR>isc</AMR></AMR> starters</AMR></AMR>together</AMR></AMR>Several</AMR></AMR>Gen</AMR></AMR>earing</AMR></AMR> able</AMR></AMR> Scroll</AMR></AMR> Goo</AMR></AMR> buff</AMR></AMR> Wedding</AMR></AMR>�</AMR></AMR> must</AMR></AMR>Club</AMR></AMR> months</AMR></AMR> rubbing</AMR></AMR>go</AMR></AMR> accelerate</AMR></AMR>isson</AMR></AMR>acking</AMR></AMR>cellaneous</AMR></AMR> rake</AMR></AMR>Go</AMR></AMR> tongues</AMR></AMR>carry</AMR></AMR> donor</AMR></AMR> Were</AMR></AMR> Built</AMR></AMR> advertising</AMR></AMR>condition</AMR></AMR> lips</AMR></AMR>gal</AMR></AMR> Styles</AMR></AMR> existed</AMR></AMR> samples</AMR></AMR>Demon</AMR></AMR>rounded</AMR></AMR>g</AMR></AMR> entitled</AMR></AMR> hired</AMR></AMR> classify</AMR></AMR>i</AMR></AMR> were</AMR></AMR> face</AMR></AMR> Organ</AMR></AMR>casting</AMR></AMR>making</AMR></AMR> Insurance</AMR></AMR>raising</AMR></AMR> listener</AMR></AMR> Food</AMR></AMR> chromos</AMR></AMR> fictional</AMR></AMR> pouring</AMR></AMR> generating</AMR></AMR>soc</AMR></AMR>el</AMR></AMR> making</AMR></AMR>Mania</AMR></AMR>asons</AMR></AMR>cies</AMR></AMR> Soc</AMR></AMR> Bing</AMR></AMR> joining</AMR></AMR> fictitious</AMR></AMR>buy</AMR></AMR>chet</AMR></AMR> affect</AMR></AMR>M</AMR></AMR> innocuous</AMR></AMR>bert</AMR></AMR> stim</AMR></AMR> Roof</AMR></AMR>called</AMR></AMR>connect</AMR></AMR> classified</AMR></AMR> Alright</AMR></AMR> rabid</AMR></AMR> become</AMR></AMR>ass</AMR></AMR>eding</AMR></AMR> congregation</AMR></AMR> facial</AMR></AMR> catering</AMR></AMR>aha</AMR></AMR>giving</AMR></AMR> match</AMR></AMR>asses</AMR></AMR>make</AMR></AMR> Action</AMR></AMR> some</AMR></AMR>air</AMR></AMR> Club</AMR></AMR> Society</AMR></AMR>oir</AMR></AMR>onga</AMR></AMR> kickoff</AMR></AMR> forming</AMR></AMR>dat</AMR></AMR> January</AMR></AMR>inski</AMR></AMR>Names</AMR></AMR> signatures</AMR></AMR> bond</AMR></AMR>connection</AMR></AMR> ticking</AMR></AMR> December</AMR></AMR>ong</AMR></AMR>pie</AMR></AMR> cards</AMR></AMR> fundraising</AMR></AMR> organs</AMR></AMR> cannibal</AMR></AMR> cater</AMR></AMR> occur</AMR></AMR> Gen</AMR></AMR>red</AMR></AMR> Regulatory</AMR></AMR>rag</AMR>

Is it normal?
FYI, this is the code I used to generate AMRs:

from transformers import BartForConditionalGeneration
from model_interface.tokenization_bart import AMRBartTokenizer
from pathlib import Path
import argparse
from tqdm import tqdm
# Load tokenizer and model

parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('--input', type=str,
                    help='input document')
parser.add_argument('--output', type=str, help='the document to save')

args = parser.parse_args()
model = BartForConditionalGeneration.from_pretrained("xfbai/AMRBART-large-finetuned-AMR3.0-AMRParsing-v2")
tokenizer = AMRBartTokenizer.from_pretrained("xfbai/AMRBART-large-finetuned-AMR3.0-AMRParsing-v2")
max_length = model.config.max_length
print(max_length)
input_sents =  Path(args.input).read_text().strip().split('\n')
with open(args.output, 'w') as pred:
    for sent in tqdm(input_sents):
        input_ids = tokenizer.encode(sent, return_tensors="pt")
        output = model.generate(input_ids, max_length=1024)
        amr_graph = tokenizer.decode(output[0], skip_special_tokens=True)
        pred.write(f'{amr_graph}\n\n')

I find the max_length in model.config is 20, so I mannually set it to 1024.
Thanks!

HTTP Error while running evaluation script

Hi,
I am facing the below error when running the command: bash finetune_AMRbart_amrparsing.sh /path/to/pre-trained/AMRBART/ gpu_id
I am unable to find the cause and would really appreciate any help! Thanks a lot!
image

AMRtotxt Inference on own data

Hi, I'm trying to use the inference script to inference on my own amr graphs but it seems keeps producing the same predictions of the data in the example folder. I tried to delete the cache file inside the example folder then it seems able to work... Could you check the script to see if the cache file is affecting the inference?
Also I'm quite confused why it need train and val dataset to run the inference script, could you explain more about this?
Thank you for your kind help!

Question about fine-tuned models for AMR parsing

Hi,
First of all, thank you for your great work!
I recently got interested in AMR and found your work from accepted papers in ACL.

I have a question about fine-tuned models that you shared for AMR parsing.
Can those models be used for plain texts to generate AMR graphs??
Also, do you support the code for plain texts to generate AMR graphs like SPRING model?

Thanks! :)
Hope you have a good one!

ValueError while running inference-amr.sh

Hi,
I'm trying to inference my own data for AMR parsing, and followed the instructions (of main branch) as in README.md.
I'm having an error as below,
image

So, I set fp16 as False in inference-amr.sh, and it gave me like the error below.
image

Have you experienced this kind of situation before?

Thanks!

Best,
Paul

Tokenizer for AMRBART-large-finetuned-AMR3.0-AMRParsing

I noticed that for the finetuned AMRBarts, there are no tokenizers offered in the huggingface hub, whereas the v2 models have tokenizers with a different vocab size (v1 53844 vs. v2 53228). My questions are:

  1. Where can I get the tokenizers for those finetuned models?
  2. Is there a demonstration for tokens used in v2 models (because I found that newly-added tokens used in v2 models are different from tokens illustrated in the paper)?
  3. Is it OK for me to use BartTokenizer to load the pretrained AMR tokenizers?

Thank you!

Using AMRBART for AMR2Parsing my own data

Hi,
I am testing using AMRBART on my own text data, but when I run the conference-amR.sh script, it shows that I need to read the cache of the model. I would like to know if it is necessary to download the model locally and modify the BasePath in conference-amR.sh? Here is the screenshot
1707611426564

PenmanBART Tokenizer

Thank you for sharing your great work. I would like to ask if you could upload the weights of your trained PENMANBartTokenizer. Thank you for your help!

Excuse me, I meet with this problem

when I run the bash : bash eval_AMRbart_amrparsing.sh /path/to/fine-tuned/AMRBART/ gpu_id,I meet with this problem:

Global seed set to 42
Traceback (most recent call last):
File "/home/jsj201-4/mount1/jym/AMR-Parser/AMRBART-main/fine-tune/run_amrparsing.py", line 157, in
main(args)
File "/home/jsj201-4/mount1/jym/AMR-Parser/AMRBART-main/fine-tune/run_amrparsing.py", line 91, in main
raw_graph=False,
File "/home/jsj201-4/mount1/jym/AMR-Parser/AMRBART-main/spring/spring_amr/tokenization_bart.py", line 44, in from_pretrained
inst = super().from_pretrained(pretrained_model_path, *args, **kwargs)
File "/home/jsj201-4/.conda/envs/kinyum/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1708, in from_pretrained
raise EnvironmentError(msg)
OSError: Can't load tokenizer for '../../../data/pretrained-model/bart-large'. Make sure that:

  • '../../../data/pretrained-model/bart-large' is a correct model identifier listed on 'https://huggingface.co/models'

  • or '../../../data/pretrained-model/bart-large' is the correct path to a directory containing relevant tokenizer files

so what should I do?

AMRBART model does not available in Hugging Face hub

Thank you for sharing this exellent work. I would like to ask you if you could re-upload AMRBART-large and base models. Because I keep getting this error

OSError: Can't load config for 'xfbai/AMRBART-base'. Make sure that:

  • 'xfbai/AMRBART-base' is a correct model identifier listed on 'https://huggingface.co/models'

  • or 'xfbai/AMRBART-base' is the correct path to a directory containing a config.json file

Thank you for your help!

Question about AMR data format

I want to inference my own data in AMR2Text. However, I do not understand how can I adjust my AMR to this format. Could you explain? and what does <pointer: X> mean? Could you provide the code for translating AMR to this format?

{"sent": "", "amr": "( <pointer:0> pledge-01 :mode imperative :ARG0 ( <pointer:1> you ) :ARG2 ( <pointer:2> fight-01 :ARG0 <pointer:1> :ARG2 ( <pointer:3> defend-01 :ARG0 <pointer:1> :ARG1 ( <pointer:4> and :op1 ( <pointer:5> island :wiki \"Senkaku_Islands\" :name ( <pointer:6> name :op1 \"Diaoyu\" :op2 \"Islands\" ) ) :op2 ( <pointer:7> island :ARG1-of ( <pointer:8> relate-01 :ARG2 <pointer:5> ) ) ) ) :manner ( <pointer:9> die-01 :ARG1 <pointer:1> ) ) )"}

thanks,

Is there a way to convert the generated AMRs to the original format?

Hi I was just wondering if there is a way to convert the generated AMR (with pointer: 1, 2, ...) to the conventional AMRs (p1/ person ...). I checked the AMR process repository, but the post-processing.py only provides a few functions without detailing how to use them. Should I do it by myself? Or is there something I'm missing here? Thanks a lot!

'PENMANBartTokenizer' object has no attribute 'amr_bos_token_id'

Hello,
when using the script inference_amr.sh I receive the following error:

Please answer yes or no.
Global seed set to 42
Tokenizer: 53587 PreTrainedTokenizer(name_or_path='facebook/bart-large', vocab_size=53587, model_max_len=1024, is_fast=False, padding_side='right', special_tokens={'bos_token': 'Ġ<s>', 'eos_token': 'Ġ</s>', 'unk_token': 'Ġ<unk>', 'sep_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': 'Ġ<pad>', 'cls_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True)})
Traceback (most recent call last):
  File "/home/students/meier/MA/AMRBART/fine-tune/inference_amr.py", line 105, in <module>
    main(args)
  File "/home/students/meier/MA/AMRBART/fine-tune/inference_amr.py", line 65, in main
    data_module = AMRParsingDataModule(amr_tokenizer, **vars(args))
  File "/home/students/meier/MA/AMRBART/fine-tune/data_interface/dataset_pl.py", line 228, in __init__
    decoder_start_token_id=self.tokenizer.amr_bos_token_id,
AttributeError: 'PENMANBartTokenizer' object has no attribute 'amr_bos_token_id'

The facebook/bart-large tokenizer is used. This error is new, since I used the scripts 8 to 6 weeks ago and everything worked fine.

A similar error can be seen when using inferece_text.sh:

Please answer yes or no.
Global seed set to 42
Tokenizer: 53587 PreTrainedTokenizer(name_or_path='facebook/bart-large', vocab_size=53587, model_max_len=1024, is_fast=False, padding_side='right', special_tokens={'bos_token': 'Ġ<s>', 'eos_token': 'Ġ</s>', 'unk_token': 'Ġ<unk>', 'sep_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': 'Ġ<pad>', 'cls_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True)})
Dataset cache dir: /home/students/meier/MA/AMRBART/fine-tune/../examples/.cache/
Using custom data configuration default-288dad464b8291c3
Downloading and preparing dataset amr_data/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/students/meier/MA/AMRBART/fine-tune/../examples/.cache/amr_data/default-288dad464b8291c3/1.0.0/f0dfbe4d826478b18bc1ef4db7270a419c69c4ea4c94fbf73515b13180f43059...
^M0 examples [00:00, ? examples/s]^M                                ^M^M0 examples [00:00, ? examples/s]^M                                ^M^M0 examples [00:00, ? examples/s]^M                                ^MDataset amr_data downloaded and prepared to /home/students/meier/MA/AMRBART/fine-tune/../examples/.cache/amr_data/default-288dad464b8291c3/1.0.0/f0dfbe4d826478b18bc1ef4db7270a419c69c4ea4c94fbf73515b13180f43059. Subsequent calls will reuse this data.
datasets: DatasetDict({
    train: Dataset({
        features: ['src', 'tgt'],
        num_rows: 10
    })
    validation: Dataset({
        features: ['src', 'tgt'],
        num_rows: 10
    })
    test: Dataset({
        features: ['src', 'tgt'],
        num_rows: 10
    })
})
colums: ['src', 'tgt']
Setting TOKENIZERS_PARALLELISM=false for forked processes.
Parameter 'function'=<function AMR2TextDataModule.setup.<locals>.tokenize_function at 0x154ba6915280> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
^M #0:   0%|          | 0/1 [00:00<?, ?ba/s]^M #0:   0%|          | 0/1 [00:00<?, ?ba/s]
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 185, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/datasets/fingerprint.py", line 397, in wrapper
    out = func(self, *args, **kwargs)
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2016, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1906, in apply_function_on_filtered_inputs
    function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
  File "/home/students/meier/MA/AMRBART/fine-tune/data_interface/dataset_pl.py", line 72, in tokenize_function
    amr_tokens = [
  File "/home/students/meier/MA/AMRBART/fine-tune/data_interface/dataset_pl.py", line 74, in <listcomp>
    + [self.tokenizer.amr_bos_token]
AttributeError: 'PENMANBartTokenizer' object has no attribute 'amr_bos_token'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/students/meier/MA/AMRBART/fine-tune/run_amr2text.py", line 154, in <module>
    main(args)
  File "/home/students/meier/MA/AMRBART/fine-tune/run_amr2text.py", line 91, in main
    data_module.setup()
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 474, in wrapped_fn
    fn(*args, **kwargs)
  File "/home/students/meier/MA/AMRBART/fine-tune/data_interface/dataset_pl.py", line 117, in setup
    self.train_dataset = datasets["train"].map(
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1744, in map
    transformed_shards = [r.get() for r in results]
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1744, in <listcomp>
    transformed_shards = [r.get() for r in results]
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
AttributeError: 'PENMANBartTokenizer' object has no attribute 'amr_bos_token'

0.3 lower Smatch score using amr-evaluation-enhanced

Dear authors,

Thank you for sharing your work, it's amazing. I just want to share a finding regarding the parsing evaluation. As far as I know, many existing works (like Cai & Lam, ACL 2020) are utilizing amr-evaluation-enhanced for computing Smatch score. After running this script on your parsing output, 84.0 was returned, which is slightly lower than the score 84.3 you reported. I ran it multiple times and the result remained the same:

$ bash evaluation.sh data/model/amr3/bartamr/AMR3.0-test-pred-wiki.amr data/amr/amr_3.0/test.txt

Smatch -> P: 0.844, R: 0.836, F: 0.840
Unlabeled -> P: 0.867, R: 0.858, F: 0.862
No WSD -> P: 0.849, R: 0.841, F: 0.845
Non_sense_frames -> P: 0.918, R: 0.916, F: 0.917
Wikification -> P: 0.836, R: 0.817, F: 0.826
Named Ent. -> P: 0.893, R: 0.874, F: 0.884
Negations -> P: 0.716, R: 0.722, F: 0.719
IgnoreVars -> P: 0.746, R: 0.742, F: 0.744
Concepts -> P: 0.907, R: 0.900, F: 0.903
Frames -> P: 0.888, R: 0.885, F: 0.887
Reentrancies -> P: 0.721, R: 0.729, F: 0.725
SRL -> P: 0.801, R: 0.807, F: 0.804

I understand Smatch is using a stochastic matching algorithm and 0.3 is not significant at all. Just want to share this little finding with the community. Maybe we should migrate to the amrlib.evaluate.smatch_enhanced package you used for better comparision.

Different decoding methods

Hi sir, thank you for your great work!

Have you run experiments with different decoding methods? I see in the code you use beam search for decoding, why do you choose beam search, and have you try with top-k sampling? Thank you.

Using HuggingFace AMRBartTokenizer

Hi @goodbai-nlp ,

This is great work! Thanks for making your model available on huggingface. Makes things easier.

However, I am not sure I follow the instructions for generating AMRs. I want to simply generate an AMR for a sentence. In your instructions the easiest way to do so seems to be using huggingface:

from transformers import BartForConditionalGeneration
from model_interface.tokenization_bart import AMRBartTokenizer      # We use our own tokenizer to process AMRs

model = BartForConditionalGeneration.from_pretrained("xfbai/AMRBART-large-finetuned-AMR3.0-AMRParsing-v2")
tokenizer = AMRBartTokenizer.from_pretrained("xfbai/AMRBART-large-finetuned-AMR3.0-AMRParsing-v2")

Are you expecting us to install your repo as a python package? If not, how do expect us to import your tokenizer in our scripts from model_interface?

InPlace Operation Error

Hello,

I'm trying to reproduce your pretraining script, run_multitask_unified_pretraining.py and I have been struggling with the following error:

/home/jkeenan/.conda/envs/amrbart/lib/python3.8/site-packages/torch/autograd/__init__.py:145: UserWarning: Error detected in NllLossBackward. Traceback of forward call that caused the error:
  File "training.py", line 1320, in <module>
    main()
  File "training.py", line 1257, in main
    global_step, tr_loss = train(
  File "training.py", line 399, in train
    outputs = model(
  File "/home/jkeenan/.conda/envs/amrbart/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jkeenan/amrbart/pre-train/model_interface/modeling_bart.py", line 1375, in forward
    masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), labels.view(-1))
  File "/home/jkeenan/.conda/envs/amrbart/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jkeenan/.conda/envs/amrbart/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1047, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/jkeenan/.conda/envs/amrbart/lib/python3.8/site-packages/torch/nn/functional.py", line 2693, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/jkeenan/.conda/envs/amrbart/lib/python3.8/site-packages/torch/nn/functional.py", line 2388, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
 (Triggered internally at  /opt/conda/conda-bld/pytorch_1616554793803/work/torch/csrc/autograd/python_anomaly_mode.cpp:104.)
  Variable._execution_engine.run_backward(
Iteration:   0%|          | 0/50 [00:01<?, ?it/s, lm_loss=16.9, lr=0]
Epoch:   0%|          | 0/4001 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "training.py", line 1320, in <module>
    main()
  File "training.py", line 1257, in main
    global_step, tr_loss = train(
  File "training.py", line 529, in train
    loss.backward()
  File "/home/jkeenan/.conda/envs/amrbart/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/jkeenan/.conda/envs/amrbart/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.LongTensor [42]] is at version 7; expected version 5 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I've installed all of the packages into my own conda environment using your requirements.yml file.

The error seems to be tied to the labels and the fact that we are doing multitask training here, as I have been able to get the script to run without any issues if I only define one task. I have tried many different options to solve the issue but nothing seems to be working. Do you have any ideas?

KeyError 'source' when finetuning

Hello,
during testing finetuning in a conda environment on the example data I encountered the following exception:

Traceback (most recent call last):
  File "/home/students/meier/AMRBART/fine-tune/run_amrparsing.py", line 154, in <module>
    main(args)
  File "/home/students/meier/AMRBART/fine-tune/run_amrparsing.py", line 129, in main
    trainer.fit(model, datamodule=data_module)
  File "/home/students/meier/anaconda3/envs/my_AMRBART_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in fit
    self._call_and_handle_interrupt(
  File "/home/students/meier/anaconda3/envs/my_AMRBART_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/students/meier/anaconda3/envs/my_AMRBART_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/students/meier/anaconda3/envs/my_AMRBART_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1193, in _run
    self._dispatch()
  File "/home/students/meier/anaconda3/envs/my_AMRBART_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1272, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/students/meier/anaconda3/envs/my_AMRBART_env/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/students/meier/anaconda3/envs/my_AMRBART_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1282, in run_stage
    return self._run_train()
  File "/home/students/meier/anaconda3/envs/my_AMRBART_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1304, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/home/students/meier/anaconda3/envs/my_AMRBART_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1368, in _run_sanity_check
    self._evaluation_loop.run()
  File "/home/students/meier/anaconda3/envs/my_AMRBART_env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 151, in run
    output = self.on_run_end()
  File "/home/students/meier/anaconda3/envs/my_AMRBART_env/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 130, in on_run_end
    self._evaluation_epoch_end(outputs)
  File "/home/students/meier/anaconda3/envs/my_AMRBART_env/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 235, in _evaluation_epoch_end
    model.validation_epoch_end(outputs)
  File "/home/students/meier/AMRBART/fine-tune/model_interface/model_amrparsing.py", line 320, in validation_epoch_end
    source = flatten_list(x["source"] for x in ori_outputs)
  File "/home/students/meier/AMRBART/fine-tune/common/utils.py", line 109, in flatten_list
    return [x for x in itertools.chain.from_iterable(summary_ids)]
  File "/home/students/meier/AMRBART/fine-tune/common/utils.py", line 109, in <listcomp>
    return [x for x in itertools.chain.from_iterable(summary_ids)]
  File "/home/students/meier/AMRBART/fine-tune/model_interface/model_amrparsing.py", line 320, in <genexpr>
    source = flatten_list(x["source"] for x in ori_outputs)
KeyError: 'source'

Printing out "ori_outputs" shows this:
ori outputs [{'loss': tensor(0.8626, device='cuda:0'), 'gen_time': 8.689491331577301, 'gen_len': 1024.0, 'preds': [[53842, 36, 53069, 51012, 52944, 36, 53070, 171, 4839, 52945, 36, 53071, 14195, 4839, 4839, 53843, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
The key 'source' is missing.

My sh script looks like this:

#!/bin/bash

ROOT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

GPUID=$2
MODEL=$1
eval_beam=5
modelcate=base
modelcate=large


lr=8e-6

datacate=/home/students/meier/AMRBART/examples/ #/home/students/meier/MA/data/ #AMR2.0
# datacate=AMR3.0


Tokenizer=facebook/bart-$modelcate  #../../../data/pretrained-model/bart-$modelcate
export OUTPUT_DIR_NAME=outputs/fine_tune_amrparse #${datacate}-AMRBart-${modelcate}-amrparsing-6taskPLM-5e-5-finetune-lr${lr}

export CURRENT_DIR=${ROOT_DIR}
export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
cache=~/.cache  #../../../data/.cache/

if [ ! -d $OUTPUT_DIR ];then
  mkdir -p $OUTPUT_DIR
else
  echo "${OUTPUT_DIR} already exists, change a new one or delete origin one"
  exit 0
fi

export OMP_NUM_THREADS=10
export CUDA_VISIBLE_DEVICES=${GPUID}
python -u ${ROOT_DIR}/run_amrparsing.py \
    --data_dir=$datacate \
    --train_data_file=$datacate/train.jsonl \
    --eval_data_file=$datacate/val.jsonl \
    --test_data_file=$datacate/test.jsonl \
    --model_type ${MODEL} \
    --model_name_or_path=${MODEL} \
    --tokenizer_name_or_path=${Tokenizer} \
    --val_metric "smatch" \
    --learning_rate=${lr} \
    --max_epochs 20 \
    --max_steps -1 \
    --per_gpu_train_batch_size=4 \
    --per_gpu_eval_batch_size=4 \
    --unified_input \
    --accumulate_grad_batches 2 \
    --early_stopping_patience 10 \
    --gpus 1 \
    --output_dir=${OUTPUT_DIR} \
    --cache_dir ${cache} \
    --num_sanity_val_steps 4 \
    --src_block_size=512 \
    --tgt_block_size=1024 \
    --eval_max_length=1024 \
    --train_num_workers 8 \
    --eval_num_workers 4 \
    --process_num_workers 8 \
    --do_train --do_predict \
    --seed 42 \
    --fp16 \
    --eval_beam ${eval_beam} 2>&1 | tee $OUTPUT_DIR/run.log

I run call the script in the following way:
srun ~/AMRBART/fine-tune/finetune_AMRbart_amrparsing_large.sh /workspace/students/meier/AMR_Bart_models/AMR-BART-LARGE 0

What can I do to solve the problem?
Thanks for reading!

Difference in hyper-parameters

Hello,

thank you very much for you work and providing the code!

While comparing the fine-tuning scripts to your reported hyper-parameters in your paper I have seen some differences:

  • The sequence length for generation is 512 in the paper, in both amr2text scripts the parameter "src_block_size" is 1024.
  • Early stop is 5 in the paper, in the scripts it ranges from 10 to 15.
  • The learning rate in finetune_AMRbart_amr2text.sh differs from the reported 1e-5 for the base model.

I guess the parameters from the scripts are more recent?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.