ltgoslo / definition_modeling Goto Github PK

Interpretable Word Sense Representations via Definition Generation

License: GNU General Public License v3.0

Python 98.11% Shell 1.89%

definition_modeling's Introduction

Interpretable Word Sense Representations via Definition Generation

This repository accompanies the paper Interpretable Word Sense Representations via Definition Generation: The Case of Semantic Change Analysis (ACL'2023) by Mario Giulianelli, Iris Luden, Raquel Fernández and Andrey Kutuzov.

The project is a collaboration between the Dialogue Modelling Group at the University of Amsterdam and the Language Technology Group at the University of Oslo.

Definition generation models for English:

Usage

Download datasets

wordnet and oxford

*.txt files are tsv files containing the target words and their gold standard definitions.

*.eg files are tsv files containing the target words and their usage examples.

CoDWoE

Predict definitions

Gzip test.txt and test.eg files, put them into the same folder and run code/modeling/generate_t5.py, e.g.

python3 code/modeling/generate_t5.py --model ltg/flan-t5-definition-en-base --testdata testdata

Generate DistilRoBERTa sentence embeddings for the definitions

code/modeling/generate_t5.py outputs a tsv file named as _post_predicted.tsv. The gold standard definitions are in the Definition column, and the predicted ones are in the Definitions column. Run code/embed_definitions.py, e.g.

python3 code/embed_definitions.py --input_path "what_is_the_definition_of_<trg>?_post_predicted.tsv" --key_to_entry_id Sense

key_to_entry_id depends on the dataset used. Sense is used in wordnet

Citation

@inproceedings{giulianelli-etal-2023-interpretable,
    title = "Interpretable Word Sense Representations via Definition Generation: The Case of Semantic Change Analysis",
    author = "Giulianelli, Mario  and
      Luden, Iris  and
      Fernandez, Raquel  and
      Kutuzov, Andrey",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.176",
    doi = "10.18653/v1/2023.acl-long.176",
    pages = "3130--3148",
    abstract = "We propose using automatically generated natural language definitions of contextualised word usages as interpretable word and word sense representations. Given a collection of usage examples for a target word, and the corresponding data-driven usage clusters (i.e., word senses), a definition is generated for each usage with a specialised Flan-T5 language model, and the most prototypical definition in a usage cluster is chosen as the sense label. We demonstrate how the resulting sense labels can make existing approaches to semantic change analysis more interpretable, and how they can allow users {---} historical linguists, lexicographers, or social scientists {---} to explore and intuitively explain diachronic trajectories of word meaning. Semantic change analysis is only one of many possible applications of the {`}definitions as representations{'} paradigm. Beyond being human-readable, contextualised definitions also outperform token or usage sentence embeddings in word-in-context semantic similarity judgements, making them a new promising type of lexical representation for NLP.",
}

definition_modeling's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes

definition_modeling's Issues

[Question] Some Questions

Hey there, I am so interested in this terrific work, and found some questions when I tried to reproduce the results in the paper:

Q1: Do the checkpoints released on the Hugging Face (3 * FlanT5 models) correspond to the Soft domain shift models?
Q2: How could I compute the final evaluation results including bertscore-f1, rouge-l, and bleu as the results reported in the paper? Of course, after using the command python3 code/modeling/generate_t5.py --model ltg/flan-t5-definition-en-xl --testdata wordnet/, I got the predicted definition of each item of a word in the test data. And I use python code/definition_pair_similarity.py --data_path predicted.tsv --output_path "result.tsv" to compute each word's metrics. As a result, should I average each line of results in result.tsv for the final mean values?

Thanks so much!

Replication of finetuning code

Hello, I want to try finetuning your model with own data but I have two questions:

I am trying to replicat eyour finetuning code but if I try finetuning the larger version of FLAN-T5 I run into memory capacity issues. I am just using the wordnet dataset from huggingface, training one epoch with a batch size of 1 and reduced lengths. It appears to not run on multiple nodes. How could I solve this?
How should I format my data in order to use it for further finetuning?

Thank you for any assistance here.

FileNotFoundError when Running generate_t5.py (testdata/complete.tsv.gz ?)

Hello!
I am currently working with your project definition_modeling and encountered an issue when trying to run the generate_t5.py script. I followed the instructions in the README to prepare the test data, but I'm facing a FileNotFoundError.

Issue Description

While running the command:

python3 code/modeling/generate_t5.py --model ltg/flan-t5-definition-en-base --testdata testdata

I encountered the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'testdata/complete.tsv.gz'

According to the README, I have compressed test.txt and test.eg using gzip and placed them in the ./testdata directory. However, it seems like the script is looking for a file named complete.tsv.gz.

My Understanding

Based on my understanding of the code, the content of the testdata file might be structured as follows:

word    example
vigilance    vigilance is especially susceptible to fatigue.
......

Questions

Could you please provide the expected structure for the complete.tsv file?
How and where are test.txt and test.eg used in the script?

Any guidance or clarification you can provide would be greatly appreciated, as I am eager to properly utilize your project.

Thank you for your time and assistance. Thank you again for such excellent research.

Best regards,

Reproduce Code

Hi there,

Followed your Usage, I've generated predicted.npz and predicted.tsv.gz. The dataset I used is Wordnet. But I don't know what to do next. May you provide the completed experimental procedure?

I try to run cluster_definitions.py or sense_label.py, but both two files need complete.tsv.gz containing usages and cluster ids, could you please tell me how to get this?

Thanks!

embed_definitions.py missing

The DistilRoBERTa embeddings are supposed to be created with code/embed_definitions.py but there is only code/embed_definitions_tfidf.py which uses TF-IDF and not DistilRoBERTa. Do you happen to have the original file somewhere?

ltgoslo / definition_modeling Goto Github PK

definition_modeling's Introduction

Interpretable Word Sense Representations via Definition Generation

Definition generation models for English:

Usage

Download datasets

Predict definitions

Generate DistilRoBERTa sentence embeddings for the definitions

Citation

definition_modeling's People

Contributors

Stargazers

Watchers

Forkers

definition_modeling's Issues

Issue Description

My Understanding

Questions

Recommend Projects

Recommend Topics

Recommend Org

Jobs