GithubHelp home page GithubHelp logo

ltgoslo / definition_modeling Goto Github PK

View Code? Open in Web Editor NEW
6.0 5.0 1.0 12.07 MB

Interpretable Word Sense Representations via Definition Generation

License: GNU General Public License v3.0

Python 98.11% Shell 1.89%

definition_modeling's Introduction

Interpretable Word Sense Representations via Definition Generation

This repository accompanies the paper Interpretable Word Sense Representations via Definition Generation: The Case of Semantic Change Analysis (ACL'2023) by Mario Giulianelli, Iris Luden, Raquel Fernández and Andrey Kutuzov.

The project is a collaboration between the Dialogue Modelling Group at the University of Amsterdam and the Language Technology Group at the University of Oslo.

Definition generation models for English:

Usage

Download datasets

*.txt files are tsv files containing the target words and their gold standard definitions.

*.eg files are tsv files containing the target words and their usage examples.

Predict definitions

Gzip test.txt and test.eg files, put them into the same folder and run code/modeling/generate_t5.py, e.g.

python3 code/modeling/generate_t5.py --model ltg/flan-t5-definition-en-base --testdata testdata

Generate DistilRoBERTa sentence embeddings for the definitions

code/modeling/generate_t5.py outputs a tsv file named as _post_predicted.tsv. The gold standard definitions are in the Definition column, and the predicted ones are in the Definitions column. Run code/embed_definitions.py, e.g.

python3 code/embed_definitions.py --input_path "what_is_the_definition_of_<trg>?_post_predicted.tsv" --key_to_entry_id Sense

key_to_entry_id depends on the dataset used. Sense is used in wordnet

Citation

@inproceedings{giulianelli-etal-2023-interpretable,
    title = "Interpretable Word Sense Representations via Definition Generation: The Case of Semantic Change Analysis",
    author = "Giulianelli, Mario  and
      Luden, Iris  and
      Fernandez, Raquel  and
      Kutuzov, Andrey",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.176",
    doi = "10.18653/v1/2023.acl-long.176",
    pages = "3130--3148",
    abstract = "We propose using automatically generated natural language definitions of contextualised word usages as interpretable word and word sense representations. Given a collection of usage examples for a target word, and the corresponding data-driven usage clusters (i.e., word senses), a definition is generated for each usage with a specialised Flan-T5 language model, and the most prototypical definition in a usage cluster is chosen as the sense label. We demonstrate how the resulting sense labels can make existing approaches to semantic change analysis more interpretable, and how they can allow users {---} historical linguists, lexicographers, or social scientists {---} to explore and intuitively explain diachronic trajectories of word meaning. Semantic change analysis is only one of many possible applications of the {`}definitions as representations{'} paradigm. Beyond being human-readable, contextualised definitions also outperform token or usage sentence embeddings in word-in-context semantic similarity judgements, making them a new promising type of lexical representation for NLP.",
}

definition_modeling's People

Contributors

akutuzov avatar mariafjodorowa avatar glnmario avatar

Stargazers

Eduardo Calò avatar gruebleen avatar Jacklanda avatar  avatar Pin-Er Chen avatar Raquel Fernández avatar

Watchers

Lilja Øvrelid avatar  avatar Erik Velldal avatar Stephan Oepen avatar  avatar

Forkers

techthiyanes

definition_modeling's Issues

[Question] Some Questions

Hey there, I am so interested in this terrific work, and found some questions when I tried to reproduce the results in the paper:

  • Q1: Do the checkpoints released on the Hugging Face (3 * FlanT5 models) correspond to the Soft domain shift models?
  • Q2: How could I compute the final evaluation results including bertscore-f1, rouge-l, and bleu as the results reported in the paper? Of course, after using the command python3 code/modeling/generate_t5.py --model ltg/flan-t5-definition-en-xl --testdata wordnet/, I got the predicted definition of each item of a word in the test data. And I use python code/definition_pair_similarity.py --data_path predicted.tsv --output_path "result.tsv" to compute each word's metrics. As a result, should I average each line of results in result.tsv for the final mean values?

Thanks so much!

Replication of finetuning code

Hello, I want to try finetuning your model with own data but I have two questions:

  1. I am trying to replicat eyour finetuning code but if I try finetuning the larger version of FLAN-T5 I run into memory capacity issues. I am just using the wordnet dataset from huggingface, training one epoch with a batch size of 1 and reduced lengths. It appears to not run on multiple nodes. How could I solve this?
  2. How should I format my data in order to use it for further finetuning?

Thank you for any assistance here.

FileNotFoundError when Running generate_t5.py (testdata/complete.tsv.gz ?)

Hello!
I am currently working with your project definition_modeling and encountered an issue when trying to run the generate_t5.py script. I followed the instructions in the README to prepare the test data, but I'm facing a FileNotFoundError.

Issue Description

While running the command:

python3 code/modeling/generate_t5.py --model ltg/flan-t5-definition-en-base --testdata testdata

I encountered the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'testdata/complete.tsv.gz'

According to the README, I have compressed test.txt and test.eg using gzip and placed them in the ./testdata directory. However, it seems like the script is looking for a file named complete.tsv.gz.

My Understanding

Based on my understanding of the code, the content of the testdata file might be structured as follows:

word    example
vigilance    vigilance is especially susceptible to fatigue.
......

Questions

  1. Could you please provide the expected structure for the complete.tsv file?
  2. How and where are test.txt and test.eg used in the script?

Any guidance or clarification you can provide would be greatly appreciated, as I am eager to properly utilize your project.

Thank you for your time and assistance. Thank you again for such excellent research.

Best regards,

Reproduce Code

Hi there,

Followed your Usage, I've generated predicted.npz and predicted.tsv.gz. The dataset I used is Wordnet. But I don't know what to do next. May you provide the completed experimental procedure?

I try to run cluster_definitions.py or sense_label.py, but both two files need complete.tsv.gz containing usages and cluster ids, could you please tell me how to get this?

Thanks!

embed_definitions.py missing

The DistilRoBERTa embeddings are supposed to be created with code/embed_definitions.py but there is only code/embed_definitions_tfidf.py which uses TF-IDF and not DistilRoBERTa. Do you happen to have the original file somewhere?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.