GithubHelp home page GithubHelp logo

centre-for-humanities-computing / dfm-sentence-transformers Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 65 KB

Code for curating data and training sentence transformers for the Danish Foundation Models project.

License: MIT License

Python 98.23% Makefile 1.77%

dfm-sentence-transformers's Introduction

dfm-sentence-transformers


Sentence transformers for the Danish Foundation Models Project.

Training

Install the package from PyPI:

pip install dfm-sentence-transformers

You have to specify basic model and training parameters, as well as all the tasks/datasets the model should be trained on.

Here is an example of a config:

[model]
name="dfm-sentence-encoder-small-v1"
base_model="chcaa/dfm-encoder-small-v1"
device="cpu"

[training]
epochs=50
steps_per_epoch=500
warmup_steps=100
batch_size=64
wandb_project="dfm-sentence-transformers"
checkpoint_repo="checkpoints-dfm-sentence-encoder-small-v1"

[tasks]

[tasks.bornholmsk]
@tasks="multiple_negatives_ranking"
sentence1="da_bornholm"
sentence2="da"

[tasks.bornholmsk.dataset]
@loaders="load_dataset"
path="strombergnlp/bornholmsk_parallel"

Then you can train a sentence transformer by using the finetune command.

python3 -m dfm_sentence_trf finetune training.cfg -o "model/"

You can push the finetuned model to HuggingFace Hub:

python3 -m dfm_sentence_trf push_to_hub training.cfg --model_path "model/"

(NEW) Curating datasets with models you've pretrained

Similarly to Microsoft's E5 we intend to train models on data that has been curated by models we've trained on huristic-based sentence pairs. We provide a CLI for filtering the dataset based on consistency.

This is based on a batch-based strategy in which we take batches of sentence pairs, create a similarity matrix between left-side and right-side sentences. If a pair's similarity is over the 1-(1/(N*specificity))'th quantile of all similarities in the matrix, and is originally annotated as a pair by the heuristics, we accept it as a positive pair. We also assign hard negatives. These are pairs that have similarity in the lower quantile and are originally not annotated as a pair.

The hard positive, hard negative scheme is employed so that we can use AnglE for finetuning the models on this curated data.

The config scheme for data cleaning is the following:

[cleaning]
batch_size=1000
specificity=1.2
name="kardosdrur/folketing-wiki-clean"

[cleaning.model]
...(same as everywhere else)

[data]

[data.folketinget]
sentence1="comment"
sentence2="response"

[data.folketinget.dataset]
@loaders="load_dataset"
path="kardosdrur/folketinget-discussions"

Then you can clean the dataset:

python3 -m dfm_sentence_trf clean_dataset "config.cfg"

This will produce the JSONL file <dataset_name>.jsonl containing all examples.

Datasets can then be shuffled, split and pushed to the hub with the push_dataset command.

python3 -m dfm_sentence_trf push_dataset "config.cfg"

(NEW) Finetuning with AnglE

You can finetune a model with AnglE on supervised tasks. AnglE models have a different config format, namely:

[model]
...

[training]
epochs=5
batch_size=32
warmup_steps=100

[angle]
sentence1="premise"
sentence2="hypothesis"
label="label"

[angle.dataset]
@loaders="load_dataset"
path="kardosdrur/nb-nli"

AnglE models can only be trained on one supervised task, where the label is correlated with semantic similarity.

Note that you have to manually install AnglE.

pip install angle_emb

Then you can finetune:

python3 -m dfm_sentence_trf angle_finetune "config.cfg" -o "model/"

Models can be pushed to the hub the same way as everything else. We recommend that you pretrain on sentence pair datasets and then finetune with angle on NLI or STS tasks.

Evaluation

You can evaluate trained models with the Scandinavian Embedding Benchmark.

pip install seb
python3 -m seb "model/" "da"

Tasks

You can add an arbitrary number of tasks to the model's config. All tasks must have a unique name but their name is ignored in the actual training procedure. Datasets of tasks with the same loss function are mixed together so that the model can learn them simultaneously in mixed batches. The package comes with three default tasks you can use for different objectives:

1. Multiple Negatives Ranking

If you have a parallel corpus of sentences (paraphrase, translation, etc.) use this task. Batches consist of positive sentence pairs, and negative samples are constructed by taking all non-matching pairs in a batch.

Parameters:

Param Type Description Default
sentence1 str Name of the first sentence column in the dataset. -
sentence2 str Name of the second sentence column in the dataset. -
scale float Output of similarity function is multiplied by scale value. 20.0
[tasks.faroese]
@tasks="multiple_negatives_ranking"
sentence1="fo"
sentence2="da"

[tasks.faroese.dataset]
@loaders="load_dataset"
path="strombergnlp/itu_faroese_danish"

2. Cosine Similarity

Good for STS datasets. Minimizes mean squared error of estimated and true sentence cosine similairites.

Parameters:

Param Type Description Default
sentence1 str Name of the first sentence column in the dataset. -
sentence2 str Name of the second sentence column in the dataset. -
similarity str Name of the gold standard similarity column. -
[tasks.sts]
@tasks="cosine_similarity"
sentence1="sent1"
sentence2="sent1"
similarity="label"

[tasks.sts.dataset]
...

3. Softmax

Good for NLI datasets. Uses softmax classification loss based on concatenated embeddings and their difference. Beware that these tasks are never joined due to potentially different labeling schemes.

Parameters:

Param Type Description Default
sentence1 str Name of the first sentence column in the dataset. -
sentence2 str Name of the second sentence column in the dataset. -
label str Name of the label column in the dataset. -
[tasks.nli]
@tasks="softmax"
sentence1="premise"
sentence2="hypothesis"
label="label"

[tasks.nli.dataset]
...

4. Contrastive (new in 0.3.6)

Contrastive loss for hard negative and hard positive pairs.

Parameters:

Param Type Description Default
sentence1 str Name of the first sentence column in the dataset. -
sentence2 str Name of the second sentence column in the dataset. -
label str Name of the label column in the dataset. -
[tasks.contrastive]
@tasks="contrastive"
sentence1="text1"
sentence2="text2"
label="label"

[tasks.contrastive.dataset]
...

Datasets

Datasets for each task are loaded with ๐Ÿค— load_dataset() function, but only the first argument, and a name are accepted. You can use local or remote datasets, and they can be of any of the canonical file formats (JSON, JSONL, CSV, Parquet...).

...

[tasks.local.dataset]
@loaders="load_dataset"
path="local/dataset/file.jsonl"

...

[tasks.huggingface_hub.dataset]
@loaders="load_dataset"
path="username/dataset"

dfm-sentence-transformers's People

Contributors

x-tabdeveloping avatar kennethenevoldsen avatar

Watchers

Kostas Georgiou avatar

dfm-sentence-transformers's Issues

What do we do with only positive sentence pairs?

@KennethEnevoldsen gave me the ContrastiveTensionLoss as an example of how one could do in batch-negatives for sampling, but as you can see in this example, Contrastive Tension loss with in batch negatives is used with an unsupervised training objective, so it is probably not what we're looking for.

I think MultipleNegativesRankingLoss is what we're looking for. As per the docs:

This loss expects as input a batch consisting of sentence pairs (a_1, p_1), (a_2, p_2)โ€ฆ, (a_n, p_n) where we assume that (a_i, p_i) are a positive pair and (a_i, p_j) for i!=j a negative pair.

For each a_i, it uses all other p_j as negative samples, i.e., for a_i, we have 1 positive example (p_i) and n-1 negative examples (p_j). It then minimizes the negative log-likehood for softmax normalized scores.

This loss function works great to train embeddings for retrieval setups where you have positive pairs (e.g. (query, relevant_doc)) as it will sample in each batch n-1 negative docs randomly.

Which essentially does the same thing as the ContrastiveParallel task that I wrote, but with in-batch negative examples and the number of negative samples is set.

MultipleNegativesSymmetricRankingLoss could also work quite well, as per the documentation:

This loss is an adaptation of MultipleNegativesRankingLoss. MultipleNegativesRankingLoss computes the following loss:For a given anchor and a list of candidates, find the positive candidate.
In MultipleNegativesSymmetricRankingLoss, we add another loss term: Given the positive and a list of all anchors,find the correct (matching) anchor.
For the example of question-answering: You have (question, answer)-pairs. MultipleNegativesRankingLoss just computes the loss to find the answer for a given question. MultipleNegativesSymmetricRankingLoss additionally computes the loss to find the question for a given answer.

I also quite like MegaBatchMarginLoss :

Given a large batch (like 500 or more examples) of (anchor_i, positive_i) pairs,
find for each pair in the batch the hardest negative, i.e. find j != i such that cos_sim(anchor_i, positive_j)
is maximal. Then create from this a triplet (anchor_i, positive_i, positive_j) where positive_j
serves as the negative for this triplet.

Train than as with the triplet loss

Good Config Schema

I came up with the current config schema in a couple of hours, while experimenting and I think we should probably put more thought into it.

Challenges to be addressed here:

  1. How many sections do we want and what should they mean? (I have looked at SpaCy for inspiration, but it's not the same thing, so some of their stuff might not work for us)
  2. How much tinkering do we want to allow? As in: Which hyperparameters do we want to control? Do we want to mess with the pooling layer at all, or do we just accept that pooling==mean?
  3. How do we describe tasks in a reproducible way? My vision is that we create a config system, where we can define the different tasks and the datasets in the config, but for that we will need some sort of system/process/schema. A couple of challenges concerning this:
  • Open and non-open datasets: Some of the data we want to use, we can access from Huggingface Hub, or just put it up there cause it's open, which is awesome, but then we need some way of loading non-open data as well, my first thought is, we could just add load_dataset from Datasets into a registry, like this:
from confection import registry
from datasets import load_dataset

registry.loaders.register("load_dataset")(load_dataset)

Then the config would be something like this:

[tasks]

[tasks.bornholmsk]

[tasks.bornholmsk.dataset]
@loaders="load_dataset"
path="strombergnlp/bornholmsk_parallel"

[task.bornholmsk.objective]
...

[tasks.nyheder]

[tasks.nyheder]
@loaders="load_dataset"
path="dat/nyheder.jsonl"

[task.nyheder.objective]
...
  • Describing training objectives and losses

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.