seanlee97 / angle Goto Github PK

Train and Infer Powerful Sentence Embeddings with AnglE | 🔥 SOTA on STS and MTEB Leaderboard

Home Page: https://arxiv.org/abs/2309.12871

License: MIT License

Python 100.00%

llama llama2 semantic-similarity semantic-textual-similarity sentence-embedding stsbenchmark text-embedding sentence-embeddings text-similarity retrieval-augmented-generation

angle's Introduction

nothing is here

angle's People

Contributors

Stargazers

Watchers

angle's Issues

Training data for UAE-Large-V1

Hi,

Awesome work! Can you share the details about what data was used for adapting WhereIsAI/UAE-Large-V1 from BGE-large? Can you share the data as well?

Thanks!

[Bug] Stuck before trainer.train()

Hello, I cannot run examples/Angle-ATEC.ipynb, angle.fit() output nothing and GPUs are not working.
Might be version issue, my environment: Successfully installed bitsandbytes-0.41.3.post2 boltons-23.1.1 peft-0.7.1 tokenizers-0.15.0 transformers-4.36.2

Questions about how to use the model

Dear author, thank you for your excellent work. I am now looking to measure the semantic similarity between multiple answers generated by a llm and the ground truth answer. Can I directly use your model to extract features from both the answers generated by the large model and the real answer, and then calculate their cosine similarity as the score for their semantic similarity match? Will the performance of STS be affected?

Gemma

Would you also be looking at other Llama based models, like Gemma?

Does WhereIsAI/UAE-Large-V1 support sentence transformers?

I am able to create an embedding using sentence transformers, but I was not sure it support it or not.

Where to find Github Issues dataset?

Error extracting angle-llama vector

Hi Sean,

Thanks for the amazing work! I notice that there might be a small bug in newer versions of the code resulting in a device error when using angle-llama to get embeddings. I downgrade the version to 0.3.0 and the problem disappears.

To reproduce the error, simply execute the code given in Angle-llama instructions

Could you take a quick look at the problem? Thanks !

A little doubt about the paper

From a code perspective, the paper concludes by adding up all values of the complex loss function

This is a normal complex division formula and transformation. The purpose of the paper is to obtain the content of the red box.
But you ultimately add up, as shown in the following figure:

Is this the desired result of the paper? May I ask if you can tell me?, thank you.

How is Re-ranking done?

On the leaderboard I see a result for re-ranking. How is this done with these embeddings?

UAE - explanation of Non-Retrieval vs Retrieval

Hello, could you please add a little explanation of the difference Non-Retrieval and Retrieval tasks for UAE? Why would one be used instead of another? I'm looking to create sentence embeddings to store in a database. Thank you!

AngIE motivation?

Hi author, I have a question: the so-called cosine similarity is actually a vector dot product, and there is no real cosine at all. When the gradient is calculated, there is only multiplication, there is no cos at all, and there is no so-called saturation region where the gradient disappears. Can you explain?

About angle-bert-base-uncased-nli-en-v1 evaluation issues

When I use angle-bert-base-uncased-nli-en-v1 to evaluate STS performance, I find that it is inconsistent with the original report.

The command line I use:

python eval_nli.py 
--model_name_or_path /home/whzhu_st/Model/angle-bert-base-uncased-nli-en-v1  
--task_set sts 
--pooling_strategy cls_avg

Enviroment:

torch 1.13.1
transformer 4.38.1
V100 GPU

So is this result acceptable within the error range or is there something wrong with my command?

How to use the encoding with tiktoken ?

Hey,

I am trying to get the encoding using tiktoken to initiate token counter:

import tiktoken
from llama_index.callbacks import CallbackManager, TokenCountingHandler
enc = tiktoken.get_encoding("WhereIsAI/UAE-Large-V1")
token_counter = TokenCountingHandler(tokenizer= enc.encode)

But i am getting following error:

_---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[20], line 3
1 import tiktoken
2 from llama_index.callbacks import CallbackManager, TokenCountingHandler
----> 3 enc = tiktoken.get_encoding("WhereIsAI/UAE-Large-V1")
4 token_counter = TokenCountingHandler(tokenizer= enc.encode)

File f:\pycharmprojects\llamaindex\venv\lib\site-packages\tiktoken\registry.py:68, in get_encoding(encoding_name)
65 assert ENCODING_CONSTRUCTORS is not None
67 if encoding_name not in ENCODING_CONSTRUCTORS:
---> 68 raise ValueError(
69 f"Unknown encoding {encoding_name}. Plugins found: {_available_plugin_modules()}"
70 )
72 constructor = ENCODING_CONSTRUCTORS[encoding_name]
73 enc = Encoding(**constructor())

ValueError: Unknown encoding WhereIsAI/UAE-Large-V1. Plugins found: ['tiktoken_ext.openai_public']_

Is there any way to use the encodings with tiktoken ?

Thanks

SimCSE-LLaMA2

Thank you for your awesome project!!

Can you provide the SimCSE-LLaMA2 code?

some mistake

in angle_emb/angle.py

line 767

labels = inputs.pop("labels", None) <-- may be error

# labels = inputs.pop("labels", None) <-- may be ok

You have already declared.

def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels", None) <-- like this

How to use accelerate to multi gpu train and encode?

It seems to be will rasie error..

Training script for the Bert-based model on the NLI dataset

Dear author, I want to use bert-base-uncased model to train on NLI dataset based on your method for some research. Could you provide relevant training scripts so that I can better reproduce your experimental results? This is my training script, using the same data as your training. I cannot reproduce the evaluation effect of your angle-bert-base-uncased-nli-en-v1 model.

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=1234 train_nli.py \
--task NLI-STS --output_dir ckpts/NLI-STS-bert-cls \
--model_name_or_path ../models/bert-base-uncased \
--learning_rate 5e-5 --maxlen 50 \
--epochs 1 \
--batch_size 10 \
--logging_steps 500 \
--warmup_steps 0 \
--save_steps 1000 --seed 42 --do_eval 0 --gradient_accumulation_steps 4 --fp16 1 --torch_dtype 'float32' \
--pooling_strategy 'cls'

This is my evalution result on STS

multi gpu use?

I am running out of memory on Tesla T4. I have 4 of them though and I usually use accelerator for multigpu setup. How can I use them for angle semantic similarity?

是否支持多卡推理或LORA？

您好，能否调整API使模型推理时存储在多张显卡上？我现在有多张24G显存的显卡并且我希望能够运行LLama-7B进行embedding

Fine-tune LLM for WhereIsAI/UAE-Large-V1 embeddings First ?

Hi,

To use the generated embeddings from WhereIsAI/UAE-Large-V1 in an LLM model , do I first need to fine tune a pre-trained LLM model with AnglE so that WhereIsAI/UAE-Large-V1 embeddings are compatible with an LLM? e.g.

angle = AnglE.from_pretrained('NousResearch/Llama-2-7b-hf', pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2')

Thank you !

How to choose distance in a retrieval system?

When we use the AnglE to build a (faiss) vector store for retrieval, do we need to customize an distance function which is in accord with the final objective function?
The default distance of faiss vector store is L2_distance, and it has an option to COSINE.
Will the retrieval system perform well just with L2 or COSINE?

Issue on Sagemaker

I am facing while deploying embedded model on AWS sagemaker. I ran the same script given in hugging face but got error:
ModelError Traceback (most recent call last)
Cell In[5], line 32
26 # deploy model to SageMaker Inference
27 predictor = huggingface_model.deploy(
28 initial_instance_count=1, # number of instances
29 instance_type='ml.r5d.12xlarge' # ec2 instance type
30 )
---> 32 predictor.predict({
33 "inputs": "Today is a sunny day and I will get some ice cream.",
34 })

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/sagemaker/base_predictor.py:167, in Predictor.predict(self, data, initial_args, target_model, target_variant, inference_id)
137 """Return the inference from the specified endpoint.
138
139 Args:
(...)
161 as is.
162 """
164 request_args = self._create_request_args(
165 data, initial_args, target_model, target_variant, inference_id
166 )
--> 167 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
168 return self._handle_response(response)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/botocore/client.py:553, in ClientCreator._create_api_method.._api_call(self, *args, **kwargs)
549 raise TypeError(
550 f"{py_operation_name}() only accepts keyword arguments."
551 )
552 # The "self" in this scope is referring to the BaseClient.
--> 553 return self._make_api_call(operation_name, kwargs)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/botocore/client.py:1009, in BaseClient._make_api_call(self, operation_name, api_params)
1005 error_code = error_info.get("QueryErrorCode") or error_info.get(
1006 "Code"
1007 )
1008 error_class = self.exceptions.from_code(error_code)
-> 1009 raise error_class(parsed_response, operation_name)
1010 else:
1011 return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
"code": 400,
"type": "InternalServerException",
"message": "Could not load model /.sagemaker/mms/models/WhereIsAI__UAE-Large-V1 with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModel\u0027\u003e, \u003cclass \u0027transformers.models.bert.modeling_bert.BertModel\u0027\u003e)."
}

Use of causal models for generation

This is an amazing work. I have been working on something that would require me to evaluate the generated outputs of models like Mistral, using a prompt like:
"Fill the [MASK] token in the sentence. Generate a single output."

Now earlier, I would simply instruction fine-tune a Mistral Model. But I would like to explore the possibility of using these models with a bi-directional attention.

I see that the library allows me to access the backbone model underneath. But it is not clear to me if this model has the bi-directional attention. Can you please clarify this? If it does, I could simply use the backbone.generate() function for my purpose.

Thanks in advance!

What is bilayer-index / BeLLM? Can you point to code or paper about this?

How to finetune NLI model

Hi, could you please give me an overview of how to fine-tune an NLI model? Namely:

Which file to fine-tune
Possible prompts and commands, values of w or any relevant details

What similarity metric should I use to measure semantic similarity between two sentences?

In the paper, at training time, it appears that you treat embeddings as a vectors of real numbers which is used to calculate cosine similarity and also as vectors of complex numbers which is used to calculate the angle between the two vectors to measure similarity. At inference time, what similarity metric should I use measure semantic similarity?

PreTrainedTokenizerBase.pad() got an unexpected keyword argument 'truncation'

In angle.py:

        if end_with_eos:
            features = self.tokenizer.pad(
                {'input_ids': [feature['input_ids'] for feature in new_features]},
                padding=False,
                max_length=self.max_length - 1,
                return_tensors=return_tensors,
                truncation=True,
            )
            features['input_ids'] = [input_ids + [self.tokenizer.eos_token_id] for input_ids in features['input_ids']]
            features = self.tokenizer.pad(features, padding=self.padding, return_tensors=return_tensors)

TypeError: PreTrainedTokenizerBase.pad() got an unexpected keyword argument 'truncation'

I'm using AngIE 0.3.1, tokenizers 0.15.1

How can I finetune a saved adapt model?

Here is my train_lora.py:

from datasets import load_dataset
from angle_emb import AnglE, AngleDataTokenizer

# 2. load dataset
# `text1`, `text2`, and `label` are three required columns.
def get_ds(path):
    ds = xxx
    return ds

# 3. transform data
rt = '../data/dataset/v02/'
data_files = {xxx}
ds = load_dataset(rt)
ds = ds.map(lambda obj: {"text1": str(obj["s1"]), "text2": str(obj['s2']), "label": obj['label']})
ds = ds.select_columns(["text1", "text2", "label"])

# 1. load pretrained model
# model_path = '../UAE-Large-V1' # first finetune based model
model_path = '../sts-b/2/ll10e1/best-checkpoint/' # second finetune based model
angle = AnglE.from_pretrained(model_path, max_length=50, pooling_strategy='cls', apply_lora=True, load_kbit=4, train_mode=True).cuda() 


# 3. transform data
train_ds = ds['train'].shuffle().map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)
valid_ds = ds['validation'].map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)

batch_size = 32
save_steps = len(train_ds) // batch_size
lrb = 10
epoch = 5
output_dir = f'../sts-b/7/ll{lrb}e{epoch}'

print('save_steps:', save_steps, output_dir)
# 4. fit
angle.fit(
    train_ds=train_ds,
    valid_ds=valid_ds,
    output_dir=output_dir,
    batch_size=batch_size,
    epochs=epoch,
    learning_rate=lrb * (10 ** -5),
    save_steps=save_steps,
    eval_steps=1000,
    warmup_steps=0,
    gradient_accumulation_steps=4,
    loss_kwargs={
        'w1': 1.0,
        'w2': 35,
        'w3': 1.0,
        'cosine_tau': 20,
        'ibn_tau': 20,
        'angle_tau': 1.0
    },
    fp16=True,
    logging_steps=100
)

When I run this code to finetune with the first finetune model, this error occurs:
INFO:AnglE:lora_config={'task_type': <TaskType.FEATURE_EXTRACTION: 'FEATURE_EXTRACTION'>, 'r': 32, 'lora_alpha': 32, 'lora_dropout': 0.1}
INFO:AnglE:lora target modules=['base_layer', 'default']
INFO:peft.tuners.tuners_utils:Already found a peft_config attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
Traceback (most recent call last):
File "/mnt/bd/mlx-bytedrive-1378-622c9164/llm/uae/train_lora.py", line 22, in
angle = AnglE.from_pretrained(model_path, max_length=50, pooling_strategy='cls', apply_lora=True, load_kbit=4, train_mode=True).cuda() #
File "/mnt/bd/mlx-bytedrive-1378-622c9164/llm/venv/lib/python3.9/site-packages/angle_emb/angle.py", line 847, in from_pretrained
angle = AnglE(model_name_or_path,
File "/mnt/bd/mlx-bytedrive-1378-622c9164/llm/venv/lib/python3.9/site-packages/angle_emb/angle.py", line 772, in init
model = get_peft_model(model, peft_config)
File "/mnt/bd/mlx-bytedrive-1378-622c9164/llm/venv/lib/python3.9/site-packages/peft/mapping.py", line 133, in get_peft_model
return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
File "/mnt/bd/mlx-bytedrive-1378-622c9164/llm/venv/lib/python3.9/site-packages/peft/peft_model.py", line 1835, in init
super().init(model, peft_config, adapter_name)
File "/mnt/bd/mlx-bytedrive-1378-622c9164/llm/venv/lib/python3.9/site-packages/peft/peft_model.py", line 125, in init
self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)
File "/mnt/bd/mlx-bytedrive-1378-622c9164/llm/venv/lib/python3.9/site-packages/peft/tuners/lora/model.py", line 111, in init
super().init(model, config, adapter_name)
File "/mnt/bd/mlx-bytedrive-1378-622c9164/llm/venv/lib/python3.9/site-packages/peft/tuners/tuners_utils.py", line 90, in init
self.inject_adapter(self.model, adapter_name)
File "/mnt/bd/mlx-bytedrive-1378-622c9164/llm/venv/lib/python3.9/site-packages/peft/tuners/tuners_utils.py", line 247, in inject_adapter
self._create_and_replace(peft_config, adapter_name, target, target_name, parent, **optional_kwargs)
File "/mnt/bd/mlx-bytedrive-1378-622c9164/llm/venv/lib/python3.9/site-packages/peft/tuners/lora/model.py", line 202, in _create_and_replace
new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
File "/mnt/bd/mlx-bytedrive-1378-622c9164/llm/venv/lib/python3.9/site-packages/peft/tuners/lora/model.py", line 355, in _create_new_module
raise ValueError(
ValueError: Target module Dropout(p=0.1, inplace=False) is not supported. Currently, only the following modules are supported: torch.nn.Linear, torch.nn.Embedding, torch.nn.Conv2d, transformers.pytorch_utils.Conv1D.

If I load model with
angle = AnglE.from_pretrained(model_path, max_length=50, pooling_strategy='cls', train_mode=True).cuda()
Then this error occus:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

What should I do to finetune my finetuned adapter(peft) model again?
Thanks!

Which feature to use?

Thank you for your works. I'm new to NLP, and I want to know which feature to use to cluster similar sentences?

After UAE(non retrieval), I'll get a (n, 1024) feature, should I use the starter token's feature same as E5?

And BTW, I found that using E5, "A red teddy bear wearing blue shirt" is very similar to "A blue teddy bear wearing red shirt". Similarly, "A man riding a horse" will be close to "A horse riding a man", is that a problem for all algorithms?

AttributeError: 'AnglE' object has no attribute 'set_prompt'

Mac mini
14.2.1 (23C71)
angle-emb==0.4.5

from angle_emb import AnglE, Prompts
print('All predefined prompts:', Prompts.list_prompts())
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls')
print("angle:", angle)
angle.set_prompt(prompt=Prompts.C)

get error

angle: <angle_emb.angle.AnglE object at 0x152515d30>
Traceback (most recent call last):
  File "/Volumes/NBDATA/JobProjects/Tsinghua/Data-chat/text_splitter/article_partition_splitter.py", line 19, in <module>
    angle.set_prompt(prompt=Prompts.C)
AttributeError: 'AnglE' object has no attribute 'set_prompt'

Incorporating Matryoshka Representation Learning

I wanted to start by expressing my appreciation for your incredible model; its outstanding performance has significantly benefited my work, and for that, I am truly grateful.

I'm reaching out to inquire if you might consider incorporating Matryoshka Representation Learning into your model's training process. I believe that this technique could further amplify the model's capabilities and effectiveness, potentially boosting its performance even more.

Thank you for your time and for creating such a valuable tool.

Compare with M3E

Have you include M3E model as a comparison ?

m3e-base released in about June 2023

Angle with bert-multilingual-base

I wanna ask whether it is possible and How to combine Angle and bert-multilingual-base to obtain a model similar to angle-bert-multilingual-base-uncased-nli-en-v1?

Code embeddings

Is there any information if this is also recommended for extracting embeddings from code snippets? In particular Javascipt and Solidity?

MTEB STS performance and using as feature extraction model

Hi, very interesting approach, and impressive results.
Have you run the MTEB STS benchmark on your trained models?
Would be interesting to see their performance v/s existing models with similar model size.

Thanks

What values are you using for w1, w2 and w3 when defining loss

Hello, I am wondering what constant values you were using for fine-tuning, the loss is L = w1 ∗ Lcos + w2 ∗ Libn + w3 ∗ Langle, but I did not find the values of w1, w2 and w3 in your paper.

How to set labels for contradict pairs

Snli dataset contains contradict pairs, they define labels:
label: an integer whose value may be either 0, indicating that the hypothesis entails the premise, 1, indicating that the premise and hypothesis neither entail nor contradict each other, or 2, indicating that the hypothesis contradicts the premise. Dataset instances which don't have any gold label are marked with -1 label. Make sure you filter them before starting the training using datasets.Dataset.filter.
If I want to use AngLE to fine-tune on those kind of dataset, Should I set -1 for contradict pairs?

ValueError: operands could not be broadcast together with shapes (38,384) (37,384)

Got this error while using this library to train an embedding model:

File "/usr/local/lib/python3.8/dist-packages/angle_emb/angle.py", line 986, in on_epoch_end
corrcoef, accuracy = self.evaluate_fn(self.valid_ds)
File "/usr/local/lib/python3.8/dist-packages/angle_emb/angle.py", line 1470, in evaluate
pred = (x_vecs[::2] * x_vecs[1::2]).sum(1)
ValueError: operands could not be broadcast together with shapes (38,384) (37,384)

I confirmed that valid_ds and train_ds were of even length, so ultimately I just modified one line of the evaluate method of the AnglE class. After this line:
x_vecs = l2_normalize(x_vecs)
I added:

if len(x_vecs) % 2 != 0:
    x_vecs = x_vecs[:-1]

Hopefully that doesn't break anything/everything else? Any thougths on what else might be the source of the issue?

Also, I attempted to restart training by running the same angle.fit() as I did when I started it but adjusting the from_pretrained to point to the most recent checkpoint:
angle = AnglE.from_pretrained('/checkpoint-1100', max_length=512, pooling_strategy='cls').cuda()

I don't see a resume_from_checkpoint=True argument option anywhere... so it's not clear that it's aware of how many epochs have already been run etc.

Unversioned PEFT Dependency Prevents Model from Loading

A recent update to PEFT deprecates and removes references to prepare_model_for_8bit_training. Since AnglE's dependencies are unversioned and it references the deprecated method here, the model breaks when its constructor is invoked.

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

I am trying to train my model using LLAMA-v2-nli. I was able to do so with the bert-nli model but when I try to run with LLAMA I get the following error:

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

`from angle_emb import AnglE, AngleDataTokenizer
angle = AnglE.from_pretrained('NousResearch/Llama-2-7b-hf', pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2').cuda()
train_ds = ds['train'].shuffle().map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)
valid_ds = ds['valid'].map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)
test_ds = ds['test'].map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)

angle.fit(
train_ds=train_ds,
valid_ds=test_ds,
output_dir='ckpts/sts-b',
batch_size=16,
epochs=5,
learning_rate=2e-5,
save_steps=100,
eval_steps=1000,
warmup_steps=0,
gradient_accumulation_steps=1,
loss_kwargs={
'w1': 1.0,
'w2': 1.0,
'w3': 1.0,
'cosine_tau': 20,
'ibn_tau': 20,
'angle_tau': 1.0
},
fp16=True,
logging_steps=100
)`

I used the same code for bert (loaded the bert model instead) and it works no issues

How to Train for UAE-Large-V1

Since I only saw the training example of angle-bert-base-uncased-nli-en-v1, I was wondering if the UAE-Large-V1 training is the same. Thank you very much for your replies.

2D Matryoshka Sentence Embeddings

Thanks for your great work. Could you please provide the code for 2D Matryoshka Sentence Embeddings or any checkpoints?

[QUESTION] How to use prompt C when using through HuggingFace embeddings loader

I am using Llamaindex to index documents into chromadb and for that I use the HuggingFaceEmbedding abstraction like that:

embed_model = HuggingFaceEmbedding(model_name="WhereIsAI/UAE-Large-V1")

However I read that one need to specify prompt C in order to optimize the embedding for retrieval.

is the prompt only used during retrieval? ie for the question embedding? or also for documents indexing?
any idea if that setting is supported through HuggingFace//Llamaindex abstractions, and how?
in the event that prompt C arg is not supported, would the resulting vector be significantly performing less in retrieval use cases?

cannot reproduce the results reported in the Espresso paper

Hi, this is a really good and useful codebase. I tried to reproduce the results reported in the paper but failed. I used the code in README_ESE.md:

WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=1234 -m angle_emb.angle_trainer \
--model_name_or_path WhereIsAI/UAE-Large-V1 \
--train_name_or_path SeanLee97/nli_for_simcse --save_dir ckpts/UAE-Large-Espresso \
--ibn_w 10.0 --cosine_w 0. --angle_w 1.0 --angle_tau 20.0 --learning_rate 1e-6 --maxlen 75 \
--workers 16 \
--pooling_strategy cls \
--epochs 1 \
--batch_size 128 \
--logging_steps 100 \
--warmup_steps 200 \
--save_steps 1000 \
--fp16 1 \
--gradient_accumulation_steps 4 \
--apply_ese 1 \
--ese_compression_size 128 \
--ese_kl_temperature 1.0

However, it only gave the following results:

sts12	sts13	sts14	sts15	sts16	STSB	SICKR	Avg.
79.25	88.63	84.15	89.61	85.99	87.79	79.59	85.00

I also change --cosine_w 0. to --cosine_w 1.0 and --ibn_w 10.0 to --ibn_w 35.0, but the results were even worse.

The results reported in your paper are:

sts12	sts13	sts14	sts15	sts16	STSB	SICKR	Avg.
79.64	90.40	85.76	90.33	86.64	88.54	81.09	86.06

If I purely evaluate the WhereIsAI/UAE-Large-V1 model, the results are:

sts12	sts13	sts14	sts15	sts16	STSB	SICKR	Avg.
79.09	89.62	85.02	89.51	86.61	89.06	82.09	85.86

This means fine-tuning gave me worse performance. In addition, I noticed that the more epochs I train, the worse the performance gets.
Besides, I also tried the code in examples/NLI/README.md to train Qwen1.5-0.5B:

CUDA_VISIBLE_DEVICES=1,2,3,4 torchrun --nproc_per_node=4 --master_port=1234 train_angle.py \
--task NLI-STS --save_dir ckpts/NLI-STS-angle-Qwen1.5-0.5B \
--model_name Qwen/Qwen1.5-0.5B \
--w2 35 --learning_rate 1e-4 --maxlen 50 \
--lora_r 32 --lora_alpha 32 --lora_dropout 0.1 \
--save_steps 500 --batch_size 120 --seed 42 --do_eval 0 --load_kbit 4 --gradient_accumulation_steps 4 --epochs 1

It gave me an average score of 70.23, whereas the paper reports 82.82.

I wonder whether these scripts are the ones you used to train your model, especially regarding the parameter values. It would be really helpful if you could assist me in reproducing the results so I can use this codebase. I really appreciate your time and help! Thank you!

Difference in output when running via Trasformers.js and when hosting on Huggingface

I created an application that uses the UAE-large-V1 model inside Transformers.js and was able to embed sentences in a browser without issues. The model would return a single vector for a single input:

extractor = await pipeline("feature-extraction", "WhereIsAI/UAE-Large-V1", {
      quantized: true,
});

let result = await extractor(text, { pooling: "mean", normalize: true });

When I hosted the model on Huggingface using their inference endpoint solution, it no longer works as expected. Instead of returning a single vector, it returns a variable length of 1024 dimension vectors.

Sample input:

{
   "inputs":  "Where are you"
}

This returns a list of lists of lists of numbers.

Is there a way to make hosted model return a single vector? And why does the the model act differently based on where it's hosted?

Issue installing package

When I install angle-emb and then try to load the Llama model, I get an error because sentencepiece is not installed. Maybe sentencepiece needs to be added as a requirement of the package?

seanlee97 / angle Goto Github PK

angle's Introduction

angle's People

Contributors

Stargazers

Watchers

Forkers

angle's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs