thudm / p-tuning-v2 Goto Github PK
View Code? Open in Web Editor NEWAn optimized deep prompt tuning strategy comparable to fine-tuning across scales and tasks
License: Apache License 2.0
An optimized deep prompt tuning strategy comparable to fine-tuning across scales and tasks
License: Apache License 2.0
您好,请问表一基于BERT-large的superglue数据集中PT方法和您之前那篇论文P-tuning方法在superglue的效果差别为什么这么大?是因为冻结参数和没冻结参数的原因嘛?代码中没有找到,不好意思打扰了!
你好,ptuningv2代码在蚂蚁金服语义相似度任务上没办法收敛,最终表现为全部预测样本数较多的类别,prompt长度为4、8、12、lr为1e-3, 1e-2, 1e-4均如此。
如果连同bert的参数一起更新,则可以正常训练,说明数据、代码没有问题,请问有什么其他的可能方向?
求助:如何通过微调提升模型的text2sql能力?
当前给出一个建表语句(比如一张用户登录日志表),再问一个问题(比如 当天的DAU是多少), chatglm会给出错误的sql语句。 比如加了一些无关的字段、sql关键字之间没有空格、用户没有去重等
期望通过微调,提升模型的text2sql能力。
期望注入:
问题:
输出:
select count(distinct yourid) from your_table where date = today()
从your_table表查询XX的DAU, 以yourid字段作为用户的唯一标识,以date字段作为过滤时间。通过count(distinct yourid)进行去重统计,date=today() 指定为今天。
In Lester et al. (2021), they use T5 as the pre-trained model and use LM head to generate answers.
For models like BERT, Roberta explored in this work, we can not use LM head to extract context spans as the answers, which means a linear QA head is essential.
Is the task-specific linear head fine-tuned with prompt embeddings in PT, Table 3?
If so, this implementation is a little different from the original implementation.
If not, the randomly initialized QA head is not expected to produce meaningful outputs and hinders PT training, which makes the PT results in Table 3 meaningless.
Or, do I have some misunderstandings about the LM head in QA tasks?
Hi, I notice that the bert prompt model does not use the cls & linear head. I try to explain it in the following code with toy inputs, where say input_ids 's shape is [8, 32], and pre_seq_len is 3, then inputs_embeds's shall be [8, 35, 768]. I'll comment the shape of the key variables in the code and state my concern
class BertPromptForSequenceClassification(BertPreTrainedModel):
def forward(*args):
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
batch_size = input_ids.shape[0]
raw_embedding = self.embeddings(
input_ids=input_ids,
position_ids=position_ids,
token_type_ids=token_type_ids,
)
prompts = self.get_prompt(batch_size=batch_size)
inputs_embeds = torch.cat((prompts, raw_embedding), dim=1) # then inputs
prefix_attention_mask = torch.ones(batch_size, self.pre_seq_len).to(self.bert.device)
attention_mask = torch.cat((prefix_attention_mask, attention_mask), dim=1)
# inputs_embeds's shape: [8, 35, 768]
outputs = self.bert(
# input_ids,
attention_mask=attention_mask,
# token_type_ids=token_type_ids,
# position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
# past_key_values=past_key_values,
)
# since bert encoder will take as inputs the first token to the bert_pooler, \
# here the real token being used for classifier is the soft prompts' first token!
pooled_output = outputs[1]
I wonder, is p-tuning v2 compared with soft prompt tuning?
But the token being used for the latter one in the head for classification is not the cls.
Is that expected?
Hi there,
Thanks for your great work! In section 4.4, you did an ablation study about adding prompt to transformer layers in descending order or ascending order. I don't quite understand this difference. Could you please elaborate on this?
RobertaPromptForSequenceClassification 中的实现原理,大致是单独把RobertaModel 中的 RobertaEmbeddings 拿出来。然后tokenize 过的输入通过 RobertaEmbeddings 获得输入embedding,再拼上 prompt 的 embedding 过 RobertaEncoder。
不过单独拎出来的输入embedding,之前经过RobertaEmbedding时候,已经经过了word、pos、type的embedding相加和LN、drop的过程。而以inputs_embeds为参数输入 RobertaModel 之后,似乎会再次经过上述的操作。
是不是我理解的不对?不然按照我的理解,这么实现好像是有问题的?
Why does the effect of fine-tuning on the CB dataset using Bert-large in this paper exceed the SuperGLUE baseline by more than ten absolute points (83.6 -> 94.6)? And the results on other datasets are not much different.
您好,我在一个中文五分类数据集上分别使用p-tuning v1的方法和p-tuning v2的方法进行训练,超参数是一样的,但结果差异很大。
在p-tuning v1 上准确率能到达八十多,但在p-tuning v2 上只能到达三十多。
请问可能是什么原因呢?
你好!我发现P-Tuning v2论文中提到所有实验都在全量数据上做的,那么在少样本设定下的效果如何呢,能赶上fine-tune吗?
Hi,
I've seen issues asking about the past_key_value
implementation and I've tried a code snippet to confirm if it's consistent with what's described in the paper. However, it doesn't seem to work - for different inputs, the first few tokens (i.e., the prompts) are not identical. Could you please take a look at the code snippet to see if it's correct and where the problem is?
config = RobertaConfig.from_pretrained('roberta-base')
config.pre_seq_len = 2
config.prefix_projection = False
model = RobertaPrefixForSequenceClassification.from_pretrained('roberta-base', config=config)
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
sentences = ['This is an example sentence', 'Each sentence is converted']
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input, output_hidden_states=True)
layer_hidden_states = model_output['hidden_states'][1:] # discard the output of embedding layer
for hidden_state in layer_hidden_states:
deep_prompts = hidden_state[:, :model.pre_seq_len, :] # [batch_size, seq_len, hidden_size]
assert deep_prompts[0].equal(deep_prompts[1])
您好,我是北科大的学生,在前人的基础上,有个对比学习结合Prompt的思路,想验证一下。看了蛮多论文,您这种方法较为优雅,想尝试一下。但苦于没有代码,没有特别看懂论文。能先提供一个不成熟,不完善的版本吗?拼手速的年代,您一定能理解此刻的心情。 邮箱:[email protected]
你好,from torch import _softmax_backward_data 显示不存在是啥原因啊?
File "run.py", line 7, in
import datasets
File "/home/appuser/miniconda3/envs/pt2/lib/python3.8/site-packages/datasets/init.py", line 37, in
from .builder import ArrowBasedBuilder, BeamBasedBuilder, BuilderConfig, DatasetBuilder, GeneratorBasedBuilder
File "/home/appuser/miniconda3/envs/pt2/lib/python3.8/site-packages/datasets/builder.py", line 44, in
from .data_files import DataFilesDict, _sanitize_patterns
File "/home/appuser/miniconda3/envs/pt2/lib/python3.8/site-packages/datasets/data_files.py", line 120, in
dataset_info: huggingface_hub.hf_api.DatasetInfo,
File "/home/appuser/miniconda3/envs/pt2/lib/python3.8/site-packages/huggingface_hub/init.py", line 290, in getattr
raise AttributeError(f"No {package_name} attribute {name}")
AttributeError: No huggingface_hub attribute hf_api
Hi, I have a question about deep prompt.
I understand that deep prompts are implemented through past_key_values in model.
Then how can I see the actual prompt weights per layer?
I mean, the shape of prompt is (prefix_len, config.num_hidden_layers * 2 * config.hidden_size) if without trans.
And the shape of past_key_values for input is [2, batch_size, n_head, prefix_len, n_embd] per each layer. I believe that the first '2' corresponds key and value for attention mechanism.
Here I want to obtain [prefix_len, config.hidden_size] vector just like embedding vector of prompt-tuning v1.
Do you have any idea for this?
Thanks : )
您好,我们按照文中方法在人民日报NER数据集上进行了复现,所得到结论和文中基本一致;但是我们在RTE和蚂蚁金服文本相似度两个分类数据集上发现并不收敛,我们尝试过MLP和LSTM对prefix embedding进行重参数化但收效甚微。请问作者在做分类相关数据集时是否遇到过此类情况?
以蚂蚁金服数据集为例,模型的loss从一开始就不下降,最终再验证集上会全部预测数量较多的类别。
我们对比过NER和分类的梯度,也没有发现明显区别
模型层面我们尝试过Roberta-large Bert-large 以及Bert-base
Would you be able to provide the gpt2 example for p-tuning v2?
您好,请问将这种方法用于T5或者BART模型时,Encoder和Decoder部分都需要添加past_key_values的初始化吗,添加了past_key_values后,在求attention score时,需要对T5或者BART模型源码的forward部分进行修改吗?希望能尽快得到您的回复!
Thank you for your great work! What hyperparameters (number of epochs, lr, etc.) did you use for prompt tuning (v1) and fine tuning?
Hi. I have a question.
In the case of Prefix Tuning, I think there will be some advantages in learning time.
However, I don't think there will be a big advantage in the inference time of the learned model, so what do you think?
Hi, I found in the official huggingface documentation that this “past_key_values=past_key_values” parameter is only useful for accelerating precompute. Can you please explain which is p-tuning v2, prefix or prompt?
Why do I think Deep Prompt Tuning is --prompt instead of --prefix?
Thank you for your code contributions.
I've observed that training DeBERTaV2 with P-Tuning v2 takes significantly more time to evaluate than other methods. Have you observed such behaviour?
It even takes significantly more time than P-Tuning v1 despite the fact that v1 have larger complexity to evaluate attention.
It seems like the issue is the ad-hoc implementation of past_key_values
for DeBERTa, which is the only difference in the code of backbone model between v1 and v2, but can't figure out the specific reason for so.
Hello!
I'm going to apply P-Tuning-v2 for my task with custom language model but I failed to locate the part of the source code where deep prompt tunig is implemented. Could anyone please indicate the file and lines where additional parameters are added and used at each layer of the language model?
There are prefix
models but I don't get where are the parts of code which correspond to creating LayerN Prompts
from the illustration you provide. There is a prefix_encoder
which seems to create only one encoding layer. Does it create multiple independent layers for each layer of LM or did I misunderstand something in your technique? And if that is the part I'm looking for where can I find more clear explanation than in Hugging Face documentation of what past_key_values
does?
hope the code and weight
Links for the PT-retrieval models in the README table appear to be switched around
我在复现PT2在Ner的Conll2003时 数据来源于hugging face根据源代码提供的metric进行计算 roberta-large模型在验证集上返回了95+的f1_score 随后在roberta-large 上进行了全微调 仅3个epoch就超过了本文提供的fine-tuning f1 score baseline 1% . 有些疑问 文章中报告的结果是metric 直接返回的seqeval.metric 的overall_f1吗 还是经过额外的计算 。
能否提供PT2在训练conll2004的数据集原始文件吗 谢谢!
## pooled_output = outputs[1] sequence_output = outputs[0] sequence_output = sequence_output[:, self.pre_seq_len:, :].contiguous() first_token_tensor = sequence_output[:, 0] pooled_output = self.bert.pooler.dense(first_token_tensor) pooled_output = self.bert.pooler.activation(pooled_output)
请问作者做这段修改的目的是什么呢,原方法经过测试效果还是不错的。
如题。跑示例代码会报错:ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.15.1/metrics/super_glue/record_evaluation.py
Hi, thanks for your working :)
I've read your paper and tried to understand p-tuning-v2 with the implementation codes to apply them to GPT2.
(Actually I've already coded but i'm not sure if I did it correctly)
I understand that p-tuning-v2 could work via past_key_value which includes output of the prefix encoder.
Upon the codes, I've found that the main difference between p-tuning and -v2 is also about input and output shape.
For v1, input_ids and prompt embedding are concatenated, which are directly injected to the model, and the output logits include prompt output which is not used to calculate the loss. On the other hand, for v2, prefix encoder is injected via past_key_value and the original input_ids and past_key_value are injected to the model, which is different from the input for v1. So the output logits here do not include first prompt length logits.
I'd appreciate it if you could check if I understood correctly.
Thanks!
为什么在GPT understand too中的,在bert-large模型上,P-tuning在BoolQ数据集要比P-tuningv2高了接近三个点,包括CB数据集也是,为什么在论文中不和P-tuning进行比较呢?
您好,请问ner 数据集的格式是什么样的呀?没下下来
raise FileNotFoundError("Local file {} doesn't exist".format(url_or_filename))
FileNotFoundError: Local file tasks/ner/datasets/../../../data/ontoNotes/train.sd.conllx doesn't exis
NERDataset 这个类里面,也看不太明白数据是什么样的
求解答,谢谢Thanks♪(・ω・)ノ您
Hi, this repo contains nice code.
I wonder how to implement the P-tuning-v2 with different prompt depth (e.g. the "Prompt depth" experiment ) ?
The paper said "... we change their attention masks for disallowing their prefix prompts to involve in the computation."
May I ask how to change the attention masks in different layers? Is there any example code?
Thanks in advance!
请问:这个linear head是做什么任务的linear?和p-tuning的任务有什么联系?p-tuning训练时冻结这个linear吗
I have some questions about rebuilding the PT-2 results of RTE in Table 1.
My base model is RoBERTa-large, I trained the model for 10 epochs with the recommended parameters (prompt length = 4, learning rate = 1e-2 as suggested in previous issue).
However, I can only get roughly 58% accuracy on the RTE dev set.
I am not sure whether the below factor would cause this, hope the authors can give me some hints, many thanks!
Nice work. But I have a question about FT.
The FT in Table 3 means FT or MFT (multi-task fine-tuning)?
Hello,
I wonder to know how you implement the prompt with depth less than model's layers. Huggingface requires length of past_key_value
to match the model's config.n_layers, so I think that we can't not just pass prompt which does not match layers to past_key_value
. Besides, it seems that layers can't share same attention_mask if some of them have prompt and some of them don't.
Thanks!
12/07/2021 19:30:19 - INFO - training.trainer_base - ***** Epoch 12: Best results ***** 12/07/2021 19:30:19 - INFO - training.trainer_base - best_epoch = 0
12/07/2021 19:30:19 - INFO - training.trainer_base - best_eval_accuracy = 0.6217125382262997
12/07/2021 19:30:19 - INFO - training.trainer_base - epoch = 12.0
OrderedDict([('best_epoch', 0), ('best_eval_accuracy', 0.6217125382262997), ('epoch', 13.0)])
{'loss': 0.7488, 'learning_rate': 0.006054054054054054, 'epoch': 13.51}
针对某一垂直领域进行微调需要至少多大的数据集
Thank you for providing the source code for the nice work. I have some questions regarding Multi-task Learning (MPT-2) in Table 4. In the paper, the authors mention that
For the multi-task setting, we combine the training set of the three datasets for pre-training. We use different linear classifiers for each dataset while sharing the continuous prompts..
1. What do you mean by pre-training here? Do you have first pre-train on all datasets with labels and then continue to training on a specific dataset?
2. I have to find the pre-training code for multi-task learning in the repo but cannot find it. Is it possible that you put it public?
Many thanks for the clarification!
Thank you for your well-organized code. Since my major si computer vision, I am not familiar with NLP. I'm very interested in P-Tunning.
Hello!
Thanks a lot for a such well-written article! It's so impressive improvement of P-tuning.
Could you please clarify what parameters are really tuned during p-tuning? Is it MLP producing input embeddings for a large model? Or maybe it's copy of some layers e.g. key-value projection for tuned prefix tokens?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.