thudm / p-tuning-v2 Goto Github PK

View Code? Open in Web Editor NEW

2.0K 29.0 196.0 1.45 MB

An optimized deep prompt tuning strategy comparable to fine-tuning across scales and tasks

License: Apache License 2.0

Python 96.71% Shell 3.29%

natural-language-processing prompt-tuning pretrained-language-model p-tuning parameter-efficient-learning

p-tuning-v2's People

Contributors

Stargazers

Watchers

Forkers

moqingxinai dumpmemory createrll wltongxue songbaitalk yaoao2017 thinksoso duzx16 zeng-wh zhchen18 xiao9905 sofyc techthiyanes bob603049648 paper-nlp xxchenxx jx1100370217 zurichrain pingyu-iris ppijbb debashishc taegyeongeo nicecodeforked vickicui alsace08 adherer gzhch tedyap taotao3614 keysersoze10 patientalone codefly13 elliotthwang yoreg123 yipanlong awyys adithya8 plasmashen liwenju0 jaedukseo bugface af-74413592 jinglishi0206 ypw757 puraminy lyttonkeepfoing lpsunny zhaoxf4 banksy23 brunotech xiaoxue1117 zhangyuanscall nchen909 hivamohammadzadeh1 johncruyff14 pschap evanhuang117 styskin xiaoqingnlp 620scott nashuju 2775919186 tracywang16 aniruddha-ju wangjiaqiys zwycl bowendoctor1616 nltm-sanskrit-interlingua-org flypanda666 songxxzp jacksonwuxs yottaxx adam-dziedzic zxcayumi cicizz lastdayboy eoncncom nanqiai qingqinggit1 xinxiangbobby fooyoo witcheng jiutian12 lomessa d2gin torney klonggan zzq-sh zhu1971 cyt1984 chfenglv zero506 knowledgehacker chavesliu makoofficial ruolunhui markhyq aixiaolei xiaojun207 zls130921

p-tuning-v2's Issues

Can you provide parameters for different models and different datasets

您好我有个疑问，关于表一的supergule数据集

您好，请问表一基于BERT-large的superglue数据集中PT方法和您之前那篇论文P-tuning方法在superglue的效果差别为什么这么大？是因为冻结参数和没冻结参数的原因嘛？代码中没有找到，不好意思打扰了！

中文句对分类任务不收敛

你好，ptuningv2代码在蚂蚁金服语义相似度任务上没办法收敛，最终表现为全部预测样本数较多的类别，prompt长度为4、8、12、lr为1e-3, 1e-2, 1e-4均如此。
如果连同bert的参数一起更新，则可以正常训练，说明数据、代码没有问题，请问有什么其他的可能方向？

Can the method be used in Bart or T5?

[Help] 如何通过微调提升 text2sql 能力？

求助：如何通过微调提升模型的text2sql能力？

如何给模型注入库表信息、业务知识？
如何校正模型输出的SQL？

当前给出一个建表语句（比如一张用户登录日志表），再问一个问题（比如当天的DAU是多少）， chatglm会给出错误的sql语句。比如加了一些无关的字段、sql关键字之间没有空格、用户没有去重等

期望通过微调，提升模型的text2sql能力。

期望注入：

建库建表语句
表字段的关联关系：比如A.a = B.b
业务知识: 比如投资roi怎么计算

问题：

今天XX的DAU是多少
最近7天XX的roi是多少

输出：
select count(distinct yourid) from your_table where date = today()

从your_table表查询XX的DAU，以yourid字段作为用户的唯一标识，以date字段作为过滤时间。通过count(distinct yourid)进行去重统计，date=today() 指定为今天。

Questions about Results on Question Answering, Table 3

In Lester et al. (2021), they use T5 as the pre-trained model and use LM head to generate answers.
For models like BERT, Roberta explored in this work, we can not use LM head to extract context spans as the answers, which means a linear QA head is essential.
Is the task-specific linear head fine-tuned with prompt embeddings in PT, Table 3?
If so, this implementation is a little different from the original implementation.
If not, the randomly initialized QA head is not expected to produce meaningful outputs and hinders PT training, which makes the PT results in Table 3 meaningless.

Or, do I have some misunderstandings about the LM head in QA tasks?

Bug for BertPrompt series code?

Hi, I notice that the bert prompt model does not use the cls & linear head. I try to explain it in the following code with toy inputs, where say input_ids 's shape is [8, 32], and pre_seq_len is 3, then inputs_embeds's shall be [8, 35, 768]. I'll comment the shape of the key variables in the code and state my concern

class BertPromptForSequenceClassification(BertPreTrainedModel):
    def forward(*args):
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        batch_size = input_ids.shape[0]
        raw_embedding = self.embeddings(
            input_ids=input_ids, 
            position_ids=position_ids,
            token_type_ids=token_type_ids,
        )
        prompts = self.get_prompt(batch_size=batch_size)
        inputs_embeds = torch.cat((prompts, raw_embedding), dim=1) # then inputs
        prefix_attention_mask = torch.ones(batch_size, self.pre_seq_len).to(self.bert.device)
        attention_mask = torch.cat((prefix_attention_mask, attention_mask), dim=1)

        # inputs_embeds's shape: [8, 35, 768]


        outputs = self.bert(
            # input_ids,
            attention_mask=attention_mask,
            # token_type_ids=token_type_ids,
            # position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
            # past_key_values=past_key_values,
        ) 
# since bert encoder will take as inputs  the first token to the bert_pooler, \
# here the real token being used for classifier is the soft prompts' first token!
        
        pooled_output = outputs[1]

I wonder, is p-tuning v2 compared with soft prompt tuning?
But the token being used for the latter one in the head for classification is not the cls.

Is that expected?

What do you mean by ascending and descending order?

Hi there,

Thanks for your great work! In section 4.4, you did an ablation study about adding prompt to transformer layers in descending order or ascending order. I don't quite understand this difference. Could you please elaborate on this?

RobertaPromptForSequenceClassification 代码实现问题

RobertaPromptForSequenceClassification 中的实现原理，大致是单独把RobertaModel 中的 RobertaEmbeddings 拿出来。然后tokenize 过的输入通过 RobertaEmbeddings 获得输入embedding，再拼上 prompt 的 embedding 过 RobertaEncoder。

不过单独拎出来的输入embedding，之前经过RobertaEmbedding时候，已经经过了word、pos、type的embedding相加和LN、drop的过程。而以inputs_embeds为参数输入 RobertaModel 之后，似乎会再次经过上述的操作。

是不是我理解的不对？不然按照我的理解，这么实现好像是有问题的？

wait for code to test

About the results on the CB dataset

Why does the effect of fine-tuning on the CB dataset using Bert-large in this paper exceed the SuperGLUE baseline by more than ten absolute points (83.6 -> 94.6)? And the results on other datasets are not much different.

分类任务上p-tuning v2效果不如p-tuning v1？

您好，我在一个中文五分类数据集上分别使用p-tuning v1的方法和p-tuning v2的方法进行训练，超参数是一样的，但结果差异很大。
在p-tuning v1 上准确率能到达八十多，但在p-tuning v2 上只能到达三十多。
请问可能是什么原因呢？

P-tuning v2在少样本上的效果如何呢？

你好！我发现P-Tuning v2论文中提到所有实验都在全量数据上做的，那么在少样本设定下的效果如何呢，能赶上fine-tune吗？

Question about "deep prompts"

Hi,

I've seen issues asking about the past_key_value implementation and I've tried a code snippet to confirm if it's consistent with what's described in the paper. However, it doesn't seem to work - for different inputs, the first few tokens (i.e., the prompts) are not identical. Could you please take a look at the code snippet to see if it's correct and where the problem is?

    config = RobertaConfig.from_pretrained('roberta-base')
    config.pre_seq_len = 2
    config.prefix_projection = False

    model = RobertaPrefixForSequenceClassification.from_pretrained('roberta-base', config=config)

    tokenizer = AutoTokenizer.from_pretrained('roberta-base')

    sentences = ['This is an example sentence', 'Each sentence is converted']

    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

    with torch.no_grad():
        model_output = model(**encoded_input, output_hidden_states=True)
        layer_hidden_states = model_output['hidden_states'][1:]  # discard the output of embedding layer
        for hidden_state in layer_hidden_states:
            deep_prompts = hidden_state[:, :model.pre_seq_len, :]  # [batch_size, seq_len, hidden_size]
            assert deep_prompts[0].equal(deep_prompts[1])

能否先提供一个不完善的版本？

您好，我是北科大的学生，在前人的基础上，有个对比学习结合Prompt的思路，想验证一下。看了蛮多论文，您这种方法较为优雅，想尝试一下。但苦于没有代码，没有特别看懂论文。能先提供一个不成熟，不完善的版本吗？拼手速的年代，您一定能理解此刻的心情。邮箱：[email protected]

关于import的一些问题？

你好，from torch import _softmax_backward_data 显示不存在是啥原因啊？

When will the code be available？

raise AttributeError(f"No {package_name} attribute {name}") AttributeError: No huggingface_hub attribute hf_api

File "run.py", line 7, in
import datasets
File "/home/appuser/miniconda3/envs/pt2/lib/python3.8/site-packages/datasets/init.py", line 37, in
from .builder import ArrowBasedBuilder, BeamBasedBuilder, BuilderConfig, DatasetBuilder, GeneratorBasedBuilder
File "/home/appuser/miniconda3/envs/pt2/lib/python3.8/site-packages/datasets/builder.py", line 44, in
from .data_files import DataFilesDict, _sanitize_patterns
File "/home/appuser/miniconda3/envs/pt2/lib/python3.8/site-packages/datasets/data_files.py", line 120, in
dataset_info: huggingface_hub.hf_api.DatasetInfo,
File "/home/appuser/miniconda3/envs/pt2/lib/python3.8/site-packages/huggingface_hub/init.py", line 290, in getattr
raise AttributeError(f"No {package_name} attribute {name}")
AttributeError: No huggingface_hub attribute hf_api

huggingface_hub version: 0.13.2
Platform: Linux-5.4.0-144-generic-x86_64-with-glibc2.10
Python version: 3.8.5
Running in iPython ?: No
Running in notebook ?: No
Running in Google Colab ?: No
Token path ?: /home/appuser/.cache/huggingface/token
Has saved token ?: False
Configured git credential helpers:
FastAI: N/A
Tensorflow: N/A
Torch: 1.7.1
Jinja2: N/A
Graphviz: N/A
Pydot: N/A
Pillow: 9.3.0
hf_transfer: N/A
ENDPOINT: https://huggingface.co
HUGGINGFACE_HUB_CACHE: /home/appuser/.cache/huggingface/hub
HUGGINGFACE_ASSETS_CACHE: /home/appuser/.cache/huggingface/assets
HF_TOKEN_PATH: /home/appuser/.cache/huggingface/token
HF_HUB_OFFLINE: False
HF_HUB_DISABLE_TELEMETRY: False
HF_HUB_DISABLE_PROGRESS_BARS: None
HF_HUB_DISABLE_SYMLINKS_WARNING: False
HF_HUB_DISABLE_IMPLICIT_TOKEN: False
HF_HUB_ENABLE_HF_TRANSFER: False

Questions about deep prompt per layer

Hi, I have a question about deep prompt.
I understand that deep prompts are implemented through past_key_values in model.
Then how can I see the actual prompt weights per layer?
I mean, the shape of prompt is (prefix_len, config.num_hidden_layers * 2 * config.hidden_size) if without trans.
And the shape of past_key_values for input is [2, batch_size, n_head, prefix_len, n_embd] per each layer. I believe that the first '2' corresponds key and value for attention mechanism.
Here I want to obtain [prefix_len, config.hidden_size] vector just like embedding vector of prompt-tuning v1.

Do you have any idea for this?

Thanks : )

为什么我换上自己的数据进行ner的实验，预测结果会全是0

P-tuning v2在NER任务上表现正常，但在分类任务上不收敛

您好，我们按照文中方法在人民日报NER数据集上进行了复现，所得到结论和文中基本一致；但是我们在RTE和蚂蚁金服文本相似度两个分类数据集上发现并不收敛，我们尝试过MLP和LSTM对prefix embedding进行重参数化但收效甚微。请问作者在做分类相关数据集时是否遇到过此类情况？

以蚂蚁金服数据集为例，模型的loss从一开始就不下降，最终再验证集上会全部预测数量较多的类别。
我们对比过NER和分类的梯度，也没有发现明显区别

模型层面我们尝试过Roberta-large Bert-large 以及Bert-base

Can you provide the GPT2 example?

Would you be able to provide the gpt2 example for p-tuning v2?

用于T5或BART的一些疑问

您好，请问将这种方法用于T5或者BART模型时，Encoder和Decoder部分都需要添加past_key_values的初始化吗，添加了past_key_values后，在求attention score时，需要对T5或者BART模型源码的forward部分进行修改吗？希望能尽快得到您的回复！

Hyperparameters for prompt tuning (v1) and fine tuning

Thank you for your great work! What hyperparameters (number of epochs, lr, etc.) did you use for prompt tuning (v1) and fine tuning?

hi,能提供下论文结果里面的超参吗？只要boolq,cb,rte,sst这几个数据集的，感谢

Questions about inference time

Hi. I have a question.

In the case of Prefix Tuning, I think there will be some advantages in learning time.

However, I don't think there will be a big advantage in the inference time of the learned model, so what do you think?

past_key_values=past_key_values

Hi, I found in the official huggingface documentation that this “past_key_values=past_key_values” parameter is only useful for accelerating precompute. Can you please explain which is p-tuning v2, prefix or prompt?

Why do I think Deep Prompt Tuning is --prompt instead of --prefix?

Thank you for your code contributions.

DeBERTa P-Tuning v2 speed

I've observed that training DeBERTaV2 with P-Tuning v2 takes significantly more time to evaluate than other methods. Have you observed such behaviour?

It even takes significantly more time than P-Tuning v1 despite the fact that v1 have larger complexity to evaluate attention.

It seems like the issue is the ad-hoc implementation of past_key_values for DeBERTa, which is the only difference in the code of backbone model between v1 and v2, but can't figure out the specific reason for so.

Where to find Deep Prompt Tuning in the source code?

Hello!

I'm going to apply P-Tuning-v2 for my task with custom language model but I failed to locate the part of the source code where deep prompt tunig is implemented. Could anyone please indicate the file and lines where additional parameters are added and used at each layer of the language model?

There are prefix models but I don't get where are the parts of code which correspond to creating LayerN Prompts from the illustration you provide. There is a prefix_encoder which seems to create only one encoding layer. Does it create multiple independent layers for each layer of LM or did I misunderstand something in your technique? And if that is the part I'm looking for where can I find more clear explanation than in Hugging Face documentation of what past_key_values does?

hope the code

hope the code and weight

Download Links for Models Appear to be Switched

Links for the PT-retrieval models in the README table appear to be switched around

在复现Ner的Conll2003时一些关于metric的疑问

我在复现PT2在Ner的Conll2003时数据来源于hugging face根据源代码提供的metric进行计算 roberta-large模型在验证集上返回了95+的f1_score 随后在roberta-large 上进行了全微调仅3个epoch就超过了本文提供的fine-tuning f1 score baseline 1% . 有些疑问文章中报告的结果是metric 直接返回的seqeval.metric 的overall_f1吗还是经过额外的计算。
能否提供PT2在训练conll2004的数据集原始文件吗谢谢!

sequence_classification.py

## pooled_output = outputs[1] sequence_output = outputs[0] sequence_output = sequence_output[:, self.pre_seq_len:, :].contiguous() first_token_tensor = sequence_output[:, 0] pooled_output = self.bert.pooler.dense(first_token_tensor) pooled_output = self.bert.pooler.activation(pooled_output)
请问作者做这段修改的目的是什么呢，原方法经过测试效果还是不错的。

有谁跑通过这代码吗？

如题。跑示例代码会报错：ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.15.1/metrics/super_glue/record_evaluation.py

Difference between P-tuning and -v2 in the codes

Hi, thanks for your working :)

I've read your paper and tried to understand p-tuning-v2 with the implementation codes to apply them to GPT2.
(Actually I've already coded but i'm not sure if I did it correctly)
I understand that p-tuning-v2 could work via past_key_value which includes output of the prefix encoder.

Upon the codes, I've found that the main difference between p-tuning and -v2 is also about input and output shape.
For v1, input_ids and prompt embedding are concatenated, which are directly injected to the model, and the output logits include prompt output which is not used to calculate the loss. On the other hand, for v2, prefix encoder is injected via past_key_value and the original input_ids and past_key_value are injected to the model, which is different from the input for v1. So the output logits here do not include first prompt length logits.

I'd appreciate it if you could check if I understood correctly.
Thanks!

关于P-tuning和P-tuningv2的BoolQ数据集效果提问？

为什么在GPT understand too中的，在bert-large模型上，P-tuning在BoolQ数据集要比P-tuningv2高了接近三个点，包括CB数据集也是，为什么在论文中不和P-tuning进行比较呢？

dataset ner

您好，请问ner 数据集的格式是什么样的呀？没下下来
raise FileNotFoundError("Local file {} doesn't exist".format(url_or_filename))
FileNotFoundError: Local file tasks/ner/datasets/../../../data/ontoNotes/train.sd.conllx doesn't exis

NERDataset 这个类里面，也看不太明白数据是什么样的
求解答，谢谢Thanks♪(･ω･)ﾉ您

Question about the implementation of "4.4 Prompt depth" experiment

Hi, this repo contains nice code.
I wonder how to implement the P-tuning-v2 with different prompt depth (e.g. the "Prompt depth" experiment ) ?
The paper said "... we change their attention masks for disallowing their prefix prompts to involve in the computation."
May I ask how to change the attention masks in different layers? Is there any example code?
Thanks in advance!

linear head

请问：这个linear head是做什么任务的linear？和p-tuning的任务有什么联系？p-tuning训练时冻结这个linear吗

hi, not found finetune and prompt tuning code

Unable to reproduce the PT-2 results of RTE in Table 1

I have some questions about rebuilding the PT-2 results of RTE in Table 1.

My base model is RoBERTa-large, I trained the model for 10 epochs with the recommended parameters (prompt length = 4, learning rate = 1e-2 as suggested in previous issue).

However, I can only get roughly 58% accuracy on the RTE dev set.

I am not sure whether the below factor would cause this, hope the authors can give me some hints, many thanks!

what is the training epoch you used for training RTE?
if I understand correctly, you are tuning both the classification head and the inserted prompts in each layer, right? In this case, would the initialization matter? And a followed question is how did you do the initialization?
I notice that you insert the prompts before the [CLS], is there any specific reason to insert them before the [CLS]?
I wonder if you are using the vanilla roberta-large checkpoint?

多任务学习的训练代码在哪呢没找到呀~

question about FT

Nice work. But I have a question about FT.
The FT in Table 3 means FT or MFT (multi-task fine-tuning)?

Why training (p-tuning v2) is taking a lot of time compared to regular fine-tuning?

Question about the implementation details of prompt depth experiment

Hello,
I wonder to know how you implement the prompt with depth less than model's layers. Huggingface requires length of past_key_value to match the model's config.n_layers, so I think that we can't not just pass prompt which does not match layers to past_key_value. Besides, it seems that layers can't share same attention_mask if some of them have prompt and some of them don't.

Thanks!

运用prefix_projection 方法训练test acc不变一直是62.1

12/07/2021 19:30:19 - INFO - training.trainer_base - ***** Epoch 12: Best results ***** 12/07/2021 19:30:19 - INFO - training.trainer_base - best_epoch = 0
12/07/2021 19:30:19 - INFO - training.trainer_base - best_eval_accuracy = 0.6217125382262997
12/07/2021 19:30:19 - INFO - training.trainer_base - epoch = 12.0
OrderedDict([('best_epoch', 0), ('best_eval_accuracy', 0.6217125382262997), ('epoch', 13.0)])
{'loss': 0.7488, 'learning_rate': 0.006054054054054054, 'epoch': 13.51}

采用p tuning方案进行微调，对数据集数量有什么要求吗

针对某一垂直领域进行微调需要至少多大的数据集

Results of Multi-task Learning.

Thank you for providing the source code for the nice work. I have some questions regarding Multi-task Learning (MPT-2) in Table 4. In the paper, the authors mention that

For the multi-task setting, we combine the training set of the three datasets for pre-training. We use different linear classifiers for each dataset while sharing the continuous prompts..

1. What do you mean by pre-training here? Do you have first pre-train on all datasets with labels and then continue to training on a specific dataset?
2. I have to find the pre-training code for multi-task learning in the repo but cannot find it. Is it possible that you put it public?

Many thanks for the clarification!

A question about the process of P-tunning-v2.

Thank you for your well-organized code. Since my major si computer vision, I am not familiar with NLP. I'm very interested in P-Tunning.

In P-tuning-v2, do I need to take the Prompts when training Pre-trained Model? Or just take the Prompts when I train a downstream task.
For the initiation of a prompt:
prompts = torch.arrange()
prompts = torch.nn.Embeding(prompts)
Is it normal to use the above initialization method？
For training a downstream task, I need to freeze all of the Model' parameters, but not the Prompts' parameters, right?
For Prompt Deep, I need to reinitialize the Prompts on each level, right ( Same initialization as 2) ) ?
I'm sorry for asking so many questions. Looking forward to your reply.

What parameters is really trained during P-tuning v2?

Hello!

Thanks a lot for a such well-written article! It's so impressive improvement of P-tuning.

Could you please clarify what parameters are really tuned during p-tuning? Is it MLP producing input embeddings for a large model? Or maybe it's copy of some layers e.g. key-value projection for tuned prefix tokens?

thudm / p-tuning-v2 Goto Github PK

p-tuning-v2's People

Contributors

Stargazers

Watchers

Forkers

p-tuning-v2's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs