GithubHelp home page GithubHelp logo

Comments (16)

shibing624 avatar shibing624 commented on September 1, 2024

tokenizer的时候把add special tokens=True就可以。

from medicalgpt.

sexan avatar sexan commented on September 1, 2024

tokenizer的时候把add special tokens=True就可以。

好像不行,实测加不加add special tokens=True,效果是一样的,可能默认值就是True
image

from medicalgpt.

shibing624 avatar shibing624 commented on September 1, 2024

有130004呀,问答是拼两句话(q+a),后面再加个answer,feed给model

from medicalgpt.

sexan avatar sexan commented on September 1, 2024

有130004呀,问答是拼两句话(q+a),后面再加个answer,feed给model

是的,自己测分词都是有130004的,我在做预训练的时候,block_size=64时,打印出来输出的时候,有的是有130004的,有的就没有,当我把block_size=128或者更高时,问题就解决了,我怀疑跟block_size相关,我再排查排查

from medicalgpt.

charryshi avatar charryshi commented on September 1, 2024

碰到过一样的问题,我用block_size=512时就再没问题了,我是--use_peft True

from medicalgpt.

sexan avatar sexan commented on September 1, 2024

确实是这样,block_size换成512以上就没问题了,还有一个问题想请教一下,我自己用zero3跑chatglm跑不通,我看一些开源项目包括chatglm官方仓库里都用的是zero2,是不是目前还不支持zero3,或者你基于chatglm跑通过zero3吗

from medicalgpt.

charryshi avatar charryshi commented on September 1, 2024

没试过ChatGLM,我是跑的LLaMA-13B

from medicalgpt.

Stupid-Ai avatar Stupid-Ai commented on September 1, 2024

我也遇到这个问题,我用的自己的数据集,每行的长度不超过2048. 在使用glm pretraining时报出130004 not in list。我debug打印没有130004的数据观察发现 可能是因为长文本的原因?刚好分割的这段不包含130004. 我也没仔细思考怎么去避免这种情况出现,就只在 group_texts() 方法中添加了判断 当130004不存在时跳过result = {
k: [t[i: i + block_size] for i in range(0, total_length, block_size) if 130004 in t[i: i + block_size]]
for k, t in concatenated_examples.items()
}
暂时还没找到更好的办法避免这个报错

from medicalgpt.

shibing624 avatar shibing624 commented on September 1, 2024

也行

from medicalgpt.

sexan avatar sexan commented on September 1, 2024

本质原因在于glm在训练时需要[gMASK]这个标记(对应130004)来判断是自回归这种训练目标,所以在tokenize的时候,设置add_special_tokens=False,然后在构造每条训练样本时,需要在样本开头手动拼接上130004,这个问题不是block_size的原因
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
block_size_ = block_size - 3
if total_length >= block_size_:
total_length = (total_length // block_size_) * block_size_
# Split by chunks of max_len.
prefix = [config.gmask_token_id, config.bos_token_id]
suffix = [config.eos_token_id]
result = {
k: [prefix + t[i : i + block_size_] + suffix for i in range(0, total_length, block_size_)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result

from medicalgpt.

daimazz1 avatar daimazz1 commented on September 1, 2024

本质原因在于glm在训练时需要[gMASK]这个标记(对应130004)来判断是自回归这种训练目标,所以在tokenize的时候,设置add_special_tokens=False,然后在构造每条训练样本时,需要在样本开头手动拼接上130004,这个问题不是block_size的原因 def group_texts(examples): # Concatenate all texts. concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} total_length = len(concatenated_examples[list(examples.keys())[0]]) # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can # customize this part to your needs. block_size_ = block_size - 3 if total_length >= block_size_: total_length = (total_length // block_size_) * block_size_ # Split by chunks of max_len. prefix = [config.gmask_token_id, config.bos_token_id] suffix = [config.eos_token_id] result = { k: [prefix + t[i : i + block_size_] + suffix for i in range(0, total_length, block_size_)] for k, t in concatenated_examples.items() } result["labels"] = result["input_ids"].copy() return result

你好方便具体说下怎么改吗,我也遇到同样的问题list130004,这里
prefix = [config.gmask_token_id, config.bos_token_id]
suffix = [config.eos_token_id]没有config这个参数怎么定义。

from medicalgpt.

sexan avatar sexan commented on September 1, 2024

本质原因在于glm在训练时需要[gMASK]这个标记(对应130004)来判断是自回归这种训练目标,所以在tokenize的时候,设置add_special_tokens=False,然后在构造每条训练样本时,需要在样本开头手动拼接上130004,这个问题不是block_size的原因 def group_texts(examples): # Concatenate all texts. concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} total_length = len(concatenated_examples[list(examples.keys())[0]]) # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can # customize this part to your needs. block_size_ = block_size - 3 if total_length >= block_size_: total_length = (total_length // block_size_) * block_size_ # Split by chunks of max_len. prefix = [config.gmask_token_id, config.bos_token_id] suffix = [config.eos_token_id] result = { k: [prefix + t[i : i + block_size_] + suffix for i in range(0, total_length, block_size_)] for k, t in concatenated_examples.items() } result["labels"] = result["input_ids"].copy() return result

你好方便具体说下怎么改吗,我也遇到同样的问题list130004,这里 prefix = [config.gmask_token_id, config.bos_token_id] suffix = [config.eos_token_id]没有config这个参数怎么定义。

加载chatglm的config配置就行了
config = AutoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True, **config_kwargs)

from medicalgpt.

cxjtju avatar cxjtju commented on September 1, 2024

group_texts

你好,请问这个怎么改呀

from medicalgpt.

cxjtju avatar cxjtju commented on September 1, 2024

感觉和文本长度有关?用tianlongbabu.txt中的数据不存在这个问题

from medicalgpt.

daimazz1 avatar daimazz1 commented on September 1, 2024

本质原因在于glm在训练时需要[gMASK]这个标记(对应130004)来判断是自回归这种训练目标,所以在tokenize的时候,设置add_special_tokens=False,然后在构造每条训练样本时,需要在样本开头手动拼接上130004,这个问题不是block_size的原因 def group_texts(examples): # Concatenate all texts. concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} total_length = len(concatenated_examples[list(examples.keys())[0]]) # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can # customize this part to your needs. block_size_ = block_size - 3 if total_length >= block_size_: total_length = (total_length // block_size_) * block_size_ # Split by chunks of max_len. prefix = [config.gmask_token_id, config.bos_token_id] suffix = [config.eos_token_id] result = { k: [prefix + t[i : i + block_size_] + suffix for i in range(0, total_length, block_size_)] for k, t in concatenated_examples.items() } result["labels"] = result["input_ids"].copy() return result

你好方便具体说下怎么改吗,我也遇到同样的问题list130004,这里 prefix = [config.gmask_token_id, config.bos_token_id] suffix = [config.eos_token_id]没有config这个参数怎么定义。

加载chatglm的config配置就行了 config = AutoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True, **config_kwargs)

你好我按照你说这个修改了group_texts(examples)函数 config使用为这个
config = AutoConfig.from_pretrained(
model_args.model_name_or_path,
torch_dtype=torch_dtype,
trust_remote_code=model_args.trust_remote_code,
cache_dir=model_args.cache_dir
)
然后不报 130004 not list这个错误了,但是运行了一点数据报了如下错误:
File "/home/kyh/anaconda3/envs/MGPT/lib/python3.7/site-packages/torch/nn/functional.py", line 2515, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: expected scalar type Half but found Float
0%|▍ | 50/14231 [03:48<17:58:12, 4.56s/it]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 31634) of binary: /home/kyh/anaconda3/envs/MGPT/bin/python
Traceback (most recent call last):

方便再详细说下具体怎么修改吗

from medicalgpt.

DengNingyuan avatar DengNingyuan commented on September 1, 2024

请问怎么解决?

from medicalgpt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.