GithubHelp home page GithubHelp logo

chatglm3-finetune's Issues

SFTTrainner封装训练时间倍增

当我用样例程序训练2000step只需要不到15分钟,而当我用trl的SFTTrainer封装加入neft之后,2000step的训练时长达到了70分钟,这种时长,怎么会有这么大
test
的时长变化呢?

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

请问博主有没有在推理时infer.py遇到过这个问题
Traceback (most recent call last):
File "infer.py", line 48, in
out = model.generate(
File "/data/Wangkh/anaconda3/envs/langchain/lib/python3.8/site-packages/peft/peft_model.py", line 1130, in generate
outputs = self.base_model.generate(**kwargs)
File "/data/Wangkh/anaconda3/envs/langchain/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/Wangkh/anaconda3/envs/langchain/lib/python3.8/site-packages/transformers/generation/utils.py", line 1572, in generate
return self.sample(
File "/data/Wangkh/anaconda3/envs/langchain/lib/python3.8/site-packages/transformers/generation/utils.py", line 2655, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0

No module named 'model'

when running finetune error shows:

Traceback (most recent call last):
File "D:\dev\chatglm3-finetune\finetune.py", line 8, in
from model.modeling_chatglm import ChatGLMForConditionalGeneration
ModuleNotFoundError: No module named 'model'

the first line in finetune.py is
from model.modeling_chatglm import ChatGLMForConditionalGeneration

this should be chatglm model???

lora存checkpoint的问题

请问您有遇到过lora训chatglm3时存下来的checkpoint里是一个12G的pytorch_model.bin而不是几十M的adapter_model.bin的情况吗

微调后的推理结果残缺不全

输出都是很短的片段,请问是哪里出了问题?
预处理的时候我已经将max_seq_length加大到了2500,可是输出基本都不会超过50个汉字。
python tokenize_dataset_rows.py --jsonl_path ./alpaca_data.jsonl --save_path ./alpaca --max_seq_length 2500

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 64: illegal multibyte sequence

2023-11-03 20:10:25,978 - WARNING - Loading data...
Traceback (most recent call last):
File "D:\test\chatglm3-base-tuning-master\train.py", line 52, in
trainer.train()
File "D:\test\chatglm3-base-tuning-master\trainer.py", line 19, in train
self.data_module = ChatDataModule(
^^^^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 75, in init
self.train_dataset = ChatDataset(tokenizer=tokenizer, data_path=data_path_train, max_tokens=max_tokens)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 37, in init
conversations = jload(data_path)
^^^^^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 28, in jload
jdict = json.load(f)
^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\env\Lib\json_init_.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 64: illegal multibyte sequence

使用的formatted_samples.json

Something wrong in the data preprocess

This is a great repository which provide finetune feature of ChatGLM3. But when I followed the process in the README to run tokenize_dataset_rows.py scripts it reported these errors:

`python tokenize_dataset_rows.py --jsonl_path ./alpaca_data.jsonl --save_path ./alpaca --max_seq_length 200
Downloading and preparing dataset generator/default to C:/Users/yt758/.cache/huggingface/datasets/generator/default-10116cbfdb8a1e8b/0.0.0...
HF google storage unreachable. Downloading and preparing it from source
Generating train split: 0 examples [00:00, ? examples/s]'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /model/resolve/main/tokenizer_config.json (Caused by ProxyError('Unable to connect to proxy', SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1131)'))))"), '(Request ID: 09963958-bc38-4941-bfac-92e4491eae09)')' thrown while requesting HEAD https://huggingface.co/model/resolve/main/tokenizer_config.json
Generating train split: 0 examples [00:02, ? examples/s]urllib3.exceptions.SSLError: TLS/SSL connection has been closed (EOF) (_ssl.c:1131)

The above exception was the direct cause of the following exception:

urllib3.exceptions.ProxyError: ('Unable to connect to proxy', SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1131)')))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\software\anaconda3\envs\t001\lib\site-packages\requests\adapters.py", line 486, in send
resp = conn.urlopen(
File "C:\software\anaconda3\envs\t001\lib\site-packages\urllib3\connectionpool.py", line 845, in urlopen
retries = retries.increment(
File "C:\software\anaconda3\envs\t001\lib\site-packages\urllib3\util\retry.py", line 515, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /model/resolve/main/tokenizer_config.json (Caused by ProxyError('Unable to connect to proxy', SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1131)'))))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\builder.py", line 1608, in _prepare_split_single
for key, record in generator:
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\packaged_modules\generator\generator.py", line 30, in _generate_examples
for idx, ex in enumerate(self.config.generator(**gen_kwargs)):
File "tokenize_dataset_rows.py", line 21, in read_jsonl
tokenizer = transformers.AutoTokenizer.from_pretrained(
File "C:\software\anaconda3\envs\t001\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 643, in from_pretrained
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 487, in get_tokenizer_config
resolved_config_file = cached_file(
File "C:\software\anaconda3\envs\t001\lib\site-packages\transformers\utils\hub.py", line 417, in cached_file
resolved_file = hf_hub_download(
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\utils_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\file_download.py", line 1233, in hf_hub_download
metadata = get_hf_file_metadata(
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\utils_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\file_download.py", line 1613, in get_hf_file_metadata
r = _request_wrapper(
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\file_download.py", line 418, in _request_wrapper
response = _request_wrapper(
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\file_download.py", line 453, in _request_wrapper
return http_backoff(
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\utils_http.py", line 274, in http_backoff
raise err
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\utils_http.py", line 258, in http_backoff
response = session.request(method=method, url=url, **kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\requests\sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\requests\sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\utils_http.py", line 63, in send
return super().send(request, *args, **kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\requests\adapters.py", line 513, in send
raise ProxyError(e, request=request)
requests.exceptions.ProxyError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /model/resolve/main/tokenizer_config.json (Caused by ProxyError('Unable to connect to proxy', SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1131)'))))"), '(Request ID: 09963958-bc38-4941-bfac-92e4491eae09)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "tokenize_dataset_rows.py", line 48, in
main()
File "tokenize_dataset_rows.py", line 42, in main
dataset = datasets.Dataset.from_generator(
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\arrow_dataset.py", line 1012, in from_generator
return GeneratorDatasetInputStream(
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\io\generator.py", line 47, in read
self.builder.download_and_prepare(
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\builder.py", line 872, in download_and_prepare
self._download_and_prepare(
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\builder.py", line 1649, in _download_and_prepare
super()._download_and_prepare(
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\builder.py", line 967, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\builder.py", line 1488, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\builder.py", line 1644, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset`

It is recommended to support multiple GPU cards

Traceback (most recent call last):
File "finetune.py", line 70, in
main()
File "finetune.py", line 55, in main
model = get_peft_model(model, peft_config).to("cuda")
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 989, in to
return self._apply(convert)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
[Previous line repeated 5 more times]
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 664, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 987, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 22.03 GiB total capacity; 20.75 GiB already allocated; 56.88 MiB free; 21.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

FineTune CUDA out of memory

(chatglm3-finetune) root@g101:/data/ChatGLM3/chatglm3-finetune# python finetune.py --dataset_path ./alpaca --lora_rank 4 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --max_steps 52000 --save_steps 1000 --save_total_limit 20 --learning_rate 1e-4 --remove_unused_columns false --logging_steps 50 --output_dir output
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
Loading checkpoint shards: 100%|████████████████████████████| 7/7 [00:08<00:00, 1.22s/it]
Traceback (most recent call last):
File "/data/ChatGLM3/chatglm3-finetune/finetune.py", line 70, in
main()
File "/data/ChatGLM3/chatglm3-finetune/finetune.py", line 55, in main
model = get_peft_model(model, peft_config).to("cuda:1")
File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 989, in to
return self._apply(convert)
File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 664, in _apply
param_applied = fn(param)
File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1016.00 MiB (GPU 1; 23.69 GiB total capacity; 22.27 GiB already allocated; 691.69 MiB free; 22.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

evaluate时显存暴增

在trainer中添加了eval_dataset,写了compute_metric函数来计算eval中的一些指标,比如funtion calling的precision/recall和回复文本的bleu score。

遇到问题,evaluate时内存暴增,本来训练时10+GB显存占用,到了eval时突然增到60GB+,最终增到OOM

请问你有遇到过类似的情况吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.