GithubHelp home page GithubHelp logo

chatglm3-finetune's Introduction

Hi there 👋

🚀 Deep in the Trenches with Big Models! 🧠

Hey there! If my code repositories haven't given it away yet, I'm currently deep-diving into the fascinating world of large-scale models. Think of me as a digital spelunker exploring the cavernous depths of the artificial neural networks.

🤖 Current Mission

My days (and quite a few of my nights) are spent wrangling with datasets that have more dimensions than a sci-fi multiverse, and coaxing layers upon layers of neurons to play nice with each other. It's like hosting a galactic party and ensuring every star shines just right.

🧐 The Scoop

If you've stumbled upon me in search of some wisdom in the realm of machine learning, or perhaps to collaborate on creating the next sentient toaster – you've come to the right place.

📊 Algorithmic Alchemist: Converting data into gold since [the year you started].

⚙️ Mad Scientist in Training: I practice responsible conjuring of AI spirits without the need for arcane incantations (most of the time).

🕰️ Time Travel Consultant: Advising temporal adventurers on how not to disrupt the space-time continuum with TensorFlow.

📬 Get in Touch

Feel free to raise an issue, submit a pull request, or simply marvel at the sheer complexity of what we're trying to achieve here. But remember, if I'm a bit slow to respond, it's only because my current compile is taking its sweet time—or I might have accidentally created a small black hole on my desk again.

Keep on coding, stay curious, and remember: a day without a segmentation fault is like a day without sunshine!

👨‍💻 Happy coding, and may the force (of the closing curly brace) be with you! 👨‍💻

chatglm3-finetune's People

Contributors

xxw1995 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chatglm3-finetune's Issues

FineTune CUDA out of memory

(chatglm3-finetune) root@g101:/data/ChatGLM3/chatglm3-finetune# python finetune.py --dataset_path ./alpaca --lora_rank 4 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --max_steps 52000 --save_steps 1000 --save_total_limit 20 --learning_rate 1e-4 --remove_unused_columns false --logging_steps 50 --output_dir output
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
Loading checkpoint shards: 100%|████████████████████████████| 7/7 [00:08<00:00, 1.22s/it]
Traceback (most recent call last):
File "/data/ChatGLM3/chatglm3-finetune/finetune.py", line 70, in
main()
File "/data/ChatGLM3/chatglm3-finetune/finetune.py", line 55, in main
model = get_peft_model(model, peft_config).to("cuda:1")
File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 989, in to
return self._apply(convert)
File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 664, in _apply
param_applied = fn(param)
File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1016.00 MiB (GPU 1; 23.69 GiB total capacity; 22.27 GiB already allocated; 691.69 MiB free; 22.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

No module named 'model'

when running finetune error shows:

Traceback (most recent call last):
File "D:\dev\chatglm3-finetune\finetune.py", line 8, in
from model.modeling_chatglm import ChatGLMForConditionalGeneration
ModuleNotFoundError: No module named 'model'

the first line in finetune.py is
from model.modeling_chatglm import ChatGLMForConditionalGeneration

this should be chatglm model???

It is recommended to support multiple GPU cards

Traceback (most recent call last):
File "finetune.py", line 70, in
main()
File "finetune.py", line 55, in main
model = get_peft_model(model, peft_config).to("cuda")
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 989, in to
return self._apply(convert)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
[Previous line repeated 5 more times]
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 664, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 987, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 22.03 GiB total capacity; 20.75 GiB already allocated; 56.88 MiB free; 21.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

微调后的推理结果残缺不全

输出都是很短的片段,请问是哪里出了问题?
预处理的时候我已经将max_seq_length加大到了2500,可是输出基本都不会超过50个汉字。
python tokenize_dataset_rows.py --jsonl_path ./alpaca_data.jsonl --save_path ./alpaca --max_seq_length 2500

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

请问博主有没有在推理时infer.py遇到过这个问题
Traceback (most recent call last):
File "infer.py", line 48, in
out = model.generate(
File "/data/Wangkh/anaconda3/envs/langchain/lib/python3.8/site-packages/peft/peft_model.py", line 1130, in generate
outputs = self.base_model.generate(**kwargs)
File "/data/Wangkh/anaconda3/envs/langchain/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/Wangkh/anaconda3/envs/langchain/lib/python3.8/site-packages/transformers/generation/utils.py", line 1572, in generate
return self.sample(
File "/data/Wangkh/anaconda3/envs/langchain/lib/python3.8/site-packages/transformers/generation/utils.py", line 2655, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0

Something wrong in the data preprocess

This is a great repository which provide finetune feature of ChatGLM3. But when I followed the process in the README to run tokenize_dataset_rows.py scripts it reported these errors:

`python tokenize_dataset_rows.py --jsonl_path ./alpaca_data.jsonl --save_path ./alpaca --max_seq_length 200
Downloading and preparing dataset generator/default to C:/Users/yt758/.cache/huggingface/datasets/generator/default-10116cbfdb8a1e8b/0.0.0...
HF google storage unreachable. Downloading and preparing it from source
Generating train split: 0 examples [00:00, ? examples/s]'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /model/resolve/main/tokenizer_config.json (Caused by ProxyError('Unable to connect to proxy', SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1131)'))))"), '(Request ID: 09963958-bc38-4941-bfac-92e4491eae09)')' thrown while requesting HEAD https://huggingface.co/model/resolve/main/tokenizer_config.json
Generating train split: 0 examples [00:02, ? examples/s]urllib3.exceptions.SSLError: TLS/SSL connection has been closed (EOF) (_ssl.c:1131)

The above exception was the direct cause of the following exception:

urllib3.exceptions.ProxyError: ('Unable to connect to proxy', SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1131)')))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\software\anaconda3\envs\t001\lib\site-packages\requests\adapters.py", line 486, in send
resp = conn.urlopen(
File "C:\software\anaconda3\envs\t001\lib\site-packages\urllib3\connectionpool.py", line 845, in urlopen
retries = retries.increment(
File "C:\software\anaconda3\envs\t001\lib\site-packages\urllib3\util\retry.py", line 515, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /model/resolve/main/tokenizer_config.json (Caused by ProxyError('Unable to connect to proxy', SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1131)'))))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\builder.py", line 1608, in _prepare_split_single
for key, record in generator:
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\packaged_modules\generator\generator.py", line 30, in _generate_examples
for idx, ex in enumerate(self.config.generator(**gen_kwargs)):
File "tokenize_dataset_rows.py", line 21, in read_jsonl
tokenizer = transformers.AutoTokenizer.from_pretrained(
File "C:\software\anaconda3\envs\t001\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 643, in from_pretrained
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 487, in get_tokenizer_config
resolved_config_file = cached_file(
File "C:\software\anaconda3\envs\t001\lib\site-packages\transformers\utils\hub.py", line 417, in cached_file
resolved_file = hf_hub_download(
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\utils_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\file_download.py", line 1233, in hf_hub_download
metadata = get_hf_file_metadata(
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\utils_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\file_download.py", line 1613, in get_hf_file_metadata
r = _request_wrapper(
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\file_download.py", line 418, in _request_wrapper
response = _request_wrapper(
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\file_download.py", line 453, in _request_wrapper
return http_backoff(
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\utils_http.py", line 274, in http_backoff
raise err
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\utils_http.py", line 258, in http_backoff
response = session.request(method=method, url=url, **kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\requests\sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\requests\sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\huggingface_hub\utils_http.py", line 63, in send
return super().send(request, *args, **kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\requests\adapters.py", line 513, in send
raise ProxyError(e, request=request)
requests.exceptions.ProxyError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /model/resolve/main/tokenizer_config.json (Caused by ProxyError('Unable to connect to proxy', SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1131)'))))"), '(Request ID: 09963958-bc38-4941-bfac-92e4491eae09)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "tokenize_dataset_rows.py", line 48, in
main()
File "tokenize_dataset_rows.py", line 42, in main
dataset = datasets.Dataset.from_generator(
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\arrow_dataset.py", line 1012, in from_generator
return GeneratorDatasetInputStream(
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\io\generator.py", line 47, in read
self.builder.download_and_prepare(
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\builder.py", line 872, in download_and_prepare
self._download_and_prepare(
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\builder.py", line 1649, in _download_and_prepare
super()._download_and_prepare(
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\builder.py", line 967, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\builder.py", line 1488, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "C:\software\anaconda3\envs\t001\lib\site-packages\datasets\builder.py", line 1644, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset`

lora存checkpoint的问题

请问您有遇到过lora训chatglm3时存下来的checkpoint里是一个12G的pytorch_model.bin而不是几十M的adapter_model.bin的情况吗

SFTTrainner封装训练时间倍增

当我用样例程序训练2000step只需要不到15分钟,而当我用trl的SFTTrainer封装加入neft之后,2000step的训练时长达到了70分钟,这种时长,怎么会有这么大
test
的时长变化呢?

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 64: illegal multibyte sequence

2023-11-03 20:10:25,978 - WARNING - Loading data...
Traceback (most recent call last):
File "D:\test\chatglm3-base-tuning-master\train.py", line 52, in
trainer.train()
File "D:\test\chatglm3-base-tuning-master\trainer.py", line 19, in train
self.data_module = ChatDataModule(
^^^^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 75, in init
self.train_dataset = ChatDataset(tokenizer=tokenizer, data_path=data_path_train, max_tokens=max_tokens)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 37, in init
conversations = jload(data_path)
^^^^^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 28, in jload
jdict = json.load(f)
^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\env\Lib\json_init_.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 64: illegal multibyte sequence

使用的formatted_samples.json

evaluate时显存暴增

在trainer中添加了eval_dataset,写了compute_metric函数来计算eval中的一些指标,比如funtion calling的precision/recall和回复文本的bleu score。

遇到问题,evaluate时内存暴增,本来训练时10+GB显存占用,到了eval时突然增到60GB+,最终增到OOM

请问你有遇到过类似的情况吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.