hf-lin / chatmusician Goto Github PK

Python 99.71% Shell 0.06% Makefile 0.06% CSS 0.10% JavaScript 0.02% HTML 0.05%

chatmusician's Introduction

🎼 ChatMusician: Understanding and Generating Music Intrinsically with LLM

🔔News

🔥[2023-12-10]: The release of ChatMusician's demo, code, model, data, and benchmark. 😆
[2023-11-30]: Checkout another awesome project MMMU that includes multimodal music reasoning.

Introduction

While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity’s creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities.

It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. Code, data, model, and benchmark are open-sourced.

Training Data

ChatMusician is pretrained on the 🤗 MusicPile, which is the first pretraining corpus for developing musical abilities in large language models. Check out the dataset card for more details. And supervised finetuned on 1.1M samples(2:1 ratio between music knowledge & music summary data and music scores) from MusicPile. Check our paper for more details.

Training Procedure

We initialized a fp16-precision ChatMusician-Base from the LLaMA2-7B-Base weights, and applied a continual pre-training plus fine-tuning pipeline. LoRA adapters were integrated into the attention and MLP layers, with additional training on embeddings and all linear layers. The maximum sequence length was 2048. We utilized 16 80GB-A800 GPUs for one epoch pre-training and 8 32GB-V100 GPUs for two epoch fine-tuning. DeepSpeed was employed for memory efficiency, and the AdamW optimizer was used with a 1e-4 learning rate and a 5% warmup cosine scheduler. Gradient clipping was set at 1.0. The LoRA parameters dimension, alpha, and dropout were set to 64, 16, and 0.1, with a batch size of 8.

Evaluation

Music understanding abilities are evaluated on the MusicTheoryBench. Check out eval folder for more details.
General language abilities of ChatMusician are evaluated on the Massive Multitask Language Understanding (MMLU) dataset.

Requirements

Python 3.8 and above
Pytorch 2.0 and above are recommended
CUDA 11.4 and above are recommended
Deepspeed 0.10 and above are recommended

Python dependency installation:

pip install -r requirements.txt

Inference

web demo (with audio)

To render audio in real-time, you must install abcmidi and MuseScore.

Install abc2midi.

sudo apt-get update
sudo apt-get install abcmidi

Install MuseScore(on Linux, on Mac, on Windows).

Then launch a gradio demo:

cd ChatMusician/
python model/infer/chatmusician_web_demo.py -c "m-a-p/ChatMusician" --server_port 8888

Prompt example:

Using ABC notation, recreate the given text as a musical score.
Meter C
Notes The parts are commonly interchanged.
Transcription 1997 by John Chambers
Key D
Note Length 1/8
Rhythm reel

inference locally

cd Chat-Musician/
python model/infer/predict.py --base_model {merged_model_path} --with_prompt --interactive

Note: with --with_prompt, input text will be converted to chat format.

Start an Experiment

SFT Data Format

Our SFT dataset comprises data points structured with three main features: instruction, input, and output. Each data point resembles a conversation between a human and an assistant, formatted as follows: Human: {...} </s> Assistant: {...} </s>.

For example,

{
"instruction": "Construct melodies by blending the designated musical pattern with the supplied motif.",
"input": "['Binary', 'Sectional: Verse/Chorus'];X:1 L:1/16 M:2/4 K:G ['G2BG A2cA B2dB', '(gf)(ge) (ed)(cB)' </s> ",
"output": "Assistant: X:1 L:1/16 M:2/4 K:G G2BG A2cA | B2dB G2B2 | c2ec B2dB | ABAG (GF)(ED) | G2BG A2cA | B2dB c2ec | cBAG D2f2 | g2d2B2G2 || (gf)(ge) (ed)(cB) | (gf)(ge) (ed)(cB) | ca2c Bg2B | ABAG GFED | G2BG A2cA | cBAG d2f2 | g2d2B2G2 || </s> "
}

You can explore more samples at MusicPile-sft. We recommend structuring your data in a similar format for fine-tuning based on ChatMusician-Base.

Data Preprocessing

Data preprocessing involves converting texts into token IDs, which helps save GPU memory compared to runtime tokenization.

cd ChatMusician
## specify `--tokenize_fn pt` for preprocessing continual pretrain data
## specify `--tokenize_fn sft` for preprocessing sft data
python model/train/data_preprocess.py \
    -t $TOKENIZER_PATH \
    -i $DATA_FILE \
    -o $OUTPUT_DIR

For example, if you're using m-a-p/ChatMusician-Base and the dataset m-a-p/MusicPile-sft for supervised fine-tuning, and want to save preprocessed data in the datasets directory:

python model/train/data_preprocess.py \
    -t m-a-p/ChatMusician-Base \
    -i m-a-p/MusicPile-sft \
    -o datasets \
    --tokenize_fn sft

Pretraining or Supervised Fine-tuning

run model/train/scripts/train.sh $PREPROCESSED_DATASET_PATH $YOUR_MODEL_PATH

For example, if you're fine-tuning based on m-a-p/ChatMusician-Base for supervised fine-tuning and your data file has been preprocessed in the datasets directory:

./model/train/scripts/train.sh datasets m-a-p/ChatMusician-Base

You can then find the tensorboard log in the runs directory.

Merge Peft Model

After finetuning, you can merge the LoRa checkpoint with the original checkpoint using the following script:

cd ChatMusician/
python model/train/merge.py --ori_model_dir $BASE_MODEL --model_dir $LORA_CKPT_PATH --output_dir $OUTPUT_PATH

Need Help?

If you find yourself confused or encountering any issues, feel free to create an issue on our repository for assistance.

Limitations

ChatMusician currently only supports strict format and close-ended instructions for the music tasks. If we have more funding, we plan to create a more diverse multi-turn music instruction chat data for better generalization.
ChatMusician suffers from hallucinations, and shouldn't be used for music education. It could be improved by feeding more music textbooks, blogs, etc. And RLHF may help, too.
A large portion of the training data is in the style of Irish music. If possible, the community should develop a converter between performance midi and ABC scores, so that we can include more established midi datasets.
The MusicThoeryBench results reported in the paper are obtained with perplexity mode. Direct generation may result in a worse performance.
We observe that using the current version of training data, ChatMusician presents a weak in-context-learning and chain-of-thoughts ability. The community should work on improving the music data quality.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{yuan2024chatmusician,
      title={ChatMusician: Understanding and Generating Music Intrinsically with LLM}, 
      author={Ruibin Yuan and Hanfeng Lin and Yi Wang and Zeyue Tian and Shangda Wu and Tianhao Shen and Ge Zhang and Yuhang Wu and Cong Liu and Ziya Zhou and Ziyang Ma and Liumeng Xue and Ziyu Wang and Qin Liu and Tianyu Zheng and Yizhi Li and Yinghao Ma and Yiming Liang and Xiaowei Chi and Ruibo Liu and Zili Wang and Pengfei Li and Jingcheng Wu and Chenghua Lin and Qifeng Liu and Tao Jiang and Wenhao Huang and Wenhu Chen and Emmanouil Benetos and Jie Fu and Gus Xia and Roger Dannenberg and Wei Xue and Shiyin Kang and Yike Guo},
      year={2024},
      eprint={2402.16153},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

chatmusician's People

Contributors

Stargazers

Watchers

chatmusician's Issues

How to integrate evaluation into a fresh version of open compass?

As I am having much more success with a fresh copy of open compass, apparently something is missing. First I have copied over the chat musician configs (at least in my environment):

cp ../ChatMusician/eval/configs/eval_chat_musician_7b.py configs/
cp -r ../ChatMusician/eval/configs/datasets/music_theory_bench/ configs/datasets/
cp -r ../ChatMusician/eval/configs/models/chat_musician/ configs/models/

But a few of the validations fail; notably below
`KeyError: 'opencompass.datasets.MusicTheoryBenchDataset is not in the opencompass::load_dataset registry.

Is there an easy way to fix this?

04/13 15:02:53 - OpenCompass - INFO - Task [Mozart-Transposed/reasoning_few_shot]

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards:  14%|█▍        | 1/7 [00:01<00:11,  1.99s/it]
Loading checkpoint shards:  29%|██▊       | 2/7 [00:03<00:09,  1.97s/it]
Loading checkpoint shards:  43%|████▎     | 3/7 [00:06<00:08,  2.01s/it]
Loading checkpoint shards:  57%|█████▋    | 4/7 [00:08<00:06,  2.04s/it]
Loading checkpoint shards:  71%|███████▏  | 5/7 [00:10<00:04,  2.03s/it]
Loading checkpoint shards:  86%|████████▌ | 6/7 [00:12<00:02,  2.09s/it]
Loading checkpoint shards: 100%|██████████| 7/7 [00:13<00:00,  1.95s/it]
Loading checkpoint shards: 100%|██████████| 7/7 [00:13<00:00,  2.00s/it]
Traceback (most recent call last):
  File "/content/drive/MyDrive/project/opencompass/opencompass/tasks/openicl_infer.py", line 156, in <module>
    inferencer.run()
  File "/content/drive/MyDrive/project/opencompass/opencompass/tasks/openicl_infer.py", line 74, in run
    self.dataset = build_dataset_from_cfg(self.dataset_cfg)
  File "/content/drive/MyDrive/project/opencompass/opencompass/utils/build.py", line 13, in build_dataset_from_cfg
    return LOAD_DATASET.build(dataset_cfg)
  File "/root/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/root/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 100, in build_from_cfg
    raise KeyError(
KeyError: 'opencompass.datasets.MusicTheoryBenchDataset is not in the opencompass::load_dataset registry. Please check whether the value of `opencompass.datasets.MusicTheoryBenchDataset` is correct or it was registered as expected. More details can be found at https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#import-the-custom-module'
[2024-04-13 15:03:15,430] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 82828) of binary: /root/anaconda3/envs/opencompass/bin/python
Traceback (most recent call last):
  File "/root/anaconda3/envs/opencompass/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.2', 'console_scripts', 'torchrun')())
  File "/root/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/root/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/root/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/root/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/content/drive/MyDrive/project/opencompass/opencompass/tasks/openicl_infer.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-13_15:03:15
  host      : fb7dea0d788a
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 82828)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

How to determine hallucinations?

Very nice work. You mentioned that there is a tendency to hallucinate. How is this determined? Can I get the probabilities per token out?

Many thanks
Peter

How to generate music by ABC notation?

Woderful work!

I'm curious what method did you use to generate the music by ABC notation?

Thanks.

model/train/train.py is not macOS (gpu - mps) aware

Running

./model/train/scripts/train.sh datasets m-a-p/ChatMusician-Base as per the README

exits with:

(ml) petergreis@MacBook-Pro-M1-Max-2021 ChatMusician % ./model/train/scripts/train.sh datasets m-a-p/ChatMusician-Base
[2024-03-29 21:11:10,773] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to mps (auto detect)
[2024-03-29 21:11:12,870] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-03-29 21:11:14,403] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-03-29 21:11:14,404] [INFO] [runner.py:568:main] cmd = /Users/petergreis/anaconda3/envs/ml/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None model/train/train.py --train_path datasets --model_name_or_path m-a-p/ChatMusician-Base --per_device_train_batch_size 1 --max_len 2048 --max_src_len 1536 --num_train_epochs 2 --gradient_accumulation_steps 1 --learning_rate 1e-4 --weight_decay 0.1 --warmup_ratio 0.1 --mode llama --train_type lora --lora_dim 64 --lora_alpha 16 --lora_dropout 0.1 --lora_module_name q_proj,k_proj,v_proj,o_proj,down_proj,gate_proj,up_proj --seed 1234 --save_model_step 2000 --ds_file model/train/config/ds_zero2_no_offload.json --show_loss_step 50 --output_dir model/train/output_dir
[2024-03-29 21:11:15,592] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to mps (auto detect)
[2024-03-29 21:11:15,801] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-03-29 21:11:16,405] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2024-03-29 21:11:16,405] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2024-03-29 21:11:16,405] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2024-03-29 21:11:16,405] [INFO] [launch.py:163:main] dist_world_size=1
[2024-03-29 21:11:16,405] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-03-29 21:11:17,605] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to mps (auto detect)
[2024-03-29 21:11:17,821] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
Traceback (most recent call last):
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/train.py", line 252, in <module>
    main()
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/train.py", line 70, in main
    torch.cuda.set_device(args.local_rank)
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
    ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'
[2024-03-29 21:11:20,433] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7216
[2024-03-29 21:11:20,436] [ERROR] [launch.py:321:sigkill_handler] ['/Users/petergreis/anaconda3/envs/ml/bin/python', '-u', 'model/train/train.py', '--local_rank=0', '--train_path', 'datasets', '--model_name_or_path', 'm-a-p/ChatMusician-Base', '--per_device_train_batch_size', '1', '--max_len', '2048', '--max_src_len', '1536', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--warmup_ratio', '0.1', '--mode', 'llama', '--train_type', 'lora', '--lora_dim', '64', '--lora_alpha', '16', '--lora_dropout', '0.1', '--lora_module_name', 'q_proj,k_proj,v_proj,o_proj,down_proj,gate_proj,up_proj', '--seed', '1234', '--save_model_step', '2000', '--ds_file', 'model/train/config/ds_zero2_no_offload.json', '--show_loss_step', '50', '--output_dir', 'model/train/output_dir'] exits with return code = 1

How should the call to

mps_available = mps.is_available()

be incorporated into the code at line 70 to set the device to "mps"?

train.py without Deepspeed

Could I request that a train.py be created that does not use DeepSpeed?

Questions about the merge script - why not type auto?

Just to report some success, this seems to work (subject to inference testing):

python model/train/merge.py --ori_model_dir ../chatmusician_model_tokenizer --model_dir model/train/output_dir/epoch-2-step-200 --output_dir new_out

Looking further at the merge.py script, I ask the following:

why are the base types here:
base_model = LlamaForCausalLM.from_pretrained(args.ori_model_dir, torch_dtype=torch.float16) lora_model = PeftModel.from_pretrained(base_model, args.model_dir, torch_dtype=torch.float16)

not set to torch_dtype=t'auto' ?

Also, the input argument --ori_model_dir I find confusing; why not --orig_model_dir ?

Model versus tokenizer mismatch

Greetings

Attempting to load a fine tuned model into llama.cpp. As the error sources from my sft ChatMusician model I post this here. I went back to try this on a saved copy of the model.

Running this:
python3 convert.py /Users/petergreis/Dropbox/Leeds/Project/chatmusician_model_tokenizer

Yields this:

ValueError: Vocab size mismatch (model has 32000, but /Users/petergreis/Dropbox/Leeds/Project/chatmusician_model_tokenizer/tokenizer.model has 32001).

And in the model directory itself I see:

 % more added_tokens.json
{
  "<pad>": 32000
}

Which explains why the token count is off by one. Any idea how I can get the two to agree?

macOS data_preprocess.py - encoding parameter on load_dataset crashes script

Running:
python model/train/data_preprocess.py -t m-a-p/ChatMusician-Base -i m-a-p/MusicPile-sft -o datasets --tokenize_fn sft

Results in:
Traceback (most recent call last): File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/data_preprocess.py", line 90, in <module> main(args) File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/data_preprocess.py", line 50, in main raw_dataset = load_dataset(args.input_file, cache_dir=tmp_cache_dir, keep_in_memory=False, encoding="utf8") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/load.py", line 2556, in load_dataset builder_instance = load_dataset_builder( ^^^^^^^^^^^^^^^^^^^^^ File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/load.py", line 2265, in load_dataset_builder builder_instance: DatasetBuilder = builder_cls( ^^^^^^^^^^^^ File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/builder.py", line 371, in __init__ self.config, self.config_id = self._create_builder_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/builder.py", line 613, in _create_builder_config raise ValueError(f"BuilderConfig {builder_config} doesn't have a '{key}' key.") ValueError: BuilderConfig ParquetConfig(name='default', version=0.0.0, data_dir=None, data_files={'train': ['data/train-*']}, description=None, batch_size=None, columns=None, features=None) doesn't have a 'encoding' key.

Modifying the load_dataset statement from

raw_dataset = load_dataset(args.input_file, cache_dir=tmp_cache_dir, keep_in_memory=False, encoding="utf8")

raw_dataset = load_dataset(args.input_file, cache_dir=tmp_cache_dir, keep_in_memory=False)

allows the script to complete. From my conda environment I am running:

datasets 2.18.0 pypi_0 pypi
transformers 4.37.2 py311hca03da5_0

I could not find the encoding parameter in the current documentation; perhaps it has been deprecated?

May I upload?

Good Morning
I have fine tuned models of ChatMusician incorporating both Mozart and Scarlatti compositions. Any objections if I upload these to HuggingFace?

summary.csv from opencomass evaluation?

Greetings

Would you be willing to share your final summary files comparing gpt3.5 (and others) to chatmusician?

How much compute to evaluate?

I have just attempted to run the eval code on an A6000. While it starts, it looks like it just hangs. Hence my question - what did you run the eval code on? Same cluster as mentioned in the paper for training?

Requirements.txt is missing what version of Gradio that it needs

How to format queries for predict.py?

Greetings

If for example I want to use the following to feed to predict as part of a data file:

Create a new 8 bar composition in the style of Mozart using the following ABC notation as a basis:
X:1
M:4/4
L:1/4
K:Dm
|| D F A d | c B A F ||

Do I simply remove the newlines and flatten it out to one line? Do you have other examples?

PS: "为什么要减少污染，保护环境？" - Nice ;)

issue with data_prepreprocess.py on macOS; filename not correct

Greetings

When trying to run the example from the readme of:

python model/train/data_preprocess.py -t m-a-p/ChatMusician-Base -i m-a-p/MusicPile-sft -o datasets --tokenize_fn sft

The script crashes out with the following:

Traceback (most recent call last):
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/data_preprocess.py", line 110, in <module>
    main(args)
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/data_preprocess.py", line 69, in main
    raw_dataset = load_dataset(args.input_file, cache_dir=tmp_cache_dir, keep_in_memory=False, encoding="utf8")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/load.py", line 2265, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/builder.py", line 371, in __init__
    self.config, self.config_id = self._create_builder_config(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/builder.py", line 613, in _create_builder_config
    raise ValueError(f"BuilderConfig {builder_config} doesn't have a '{key}' key.")
ValueError: BuilderConfig ParquetConfig(name='default', version=0.0.0, data_dir=None, data_files={'train': ['data/train-*']}, description=None, batch_size=None, columns=None, features=None) doesn't have a 'encoding' key.

I have chased this down to this line:

filename = '.'.join(args.input_file.split("/")[-1].split(".")[:-1])

under macOS this yields an empty string, which causes the problem. Given that this is the argument "m-a-p/MusicPile-sft" in the original call, which part of the argument is intended for use? And, as this appears to be used to set the cache directory, should this not respect the environment variable HF_DATASETS_CACHE if set?

How can I finetune on my own data?

Hi thanks for the great work!

I wonder how can I finetune your chat-checkpoint on my own dataset? Are there any resources I can refer to, such as dataset preperation and finetune script? Thanks!

model is not found

https://huggingface.co/m-a-p/ChatMusician
this website return 404 page

After update to data_preprocess.py, where is the training data?

When attempting to train (as per the readme), the shell script file expands to:

model/train/train.py --train_path datasets --model_name_or_path m-a-p/ChatMusician-Base --per_device_train_batch_size 1 --max_len 2048 --max_src_len 1536 --num_train_epochs 2 --gradient_accumulation_steps 1 --learning_rate 1e-4 --weight_decay 0.1 --warmup_ratio 0.1 --mode llama --train_type lora --lora_dim 64 --lora_alpha 16 --lora_dropout 0.1 --lora_module_name q_proj,k_proj,v_proj,o_proj,down_proj,gate_proj,up_proj --seed 1234 --save_model_step 2000 --show_loss_step 50 --output_dir model/train/output_dir

Which then throws this trace:

Traceback (most recent call last):
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/train.py", line 129, in <module>
    main()
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/train.py", line 89, in main
    tokenizer = MODE[args.mode]["tokenizer"].from_pretrained(args.model_name_or_path, use_fast=False, trust_remote_code=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2029, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2261, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/llama/tokenization_llama.py", line 79, in __init__
    super().__init__(
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 367, in __init__
    self._add_tokens(
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 467, in _add_tokens
    current_vocab = self.get_vocab().copy()
                    ^^^^^^^^^^^^^^^^
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/llama/tokenization_llama.py", line 113, in get_vocab
    vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
                                                             ^^^^^^^^^^^^^^^
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/llama/tokenization_llama.py", line 109, in vocab_size
    return self.sp_model.get_piece_size()
           ^^^^^^^^^^^^^
AttributeError: 'LlamaTokenizer' object has no attribute 'sp_model'

...which would seem to indicate that the training data is not found. The recent changes to data_preprocess.py creates datasets/processed_tokens and datasets/processed_tokens_text. Although adjusting --train_path datasets to either of the previous, it would seem that the tokenised data is not found. Any suggestions?

Running out of GPU memory in Google colab; ideas?

Currently seeing this when attempting to train in Google colab on a T4 GPU. Any idea how I can further reduce the footprint to make this run?

2024-04-03 14:35:22,511] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 3.85 GB, percent = 7.6% Traceback (most recent call last): File "/content/drive/MyDrive/project/ChatMusician/model/train/train.py", line 254, in <module> main() File "/content/drive/MyDrive/project/ChatMusician/model/train/train.py", line 198, in main model, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, args=args, config=ds_config, File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 176, in initialize engine = DeepSpeedEngine(args=args, File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 307, in __init__ self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1256, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1513, in _configure_zero_optimizer optimizer = DeepSpeedZeroOptimizer( File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 535, in __init__ self.initialize_optimizer_states() File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 670, in initialize_optimizer_states single_grad_partition = torch.zeros(int(self.partition_size[i]), torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 610.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 403.06 MiB is free. Process 129704 has 14.35 GiB memory in use. Of the allocated memory 13.48 GiB is allocated by PyTorch, and 197.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [2024-04-03 14:35:26,240] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 9908 [2024-04-03 14:35:26,240] [ERROR] [launch.py:322:sigkill_handler] ['/usr/bin/python3', '-u', 'model/train/train.py', '--local_rank=0', '--train_path', 'datasets/train_bias_sample_100', '--model_name_or_path', 'm-a-p/ChatMusician-Base', '--per_device_train_batch_size', '1', '--max_len', '2048', '--max_src_len', '1536', '--num_train_epochs', '2', '--gradient_accumulation_steps', '4', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--warmup_ratio', '0.1', '--mode', 'llama', '--train_type', 'lora', '--lora_dim', '64', '--lora_alpha', '16', '--lora_dropout', '0.1', '--lora_module_name', 'q_proj,k_proj,v_proj,o_proj,down_proj,gate_proj,up_proj', '--seed', '1234', '--save_model_step', '2000', '--ds_file', 'model/train/config/ds_zero2_no_offload.json', '--show_loss_step', '50', '--output_dir', 'model/train/output_dir'] exits with return code = 1

If you would like to reach me directly: peter dot greis at freethinker dot com

Will you fine-tune a better model based on Llama-3?

Hi! Will you fine-tune a better model based on Llama-3 which is much smarter?

hf-lin / chatmusician Goto Github PK

chatmusician's Introduction

🎼 ChatMusician: Understanding and Generating Music Intrinsically with LLM

🔔News

Introduction

Training Data

Training Procedure

Evaluation

Requirements

Inference

web demo (with audio)

inference locally

Start an Experiment

SFT Data Format

Data Preprocessing

Pretraining or Supervised Fine-tuning

Merge Peft Model

Need Help?

Limitations

Citation

chatmusician's People

Contributors

Stargazers

Watchers

Forkers

chatmusician's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs