GithubHelp home page GithubHelp logo

chatmusician's Introduction

🎼 ChatMusician: Understanding and Generating Music Intrinsically with LLM

🌐 DemoPage | πŸ€— Pretrain Dataset | πŸ€— SFT Dataset | πŸ€— Benchmark | πŸ“– arXiv | πŸ’» Code | πŸ€– Chat Model | πŸ€– Base Model

πŸ””News

  • πŸ”₯[2023-12-10]: The release of ChatMusician's demo, code, model, data, and benchmark. πŸ˜†
  • [2023-11-30]: Checkout another awesome project MMMU that includes multimodal music reasoning.

Introduction

While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity’s creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities.

It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. Code, data, model, and benchmark are open-sourced.

Training Data

ChatMusician is pretrained on the πŸ€— MusicPile, which is the first pretraining corpus for developing musical abilities in large language models. Check out the dataset card for more details. And supervised finetuned on 1.1M samples(2:1 ratio between music knowledge & music summary data and music scores) from MusicPile. Check our paper for more details.

Training Procedure

We initialized a fp16-precision ChatMusician-Base from the LLaMA2-7B-Base weights, and applied a continual pre-training plus fine-tuning pipeline. LoRA adapters were integrated into the attention and MLP layers, with additional training on embeddings and all linear layers. The maximum sequence length was 2048. We utilized 16 80GB-A800 GPUs for one epoch pre-training and 8 32GB-V100 GPUs for two epoch fine-tuning. DeepSpeed was employed for memory efficiency, and the AdamW optimizer was used with a 1e-4 learning rate and a 5% warmup cosine scheduler. Gradient clipping was set at 1.0. The LoRA parameters dimension, alpha, and dropout were set to 64, 16, and 0.1, with a batch size of 8.

Evaluation

  1. Music understanding abilities are evaluated on the MusicTheoryBench. Check out eval folder for more details.
  2. General language abilities of ChatMusician are evaluated on the Massive Multitask Language Understanding (MMLU) dataset.

Requirements

  • Python 3.8 and above
  • Pytorch 2.0 and above are recommended
  • CUDA 11.4 and above are recommended
  • Deepspeed 0.10 and above are recommended

Python dependency installation:

pip install -r requirements.txt 

Inference

web demo (with audio)

To render audio in real-time, you must install abcmidi and MuseScore.

  1. Install abc2midi.
sudo apt-get update
sudo apt-get install abcmidi
  1. Install MuseScore(on Linux, on Mac, on Windows).

Then launch a gradio demo:

cd ChatMusician/
python model/infer/chatmusician_web_demo.py -c "m-a-p/ChatMusician" --server_port 8888

Prompt example:

Using ABC notation, recreate the given text as a musical score.
Meter C
Notes The parts are commonly interchanged.
Transcription 1997 by John Chambers
Key D
Note Length 1/8
Rhythm reel

chatmusician web demo

inference locally

cd Chat-Musician/
python model/infer/predict.py --base_model {merged_model_path} --with_prompt --interactive

Note: with --with_prompt, input text will be converted to chat format.

Start an Experiment

SFT Data Format

Our SFT dataset comprises data points structured with three main features: instruction, input, and output. Each data point resembles a conversation between a human and an assistant, formatted as follows: Human: {...} </s> Assistant: {...} </s>.

For example,

{
"instruction": "Construct melodies by blending the designated musical pattern with the supplied motif.",
"input": "['Binary', 'Sectional: Verse/Chorus'];X:1 L:1/16 M:2/4 K:G ['G2BG A2cA B2dB', '(gf)(ge) (ed)(cB)' </s> ",
"output": "Assistant: X:1 L:1/16 M:2/4 K:G G2BG A2cA | B2dB G2B2 | c2ec B2dB | ABAG (GF)(ED) | G2BG A2cA | B2dB c2ec | cBAG D2f2 | g2d2B2G2 || (gf)(ge) (ed)(cB) | (gf)(ge) (ed)(cB) | ca2c Bg2B | ABAG GFED | G2BG A2cA | cBAG d2f2 | g2d2B2G2 || </s> "
}

You can explore more samples at MusicPile-sft. We recommend structuring your data in a similar format for fine-tuning based on ChatMusician-Base.

Data Preprocessing

Data preprocessing involves converting texts into token IDs, which helps save GPU memory compared to runtime tokenization.

cd ChatMusician
## specify `--tokenize_fn pt` for preprocessing continual pretrain data
## specify `--tokenize_fn sft` for preprocessing sft data
python model/train/data_preprocess.py \
    -t $TOKENIZER_PATH \
    -i $DATA_FILE \
    -o $OUTPUT_DIR 

For example, if you're using m-a-p/ChatMusician-Base and the dataset m-a-p/MusicPile-sft for supervised fine-tuning, and want to save preprocessed data in the datasets directory:

python model/train/data_preprocess.py \
    -t m-a-p/ChatMusician-Base \
    -i m-a-p/MusicPile-sft \
    -o datasets \
    --tokenize_fn sft 

Pretraining or Supervised Fine-tuning

run model/train/scripts/train.sh $PREPROCESSED_DATASET_PATH $YOUR_MODEL_PATH

For example, if you're fine-tuning based on m-a-p/ChatMusician-Base for supervised fine-tuning and your data file has been preprocessed in the datasets directory:

./model/train/scripts/train.sh datasets m-a-p/ChatMusician-Base

You can then find the tensorboard log in the runs directory.

Merge Peft Model

After finetuning, you can merge the LoRa checkpoint with the original checkpoint using the following script:

cd ChatMusician/
python model/train/merge.py --ori_model_dir $BASE_MODEL --model_dir $LORA_CKPT_PATH --output_dir $OUTPUT_PATH

Need Help?

If you find yourself confused or encountering any issues, feel free to create an issue on our repository for assistance.

Limitations

  • ChatMusician currently only supports strict format and close-ended instructions for the music tasks. If we have more funding, we plan to create a more diverse multi-turn music instruction chat data for better generalization.
  • ChatMusician suffers from hallucinations, and shouldn't be used for music education. It could be improved by feeding more music textbooks, blogs, etc. And RLHF may help, too.
  • A large portion of the training data is in the style of Irish music. If possible, the community should develop a converter between performance midi and ABC scores, so that we can include more established midi datasets.
  • The MusicThoeryBench results reported in the paper are obtained with perplexity mode. Direct generation may result in a worse performance.
  • We observe that using the current version of training data, ChatMusician presents a weak in-context-learning and chain-of-thoughts ability. The community should work on improving the music data quality.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{yuan2024chatmusician,
      title={ChatMusician: Understanding and Generating Music Intrinsically with LLM}, 
      author={Ruibin Yuan and Hanfeng Lin and Yi Wang and Zeyue Tian and Shangda Wu and Tianhao Shen and Ge Zhang and Yuhang Wu and Cong Liu and Ziya Zhou and Ziyang Ma and Liumeng Xue and Ziyu Wang and Qin Liu and Tianyu Zheng and Yizhi Li and Yinghao Ma and Yiming Liang and Xiaowei Chi and Ruibo Liu and Zili Wang and Pengfei Li and Jingcheng Wu and Chenghua Lin and Qifeng Liu and Tao Jiang and Wenhao Huang and Wenhu Chen and Emmanouil Benetos and Jie Fu and Gus Xia and Roger Dannenberg and Wei Xue and Shiyin Kang and Yike Guo},
      year={2024},
      eprint={2402.16153},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

chatmusician's People

Contributors

hf-lin avatar magnetic2014 avatar eltociear avatar lzqlzzq avatar

Stargazers

nksgw11 avatar Wenhao Jiang avatar ZhaoXianglin avatar Qinyi Lu avatar 清吾 avatar  avatar wowopo avatar  avatar Mingyu Xiong avatar kdleong avatar  avatar Ramtin Hosseini avatar Chih-Cheng (CC) Chang avatar  avatar Alex Lai avatar Raffaele Sena avatar Andres Nedilskyj avatar El loza avatar  avatar  avatar  avatar  avatar  avatar smithart avatar Esus2_A avatar shimomura kei avatar Seunghyun Lee avatar ya xiang avatar  avatar lina avatar  avatar  avatar  avatar kuroya avatar syanhu avatar Yongchao Xu avatar  avatar Kidding avatar Charles avatar Xin Xu avatar  avatar  avatar jieqiang avatar  avatar Wonow Lau avatar  avatar  avatar  avatar  avatar Jerry Wu avatar Zihan Liu 刘紫梡 avatar  avatar Wenbin Wang avatar Changtong avatar Zaggy avatar Liang Ding avatar  avatar meng7333 avatar Bingsong Bai avatar xiaofof avatar  avatar Claas Heuer avatar Yifan Hong avatar  avatar Nikhil R. avatar Slice avatar chenjintx avatar hua(Kungfu) avatar  avatar  avatar Riceball LEE avatar Baotong Tian avatar Rishu Mehrotra avatar skyeKing avatar  avatar  avatar Aria F avatar rockwood avatar  avatar  avatar luamoon avatar SΓ©bastien Jalliffier Verne avatar Pasha S avatar Vedant Kalbag avatar annn avatar David Haas avatar Evgeny avatar Michalis Papakostas avatar  avatar Walle avatar Christof avatar Brian Roach avatar Ilya Borovik avatar Alex avatar Adam Łukawski avatar Art A. avatar  avatar  avatar Straughter "BatmanOsama" Guthrie avatar owlwang avatar

Watchers

 avatar Devin Ulibarri avatar LiWei avatar  avatar Erik Luo avatar Eric Putnam avatar  avatar shimomura kei avatar  avatar

chatmusician's Issues

How to integrate evaluation into a fresh version of open compass?

As I am having much more success with a fresh copy of open compass, apparently something is missing. First I have copied over the chat musician configs (at least in my environment):

cp ../ChatMusician/eval/configs/eval_chat_musician_7b.py configs/
cp -r ../ChatMusician/eval/configs/datasets/music_theory_bench/ configs/datasets/
cp -r ../ChatMusician/eval/configs/models/chat_musician/ configs/models/

But a few of the validations fail; notably below
`KeyError: 'opencompass.datasets.MusicTheoryBenchDataset is not in the opencompass::load_dataset registry.

Is there an easy way to fix this?

04/13 15:02:53 - OpenCompass - INFO - Task [Mozart-Transposed/reasoning_few_shot]

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards:  14%|β–ˆβ–        | 1/7 [00:01<00:11,  1.99s/it]
Loading checkpoint shards:  29%|β–ˆβ–ˆβ–Š       | 2/7 [00:03<00:09,  1.97s/it]
Loading checkpoint shards:  43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 3/7 [00:06<00:08,  2.01s/it]
Loading checkpoint shards:  57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 4/7 [00:08<00:06,  2.04s/it]
Loading checkpoint shards:  71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 5/7 [00:10<00:04,  2.03s/it]
Loading checkpoint shards:  86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 6/7 [00:12<00:02,  2.09s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:13<00:00,  1.95s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:13<00:00,  2.00s/it]
Traceback (most recent call last):
  File "/content/drive/MyDrive/project/opencompass/opencompass/tasks/openicl_infer.py", line 156, in <module>
    inferencer.run()
  File "/content/drive/MyDrive/project/opencompass/opencompass/tasks/openicl_infer.py", line 74, in run
    self.dataset = build_dataset_from_cfg(self.dataset_cfg)
  File "/content/drive/MyDrive/project/opencompass/opencompass/utils/build.py", line 13, in build_dataset_from_cfg
    return LOAD_DATASET.build(dataset_cfg)
  File "/root/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/root/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 100, in build_from_cfg
    raise KeyError(
KeyError: 'opencompass.datasets.MusicTheoryBenchDataset is not in the opencompass::load_dataset registry. Please check whether the value of `opencompass.datasets.MusicTheoryBenchDataset` is correct or it was registered as expected. More details can be found at https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#import-the-custom-module'
[2024-04-13 15:03:15,430] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 82828) of binary: /root/anaconda3/envs/opencompass/bin/python
Traceback (most recent call last):
  File "/root/anaconda3/envs/opencompass/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.2', 'console_scripts', 'torchrun')())
  File "/root/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/root/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/root/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/root/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/content/drive/MyDrive/project/opencompass/opencompass/tasks/openicl_infer.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-13_15:03:15
  host      : fb7dea0d788a
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 82828)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

How to determine hallucinations?

Very nice work. You mentioned that there is a tendency to hallucinate. How is this determined? Can I get the probabilities per token out?

Many thanks
Peter

model/train/train.py is not macOS (gpu - mps) aware

Running

./model/train/scripts/train.sh datasets m-a-p/ChatMusician-Base as per the README

exits with:

(ml) petergreis@MacBook-Pro-M1-Max-2021 ChatMusician % ./model/train/scripts/train.sh datasets m-a-p/ChatMusician-Base
[2024-03-29 21:11:10,773] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to mps (auto detect)
[2024-03-29 21:11:12,870] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-03-29 21:11:14,403] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-03-29 21:11:14,404] [INFO] [runner.py:568:main] cmd = /Users/petergreis/anaconda3/envs/ml/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None model/train/train.py --train_path datasets --model_name_or_path m-a-p/ChatMusician-Base --per_device_train_batch_size 1 --max_len 2048 --max_src_len 1536 --num_train_epochs 2 --gradient_accumulation_steps 1 --learning_rate 1e-4 --weight_decay 0.1 --warmup_ratio 0.1 --mode llama --train_type lora --lora_dim 64 --lora_alpha 16 --lora_dropout 0.1 --lora_module_name q_proj,k_proj,v_proj,o_proj,down_proj,gate_proj,up_proj --seed 1234 --save_model_step 2000 --ds_file model/train/config/ds_zero2_no_offload.json --show_loss_step 50 --output_dir model/train/output_dir
[2024-03-29 21:11:15,592] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to mps (auto detect)
[2024-03-29 21:11:15,801] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-03-29 21:11:16,405] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2024-03-29 21:11:16,405] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2024-03-29 21:11:16,405] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2024-03-29 21:11:16,405] [INFO] [launch.py:163:main] dist_world_size=1
[2024-03-29 21:11:16,405] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-03-29 21:11:17,605] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to mps (auto detect)
[2024-03-29 21:11:17,821] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
Traceback (most recent call last):
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/train.py", line 252, in <module>
    main()
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/train.py", line 70, in main
    torch.cuda.set_device(args.local_rank)
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
    ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'
[2024-03-29 21:11:20,433] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7216
[2024-03-29 21:11:20,436] [ERROR] [launch.py:321:sigkill_handler] ['/Users/petergreis/anaconda3/envs/ml/bin/python', '-u', 'model/train/train.py', '--local_rank=0', '--train_path', 'datasets', '--model_name_or_path', 'm-a-p/ChatMusician-Base', '--per_device_train_batch_size', '1', '--max_len', '2048', '--max_src_len', '1536', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--warmup_ratio', '0.1', '--mode', 'llama', '--train_type', 'lora', '--lora_dim', '64', '--lora_alpha', '16', '--lora_dropout', '0.1', '--lora_module_name', 'q_proj,k_proj,v_proj,o_proj,down_proj,gate_proj,up_proj', '--seed', '1234', '--save_model_step', '2000', '--ds_file', 'model/train/config/ds_zero2_no_offload.json', '--show_loss_step', '50', '--output_dir', 'model/train/output_dir'] exits with return code = 1

How should the call to

mps_available = mps.is_available()

be incorporated into the code at line 70 to set the device to "mps"?

Questions about the merge script - why not type auto?

Just to report some success, this seems to work (subject to inference testing):

python model/train/merge.py --ori_model_dir ../chatmusician_model_tokenizer --model_dir model/train/output_dir/epoch-2-step-200 --output_dir new_out

Looking further at the merge.py script, I ask the following:

why are the base types here:
base_model = LlamaForCausalLM.from_pretrained(args.ori_model_dir, torch_dtype=torch.float16) lora_model = PeftModel.from_pretrained(base_model, args.model_dir, torch_dtype=torch.float16)

not set to torch_dtype=t'auto' ?

Also, the input argument --ori_model_dir I find confusing; why not --orig_model_dir ?

Model versus tokenizer mismatch

Greetings

Attempting to load a fine tuned model into llama.cpp. As the error sources from my sft ChatMusician model I post this here. I went back to try this on a saved copy of the model.

Running this:
python3 convert.py /Users/petergreis/Dropbox/Leeds/Project/chatmusician_model_tokenizer

Yields this:

ValueError: Vocab size mismatch (model has 32000, but /Users/petergreis/Dropbox/Leeds/Project/chatmusician_model_tokenizer/tokenizer.model has 32001).

And in the model directory itself I see:

 % more added_tokens.json
{
  "<pad>": 32000
}

Which explains why the token count is off by one. Any idea how I can get the two to agree?

macOS data_preprocess.py - encoding parameter on load_dataset crashes script

Running:
python model/train/data_preprocess.py -t m-a-p/ChatMusician-Base -i m-a-p/MusicPile-sft -o datasets --tokenize_fn sft

Results in:
Traceback (most recent call last): File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/data_preprocess.py", line 90, in <module> main(args) File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/data_preprocess.py", line 50, in main raw_dataset = load_dataset(args.input_file, cache_dir=tmp_cache_dir, keep_in_memory=False, encoding="utf8") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/load.py", line 2556, in load_dataset builder_instance = load_dataset_builder( ^^^^^^^^^^^^^^^^^^^^^ File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/load.py", line 2265, in load_dataset_builder builder_instance: DatasetBuilder = builder_cls( ^^^^^^^^^^^^ File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/builder.py", line 371, in __init__ self.config, self.config_id = self._create_builder_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/builder.py", line 613, in _create_builder_config raise ValueError(f"BuilderConfig {builder_config} doesn't have a '{key}' key.") ValueError: BuilderConfig ParquetConfig(name='default', version=0.0.0, data_dir=None, data_files={'train': ['data/train-*']}, description=None, batch_size=None, columns=None, features=None) doesn't have a 'encoding' key.

Modifying the load_dataset statement from

raw_dataset = load_dataset(args.input_file, cache_dir=tmp_cache_dir, keep_in_memory=False, encoding="utf8")

to

raw_dataset = load_dataset(args.input_file, cache_dir=tmp_cache_dir, keep_in_memory=False)

allows the script to complete. From my conda environment I am running:

datasets 2.18.0 pypi_0 pypi
transformers 4.37.2 py311hca03da5_0

I could not find the encoding parameter in the current documentation; perhaps it has been deprecated?

May I upload?

Good Morning
I have fine tuned models of ChatMusician incorporating both Mozart and Scarlatti compositions. Any objections if I upload these to HuggingFace?

How much compute to evaluate?

I have just attempted to run the eval code on an A6000. While it starts, it looks like it just hangs. Hence my question - what did you run the eval code on? Same cluster as mentioned in the paper for training?

How to format queries for predict.py?

Greetings

If for example I want to use the following to feed to predict as part of a data file:

Create a new 8 bar composition in the style of Mozart using the following ABC notation as a basis:
X:1
M:4/4
L:1/4
K:Dm
|| D F A d | c B A F ||

Do I simply remove the newlines and flatten it out to one line? Do you have other examples?

PS: "δΈΊδ»€δΉˆθ¦ε‡ε°‘ζ±‘ζŸ“οΌŒδΏζŠ€ηŽ―ε’ƒοΌŸ" - Nice ;)

issue with data_prepreprocess.py on macOS; filename not correct

Greetings

When trying to run the example from the readme of:

python model/train/data_preprocess.py -t m-a-p/ChatMusician-Base -i m-a-p/MusicPile-sft -o datasets --tokenize_fn sft

The script crashes out with the following:

Traceback (most recent call last):
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/data_preprocess.py", line 110, in <module>
    main(args)
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/data_preprocess.py", line 69, in main
    raw_dataset = load_dataset(args.input_file, cache_dir=tmp_cache_dir, keep_in_memory=False, encoding="utf8")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/load.py", line 2265, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/builder.py", line 371, in __init__
    self.config, self.config_id = self._create_builder_config(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/builder.py", line 613, in _create_builder_config
    raise ValueError(f"BuilderConfig {builder_config} doesn't have a '{key}' key.")
ValueError: BuilderConfig ParquetConfig(name='default', version=0.0.0, data_dir=None, data_files={'train': ['data/train-*']}, description=None, batch_size=None, columns=None, features=None) doesn't have a 'encoding' key.

I have chased this down to this line:

filename = '.'.join(args.input_file.split("/")[-1].split(".")[:-1])

under macOS this yields an empty string, which causes the problem. Given that this is the argument "m-a-p/MusicPile-sft" in the original call, which part of the argument is intended for use? And, as this appears to be used to set the cache directory, should this not respect the environment variable HF_DATASETS_CACHE if set?

How can I finetune on my own data?

Hi thanks for the great work!

I wonder how can I finetune your chat-checkpoint on my own dataset? Are there any resources I can refer to, such as dataset preperation and finetune script? Thanks!

After update to data_preprocess.py, where is the training data?

When attempting to train (as per the readme), the shell script file expands to:

model/train/train.py --train_path datasets --model_name_or_path m-a-p/ChatMusician-Base --per_device_train_batch_size 1 --max_len 2048 --max_src_len 1536 --num_train_epochs 2 --gradient_accumulation_steps 1 --learning_rate 1e-4 --weight_decay 0.1 --warmup_ratio 0.1 --mode llama --train_type lora --lora_dim 64 --lora_alpha 16 --lora_dropout 0.1 --lora_module_name q_proj,k_proj,v_proj,o_proj,down_proj,gate_proj,up_proj --seed 1234 --save_model_step 2000 --show_loss_step 50 --output_dir model/train/output_dir

Which then throws this trace:

Traceback (most recent call last):
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/train.py", line 129, in <module>
    main()
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/train.py", line 89, in main
    tokenizer = MODE[args.mode]["tokenizer"].from_pretrained(args.model_name_or_path, use_fast=False, trust_remote_code=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2029, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2261, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/llama/tokenization_llama.py", line 79, in __init__
    super().__init__(
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 367, in __init__
    self._add_tokens(
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 467, in _add_tokens
    current_vocab = self.get_vocab().copy()
                    ^^^^^^^^^^^^^^^^
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/llama/tokenization_llama.py", line 113, in get_vocab
    vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
                                                             ^^^^^^^^^^^^^^^
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/llama/tokenization_llama.py", line 109, in vocab_size
    return self.sp_model.get_piece_size()
           ^^^^^^^^^^^^^
AttributeError: 'LlamaTokenizer' object has no attribute 'sp_model'

...which would seem to indicate that the training data is not found. The recent changes to data_preprocess.py creates datasets/processed_tokens and datasets/processed_tokens_text. Although adjusting --train_path datasets to either of the previous, it would seem that the tokenised data is not found. Any suggestions?

Running out of GPU memory in Google colab; ideas?

Currently seeing this when attempting to train in Google colab on a T4 GPU. Any idea how I can further reduce the footprint to make this run?

2024-04-03 14:35:22,511] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 3.85 GB, percent = 7.6% Traceback (most recent call last): File "/content/drive/MyDrive/project/ChatMusician/model/train/train.py", line 254, in <module> main() File "/content/drive/MyDrive/project/ChatMusician/model/train/train.py", line 198, in main model, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, args=args, config=ds_config, File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 176, in initialize engine = DeepSpeedEngine(args=args, File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 307, in __init__ self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1256, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1513, in _configure_zero_optimizer optimizer = DeepSpeedZeroOptimizer( File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 535, in __init__ self.initialize_optimizer_states() File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 670, in initialize_optimizer_states single_grad_partition = torch.zeros(int(self.partition_size[i]), torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 610.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 403.06 MiB is free. Process 129704 has 14.35 GiB memory in use. Of the allocated memory 13.48 GiB is allocated by PyTorch, and 197.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [2024-04-03 14:35:26,240] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 9908 [2024-04-03 14:35:26,240] [ERROR] [launch.py:322:sigkill_handler] ['/usr/bin/python3', '-u', 'model/train/train.py', '--local_rank=0', '--train_path', 'datasets/train_bias_sample_100', '--model_name_or_path', 'm-a-p/ChatMusician-Base', '--per_device_train_batch_size', '1', '--max_len', '2048', '--max_src_len', '1536', '--num_train_epochs', '2', '--gradient_accumulation_steps', '4', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--warmup_ratio', '0.1', '--mode', 'llama', '--train_type', 'lora', '--lora_dim', '64', '--lora_alpha', '16', '--lora_dropout', '0.1', '--lora_module_name', 'q_proj,k_proj,v_proj,o_proj,down_proj,gate_proj,up_proj', '--seed', '1234', '--save_model_step', '2000', '--ds_file', 'model/train/config/ds_zero2_no_offload.json', '--show_loss_step', '50', '--output_dir', 'model/train/output_dir'] exits with return code = 1

Where are the results of MusicTheoryBench?

I have just run the "eval" script against my Mozart trained data sets. Looking at the summary results .txt and .csv, where exactly are the MusicTheoryBench results reflected?

Autotrain or Llama Factory for fine tuning?

Greetings

Enjoying working with this so far. But, one question: As this is based on Llama 2, do you see it as practical to use either autotrain-advanced and/or Llama Factory for fine tuning?

If you would like to reach me directly: peter dot greis at freethinker dot com

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.