xirider / finetune-gpt2xl Goto Github PK

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and finetune GPT-NEO (2.7 B) on a single GPU with Huggingface Transformers using DeepSpeed

License: MIT License

Python 100.00%

huggingface huggingface-transformers deepspeed gpt2 gpt3 finetuning gpt-neo gpt-neo-fine-tuning

finetune-gpt2xl's Introduction

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single GPU with Huggingface Transformers using DeepSpeed

Finetuning large language models like GPT2-xl is often difficult, as these models are too big to fit on a single GPU.
This guide explains how to finetune GPT2-xl and GPT-NEO (2.7B Parameters) with just one command of the Huggingface Transformers library on a single GPU.
This is made possible by using the DeepSpeed library and gradient checkpointing to lower the required GPU memory usage of the model.
I also explain how to set up a server on Google Cloud with a V100 GPU (16GB VRAM), that you can use if you don't have a GPU with enough VRAM (16+ GB) or you don't have enough enough normal RAM (60 GB+).

1. (Optional) Setup VM with V100 in Google Compute Engine

Note: The GPT2-xl model does run on any server with a GPU with at least 16 GB VRAM and 60 GB RAM. The GPT-NEO model needs at least 70 GB RAM. If you use your own server and not the setup described here, you will need to install CUDA and Pytorch on it.

Requirements

Install the Google Cloud SDK: Click Here
Register a Google Cloud Account, create a project and set up billing (only once you set up billing, you can use the $300 dollar sign up credit for GPUs).
Request a quota limit increase for "GPU All Regions" to 1. Here is a step by step guide. The UI changed a bit and looks now like this.
Log in and initialize the cloud sdk with gcloud auth login and gcloud init and follow the steps until you are set up.

Create VM

Replace YOURPROJECTID in the command below with the project id from your GCE project.
You can remove the --preemptible flag from the command below, but keeping it reduces your cost to about 1/3 and allows Google to shut down your instance at any point. At the time of writing, this configuration only costs about $1.28 / hour in GCE, when using preemptible. Depending on the size of your dataset, finetuning usually only takes a few hours.
You can change the zone, if there are no ressources available. Here is a list of all zones and whether they have V100 GPUs. Depending on the time of the day you might need to try out a few. Usually there are also more server available if you keep the --preemptible flag
We need a GPU server with at least 60 GB RAM, otherwise the run will crash, whenever the script wants to save/pickle a model. This setup below gives us as much RAM as possible with 12 CPU cores in GCE (without paying for extended memory). You also can't use more than 12 CPU cores with a single V100 GPU in GCE.

Run this to create the instance:

gcloud compute instances create gpuserver \
   --project YOURPROJECTID \
   --zone us-west1-b \
   --custom-cpu 12 \
   --custom-memory 78 \
   --maintenance-policy TERMINATE \
   --image-family pytorch-1-7-cu110 \
   --image-project deeplearning-platform-release \
   --boot-disk-size 200GB \
   --metadata "install-nvidia-driver=True" \
   --accelerator="type=nvidia-tesla-v100,count=1" \
   --preemptible

After 5 minutes or so (the server needs to install nvidia drivers first), you can connect to your instance with the command below. If you changed the zone, you also will need to change it here.

replace YOURSDKACCOUNT with your sdk account name

gcloud compute ssh YOURSDKACCOUNT@gpuserver --zone=us-west1-b

Don't forget to shut down the server once your done, otherwise you will keep getting billed for it. This can be done here.

The next time you can restart the server from the same web ui here.

2. Download script and install libraries

Run this to download the script and to install all libraries:

git clone https://github.com/Xirider/finetune-gpt2xl.git
chmod -R 777 finetune-gpt2xl/
cd finetune-gpt2xl
pip install -r requirements.txt

(Optional) If you want to use Wandb.ai for experiment tracking, you have to login:

wandb login

3. Finetune GPT2-xl (1.5 Billion Parameters)

Then add your training data:

replace the example train.txt and validation.txt files in the folder with your own training data with the same names and then run python text2csv.py. This converts your .txt files into one column csv files with a "text" header and puts all the text into a single line. We need to use .csv files instead of .txt files, because Huggingface's dataloader removes line breaks when loading text from a .txt file, which does not happen with the .csv files.
If you want to feed the model separate examples instead of one continuous block of text, you need to pack each of your examples into an separate line in the csv train and validation files.
Be careful with the encoding of your text. If you don't clean your text files or if you just copy text from the web into a text editor, the dataloader from the datasets library might not load them.

Run this:

deepspeed --num_gpus=1 run_clm.py \
--deepspeed ds_config.json \
--model_name_or_path gpt2-xl \
--train_file train.csv \
--validation_file validation.csv \
--do_train \
--do_eval \
--fp16 \
--overwrite_cache \
--evaluation_strategy="steps" \
--output_dir finetuned \
--eval_steps 200 \
--num_train_epochs 1 \
--gradient_accumulation_steps 2 \
--per_device_train_batch_size 8

This command runs the the standard run_clm.py file from Huggingface's examples with deepspeed, just with 2 lines added to enable gradient checkpointing to use less memory.
Training on the Shakespeare example should take about 17 minutes. With gradient accumulation 2 and batch size 8, one gradient step takes about 9 seconds. This means the model training speed should be almost 2 examples / second. You can go up to batch size of 12 before running out of memory, but that doesn't provide any speedups.
Note that the default huggingface optimizer hyperparameters and the hyperparameters given as flag overwrite the hyperparameters in the ds_config.json file. Therefore if you want to adjust learning rates, warmup and more, you need to set these as flags to the training command. For an example you can find further below the training command of GPT-NEO which changes the learning rate.
You might want to try different hyperparameters like --learning_rate and --warmup_steps to improve the finetuning.

4. Generate text with your finetuned model

You can test your finetuned GPT2-xl model with this script from Huggingface Transfomers (is included in the folder):

python run_generation.py --model_type=gpt2 --model_name_or_path=finetuned --length 200

Or you can use it now in your own code like this to generate text in batches:

# credit to Niels Rogge - https://github.com/huggingface/transformers/issues/10704

from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = GPT2Tokenizer.from_pretrained('finetuned')
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained('finetuned').to(device)
print("model loaded")

# this is a single input batch with size 3
texts = ["From off a hill whose concave womb", "Another try", "A third test"]

encoding = tokenizer(texts, padding=True, return_tensors='pt').to(device)
with torch.no_grad():
    generated_ids = model.generate(**encoding, max_length=100)
generated_texts = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True)

print(generated_texts)

model inference runs on even small gpus or on cpus without any additional changes

Finetune GPT-NEO (2.7 Billion Parameters)

This works now. I tested it with a server with one V100 GPU (16 GB VRAM) and 78 GB normal RAM, but it might not actually need that much RAM.

Add your training data like you would for GPT2-xl:

replace the example train.txt and validation.txt files in the folder with your own training data with the same names and then run python text2csv.py. This converts your .txt files into one column csv files with a "text" header and puts all the text into a single line. We need to use .csv files instead of .txt files, because Huggingface's dataloader removes line breaks when loading text from a .txt file, which does not happen with the .csv files.
If you want to feed the model separate examples instead of one continuous block of text, you need to modify the function group_texts in run_clm.py .
Be careful with the encoding of your text. If you don't clean your text files or if you just copy text from the web into a text editor, the dataloader from the datasets library might not load them.
Be sure to either login into wandb.ai with wandb login or uninstall it completely. Otherwise it might cause a memory error during the run.

Then start the training run this command:

deepspeed --num_gpus=1 run_clm.py \
--deepspeed ds_config_gptneo.json \
--model_name_or_path EleutherAI/gpt-neo-2.7B \
--train_file train.csv \
--validation_file validation.csv \
--do_train \
--do_eval \
--fp16 \
--overwrite_cache \
--evaluation_strategy="steps" \
--output_dir finetuned \
--num_train_epochs 1 \
--eval_steps 15 \
--gradient_accumulation_steps 2 \
--per_device_train_batch_size 4 \
--use_fast_tokenizer False \
--learning_rate 5e-06 \
--warmup_steps 10

This uses a smaller "allgather_bucket_size" setting in the ds_config_gptneo.json file and a smaller batch size to further reduce gpu memory.
You might want to change and try hyperparameters to be closer to the orignal EleutherAi training config. You can find these here.
If you want to try train on a GPU with less VRAM or your machine doesn't have 70 GB RAM, you could try to set --per_device_train_batch_size to 1 and --gradient_accumulation_steps to 8. You can also then try to reduce the values for "allgather_bucket_size" and "reduce_bucket_size" in the ds_config_gptneo.json file to 5e7.

Generate text with a GPT-NEO 2.7 Billion Parameters model

I provided a script, that allows you to interactively prompt your GPT-NEO model. If you just want to sample from the pretrained model without finetuning it yourself, replace "finetuned" with "EleutherAI/gpt-neo-2.7B". Start it with this:

python run_generate_neo.py finetuned

Or use this snippet to generate text from your finetuned model within your code:

# credit to Suraj Patil - https://github.com/huggingface/transformers/pull/10848 - modified to create multiple texts and use deepspeed inference

from transformers import GPTNeoForCausalLM, AutoTokenizer
import deepspeed

# casting to fp16 "half" gives a large speedup during model loading
model = GPTNeoForCausalLM.from_pretrained("finetuned").half().to("cuda")
tokenizer = AutoTokenizer.from_pretrained("finetuned")

# using deepspeed inference is optional: it gives about a 2x speed up
deepspeed.init_inference(model, mp_size=1, dtype=torch.half, replace_method='auto')

texts = ["From off a hill whose concave", "Paralell text 2"]

ids = tokenizer(texts, padding=padding, return_tensors="pt").input_ids.to("cuda")


gen_tokens = model.generate(
  ids,
  do_sample=True,
  min_length=0,
  max_length=200,
  temperature=1.0,
  top_p=0.8,
  use_cache=True
)
gen_text = tokenizer.batch_decode(gen_tokens)
print(gen_text)

(Optional) Configuration

You can change the learning rate, weight decay and warmup by setting them as flags to the training command. Warm up and learning rates in the config are ignored, as the script always uses the Huggingface optimizer/trainer default values. If you want to overwrite them you need to use flags. You can check all the explanations here:

https://huggingface.co/transformers/master/main_classes/trainer.html#deepspeed

The rest of the training arguments can be provided as a flags and are all listed here:

https://huggingface.co/transformers/master/main_classes/trainer.html#trainingarguments

finetune-gpt2xl's People

Contributors

Stargazers

Watchers

finetune-gpt2xl's Issues

Can't change BOS token or EOS token for GPT Neo

In order to better control the start and stop of generated text, I have added BOS tokens and EOS tokens for GPT2xl. This works well and the generated text stops at an appropriate length and starts how a normal sentence would. However, I want to do this process on GPT Neo, and this does not work. I have discovered that for some reason arguments that normally set BOS and EOS are not working when GPT Neo is ran, even if I change the tokenizer from AutoTokenizer to GPT2Tokenizer. Below is some code that shows what I mean.

    tokenizer = GPT2Tokenizer.from_pretrained(
    model_args.model_name_or_path, bos_token='<|beginingtext|>',eos_token='<|endingtext|>', pad_token='<|pad|>',**tokenizer_kwargs)
    print(tokenizer.eos_token)
    print(tokenizer.bos_token)
    quit()

As I said, when I run this with GPT2xl, the tokens are appropriately changed. When I run this with GPT Neo, both the BOS and EOS tokens are <|endoftext|>

TypeError: unsupported operand type(s) for -: 'float' and 'str' on AWS g4dn.12xlarge

Hi, thanks for making this repo. I'm on a g4dn.12xlarge (4 GPUs) Deep Learning AMI on AWS and trying to make this work. Keep running into this error. Anything I'm missing? TypeError: unsupported operand type(s) for -: 'float' and 'str'

Traceback (most recent call last):
  File "run_clm.py", line 478, in <module>
    main()
  File "run_clm.py", line 441, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py", line 969, in train
    self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/integrations.py", line 448, in init_deepspeed
    lr_scheduler=lr_scheduler,
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/__init__.py", line 125, in initialize
    config_params=config_params)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 187, in __init__
    self._configure_lr_scheduler(lr_scheduler)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 447, in _configure_lr_scheduler
    lr_scheduler = self._scheduler_from_config(self.optimizer)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 489, in _scheduler_from_config
    instantiated_scheduler = scheduler(optimizer, **scheduler_params)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/runtime/lr_schedules.py", line 708, in __init__
    self.delta_lrs = [big - small for big, small in zip(self.max_lrs, self.min_lrs)]
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/runtime/lr_schedules.py", line 708, in <listcomp>
    self.delta_lrs = [big - small for big, small in zip(self.max_lrs, self.min_lrs)]
TypeError: unsupported operand type(s) for -: 'float' and 'str'
Killing subprocess 35034
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
    main()
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/launcher/launch.py", line 161, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/anaconda3/envs/pytorch_latest_p37/bin/python3.7', '-u', 'run_clm.py', '--local_rank=0', '--deepspeed', 'ds_config.json', '--model_name_or_path', 'gpt2-xl', '--train_file', 'train.csv', '--validation_file', 'validation.csv', '--do_train', '--do_eval', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', 'finetuned', '--eval_steps', '200', '--num_train_epochs', '1', '--gradient_accumulation_steps', '2', '--per_device_train_batch_size', '8']' returned non-zero exit status 1.

Thanks!

Ideal number of epochs? Number of examples meaning?

Is there a recommended number of epochs to use? I was able to successfully train on a custom dataset with near 45k entries for the training set and near 11k in the validation set. In the example only 1 epoch set for the flag. However, I have found that training for 4 epochs leads to a lower loss than 1 epoch, and I imagine continuing to train the model would lead to an even better result. It is difficult to say at what point overfitting may start occurring, as the validation data is only evaluated at the end of the training

Thus I ask, is there a rough ideal number of epochs for fine-tuning? If there is, I think it would be a good idea to add that to the README(which I can do if needed).

My second question is related to the Num examples part of training and evaluation. As I said, I have near 45k training texts and near 11k validation texts. However, the Num examples say 1472 and 365 respectfully for training and validation. What does this mean? Is not all the data being used? Why does it not say the much larger numbers of 45k and 11k?

Thanks for the repo and for your help. This is very cool and relatively easy to work with after one gets experience with DeepSpeed

Exception: Installed CUDA version 11.0 does not match the version torch was compiled with 11.1 [SOLUTION]

Hey first off awesome project Im getting this error when i try to run the deepspeed command. I found my solution if anyone else has this problem

wget https://developer.download.nvidia.com/compute/cuda/11.1.1/local_installers/cuda_11.1.1_455.32.00_linux.run
sudo sh cuda_11.1.1_455.32.00_linux.run

AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

I try to use your script (gpt2-xl) but I have an error:
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

pip list
Package Version

certifi 2021.5.30
charset-normalizer 2.0.4
click 8.0.1
configparser 5.0.2
datasets 1.8.0
deepspeed 0.4.0
dill 0.3.4
docker-pycreds 0.4.0
filelock 3.0.12
fsspec 2021.7.0
gitdb 4.0.7
GitPython 3.1.18
huggingface-hub 0.0.8
idna 3.2
importlib-metadata 4.7.0
joblib 1.0.1
multiprocess 0.70.12.2
ninja 1.10.2
numpy 1.21.2
packaging 21.0
pandas 1.3.2
pathtools 0.1.2
Pillow 8.3.1
pip 21.2.4
promise 2.3
protobuf 3.17.3
psutil 5.8.0
pyarrow 3.0.0
pyparsing 2.4.7
python-dateutil 2.8.2
pytz 2021.1
PyYAML 5.4.1
regex 2021.8.21
requests 2.26.0
sacremoses 0.0.45
sentry-sdk 1.3.1
setuptools 57.4.0
shortuuid 1.0.1
six 1.16.0
smmap 4.0.0
subprocess32 3.5.4
tensorboardX 1.8
tokenizers 0.10.3
torch 1.9.0
torchvision 0.10.0
tqdm 4.49.0
transformers 4.7.0
triton 1.0.0
typing-extensions 3.10.0.0
urllib3 1.26.6
wandb 0.12.0
wheel 0.37.0
xxhash 2.0.2
zipp 3.5.0

fine tuning GPT-J 6B?

hi, this is not an issue but i was not sure where to post it.

how can this tool be adapted for fine tuning GPT-J 6B?

Crashes with new Transformers version

Here's the error:

Traceback (most recent call last):
File "run_clm.py", line 478, in
main()
File "run_clm.py", line 422, in main
trainer = Trainer(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 295, in init
logging.set_verbosity(log_level)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/utils/logging.py", line 161, in set_verbosity
_get_library_root_logger().setLevel(verbosity)
File "/root/miniconda3/lib/python3.8/logging/init.py", line 1409, in setLevel
self.level = _checkLevel(level)
File "/root/miniconda3/lib/python3.8/logging/init.py", line 194, in _checkLevel
raise ValueError("Unknown level: %r" % level)

The fix was to install transformers v4.6.0 from pip

Gpt-neo inference with Deepspeed: IndexError: Dimension out of range

Thanks for this useful repository. I was able to follow it to train a gtp-neo 2.7B model.

Inference on the model works well for me, using less than 8GB of Vram, so fits on consumer-level gpus, however, I'm not yet able to get the inference working with Deepspeed.

To be clear...

I am using the code from here:

https://github.com/Xirider/finetune-gpt2xl/blob/main/README.md#generate-text-with-a-gpt-neo-27-billion-parameters-model

And it works well, if I comment out this line:

deepspeed.init_inference(model, mp_size=1, dtype=torch.half, replace_method='auto')

If I retain the line, then the inference fails with this error message:

  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 374, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 312, in forward
    output, key_layer, value_layer, context_layer = selfAttention_fp()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 270, in selfAttention_fp
    qkv_out = qkv_func(input,
IndexError: Dimension out of range (expected to be in range of [-2, 1], but got 2)

I'm actually a bit vague on whether Deepspeed actually should be used with inference for GTP-NEO, as far.

Huggingface says....

https://huggingface.co/transformers/main_classes/deepspeed.html

DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference.

But, Microsoft has a guide which shows the usage of Deepspeed for inference with this model...

https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/inference-tutorial.md#end-to-end-gpt-neo-27b-inference

IndexError: index out of bounds

I'm getting an index out of bounds error from datasets, which makes me think there's something wrong with my training data. The full error is
Traceback (most recent call last):
File "/home/ckg/github/finetune-gpt2xl/run_clm.py", line 478, in
main()
File "/home/ckg/github/finetune-gpt2xl/run_clm.py", line 398, in main
lm_datasets = tokenized_datasets.map(
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/dataset_dict.py", line 471, in map
{
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/dataset_dict.py", line 472, in
k: dataset.map(
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 1619, in map
return self._map_single(
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 186, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/fingerprint.py", line 397, in wrapper
out = func(self, *args, **kwargs)
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 1977, in _map_single
writer.write_batch(batch)
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/arrow_writer.py", line 383, in write_batch
pa_table = pa.Table.from_pydict(typed_sequence_examples)
File "pyarrow/table.pxi", line 1559, in pyarrow.lib.Table.from_pydict
File "pyarrow/array.pxi", line 331, in pyarrow.lib.asarray
File "pyarrow/array.pxi", line 222, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/arrow_writer.py", line 100, in arrow_array
if trying_type and out[0].as_py() != self.data[0]:
File "pyarrow/array.pxi", line 1067, in pyarrow.lib.Array.getitem
File "pyarrow/array.pxi", line 549, in pyarrow.lib._normalize_index
IndexError: index out of bounds
[2023-02-16 18:50:28,897] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 424
[2023-02-16 18:50:28,897] [ERROR] [launch.py:324:sigkill_handler] ['/home/ckg/anaconda3/envs/p39/bin/python', '-u', 'run_clm.py', '--local_rank=0', '--deepspeed', 'ds_config_gptneo.json', '--model_name_or_path', 'EleutherAI/gpt-neo-1.3B', '--train_file', 'train.csv', '--validation_file', 'validation.csv', '--do_train', '--do_eval', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', 'finetuned', '--num_train_epochs', '1', '--eval_steps', '15', '--gradient_accumulation_steps', '2', '--per_device_train_batch_size', '4', '--use_fast_tokenizer', 'False', '--learning_rate', '5e-06', '--warmup_steps', '10'] exits with return code = 1

The training file and validation file were converted to csv with the script in the repo. The original text files are, as far as I can tell, just normal text files, so I cant think of what could have gone wrong. I've included the training and validation text files and csv files in the report
train.csv
train.txt
validation.csv
validation.txt

It's just the daodejing with a trigger word added and <|endoftext|> between each verse. I was able to train earlier with the sample training data, what could be going on here?

Errors while trying to train with two GPUs

Hi,

When trying to train on two GPUs, I'm getting this error:

Traceback (most recent call last):
File "run_clm.py", line 478, in
main()
File "run_clm.py", line 441, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1083, in train
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/integrations.py", line 520, in deepspeed_init
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/init.py", line 116, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 148, in init
self._configure_with_arguments(args, mpu)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 517, in _configure_with_arguments
self._config = DeepSpeedConfig(config_file, mpu, param_dict=self.config_params)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 597, in init
self._configure_train_batch_size()
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 732, in _configure_train_batch_size
self._set_batch_related_parameters()
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 728, in _set_batch_related_parameters
assert False,
AssertionError: Either train_batch_size or micro_batch_per_gpu needs to be provided

So if I added the flag --train_batch_size 8 and I got the following error:

Traceback (most recent call last):
File "run_clm.py", line 478, in
main()
File "run_clm.py", line 192, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/root/miniconda3/lib/python3.8/site-packages/transformers/hf_argparser.py", line 196, in parse_args_into_dataclasses
Traceback (most recent call last):
File "run_clm.py", line 478, in
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--train_batch_size', '8']
main()
File "run_clm.py", line 192, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/root/miniconda3/lib/python3.8/site-packages/transformers/hf_argparser.py", line 196, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--train_batch_size', '8']

Looks to me like a mismatch between deepspeed and transformers, do you have any suggestions on how to solve it?

This is my ds_report:

DeepSpeed general environment info:
torch install path ............... ['/root/miniconda3/lib/python3.8/site-packages/torch']
torch version .................... 1.7.1
torch cuda version ............... 11.0
nvcc version ..................... 11.0
deepspeed install path ........... ['/root/miniconda3/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.3.15, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.7, cuda 11.0

Out of memory with RTX3090

Hi,
I'm trying to train gpt2xl, but keep getting OOM, even when I set batch size to 1 and gradient_accumulation to 8\16\512, contigous_gradients false and allgather_bucket_size \ reduce_bucket_size 2e2.
I can see in nvidia-smi that I'm only reaching half the memory capacity - around 12GB
My system is as stated - 3090 with 24GB memory
80 GB RAM
5600x cpu if that matters
running WSL2 on windows 10
Thanks.

TypeError: init() got an unexpected keyword argument 'no_args_is_help'

(gh_finetune-gpt2xl) r730ub20@r730ub20-M0:~/llm_dev/finetune-gpt2xl$ deepspeed --num_gpus=1 run_clm.py --deepspeed ds_config.json --model_name_or_path gpt2-xl --train_file train.csv --validation_file validation.csv --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy="steps" --output_dir finetuned --eval_steps 200 --num_train_epochs 1 --gradient_accumulation_steps 2 --per_device_train_batch_size 1
[2023-05-22 22:00:31,576] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-22 22:00:31,600] [INFO] [runner.py:541:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_clm.py --deepspeed ds_config.json --model_name_or_path gpt2-xl --train_file train.csv --validation_file validation.csv --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned --eval_steps 200 --num_train_epochs 1 --gradient_accumulation_steps 2 --per_device_train_batch_size 1
[2023-05-22 22:00:33,028] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-22 22:00:33,028] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-22 22:00:33,028] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-22 22:00:33,028] [INFO] [launch.py:247:main] dist_world_size=1
[2023-05-22 22:00:33,028] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-05-22 22:00:34,832] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
05/22/2023 22:00:34 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
05/22/2023 22:00:34 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=ds_config.json,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=200,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=2,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_on_each_node=True,
logging_dir=runs/May22_22-00-34_r730ub20-M0,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=1.0,
output_dir=finetuned,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
remove_unused_columns=True,
report_to=['wandb'],
resume_from_checkpoint=None,
run_name=finetuned,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
05/22/2023 22:00:36 - WARNING - datasets.builder - Using custom data configuration default-3bfffae691dad1b0
05/22/2023 22:00:36 - WARNING - datasets.builder - Reusing dataset csv (/home/r730ub20/.cache/huggingface/datasets/csv/default-3bfffae691dad1b0/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
[INFO|configuration_utils.py:517] 2023-05-22 22:00:36,541 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/r730ub20/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.d684cb2afa3f8c44c73bd67537d9aa5ff6044658793e077d7306ef2e37dd79bd
[INFO|configuration_utils.py:553] 2023-05-22 22:00:36,543 >> Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"gradient_checkpointing": false,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 1600,
"n_head": 25,
"n_inner": null,
"n_layer": 48,
"n_positions": 1024,
"output_past": true,
"resid_pdrop": 0.1,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.7.0",
"use_cache": true,
"vocab_size": 50257
}

[INFO|configuration_utils.py:517] 2023-05-22 22:00:36,953 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/r730ub20/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.d684cb2afa3f8c44c73bd67537d9aa5ff6044658793e077d7306ef2e37dd79bd
[INFO|configuration_utils.py:553] 2023-05-22 22:00:36,954 >> Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"gradient_checkpointing": false,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 1600,
"n_head": 25,
"n_inner": null,
"n_layer": 48,
"n_positions": 1024,
"output_past": true,
"resid_pdrop": 0.1,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.7.0",
"use_cache": true,
"vocab_size": 50257
}

[INFO|tokenization_utils_base.py:1717] 2023-05-22 22:00:39,950 >> loading file https://huggingface.co/gpt2-xl/resolve/main/vocab.json from cache at /home/r730ub20/.cache/huggingface/transformers/8560a2df03f812b276794ae6935255d0590522553a4c8103155472b07591a21b.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|tokenization_utils_base.py:1717] 2023-05-22 22:00:39,950 >> loading file https://huggingface.co/gpt2-xl/resolve/main/merges.txt from cache at /home/r730ub20/.cache/huggingface/transformers/18fe27e0b70062b3e45fc4e827d5449d9fe85875937594da927e48cb657366d1.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1717] 2023-05-22 22:00:39,950 >> loading file https://huggingface.co/gpt2-xl/resolve/main/tokenizer.json from cache at /home/r730ub20/.cache/huggingface/transformers/aabb8839163cd911f810ab23f5ae8c966b9b9ea60622c429020611caa389b04b.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|tokenization_utils_base.py:1717] 2023-05-22 22:00:39,950 >> loading file https://huggingface.co/gpt2-xl/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1717] 2023-05-22 22:00:39,950 >> loading file https://huggingface.co/gpt2-xl/resolve/main/special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:1717] 2023-05-22 22:00:39,950 >> loading file https://huggingface.co/gpt2-xl/resolve/main/tokenizer_config.json from cache at None
[INFO|modeling_utils.py:1152] 2023-05-22 22:00:40,482 >> loading weights file https://huggingface.co/gpt2-xl/resolve/main/pytorch_model.bin from cache at /home/r730ub20/.cache/huggingface/transformers/96569b907e56747ce3e593c6a13d8475b8c733a64aab8af8f602b90d94c4af71.8fbbcdf404c82c5967934d411f1462fa0574d639f2aa398aa3754fced1bb26c0
[INFO|modeling_utils.py:1336] 2023-05-22 22:00:58,095 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.

[INFO|modeling_utils.py:1344] 2023-05-22 22:00:58,095 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2-xl.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
05/22/2023 22:00:58 - WARNING - datasets.fingerprint - Parameter 'function'=<function main..tokenize_function at 0x7f2363e61af0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
0%| | 0/1 [00:00<?, ?ba/s][WARNING|tokenization_utils_base.py:3171] 2023-05-22 22:01:02,910 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1462828 > 1024). Running this sequence through the model will result in indexing errors
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.10s/ba]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 61.16ba/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.46s/ba]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 194.43ba/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[INFO|trainer.py:414] 2023-05-22 22:01:05,456 >> Using amp fp16 backend
[2023-05-22 22:01:05,461] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.2, git-hash=unknown, git-branch=unknown
[2023-05-22 22:01:05,462] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-22 22:01:10,928] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Installed CUDA version 11.7 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Installed CUDA version 11.7 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Using /home/r730ub20/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Detected CUDA files, patching ldflags
Emitting ninja build file /home/r730ub20/.cache/torch_extensions/py38_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.7361702919006348 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2023-05-22 22:01:17,581] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-05-22 22:01:17,630] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-05-22 22:01:17,630] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-05-22 22:01:17,630] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2023-05-22 22:01:17,630] [INFO] [stage_1_and_2.py:133:init] Reduce bucket size 200000000
[2023-05-22 22:01:17,630] [INFO] [stage_1_and_2.py:134:init] Allgather bucket size 200000000
[2023-05-22 22:01:17,630] [INFO] [stage_1_and_2.py:135:init] CPU Offload: True
[2023-05-22 22:01:17,630] [INFO] [stage_1_and_2.py:136:init] Round robin gradient partitioning: False
Using /home/r730ub20/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Emitting ninja build file /home/r730ub20/.cache/torch_extensions/py38_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.6310455799102783 seconds
Rank: 0 partition count [1] and sizes[(1557611200, False)]
[2023-05-22 22:01:24,621] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states
[2023-05-22 22:01:24,622] [INFO] [utils.py:786:see_memory_usage] MA 3.1 GB Max_MA 3.1 GB CA 3.1 GB Max_CA 3 GB
[2023-05-22 22:01:24,623] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 18.32 GB, percent = 7.3%
[2023-05-22 22:01:31,310] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states
[2023-05-22 22:01:31,311] [INFO] [utils.py:786:see_memory_usage] MA 3.1 GB Max_MA 3.1 GB CA 3.1 GB Max_CA 3 GB
[2023-05-22 22:01:31,311] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.84 GB, percent = 14.2%
[2023-05-22 22:01:31,311] [INFO] [stage_1_and_2.py:489:init] optimizer state initialized
[2023-05-22 22:01:31,369] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer
[2023-05-22 22:01:31,370] [INFO] [utils.py:786:see_memory_usage] MA 3.1 GB Max_MA 3.1 GB CA 3.1 GB Max_CA 3 GB
[2023-05-22 22:01:31,370] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.84 GB, percent = 14.2%
[2023-05-22 22:01:31,386] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2023-05-22 22:01:31,386] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupLR
[2023-05-22 22:01:31,386] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7f22265c1040>
[2023-05-22 22:01:31,386] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05], mom=[[0.9, 0.999]]
[2023-05-22 22:01:31,387] [INFO] [config.py:955:print] DeepSpeedEngine configuration:
[2023-05-22 22:01:31,387] [INFO] [config.py:959:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-05-22 22:01:31,387] [INFO] [config.py:959:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-05-22 22:01:31,387] [INFO] [config.py:959:print] amp_enabled .................. False
[2023-05-22 22:01:31,387] [INFO] [config.py:959:print] amp_params ................... False
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] bfloat16_enabled ............. False
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] checkpoint_parallel_write_pipeline False
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] checkpoint_tag_validation_enabled True
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] checkpoint_tag_validation_fail False
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f2032c4a580>
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] communication_data_type ...... None
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] curriculum_enabled_legacy .... False
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] curriculum_params_legacy ..... False
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] data_efficiency_enabled ...... False
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] dataloader_drop_last ......... False
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] disable_allgather ............ False
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] dump_state ................... False
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] eigenvalue_enabled ........... False
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] eigenvalue_gas_boundary_resolution 1
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] eigenvalue_layer_num ......... 0
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] eigenvalue_max_iter .......... 100
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] eigenvalue_stability ......... 1e-06
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] eigenvalue_tol ............... 0.01
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] eigenvalue_verbose ........... False
[2023-05-22 22:01:31,388] [INFO] [config.py:959:print] elasticity_enabled ........... False
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] fp16_auto_cast ............... False
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] fp16_enabled ................. True
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] fp16_master_weights_and_gradients False
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] global_rank .................. 0
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] grad_accum_dtype ............. None
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] gradient_accumulation_steps .. 2
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] gradient_clipping ............ 1.0
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] gradient_predivide_factor .... 1.0
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] initial_dynamic_scale ........ 65536
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] load_universal_checkpoint .... False
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] loss_scale ................... 0
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] memory_breakdown ............. False
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] mics_hierarchial_params_gather False
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] mics_shard_size .............. -1
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] optimizer_legacy_fusion ...... False
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] optimizer_name ............... adamw
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] optimizer_params ............. {'lr': 5e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0}
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-05-22 22:01:31,389] [INFO] [config.py:959:print] pld_enabled .................. False
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] pld_params ................... False
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] prescale_gradients ........... False
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] scheduler_name ............... WarmupLR
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 5e-05, 'warmup_num_steps': 0}
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] sparse_attention ............. None
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] sparse_gradients_enabled ..... False
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] steps_per_print .............. 2000
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] train_batch_size ............. 2
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] train_micro_batch_size_per_gpu 1
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] use_node_local_storage ....... False
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] wall_clock_breakdown ......... False
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] world_size ................... 1
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] zero_allow_untested_optimizer False
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=200000000 allgather_partitions=True allgather_bucket_size=200000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] zero_enabled ................. True
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] zero_force_ds_cpu_optimizer .. True
[2023-05-22 22:01:31,390] [INFO] [config.py:959:print] zero_optimization_stage ...... 2
[2023-05-22 22:01:31,390] [INFO] [config.py:945:print_user_config] json = {
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 5e-05,
"betas": [0.9, 0.999],
"eps": 1e-08,
"weight_decay": 0.0
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 5e-05,
"warmup_num_steps": 0
}
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2.000000e+08,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2.000000e+08,
"contiguous_gradients": true,
"cpu_offload": true
},
"gradient_accumulation_steps": 2,
"gradient_clipping": 1.0,
"steps_per_print": 2.000000e+03,
"train_batch_size": 2,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": false
}
Using /home/r730ub20/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004444122314453125 seconds
[INFO|trainer.py:1147] 2023-05-22 22:01:31,391 >> ***** Running training *****
[INFO|trainer.py:1148] 2023-05-22 22:01:31,391 >> Num examples = 11428
[INFO|trainer.py:1149] 2023-05-22 22:01:31,391 >> Num Epochs = 1
[INFO|trainer.py:1150] 2023-05-22 22:01:31,391 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1151] 2023-05-22 22:01:31,391 >> Total train batch size (w. parallel, distributed & accumulation) = 2
[INFO|trainer.py:1152] 2023-05-22 22:01:31,391 >> Gradient Accumulation steps = 2
[INFO|trainer.py:1153] 2023-05-22 22:01:31,391 >> Total optimization steps = 5714
[INFO|integrations.py:402] 2023-05-22 22:01:31,393 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/main.py", line 1, in
from wandb.cli import cli
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/cli/cli.py", line 933, in
def launch_sweep(
File "/usr/lib/python3/dist-packages/click/core.py", line 1234, in decorator
cmd = command(*args, **kwargs)(f)
File "/usr/lib/python3/dist-packages/click/decorators.py", line 115, in decorator
cmd = _make_command(f, name, attrs, cls)
File "/usr/lib/python3/dist-packages/click/decorators.py", line 88, in make_command
return cls(name=name or f.name.lower().replace('', '-'),
TypeError: init() got an unexpected keyword argument 'no_args_is_help'
Traceback (most recent call last):
File "run_clm.py", line 478, in
main()
File "run_clm.py", line 441, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/r730ub20/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1207, in train
self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
File "/home/r730ub20/.local/lib/python3.8/site-packages/transformers/trainer_callback.py", line 340, in on_train_begin
return self.call_event("on_train_begin", args, state, control)
File "/home/r730ub20/.local/lib/python3.8/site-packages/transformers/trainer_callback.py", line 378, in call_event
result = getattr(callback, event)(
File "/home/r730ub20/.local/lib/python3.8/site-packages/transformers/integrations.py", line 446, in on_train_begin
self.setup(args, state, model, **kwargs)
File "/home/r730ub20/.local/lib/python3.8/site-packages/transformers/integrations.py", line 419, in setup
self._wandb.init(
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1169, in init
raise e
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1146, in init
wi.setup(kwargs)
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 172, in setup
self._wl = wandb_setup.setup(settings=setup_settings)
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 327, in setup
ret = _setup(settings=settings)
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 320, in _setup
wl = _WandbSetup(settings=settings)
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 303, in init
_WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 114, in init
self._setup()
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 250, in _setup
self._setup_manager()
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 277, in _setup_manager
self._manager = wandb_manager._Manager(settings=self._settings)
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/sdk/wandb_manager.py", line 145, in init
self._service.start()
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 199, in start
self._launch_server()
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 193, in _launch_server
_sentry.reraise(e)
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/analytics/sentry.py", line 146, in reraise
raise exc.with_traceback(sys.exc_info()[2])
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 191, in _launch_server
self._wait_for_ports(fname, proc=internal_proc)
File "/home/r730ub20/.local/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 116, in _wait_for_ports
raise ServiceStartProcessError(
wandb.sdk.service.service.ServiceStartProcessError: The wandb service process exited with 1. Ensure that sys.executable is a valid python interpreter. You can override it with the _executable setting or with the WANDB__EXECUTABLE environment variable.
[2023-05-22 22:01:38,113] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 10431
[2023-05-22 22:01:38,114] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'run_clm.py', '--local_rank=0', '--deepspeed', 'ds_config.json', '--model_name_or_path', 'gpt2-xl', '--train_file', 'train.csv', '--validation_file', 'validation.csv', '--do_train', '--do_eval', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', 'finetuned', '--eval_steps', '200', '--num_train_epochs', '1', '--gradient_accumulation_steps', '2', '--per_device_train_batch_size', '1'] exits with return code = 1
(gh_finetune-gpt2xl) r730ub20@r730ub20-M0:~/llm_dev/finetune-gpt2xl$

Unable to proceed, no GPU resources available

We are trying to run the model with our own server, and we have got this error:
RuntimeError: Unable to proceed, no GPU resources available

Multiple entries csv

Hi i come from upwork, is this what are you looking for, split dataset into (multi row csv)


start_token = "|<start of text>|"
end_token = "|<end of text>|"
with open('train.txt', encoding='utf-8') as txtfile:
    all_text = txtfile.read().replace(start_token,"").split(end_token)
    all_text = all_text[0:len(all_text)-1]
with open('train.csv', mode='w', encoding='utf-8') as csv_file:
    fieldnames = ['text']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    for row in all_text:
        writer.writerow({'text': all_text})


with open('validation.txt', encoding='utf-8') as txtfile:
    all_text = txtfile.read().replace(start_token,"").split(end_token)
    all_text = all_text[0:len(all_text)-1]
with open('validation.csv', mode='w', encoding='utf-8') as csv_file:
    fieldnames = ['text']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    for row in all_text:
        writer.writerow({'text': row})

print("created train.csv and validation.csv > files")```

Feeding the model separate examples instead of one continuous block of text

Hello I'm interested in adding this feature anding a function in text2csv.py to take a folder of texts and then in run_clm.py pad and truncate them instead of the group_text function.

Using [text,labels] instead of just [text] in Datasets

Hi, I'd like to start with a big thanx for your amazing work. I would like to use your library to fine tune GPT-NEO to a Text2Text task instead of TextGeneration. I'm try to adapt your script run_clm.py to handle not only a Dataset with just [text] but with a structure [text,label].

So I'm now trying to create a train_dataset that is built by these new two tokenized dataset, built this way:

def tokenize_function_text(examples): return tokenizer(examples["text"])

tokenized_datasets_text = datasets.map( tokenize_function_text, batched=True, num_proc=data_args.preprocessing_num_workers, remove_columns=column_names, load_from_cache_file=not data_args.overwrite_cache)

def tokenize_function_label(examples): return tokenizer(examples["label"])

tokenized_datasets_label = datasets.map( tokenize_function_label, batched=True, num_proc=data_args.preprocessing_num_workers, remove_columns=column_names, load_from_cache_file=not data_args.overwrite_cache, )

But I'm now really struggling to mix them togheter in a single object "train_dataset" that i want to give to the trainer. Do you have any tips or suggestion to give me?

thank you very much

Resume from checkpoint

I have RTX 3090 (24GB) and 64 GB RAM, and 50 GB swap memory, and although training works pretty nicely, unfortunately resuming training from checkpoints results in OOM:

[2021-05-07 19:18:39,962] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-05-07 19:18:39,973] [INFO] [runner.py:360:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 run_clm.py --deepspeed ds_config_gptneo_new.json --model_name_or_path /datadrive/model/checkpoint-800/ --train_file merged_train.txt.csv --do_train --fp16 --overwrite_cache --output_dir /datadrive/model --num_train_epochs 1 --gradient_accumulation_steps 2 --per_device_train_batch_size 4 --use_fast_tokenizer False --learning_rate 5e-06 --save_steps 400
[2021-05-07 19:18:40,526] [INFO] [launch.py:73:main] 0 NCCL_VERSION 2.7.8
[2021-05-07 19:18:40,526] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0]}
[2021-05-07 19:18:40,526] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=1, node_rank=0
[2021-05-07 19:18:40,526] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2021-05-07 19:18:40,526] [INFO] [launch.py:102:main] dist_world_size=1
[2021-05-07 19:18:40,526] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0
[2021-05-07 19:18:41,601] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
05/07/2021 19:18:41 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
05/07/2021 19:18:41 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=/datadrive/model, overwrite_output_dir=False, do_train=True, do_eval=False, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=4, per_device_eval_batch_size=8, gradient_accumulation_steps=2, eval_accumulation_steps=None, learning_rate=5e-06, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/May07_19-18-41_9c3c6cac903e, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=400, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=/datadrive/model, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=ds_config_gptneo_new.json, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name=length, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, _n_gpu=1, mp_parameters=)
05/07/2021 19:18:42 - WARNING - datasets.builder -   Using custom data configuration default-b5898a6a80220f13
05/07/2021 19:18:42 - WARNING - datasets.builder -   Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-b5898a6a80220f13/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
[INFO|configuration_utils.py:515] 2021-05-07 19:18:42,390 >> loading configuration file /datadrive/model/checkpoint-800/config.json
[INFO|configuration_utils.py:553] 2021-05-07 19:18:42,390 >> Model config GPTNeoConfig {
  "_name_or_path": "EleutherAI/gpt-neo-2.7B",
  "activation_function": "gelu_new",
  "architectures": [
    "GPTNeoForCausalLM"
  ],
  "attention_dropout": 0,
  "attention_layers": [
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local"
  ],
  "attention_types": [
    [
      [
        "global",
        "local"
      ],
      16
    ]
  ],
  "bos_token_id": 50256,
  "embed_dropout": 0,
  "eos_token_id": 50256,
  "gradient_checkpointing": true,
  "hidden_size": 2560,
  "initializer_range": 0.02,
  "intermediate_size": null,
  "layer_norm_epsilon": 1e-05,
  "max_position_embeddings": 2048,
  "model_type": "gpt_neo",
  "num_heads": 20,
  "num_layers": 32,
  "resid_dropout": 0,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50,
      "temperature": 0.9
    }
  },
  "tokenizer_class": "GPT2Tokenizer",
  "transformers_version": "4.6.0.dev0",
  "use_cache": false,
  "vocab_size": 50257,
  "window_size": 256
}

[INFO|configuration_utils.py:517] 2021-05-07 19:18:42,765 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /models/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
[INFO|configuration_utils.py:553] 2021-05-07 19:18:42,765 >> Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.6.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/vocab.json from cache at /models/transformers/684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/merges.txt from cache at /models/transformers/c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer_config.json from cache at None
[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer.json from cache at /models/transformers/16a2f78023c8dc511294f0c97b5e10fde3ef9889ad6d11ffaa2a00714e73926e.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|modeling_utils.py:1147] 2021-05-07 19:18:44,955 >> loading weights file /datadrive/model/checkpoint-800/pytorch_model.bin
[INFO|modeling_utils.py:1328] 2021-05-07 19:18:59,255 >> All model checkpoint weights were used when initializing GPTNeoForCausalLM.

[INFO|modeling_utils.py:1336] 2021-05-07 19:18:59,255 >> All the weights of GPTNeoForCausalLM were initialized from the model checkpoint at /datadrive/model/checkpoint-800/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPTNeoForCausalLM for predictions without further training.
  0%|                                                     | 0/1 [00:00<?, ?ba/s][WARNING|tokenization_utils_base.py:3170] 2021-05-07 19:19:40,807 >> Token indices sequence length is longer than the specified maximum sequence length for this model (14397149 > 1024). Running this sequence through the model will result in indexing errors
100%|█████████████████████████████████████████████| 1/1 [00:42<00:00, 42.00s/ba]
100%|█████████████████████████████████████████████| 1/1 [00:08<00:00,  8.47s/ba]
[INFO|trainer.py:414] 2021-05-07 19:19:50,812 >> Using amp fp16 backend
[INFO|trainer.py:1042] 2021-05-07 19:19:50,865 >> Loading model from /datadrive/model/checkpoint-800/).
[2021-05-07 19:19:50,867] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.16, git-hash=unknown, git-branch=unknown
[2021-05-07 19:19:50,867] [WARNING] [config.py:79:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-05-07 19:19:54,135] [INFO] [utils.py:11:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /root/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.1879847049713135 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2021-05-07 19:19:58,240] [INFO] [engine.py:610:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2021-05-07 19:19:58,240] [INFO] [engine.py:615:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2021-05-07 19:19:58,240] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2021-05-07 19:19:58,240] [INFO] [stage2.py:102:__init__] Reduce bucket size 200000000.0
[2021-05-07 19:19:58,240] [INFO] [stage2.py:103:__init__] Allgather bucket size 200000000.0
[2021-05-07 19:19:58,240] [INFO] [stage2.py:104:__init__] CPU Offload: True
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 1.4445114135742188 seconds
[2021-05-07 19:21:35,500] [INFO] [stage2.py:381:__init__] optimizer state initialized
[2021-05-07 19:21:35,709] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2021-05-07 19:21:35,760] [INFO] [engine.py:439:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2021-05-07 19:21:35,761] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fe9d20fb5b0>
[2021-05-07 19:21:35,769] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-06], mom=[[0.9, 0.999]]
[2021-05-07 19:21:35,777] [INFO] [config.py:747:print] DeepSpeedEngine configuration:
[2021-05-07 19:21:35,925] [INFO] [config.py:751:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2021-05-07 19:21:35,926] [INFO] [config.py:751:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2021-05-07 19:21:35,926] [INFO] [config.py:751:print]   allreduce_always_fp32 ........ False
[2021-05-07 19:21:35,927] [INFO] [config.py:751:print]   amp_enabled .................. False
[2021-05-07 19:21:35,927] [INFO] [config.py:751:print]   amp_params ................... False
[2021-05-07 19:21:35,927] [INFO] [config.py:751:print]   checkpoint_tag_validation_enabled  True
[2021-05-07 19:21:35,928] [INFO] [config.py:751:print]   checkpoint_tag_validation_fail  False
[2021-05-07 19:21:35,928] [INFO] [config.py:751:print]   disable_allgather ............ False
[2021-05-07 19:21:35,928] [INFO] [config.py:751:print]   dump_state ................... False
[2021-05-07 19:21:35,929] [INFO] [config.py:751:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-05-07 19:21:35,929] [INFO] [config.py:751:print]   elasticity_enabled ........... False
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 3, 
    "detailed": true
}
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   fp16_enabled ................. True
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   global_rank .................. 0
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   gradient_accumulation_steps .. 2
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   gradient_clipping ............ 1.0
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   gradient_predivide_factor .... 1.0
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   initial_dynamic_scale ........ 65536
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   loss_scale ................... 0
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   memory_breakdown ............. False
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   optimizer_legacy_fusion ...... False
[2021-05-07 19:21:35,932] [INFO] [config.py:751:print]   optimizer_name ............... adamw
[2021-05-07 19:21:35,932] [INFO] [config.py:751:print]   optimizer_params ............. {'lr': 5e-06, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0}
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   pld_enabled .................. False
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   pld_params ................... False
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   prescale_gradients ........... False
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   scheduler_name ............... WarmupLR
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 5e-06, 'warmup_num_steps': 0}
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   sparse_attention ............. None
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   sparse_gradients_enabled ..... False
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   steps_per_print .............. 2000
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   tensorboard_enabled .......... False
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   tensorboard_output_path ...... 
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   train_batch_size ............. 8
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   train_micro_batch_size_per_gpu  4
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   wall_clock_breakdown ......... False
[2021-05-07 19:21:35,934] [INFO] [config.py:751:print]   world_size ................... 1
[2021-05-07 19:21:35,934] [INFO] [config.py:751:print]   zero_allow_untested_optimizer  False
[2021-05-07 19:21:35,938] [INFO] [config.py:751:print]   zero_config .................. {
    "stage": 2, 
    "contiguous_gradients": true, 
    "reduce_scatter": true, 
    "reduce_bucket_size": 2.000000e+08, 
    "allgather_partitions": true, 
    "allgather_bucket_size": 2.000000e+08, 
    "overlap_comm": true, 
    "load_from_fp32_weights": true, 
    "elastic_checkpoint": true, 
    "offload_param": null, 
    "offload_optimizer": {
        "device": "cpu", 
        "nvme_path": null, 
        "buffer_count": 4, 
        "pin_memory": false, 
        "pipeline_read": false, 
        "pipeline_write": false, 
        "fast_init": false
    }, 
    "sub_group_size": 1.000000e+12, 
    "prefetch_bucket_size": 5.000000e+07, 
    "param_persistence_threshold": 1.000000e+05, 
    "max_live_parameters": 1.000000e+09, 
    "max_reuse_distance": 1.000000e+09, 
    "gather_fp16_weights_on_model_save": false, 
    "find_unused_parameters": false
}
[2021-05-07 19:21:35,938] [INFO] [config.py:751:print]   zero_enabled ................. True
[2021-05-07 19:21:35,938] [INFO] [config.py:751:print]   zero_optimization_stage ...... 2
[2021-05-07 19:21:35,942] [INFO] [config.py:753:print]   json = {
    "fp16": {
        "enabled": true, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 16, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 5e-06, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.0
        }
    }, 
    "scheduler": {
        "type": "WarmupLR", 
        "params": {
            "warmup_min_lr": 0, 
            "warmup_max_lr": 5e-06, 
            "warmup_num_steps": 0
        }
    }, 
    "zero_optimization": {
        "stage": 2, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 2.000000e+08, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 2.000000e+08, 
        "contiguous_gradients": true, 
        "cpu_offload": true
    }, 
    "gradient_accumulation_steps": 2, 
    "gradient_clipping": 1.0, 
    "steps_per_print": 2.000000e+03, 
    "train_batch_size": 8, 
    "train_micro_batch_size_per_gpu": 4, 
    "wall_clock_breakdown": false
}
Using /root/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.09232521057128906 seconds
[INFO|integrations.py:536] 2021-05-07 19:21:36,160 >> Attempting to resume from /datadrive/model/checkpoint-800/
[2021-05-07 19:21:36,175] [INFO] [engine.py:1480:_load_checkpoint] rank: 0 loading checkpoint: /datadrive/model/checkpoint-800/global_step800/mp_rank_00_model_states.pt

New issue with Pandas

I got this error:

Traceback (most recent call last):
File "run_clm.py", line 478, in
main()
File "run_clm.py", line 271, in main
datasets = load_dataset(
File "/root/miniconda3/lib/python3.8/site-packages/datasets/load.py", line 742, in load_dataset
builder_instance.download_and_prepare(
File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 574, in download_and_prepare
self._download_and_prepare(
File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 652, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 1041, in _prepare_split
for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
File "/root/miniconda3/lib/python3.8/site-packages/tqdm/std.py", line 1133, in iter
for obj in iterable:
File "/root/miniconda3/lib/python3.8/site-packages/datasets/packaged_modules/csv/csv.py", line 92, in _generate_tables
csv_file_reader = pd.read_csv(
File "/root/miniconda3/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 571, in read_csv
kwds_defaults = _refine_defaults_read(
File "/root/miniconda3/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1306, in _refine_defaults_read
raise ValueError("Specified named and prefix; you can only specify one.")
ValueError: Specified named and prefix; you can only specify one.
Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-84d6151a5e4565ed/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0...
Traceback (most recent call last):
File "run_clm.py", line 478, in
main()
File "run_clm.py", line 271, in main
datasets = load_dataset(
File "/root/miniconda3/lib/python3.8/site-packages/datasets/load.py", line 742, in load_dataset
builder_instance.download_and_prepare(
File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 574, in download_and_prepare
self._download_and_prepare(
File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 652, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 1041, in _prepare_split
for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
File "/root/miniconda3/lib/python3.8/site-packages/tqdm/std.py", line 1133, in iter
for obj in iterable:
File "/root/miniconda3/lib/python3.8/site-packages/datasets/packaged_modules/csv/csv.py", line 92, in _generate_tables
csv_file_reader = pd.read_csv(
File "/root/miniconda3/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 571, in read_csv
kwds_defaults = _refine_defaults_read(
File "/root/miniconda3/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1306, in _refine_defaults_read
raise ValueError("Specified named and prefix; you can only specify one.")
ValueError: Specified named and prefix; you can only specify one.
Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-84d6151a5e4565ed/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0...

Apparently it's a know error with the latest Pandas: pandas-dev/pandas#42387

I solved it by downgrading to Pandas 1.2.5

Freezing at "Using /home/user/.cache/torch_extensions as PyTorch extensions root..."

After installing the dependencies and running the given commands to fine-tune a model, some GPU VRAM is allocated(looking at nvidia-smi) , but then the program seems to just stop with once "Using /home/user/.cache/torch_extensions as PyTorch extensions root..." prints

subprocess.CalledProcessError:

I got the following error:
[2022-01-13 14:47:32,154] [INFO] [launch.py:131:sigkill_handler] Killing subprocess 2273 Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/gpt2_lm/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/ubuntu/anaconda3/envs/gpt2_lm/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ubuntu/anaconda3/envs/gpt2_lm/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 167, in <module> main() File "/home/ubuntu/anaconda3/envs/gpt2_lm/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 156, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/ubuntu/anaconda3/envs/gpt2_lm/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 137, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/ubuntu/anaconda3/envs/gpt2_lm/bin/python', '-u', 'run_clm.py', '--local_rank=0', '--deepspeed', 'ds_config.json', '--model_name_or_path', 'gpt2-xl', '--train_file', '../../dataset/train.txt', '--validation_file', '../../dataset/test.txt', '--do_train', '--do_eval', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', 'finetuned', '--eval_steps', '500', '--num_train_epochs', '1', '--gradient_accumulation_steps', '2', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1']' died with <Signals.SIGKILL: 9>.

Suspected optimizer issue causing crashes

I am running a the code on a single box with a TITAN and 2080Ti. The trainer is running just on the TITAN. I have a problem where the system will lock up the cpu and kill the local network. Very non-performant ...

It seems to be related to microsoft/DeepSpeed#679
It appears changing the optimizer section of the JSON file seems to allow it to run. A bit slower, but it does run.

"optimizer": {
    "type": "Adam",
    "params": {
        "torch_adam":true,
        "lr": 0.00001,
        "betas": [
            0.9,
            0.95
        ],
        "eps": 1e-8,
        "weight_decay": 0.1
    }
}

The big thing being setting "torch_adam" to true.
Any ideas for regaining regular performance would be appreciated.

Training on a larger dataset fails due to memory issues on faster GPUs

Thanks so much for producing this repo, it's been really helpful in getting up and running on the biggest GPT-Neo model.

I'm having an issue training gpt-neo_2-7B though - my dataset is just over 200mb, which leads to an out of memory issue on the very last step of loading a model into memory before training.

[INFO|integrations.py:533] 2021-04-20 12:40:32,650 >> Attempting to resume from paragraphs/checkpoint-600 [2021-04-20 12:40:32,664] [INFO] [engine.py:1445:_load_checkpoint] rank: 0 loading checkpoint: paragraphs/checkpoint-600/global_step600/mp_rank_00_model_states.pt Traceback (most recent call last): File "run_clm.py", line 478, in <module> main() File "run_clm.py", line 441, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) [...] RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 10605230080 bytes. Error code 12 (Cannot allocate memory)

I've tried a number of GPUs on Google cloud, and I can get it run on the P100 since I can up the RAM to 100GB, but both the V100 and A100s fail (with 78GB and 85GB respectively)

Unfortunately Google puts a hard limit on RAM for these GPUs, and increasing the number of GPUs also doubles the number of processes run and so the RAM required - so unless I pay for 2 GPUs and let one sit idle I have to train on the much slower P100.

This is .. ok .. 😅 but I'd love to go faster if I can. So far I've tried:

Reducing per_device_train_batch_size to 2
Halving the dataset size
but neither have made a difference.

Do you have any other tips on how I might squeeze into the 85GB you get with an A100? It's so tantalizingly close - I wish Google would just let me add more RAM!