Hi Please add option for incremental training, so that its possible to train on co

Sorry, the option should be --load_model_path. <div class="snippet-clipboard-conte

Enhancement: Please add options for incremental training. (Code2Text) about codexglue HOT 17 CLOSED

microsoft commented on May 12, 2024

Enhancement: Please add options for incremental training. (Code2Text)

from codexglue.

Comments (17)

guoday commented on May 12, 2024

Do you mean gradient_accumulation_steps? The code has implemented. You can add option --gradient_accumulation_steps n for incremental training

from codexglue.

Manas-Embold commented on May 12, 2024

Alright, thanks much !
I come from tensorflow background, therefore i am unaware of how its done in pytorch.
I would be thankful, if you can let me know what exactly i need to do
Say i run training for 2 epoch, save checkpoint and want to start again from saved check point.

from codexglue.

guoday commented on May 12, 2024

Change "pretrained_model=microsoft/codebert-base" to "pretrained_model=saved_checkpoint_path"

from codexglue.

Manas-Embold commented on May 12, 2024

Alright.
Thanks Much !!

from codexglue.

Manas-Embold commented on May 12, 2024

One more question, just to be sure.
My calls should look like following:

Do i need to use --gradient_accumulation_steps somewhere now ? or just --pretrained_model should be fine

Call 1 for first two epochs:
python run.py --do_train --do_eval --model_type roberta --model_name_or_path "microsoft/codebert-base" --train_filename "../dataset/java/valid.jsonl" --dev_filename "../dataset/java/valid.jsonl" --output_dir "model/java" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 8 --eval_batch_size 8 --learning_rate 5e-5 --num_train_epochs 2

Call 2 for training for next two epoch

python run.py --do_train --do_eval --model_type roberta --model_name_or_path "saved_checkpoint_path" --train_filename "../dataset/java/valid.jsonl" --dev_filename "../dataset/java/valid.jsonl" --output_dir "model/java" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 8 --eval_batch_size 8 --learning_rate 5e-5 --num_train_epochs 2

from codexglue.

guoday commented on May 12, 2024

just --pretrained_model is fine

from codexglue.

Manas-Embold commented on May 12, 2024

Thanks

from codexglue.

guoday commented on May 12, 2024

python run.py --do_train --do_eval --model_type roberta --model_name_or_path "saved_checkpoint_path" --train_filename "../dataset/java/valid.jsonl" --dev_filename "../dataset/java/valid.jsonl" --output_dir "model/java" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 8 --eval_batch_size 8 --learning_rate 5e-5 --num_train_epochs 2

from codexglue.

guoday commented on May 12, 2024

Sorry, the option should be --load_model_path.

python run.py --do_train --do_eval --model_type roberta --model_name_or_path microsoft/codebert-base --train_filename "../dataset/java/valid.jsonl" --dev_filename "../dataset/java/valid.jsonl" --output_dir "model/java" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 8 --eval_batch_size 8 --learning_rate 5e-5 --num_train_epochs 2 --load_model_path $output_dir/checkpoint-best-bleu/pytorch_model.bin

from codexglue.

Manas-Embold commented on May 12, 2024

Alright,
Thanks once again.

from codexglue.

Manas-Embold commented on May 12, 2024

Hi,
Just to test the flow, I started training for 1 epoch, and Model was saved.
python run.py --do_train --do_eval --model_type roberta --model_name_or_path "microsoft/codebert-base" --train_filename "../dataset/javascript/valid.jsonl" --dev_filename "../dataset/javascript/valid.jsonl" --output_dir "model/javascript" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 16 --eval_batch_size 16 --learning_rate 5e-5 --num_train_epochs 1

Again I started the training from the trained model for next 2 epochs
python run.py --do_train --do_eval --model_type roberta --model_name_or_path "microsoft/codebert-base" --train_filename "../dataset/javascript/valid.jsonl" --dev_filename "../dataset/javascript/valid.jsonl" --output_dir "model/javascript" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 16 --eval_batch_size 16 --learning_rate 5e-5 --num_train_epochs 2 --load_model_path "/content/code/model/javascript/checkpoint-best-bleu/pytorch_model.bin"

Training has started again, but in console it says "Epoch 0" again instead of "Epoch 1"
Is this normal for script to say Epoch 0 again ? Is it actually Epoch 1 as essentially I am incrementally training on last checkpoint model.

Log for first iteration (Epoch 1)
12/03/2020 08:34:08 - INFO - main - Num examples = 3885
12/03/2020 08:34:08 - INFO - main - Batch size = 16
12/03/2020 08:34:08 - INFO - main - Num epoch = 1
epoch 0 loss 6.5622: 100% 243/243 [08:22<00:00, 2.07s/it]
12/03/2020 08:42:34 - INFO - main -
***** Running evaluation *****
12/03/2020 08:42:34 - INFO - main - Num examples = 3885
12/03/2020 08:42:34 - INFO - main - Batch size = 16
12/03/2020 08:45:32 - INFO - main - eval_ppl = 306.69674
12/03/2020 08:45:32 - INFO - main - global_step = 244
12/03/2020 08:45:32 - INFO - main - train_loss = 6.5622
12/03/2020 08:45:32 - INFO - main - ********************
12/03/2020 08:45:34 - INFO - main - Best ppl:306.69674
12/03/2020 08:45:34 - INFO - main - ********************
Total: 1000
12/03/2020 08:53:21 - INFO - main - bleu-4 = 7.58
12/03/2020 08:53:21 - INFO - main - ********************
12/03/2020 08:53:21 - INFO - main - Best bleu:7.58
12/03/2020 08:53:21 - INFO - main - ********************

Log for second iteration (Epoch 2)

12/03/2020 08:58:29 - INFO - main - ***** Running training *****
12/03/2020 08:58:29 - INFO - main - Num examples = 3885
12/03/2020 08:58:29 - INFO - main - Batch size = 16
12/03/2020 08:58:29 - INFO - main - Num epoch = 2
epoch 0 loss 5.4316: 100% 243/243 [08:22<00:00, 2.07s/it]
12/03/2020 09:06:54 - INFO - main -
***** Running evaluation *****
12/03/2020 09:06:54 - INFO - main - Num examples = 3885
12/03/2020 09:06:54 - INFO - main - Batch size = 16
12/03/2020 09:09:50 - INFO - main - eval_ppl = 117.87884
12/03/2020 09:09:50 - INFO - main - global_step = 244
12/03/2020 09:09:50 - INFO - main - train_loss = 5.4316
12/03/2020 09:09:50 - INFO - main - ********************
12/03/2020 09:09:52 - INFO - main - Best ppl:117.87884
12/03/2020 09:09:52 - INFO - main - ********************

from codexglue.

Manas-Embold commented on May 12, 2024

Since loss has decreased in subsequent epochs, shall I assume that it is actually Epoch 1 and not epoch 0
In simple terms i want to be sure, that it is not training from scratch again.

from codexglue.

Manas-Embold commented on May 12, 2024

Note that i am training on valid.jsonl just to quickly test the flow

from codexglue.

guoday commented on May 12, 2024

--load_model_path just only re-load the model from checkpoint, but optimizer and logs will be reset. Maybe for implementing incremental training, we also need to save optimizer and logs.

from codexglue.

Manas-Embold commented on May 12, 2024

Alrights
Resetting of logger is fine.
But not optimizer. right ?

from codexglue.

guoday commented on May 12, 2024

Replace run.py with run.txt. You just need to re-run the following command and the program will restore the last checkpoint for incremental training.

lang=ruby #programming language
lr=5e-5
batch_size=32
beam_size=10
source_length=256
target_length=128
data_dir=../dataset
output_dir=model/$lang
train_file=$data_dir/$lang/train.jsonl
dev_file=$data_dir/$lang/valid.jsonl
epochs=10 
pretrained_model=microsoft/codebert-base #Roberta: roberta-base

python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --num_train_epochs $epochs

from codexglue.

Manas-Embold commented on May 12, 2024

Many Thanks for prompt response !

from codexglue.

Enhancement: Please add options for incremental training. (Code2Text) about codexglue HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs