GithubHelp home page GithubHelp logo

Comments (17)

guoday avatar guoday commented on May 12, 2024

Do you mean gradient_accumulation_steps? The code has implemented. You can add option --gradient_accumulation_steps n for incremental training

from codexglue.

Manas-Embold avatar Manas-Embold commented on May 12, 2024

Alright, thanks much !
I come from tensorflow background, therefore i am unaware of how its done in pytorch.
I would be thankful, if you can let me know what exactly i need to do
Say i run training for 2 epoch, save checkpoint and want to start again from saved check point.

from codexglue.

guoday avatar guoday commented on May 12, 2024

Change "pretrained_model=microsoft/codebert-base" to "pretrained_model=saved_checkpoint_path"

from codexglue.

Manas-Embold avatar Manas-Embold commented on May 12, 2024

Alright.
Thanks Much !!

from codexglue.

Manas-Embold avatar Manas-Embold commented on May 12, 2024

One more question, just to be sure.
My calls should look like following:

Do i need to use --gradient_accumulation_steps somewhere now ? or just --pretrained_model should be fine

Call 1 for first two epochs:
python run.py --do_train --do_eval --model_type roberta --model_name_or_path "microsoft/codebert-base" --train_filename "../dataset/java/valid.jsonl" --dev_filename "../dataset/java/valid.jsonl" --output_dir "model/java" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 8 --eval_batch_size 8 --learning_rate 5e-5 --num_train_epochs 2

Call 2 for training for next two epoch

python run.py --do_train --do_eval --model_type roberta --model_name_or_path "saved_checkpoint_path" --train_filename "../dataset/java/valid.jsonl" --dev_filename "../dataset/java/valid.jsonl" --output_dir "model/java" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 8 --eval_batch_size 8 --learning_rate 5e-5 --num_train_epochs 2

from codexglue.

guoday avatar guoday commented on May 12, 2024

just --pretrained_model is fine

from codexglue.

Manas-Embold avatar Manas-Embold commented on May 12, 2024

Thanks

from codexglue.

guoday avatar guoday commented on May 12, 2024

python run.py --do_train --do_eval --model_type roberta --model_name_or_path "saved_checkpoint_path" --train_filename "../dataset/java/valid.jsonl" --dev_filename "../dataset/java/valid.jsonl" --output_dir "model/java" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 8 --eval_batch_size 8 --learning_rate 5e-5 --num_train_epochs 2

from codexglue.

guoday avatar guoday commented on May 12, 2024

Sorry, the option should be --load_model_path.

python run.py --do_train --do_eval --model_type roberta --model_name_or_path microsoft/codebert-base --train_filename "../dataset/java/valid.jsonl" --dev_filename "../dataset/java/valid.jsonl" --output_dir "model/java" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 8 --eval_batch_size 8 --learning_rate 5e-5 --num_train_epochs 2 --load_model_path $output_dir/checkpoint-best-bleu/pytorch_model.bin

from codexglue.

Manas-Embold avatar Manas-Embold commented on May 12, 2024

Alright,
Thanks once again.

from codexglue.

Manas-Embold avatar Manas-Embold commented on May 12, 2024

Hi,
Just to test the flow, I started training for 1 epoch, and Model was saved.
python run.py --do_train --do_eval --model_type roberta --model_name_or_path "microsoft/codebert-base" --train_filename "../dataset/javascript/valid.jsonl" --dev_filename "../dataset/javascript/valid.jsonl" --output_dir "model/javascript" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 16 --eval_batch_size 16 --learning_rate 5e-5 --num_train_epochs 1

Again I started the training from the trained model for next 2 epochs
python run.py --do_train --do_eval --model_type roberta --model_name_or_path "microsoft/codebert-base" --train_filename "../dataset/javascript/valid.jsonl" --dev_filename "../dataset/javascript/valid.jsonl" --output_dir "model/javascript" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 16 --eval_batch_size 16 --learning_rate 5e-5 --num_train_epochs 2 --load_model_path "/content/code/model/javascript/checkpoint-best-bleu/pytorch_model.bin"

Training has started again, but in console it says "Epoch 0" again instead of "Epoch 1"
Is this normal for script to say Epoch 0 again ? Is it actually Epoch 1 as essentially I am incrementally training on last checkpoint model.

Log for first iteration (Epoch 1)
12/03/2020 08:34:08 - INFO - main - Num examples = 3885
12/03/2020 08:34:08 - INFO - main - Batch size = 16
12/03/2020 08:34:08 - INFO - main - Num epoch = 1
epoch 0 loss 6.5622: 100% 243/243 [08:22<00:00, 2.07s/it]
12/03/2020 08:42:34 - INFO - main -
***** Running evaluation *****
12/03/2020 08:42:34 - INFO - main - Num examples = 3885
12/03/2020 08:42:34 - INFO - main - Batch size = 16
12/03/2020 08:45:32 - INFO - main - eval_ppl = 306.69674
12/03/2020 08:45:32 - INFO - main - global_step = 244
12/03/2020 08:45:32 - INFO - main - train_loss = 6.5622
12/03/2020 08:45:32 - INFO - main - ********************
12/03/2020 08:45:34 - INFO - main - Best ppl:306.69674
12/03/2020 08:45:34 - INFO - main - ********************
Total: 1000
12/03/2020 08:53:21 - INFO - main - bleu-4 = 7.58
12/03/2020 08:53:21 - INFO - main - ********************
12/03/2020 08:53:21 - INFO - main - Best bleu:7.58
12/03/2020 08:53:21 - INFO - main - ********************


Log for second iteration (Epoch 2)

12/03/2020 08:58:29 - INFO - main - ***** Running training *****
12/03/2020 08:58:29 - INFO - main - Num examples = 3885
12/03/2020 08:58:29 - INFO - main - Batch size = 16
12/03/2020 08:58:29 - INFO - main - Num epoch = 2
epoch 0 loss 5.4316: 100% 243/243 [08:22<00:00, 2.07s/it]
12/03/2020 09:06:54 - INFO - main -
***** Running evaluation *****
12/03/2020 09:06:54 - INFO - main - Num examples = 3885
12/03/2020 09:06:54 - INFO - main - Batch size = 16
12/03/2020 09:09:50 - INFO - main - eval_ppl = 117.87884
12/03/2020 09:09:50 - INFO - main - global_step = 244
12/03/2020 09:09:50 - INFO - main - train_loss = 5.4316
12/03/2020 09:09:50 - INFO - main - ********************
12/03/2020 09:09:52 - INFO - main - Best ppl:117.87884
12/03/2020 09:09:52 - INFO - main - ********************

from codexglue.

Manas-Embold avatar Manas-Embold commented on May 12, 2024

Since loss has decreased in subsequent epochs, shall I assume that it is actually Epoch 1 and not epoch 0
In simple terms i want to be sure, that it is not training from scratch again.

from codexglue.

Manas-Embold avatar Manas-Embold commented on May 12, 2024

Note that i am training on valid.jsonl just to quickly test the flow

from codexglue.

guoday avatar guoday commented on May 12, 2024

--load_model_path just only re-load the model from checkpoint, but optimizer and logs will be reset. Maybe for implementing incremental training, we also need to save optimizer and logs.

from codexglue.

Manas-Embold avatar Manas-Embold commented on May 12, 2024

Alrights
Resetting of logger is fine.
But not optimizer. right ?

from codexglue.

guoday avatar guoday commented on May 12, 2024

Replace run.py with run.txt. You just need to re-run the following command and the program will restore the last checkpoint for incremental training.

lang=ruby #programming language
lr=5e-5
batch_size=32
beam_size=10
source_length=256
target_length=128
data_dir=../dataset
output_dir=model/$lang
train_file=$data_dir/$lang/train.jsonl
dev_file=$data_dir/$lang/valid.jsonl
epochs=10 
pretrained_model=microsoft/codebert-base #Roberta: roberta-base

python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --num_train_epochs $epochs

from codexglue.

Manas-Embold avatar Manas-Embold commented on May 12, 2024

Many Thanks for prompt response !

from codexglue.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.