microsoft / onnxruntime-training-examples Goto Github PK

Examples for using ONNX Runtime for model training.

License: MIT License

Python 32.79% Dockerfile 0.34% Kotlin 6.85% CMake 0.19% C++ 1.25% Jupyter Notebook 9.69% C# 43.52% Swift 2.56% Ruby 0.04% CSS 0.49% Batchfile 0.01% HTML 0.11% TypeScript 1.99% JavaScript 0.16%

onnxruntime-training-examples's Introduction

New to Onnx

Official ORT documentation: https://www.onnxruntime.ai/
Official ORT GitHub Repo: https://github.com/microsoft/onnxruntime
Official ORT Samples Repo: https://github.com/microsoft/onnxruntime-training-examples

What is ONNX Runtime for PyTorch

ONNX Runtime for PyTorch gives you the ability to accelerate training of large transformer PyTorch models. The training time and cost are reduced with just a one line code change.

One line code change: ORT provides a one-line addition for existing PyTorch training scripts allowing easier experimentation and greater agility.

    from torch_ort import ORTModule
    model = ORTModule(model)

Flexible and extensible hardware support: The same model and API works with NVIDIA and AMD GPUs; the extensible "execution provider" architecture allow you to plug-in custom operators, optimizer and hardware accelerators.
Faster Training: Optimized kernels provide up to 1.4X speed up in training time.
Larger Models: Memory optimizations allow fitting a larger model such as GPT-2 on 16GB GPU, which runs out of memory with stock PyTorch.
Composable with other acceleration libraries such as Deepspeed, Fairscale, Megatron for even faster and more efficient training
Part of the PyTorch Ecosystem. It is available via the torch-ort python package.
Built on top of highly successful and proven technologies of ONNX Runtime and ONNX format.

ONNX Runtime Training Examples

This repo has examples for using ONNX Runtime (ORT) for accelerating training of Transformer models. These examples focus on large scale model training and achieving the best performance in Azure Machine Learning service. ONNX Runtime has the capability to train existing PyTorch models (implemented using torch.nn.Module) through its optimized backend. The examples in this repo demonstrate how ORTModule can be used to switch the training backend.

Examples

Outline the examples in the repository.

Example	Performance Comparison	Model Change
HuggingFace BART	See BART	No model change required
HuggingFace BERT	See BERT	No model change required
HuggingFace DeBERTa	See DeBERTa	See this commit
HuggingFace DistilBERT	See DistilBERT	No model change required
HuggingFace GPT2	See GPT2	No model change required
HuggingFace RoBERTa	See RoBERTa	See this commit
t5-large	See T5	See this PR

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

onnxruntime-training-examples's People

Contributors

Stargazers

Watchers

onnxruntime-training-examples's Issues

[Request] Need a example for training an ONNX model

As title says, if any ONNX model sample is available.....

would you please offering the base image dockerfile

what's the problem?
my env is Cuda10.1, i need a way to re-compile the based "mcr.microsoft.com/azureml/onnxruntime-training:0.1-rc2-openmpi4.0-cuda10.2-cudnn7.6-nccl2.7.6" to "mcr.microsoft.com/azureml/onnxruntime-training:0.1-rc2-openmpi4.0-cuda10.1-cudnn7.6-nccl2.7.6" to enable working on .

how to fix?
may need the project maintaner to offer the based dockerfile .

[Help] [iOS example] Failed to launch the app w/ `Type Error` of `output arg`

Following #161, I am running the iOS example according to its README.

I opened MyVoice.xcworkspace in Xcode, made some slight modifications to make it compile, and ran it.

I am getting Type Error: Type (tensor(bool)) of output arg (onnx::adamw.updated_flag::5) of node (onnx::AdamWOptimizer::6) does not match expected type (tensor(int64)) and a blank screen.

Full relevant error.

MyVoice/TrainView.swift:25: Fatal error: 'try!' expression unexpectedly raised an error: Error Domain=onnxruntime Code=1 "/Users/runner/work/1/s/orttraining/orttraining/training_api/optimizer.cc:239 void onnxruntime::training::api::Optimizer::Initialize(const std::string &, const std::vector<std::shared_ptr<IExecutionProvider>> &, gsl::span<OrtCustomOpDomain *const>) [ONNXRuntimeError] : 1 : FAIL : Load model from /Users/sichanghe/Library/Developer/CoreSimulator/Devices/CB79BDAB-7D81-43FC-8440-FE4B95955A60/data/Containers/Bundle/Application/02A3121A-4522-497A-8677-0BD7A416F48D/MyVoice.app/optimizer_model.onnx failed:Type Error: Type (tensor(bool)) of output arg (onnx::adamw.updated_flag::5) of node (onnx::AdamWOptimizer::6) does not match expected type (tensor(int64)).
" UserInfo={NSLocalizedDescription=/Users/runner/work/1/s/orttraining/orttraining/training_api/optimizer.cc:239 void onnxruntime::training::api::Optimizer::Initialize(const std::string &, const std::vector<std::shared_ptr<IExecutionProvider>> &, gsl::span<OrtCustomOpDomain *const>) [ONNXRuntimeError] : 1 : FAIL : Load model from /Users/sichanghe/Library/Developer/CoreSimulator/Devices/CB79BDAB-7D81-43FC-8440-FE4B95955A60/data/Containers/Bundle/Application/02A3121A-4522-497A-8677-0BD7A416F48D/MyVoice.app/optimizer_model.onnx failed:Type Error: Type (tensor(bool)) of output arg (onnx::adamw.updated_flag::5) of node (onnx::AdamWOptimizer::6) does not match expected type (tensor(int64)).
}
2023-09-06 13:14:29.936987+0800 MyVoice[95643:1836341] MyVoice/TrainView.swift:25: Fatal error: 'try!' expression unexpectedly raised an error: Error Domain=onnxruntime Code=1 "/Users/runner/work/1/s/orttraining/orttraining/training_api/optimizer.cc:239 void onnxruntime::training::api::Optimizer::Initialize(const std::string &, const std::vector<std::shared_ptr<IExecutionProvider>> &, gsl::span<OrtCustomOpDomain *const>) [ONNXRuntimeError] : 1 : FAIL : Load model from /Users/sichanghe/Library/Developer/CoreSimulator/Devices/CB79BDAB-7D81-43FC-8440-FE4B95955A60/data/Containers/Bundle/Application/02A3121A-4522-497A-8677-0BD7A416F48D/MyVoice.app/optimizer_model.onnx failed:Type Error: Type (tensor(bool)) of output arg (onnx::adamw.updated_flag::5) of node (onnx::AdamWOptimizer::6) does not match expected type (tensor(int64)).
" UserInfo={NSLocalizedDescription=/Users/runner/work/1/s/orttraining/orttraining/training_api/optimizer.cc:239 void onnxruntime::training::api::Optimizer::Initialize(const std::string &, const std::vector<std::shared_ptr<IExecutionProvider>> &, gsl::span<OrtCustomOpDomain *const>) [ONNXRuntimeError] : 1 : FAIL : Load model from /Users/sichanghe/Library/Developer/CoreSimulator/Devices/CB79BDAB-7D81-43FC-8440-FE4B95955A60/data/Containers/Bundle/Application/02A3121A-4522-497A-8677-0BD7A416F48D/MyVoice.app/optimizer_model.onnx failed:Type Error: Type (tensor(bool)) of output arg (onnx::adamw.updated_flag::5) of node (onnx::AdamWOptimizer::6) does not match expected type (tensor(int64)).
}

Screenshot.

@baijumeswani, does this have anything to do with the change in onnxruntime-training-cpu?

Dockerfile_clm is broken for gpt2

When running, it error out with:
cannot import name 'send_example_telemetry' from 'transformers.utils

Current docker use:
transformers==4.16.0 \

Workaround:
pip3 install git+https://github.com/huggingface/transformers

Another error when ort is enabled:
Traceback (most recent call last):
File "run_clm.py", line 52, in
from optimum.onnxruntime import ORTTrainer, ORTTrainingArguments
ImportError: cannot import name 'ORTTrainingArguments' from 'optimum.onnxruntime' (/opt/conda/envs/ptca/lib/python3.8/site-packages/optimum/onnxruntime/init.py)

Workaround:
pip3 install git+https://github.com/huggingface/optimum.git

Last data_file seems to be skipped in the training loop (nvidia-bert)

In the nvidia-bert model, the loop starting from this line seems to be skipping the last data file, since in the last step of the loop, the dataloader from the last-but-one data file is fed to the training.

When the loop ends, the last dataloader is generated, but not used. The last dataloader is used only when there is only one data file (link).

Thanks in advance for any explanation to correct my possible misunderstanding.

[QnA Fine-tuning] Nebula gets write shared memory failed

Repro step.
As shown in readme.md at QnA Fine-tuning...

Build a custom environment with "AzureML-ACPT-pytorch-1.13-py38-cuda11.7-gpu" adding "accelerate", "datasets", and "transformers"
Run a job using "Submit a training job" functionality of AML Studio UI, the reason why didn't use "aml_submit.py" was due to authentication error with microsoft non-production subscription.
2-1. Select "Run a custom training script"
2-2. Upload "finetune.py" and "ds_config_zero_1.json"
2-3. Set command "torchrun --nproc_per_node=4 finetune.py --deepspeed --ort --model_name distilbert-base-uncased"
2-4. Select compute cluster of which vm is "Standard_ND40rs_v2" as shown in readme.md, with the instance count 1
2-5. Select the custom environment built in step1

error_2_nodes.txt

fatal error

TypeError: export() got an unexpected keyword argument 'example_outputs'

Your own-developed onnxruntime.training and torch.onnx, they simply can't match!!!
And torchtext.legacy and torchtext.data.Field has wholly abandoned!!!
How dare you take out such rubbish code that has so many bugs?!
You are Microsoft!

/storage/ypd-19-7/anaconda3/envs/ONNX/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_validation.py:112: UserWarning: onnxruntime training package info: package_name: onnxruntime-training
warnings.warn('onnxruntime training package info: package_name: %s' % package_name)
/storage/ypd-19-7/anaconda3/envs/ONNX/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_validation.py:113: UserWarning: onnxruntime training package info: version: 1.11.1+cu111
warnings.warn('onnxruntime training package info: version: %s' % version)
/storage/ypd-19-7/anaconda3/envs/ONNX/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: onnxruntime training package info: cuda_version: 11.1
warnings.warn('onnxruntime training package info: cuda_version: %s' % cuda_version)
/storage/ypd-19-7/anaconda3/envs/ONNX/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_validation.py:115: UserWarning: onnxruntime build info: cudart_version: 11010
warnings.warn('onnxruntime build info: cudart_version: %s' % cudart_version)
/storage/ypd-19-7/anaconda3/envs/ONNX/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_validation.py:122: UserWarning: WARNING: failed to find cudart version that matches onnxruntime build info
warnings.warn('WARNING: failed to find cudart version that matches onnxruntime build info')
/storage/ypd-19-7/anaconda3/envs/ONNX/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_validation.py:123: UserWarning: WARNING: found cudart versions: [10020]
warnings.warn('WARNING: found cudart versions: %s' % local_cudart_versions)
Traceback (most recent call last):
File "/storage/ypd-19-7/onnxruntime-training-examples/orttrainer/getting-started/train_ort.py", line 111, in
train()
File "/storage/ypd-19-7/onnxruntime-training-examples/orttrainer/getting-started/train_ort.py", line 68, in train
loss, output = trainer.train_step(data, targets)
File "/storage/ypd-19-7/anaconda3/envs/ONNX/lib/python3.9/site-packages/onnxruntime/training/orttrainer.py", line 348, in train_step
self._init_onnx_model(sample_input)
File "/storage/ypd-19-7/anaconda3/envs/ONNX/lib/python3.9/site-packages/onnxruntime/training/orttrainer.py", line 726, in _init_onnx_model
self._onnx_model = self._convert_torch_model_loss_fn_to_onnx(inputs, 'cpu')
File "/storage/ypd-19-7/anaconda3/envs/ONNX/lib/python3.9/site-packages/onnxruntime/training/orttrainer.py", line 543, in _convert_torch_model_loss_fn_to_onnx
torch.onnx.export(model, tuple(sample_inputs_copy), f,
TypeError: export() got an unexpected keyword argument 'example_outputs'

Train ONNX models with Rust.

How would I go about training an onnx model using the onnx runtime and Rust?

There is a Rust wrapper for onnx runtime https://github.com/pykeio/ort

Would this need to be extended?

AttributeError: 'ORTTrainingArguments' object has no attribute 'deepspeed_plugin'

When I run the command
"python hf-ort.py --hf_model bert-large --run_config ort --process_count 1 --local_run --model_batchsize 4"
I get the following error:
AttributeError: 'ORTTrainingArguments' object has no attribute 'deepspeed_plugin'.

Need some CNN training examples

Hello
A part pf my job is to deploy ONNX model . And i find this excellent project which may make my work better,but i'm not a NLP worker so the examples are hard for me to understand.
Maybe there need some CNN training examples so we can join the project.

DragGAN training_model.onnx generation failure on CPU

Hi there,

I am trying to run the DragGAN example in this repo on a x64 intel i7 CPU.

When I tried to run the draggan_demo.py script, I met the following error when the script was trying to generate training_model.onnx artifact.

RuntimeError: C:\a\_work\1\s\orttraining\orttraining\core\graph\gradient_builder_registry.cc:30 
onnxruntime::training::GetGradientForOp gradient_builder != nullptr was false. 
The gradient builder has not been registered: Reciprocal for node /generator/block_list.6/conv1/Reciprocal

Does this mean this gradient operation is not supported on the CPU? Then can the DragGAN win32 app only run with GPU?
Thanks a lot for your help in advance.

DiffusionFER data set not available

See https://huggingface.co/datasets/FER-Universe/DiffusionFER
Hence the on device training example https://github.com/microsoft/onnxruntime-training-examples/tree/master/on_device_training/desktop/csharp/image_classification can not be completed using this.

Benchmark in training time vs hugging face training time?

Hello,

I am interested in applying any optimization to pre-training. But I don't know the trade of of modifying my code to add onnxruntime vs normal training time.

Do you have any benchmark between a normal pytorch training vs onnxruntime training?

Thanks!

Edit: also a benchmark vs Deepspeed would be amazing.

iOS on-device training example Python dependency installation clarification

The instructions don't quite work as is on MacOS.

In particular,

onnxruntime-training-examples/on_device_training/mobile/ios/README.md

Lines 36 to 41 in 8d820c5

 ### Install Python dependencies 

 From this directory, run: 

 ```bash 

 pip install -r requirements.txt 

 ```

doesn't quite work with this requirement line:

onnxruntime-training-examples/on_device_training/mobile/ios/requirements.txt

Line 4 in 8d820c5

onnxruntime-training==1.16.0

See these installation instructions from onnxruntime.ai:

Training a BERT model is failing on android mobile device

I was trying to train a pretrained BERT model but failing as the below nodes are not implemented.

RuntimeError: C:\a_work\1\s\orttraining\orttraining\training_api\module.cc:175 onnxruntime::training::api::Module::Module [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(19) node with name 'Reshape__265'

Are we planning to include this, let me know the schedule?

Broken Setup Steps

https://github.com/microsoft/onnxruntime-training-examples/blob/master/huggingface/README.md
on Setup:

git submodule foreach git pull origin master

(ptca) root@f6782fb25630:/ettao/ort-ex/onnxruntime-training-examples# git submodule foreach git pull origin master
Entering 'optimum'
fatal: couldn't find remote ref master
fatal: run_command returned non-zero status for optimum

[Feature Request] Decentralized Distributed Training on Edge Cluster

Dear ONNX Community,

I'm CS PhD student at UC Irvine. Currently, our team are working on decentralized distributed training on edge cluster.

We have checked the on-device training of ONNX, which is very helpful for our project.

For enabling the decentralized distributing training, we choose to apply the model parallel paradigm, which shards a model into multiple sub-modules and assigns each of the sub-module to the edge devices in the cluster. During the training process, each device will have p2p connection with each other, in which device n+1 will receive the device n's output as input during forward pass, and the gradient of device n+1 will be received by device n during the backward pass.

In order to achieve this, we need to allow the sub-module on each device to pass its forward output and backward gradient from one device to another.

I know that currently onnxruntime-training integrate forward and backward into 1 method, the TrainStep. We are very new to ONNX, and wish to know that whether is there a way to seperate the TrainStep method for extracting the model forward output and gradient for backward pass using onnxruntime-training?

Is there a tutorial or a webpage of the github that shows the procedure of doing this?

I have also posted related issue on onnxruntime, please refer to #@microsoft/onnxruntime#16232

Please let me know. Thanks a lot!

@baijumeswani @pengwa

Script doesn't run

I cloned this repo and initialized all sub-modules. When running the example command for BERT, I got the following error.

id@dev:~/onnxruntime-training-examples/huggingface/script$ python hf-ort.py --hf_model bert-large --run_config ort --process_count 4 --local_run
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun = azureml.core.script_run:ScriptRun._from_run_dto with exception (pyOpenSSL 21.0.0 (python3.7/site-packages), Requirement.parse('pyopenssl<21.0.0')).
SDK version: 1.29.0
The arguments are: ['hf-ort.py', '--hf_model', 'bert-large', '--run_config', 'ort', '--process_count', '4', '--local_run']
Running model: bert-large, config: ort locally
Traceback (most recent call last):
File "hf-ort.py", line 157, in
shutil.copy(model_run_script_path, '.')
File "lib/python3.7/shutil.py", line 248, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "lib/python3.7/shutil.py", line 120, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '../../huggingface-transformers/examples/pytorch/language-modeling/run_mlm.py'

After change

TRAINER_DIR = '../../huggingface-transformers/examples/pytorch'

TRAINER_DIR = '../../huggingface-transformers/examples/'

script starts running.

nvidia-bert not working using onnxruntime 1.8.0

It seems that the latest onnxruntime 1.8.0 will have problems running nvidia-bert. I encountered the following warnings and errors when training with 8 V100s.

2021-06-07 06:26:24.957783707 [W:onnxruntime:, execution_frame.cc:721 VerifyOutputSizes] Expected shape from model of {} does not match actual shape of {1} for output Group_Accumulated_Gradients

... (left out)

Traceback (most recent call last):                                                             
  File "/home/tianzheng/projects/onnxruntime-training-examples/workspace/BERT/run_pretraining_ort.py", line 552, in <module>                                                                
    args, final_loss, train_time_raw = main()                                                  
  File "/home/tianzheng/projects/onnxruntime-training-examples/workspace/BERT/run_pretraining_ort.py", line 481, in main                                                                    
    loss, global_step = ort_supplement.run_ort_training_step(args, global_step, training_steps,
 model, batch)                                                                                 
  File "/home/tianzheng/projects/onnxruntime-training-examples/workspace/BERT/ort_supplement/ort_supplement.py", line 124, in run_ort_training_step                                         
    loss = trainer.train_step(*batch)                                                          
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/orttrainer.py", line 402, in train_step                                                                                   
    outputs_desc, run_options)                                                                 
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/orttrainer.py", line 901, in _training_session_run_helper                                                                 
    self._training_session.run_with_iobinding(iobinding, run_options)                          
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 229, in run_with_iobinding                                                         
    self._sess.run_with_iobinding(iobinding._iobinding, run_options)                           
RuntimeError: Error in execution: Contiguous memory checking failed on node NcclAllReduce: input #1 address is 0x7f0926454000 and #bytes = 2048, input #2 address is 0x7f06e4200000

incorrect input for ReluGrad layer

Hi, I have a simple linear model with 3 fc layers and 3 relu layers -

and the corresponding training model is generated by onnxruntime -

As highlighted in the red circle, instead of output of the last Relu layer, the output of last Gemm layer is going to the ReluGrad.

Although, the behavior of the other two intermediate relu/grad layers seems ok, i.e. the Relu Outputs become inputs of ReluGrad,
But the last layer looks buggy.

Code for generating training artifacts:

model = onnx.load("orig.onnx")
artifacts.generate_artifacts(model,
                             requires_grad = ["fc1.weight","fc2.weight","fc3.weight"],
                             loss = artifacts.LossType.CrossEntropyLoss,
                             optimizer = artifacts.OptimType.AdamW,
                             artifact_directory = "./")

Please advise how to fix this.

Can't Download the Corpus under "nvidia-bert" example follow the Readme

I follow the Readme under the "nvidia-bert" folder to run the example but failed when running this line:

(bert) root@user:/mnt/d/src/onnxruntime-training-examples# python ./workspace/BERT/data/bertPrep.py --action download --dataset wikicorpus_en

The download hung for a long time:

_[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Working Directory: ./workspace/BERT/data/
Action: download
Dataset Name: wikicorpus_en

Directory Structure:
{ 'download': './workspace/BERT/data//download',
'extracted': './workspace/BERT/data//extracted',
'formatted': './workspace/BERT/data//formatted_one_article_per_line',
'hdf5': './workspace/BERT/data//hdf5_lower_case_1_seq_len_512_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5',
'sharded': './workspace/BERT/data//sharded_training_shards_256_test_shards_256_fraction_0.2',
'tfrecord': './workspace/BERT/data//tfrecord_lower_case_1_seq_len_512_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5'}

Downloading: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

And then :

Traceback (most recent call last):
File "./workspace/BERT/data/bertPrep.py", line 362, in
main(args)
File "./workspace/BERT/data/bertPrep.py", line 60, in main
downloader.download()
File "/mnt/d/src/onnxruntime-training-examples/workspace/BERT/data/Downloader.py", line 33, in download
self.download_wikicorpus('en')
File "/mnt/d/src/onnxruntime-training-examples/workspace/BERT/data/Downloader.py", line 71, in download_wikicorpus
downloader.download()
File "/mnt/d/src/onnxruntime-training-examples/workspace/BERT/data/WikiDownloader.py", line 50, in download
handle.write(response.read())
File "/root/miniconda3/envs/bert/lib/python3.6/http/client.py", line 472, in read
s = self._safe_read(self.length)
File "/root/miniconda3/envs/bert/lib/python3.6/http/client.py", line 627, in _safe_read
return b"".join(s)
MemoryError

My environment is Windows 10 and run the Example in WSL2 with conda.
But as I check the file temp created when downloading was wikicorpus_en.xml.bz2 instead of enwiki-latest-pages-articles.xml.bz2.
Do you have any suggestion to resolve this error?

Can't load checkpoint

I am using the latest onnxruntime.dll training win64 from nuget.
I created my artifacts. So I have a checkpoint folder called "checkpoint" which has a file: paramtrain_tensors.pbseq

In C# I am writing:

CheckpointState checkpoint = new CheckpointState(@"C:\Users\Shadow\Documents\TrainTest\checkpoint");

but I am getting the error:

OnnxRuntimeException: [ErrorCode:Fail] open file checkpoint fail, errcode = 5 - unknown error
Microsoft.ML.OnnxRuntime.NativeApiStatus.VerifySuccess (System.IntPtr nativeStatus) (at <7bc2b92dbec44d6e9102eec4b89e261c>:0)
Microsoft.ML.OnnxRuntime.CheckpointState.LoadCheckpoint (System.String checkpointPath) (at <7bc2b92dbec44d6e9102eec4b89e261c>:0)
Microsoft.ML.OnnxRuntime.CheckpointState..ctor (System.String checkpointPath) (at <7bc2b92dbec44d6e9102eec4b89e261c>:0)
TrainTest.Start () (at Assets/Scenes/TrainTest.cs:14)

It is definitely the correct path. Any ideas?

Never mind I rolled back to 15.1 and it works now

Docker running of huggingface failed with info "the input device is not a TTY"

I followed the huggingface/Readme.md to try to run the examples on local. It seems no problem until I execute the command below.

sudo docker run -it -v /dev/shm:/dev/shm -v <onnxruntime-training-examples_path>:/onnxruntime-training-examples --gpus all hf-recipe-local-docker

And I get the error "the input device is not a TTY".

I search it but the answers are changing -it to -i or just removing -it, which will bring a new error.

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "--help": executable file not found in $PATH: unknown.
ERRO[0000] error waiting for container: context canceled

Could you please tell me how to solve this problem? Thank you!

The environment

OS: Ubuntu 18.04 64bits
CUDA: 11.1
Docker: version 20.10.12

[Help] Problem trying out the iOS example

I am trying to run on_device_training/mobile/ios on an M1 Mac. I could not install onnxruntime-training.

Inspecting the verbose output of Pip, it seems there exists no .whl for ARM. I've tried Python 3.11 and 3.10.

Also, onnxruntime-training==1.16.0 specified in requirements.txt is not released, yet. Is this an internal version?

Alternatively, @vraspar, could you maybe compress the artifacts and upload them here so I can try out the iOS example?

Thanks!

Train model occured error: '*.num_batches_tracked' in frozen_weights not found in model weights

Hello, When I use ORTTrainer to train a simple CNN torchvision model, which code is changed from mnist_training.py, it seems like that bn model set incorrect:

code:
from torchvision.models import efficientnet

model = efficientnet.efficientnet_b2()
model.train()

trainer = ORTTrainer(model, loss, ModelDesc......)

loss, out = trainer.train_step(img, label)

Mobile bert : onnxruntime :: training :: GradientBuilderBase :: O i < node_ -> Outputdefs().size was false

I am trying to generate the training artifacts for mobile bert model but onnx is giving me below error message : : onnxruntime :: training :: GradientBuilderBase :: O i < node_ -> Outputdefs().size was false.

I am trying the example that is given under masked_language_modeling and the name of file is mobilebert_offline.ipynb.

Please help !! It is urgent!!

ONNX Training for GLUE Tasks

🚀 Feature request

Is there support for ONNX training of transformers on glue tasks ? More specifically, a patch for this script https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py . This patch huggingface-gpt2/ort_addon/src_changes.patch can be used for language modeling in general.

Motivation

Faster training.

Thank you.

no attribute 'deepspeed_plugin'

T5 sample code doens't work in my Azure ML. Any advice on this?

Traceback (most recent call last):
  File "src/Finetune/train_summarization_deepspeed_optum.py", line 684, in <module>
    main()
  File "src/Finetune/train_summarization_deepspeed_optum.py", line 586, in main
    trainer = ORTSeq2SeqTrainer(
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/optimum/onnxruntime/trainer.py", line 304, in __init__
    super().__init__(
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/trainer.py", line 345, in __init__
    self.create_accelerator_and_postprocess()
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/trainer.py", line 3825, in create_accelerator_and_postprocess
    deepspeed_plugin=self.args.deepspeed_plugin,
AttributeError: 'ORTSeq2SeqTrainingArguments' object has no attribute 'deepspeed_plugin'

Does the current training code support RNN model like seq2seq and Transformer and GNN model?

java.lang.UnsatisfiedLinkError: dlopen failed: library "/Users/junchenzhao/Dist-CPU-Learn/android/distributed_inference_demo/test1/src/main/cpp/lib/libonnxruntime4j_jni.so" not found

Dear ONNX community,

Recently, I'm trying to build my app for on-device training. I followed the procedure in the example but made modifications in the CmakeList.txt file.

Here is my CmakeList file code:

# Sets the minimum version of CMake required to build the native library.

cmake_minimum_required(VERSION 3.18.1)
project("distributed_inference_demo")

add_library( # Sets the name of the library.
        distributed_inference_demo

        # Sets the library as a shared library.
        SHARED

        # Provides a relative path to your source file(s).
        native-lib.cpp
        utils.cpp
        inference.cpp
        )

add_library(onnxruntime SHARED IMPORTED)
set_target_properties(onnxruntime PROPERTIES IMPORTED_LOCATION ${CMAKE_SOURCE_DIR}/lib/libonnxruntime.so)
add_library(onnxruntime4j_jni SHARED IMPORTED)
set_target_properties(onnxruntime4j_jni PROPERTIES IMPORTED_LOCATION ${CMAKE_SOURCE_DIR}/lib/libonnxruntime4j_jni.so)
target_include_directories(distributed_inference_demo PRIVATE ${CMAKE_SOURCE_DIR}/include/)



# Searches for a specified prebuilt library and stores the path as a
# variable. Because CMake includes system libraries in the search path by
# default, you only need to specify the name of the public NDK library
# you want to add. CMake verifies that the library exists before
# completing its build.

find_library( # Sets the name of the path variable.
        log-lib

        # Specifies the name of the NDK library that
        # you want CMake to locate.
        log)

# Specifies libraries CMake should link to your target library. You
# can link multiple libraries, such as libraries you define in this
# build script, prebuilt third-party libraries, or system libraries.

target_link_libraries( # Specifies the target library.
        distributed_inference_demo

        # Links the target library to the log library
        # included in the NDK.
        ${log-lib}
        onnxruntime4j_jni
        onnxruntime)

As you can see, I add the libonnxruntime4j_jni.so as additional library in my app. The libonnxruntime4j_jni.so file is saved under my cpp/lib/ directory.

However, when I build my app, the error constantly prompts up shown below:

If it's possible, could anyone help me with solving this issue?

Broken onnx graph while using onnxruntime-training 1.11.0

The issue continues the discussion here.

I have tried onnxruntime-training 1.11.0, however I met some unexpected errors while trying to run a simple text classification task. It seems that even the examples that we had no problem with before are broken now. Here are the error messages that I got:

Traceback (most recent call last):
  File "run_glue.py", line 572, in <module>
    main()
  File "run_glue.py", line 491, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/optimum/onnxruntime/trainer.py", line 476, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1984, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2016, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_utils.py", line 309, in _forward
    return ortmodule._torch_module.forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_utils.py", line 288, in _forward
    return torch_module_ort._execution_manager(
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 295, in forward
    self._fallback_manager.handle_exception(exception=e,
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_fallback.py", line 151, in handle_exception
    raise exception
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 231, in forward
    build_gradient_graph = self._export_model(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 329, in _export_model
    self._onnx_models.exported_model = SymbolicShapeInference.infer_shapes(self._onnx_models.exported_model,
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 2061, in infer_shapes
    all_shapes_inferred = symbolic_shape_inference._infer_impl()
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 1928, in _infer_impl
    self._check_merged_dims(in_dims, allow_broadcast=True)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 620, in _check_merged_dims
    self._add_suggested_merge(dims, apply=True)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 218, in _add_suggested_merge
    assert all([(type(s) == str and s in self.symbolic_dims_) or is_literal(s) for s in symbols])
AssertionError
  0%|                                                                                   | 0/12630 [00:03<?, ?it/s]

Besides, I received a lot of warnings before that. It seems that the exported ONNX graph is broken:

WARNING: The shape inference of org.pytorch.aten::ATen type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of com.microsoft::SoftmaxCrossEntropyLossInternal type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
Warning: Checker does not support models with experimental ops: ATen

Environment information
The tests were run with this Dockerfile with

OS: Ubuntu 20.04
CUDA/cuDNN version: 11.1/8
onnxruntime-training: 1.11.0+cu111
torch: 1.10.0+cu111
(actually I tried every stable torch version>1.9.0, like 1.9.0, 1.9.0+cu111, 1.10.0, 1.10.0+cu111, 1.11.0+cu113, none of the; works)
torch-ort: 1.11.0
Python version:3.8.10
GPU: A100 / T4

To Reproduce

Dockerfile
Example scripts

Does the ORT team have any suggestions on it? @ytaous I am curious about where comes the issue. Thanks for helping!! 🙏

onnxruntime-training CPU-Overnight version error

I'm trying the nightly build onnxruntime-training from https://download.onnxruntime.ai/onnxruntime_nightly_cpu.html and got error:
File "/home/users/user/.local/lib/python3.8/site-packages/onnxruntime/training/api/module.py", line 128, in Module
def export_model_for_inferencing(self, inference_model_uri: str, graph_output_names: list[str]) -> None:
TypeError: 'type' object is not subscriptable

It happened in multiple overnight build. I've tried the following versions.
onnxruntime_training-1.15.0.dev20230201001+cpu-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
onnxruntime_training-1.15.0.dev20230207001+cpu-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

I'm using CentOS 7 system, python38, with onnxruntime version 1.13.1

converting checkpoints

Hi guys!

We were using the training recipe with latest update and the convert_checkpoint.py was not working properly according to transformers library.

I uploaded an updated version of the script in my repo, which makes the checkpoint work with transformers.

Check it please.

Regards,
Robert

Does the current training code support CNN model such as ResNet or YOLO?

Using ONNX Runtime Training with GPT2 Text Generation

There are 2 parts to this questions.

How to use OnnxRuntime C# API to do inference using GPT-2 Onnx Model
How to use the API to do training (text generation) using the GPT-2 Onnx Model

Questoin1: How to set inputMeta[inputName].Dimension for GPT-2 Onnx using OnnxRuntime c# API?
==> When the dimensions are int[3] consisting of {-1,-1,-1}?

The GPT-2 provided has the following Input metadata.

name: input
TensorKind: Int64
dimensions: int[3] consisting of {-1,-1,-1} or shape(-1,-1,-1)

OnnxRuntime c# API

Int64[] sourceData;  // assume your data is loaded into a flat Int64 array
int[] **dimensions**;    // and the dimensions of the input is stored here

//dimensions are int[3] consisting of {-1,-1,-1}?
Tensor<Int64> input = new DenseTensor<Int64>(sourceData, dimensions);

The acceleration of ort-stable-diffusion is not significant compared to pytorch-stable-diffusion

I have made a stable-diffusion test and compared the speed between ort and pytorch version, hope to get results similar to the example.
But the acceleration between ort-stable-diffusion and pytorch-stable-diffusion is not significant like the example.
This is my result:

my command:
ort: accelerate launch --config_file=accelerate_config.yaml --mixed_precision=fp16 train_text_to_image.py --ort --pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4 --dataset_name=lambdalabs/pokemon-blip-captions --use_ema --resolution=512 --center_crop --random_flip --train_batch_size=1 --gradient_accumulation_steps=4 --gradient_checkpointing --max_train_steps=5000 --learning_rate=1e-05 --max_grad_norm=1 --lr_scheduler=constant --lr_warmup_steps=0 --output_dir=sd-pokemon-model
pytorch:
accelerate launch --config_file=accelerate_config.yaml --mixed_precision=fp16 train_text_to_image.py --pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4 --dataset_name=lambdalabs/pokemon-blip-captions --use_ema --resolution=512 --center_crop --random_flip --train_batch_size=1 --gradient_accumulation_steps=4 --gradient_checkpointing --max_train_steps=5000 --learning_rate=1e-05 --max_grad_norm=1 --lr_scheduler=constant --lr_warmup_steps=0 --output_dir=sd-pokemon-model
Who can help me please!

ORTPersonalize App Building Fails

System

Ubuntu 20.04.
SDK 33.
CMake 3.26.4.

Issue

Currently, I'm trying to reproduce the work shown in ORTPersonalize.

When I'm building the project, 2 errors always shown up like this:

Error 1

ninja: error: '/home/junchen/onnx-android/app/ORTPersonalize/app/src/main/cpp/libs/libonnxruntime.so', needed by '/home/junchen/onnx-android/app/ORTPersonalize/app/build/intermediates/cxx/Debug/3s452306/obj/arm64-v8a/libortpersonalize.so', missing and no known rule to make it

Error 2

Execution failed for task ':app:buildCMakeDebug[arm64-v8a]'.

com.android.ide.common.process.ProcessException: ninja: Entering directory `/home/junchen/onnx-android/app/ORTPersonalize/app/.cxx/Debug/3s452306/arm64-v8a'

Try:

Run with --info or --debug option to get more log output.
Run with --scan to get full insights.

Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':app:buildCMakeDebug[arm64-v8a]'.
at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.lambda$executeIfValid$1(ExecuteActionsTaskExecuter.java:147)
at org.gradle.internal.Try$Failure.ifSuccessfulOrElse(Try.java:282)
at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeIfValid(ExecuteActionsTaskExecuter.java:145)
at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.execute(ExecuteActionsTaskExecuter.java:133)
at org.gradle.api.internal.tasks.execution.CleanupStaleOutputsExecuter.execute(CleanupStaleOutputsExecuter.java:77)
at org.gradle.api.internal.tasks.execution.FinalizePropertiesTaskExecuter.execute(FinalizePropertiesTaskExecuter.java:46)
at org.gradle.api.internal.tasks.execution.ResolveTaskExecutionModeExecuter.execute(ResolveTaskExecutionModeExecuter.java:51)
at org.gradle.api.internal.tasks.execution.SkipTaskWithNoActionsExecuter.execute(SkipTaskWithNoActionsExecuter.java:57)
at org.gradle.api.internal.tasks.execution.SkipOnlyIfTaskExecuter.execute(SkipOnlyIfTaskExecuter.java:56)
at org.gradle.api.internal.tasks.execution.CatchExceptionTaskExecuter.execute(CatchExceptionTaskExecuter.java:36)
at org.gradle.api.internal.tasks.execution.EventFiringTaskExecuter$1.executeTask(EventFiringTaskExecuter.java:77)
at org.gradle.api.internal.tasks.execution.EventFiringTaskExecuter$1.call(EventFiringTaskExecuter.java:55)
at org.gradle.api.internal.tasks.execution.EventFiringTaskExecuter$1.call(EventFiringTaskExecuter.java:52)
at org.gradle.internal.operations.DefaultBuildOperationRunner$CallableBuildOperationWorker.execute(DefaultBuildOperationRunner.java:204)
at org.gradle.internal.operations.DefaultBuildOperationRunner$CallableBuildOperationWorker.execute(DefaultBuildOperationRunner.java:199)
at org.gradle.internal.operations.DefaultBuildOperationRunner$2.execute(DefaultBuildOperationRunner.java:66)
at org.gradle.internal.operations.DefaultBuildOperationRunner$2.execute(DefaultBuildOperationRunner.java:59)
at org.gradle.internal.operations.DefaultBuildOperationRunner.execute(DefaultBuildOperationRunner.java:157)
at org.gradle.internal.operations.DefaultBuildOperationRunner.execute(DefaultBuildOperationRunner.java:59)
at org.gradle.internal.operations.DefaultBuildOperationRunner.call(DefaultBuildOperationRunner.java:53)
at org.gradle.internal.operations.DefaultBuildOperationExecutor.call(DefaultBuildOperationExecutor.java:73)
at org.gradle.api.internal.tasks.execution.EventFiringTaskExecuter.execute(EventFiringTaskExecuter.java:52)
at org.gradle.execution.plan.LocalTaskNodeExecutor.execute(LocalTaskNodeExecutor.java:74)
at org.gradle.execution.taskgraph.DefaultTaskExecutionGraph$InvokeNodeExecutorsAction.execute(DefaultTaskExecutionGraph.java:333)
at org.gradle.execution.taskgraph.DefaultTaskExecutionGraph$InvokeNodeExecutorsAction.execute(DefaultTaskExecutionGraph.java:320)
at org.gradle.execution.taskgraph.DefaultTaskExecutionGraph$BuildOperationAwareExecutionAction.execute(DefaultTaskExecutionGraph.java:313)
at org.gradle.execution.taskgraph.DefaultTaskExecutionGraph$BuildOperationAwareExecutionAction.execute(DefaultTaskExecutionGraph.java:299)
at org.gradle.execution.plan.DefaultPlanExecutor$ExecutorWorker.lambda$run$0(DefaultPlanExecutor.java:143)
at org.gradle.execution.plan.DefaultPlanExecutor$ExecutorWorker.execute(DefaultPlanExecutor.java:227)
at org.gradle.execution.plan.DefaultPlanExecutor$ExecutorWorker.executeNextNode(DefaultPlanExecutor.java:218)
at org.gradle.execution.plan.DefaultPlanExecutor$ExecutorWorker.run(DefaultPlanExecutor.java:140)
at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run(ManagedExecutorImpl.java:48)
Caused by: org.gradle.internal.UncheckedException: com.android.ide.common.process.ProcessException: ninja: Entering directory `/home/junchen/onnx-android/app/ORTPersonalize/app/.cxx/Debug/3s452306/arm64-v8a'

C++ build system [build] failed while executing:
/home/junchen/Android/Sdk/cmake/3.18.1/bin/ninja
-C
/home/junchen/onnx-android/app/ORTPersonalize/app/.cxx/Debug/3s452306/arm64-v8a
ortpersonalize
from /home/junchen/onnx-android/app/ORTPersonalize/app
ninja: error: '/home/junchen/onnx-android/app/ORTPersonalize/app/src/main/cpp/libs/libonnxruntime.so', needed by '/home/junchen/onnx-android/app/ORTPersonalize/app/build/intermediates/cxx/Debug/3s452306/obj/arm64-v8a/libortpersonalize.so', missing and no known rule to make it
at org.gradle.internal.UncheckedException.throwAsUncheckedException(UncheckedException.java:68)
at org.gradle.internal.UncheckedException.throwAsUncheckedException(UncheckedException.java:41)
at org.gradle.internal.reflect.JavaMethod.invoke(JavaMethod.java:107)
at org.gradle.api.internal.project.taskfactory.StandardTaskAction.doExecute(StandardTaskAction.java:58)
at org.gradle.api.internal.project.taskfactory.StandardTaskAction.execute(StandardTaskAction.java:51)
at org.gradle.api.internal.project.taskfactory.StandardTaskAction.execute(StandardTaskAction.java:29)
at org.gradle.api.internal.tasks.execution.TaskExecution$3.run(TaskExecution.java:242)
at org.gradle.internal.operations.DefaultBuildOperationRunner$1.execute(DefaultBuildOperationRunner.java:29)
at org.gradle.internal.operations.DefaultBuildOperationRunner$1.execute(DefaultBuildOperationRunner.java:26)
at org.gradle.internal.operations.DefaultBuildOperationRunner$2.execute(DefaultBuildOperationRunner.java:66)
at org.gradle.internal.operations.DefaultBuildOperationRunner$2.execute(DefaultBuildOperationRunner.java:59)
at org.gradle.internal.operations.DefaultBuildOperationRunner.execute(DefaultBuildOperationRunner.java:157)
at org.gradle.internal.operations.DefaultBuildOperationRunner.execute(DefaultBuildOperationRunner.java:59)
at org.gradle.internal.operations.DefaultBuildOperationRunner.run(DefaultBuildOperationRunner.java:47)
at org.gradle.internal.operations.DefaultBuildOperationExecutor.run(DefaultBuildOperationExecutor.java:68)
at org.gradle.api.internal.tasks.execution.TaskExecution.executeAction(TaskExecution.java:227)
at org.gradle.api.internal.tasks.execution.TaskExecution.executeActions(TaskExecution.java:210)
at org.gradle.api.internal.tasks.execution.TaskExecution.executeWithPreviousOutputFiles(TaskExecution.java:193)
at org.gradle.api.internal.tasks.execution.TaskExecution.execute(TaskExecution.java:171)
at org.gradle.internal.execution.steps.ExecuteStep.executeInternal(ExecuteStep.java:89)
at org.gradle.internal.execution.steps.ExecuteStep.access$000(ExecuteStep.java:40)
at org.gradle.internal.execution.steps.ExecuteStep$1.call(ExecuteStep.java:53)
at org.gradle.internal.execution.steps.ExecuteStep$1.call(ExecuteStep.java:50)
at org.gradle.internal.operations.DefaultBuildOperationRunner$CallableBuildOperationWorker.execute(DefaultBuildOperationRunner.java:204)
at org.gradle.internal.operations.DefaultBuildOperationRunner$CallableBuildOperationWorker.execute(DefaultBuildOperationRunner.java:199)
at org.gradle.internal.operations.DefaultBuildOperationRunner$2.execute(DefaultBuildOperationRunner.java:66)
at org.gradle.internal.operations.DefaultBuildOperationRunner$2.execute(DefaultBuildOperationRunner.java:59)
at org.gradle.internal.operations.DefaultBuildOperationRunner.execute(DefaultBuildOperationRunner.java:157)
at org.gradle.internal.operations.DefaultBuildOperationRunner.execute(DefaultBuildOperationRunner.java:59)
at org.gradle.internal.operations.DefaultBuildOperationRunner.call(DefaultBuildOperationRunner.java:53)
at org.gradle.internal.operations.DefaultBuildOperationExecutor.call(DefaultBuildOperationExecutor.java:73)
at org.gradle.internal.execution.steps.ExecuteStep.execute(ExecuteStep.java:50)
at org.gradle.internal.execution.steps.ExecuteStep.execute(ExecuteStep.java:40)
at org.gradle.internal.execution.steps.RemovePreviousOutputsStep.execute(RemovePreviousOutputsStep.java:68)
at org.gradle.internal.execution.steps.RemovePreviousOutputsStep.execute(RemovePreviousOutputsStep.java:38)
at org.gradle.internal.execution.steps.ResolveInputChangesStep.execute(ResolveInputChangesStep.java:48)
at org.gradle.internal.execution.steps.ResolveInputChangesStep.execute(ResolveInputChangesStep.java:36)
at org.gradle.internal.execution.steps.CancelExecutionStep.execute(CancelExecutionStep.java:41)
at org.gradle.internal.execution.steps.TimeoutStep.executeWithoutTimeout(TimeoutStep.java:74)
at org.gradle.internal.execution.steps.TimeoutStep.execute(TimeoutStep.java:55)
at org.gradle.internal.execution.steps.CreateOutputsStep.execute(CreateOutputsStep.java:51)
at org.gradle.internal.execution.steps.CreateOutputsStep.execute(CreateOutputsStep.java:29)
at org.gradle.internal.execution.steps.CaptureStateAfterExecutionStep.execute(CaptureStateAfterExecutionStep.java:61)
at org.gradle.internal.execution.steps.CaptureStateAfterExecutionStep.execute(CaptureStateAfterExecutionStep.java:42)
at org.gradle.internal.execution.steps.BroadcastChangingOutputsStep.execute(BroadcastChangingOutputsStep.java:60)
at org.gradle.internal.execution.steps.BroadcastChangingOutputsStep.execute(BroadcastChangingOutputsStep.java:27)
at org.gradle.internal.execution.steps.BuildCacheStep.executeWithoutCache(BuildCacheStep.java:180)
at org.gradle.internal.execution.steps.BuildCacheStep.lambda$execute$1(BuildCacheStep.java:75)
at org.gradle.internal.Either$Right.fold(Either.java:175)
at org.gradle.internal.execution.caching.CachingState.fold(CachingState.java:59)
at org.gradle.internal.execution.steps.BuildCacheStep.execute(BuildCacheStep.java:73)
at org.gradle.internal.execution.steps.BuildCacheStep.execute(BuildCacheStep.java:48)
at org.gradle.internal.execution.steps.StoreExecutionStateStep.execute(StoreExecutionStateStep.java:36)
at org.gradle.internal.execution.steps.StoreExecutionStateStep.execute(StoreExecutionStateStep.java:25)
at org.gradle.internal.execution.steps.RecordOutputsStep.execute(RecordOutputsStep.java:36)
at org.gradle.internal.execution.steps.RecordOutputsStep.execute(RecordOutputsStep.java:22)
at org.gradle.internal.execution.steps.SkipUpToDateStep.executeBecause(SkipUpToDateStep.java:110)
at org.gradle.internal.execution.steps.SkipUpToDateStep.lambda$execute$2(SkipUpToDateStep.java:56)
at org.gradle.internal.execution.steps.SkipUpToDateStep.execute(SkipUpToDateStep.java:56)
at org.gradle.internal.execution.steps.SkipUpToDateStep.execute(SkipUpToDateStep.java:38)
at org.gradle.internal.execution.steps.ResolveChangesStep.execute(ResolveChangesStep.java:73)
at org.gradle.internal.execution.steps.ResolveChangesStep.execute(ResolveChangesStep.java:44)
at org.gradle.internal.execution.steps.legacy.MarkSnapshottingInputsFinishedStep.execute(MarkSnapshottingInputsFinishedStep.java:37)
at org.gradle.internal.execution.steps.legacy.MarkSnapshottingInputsFinishedStep.execute(MarkSnapshottingInputsFinishedStep.java:27)
at org.gradle.internal.execution.steps.ResolveCachingStateStep.execute(ResolveCachingStateStep.java:89)
at org.gradle.internal.execution.steps.ResolveCachingStateStep.execute(ResolveCachingStateStep.java:50)
at org.gradle.internal.execution.steps.ValidateStep.execute(ValidateStep.java:114)
at org.gradle.internal.execution.steps.ValidateStep.execute(ValidateStep.java:57)
at org.gradle.internal.execution.steps.CaptureStateBeforeExecutionStep.execute(CaptureStateBeforeExecutionStep.java:76)
at org.gradle.internal.execution.steps.CaptureStateBeforeExecutionStep.execute(CaptureStateBeforeExecutionStep.java:50)
at org.gradle.internal.execution.steps.SkipEmptyWorkStep.executeWithNoEmptySources(SkipEmptyWorkStep.java:249)
at org.gradle.internal.execution.steps.SkipEmptyWorkStep.execute(SkipEmptyWorkStep.java:86)
at org.gradle.internal.execution.steps.SkipEmptyWorkStep.execute(SkipEmptyWorkStep.java:54)
at org.gradle.internal.execution.steps.RemoveUntrackedExecutionStateStep.execute(RemoveUntrackedExecutionStateStep.java:32)
at org.gradle.internal.execution.steps.RemoveUntrackedExecutionStateStep.execute(RemoveUntrackedExecutionStateStep.java:21)
at org.gradle.internal.execution.steps.legacy.MarkSnapshottingInputsStartedStep.execute(MarkSnapshottingInputsStartedStep.java:38)
at org.gradle.internal.execution.steps.LoadPreviousExecutionStateStep.execute(LoadPreviousExecutionStateStep.java:43)
at org.gradle.internal.execution.steps.LoadPreviousExecutionStateStep.execute(LoadPreviousExecutionStateStep.java:31)
at org.gradle.internal.execution.steps.AssignWorkspaceStep.lambda$execute$0(AssignWorkspaceStep.java:40)
at org.gradle.api.internal.tasks.execution.TaskExecution$4.withWorkspace(TaskExecution.java:287)
at org.gradle.internal.execution.steps.AssignWorkspaceStep.execute(AssignWorkspaceStep.java:40)
at org.gradle.internal.execution.steps.AssignWorkspaceStep.execute(AssignWorkspaceStep.java:30)
at org.gradle.internal.execution.steps.IdentityCacheStep.execute(IdentityCacheStep.java:37)
at org.gradle.internal.execution.steps.IdentityCacheStep.execute(IdentityCacheStep.java:27)
at org.gradle.internal.execution.steps.IdentifyStep.execute(IdentifyStep.java:44)
at org.gradle.internal.execution.steps.IdentifyStep.execute(IdentifyStep.java:33)
at org.gradle.internal.execution.impl.DefaultExecutionEngine$1.execute(DefaultExecutionEngine.java:76)
at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeIfValid(ExecuteActionsTaskExecuter.java:144)
at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.execute(ExecuteActionsTaskExecuter.java:133)
at org.gradle.api.internal.tasks.execution.CleanupStaleOutputsExecuter.execute(CleanupStaleOutputsExecuter.java:77)
at org.gradle.api.internal.tasks.execution.FinalizePropertiesTaskExecuter.execute(FinalizePropertiesTaskExecuter.java:46)
at org.gradle.api.internal.tasks.execution.ResolveTaskExecutionModeExecuter.execute(ResolveTaskExecutionModeExecuter.java:51)
at org.gradle.api.internal.tasks.execution.SkipTaskWithNoActionsExecuter.execute(SkipTaskWithNoActionsExecuter.java:57)
at org.gradle.api.internal.tasks.execution.SkipOnlyIfTaskExecuter.execute(SkipOnlyIfTaskExecuter.java:56)
at org.gradle.api.internal.tasks.execution.CatchExceptionTaskExecuter.execute(CatchExceptionTaskExecuter.java:36)
at org.gradle.api.internal.tasks.execution.EventFiringTaskExecuter$1.executeTask(EventFiringTaskExecuter.java:77)
at org.gradle.api.internal.tasks.execution.EventFiringTaskExecuter$1.call(EventFiringTaskExecuter.java:55)
at org.gradle.api.internal.tasks.execution.EventFiringTaskExecuter$1.call(EventFiringTaskExecuter.java:52)
at org.gradle.internal.operations.DefaultBuildOperationRunner$CallableBuildOperationWorker.execute(DefaultBuildOperationRunner.java:204)
at org.gradle.internal.operations.DefaultBuildOperationRunner$CallableBuildOperationWorker.execute(DefaultBuildOperationRunner.java:199)
at org.gradle.internal.operations.DefaultBuildOperationRunner$2.execute(DefaultBuildOperationRunner.java:66)
at org.gradle.internal.operations.DefaultBuildOperationRunner$2.execute(DefaultBuildOperationRunner.java:59)
at org.gradle.internal.operations.DefaultBuildOperationRunner.execute(DefaultBuildOperationRunner.java:157)
at org.gradle.internal.operations.DefaultBuildOperationRunner.execute(DefaultBuildOperationRunner.java:59)
at org.gradle.internal.operations.DefaultBuildOperationRunner.call(DefaultBuildOperationRunner.java:53)
at org.gradle.internal.operations.DefaultBuildOperationExecutor.call(DefaultBuildOperationExecutor.java:73)
at org.gradle.api.internal.tasks.execution.EventFiringTaskExecuter.execute(EventFiringTaskExecuter.java:52)
at org.gradle.execution.plan.LocalTaskNodeExecutor.execute(LocalTaskNodeExecutor.java:74)
at org.gradle.execution.taskgraph.DefaultTaskExecutionGraph$InvokeNodeExecutorsAction.execute(DefaultTaskExecutionGraph.java:333)
at org.gradle.execution.taskgraph.DefaultTaskExecutionGraph$InvokeNodeExecutorsAction.execute(DefaultTaskExecutionGraph.java:320)
at org.gradle.execution.taskgraph.DefaultTaskExecutionGraph$BuildOperationAwareExecutionAction.execute(DefaultTaskExecutionGraph.java:313)
at org.gradle.execution.taskgraph.DefaultTaskExecutionGraph$BuildOperationAwareExecutionAction.execute(DefaultTaskExecutionGraph.java:299)
at org.gradle.execution.plan.DefaultPlanExecutor$ExecutorWorker.lambda$run$0(DefaultPlanExecutor.java:143)
at org.gradle.execution.plan.DefaultPlanExecutor$ExecutorWorker.execute(DefaultPlanExecutor.java:227)
at org.gradle.execution.plan.DefaultPlanExecutor$ExecutorWorker.executeNextNode(DefaultPlanExecutor.java:218)
at org.gradle.execution.plan.DefaultPlanExecutor$ExecutorWorker.run(DefaultPlanExecutor.java:140)
at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run(ManagedExecutorImpl.java:48)
Caused by: com.android.ide.common.process.ProcessException: ninja: Entering directory `/home/junchen/onnx-android/app/ORTPersonalize/app/.cxx/Debug/3s452306/arm64-v8a'

C++ build system [build] failed while executing:
/home/junchen/Android/Sdk/cmake/3.18.1/bin/ninja
-C
/home/junchen/onnx-android/app/ORTPersonalize/app/.cxx/Debug/3s452306/arm64-v8a
ortpersonalize
from /home/junchen/onnx-android/app/ORTPersonalize/app
ninja: error: '/home/junchen/onnx-android/app/ORTPersonalize/app/src/main/cpp/libs/libonnxruntime.so', needed by '/home/junchen/onnx-android/app/ORTPersonalize/app/build/intermediates/cxx/Debug/3s452306/obj/arm64-v8a/libortpersonalize.so', missing and no known rule to make it
at com.android.build.gradle.internal.cxx.process.ExecuteProcessKt.execute(ExecuteProcess.kt:274)
at com.android.build.gradle.internal.cxx.process.ExecuteProcessKt$executeProcess$1.invoke(ExecuteProcess.kt:106)
at com.android.build.gradle.internal.cxx.process.ExecuteProcessKt$executeProcess$1.invoke(ExecuteProcess.kt:104)
at com.android.build.gradle.internal.cxx.timing.TimingEnvironmentKt.time(TimingEnvironment.kt:32)
at com.android.build.gradle.internal.cxx.process.ExecuteProcessKt.executeProcess(ExecuteProcess.kt:104)
at com.android.build.gradle.internal.cxx.process.ExecuteProcessKt.executeProcess$default(ExecuteProcess.kt:84)
at com.android.build.gradle.internal.cxx.build.CxxRegularBuilder.executeProcessBatch(CxxRegularBuilder.kt:331)
at com.android.build.gradle.internal.cxx.build.CxxRegularBuilder.build(CxxRegularBuilder.kt:128)
at com.android.build.gradle.tasks.ExternalNativeBuildTask$doTaskAction$$inlined$recordTaskAction$1.invoke(BaseTask.kt:70)
at com.android.build.gradle.internal.tasks.Blocks.recordSpan(Blocks.java:51)
at com.android.build.gradle.tasks.ExternalNativeBuildTask.doTaskAction(ExternalNativeBuildTask.kt:136)
at com.android.build.gradle.internal.tasks.UnsafeOutputsTask$taskAction$$inlined$recordTaskAction$1.invoke(BaseTask.kt:65)
at com.android.build.gradle.internal.tasks.Blocks.recordSpan(Blocks.java:51)
at com.android.build.gradle.internal.tasks.UnsafeOutputsTask.taskAction(UnsafeOutputsTask.kt:61)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at org.gradle.internal.reflect.JavaMethod.invoke(JavaMethod.java:104)
... 115 more
Caused by: com.android.ide.common.process.ProcessException: Error while executing process /home/junchen/Android/Sdk/cmake/3.18.1/bin/ninja with arguments {-C /home/junchen/onnx-android/app/ORTPersonalize/app/.cxx/Debug/3s452306/arm64-v8a ortpersonalize}
at com.android.build.gradle.internal.process.GradleProcessResult.buildProcessException(GradleProcessResult.java:73)
at com.android.build.gradle.internal.process.GradleProcessResult.assertNormalExitValue(GradleProcessResult.java:48)
at com.android.build.gradle.internal.cxx.process.ExecuteProcessKt.execute(ExecuteProcess.kt:269)
... 132 more
Caused by: org.gradle.process.internal.ExecException: Process 'command '/home/junchen/Android/Sdk/cmake/3.18.1/bin/ninja'' finished with non-zero exit value 1
at org.gradle.process.internal.DefaultExecHandle$ExecResultImpl.assertNormalExitValue(DefaultExecHandle.java:414)
at com.android.build.gradle.internal.process.GradleProcessResult.assertNormalExitValue(GradleProcessResult.java:46)
... 133 more

For the first error message, it seems like there is no ibonnxruntime.so' file under the /ORTPersonalize/app/src/main/cpp/libs/ directory. I have tested this work on my ubuntu machine, windows machine. Both of them are showing this error. Probably the `ibonnxruntime.so' file is missing.

For the second error, I'm not so sure why it is happening since my Cmake version is higher than the required version.

Could anyone help me with solving this issue?

Thanks a lot!

Urgent

This issue is urgent.

Build failed because of ninja in the Android example app

Hi, I'm trying to run your Android example app ORTPersonalize. I met a problem below. The libortpersonalize.so is missing or built with failure in the building process. So the Cmake script didn't create the corresponding .so file successfully. I tried in my Linux and Windows system. Both of them failed with the same error message below. My cmake version is 3.26.4.

ninja: error: '/onnx/onnxruntime-training-examples/on_device_training/mobile/android/c-cpp/app/ORTPersonalize/app/src/main/cpp/libs/libonnxruntime.so', needed by '/onnx/onnxruntime-training-examples/on_device_training/mobile/android/c-cpp/app/ORTPersonalize/app/build/intermediates/cxx/Debug/5b237351/obj/arm64-v8a/libortpersonalize.so', missing and no known rule to make it

Which command can be used to run the Bert model from scratch?

Python 3.6 is now outdated

Attempting to run python hf-ort.py --gpu_cluster_name <gpu_cluster_name> --hf_model bert-large --run_config ort with my cluster results in a failed run due to a futures error. Futures can't be used until Python 3.7. I think all this needs is a documentation update.

"User program failed with SyntaxError: future feature annotations is not defined (ssd_offload.py, line 6)"

device_training

RuntimeError: /onnxruntime_src/orttraining/orttraining/training_api/optimizer.cc:273 void onnxruntime::training::api::Optimizer::Initialize(const onnxruntime::training::api::ModelIdentifiers&, const std::vector<std::shared_ptronnxruntime::IExecutionProvider >&, gsl::span<OrtCustomOpDomain* const>) [ONNXRuntimeError] : 1 : FAIL : Load model from data/optimizer_model.onnx failed:/onnxruntime_src/onnxruntime/core/graph/model_load_utils.h:46 void onnxruntime::model_load_utils::ValidateOpsetForDomain(const std::unordered_map<std::basic_string, int>&, const onnxruntime::logging::Logger&, bool, const string&, int) ONNX Runtime only guarantees support for models stamped with official released onnx opset versions. Opset 20 is under development and support for this is limited. The operator schemas and or other functionality may change before next ONNX release and in this case ONNX Runtime will not guarantee backward compatibility. Current official support for domain ai.onnx is till opset 19

<! - วางไว้ในองค์ประกอบ <queries> -> <intent> <action android: name = "android.intent.action.VIEW" /> <หมวดหมู่android: name = "android.intent.category.BROWSABLE" /> <data android: scheme = "https" / > </intent>

WikiExtractor.py was not found in the latest WikiExtractor repo

I am running the Bert example, and have manually downloaded enwiki-latest-pages-articles.xml.bz2 and ran:
$ python ./workspace/BERT/data/bertPrep.py --action download --dataset wikicorpus_en

But found WikiExtractor.py was not found in the latest WikiExtractor repo and got following error:

bert) root@user:/mnt/d/src/onnxruntime-training-examples# python ./workspace/BERT/data/bertPrep.py --action text_formatting --dataset wikicorpus_en [nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Package punkt is already up-to-date! Working Directory: ./workspace/BERT/data/ Action: text_formatting Dataset Name: wikicorpus_en Directory Structure: { 'download': './workspace/BERT/data//download', 'extracted': './workspace/BERT/data//extracted', 'formatted': './workspace/BERT/data//formatted_one_article_per_line', 'hdf5': './workspace/BERT/data//hdf5_lower_case_1_seq_len_512_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5', 'sharded': './workspace/BERT/data//sharded_training_shards_256_test_shards_256_fraction_0.2', 'tfrecord': './workspace/BERT/data//tfrecord_lower_case_1_seq_len_512_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5'} WikiExtractor Command: ./workspace/wikiextractor/WikiExtractor.py ./workspace/BERT/data//download/wikicorpus_en/wikicorpus_en.xml -b 100M --processes 4 -o ./workspace/BERT/data//extracted/wikicorpus_en /bin/sh: 1: ./workspace/wikiextractor/WikiExtractor.py: not found

onnxruntime training example in C++

Hi, I'm wondering if there is any c++ examples to show how to train the model using onnxruntime? We are interested in integrating the training capabilities but our code is in c++. So can we provide some c++ examples?

Missing sympy on nvidia-bert/docker/Dockerfile

nvidia-bert/docker/Dockerfile needs to install sympy, a new ORT dependency to properly run

	### Install Python dependencies
	From this directory, run:

	```bash
	pip install -r requirements.txt
	```