GithubHelp home page GithubHelp logo

ki6an / fastt5 Goto Github PK

View Code? Open in Web Editor NEW
541.0 13.0 69.0 284 KB

⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x.

License: Apache License 2.0

Python 100.00%
python t5 onnx onnxruntime quantization fastt5 nlp fast quantized-onnx-models translation

fastt5's Introduction

fastt5 icon

Reduce T5 model size by 3X and increase the inference speed up to 5X.

GitHub Workflow PYPI release Workflow


T5 models can be used for several NLP tasks such as summarization, QA, QG, translation, text generation, and more. Sequential text generation is naturally slow, and for larger T5 models it gets even slower. fastT5 makes the T5 models inference faster by running it on onnxruntime. and it also decreases the model size by quantizing it.

fastT5 library allows you to convert a pretrained T5 model to onnx, quantizes it, and gives the model as output which is running on an onnxruntime in a single line of code. You can also customize this whole process.


Install

You can install fastT5 from PyPI:

 pip install fastt5

If you want to build from source:

git clone https://github.com/Ki6an/fastT5
cd fastT5
pip3 install -e .

Usage

The export_and_get_onnx_model() method exports the given pretrained T5 model to onnx, quantizes it and runs it on the onnxruntime with default settings. The returned model from this method supports the generate() method of huggingface.

If you don't wish to quantize the model then use quantized=False in the method.

from fastT5 import export_and_get_onnx_model
from transformers import AutoTokenizer

model_name = 't5-small'
model = export_and_get_onnx_model(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
t_input = "translate English to French: The universe is a dark forest."
token = tokenizer(t_input, return_tensors='pt')

tokens = model.generate(input_ids=token['input_ids'],
               attention_mask=token['attention_mask'],
               num_beams=2)

output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
print(output)

to run the already exported model use get_onnx_model()

you can customize the whole pipeline as shown in the below code example:

from fastT5 import (OnnxT5, get_onnx_runtime_sessions,
                    generate_onnx_representation, quantize)
from transformers import AutoTokenizer

model_or_model_path = 't5-small'

# Step 1. convert huggingfaces t5 model to onnx
onnx_model_paths = generate_onnx_representation(model_or_model_path)

# Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
quant_model_paths = quantize(onnx_model_paths)

# step 3. setup onnx runtime
model_sessions = get_onnx_runtime_sessions(quant_model_paths)

# step 4. get the onnx model
model = OnnxT5(model_or_model_path, model_sessions)

                      ...
custom output paths

By default, fastT5 creates a models folder in the current directory and stores all the models. You can provide a custom path for a folder to store the exported models. And to run already exported models that are stored in a custom folder path: use get_onnx_model(onnx_models_path="/path/to/custom/folder/")

from fastT5 import export_and_get_onnx_model, get_onnx_model

model_name = "t5-small"
custom_output_path = "/path/to/custom/folder/"

# 1. stores models to custom_output_path
model = export_and_get_onnx_model(model_name, custom_output_path)

# 2. run already exported models that are stored in custom path
# model = get_onnx_model(model_name, custom_output_path)

Details

T5 is a seq2seq model (Encoder-Decoder), as it uses decoder repeatedly for inference, we can't directly export the whole model to onnx. We need to export the encoder and decoder separately.

past_key_values contain pre-computed hidden-states (key and values in the self-attention blocks and cross-attention blocks) that can be used to speed up sequential decoding.

models can only be exported with a constant number of inputs. Contrary to this, the decoder of the first step does not take past_key_values and the rest of the steps decoders do. To get around this issue, we can create two decoders: one for the first step that does not take past_key_values and another for the rest of the steps that utilize the past_key_values.

Next, we'll export all three models (encoder, decoder, init_decoder). And then quantize them, quantizing 32bit to 8bit should give the 4x memory reduction. Since there is an extra decoder the model size reduces by 3x.

Finally, we'll run the quantized model on onnx runtime.

The inference is simple as the model supports the generate() method of huggingface.

Functionalities

  • Export any pretrained T5 model to ONNX easily (with past_key_values).
  • The exported model supports beam search and greedy search and more via generate() method.
  • Reduce the model size by 3X using quantization.
  • Up to 5X speedup compared to PyTorch execution for greedy search and 3-4X for beam search.

Benchmarks

The benchmarks are the result of the T5-base model tested on English to French translation.

Onnx model

The following graph shows the latency of the quantized onnx model vs the PyTorch model for beam numbers varying from 1 to 9. The latencies shown here are for the mean of sequence lengths up to 130.

t5-base

The following heat map shows the X times faster which the ratio of latency of PyTorch to onnx model. The onnx model outperforms most cases. however, the speed of the model drops for a longer sequence length.

t5-base-hist

Quantized onnx model

Quantized models are lightweight models as mentioned earlier, these models have almost the same accuracy as the original model (quantized model scores are mentioned in the next section). Quantized onnx models have the lowest latency compared to both Onnx & PyTorch models.

t5-base-quant

The model outperforms the PyTorch model by 5.7X for greedy search on average and 3-4X for beam search.

t5-base-quant-hist

Note : The results were generated on AMD EPYC 7B12, these results may vary from device to device. The Onnx models usually perform well on high-end CPUs with more cores.

Quantized model scores

The results were tested for English to French translation with beam search number of 3.

Bleu_4 METEOR ROUGE_L
t5-small (quant) 0.240769 0.282342 0.468817
t5-small (pytorch) 0.254601 0.295172 0.492749
t5-base (quant) 0.267606 0.306019 0.499188
t5-base (pytorch) 0.268346 0.304969 0.503306
t5-large (quant) 0.286726 0.316845 0.503585
t5-large (pytorch) 0.294015 0.315774 0.508677

Private HuggingFace Model Hub Models

The HuggingFace model hub supports private models. To use a private, pre-trained version of T5 with fastT5 you first must have authenticated into HuggingFace ecosystem with $ transformers-cli login. Then, when using fastT5, there is an extra import and call:

from fastT5 import (
    OnnxT5,
    get_onnx_runtime_sessions,
    generate_onnx_representation,
    quantize,
    set_auth_token)
from transformers import AutoTokenizer

set_auth_token(True)
# the rest of the code is the same as using a public model

If you are unable to call $ transformers-cli login or prefer to use your API Key, found at https://huggingface.co/settings/token (or https://huggingface.co/organizations/ORG_NAME/settings/token for organizations), you can pass that as a string to set_auth_token. Avoid hard-coding your API key into code by setting the environment variable HF_API_KEY=<redacted>, and then in code:

import os

from fastT5 import (
    OnnxT5,
    get_onnx_runtime_sessions,
    generate_onnx_representation,
    quantize,
    set_auth_token)
from transformers import AutoTokenizer

auth_token = os.environ.get("HF_API_KEY")
set_auth_token(auth_token)

# code proceeds as normal

further improvements

  • currently the fastT5 library supports only the cpu version of onnxruntime, gpu implementation still needs to be done.
  • graph optimization of the onnx model will further reduce the latency.

Get Help

Acknowledgements

@article{2019t5,
  author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
  title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  journal = {arXiv e-prints},
  year = {2019},
  archivePrefix = {arXiv},
  eprint = {1910.10683},
}

fastt5's People

Contributors

aseifert avatar kagrze avatar ki6an avatar sam-writer avatar warrierrajeev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastt5's Issues

get_onnx_model fails

When a custom models folder is given in model_name_or_path, it fails to find the model files.

Reason:

get_model_paths(
        model_name_or_path, saved_models_path, quantized
    )

is used to get the model paths where the saved_model_path is imported (not derived from the model_name_or_path) and is equal to ./models/ and so it checks if the files exist inside the current dir's models folder only.

Unable to retrieve hidden_states

I converted a locally saved T5 checkpoint to ONNX using FastT5:

>>> from fastT5 import export_and_get_onnx_model
>>> from transformers import AutoTokenizer

>>> model_checkpoint = "path/to/checkpoint"
>>> model = export_and_get_onnx_model(model_name)

I tested it for inference:

>>> tokenizer = AutoTokenizer.from_pretrained(model_name)

>>> token = tokenizer(input_terms, max_length=512 * 2, padding=True, truncation=True, return_tensors='pt')

>>> out = model.generate(input_ids=token['input_ids'].to('cpu'),
                            attention_mask=token['attention_mask'].to('cpu'),
                            return_dict_in_generate=True,
                            max_length=512 * 2,
                            num_beams=1,
                            output_scores=True,
                            output_hidden_states=True)

>>> out.encoder_hidden_states
>>> out.decoder_hidden_states
(None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
...

>>> out
GreedySearchEncoderDecoderOutput(sequences=tensor([[  0, 119, 114, 102, 108, 111, 108, 125, 120, 112, 100, 101,  35,  53, ...
...
), , encoder_attentions=None, encoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, decoder_hidden_states=(None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None))

The hidden states are all None.

Is there any way that I can retrieve the hidden states for both encoder and decoder?

failed with output shape error

transformers 4.4.2 or 4.5.1

from fastT5 import export_and_get_onnx_model
from transformers import AutoTokenizer

model_name = "t5-small"
model = export_and_get_onnx_model(model_name, quantized=False)

error:

transformers/models/t5/modeling_t5.py", line 497, in forward
    scores += position_bias
RuntimeError: output with shape [5, 8, 1, 2] doesn't match the broadcast shape [5, 8, 2, 2]

mt5 and Neural Machine Trnslation

Hi there,

  • is this repository compatible with mT5 (Multilingual T5)
  • How much ram memory is required to run the final quantized model vs the original.
  • can the quantized model be ran on a raspberry pi device

offline install error

Hi, I am installing fastt5 in an offline server. I already installed onnxruntime, but it still show

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7feedf7f9c88>: Failed to establish a new connection: [Errno -2] Name or service not known')': /simple/onnxruntime/

Accuracy hit when using fastT5

I've been experimenting with this and the speed gains are definitely amazing. However, I'm experiencing some poorer accuracy when using the fastT5 quantized version of my fine-tuned model. Is this to be expected and do you have any ideas on how to mitigate this? Perhaps more fine-tuning with a larger dataset?

CPU-inference T5Encoder-XL slower than PyTorch

Hello,

my goal is to convert the T5-XL model with fastT5. To be precise, this model has a model size >2GB and those model types currently seem to not be supported.
Additionally, I would like to use the T5EncoderModel which consists of only the encoder-part. I have locally adapted some things (removing the decoder part and using external_data_format) which are unfortunately not really usable for others.
I would really appreciate it if this could be somehow integrated.

Going along with the problem with large models, the quantization does not work for large models. For that, an issue microsoft/onnxruntime#7974 was already created.

Now to the actual problem. I am not sure if I made something wrong during my adaption, therefore I would appreciate it if this would be supported. But in the ort_settings.py it is documented that

default : set this to true, ort will choose the best settings for your hardware. (you can test out different settings for better results.)

Does it really yield the best configuration? In my case, my PyTorch model had a faster inference time.
I tried to set the default-bool to false to adapt some configurations. But even by setting n_threads = 2, I for some reason get a huge memory consumption if infering for a larger number of sequences (>200).

My system:
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz limited to 10 Cores.
60GB RAM
transformers 4.5.1
tokenizers 0.10.1
onnx 1.9.0
onnxruntime 1.7.0

The T5EncoderModel-XL is used from huggingface.

Thank you

Problems with T5 & onnxruntime

Hello there.
My purpose is to speed up a T5-small (fine-tuned) both on CPU and GPU.
So, I am trying to transform the net through fastT5. However, using the quantized model on CPU, I get similar performances with respect to the initial model (T5-small, on CPU), without any significant improvement. Am I missing something?

Moreover, I have problems with onnxruntime-gpu. I have read from the other issues that I can't use onnxruntime-gpu with quantization, is that correct?

Also, I am trying to transform the T5-small model into a non-quantized onnx model, in order to be able to use it on the GPU with onnxruntime-gpu, to obtain some improvements. In this case, i get the errors:

[ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Add node. Name:'Add_98' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:487 void onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 14 by 16

or

[ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Add node. Name:'Add_98' Status Message: Add_98: right operand cannot broadcast on dim 2 LeftShape: {1,8,85,85}, RightShape: {1,8,14,85}

The code is reported below. It fails in the generate method when I try to run it with onnxruntime-gpu.

    t_input = 'translate {} to SQL: '.format(languages[lang]) + original_text
    tokenizer = AutoTokenizer.from_pretrained(model_directory)
    token = tokenizer(t_input, return_tensors='pt')
    tokens = model.generate(input_ids=token['input_ids'], attention_mask=token['attention_mask'], num_beams=3)

Thank you.

Is fastT5 deepspeed compatible?

From the research I’ve done the onnxruntime wraps PyTorch modules handling all the parallel execution and such giving the speed. It’s also possible to use graph optimizations such as quantization and provide further speed boost by integrating with deepspeed which uses ZeRO parallel processing to slice weights into multiple GPUs.

What I cannot find is how to integrate it all together or if that is even possible.

Does this wrapper support onnx runtime and by default allow deep speed integration or is that specific to how the export is done?

Is there a process for utilizing a t5 based model on deep speed / does that give more performance than this out of the box?

Readme / documentation mentions that more performance can be gain through graph optimization (eg quantization), but I couldn’t find instructions to do so? Do I have to re-train or can I configure that at export?

My specific use case is optimize inference for throughout. I would like to reach 4,000/generations per second with a reasonable amount of cost in the hardware (less than $5 per 1M inferences).

It seemed feasible with Neuron/Inferentia but the T5 model is bigger than Bert and I could only fit a few into memory (and that was using t5-base). It also only gave 3x performance over CPU with 5 parallel instances (all I could fit before running out of memory) vs similar CPU parallel execution.

Thanks!

Failed to create CUDAExecutionProvider

Hi,

After having obtained ONNX models (not quantized), I would like to run inference on GPU devices with setting onnx runtime:

model_sessions = get_onnx_runtime_sessions(model_paths, default=False, provider=['CUDAExecutionProvider'])

However, I get the following error:

Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.

I checked that all dependencies are installed.
How could I fix it? Thanks in advance for answer

Getting runtime error.

Hi, @Ki6an it's great work. But while executing below code

from fastT5 import export_and_get_onnx_model
from transformers import AutoTokenizer

model_name = 't5-small'
model = export_and_get_onnx_model(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
t_input = "translate English to French: The universe is a dark forest."
token = tokenizer(t_input, return_tensors='pt')

tokens = model.generate(input_ids=token['input_ids'],
               attention_mask=token['attention_mask'],
               num_beams=2)

output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
print(output)

I'm getting this error.

RuntimeError: output with shape [5, 12, 1, 2] doesn't match the broadcast shape [5, 12, 2, 2]


Observing difference in outputs from decoder with IO bindings.

Hi @Ki6an Was trying to implement IO bindings for the decoder part of the model. Used the same code from your repo to convert the model to ONNX. After loading the model and making predictions using the decoder session directly the output appears to be fine but with inputs binded the result is coming to be different.

Below is the code for the IO bindings:

def dec_pred_with_io_bindings(input_ids, attention_mask, encoder_output, past_key_values_dict,dec_session):
  dec_io_binding = dec_session.io_binding()
  dec_io_binding.bind_input(name="input_ids",
                          device_type="cuda",
                          device_id=0,
                          element_type=np.longlong,
                          shape=list(input_ids.shape),
                          buffer_ptr=input_ids.data_ptr())
  dec_io_binding.bind_input(name="encoder_attention_mask",
                          device_type="cuda",
                          device_id=0,
                          element_type=np.longlong,
                          shape=list(attention_mask.shape),
                          buffer_ptr=attention_mask.data_ptr())
                        
  dec_io_binding.bind_input(name="encoder_hidden_states",
                          device_type="cuda",
                          device_id=0,
                          element_type=np.float32,
                          shape=list(encoder_output.shape),
                          buffer_ptr=encoder_output.data_ptr())
  

  for key,val in past_key_values_dict.items():
    dec_io_binding.bind_input(name=key,
                                      device_type="cuda",
                                      device_id=0,
                                      element_type=np.float32,
                                      shape=list(val.shape),
                                      buffer_ptr=val.data_ptr())
  
  #Bind outputs.
  for arg in self.decoder.get_outputs():
    dec_io_binding.bind_output(arg.name, "cuda")
    
  dec_session.run_with_iobinding(dec_io_binding)
  ort_output = dec_io_binding.get_outputs()

  logits=ort_output[0]

  list_pkv = tuple(torch.from_numpy(x.numpy()).cuda() for x in ort_output[1:])

  # creates a tuple of tuples of shape 6x4 from the above tuple
  out_past_key_values = tuple(
      list_pkv[i : i + 4] for i in range(0, len(list_pkv), 4)
  )


  return torch.from_numpy(logits.numpy()).cuda(),out_past_key_values

Not all outputs have names

When inspecting the generated onnx models I saw that the decoder and init_decoder have a lot of unnamed outputs.

You can see the output names of an onnx InferenceSession like this: list(map(lambda x: x.name, session.get_outputs()))

We can see that the first self attention is called past_key_values while the other outputs have generated names.
image

I was just wondering if this can be a problem, if the dynamic axes are not set for the other outputs.

Also, could it be possible that the output_past_key_values dynamic axes for sequence should be 2 (as in dyn_pkv inputs) instead of 1?

Cheers!

Fails to convert T0-3B

T0-3B is just a finetune of T5v1.1_3B_LMadapt (according to their paper), and in HF it loads via just the standard T5 code.

I am able to successfully ONNX-convert other T5 models on my computer.

But when I try on T0:

❯ MODEL=~/Downloads/PT_T0_3B poetry run test1_onnx
Loading model from /home/user/Downloads/PT_T0_3B
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
Exporting to onnx... |################################| 3/3
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/user/Dev/Proj/codet5_tests/codet5_tests/test1_onnx.py", line 63, in main
    typer.run(bean_cli)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/typer/main.py", line 864, in run
    app()
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/typer/main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/user/Dev/Proj/codet5_tests/codet5_tests/test1_onnx.py", line 45, in bean_cli
    models = load_torch_models(os.environ.get("MODEL"))
  File "/home/user/Dev/Proj/codet5_tests/codet5_tests/test1_onnx.py", line 17, in load_torch_models
    model = export_and_get_onnx_model(model_path, onnx_model_output_path)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/fastT5/onnx_models.py", line 219, in export_and_get_onnx_model
    quant_model_paths = quantize(onnx_model_paths)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/fastT5/onnx_exporter.py", line 280, in quantize
    quantize_dynamic(
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/onnxruntime/quantization/quantize.py", line 308, in quantize_dynamic
    model = load_model(Path(model_input), optimize_model)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/onnxruntime/quantization/quantize.py", line 53, in load_model
    return onnx.load(Path(model_path))
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/onnx/__init__.py", line 127, in load_model
    load_external_data_for_model(model, base_dir)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/onnx/external_data_helper.py", line 69, in load_external_data_for_model
    load_external_data_for_tensor(tensor, base_dir)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/onnx/external_data_helper.py", line 48, in load_external_data_for_tensor
    with open(external_data_file_path, 'rb') as data_file:
FileNotFoundError: [Errno 2] No such file or directory: '/home/user/Dev/Proj/codet5_tests/encoder.embed_tokens.weight'

Different behaviour when extending this project to Bart

Hello there. This is a really fantastic project. I'm trying to extend your work to Bart but I've run into some strange behaviour.

I've made a Colab notebook to illustrate the problem. Specifically when converting Bart to ONNX, the encoder_hidden_states input does not get included in the ONNX model's graph. As you can see from the notebook though, it works perfectly for T5.

I realise this is out of scope for the fastT5 project but thought someone who comes across this issue might have experienced a similar problem and be able to help. This may also be useful to know in case you have plans to expand this project to include models like Bart in the future.

Error from export_and_get_onnx_model()

Hi,

I am getting the following error when I run the following test snippet provided in the readme.md file. But I can see 3 onnx files are created one for encoder, one for decoder and one for init decoder.

Environment
fastT5 - 0.0.7
MacOS - Big Sur (11.2.3)
Python conda - 3.7.6

Error
Exporting to onnx... |################################| 3/3
[libprotobuf ERROR google/protobuf/descriptor_database.cc:394] Invalid file descriptor data passed to EncodedDescriptorDatabase::Add().
[libprotobuf FATAL google/protobuf/descriptor.cc:1356] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):
libc++abi.dylib: terminating with uncaught exception of type google::protobuf::FatalException: CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

Test code snippet

from fastT5 import export_and_get_onnx_model
from transformers import AutoTokenizer

model_name = 't5-small'
model = export_and_get_onnx_model(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
t_input = "translate English to French: The universe is a dark forest."
token = tokenizer(t_input, return_tensors='pt')

tokens = model.generate(input_ids=token['input_ids'],
                        attention_mask=token['attention_mask'],
                        num_beams=2)

output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
print(output)

OnnxT5 creates potentially unneccessary objects

Hey,

right now the init function of OnnxT5 calls the init function of T5ForConditionalGeneration, which instantiates objects that we might not need.

E.g.:

>>> onnx_model = OnnxT5(model_or_model_path, onnx_model_sessions)
>>> onnx_model.shared
Embedding(32128, 512)

Maybe it would be more efficient to call super(T5ForConditionalGeneration, self).__init__(config) (and set things from T5ForConditionalGeneration.__init__ that are actually required manually)?

Cheers!

t5-11b out of memory/FileNotFoundError

``First of all, this seems like a great repo that I was super excited to find!

When testing with t5-small everything works correctly. But when trying with my custom t5-11b I get out of memory issues.

I was running this with a t5-11b as model:
onnx_model_paths = generate_onnx_representation("t5-11b",model=model)

And at first I got this error:

RuntimeError: Exporting model exceed maximum protobuf size of 2GB. Please call torch.onnx.export with use_external_data_format=True.

So I simply added use_external_data_format=True to all of the three torch.onnx.export in onnx_exporter.py in fastT5.

Then I can run onnx_model_paths = generate_onnx_representation(model_name,model=model), and get no error (First time I posted I got an error but it seems like I made an error and only had 100 GB disk memory, when trying 200 GB it worked).

Then when running quant_model_paths = quantize(onnx_model_paths) I get the error:

`FileNotFoundError                         Traceback (most recent call last)
<ipython-input-7-3a782b6d5a25> in <module>
      8 
      9 # Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
---> 10 quant_model_paths = quantize(onnx_model_paths)
     11 
     12 # step 3. setup onnx runtime

~/fastT5/fastT5/onnx_exporter.py in quantize(models_name_or_path)
    273             activation_type=QuantType.QUInt8,
    274             weight_type=QuantType.QUInt8,
--> 275             optimize_model=False,
    276         )  # op_types_to_quantize=['MatMul', 'Relu', 'Add', 'Mul' ],
    277         quant_model_paths.append(output_model_name)

/opt/conda/lib/python3.7/site-packages/onnxruntime/quantization/quantize.py in quantize_dynamic(model_input, model_output, op_types_to_quantize, per_channel, reduce_range, activation_type, weight_type, nodes_to_quantize, nodes_to_exclude, optimize_model, use_external_data_format)
    266         op_types_to_quantize = list(IntegerOpsRegistry.keys())
    267 
--> 268     model = load_model(Path(model_input), optimize_model)
    269     quantizer = ONNXQuantizer(
    270         model,

/opt/conda/lib/python3.7/site-packages/onnxruntime/quantization/quantize.py in load_model(model_path, optimize)
     51         return onnx_model.model
     52 
---> 53     return onnx.load(Path(model_path))
     54 
     55 

/opt/conda/lib/python3.7/site-packages/onnx/__init__.py in load_model(f, format, load_external_data)
    125         if model_filepath:
    126             base_dir = os.path.dirname(model_filepath)
--> 127             load_external_data_for_model(model, base_dir)
    128 
    129     return model

/opt/conda/lib/python3.7/site-packages/onnx/external_data_helper.py in load_external_data_for_model(model, base_dir)
     69     for tensor in _get_all_tensors(model):
     70         if uses_external_data(tensor):
---> 71             load_external_data_for_tensor(tensor, base_dir)
     72             # After loading raw_data from external_data, change the state of tensors
     73             tensor.data_location = TensorProto.DEFAULT

/opt/conda/lib/python3.7/site-packages/onnx/external_data_helper.py in load_external_data_for_tensor(tensor, base_dir)
     48     external_data_file_path = os.path.join(base_dir, file_location)
     49 
---> 50     with open(external_data_file_path, 'rb') as data_file:
     51 
     52         if info.offset:

FileNotFoundError: [Errno 2] No such file or directory: '/home/jupyter/encoder.embed_tokens.weight'`

Has anyone successfully exported the t5-11b version and knows how to solve this?

Update:

I tried changing the working directory to /home/jupyter/models instead of /home/jupyter/, which seems to solve the FileNotFoundError. But then again I get problems with the size:

ValueError                                Traceback (most recent call last)
<ipython-input-10-032d95bca1c8> in <module>
      1 os.chdir(r'/home/jupyter/models/')
----> 2 quant_model_paths = quantize(onnx_model_paths)

~/fastT5/fastT5/onnx_exporter.py in quantize(models_name_or_path)
    273             activation_type=QuantType.QUInt8,
    274             weight_type=QuantType.QUInt8,
--> 275             optimize_model=False,
    276         )  # op_types_to_quantize=['MatMul', 'Relu', 'Add', 'Mul' ],
    277         quant_model_paths.append(output_model_name)

/opt/conda/lib/python3.7/site-packages/onnxruntime/quantization/quantize.py in quantize_dynamic(model_input, model_output, op_types_to_quantize, per_channel, reduce_range, activation_type, weight_type, nodes_to_quantize, nodes_to_exclude, optimize_model, use_external_data_format)
    278         nodes_to_quantize,
    279         nodes_to_exclude,
--> 280         op_types_to_quantize)
    281 
    282     quantizer.quantize_model()

/opt/conda/lib/python3.7/site-packages/onnxruntime/quantization/onnx_quantizer.py in __init__(self, model, per_channel, reduce_range, mode, static, weight_qType, input_qType, tensors_range, nodes_to_quantize, nodes_to_exclude, op_types_to_quantize)
     30 
     31         # run shape inference on the model
---> 32         model = onnx.shape_inference.infer_shapes(model)
     33         self.value_infos = {vi.name: vi for vi in model.graph.value_info}
     34         self.value_infos.update({ot.name: ot for ot in model.graph.output})

/opt/conda/lib/python3.7/site-packages/onnx/shape_inference.py in infer_shapes(model, check_type, strict_mode)
     34 def infer_shapes(model, check_type=False, strict_mode=False):  # type: (ModelProto, bool, bool) -> ModelProto
     35     if isinstance(model, ModelProto):
---> 36         model_str = model.SerializeToString()
     37         inferred_model_str = C.infer_shapes(model_str, check_type, strict_mode)
     38         return onnx.load_from_string(inferred_model_str)

ValueError: Message onnx.ModelProto exceeds maximum protobuf size of 2GB: 19459248612

Implemented the code for BART

Hi,
this is my first time contributing to an open-source project. fast-Bart
I wanted to extend the functionality for BART models - and took inspiration from this notebook along with the fastT5 repo.

I have been able to successfully achieve the following:

  1. Extend the ONNX conversion functionality to BART models - similar to T5
  2. Implement quantization for converted ONNX models (compatible with onnxruntime version 1.7.0)
  3. Implement OnnxBart

I know this is an issues forum-and not the best place to share the progress. But, please share your feedbacks on the same.

Error with "kiri-ai/t5-base-qa-summary-emotion" model

Hey,

If I understand correctly, then this library should work with all t5 models including finetuned ones right? When I try to generate onnx representation of model "kiri-ai/t5-base-qa-summary-emotion", I get the following error:
File "C:\Users\Oren\Anaconda3\envs\emotion-qa\lib\site-packages\transformers\models\t5\modeling_t5.py", line 499, in forward scores += position_bias RuntimeError: output with shape [5, 12, 1, 2] doesn't match the broadcast shape [5, 12, 2, 2]

Quantisation step for canned t5-small model persistently fails (See details)

Executed the below code, it fails with the following error. Please advice.

Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /content/models/t5-small-decoder-quantized.onnx failed:This is an invalid model. Error: Duplicate definition of name (pkv_11).

from fastT5 import (OnnxT5, get_onnx_runtime_sessions,
generate_onnx_representation, quantize)
from transformers import AutoTokenizer

model_or_model_path = 't5-small'

# Step 1. convert huggingfaces t5 model to onnx
onnx_model_paths = generate_onnx_representation(model_or_model_path)

# Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
quant_model_paths = quantize(onnx_model_paths)

# step 3. setup onnx runtime
model_sessions = get_onnx_runtime_sessions(quant_model_paths)

# step 4. get the onnx model
model = OnnxT5(model_or_model_path, model_sessions)

GPU support for fastT5

You have used onnxruntime which is CPU compatible but are we looking forward to have onnxruntime-gpu?

GPU Optimization

Thanks for sharing the repo . It is really helpful.

I'm exploring ways to do the optimization on GPU. I know its not presently supported. Could you share some approach or references to implement the optimization on GPU(Nvidia)

forward() got an unexpected keyword argument 'cross_attn_head_mask'

----> 1 paraphrase_t5("Kyle Lowry scored 33 points and Norman Powell added 23 to lift the Toronto Raptors to a 122-125 victory over the Boston Celtics on Wednesday night.")

4 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

TypeError: forward() got an unexpected keyword argument 'cross_attn_head_mask'

Issue with quantize()

I'm trying to use a quantized t5-base model for my summarization task. I get AttribteError when I'm running below line of code
quant_model_paths = quantize(onnx_model_paths)

Error message below

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-d0497e8bf343> in <module>
      8 onnx_model_paths = generate_onnx_representation(model_name)
      9 
---> 10 quant_model_paths = quantize(onnx_model_paths)
     11 
     12 model_sessions = get_onnx_runtime_sessions(onnx_model_paths, default=False, provider = onnxruntime.get_available_providers())

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastT5/onnx_exporter.py in quantize(models_name_or_path)
    270             activation_type=QuantType.QUInt8,
    271             weight_type=QuantType.QUInt8,
--> 272             optimize_model=False,
    273         )  # op_types_to_quantize=['MatMul', 'Relu', 'Add', 'Mul' ],
    274         quant_model_paths.append(output_model_name)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/onnxruntime/quantization/quantize.py in quantize_dynamic(model_input, model_output, op_types_to_quantize, per_channel, reduce_range, activation_type, weight_type, nodes_to_quantize, nodes_to_exclude, optimize_model, use_external_data_format)
    278         nodes_to_quantize,
    279         nodes_to_exclude,
--> 280         op_types_to_quantize)
    281 
    282     quantizer.quantize_model()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/onnxruntime/quantization/onnx_quantizer.py in __init__(self, model, per_channel, reduce_range, mode, static, weight_qType, input_qType, tensors_range, nodes_to_quantize, nodes_to_exclude, op_types_to_quantize)
     30 
     31         # run shape inference on the model
---> 32         model = onnx.shape_inference.infer_shapes(model)
     33         self.value_infos = {vi.name: vi for vi in model.graph.value_info}
     34         self.value_infos.update({ot.name: ot for ot in model.graph.output})

AttributeError: module 'onnx' has no attribute 'shape_inference'

I can successfully run the below code without the line that calls quantize(onnx_model_paths), I get above error at the quantize step

from fastT5 import (OnnxT5, get_onnx_runtime_sessions,
                    generate_onnx_representation, quantize)
import onnxruntime
from transformers import AutoTokenizer

model_name = 't5-small'

onnx_model_paths = generate_onnx_representation(model_name) 

quant_model_paths = quantize(onnx_model_paths)

model_sessions = get_onnx_runtime_sessions(onnx_model_paths, default=False, provider = onnxruntime.get_available_providers()) 

encoder_sess, _, _ = model_sessions

print(encoder_sess.get_providers())

model = OnnxT5(model_name, model_sessions)

My current set up below.

fastt5==0.0.5
onnx==1.5.0
onnxruntime==1.7.0
onnxruntime-gpu==1.7.0
onnxruntime-tools==1.7.0
onnxt5==0.1.8
transformers==4.6.0(tried with transformers 4.5.0 as well)

Question about implementation details of `past_key_values`

Hi,

First thank you for your contribution! I'm working on converting Transformers' PEGASUS to ONNX and your repo is a very good reference. I have a few questions about your implementation regarding past_key_values.

Here's the code starting at https://github.com/Ki6an/fastT5/blob/master/fastT5/onnx_exporter.py#L94.

 # dummy inputs
    batch_size = 5
    n_heads = model_config.num_heads
    seq_length_a, seq_length_b = input_ids.shape
    d_kv = model_config.d_kv

    input_ids_dec = torch.ones((5, 1), dtype=torch.int64)
    attention_mask_dec = torch.ones((5, seq_length_b), dtype=torch.int64)
    enc_out = torch.ones(
        (batch_size, seq_length_b, model_config.d_model), dtype=torch.float32
    )

    # self_attention_past_key_values = torch.ones(
    #     (model_config.num_decoder_layers, 2, batch_size, n_heads, seq_length_a, d_kv), dtype=torch.float32)
    # cross_attention_past_key_values = torch.ones(
    #     (model_config.num_decoder_layers, 2, batch_size, n_heads, seq_length_b, d_kv), dtype=torch.float32)

    sa = torch.ones(
        (batch_size, n_heads, seq_length_a, d_kv), dtype=torch.float32
    )  # 1, 8, 1, 64
    ca = torch.ones(
        (batch_size, n_heads, seq_length_b, d_kv), dtype=torch.float32
    )  # 1, 8, 30, 64
    t5_block = (sa, sa, ca, ca)
    past_key_values = (t5_block,) * model_config.num_decoder_layers

    flat_past_key_values = functools.reduce(operator.iconcat, past_key_values, [])

Here's the question:

What does seq_length_a, seq_length_b represents? I believe input_ids.shape should be [batch_size, length]. t5_block is the past_key_values for a single transformer layer, and the shape of sa, ca should be (batch_size, n_heads, decoder_seq_length, d_kv), (batch_size, n_heads, encoder_seq_length, d_kv). So I guess seq_length_a, seq_length_b are the length of decoder and encoder input, but they are not the same values as input_ids.shape. Can you provide more explanation on this? Thank you!

Error: Duplicate definition of name (pkv_11) using fine-tuned T5

Exporting to onnx... |################################| 3/3
Quantizing... |################################| 3/3
---------------------------------------------------------------------------
Fail                                      Traceback (most recent call last)
<ipython-input-1-d6e090e531c0> in <module>()
     19 
     20 # step 3. setup onnx runtime
---> 21 model_sessions = get_onnx_runtime_sessions(quant_model_paths)
     22 
     23 # step 4. get the onnx model

2 frames
/usr/local/lib/python3.7/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py in _create_inference_session(self, providers, provider_options, disabled_optimizers)
    308         session_options = self._sess_options if self._sess_options else C.get_default_session_options()
    309         if self._model_path:
--> 310             sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
    311         else:
    312             sess = C.InferenceSession(session_options, self._model_bytes, False, self._read_config_from_model)

Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /content/models/t5-base-128_en_paraphrasing-decoder-quantized.onnx failed:This is an invalid model. Error: Duplicate definition of name (pkv_11).

the method get_onnx_model() should not need the path to original model

Hi,

After having obtained our ONNX model thanks to the export_and_get_onnx_model() method, we want to load it in order to use it in production.

There is a method called get_onnx_model() which requires the following arguments: model_name_or_path, onnx_models_path=saved_models_path and quantized=True

This is conceptually strange: why do I need to access the original model (via model_name_or_path) when I want to use the ONNX one?

I explored the code and figured out that model_name_or_path is needed for 2 things:

  1. get the model name that was used to create the ONNX files names
  2. get model name configuration

In order to avoid passing the model_name_or_path argument in get_onnx_model(), the export_and_get_onnx_model() method could save the model name configuration in the ONNX folder and could arrange a standard model name for ONNX files names as we can now (in the latest version of fastt5) customize ONNX folder path.

What do you think?

Errors when loading saved onnx files

We have an issue with saving and loading onnx files.
When passing the generated quant_model_paths to get_onnx_runtime_sessions everything works okay but if I save the file and then run get_onnx_runtime_sessions on the loaded quantized files the model throws an error:

File "/Users/itai/Code/email-cleaner/.venv/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 184, in run
    raise ValueError("Model requires {} inputs. Input Feed contains {}".format(num_required_inputs, num_inputs))
ValueError: Model requires 3 inputs. Input Feed contains 2

This doesn't seem to happen on SageMaker but it happens on mac and also on linux containerized environment.

Electra model loading into fastT5

I am trying to use fastT5 library to do some benchmarking, when i try to load model,
onnx_model_paths = generate_onnx_representation(trained_model_path)
it gives out error as
'ElectraConfig' object has no attribute 'num_heads' and 'ElectraConfig' object has no attribute 'd_kv'

Not working

Error with transformers, i tried on transformers==4.4.2 and also 4.2.2

/usr/local/lib/python3.7/dist-packages/transformers/models/t5/modeling_t5.py in forward(self, hidden_states, mask, key_value_states, position_bias, past_key_value, head_mask, query_length, use_cache, output_attentions)
    496                 position_bias = position_bias + mask  # (batch_size, n_heads, seq_length, key_length)
    497 
--> 498         scores += position_bias
    499         attn_weights = F.softmax(scores.float(), dim=-1).type_as(
    500             scores

RuntimeError: output with shape [5, 12, 1, 2] doesn't match the broadcast shape [5, 12, 2, 2]

[Question] Any recommendations on hosting the exported models in Hugging-face or other hubs ? (see details)

Hi Kiran - Since we have 3 models encoder, decoder, and init-decoder coming out of this process, instead of one PyTorch model and Tokenizer, How are we supposed to host them or share with folks? Just upload these 3 files share instructions to load them as follows or is there any trick to this?

model_sessions = get_onnx_runtime_sessions(quant_model_paths)
quantised_model = OnnxT5(model_name, model_sessions)

OnnxT5 slower than Pytorch

Hi. I have created an OnnxT5 model (non quantized) as shown in Readme. But OnnxT5 is slower than original Huggingface T5 10-20%. Could you share how the latency difference shown in repo was obtained? Thanks

Mt5 model loading fails

Hallo, I have MT5 pretrained model, i am using fastt5 approch to convert the model to onnx. The convestion of the model works fine. But when creating the decoder_sess at
decoder_sess = InferenceSession(str(path_to_decoder)) more specfic it fails at

# initialize the C++ InferenceSession
sess.initialize_session(providers, provider_options, disabled_optimizers)

it fails without any error, as
Process finished with exit code 135 (interrupted by signal 7: SIGEMT)
Loading the encoder model works, but not decoder model

I am using latest version of fastt5==0.1.4
Any ideas to create session.

Support for py3.10

Hello I just wanted to ask if python 3.10 is supported yet, I have an ubuntu server with the latest LTS release and there are conflicts with the pip packages, log file of the stack trace has been attached, this happen when I try to run pip install -q transformers fastT5
Are the latest python versions supported?
trace.log

Updating fastT5?

Hello @Ki6an.

Thanks a lot for fastT5! One question: are you planning to update it to the latest versions of ONNX, ONNX Runtime and Transformers?

Thank you.

ONNX Runtime Session Defaults Discrepancy

In the definition of get_onnx_runtime_sessions, we have n_threads: int = 4,. In the docstring, it says n_threads (int) : Sets the number of threads used to parallelize the execution within nodes. Default is 0 to let onnxruntime choose

Is there a reason that n_threads defaults to 4?

how to fix when convert model mT5 with max_length = 512

2022-03-10 09:07:54.967587868 [W:onnxruntime:, execution_frame.cc:811 VerifyOutputSizes] Expected shape from model of {batch,sequence,2,64} does not match actual shape of {5,12,21,64} for output output_past_key_values
2022-03-10 09:07:54.969066346 [W:onnxruntime:, execution_frame.cc:811 VerifyOutputSizes] Expected shape from model of {1,12,2,64} does not match actual shape of {5,12,21,64} for output 566
2022-03-10 09:07:54.973768695 [W:onnxruntime:, execution_frame.cc:811 VerifyOutputSizes] Expected shape from model of {1,12,2,64} does not match actual shape of {5,12,21,64} for output 710
2022-03-10 09:07:54.978314803 [W:onnxruntime:, execution_frame.cc:811 VerifyOutputSizes] Expected shape from model of {1,12,2,64} does not match actual shape of {5,12,21,64} for output 854
2022-03-10 09:07:54.982609990 [W:onnxruntime:, execution_frame.cc:811 VerifyOutputSizes] Expected shape from model of {1,12,2,64} does not match actual shape of {5,12,21,64} for output 998
2022-03-10 09:07:54.986836355 [W:onnxruntime:, execution_frame.cc:811 VerifyOutputSizes] Expected shape from model of {1,12,2,64} does not match actual shape of {5,12,21,64} for output 1142
2022-03-10 09:07:54.991021618 [W:onnxruntime:, execution_frame.cc:811 VerifyOutputSizes] Expected shape from model of {1,12,2,64} does not match actual shape of {5,12,21,64} for output 1286
2022-03-10 09:07:54.995182350 [W:onnxruntime:, execution_frame.cc:811 VerifyOutputSizes] Expected shape from model of {1,12,2,64} does not match actual shape of {5,12,21,64} for output 1430
2022-03-10 09:07:54.999290875 [W:onnxruntime:, execution_frame.cc:811 VerifyOutputSizes] Expected shape from model of {1,12,2,64} does not match actual shape of {5,12,21,64} for output 1574
2022-03-10 09:07:55.003348281 [W:onnxruntime:, execution_frame.cc:811 VerifyOutputSizes] Expected shape from model of {1,12,2,64} does not match actual shape of {5,12,21,64} for output 1718
2022-03-10 09:07:55.007545242 [W:onnxruntime:, execution_frame.cc:811 VerifyOutputSizes] Expected shape from model of {1,12,2,64} does not match actual shape of {5,12,21,64} for output 1862
2022-03-10 09:07:55.011838101 [W:onnxruntime:, execution_frame.cc:811 VerifyOutputSizes] Expected shape from model of {1,12,2,64} does not match actual shape of {5,12,21,64} for output 2006

I have onnx==1.11.0 and onnxruntime==1.10.0

Conversion of decoder with past_key_values to float16.

Hi @Ki6an. With the converted ONNX model generated. I was trying to convert the decoder_init and decoder to float16. I did the quantization with onnxruntime's transformer optimizer. I was able to convert the decoder_init to fp16 but while converting the decoder with past_key_values I am getting following issue:

AssertionError Traceback (most recent call last)
in ()
25 ,
26 )
---> 27 optimized_model.convert_float_to_float16() # FP32 -> FP16
28 optimized_model.save_model_to_file('/content/optimized_models/t5-base-qa-qg-hl-decoder.onnx')
29

6 frames
/usr/local/lib/python3.7/dist-packages/onnxruntime/transformers/../tools/symbolic_shape_infer.py in add_suggested_merge(self, symbols, apply)
216
217 def add_suggested_merge(self, symbols, apply=False):
--> 218 assert all([(type(s) == str and s in self.symbolic_dims
) or is_literal(s) for s in symbols])
219 symbols = set(symbols)
220 for k, v in self.suggested_merge
.items():

AssertionError:

Code used:
from onnxruntime.transformers import optimizer

#Decoder.
optimized_model =optimizer.optimize_model(
input='/content/models/t5-base-qa-qg-hl-decoder.onnx',
use_gpu=True,
opt_level=1, only_onnxruntime= True
,
)
optimized_model.convert_float_to_float16() # FP32 -> FP16
optimized_model.save_model_to_file('/content/optimized_models/t5-base-qa-qg-hl-decoder.onnx')

How to fix when run convert model mT5

when using convert to generate appear this error ? Can u fix ?

File "/data/dodx/anaconda3/envs/T5/lib/python3.8/site-packages/gradio/routes.py", line 260, in predict output = await run_in_threadpool(app.launchable.process_api, body, username) File "/data/dodx/anaconda3/envs/T5/lib/python3.8/site-packages/starlette/concurrency.py", line 39, in run_in_threadpool return await anyio.to_thread.run_sync(func, *args) File "/data/dodx/anaconda3/envs/T5/lib/python3.8/site-packages/anyio/to_thread.py", line 28, in run_sync return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable, File "/data/dodx/anaconda3/envs/T5/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread return await future File "/data/dodx/anaconda3/envs/T5/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 754, in run result = context.run(func, *args) File "/data/dodx/anaconda3/envs/T5/lib/python3.8/site-packages/gradio/interface.py", line 574, in process_api prediction, durations = self.process(raw_input) File "/data/dodx/anaconda3/envs/T5/lib/python3.8/site-packages/gradio/interface.py", line 611, in process predictions, durations = self.run_prediction( File "/data/dodx/anaconda3/envs/T5/lib/python3.8/site-packages/gradio/interface.py", line 532, in run_prediction prediction = predict_fn(*processed_input) File "api.py", line 54, in predict outputs_text = model.generate(input_ids = inputs_text["input_ids"], File "/data/dodx/anaconda3/envs/T5/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) File "/data/dodx/anaconda3/envs/T5/lib/python3.8/site-packages/transformers/generation_utils.py", line 1083, in generate inputs_tensor, model_input_name, model_kwargs = self._prepare_model_inputs(inputs, bos_token_id, model_kwargs) File "/data/dodx/anaconda3/envs/T5/lib/python3.8/site-packages/transformers/generation_utils.py", line 397, in _prepare_model_inputs and self.encoder.main_input_name != self.main_input_name File "/data/dodx/anaconda3/envs/T5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 778, in __getattr__ raise ModuleAttributeError("'{}' object has no attribute '{}'".format( torch.nn.modules.module.ModuleAttributeError: 'T5Encoder' object has no attribute 'main_input_name'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.