Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Implement and benchmark ONNX Runtime for Inference about haystack HOT 16 CLOSED

deepset-ai commented on May 14, 2024

Implement and benchmark ONNX Runtime for Inference

from haystack.

Comments (16)

tanaysoni commented on May 14, 2024 2

Hi @ahotrod, we are testing ONNX Runtime Inference with FARM. We ran a preliminary benchmark to compare it with PyTorch Inference for the forward pass of a model and observed ~2x performance gain.

We plan to implement it in FARM(deepset-ai/FARM#276) and then do an end-to-end benchmark in Haystack.

from haystack.

tanaysoni commented on May 14, 2024 2

Hi @ahotrod

We used the tutorial notebook to run more benchmarks comparing the performance of ONNX and PyTorch Inference with different batch sizes.

Here's the code for benchmarks

# %env CUDA_LAUNCH_BLOCKING=1

# ONNX Runtime Inference

import onnxruntime as rt  
import time

sess_options = rt.SessionOptions()

# Set graph optimization level to ORT_ENABLE_EXTENDED to enable bert optimization.
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED

# To enable model serialization and store the optimized graph to desired location.
sess_options.optimized_model_filepath = os.path.join(output_dir, "optimized_model.onnx")
session = rt.InferenceSession(output_model_path, sess_options)

for batch_size in (1, 2, 4, 8, 16, 32, 64):
    runtimes = []
    for _ in range(5):
        dataloader = DataLoader(dataset=dataset, batch_size=batch_size)
        batch = next(iter(dataloader))
        batch = tuple(t.to("cpu") for t in batch)
        inputs = {
            'input_ids':      batch[0],                       
            'attention_mask': batch[1],
            'token_type_ids': batch[2],
        }

        # evaluate the model
        start = time.time()
        res = session.run(None, {
                    'input_ids': inputs['input_ids'].cpu().numpy(),
                    'input_mask': inputs['attention_mask'].cpu().numpy(),
                    'segment_ids': inputs['token_type_ids'].cpu().numpy()
                })
        end = time.time()
        runtimes.append(end-start)
    print(f"ONNX Runtime inference time for batch_size {batch_size}: {round(sum(runtimes)/len(runtimes), 4)}")


# PyTorch Inference
model.to("cuda")
for batch_size in (1, 2, 4, 8, 16, 32, 64):
    runtimes = []
    for _ in range(5):
        dataloader = DataLoader(dataset=dataset, batch_size=batch_size)
        batch = next(iter(dataloader))
        batch = tuple(t.to("cuda") for t in batch)
        inputs = {
            'input_ids':      batch[0],
            'attention_mask': batch[1],
            'token_type_ids': batch[2],
        }

        # evaluate the model
        start = time.time()
        outputs = model(**inputs)
        end = time.time()
        runtimes.append(end-start)
    print(f"PyTorch inference time for batch_size {batch_size}: {round(sum(runtimes)/len(runtimes), 4)}")

The benchmarks were done on an AWS EC2 p3.2xlarge(V100 GPU) instance with pytorch v1.4.0, transformers v2.4.0, onnx v1.6.0, and onnxruntime-gpu v1.2.0

Here's the comparison of Inference times (in seconds)

Batch Size	ONNX	PyTorch	ONNX SpeedUp
1	0.0075	0.0307	4.09
2	0.0089	0.0329	3.70
4	0.0128	0.0364	2.84
8	0.0193	0.0482	2.50
16	0.0348	0.0660	1.90
32	0.0648	0.1068	1.65
64	0.1288	0.1621	1.26

It seems ONNX Inference is faster compared to PyTorch when using lower batch size, but the difference decreases as we increase the batch size. Wondering if there's any further optimization that could be done for ONNX Runtime with respect to batch sizing?

from haystack.

ahotrod commented on May 14, 2024 1

@tanaysoni There is on-going further speed optimization of Bert w/ONNX here:
Add Bert Optimization Notebooks #3204

Checks are still running as I type this, with 1 pending review. I have not had an opportunity to evaluate the changes, but after a cursory review, it appears there are significant changes to nine supporting code files, plus a primary change to the forked notebook is:

# Use contiguous array as input could improve performance.
ort_inputs = {'input_ids': numpy.ascontiguousarray(inputs['input_ids'].cpu().numpy()),
              'input_mask': numpy.ascontiguousarray(inputs['attention_mask'].cpu().numpy()),
              'segment_ids': numpy.ascontiguousarray(inputs['token_type_ids'].cpu().numpy())
}

# Warm up with one run.
session.run(None, ort_inputs)

# Measure the latency.
start = time.time()
results = session.run(None, ort_inputs)
end = time.time()

PyTorch cuda Inference time = 30.92 ms
ONNX Runtime cuda inference time: 9.97 ms

Note one of the commits is Allow test multiple batch_size.

FYI ref: Graph Optimizations in ONNX Runtime
rt.GraphOptimizationLevel.ORT_ENABLE_ALL doesn't appear to add anything for GPU, only CPU, and rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED previously adds BERT Embedding Layer Fusion.

from haystack.

tanaysoni commented on May 14, 2024 1

@ahotrod thank you for all the pointers! I could reproduce the results in the newly updated notebook with a batch_size of 1. I'll now try with different sizes and also benchmark the integration will FARM.

from haystack.

ahotrod commented on May 14, 2024 1

@tanaysoni I have no experience/benchmark for BERT SQuAD with TensorRT or nGraph.

Historical early comments on "productionizing the (HF) models" from Oct2019: https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2

There's an informative intro into "The Serving Problem" at this 20 April 2020 blog post:
https://blog.einstein.ai/benchmarking-tensorrt-inference-server/

This blog post also benchmarks the newly-named NVIDIA Triton (formerly TensorRT) Inference Server. Early impression includes the positive that Triton hosts models from multiple frameworks (ONNX, PyTorch and TensorFlow) and multiple HF Transformer Language models, e.g. BERT, ALBERT, GPT2 and CTRL mentioned in the blog. Downside is that Triton links to a proprietary NVIDIA hardware solution, no surprise, maybe even requiring NVIDIA's DGX in their GPU Cloud, not sure.

Triton github: https://github.com/NVIDIA/triton-inference-server/tree/d7cc183b7611f7775e1808b0a9d25a36e3d6e055#roadmap

I have just begun looking for a reasonable cloud inferencing solution for large (ALBERT-xxlarge, RoBERTa-large, eventually Elastic-large, ... etc.) HF Transformer QA models, either in Tensorflow or Pytorch, compatible with Haystack. Reasonable in that each inferencing of a single 6K-word or less document can take low seconds, not tens of seconds.

Looks like I will be using an AWS cloud solution. Currently working on a domain vocabulary file for AWS Transcribe.

from haystack.

tholor commented on May 14, 2024 1

@ahotrod yep, I believe onnx-runtime can be a good alternative to PyTorch and becomes increasingly popular. Great to see that Transformers is also implementing it! We will try to support this in Haystack.

On our end, we finished the implementation in FARM and recently added some benchmark scripts. Maybe the results are interesting to you: Google Spreadsheet

The speedup is particularly significant for smaller batches and when the ONNX optimizations for V100 (or similar devices) are applied.

We will work on getting a "FarmOnnxReader" into Haystack.

from haystack.

ahotrod commented on May 14, 2024 1

@tholor @tanaysoni

You may find this interesting: HF ONNX

from haystack.

ahotrod commented on May 14, 2024

@tanaysoni Wondering if you have taken the next step of importing the ONNX model into TensorRT for NVIDIA Cuda inferencing performance gains in a production environment?

Seen any ONNX support for RoBERTa?

from haystack.

tanaysoni commented on May 14, 2024

@ahotrod we would want to explore different execution providers for Haystack in the next days. Do you have any experience/benchmark for BERT SQuAD with TensorRT or nGraph?

I haven't yet seen an ONNX for RoBERTa but would be excited to test it out in Haystack!

from haystack.

tanaysoni commented on May 14, 2024

Hi @ahotrod, thank you for all the pointers!

I did a quick test running an ONNX model on TensorRT(V100 GPU) using this Dockerfile, but the benchmarks did not show performance gains. I'll have to investigate further before posting the results.

Meanwhile, we are also working on implementing an inference speed benchmarking pipeline in FARM-#321. This will help reproduce benchmarks for different models, execution providers, batch sizing, and other params.

from haystack.

ahotrod commented on May 14, 2024

@tanaysoni FYI ONNX Conversion Script just posted.
Will be following with interest/implications for my models.
Conformity/coordination with your work in Haystack-Farm?

from haystack.

tanaysoni commented on May 14, 2024

Hi @ahotrod, ONNX support is now added in Haystack with #157!

from haystack.

raphychek commented on May 14, 2024

Hi @tanaysoni! Measuring the time of inference on GPU (at least in PyTorch, not that sure about ONNX) doesn't work well like this, as executions on GPU are asynchronous. The results you have might be uncorrect. You should check this link which explains it: https://towardsdatascience.com/the-correct-way-to-measure-inference-time-of-deep-neural-networks-304a54e5187f

from haystack.

tholor commented on May 14, 2024

Hey @raphychek , not sure which of our code your are referring to here? You are totally right that GPU computations are asynchronous and that's why we usually use torch.cuda.synchronize() between GPU operations that we measure OR measure on an outer scope where the GPU was forced to sync (e.g. when assigning back to CPU or aggregating results as in some of the above snippets).

from haystack.

raphychek commented on May 14, 2024

Hi @tholor. I might be learning something new here, so thank you for that! What do you mean by "aggregating results" and how does it allow the GPU to be forced to sync?

From my own experiments, measuring time of GPU infered operations with a time.time() substraction gave differents -and sometimes inconsistents- results than when using torch.cuda.synchronise() and measuring time with torch.cuda.Event(). Hence the part that seems suspect to me is this one, especially knowing your model in on cuda (model.to("cuda") in your code):

        start = time.time()
        outputs = model(**inputs)
        end = time.time()
        runtimes.append(end-start)

from haystack.

tholor commented on May 14, 2024

Ah, I see you are referring to this code snippet above. I believe the code that you quoted from there could indeed be problematic - depends on the implementation of the model forward pass though. Imagine an operation that sums all logits and prints the sum. Such an operation forces the GPU to sync. Unfortunately, the notebook linked in the comment seems to be deleted by now.

However, this script above was just one of our earlier benchmark runs. For ONNX we had a couple of other scripts later on that actually make explicit use of torch.synchronize, see:
https://github.com/deepset-ai/FARM/blob/7305a17979b0a80dbe2dbebe5815450883f20627/farm/infer.py#L645

Hope this is helpful :)

from haystack.

Implement and benchmark ONNX Runtime for Inference about haystack HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs