GithubHelp home page GithubHelp logo

levipereira / triton-server-yolo Goto Github PK

View Code? Open in Web Editor NEW
18.0 2.0 1.0 174 KB

This repository serves as an example of deploying the YOLO models on Triton Server for performance and testing purposes

License: Apache License 2.0

Shell 100.00%
deepstream onnx triton-server yolov7 yolov9

triton-server-yolo's Introduction

Triton Server YOLO Models

This repository serves as an example of deploying the YOLO models on Triton Server for performance and testing purposes. It includes support for applications developed using Nvidia DeepStream.
Currently, only YOLOv7, YOLOv7 QAT, YOLOv9 and YOLOv9 QAT are supported, but we plan to add support for YOLOv8 in the future.

Triton Client Repository

For testing and evaluating YOLO models, you can utilize the repository triton-client-yolo

Evaluation Test on TensorRT

Evaluation test was perfomed using this client

Models Details

Model ONNX > TensorRT Test Size APval AP50val AP75val
YOLOv9-C (FP16) 640 52.9% 70.1% 57.7%
YOLOv9-C ReLU (FP16) 640 51.7% 68.8% 56.3%
YOLOv9-E (FP16) 640 55.4% 72.6% 60.3%
YOLOv9-C QAT 640 52.7% 69.8% 57.5%
YOLOv9-C ReLU QAT 640 51.6% 69.7% 56.3%
YOLOv9-E QAT 640 55.3% 72.4% 60.2%
YOLOv7 (FP16) 640 51.1% 69.3% 55.6%
YOLOv7x (FP16) 640 52.9% 70.8% 57.4%
YOLOv7-QAT (INT8) 640 50.9% 69.2% 55.5%
YOLOv7x-QAT (INT8) 640 52.5% 70.6% 57.3%

Evaluation Test original (Pytorch)

Model Test Size APval AP50val AP75val
YOLOv9-C 640 53.0% 70.2% 57.8%
YOLOv9-E 640 55.6% 72.8% 60.6%
YOLOv7 640 51.4% 69.7% 55.9%
YOLOv7-X 640 53.1% 71.2% 57.8%

Evaluation Comparasion TensorRT vs Pytorch

Model ONNX > TensorRT Model Pytorch Test Size APval AP50val AP75val
YOLOv9-C (FP16) YOLOv9-C 640 -0.2% -0.1% -0.1%
YOLOv9-E (FP16) YOLOv9-E 640 -0.2% -0.2% -0.3%
YOLOv7 (FP16) YOLOv7 640 -0.3% -0.4% -0.3%
YOLOv7x (FP16) YOLOv7-X 640 -0.2% -0.4% -0.4%
YOLOv7-QAT (INT8) YOLOv7 640 -0.5% -0.5% -0.4%
YOLOv7x-QAT (INT8) YOLOv7-X 640 -0.2% -0.4% -0.4%

Description

This repository utilizes exported models using ONNX. It offers two types of ONNX models

  1. Model with End2End, Efficient NMS plugin enabled
    It offers two types of End2End models:
  • A model optimized for evaluation (--topk-all 300 --iou-thres 0.7 --conf-thres 0.001).
  • A model optimized for inference (--topk-all 100 --iou-thres 0.45 --conf-thres 0.25).
  1. Model with Dynamic Batching only
  • The Non-Maximum Suppression must be handled by Client

Detailed Models can be found here

Starting NVIDIA Triton Inference Server Container with Docker

Prerequisites

Usage

git clone https://github.com/levipereira/triton-server-yolo.git
cd triton-server-yolo
# Start Docker container
bash ./start-container-triton-server.sh

Starting Triton Inference Server

Inside Docker Container use bash ./start-triton-server.sh

Script Usage

This script is used to build TensorRT engines and start Triton-Server for YOLO models.

cd /apps
bash ./start-triton-server.sh

Script Options

  • --models: Specify the YOLO model name(s). Choose one or more with comma separation. Available options: yolov9-c, yolov9-e, yolov7, yolov7x.
  • --model_mode: Use Model ONNX optimized for EVALUATION or INFERENCE. Choose from 'eval' or 'inference'.
  • --efficient_nms: Use the TRT Efficient NMS plugin.. Options: 'enable' or 'disable'.
  • --opt_batch_size: Specify the optimal batch size for TensorRT engines.
  • --max_batch_size: Specify the maximum batch size for TensorRT engines.
  • --instance_group: Specify the number of TensorRT engine instances loaded per model in the Triton Server.
  • --force: Rebuild TensorRT engines even if they already exist.
  • --reset_all: Purge all existing TensorRT engines and their respective configurations.

Script Flow:

  1. Checks for the existence of YOLOv7/YOLOv9 ONNX model files.
  2. Downloads ONNX models if they do not exist.
  3. Converts YOLOv7/YOLOv9 ONNX model to TensorRT engine with FP16 precision.
  4. Updates configurations in the Triton Server config files.
  5. Starts Triton Inference Server.

Important Note: Building TensorRT engines for each model can take more than 15 minutes. If TensorRT engines already exist, this script reuses them. Users can utilize the --force flag to trigger a fresh rebuild of the models.

Starting Triton Server using Models for Evaluation Purposes

example:

cd /apps
bash ./start-triton-server.sh  \
--models yolov9-c,yolov7 \
--model_mode eval \
--efficient_nms enable \
--opt_batch_size 4 \
--max_batch_size 4 \
--instance_group 1 

Starting Triton Server using Models for Inference Purposes

example:

cd /apps
bash ./start-triton-server.sh  \
--models yolov9-c,yolov7 \
--model_mode inference \
--efficient_nms enable \
--opt_batch_size 4 \
--max_batch_size 4 \
--instance_group 1 

Starting Triton Server using models for Inference Purposes without Efficient NMS.

example:

cd /apps
bash ./start-triton-server.sh  \
--models yolov9-c,yolov7 \
--model_mode inference \
--efficient_nms disable \
--opt_batch_size 4 \
--max_batch_size 4 \
--instance_group 1 


After running script, you can verify the availability of the model by checking this output::

+----------+---------+--------+
| yolov7   | 1       | READY  |
| yolov7x  | 1       | READY  |
| yolov9-c | 1       | READY  |
| yolov9-e | 1       | READY  |
+----------+---------+--------+

Manual ONNX YOLOv7/v9 exports

Exporting YOLOv7 Series from PyTorch to ONNX With Efficient NMS plugin

This repo does not export pytorch models to ONNX.
You can use the Official Yolov7 Repository

python export.py --weights yolov7.pt \
  --grid \
  --end2end \
  --dynamic-batch \
  --simplify \
  --topk-all 100 \
  --iou-thres 0.65 \
  --conf-thres 0.35 \
  --img-size 640 640

Exporting YOLOv9 Series from PyTorch YOLOv9 to ONNX With Efficient NMS plugin

This repo does not export pytorch models to ONNX.

python3 export.py \
   --weights ./yolov9-c.pt \
   --imgsz 640 \
   --topk-all 100 \
   --iou-thres 0.65 \
   --conf-thres 0.35 \
   --include onnx_end2end

Additional Configurations

See Triton Model Configuration Documentation for more info.

Example of Yolo Configuration

Note:

  • The values of 100 in the det_boxes/det_scores/det_classes arrays represent the topk-all.
  • The setting "max_queue_delay_microseconds: 30000" is optimized for a 30fps input rate.
name: "yolov9-c"
platform: "tensorrt_plan"
max_batch_size: 8
input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [ 3, 640, 640 ]
  }
]
output [
  {
    name: "num_dets"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "det_boxes"
    data_type: TYPE_FP32
    dims: [ 100, 4 ]
  },
  {
    name: "det_scores"
    data_type: TYPE_FP32
    dims: [ 100 ]
  },
  {
    name: "det_classes"
    data_type: TYPE_INT32
    dims: [ 100 ]
  }
]
instance_group [
    {
      count: 4
      kind: KIND_GPU
      gpus: [ 0 ]
    }
]
version_policy: { latest: { num_versions: 1}}
dynamic_batching {
  max_queue_delay_microseconds: 3000
}

Running Performance with Model Analyzer

See Triton Model Analyzer Documentation for more info.

On the Host Machine:

docker run -it --ipc=host --net=host nvcr.io/nvidia/tritonserver:23.08-py3-sdk /bin/bash

$ ./install/bin/perf_analyzer -m yolov7 -u 127.0.0.1:8001 -i grpc --shared-memory system --concurrency-range 1
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 7524
    Throughput: 417.972 infer/sec
    Avg latency: 2391 usec (standard deviation 1235 usec)
    p50 latency: 2362 usec
    p90 latency: 2460 usec
    p95 latency: 2484 usec
    p99 latency: 2669 usec
    Avg gRPC time: 2386 usec ((un)marshal request/response 4 usec + response wait 2382 usec)
  Server:
    Inference count: 7524
    Execution count: 7524
    Successful request count: 7524
    Avg request latency: 2280 usec (overhead 30 usec + queue 18 usec + compute input 972 usec + compute infer 1223 usec + compute output 36 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 417.972 infer/sec, latency 2391 usec

triton-server-yolo's People

Contributors

korciuch avatar levipereira avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

tuanluong93

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.