Comments (8)
Hello! Thanks for reaching out with your observations on batch size impacts during engine exports.
Indeed, the behavior you're seeing with the newer drivers and Ultralytics updates is expected. Recent optimizations and updates in both our software and the underlying drivers can lead to improved performance, even at varying batch sizes. The newer versions are designed to better utilize hardware capabilities, which might explain why you're seeing consistent or improved performance across different batch sizes.
It's great to hear that your models are performing more efficiently! If you have any more questions or need further clarification, feel free to ask. Happy modeling! 🚀
from ultralytics.
@CySlider I'm not sure how you're benchmarking, but I ran a small experiment and you can find the code below if you want to try it out. Perhaps I misunderstood what your concern is, but it helps me to remember that batch inference time (from Ultralytics) shows the model throughput time for inference (how long did it take the model to return a result). When
The results table is just a quick demo of what I mean. There's not a true measure for per/image times when batching, because the model processes them in parallel, so "counting" per image can be estimated by dividing the time by the batch size.
I've also learned that warming up the GPU is definitely an important step for measurement stability. I don't know if your benchmark accounts for this, but something I'd recommend if not.
Export command
yolo export model=yolov8s.pt format=engine half=True dynamic=True batch=8 workspace=5
Measurement code used
from pathlib import Path
import cv2
import numpy as np
from ultralytics import YOLO
im_path = Path("coco128/images/train2017") # COCO 128 images
imgs = sorted(im_path.glob("*.jpg"))
model = YOLO("yolov8s.engine")
warmup = 15 # arbitrary warm up iterations
batch_size = 1
# batch_size = 2
# batch_size = 4
# batch_size = 8
dummy = np.random.randint(0, 255, (640, 640, 3), np.uint8)
speeds = []
for i in range(0, len(imgs), batch_size):
# Read images
b = [cv2.imread(str(im)) for im in imgs[i: i + batch_size]]
# Warm up
_ = [model.predict([dummy] * batch_size, verbose=False) for _ in range(warmup)][0]
# Inference
r = model.predict(b, batch=batch_size)[0]
speeds.append({"total": sum(r.speed.values()), **r.speed})
# Overall Slowest batch
max([s["inference"] for s in speeds])
# Overall Slowest image (average)
max([s["inference"] for s in speeds]) / batch_size
# Overall Fastest batch
min([s["inference"] for s in speeds])
# Overall Fastest image (average)
min([s["inference"] for s in speeds]) / batch_size
# Average per image
sum([s["inference"] for s in speeds]) / len(imgs)
# Average per batch
sum([s["inference"] for s in speeds]) / (len(imgs) / batch_size)
# could be looped to run all batch sizes in a single go
Results
Measure | Batch | Slowest (ms) | Fastest (ms) | Average (ms) |
---|---|---|---|---|
per batch | 1 | 4.00 | 4.00 | 3.76 |
per image | 1 | 4.00 | 4.00 | 3.76 |
- | - | - | - | - |
per batch | 2 | 2.74 | 2.60 | 2.62 |
per image | 2 | 1.37 | 1.30 | 1.31 |
- | - | - | - | - |
per batch | 4 | 2.09 | 2.05 | 2.06 |
per image | 4 | 0.52 | 0.51 | 0.52 |
- | - | - | - | - |
per batch | 8 | 1.78 | 1.76 | 1.77 |
per image | 8 | 0.22 | 0.22 | 0.22 |
System info
Using tensorrt==8.6.1
Ultralytics YOLOv8.2.28 🚀 Python-3.10.12 torch-2.2.0+cu121 CUDA:0 (NVIDIA GeForce RTX 2060, 5924MiB)
Setup complete ✅ (12 CPUs, 15.6 GB RAM, 76.2/101.0 GB disk)
OS Linux-6.6.10-76060610-generic-x86_64-with-glibc2.35
Environment Linux
Python 3.10.12
Install git
RAM 15.56 GB
CPU AMD Ryzen 5 1600 Six-Core Processor
CUDA 12.1
matplotlib ✅ 3.8.1>=3.3.0
opencv-python ✅ 4.8.1.78>=4.6.0
pillow ✅ 10.1.0>=7.1.2
pyyaml ✅ 6.0.1>=5.3.1
requests ✅ 2.31.0>=2.23.0
scipy ✅ 1.11.3>=1.4.1
torch ✅ 2.2.0>=1.8.0
torchvision ✅ 0.17.0>=0.9.0
tqdm ✅ 4.66.1>=4.64.0
psutil ✅ 5.9.6
py-cpuinfo ✅ 9.0.0
thop ✅ 0.1.1-2209072238>=0.1.1
pandas ✅ 2.1.3>=1.1.4
seaborn ✅ 0.13.0>=0.11.0
from ultralytics.
Thanks a lot @Burhan-Q ! This makes fare more sense.
I will test your code. In my benchmark I do a warmup and then I stop the time myself, and ofc I devide the result time by batch size, at least I think so.
from ultralytics.
One issue I see in your test is that you use an export for size 8 with smaller batches. I could imagine that this clearly is improving speed as more as you approach the optimal batch size. So you would need to compare an export for batch size 4 run with batches of 4 with an export for batch size 8 with a batch of 8.
Also I use dynamic = False as, at least back then, my test showed that Half=True will only work with dynamic=False. This might have changed
from ultralytics.
Ok, I did test your code. I added time taking to the outside:
time_before = time.time()
r = model.predict(b, batch=batch_size)[0]
time_after = time.time()
print(f"Total: {time_after - time_before}")
and for batch size 4 I get: Total: 0.0068817138671875
and for batch size 8 I get: Total: 0.012621164321899414
so no real improvement which matches my own benchmark results.
from ultralytics.
Thanks for running those tests and sharing your results! It's interesting to see that the performance doesn't scale as expected with the increase in batch size. This could be due to several factors, including how the model handles memory and computational resources at different batch sizes, especially when dynamic=False
is set.
Regarding your point about using dynamic=False
and Half=True
, it's true that certain configurations might behave differently depending on the specific hardware and software environment. It might be worth experimenting with different settings for dynamic
and observing how they impact performance on your specific setup.
If you continue to see no improvement with larger batch sizes, it could be beneficial to look into more detailed profiling of the model's execution on the GPU to identify any potential bottlenecks or inefficiencies. Tools like NVIDIA's Nsight Systems or Nsight Compute could provide deeper insights into what's happening under the hood.
Let's keep the discussion going if you have more updates or need further assistance! 🚀
from ultralytics.
Ok, I did find a setup in it's original form and could run some tests and I think I see now what changed:
First two are the old setup (8.1.4) and the second two are the new setup (8.2.22)
Internal inference predict Own postprocessing Total
preprocess inference postprocess total perImage postprocess postPerImage withPostprocess perImage
4: 1.4 4.3 0.4 45.6 11.4 18.5 4.6 77.1 19.3
48: 1.8 3.6 0.4 312.5 6.5 78.7 1.6 405.7 8.4
4: 1.4 4.3 0.4 26.0 6.5 6.0 1.5 32 8.0
48: 1.8 3.6 0.4 292.0 6.0 69.9 1.5 361.9 7.5
The internal inference speed is still unchanged and the same. And it seems even back then there was not a big speedup from bigger batch sizes. The inference speed went down a bit (4.3 > 3.6), but the preprocess time did went up (1.4 > 1.8) so only a 0.3ms speedup per image or ~5%
However, I did a fully integrated benchmark using predict with some extra features:
results = self.model.predict(
source=frames,
augment=False, #If data augemtnation should be performed on input (hue, resize, flip, etc)
visualize=False,
save=False,
iou=0.7,
device=self.selected_device,
classes=None,
agnostic_nms=True,
max_det=50,
imgsz=get_image_inverted_img_size(self.image_size), #h x w here
batch=batch_size,
half=True )
And here a lot happened for small batch sizes, time per image going down from 11.4 to 6.4ms, nearly half.
Also my own postprocessing went down from 4.6 to 1.5ms which is a bit puzzling to me, as the code did not change and it is still the same python version and the code is also quite boring, mainly remapping some info.
time_predict = (time_after - time_before) * 1000
print(f"{log_prefix}Time for prediction of {batch_size} images: {time_predict} ms. {time_predict/batch_size} ms per Image")
valid_detections_batch = []
for result in results:
detections = []
for b in result.boxes:
# by convention we swap x/y to y/x, later calculations depend on it
box = [float( b.xyxyn[0][1] ), float( b.xyxyn[0][0] ), float( b.xyxyn[0][3] ), float( b.xyxyn[0][2] )]
tl_x = int( b.xyxy[0][0] )
tl_y = int( b.xyxy[0][1] )
left_top = ( tl_x, tl_y )
br_x = int( b.xyxy[0][2] )
br_y = int( b.xyxy[0][3] )
right_bottom = ( br_x, br_y )
score = float(b.conf[0])
label_id = int( b.cls )
label = self.category_index[label_id+1]['name']
detection = {'label_id': label_id,
'label': label,
'score':score,
'box': box,
'left_top': left_top,
'right_bottom':right_bottom}
detections.append( detection )
valid_detections_batch.append(detections)
time_postprocess = (time.time() - time_after) * 1000
total_time = (time.time() - start_time) * 1000
print(f"{log_prefix}Time for postprocessing of {batch_size} images: {time_postprocess} ms. {time_postprocess/batch_size} ms per Image. Total time: {total_time} ms or {total_time/batch_size} ms per image")
I guess some packages got a major performance boost.
So my takeaway is that using bigger batch sizes beyond 8 is not worth the memory now. I still wonder if this is how it should be, or if there is a deeper bug still to be uncovered (and lots of performance with it)
PS: Full benchmark results with new setting:
Final Benchmark Results for M09_YOLO_V8_LARGE_1280x1280_ARAMCO_ROUND_2 on NVIDIA_RTX_A4000:
Resolution 1: 832x480 2: 1280x768 3: 1920x1088
I/s Memory I/s Memory I/s Memory
Export Format 1: engine 1: engine 1: engine 1: engine 1: engine 1: engine
Test Type Batch Proc
Incremental 1 1 100.6 1952.0 62.0 2020.0 37.9 2156.0
4 1 129.7 2098.0 72.7 2382.0 38.6 2952.0
8 1 137.0 2286.0 69.2 2882.0 38.1 3964.0
12 1 138.8 2472.0 72.3 3362.0 38.3 4988.0
16 1 133.4 2702.0 69.8 3840.0 37.9 6016.0
20 1 133.2 2890.0 69.6 4318.0 38.1 7040.0
24 1 136.0 3090.0 69.7 4798.0 85.7 7518.0
28 1 134.3 3282.0 69.7 5278.0 38.0 9088.0
32 1 136.0 3480.0 68.9 5760.0 NaN NaN
40 1 133.2 3866.0 68.8 6728.0 NaN NaN
48 1 133.0 4248.0 68.6 7690.0 NaN NaN
56 1 133.1 4640.0 NaN NaN NaN NaN
64 1 133.4 5036.0 NaN NaN NaN NaN
80 1 133.0 5814.0 NaN NaN NaN NaN
96 1 133.2 6596.0 NaN NaN NaN NaN
112 1 132.7 7380.0 NaN NaN NaN NaN
128 1 132.7 8168.0 NaN NaN NaN NaN
Parallel Mixed 1 136.5 (b=8) [136.5-136.5] 2288.0 73.4 (b=4) [73.4-73.4] 2384.0 85.1 (b=24) [85.1-85.1] 7520.0
2 121.7 (b=8) [60.9-60.9] 4576.0 71.8 (b=4) [35.9-35.9] 4766.0 36.9 (b=1) [18.4-18.4] 4314.0
3 133.0 (b=8) [44.3-44.3] 6862.0 76.9 (b=4) [25.6-25.7] 7148.0 37.1 (b=1) [12.4-12.4] 6470.0
4 135.9 (b=8) [34.0-34.0] 9148.0 78.9 (b=4) [19.7-19.7] 9530.0 38.0 (b=1) [9.5-9.5] 8626.0
from ultralytics.
Hello @CySlider,
Thank you for sharing your detailed findings and the comprehensive benchmarks! It's great to see the depth of your analysis and the effort you've put into understanding the performance characteristics of different batch sizes and setups.
From your results, it appears that the internal inference speeds remain consistent, which is a good sign that the model's core computational efficiency is stable across versions. The significant reduction in time per image for smaller batch sizes in the new setup, as well as the decrease in your postprocessing times, indeed suggest that there might have been improvements in the underlying libraries or the way batch processing is handled in the newer version of the software.
The fact that larger batch sizes beyond 8 don't show proportional performance gains could be due to several factors, including GPU memory bandwidth saturation, inefficiencies in parallel processing at higher batch sizes, or the overhead of managing larger batches outweighing the computational benefits.
It's also worth noting that improvements in the underlying software or CUDA libraries could lead to better utilization of the GPU, which might explain why newer versions perform better even without significant changes in your code.
Your approach to benchmarking and analysis is quite thorough, and it provides valuable insights into how different factors influence performance. If you suspect there might still be underlying issues or potential for further optimization, it could be beneficial to profile the GPU during execution to identify any bottlenecks or inefficiencies.
Again, thank you for your detailed feedback and for using Ultralytics YOLO. If you have any more questions or need further assistance, feel free to reach out. Happy coding! 🚀
from ultralytics.
Related Issues (20)
- Using YOLOv8(seg) with SHAP HOT 5
- yolov8 object_counting in and out doesn't differentiate for defined line HOT 4
- how to set `verbose:false` so that model can predict the batches without printing anything in the terminal HOT 1
- Questions about incremental training HOT 3
- How can I use the segmentation models of previous versions? HOT 3
- yolov8-obb plot train labels maybe error HOT 2
- Error Code 2: Internal Error (Assertion cublasStatus == CUBLAS_STATUS_SUCCESS failed. ) HOT 4
- Yolov10 Can't get attribute 'SCDown' on <module 'ultralytics.nn.modules.block' from 'C:\\Users\\ZHANG\\miniconda3\\lib\\site-packages\\ultralytics\\nn\\modules\\block.py'> HOT 20
- yolov8 -- After the cache is turned on, the memory occupied by reading val data is too large HOT 5
- YOLOv10 Performance Issue: Version 3.12 Fast, But 3.11 and Below Very Slow HOT 8
- yolo8 onnx in opencv HOT 2
- Is OBB available for yolov9 and v10 ? HOT 1
- Clamping in bbox2dist HOT 1
- Question about code of position embedding in rt-detr HOT 5
- Process group init fails when training YOLOv8 after successful tunning [Databricks] [single node GPU] HOT 4
- Train with single gpu HOT 3
- Yolo8-OnnxRuntime-CPP-Inference awful output HOT 6
- confusion matrix single HOT 2
- How to add the bounding box values to the labels text files during prediction with a trained YOLO-V8 instance segmentation model? HOT 4
- Class imabalance dataloader HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ultralytics.