<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks a lot <a class="user-mention notranslate" data-hovercard-type="user" data-hover

Ok, I did test your code. I added time taking to the outside: <div class="snippet-

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Doesn't batch size increase benefit engine export anymore? about ultralytics HOT 8 OPEN

CySlider commented on July 1, 2024

Doesn't batch size increase benefit engine export anymore?

from ultralytics.

Comments (8)

glenn-jocher commented on July 1, 2024

Hello! Thanks for reaching out with your observations on batch size impacts during engine exports.

Indeed, the behavior you're seeing with the newer drivers and Ultralytics updates is expected. Recent optimizations and updates in both our software and the underlying drivers can lead to improved performance, even at varying batch sizes. The newer versions are designed to better utilize hardware capabilities, which might explain why you're seeing consistent or improved performance across different batch sizes.

It's great to hear that your models are performing more efficiently! If you have any more questions or need further clarification, feel free to ask. Happy modeling! 🚀

from ultralytics.

Burhan-Q commented on July 1, 2024

@CySlider I'm not sure how you're benchmarking, but I ran a small experiment and you can find the code below if you want to try it out. Perhaps I misunderstood what your concern is, but it helps me to remember that batch inference time (from Ultralytics) shows the model throughput time for inference (how long did it take the model to return a result). When $8$ images are passed to the model, the time that it takes to process is for all images (in parallel), and the per-image inference would be (on average) the $\text{process-time} / \text{batch}$.

The results table is just a quick demo of what I mean. There's not a true measure for per/image times when batching, because the model processes them in parallel, so "counting" per image can be estimated by dividing the time by the batch size.

I've also learned that warming up the GPU is definitely an important step for measurement stability. I don't know if your benchmark accounts for this, but something I'd recommend if not.

Export command

yolo export model=yolov8s.pt format=engine half=True dynamic=True batch=8 workspace=5

Measurement code used

from pathlib import Path

import cv2
import numpy as np

from ultralytics import YOLO

im_path = Path("coco128/images/train2017")  # COCO 128 images
imgs = sorted(im_path.glob("*.jpg"))

model = YOLO("yolov8s.engine")

warmup = 15  # arbitrary warm up iterations
batch_size = 1
# batch_size = 2
# batch_size = 4
# batch_size = 8

dummy = np.random.randint(0, 255, (640, 640, 3), np.uint8)

speeds = []
for i in range(0, len(imgs), batch_size):
    # Read images
    b = [cv2.imread(str(im)) for im in imgs[i: i + batch_size]]
    # Warm up
    _ = [model.predict([dummy] * batch_size, verbose=False) for _ in range(warmup)][0]
    # Inference
    r = model.predict(b, batch=batch_size)[0]
    speeds.append({"total": sum(r.speed.values()), **r.speed})


# Overall Slowest batch
max([s["inference"] for s in speeds])
# Overall Slowest image (average)
max([s["inference"] for s in speeds]) / batch_size

# Overall Fastest batch
min([s["inference"] for s in speeds])
# Overall Fastest image (average)
min([s["inference"] for s in speeds]) / batch_size

# Average per image
sum([s["inference"] for s in speeds]) / len(imgs)
# Average per batch
sum([s["inference"] for s in speeds]) / (len(imgs) / batch_size)

# could be looped to run all batch sizes in a single go

Results

Measure	Batch	Slowest (ms)	Fastest (ms)	Average (ms)
per batch	1	4.00	4.00	3.76
per image	1	4.00	4.00	3.76
-	-	-	-	-
per batch	2	2.74	2.60	2.62
per image	2	1.37	1.30	1.31
-	-	-	-	-
per batch	4	2.09	2.05	2.06
per image	4	0.52	0.51	0.52
-	-	-	-	-
per batch	8	1.78	1.76	1.77
per image	8	0.22	0.22	0.22

System info

Using tensorrt==8.6.1

Ultralytics YOLOv8.2.28 🚀 Python-3.10.12 torch-2.2.0+cu121 CUDA:0 (NVIDIA GeForce RTX 2060, 5924MiB)
Setup complete ✅ (12 CPUs, 15.6 GB RAM, 76.2/101.0 GB disk)

OS                  Linux-6.6.10-76060610-generic-x86_64-with-glibc2.35
Environment         Linux
Python              3.10.12
Install             git
RAM                 15.56 GB
CPU                 AMD Ryzen 5 1600 Six-Core Processor
CUDA                12.1

matplotlib          ✅ 3.8.1>=3.3.0
opencv-python       ✅ 4.8.1.78>=4.6.0
pillow              ✅ 10.1.0>=7.1.2
pyyaml              ✅ 6.0.1>=5.3.1
requests            ✅ 2.31.0>=2.23.0
scipy               ✅ 1.11.3>=1.4.1
torch               ✅ 2.2.0>=1.8.0
torchvision         ✅ 0.17.0>=0.9.0
tqdm                ✅ 4.66.1>=4.64.0
psutil              ✅ 5.9.6
py-cpuinfo          ✅ 9.0.0
thop                ✅ 0.1.1-2209072238>=0.1.1
pandas              ✅ 2.1.3>=1.1.4
seaborn             ✅ 0.13.0>=0.11.0

from ultralytics.

CySlider commented on July 1, 2024

Thanks a lot @Burhan-Q ! This makes fare more sense.

I will test your code. In my benchmark I do a warmup and then I stop the time myself, and ofc I devide the result time by batch size, at least I think so.

from ultralytics.

CySlider commented on July 1, 2024

One issue I see in your test is that you use an export for size 8 with smaller batches. I could imagine that this clearly is improving speed as more as you approach the optimal batch size. So you would need to compare an export for batch size 4 run with batches of 4 with an export for batch size 8 with a batch of 8.

Also I use dynamic = False as, at least back then, my test showed that Half=True will only work with dynamic=False. This might have changed

from ultralytics.

CySlider commented on July 1, 2024

Ok, I did test your code. I added time taking to the outside:

time_before = time.time()
r = model.predict(b, batch=batch_size)[0]
time_after = time.time()
print(f"Total: {time_after - time_before}")

and for batch size 4 I get: Total: 0.0068817138671875
and for batch size 8 I get: Total: 0.012621164321899414

so no real improvement which matches my own benchmark results.

from ultralytics.

glenn-jocher commented on July 1, 2024

Thanks for running those tests and sharing your results! It's interesting to see that the performance doesn't scale as expected with the increase in batch size. This could be due to several factors, including how the model handles memory and computational resources at different batch sizes, especially when dynamic=False is set.

Regarding your point about using dynamic=False and Half=True, it's true that certain configurations might behave differently depending on the specific hardware and software environment. It might be worth experimenting with different settings for dynamic and observing how they impact performance on your specific setup.

If you continue to see no improvement with larger batch sizes, it could be beneficial to look into more detailed profiling of the model's execution on the GPU to identify any potential bottlenecks or inefficiencies. Tools like NVIDIA's Nsight Systems or Nsight Compute could provide deeper insights into what's happening under the hood.

Let's keep the discussion going if you have more updates or need further assistance! 🚀

from ultralytics.

CySlider commented on July 1, 2024

Ok, I did find a setup in it's original form and could run some tests and I think I see now what changed:

First two are the old setup (8.1.4) and the second two are the new setup (8.2.22)

            Internal inference                   predict             Own postprocessing             Total
     preprocess  inference   postprocess     total   perImage    postprocess postPerImage    withPostprocess perImage
4:   1.4         4.3         0.4              45.6   11.4        18.5        4.6              77.1           19.3
48:  1.8         3.6         0.4             312.5    6.5        78.7        1.6             405.7            8.4

4:   1.4         4.3         0.4              26.0    6.5         6.0        1.5              32              8.0
48:  1.8         3.6         0.4             292.0    6.0        69.9        1.5             361.9            7.5

The internal inference speed is still unchanged and the same. And it seems even back then there was not a big speedup from bigger batch sizes. The inference speed went down a bit (4.3 > 3.6), but the preprocess time did went up (1.4 > 1.8) so only a 0.3ms speedup per image or ~5%

However, I did a fully integrated benchmark using predict with some extra features:

  results = self.model.predict(
                              source=frames,
                              augment=False, #If data augemtnation should be performed on input (hue, resize, flip, etc)
                              visualize=False,
                              save=False,
                              iou=0.7,
                              device=self.selected_device,
                              classes=None,
                              agnostic_nms=True,
                              max_det=50,
                              imgsz=get_image_inverted_img_size(self.image_size), #h x w here
                              batch=batch_size,
                              half=True )

And here a lot happened for small batch sizes, time per image going down from 11.4 to 6.4ms, nearly half.

Also my own postprocessing went down from 4.6 to 1.5ms which is a bit puzzling to me, as the code did not change and it is still the same python version and the code is also quite boring, mainly remapping some info.

        time_predict = (time_after - time_before) * 1000
        print(f"{log_prefix}Time for prediction of {batch_size} images: {time_predict} ms. {time_predict/batch_size} ms per Image")

        valid_detections_batch = []
        for result in results:
            detections = []
            for b in result.boxes:
                # by convention we swap x/y to y/x, later calculations depend on it
                box          = [float( b.xyxyn[0][1] ), float( b.xyxyn[0][0] ), float( b.xyxyn[0][3] ), float( b.xyxyn[0][2] )]
                tl_x         = int( b.xyxy[0][0] )
                tl_y         = int( b.xyxy[0][1] )
                left_top     = ( tl_x, tl_y )
                br_x         = int( b.xyxy[0][2] )
                br_y         = int( b.xyxy[0][3] )
                right_bottom = ( br_x, br_y  )
                score        = float(b.conf[0])
                label_id     = int( b.cls )
                label        = self.category_index[label_id+1]['name']
                detection = {'label_id': label_id,
                             'label': label,
                             'score':score,
                             'box': box,
                             'left_top': left_top,
                             'right_bottom':right_bottom}
                detections.append( detection )
            valid_detections_batch.append(detections)


        time_postprocess = (time.time() - time_after) * 1000
        total_time = (time.time() - start_time) * 1000
        print(f"{log_prefix}Time for postprocessing of {batch_size} images: {time_postprocess} ms. {time_postprocess/batch_size} ms per Image. Total time: {total_time} ms or {total_time/batch_size} ms per image")

I guess some packages got a major performance boost.

So my takeaway is that using bigger batch sizes beyond 8 is not worth the memory now. I still wonder if this is how it should be, or if there is a deeper bug still to be uncovered (and lots of performance with it)

PS: Full benchmark results with new setting:

Final Benchmark Results for M09_YOLO_V8_LARGE_1280x1280_ARAMCO_ROUND_2 on NVIDIA_RTX_A4000:
Resolution                             1: 832x480                       2: 1280x768                       3: 1920x1088          
                                              I/s    Memory                     I/s    Memory                      I/s    Memory
Export Format                           1: engine 1: engine               1: engine 1: engine                1: engine 1: engine
Test Type   Batch Proc                                                                                                          
Incremental 1     1                         100.6    1952.0                    62.0    2020.0                     37.9    2156.0
            4     1                         129.7    2098.0                    72.7    2382.0                     38.6    2952.0
            8     1                         137.0    2286.0                    69.2    2882.0                     38.1    3964.0
            12    1                         138.8    2472.0                    72.3    3362.0                     38.3    4988.0
            16    1                         133.4    2702.0                    69.8    3840.0                     37.9    6016.0
            20    1                         133.2    2890.0                    69.6    4318.0                     38.1    7040.0
            24    1                         136.0    3090.0                    69.7    4798.0                     85.7    7518.0
            28    1                         134.3    3282.0                    69.7    5278.0                     38.0    9088.0
            32    1                         136.0    3480.0                    68.9    5760.0                      NaN       NaN
            40    1                         133.2    3866.0                    68.8    6728.0                      NaN       NaN
            48    1                         133.0    4248.0                    68.6    7690.0                      NaN       NaN
            56    1                         133.1    4640.0                     NaN       NaN                      NaN       NaN
            64    1                         133.4    5036.0                     NaN       NaN                      NaN       NaN
            80    1                         133.0    5814.0                     NaN       NaN                      NaN       NaN
            96    1                         133.2    6596.0                     NaN       NaN                      NaN       NaN
            112   1                         132.7    7380.0                     NaN       NaN                      NaN       NaN
            128   1                         132.7    8168.0                     NaN       NaN                      NaN       NaN
Parallel    Mixed 1     136.5 (b=8) [136.5-136.5]    2288.0  73.4 (b=4) [73.4-73.4]    2384.0  85.1 (b=24) [85.1-85.1]    7520.0
                  2       121.7 (b=8) [60.9-60.9]    4576.0  71.8 (b=4) [35.9-35.9]    4766.0   36.9 (b=1) [18.4-18.4]    4314.0
                  3       133.0 (b=8) [44.3-44.3]    6862.0  76.9 (b=4) [25.6-25.7]    7148.0   37.1 (b=1) [12.4-12.4]    6470.0
                  4       135.9 (b=8) [34.0-34.0]    9148.0  78.9 (b=4) [19.7-19.7]    9530.0     38.0 (b=1) [9.5-9.5]    8626.0

from ultralytics.

glenn-jocher commented on July 1, 2024

Hello @CySlider,

Thank you for sharing your detailed findings and the comprehensive benchmarks! It's great to see the depth of your analysis and the effort you've put into understanding the performance characteristics of different batch sizes and setups.

From your results, it appears that the internal inference speeds remain consistent, which is a good sign that the model's core computational efficiency is stable across versions. The significant reduction in time per image for smaller batch sizes in the new setup, as well as the decrease in your postprocessing times, indeed suggest that there might have been improvements in the underlying libraries or the way batch processing is handled in the newer version of the software.

The fact that larger batch sizes beyond 8 don't show proportional performance gains could be due to several factors, including GPU memory bandwidth saturation, inefficiencies in parallel processing at higher batch sizes, or the overhead of managing larger batches outweighing the computational benefits.

It's also worth noting that improvements in the underlying software or CUDA libraries could lead to better utilization of the GPU, which might explain why newer versions perform better even without significant changes in your code.

Your approach to benchmarking and analysis is quite thorough, and it provides valuable insights into how different factors influence performance. If you suspect there might still be underlying issues or potential for further optimization, it could be beneficial to profile the GPU during execution to identify any bottlenecks or inefficiencies.

Again, thank you for your detailed feedback and for using Ultralytics YOLO. If you have any more questions or need further assistance, feel free to reach out. Happy coding! 🚀

from ultralytics.

Doesn't batch size increase benefit engine export anymore? about ultralytics HOT 8 OPEN

Comments (8)

Export command

Results

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs