Describe the bug The cucim.skimage.transform.PiecewiseAffineTrans

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks for the details <a class="user-mention notranslate" data-hovercard-type="user"

Slowness seen using PiecewiseAffineTransform compared to scikit-image version about cucim HOT 2 OPEN

JHancox commented on August 25, 2024

Slowness seen using PiecewiseAffineTransform compared to scikit-image version

from cucim.

Comments (2)

grlee77 commented on August 25, 2024 1

Hi @JHancox, thanks for reporting this. Can you specify the shape and dtype of imgrid?

Unfortunately PiecewiseAffineTransform is an outlier in cuCIM in that it currently does not actually have proper GPU implementation and will be faster on CPU. We should consider printing a warning at runtime and adding a Note to this effect in the docstring or removing it from the library. It currently has to copy to CPU to run scipy.spatial.Delauney which CuPy does not have a GPU implementation for.

warp should be faster on the GPU if the image is sufficiently large, but in this case with inverse_map being a PiecewiseAffineTransform callable rather than a cupy.ndarray it will be slow due to that.

In general, for warp if you are able to supply inverse_map as a cupy.ndarray instead of a callable and the image is not too small the GPU should be faster. A quick rule of thumb is that the CPU is expected to be faster if an image is very small like (256, 256) (especially if it fits in L1 cache size of the CPU). For medium sizes such as (512, 512) or (1024, 1024) the GPU should be becoming faster. Above several MB in size, the GPU should be much faster. For the GPU, it is also beneficial to ensure that the input is single precision to avoid relatively slow double precision on the GPU.

I don't doubt that the GPU is slower here, but wanted to mention that using timer for the comparison has a couple of potential pitfalls to be aware of

GPU times will be much slower the first time a function is called because any kernels get compiled and cached (fortunately this .cubin cache is persistent on disk across program runs so this is a one time cost).
GPU times can be misleadingly short in some cases where synchronization may not have been performed, so it is best to explicitly call cupy.cuda.Device().synchronize() before checking the final time to make sure the kernels have completed.

To handle the above issues automatically, CuPy provides a benchmark timing utility that can be used like this

from cupyx.profiler import benchmark

perf_cpu = benchmark(
    warp,
    args=(imgrid, tform),
    kwargs=dict(output_shape=255, 255),
    n_warmup=10,
    n_repeat=10000,
    max_duration=5)  # cap at 5 seconds duration
print(f"warp: avg CPU time = {perf_cpu.cpu_times.mean()}")


cu_imgrid = cp.array(imgrid)

perf_gpu = benchmark(
    cu_warp,
    args=(cu_imgrid, cu_tform),
    kwargs=dict(output_shape=255, 255),
    n_warmup=10,
    n_repeat=10000,
    max_duration=5)  # cap at 5 seconds duration
print(f"warp: avg GPU time = {perf_gpu.gpu_times.mean()}")

from cucim.

JHancox commented on August 25, 2024

Thanks for the details @grlee77. In this case the image was 256 x 256 but I will try larger images and see what happens. Thanks for the tip on the timeit - you are quite right. Often there is some implicit mem synch operation involved anyhow, but I should be explicit about it.

from cucim.

Slowness seen using PiecewiseAffineTransform compared to scikit-image version about cucim HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs