pytorch-labs / segment-anything-fast Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 66.0 32.67 MB

A batched offline inference oriented version of segment-anything

License: Apache License 2.0

Python 100.00%

segment-anything-fast's People

Contributors

Stargazers

Watchers

segment-anything-fast's Issues

Not enough SMs on RTX 2080

Hey

I know that you guys optimized this project for the A100, and i read that people got the 4090 and the 3090 running. I am only able to work with 2080s (University).

When i try to run your code (amg_example.py), im getting the following errors :

torch._inductor.utils: [WARNING] not enough SMs to use max_autotune_gemm mode

followed by a bunch of "code" and then:
BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Internal Triton PTX codegen error:
ptxas /tmp/compile-ptx-src-76618e, line 149; error : Feature '.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-76618e, line 149; error : Feature 'cvt with .f32.bf16' requires .target sm_80 or higher
(.....)
ptxas /tmp/compile-ptx-src-76618e, line 200; error : Feature '.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-76618e, line 200; error : Feature 'cvt with .f32.bf16' requires .target sm_80 or higher
ptxas fatal : Ptx assembly aborted due to error

Is it just a shortcoming of my hardware or is there anything i am doing wrong.

PS: the Original model runs fine and your project runs as well if i use "sam_model_registry" (i guess that is just the meta implementation)

Thank you.

the results on cpu and gpu are very different

there is a short test to debug. as you can see,the ImageEcnoderVit will get different result on cpu and gpu.this precision are not acceptable.

and weirdly, when i print the x before pass x to self.neck layer. the x on cpu and gpu are the same. self.neck are just conv2d and layernorm. i have no idea why self.neck can cause different result on cpu and gpu. can anyone help?

>>> from segment_anything_fast.modeling.image_encoder import ImageEncoderViT
>>> import torch
>>> from functools import partial
>>> model = ImageEncoderViT(depth=32,embed_dim=1280,img_size=1024,mlp_ratio=4,norm_layer=partial(torch.nn.LayerNorm, eps=1e-6),num_heads=16,patch_size=16,qkv_bias=True,use_rel_pos=True,global_attn_indexes=[7, 15, 23, 31],window_size=14,out_chans=256,)
>>> input = torch.randn((1,3,1024,1024))
>>> a = model(input)
>>> input = input.to('cuda')
>>> model = model.to('cuda')
>>> b = model(input)
>>> a
tensor([[[[ 1.1279e+00, -3.9965e-01,  1.3351e+00,  ...,  6.2123e-01,
           -2.6559e-01,  8.9881e-01],
          [ 7.6766e-01,  1.1112e+00,  9.9850e-01,  ...,  8.1880e-01,
            4.4002e-01, -1.1236e+00],
          [ 4.8820e-01,  1.3826e+00,  1.6246e+00,  ...,  5.2780e-01,
           -1.2887e+00, -1.8294e-01],
          ...,
          [ 5.3505e-01,  6.5830e-01,  8.6872e-01,  ..., -5.3345e-02,
           -2.3688e-01, -5.8527e-01],
          [ 8.2007e-01,  1.1339e+00,  6.3609e-01,  ...,  1.2077e+00,
           -1.3440e+00,  1.3187e-02],
          [-1.8615e-01,  9.4821e-01,  5.8698e-01,  ..., -2.2219e-02,
            2.2753e-01, -1.0425e+00]],

         [[-5.4213e-01,  6.6123e-02,  7.4238e-01,  ..., -8.3169e-01,
           -5.6192e-01, -1.2276e+00],
          [ 3.6463e-01,  3.3585e-01,  1.0749e+00,  ...,  1.3571e+00,
           -3.0437e-01,  6.2709e-01],
          [-3.5779e-03, -3.8753e-03,  1.0144e+00,  ...,  1.5758e-01,
           -3.4319e-01,  1.0236e+00],
          ...,
          [ 5.7304e-01,  1.6791e+00,  7.4203e-01,  ...,  2.3011e+00,
            1.2866e+00,  1.0487e+00],
          [ 1.0945e+00,  1.3373e+00,  6.1515e-01,  ...,  1.9827e+00,
            1.4084e+00,  1.2886e+00],
          [ 2.2072e-01,  9.9274e-01, -4.1084e-02,  ...,  4.0965e-01,
            1.0271e+00,  7.0089e-01]],

         [[ 9.2292e-01,  1.2548e+00,  1.4542e+00,  ...,  1.3106e+00,
           -4.0816e-01,  1.3618e+00],
          [ 1.5234e+00,  5.9021e-01,  1.1496e+00,  ...,  1.1651e+00,
            1.6763e+00,  4.5780e-01],
          [ 9.5586e-01,  1.3678e+00,  1.6812e+00,  ...,  1.8888e-01,
            5.9193e-01,  2.3960e+00],
          ...,
          [ 1.3847e+00,  1.3367e+00, -2.9702e-02,  ...,  7.4956e-01,
            9.3687e-01,  1.1981e+00],
          [ 1.0065e+00,  9.2275e-01,  1.3572e+00,  ...,  2.0780e+00,
            8.4323e-01,  3.0510e-01],
          [ 1.4033e+00,  1.1534e+00, -5.7349e-01,  ..., -1.2417e-03,
            1.1398e+00,  1.4818e+00]],

         ...,

         [[-1.0704e+00, -1.4728e-02, -2.9317e-01,  ...,  1.2236e+00,
           -1.2223e-01, -1.0641e+00],
          [-2.4054e-01, -1.3002e-01, -3.1265e-02,  ...,  7.6334e-01,
           -1.5491e-01, -6.8875e-01],
          [ 4.5375e-01,  5.3694e-01, -7.1474e-01,  ...,  9.6021e-01,
            8.3933e-03,  7.1659e-01],
          ...,
          [-4.5589e-01,  1.0988e+00, -7.2541e-02,  ...,  6.7271e-01,
            1.7812e+00, -1.5954e+00],
          [ 4.6227e-01, -4.6240e-01, -1.4692e-02,  ...,  2.4024e+00,
           -1.4458e+00, -8.7657e-01],
          [-9.7614e-01, -6.4876e-01,  6.0639e-01,  ..., -1.2481e+00,
           -3.0824e-01,  1.1255e+00]],

         [[ 9.7512e-01,  6.5605e-01,  9.7539e-01,  ...,  1.1063e+00,
           -5.2691e-03, -1.5871e-01],
          [ 1.7015e-01,  4.2257e-01,  9.6964e-01,  ..., -2.5970e-01,
            3.9704e-01, -5.1129e-01],
          [ 7.7848e-01,  4.4700e-01,  4.3270e-01,  ...,  3.3544e-01,
            1.4105e+00,  6.6389e-01],
          ...,
          [-2.5888e-01,  1.2881e-01,  2.1639e-01,  ..., -1.2198e+00,
            1.0314e+00, -7.4611e-02],
          [ 7.9196e-01,  4.3873e-01,  8.0761e-01,  ...,  9.0232e-01,
            1.5996e+00,  8.0240e-01],
          [ 3.8602e-01,  5.4603e-01,  3.3816e-01,  ..., -6.9882e-01,
           -1.4301e+00, -4.7366e-01]],

         [[-2.9780e-01, -1.3218e+00, -2.6823e-01,  ..., -1.1618e+00,
           -2.9855e-01,  3.9646e-01],
          [ 3.1010e-01,  4.9516e-01,  1.7241e-01,  ..., -5.7428e-01,
           -1.1108e+00, -1.3312e+00],
          [-3.1486e-01,  7.0213e-01,  8.5001e-01,  ..., -1.0412e+00,
           -1.0719e+00, -5.6568e-01],
          ...,
          [-1.6898e-01, -7.2126e-01,  4.5796e-02,  ...,  7.2761e-01,
           -2.3792e-01, -7.9595e-01],
          [-1.7269e+00, -1.7630e+00, -9.0964e-01,  ..., -9.1442e-01,
            4.1963e-01,  1.3568e+00],
          [-4.8538e-01,  8.0793e-02,  3.3132e-01,  ..., -1.9279e-01,
           -4.4357e-01,  7.3314e-01]]]], grad_fn=<AddBackward0>)
>>> b
tensor([[[[ 1.1276e+00, -3.9896e-01,  1.3352e+00,  ...,  6.2086e-01,
           -2.6533e-01,  8.9948e-01],
          [ 7.6792e-01,  1.1115e+00,  9.9893e-01,  ...,  8.1904e-01,
            4.4017e-01, -1.1244e+00],
          [ 4.8875e-01,  1.3828e+00,  1.6243e+00,  ...,  5.2787e-01,
           -1.2887e+00, -1.8287e-01],
          ...,
          [ 5.3505e-01,  6.5837e-01,  8.6899e-01,  ..., -5.2909e-02,
           -2.3689e-01, -5.8535e-01],
          [ 8.2054e-01,  1.1339e+00,  6.3626e-01,  ...,  1.2077e+00,
           -1.3434e+00,  1.4043e-02],
          [-1.8604e-01,  9.4874e-01,  5.8726e-01,  ..., -2.1825e-02,
            2.2782e-01, -1.0425e+00]],

         [[-5.4216e-01,  6.5893e-02,  7.4183e-01,  ..., -8.3199e-01,
           -5.6210e-01, -1.2276e+00],
          [ 3.6446e-01,  3.3610e-01,  1.0743e+00,  ...,  1.3566e+00,
           -3.0510e-01,  6.2653e-01],
          [-3.5442e-03, -4.2382e-03,  1.0142e+00,  ...,  1.5751e-01,
           -3.4393e-01,  1.0233e+00],
          ...,
          [ 5.7268e-01,  1.6793e+00,  7.4196e-01,  ...,  2.3008e+00,
            1.2864e+00,  1.0479e+00],
          [ 1.0941e+00,  1.3378e+00,  6.1509e-01,  ...,  1.9829e+00,
            1.4082e+00,  1.2889e+00],
          [ 2.2067e-01,  9.9273e-01, -4.0986e-02,  ...,  4.1015e-01,
            1.0265e+00,  7.0073e-01]],

         [[ 9.2357e-01,  1.2548e+00,  1.4542e+00,  ...,  1.3107e+00,
           -4.0825e-01,  1.3620e+00],
          [ 1.5236e+00,  5.9024e-01,  1.1501e+00,  ...,  1.1650e+00,
            1.6763e+00,  4.5799e-01],
          [ 9.5630e-01,  1.3678e+00,  1.6815e+00,  ...,  1.8899e-01,
            5.9160e-01,  2.3954e+00],
          ...,
          [ 1.3852e+00,  1.3368e+00, -2.9295e-02,  ...,  7.4981e-01,
            9.3762e-01,  1.1983e+00],
          [ 1.0066e+00,  9.2232e-01,  1.3569e+00,  ...,  2.0778e+00,
            8.4387e-01,  3.0434e-01],
          [ 1.4036e+00,  1.1533e+00, -5.7359e-01,  ..., -1.0455e-03,
            1.1398e+00,  1.4820e+00]],

         ...,

         [[-1.0698e+00, -1.4772e-02, -2.9359e-01,  ...,  1.2239e+00,
           -1.2180e-01, -1.0637e+00],
          [-2.4025e-01, -1.2997e-01, -3.1515e-02,  ...,  7.6373e-01,
           -1.5430e-01, -6.8904e-01],
          [ 4.5413e-01,  5.3748e-01, -7.1436e-01,  ...,  9.6014e-01,
            9.0057e-03,  7.1679e-01],
          ...,
          [-4.5574e-01,  1.0993e+00, -7.2000e-02,  ...,  6.7207e-01,
            1.7814e+00, -1.5952e+00],
          [ 4.6145e-01, -4.6165e-01, -1.4222e-02,  ...,  2.4023e+00,
           -1.4463e+00, -8.7620e-01],
          [-9.7515e-01, -6.4810e-01,  6.0649e-01,  ..., -1.2477e+00,
           -3.0792e-01,  1.1256e+00]],

         [[ 9.7454e-01,  6.5678e-01,  9.7528e-01,  ...,  1.1065e+00,
           -5.0216e-03, -1.5847e-01],
          [ 1.7006e-01,  4.2306e-01,  9.6959e-01,  ..., -2.5992e-01,
            3.9714e-01, -5.1084e-01],
          [ 7.7854e-01,  4.4686e-01,  4.3235e-01,  ...,  3.3460e-01,
            1.4102e+00,  6.6323e-01],
          ...,
          [-2.5823e-01,  1.2859e-01,  2.1663e-01,  ..., -1.2196e+00,
            1.0320e+00, -7.4387e-02],
          [ 7.9208e-01,  4.3808e-01,  8.0768e-01,  ...,  9.0169e-01,
            1.5992e+00,  8.0307e-01],
          [ 3.8559e-01,  5.4567e-01,  3.3847e-01,  ..., -6.9920e-01,
           -1.4303e+00, -4.7335e-01]],

         [[-2.9799e-01, -1.3227e+00, -2.6864e-01,  ..., -1.1625e+00,
           -2.9797e-01,  3.9606e-01],
          [ 3.0952e-01,  4.9551e-01,  1.7227e-01,  ..., -5.7379e-01,
           -1.1105e+00, -1.3312e+00],
          [-3.1462e-01,  7.0210e-01,  8.5023e-01,  ..., -1.0416e+00,
           -1.0716e+00, -5.6509e-01],
          ...,
          [-1.6905e-01, -7.2131e-01,  4.6613e-02,  ...,  7.2798e-01,
           -2.3790e-01, -7.9638e-01],
          [-1.7270e+00, -1.7633e+00, -9.0924e-01,  ..., -9.1347e-01,
            4.2046e-01,  1.3563e+00],
          [-4.8477e-01,  8.0591e-02,  3.3117e-01,  ..., -1.9234e-01,
           -4.4338e-01,  7.3338e-01]]]], device='cuda:0',
       grad_fn=<AddBackward0>)
>>>

sam_model_fast_registry ， Speed problem

Hello,

I used segment-anything-fast with no significant acceleration compared to segment-anything.
Is the method incorrect?

code:
import cv2
import numpy as np
from segment_anything_fast import sam_model_registry, sam_model_fast_registry, SamPredictor

video_path = 'video.avi'
sam_checkpoint = 'sam_vit_h_4b8939.pth'

input_point = np.array([[305, 235]])
input_label = np.array([1])

video_cap = cv2.VideoCapture(video_path)
ret, image = video_cap.read()
sam = sam_model_fast_registry"vit_h"
sam.to(device="cuda:0")
predictor = SamPredictor(sam)
predictor.set_image(image)
masks, scores, logits = predictor.predict(point_coords=input_point, point_labels=input_label, multimask_output=True)
mask_input = logits[np.argmax(scores), :, :]

i_fn = 0
while video_cap.isOpened() and i_fn < nframes:
ret, image = video_cap.read()
if ret == True:
predictor.set_image(image)
masks, _, _ = predictor.predict(point_coords=input_point, point_labels=input_label, mask_input=mask_input[None, :, :], multimask_output=False)
i_fn += 1

How do I set up batch input image? (set batch size)

Use onnxruntime?

Thanks

how to reproduce memory snapshot in doc？

hi，how to reproduce memory snapshot in doc？

what i get is

I‘m very confused the reason that can not get ‘add_decomposed_rel_pos’ stack informations in memory snapshot, and how to get full stack backtrace.
The torch version is 2.2, following up instructions in https://github.com/pytorch-labs/segment-anything-fast/tree/main/experiments#installation-instructions
Looking forward to a reply.

Running on CPU.

Hello,

Is there a way to run this code on "CPU" instead of cuda. I get the following error which I changed device='cpu" in the example code.

torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: AttributeError: 'PermuteView' object has no attribute 'freeze_layout'
target: aten.convolution.default
args[0]: TensorBox(
PermuteView(data=StorageBox(
ComputedBuffer(name='buf943', layout=FlexibleLayout('cpu', torch.bfloat16, size=[1, 64, 64, 1280], stride=[5242880, 81920, 1280, 1]), data=Pointwise(
'cpu',
torch.bfloat16,
def inner_fn(index):
_, i1, i2, i3 = index
tmp0 = ops.load(buf935, i3 + 1280 * i2 + 81920 * i1)
tmp1 = ops.load(buf942, i3 + 1280 * i2 + 81920 * i1)
tmp2 = tmp0 + tmp1
return tmp2
,
ranges=[1, 64, 64, 1280],
origin_node=add_352,
origins={add_352}
))
), dims=[0, 3, 1, 2])
)
args[1]: TensorBox(StorageBox(
InputBuffer(name='arg455_1', layout=FixedLayout('cpu', torch.bfloat16, size=[256, 1280, 1, 1], stride=[1280, 1, 1, 1]))
))
args[2]: None
args[3]: [1, 1]
args[4]: [0, 0]
args[5]: [1, 1]
args[6]: False
args[7]: [0, 0]
args[8]: 1

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True

Compatible with torch.export?

Is this implementation compatible with torch.export? I want to deploy SAM in a serverless fashion but the compile step causes long cold start times.

Issue with opencv-python and SamAutomaticMaskGenerator

I have installed my environment exactly like in the README.md in the experiments folder but I have two issues when trying to run the code in the amg_example.py

Can't import cv2
When trying to import cv2 I get the following error:
"ImportError: libGL.so.1: cannot open shared object file: No such file or directory"

I have tried uninstalling and reinstalling opencv-python which didn't work.
I have also tried uninstalling opencv-python and installing opencv-python-headless but that didn't work either

SamAutomaticMaskGenerator throwing a compiler-error
I read in an image with torchvision.io.read_image and tried generating masks for it but I got the following error:

BackendCompilerFailed Traceback (most recent call last) Cell In[18], [line 1](vscode-notebook-cell:?execution_count=18&line=1) ----> [1](vscode-notebook-cell:?execution_count=18&line=1) masks = mask_generator.generate(image) You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True

After trying the suggested fix of setting the suppress_errors to True I get the following error:
NotImplementedError PythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:161 [backend fallback] FuncTorchDynamicLayerFrontMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:493 [backend fallback] PreDispatch: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:165 [backend fallback] PythonDispatcher: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:157 [backend fallback]

These are not the full error messages since posting the full error log would exceed the character-limit.

I have no idea what this means and I can't find any fixes anywhere.

May I ask if this project can be deployed and used through libtorch？

How do I convert the official weight of Sam to the. pt format that libtorch can use.

error for run mask_generator.generate(image) on nvidia 3090 and A10

^
/tmp/tmpeilktith/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmpeilktith/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmpeilktith/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmpeilktith/main.c: In function ‘list_to_cuuint32_array’:
/tmp/tmpeilktith/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmptf4cg5_c/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmptf4cg5_c/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmptf4cg5_c/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmptf4cg5_c/main.c: In function ‘list_to_cuuint32_array’:
/tmp/tmptf4cg5_c/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmp9utu3rwt/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmp9utu3rwt/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmp9utu3rwt/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmp9utu3rwt/main.c: In function ‘list_to_cuuint32_array’:
/tmp/tmp9utu3rwt/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmpfwwxs7d8/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmpfwwxs7d8/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmpfwwxs7d8/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmpfwwxs7d8/main.c: In function ‘list_to_cuuint32_array’:
/tmp/tmpfwwxs7d8/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmpsomyu47p/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmpsomyu47p/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmpsomyu47p/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmpsomyu47p/main.c: In function ‘list_to_cuuint32_array’:
/tmp/tmpsomyu47p/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmpul3l8_vy/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmpul3l8_vy/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmpul3l8_vy/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmpul3l8_vy/main.c: In function ‘list_to_cuuint32_array’:
/tmp/tmpul3l8_vy/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmp_25p7zmi/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmp_25p7zmi/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmp_25p7zmi/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmp_25p7zmi/main.c: In function ‘list_to_cuuint32_array’:
/tmp/tmp_25p7zmi/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmpaqkjsyxw/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmpaqkjsyxw/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^
/tmp/tmpaqkjsyxw/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmpaqkjsyxw/main.c: In function ‘list_to_cuuint32_array’:
/tmp/tmpaqkjsyxw/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (Py_ssize_t i = 0; i < len; i++) {
^

BackendCompilerFailed Traceback (most recent call last)
Cell In[5], line 2
1 start_time = time.time()
----> 2 masks = mask_generator.generate(image)
3 print(time.time()-start_time)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/pytorch_labs_segment_anything_fast-0.2-py3.11.egg/segment_anything_fast/automatic_mask_generator.py:170, in SamAutomaticMaskGenerator.generate(self, image)
145 """
146 Generates masks for the given image.
147
(...)
166 the mask, given in XYWH format.
167 """
169 # Generate masks
--> 170 mask_data = self._generate_masks(image)
172 # Filter small disconnected regions and holes in masks
173 if self.min_mask_region_area > 0:

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/pytorch_labs_segment_anything_fast-0.2-py3.11.egg/segment_anything_fast/automatic_mask_generator.py:213, in SamAutomaticMaskGenerator._generate_masks(self, image)
211 data = MaskData()
212 for crop_box, layer_idx in zip(crop_boxes, layer_idxs):
--> 213 crop_data = self._process_crop(image, crop_box, layer_idx, orig_size)
214 data.cat(crop_data)
216 # Remove duplicate masks between crops

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/pytorch_labs_segment_anything_fast-0.2-py3.11.egg/segment_anything_fast/automatic_mask_generator.py:243, in SamAutomaticMaskGenerator._process_crop(self, image, crop_box, crop_layer_idx, orig_size)
241 cropped_im = image[y0:y1, x0:x1, :]
242 cropped_im_size = cropped_im.shape[:2]
--> 243 self.predictor.set_image(cropped_im)
245 # Get points for this crop
246 points_scale = np.array(cropped_im_size)[None, ::-1]

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/pytorch_labs_segment_anything_fast-0.2-py3.11.egg/segment_anything_fast/predictor.py:60, in SamPredictor.set_image(self, image, image_format)
57 input_image_torch = torch.as_tensor(input_image, device=self.device)
58 input_image_torch = input_image_torch.permute(2, 0, 1).contiguous()[None, :, :, :]
---> 60 self.set_torch_image(input_image_torch, image.shape[:2])

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/pytorch_labs_segment_anything_fast-0.2-py3.11.egg/segment_anything_fast/predictor.py:90, in SamPredictor.set_torch_image(self, transformed_image, original_image_size)
88 input_image = self.model.preprocess(transformed_image)
89 model_dtype = self.model.mask_decoder.iou_prediction_head.layers[0].weight.dtype
---> 90 self.features = self.model.image_encoder(input_image.to(model_dtype))
91 self.is_image_set = True

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:489, in _TorchDynamoContext.call.._fn(*args, **kwargs)
487 dynamo_config_ctx.enter()
488 try:
--> 489 return fn(*args, **kwargs)
490 finally:
491 set_eval_frame(prior)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:655, in catch_errors_wrapper..catch_errors(frame, cache_entry, frame_state)
652 return hijacked_callback(frame, cache_entry, hooks, frame_state)
654 with compile_lock, _disable_current_modes():
--> 655 return callback(frame, cache_entry, hooks, frame_state)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py:727, in convert_frame.._convert_frame(frame, cache_entry, hooks, frame_state)
725 counters["frames"]["total"] += 1
726 try:
--> 727 result = inner_convert(frame, cache_entry, hooks, frame_state)
728 counters["frames"]["ok"] += 1
729 return result

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py:383, in convert_frame_assert.._convert_frame_assert(frame, cache_entry, hooks, frame_state)
370 signpost_event(
371 "dynamo",
372 "_convert_frame_assert._compile",
(...)
379 },
380 )
382 with config.patch(_patch_config_if_changed()):
--> 383 compiled_product = _compile(
384 frame.f_code,
385 frame.f_globals,
386 frame.f_locals,
387 frame.f_builtins,
388 compiler_fn,
389 one_graph,
390 export,
391 export_constraints,
392 hooks,
393 cache_size,
394 frame,
395 frame_state=frame_state,
396 compile_id=compile_id,
397 )
398 return compiled_product

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py:646, in _compile(code, globals, locals, builtins, compiler_fn, one_graph, export, export_constraints, hooks, cache_size, frame, frame_state, compile_id)
644 with compile_context(CompileContext(compile_id)):
645 try:
--> 646 guarded_code = compile_inner(code, one_graph, hooks, transform)
647 return guarded_code
648 except (
649 Unsupported,
650 TorchRuntimeError,
(...)
657 BisectValidationException,
658 ) as e:

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/utils.py:244, in dynamo_timed..dynamo_timed_inner..time_wrapper(*args, **kwargs)
242 with torch.profiler.record_function(f"{key} (dynamo_timed)"):
243 t0 = time.time()
--> 244 r = func(*args, **kwargs)
245 time_spent = time.time() - t0
246 compilation_time_metrics[key].append(time_spent)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py:562, in _compile..compile_inner(code, one_graph, hooks, transform)
560 CompileContext.get().attempt = attempt
561 try:
--> 562 out_code = transform_code_object(code, transform)
563 break
564 except exc.RestartAnalysis as e:

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py:1033, in transform_code_object(code, transformations, safe)
1030 instructions = cleaned_instructions(code, safe)
1031 propagate_line_nums(instructions)
-> 1033 transformations(instructions, code_options)
1034 return clean_and_assemble_instructions(instructions, keys, code_options)[1]

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py:151, in preserve_global_state.._fn(*args, **kwargs)
149 cleanup = setup_compile_debug()
150 try:
--> 151 return fn(*args, **kwargs)
152 finally:
153 cleanup.close()

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py:527, in _compile..transform(instructions, code_options)
525 try:
526 with tracing(tracer.output.tracing_context), tracer.set_current_tx():
--> 527 tracer.run()
528 except exc.UnspecializeRestartAnalysis:
529 speculation_log.clear()

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py:2128, in InstructionTranslator.run(self)
2127 def run(self):
-> 2128 super().run()

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py:818, in InstructionTranslatorBase.run(self)
813 try:
814 self.output.push_tx(self)
815 while (
816 self.instruction_pointer is not None
817 and not self.output.should_exit
--> 818 and self.step()
819 ):
820 pass
821 except BackendCompilerFailed:

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py:781, in InstructionTranslatorBase.step(self)
777 unimplemented(f"missing: {inst.opname}")
778 TracingContext.set_current_loc(
779 self.f_code.co_filename, self.lineno, self.f_code.co_name
780 )
--> 781 getattr(self, inst.opname)(inst)
783 return inst.opname != "RETURN_VALUE"
784 except Unsupported:

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py:2243, in InstructionTranslator.RETURN_VALUE(self, inst)
2238 _step_logger()(
2239 logging.INFO,
2240 f"torchdynamo done tracing {self.f_code.co_name} (RETURN_VALUE)",
2241 )
2242 log.debug("RETURN_VALUE triggered compile")
-> 2243 self.output.compile_subgraph(
2244 self,
2245 reason=GraphCompileReason(
2246 "return_value", [self.frame_summary()], graph_break=False
2247 ),
2248 compile_return_value=True,
2249 )
2250 self.output.add_output_instructions([create_instruction("RETURN_VALUE")])

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/output_graph.py:919, in OutputGraph.compile_subgraph(self, tx, partial_convert, reason, compile_return_value)
916 append_prefix_insts()
917 # optimization to generate better code in a common case
918 self.add_output_instructions(
--> 919 self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
920 + [create_instruction("UNPACK_SEQUENCE", arg=len(stack_values))]
921 )
922 else:
923 graph_output_var = self.new_var("graph_out")

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/contextlib.py:81, in ContextDecorator.call..inner(*args, **kwds)
78 @wraps(func)
79 def inner(*args, **kwds):
80 with self._recreate_cm():
---> 81 return func(*args, **kwds)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/output_graph.py:1087, in OutputGraph.compile_and_call_fx_graph(self, tx, rv, root)
1084 self.tracing_context.fake_mode = backend_fake_mode
1086 with self.restore_global_state():
-> 1087 compiled_fn = self.call_user_compiler(gm)
1088 compiled_fn = disable(compiled_fn)
1090 counters["stats"]["unique_graphs"] += 1

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/output_graph.py:1159, in OutputGraph.call_user_compiler(self, gm)
1157 raise e
1158 except Exception as e:
-> 1159 raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
1160 e.traceback
1161 ) from None
1163 signpost_event(
1164 "dynamo",
1165 "OutputGraph.call_user_compiler",
(...)
1171 },
1172 )
1174 return compiled_fn

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/output_graph.py:1140, in OutputGraph.call_user_compiler(self, gm)
1138 if config.verify_correctness:
1139 compiler_fn = WrapperBackend(compiler_fn)
-> 1140 compiled_fn = compiler_fn(gm, self.example_inputs())
1141 _step_logger()(logging.INFO, f"done compiler function {name}")
1142 assert callable(compiled_fn), "compiler_fn did not return callable"

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py:117, in wrap_backend_debug..debug_wrapper(gm, example_inputs, **kwargs)
115 raise
116 else:
--> 117 compiled_gm = compiler_fn(gm, example_inputs)
119 return compiled_gm

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/init.py:1662, in TorchCompileInductorWrapper.call(self, model, inputs_)
1659 def call(self, model_, inputs_):
1660 from torch.inductor.compile_fx import compile_fx
-> 1662 return compile_fx(model, inputs_, config_patches=self.config)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/inductor/compile_fx.py:952, in compile_fx(model, example_inputs_, inner_compile, config_patches, decompositions)
950 if config_patches:
951 with config.patch(config_patches):
--> 952 return compile_fx(
953 model_,
954 example_inputs_,
955 # need extra layer of patching as backwards is compiled out of scope
956 inner_compile=config.patch(config_patches)(inner_compile),
957 decompositions=decompositions,
958 )
960 if config.cpp_wrapper:
961 with config.patch(
962 {
963 "cpp_wrapper": False,
(...)
967 }
968 ), V.set_real_inputs(example_inputs_):

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/inductor/compile_fx.py:1168, in compile_fx(model, example_inputs_, inner_compile, config_patches, decompositions)
1163 return inference_compiler(unlifted_gm, example_inputs_)
1165 with V.set_fake_mode(fake_mode), torch.guards.tracing(
1166 tracing_context
1167 ), compiled_autograd.disable():
-> 1168 return aot_autograd(
1169 fw_compiler=fw_compiler,
1170 bw_compiler=bw_compiler,
1171 inference_compiler=inference_compiler,
1172 decompositions=decompositions,
1173 partition_fn=partition_fn,
1174 keep_inference_input_mutations=True,
1175 )(model, example_inputs_)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/backends/common.py:55, in aot_autograd..compiler_fn(gm, example_inputs)
52 try:
53 # NB: NOT cloned!
54 with enable_aot_logging(), patch_config:
---> 55 cg = aot_module_simplified(gm, example_inputs, **kwargs)
56 counters["aot_autograd"]["ok"] += 1
57 return disable(cg)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py:887, in aot_module_simplified(mod, args, fw_compiler, bw_compiler, partition_fn, decompositions, keep_inference_input_mutations, inference_compiler)
871 aot_config = AOTConfig(
872 fw_compiler=fw_compiler,
873 bw_compiler=bw_compiler,
(...)
883 no_tangents=False,
884 )
886 with compiled_autograd.disable():
--> 887 compiled_fn = create_aot_dispatcher_function(
888 functional_call,
889 full_args,
890 aot_config,
891 )
893 # TODO: There is something deeply wrong here; compiled_fn running with
894 # the boxed calling convention, but aot_module_simplified somehow
895 # historically returned a function that was not the boxed calling
896 # convention. This should get fixed...
897 def forward(*runtime_args):

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py:600, in create_aot_dispatcher_function(flat_fn, flat_args, aot_config)
597 compiler_fn = partial(aot_wrapper_dedupe, compiler_fn=compiler_fn)
598 # You can put more passes here
--> 600 compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
601 if aot_config.is_export:
602 mutated_user_inp_locs = [
603 idx - aot_config.num_params_buffers
604 for idx in fw_metadata.mutated_inp_runtime_indices
605 if idx >= aot_config.num_params_buffers
606 ]

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py:425, in aot_wrapper_dedupe(flat_fn, flat_args, aot_config, compiler_fn, fw_metadata)
422 break
424 if ok:
--> 425 return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
427 if requires_subclass_dispatch(leaf_flat_args, fw_metadata):
428 raise RuntimeError(
429 """
430 Encountered duplicate inputs that are mutated in the graph, but at least one input/output
431 to the graph is a tensor subclass. This is not supported today. You can try to
432 remove the aliasing yourself as a workaround, or otherwise file an issue on github."""
433 )

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py:630, in aot_wrapper_synthetic_base(flat_fn, flat_args, aot_config, fw_metadata, needs_autograd, compiler_fn)
628 # Happy path: we don't need synthetic bases
629 if synthetic_base_info is None:
--> 630 return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
632 # export path: ban synthetic bases for now, add later if requested.
633 if requires_subclass_dispatch(flat_args, fw_metadata):

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py:97, in aot_dispatch_base(flat_fn, flat_args, aot_config, fw_metadata)
91 if tracing_context := torch._guards.TracingContext.try_get():
92 tracing_context.fw_metadata = (
93 fw_metadata
94 if maybe_subclass_meta is None
95 else maybe_subclass_meta.fw_metadata
96 )
---> 97 compiled_fw = compiler(fw_module, updated_flat_args)
99 # This boxed_call handling happens inside create_runtime_wrapper as well.
100 # However, create_runtime_wrapper does not expect the rng offsets in the
101 # output. So, we have to create another wrapper and take out the offset. As
102 # a result, we have to account for not boxed_call compilers as well.
103 if not hasattr(compiled_fw, "_boxed_call"):

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:1100, in compile_fx..fw_compiler_base(model, example_inputs, is_inference)
1092 assert orig_output_end_idx <= num_model_outputs
1094 user_visible_outputs = {
1095 n.name
1096 for n in model_outputs[original_output_start_index:orig_output_end_idx]
1097 if isinstance(n, torch.fx.Node)
1098 }
-> 1100 return inner_compile(
1101 model,
1102 example_inputs,
1103 num_fixed=fixed,
1104 cudagraphs=cudagraphs,
1105 graph_id=graph_id,
1106 is_inference=is_inference,
1107 boxed_forward_device_index=forward_device,
1108 user_visible_outputs=user_visible_outputs,
1109 )

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_dynamo/repro/after_aot.py:83, in wrap_compiler_debug..debug_wrapper(gm, example_inputs, **kwargs)
78 assert config.repro_after in ("dynamo", "aot", None)
80 try:
81 # Call the compiler_fn - which is either aot_autograd or inductor
82 # with fake inputs
---> 83 inner_compiled_fn = compiler_fn(gm, example_inputs)
84 except Exception as e:
85 # TODO: Failures here are troublesome because no real inputs,
86 # need a different serialization strategy
87 if config.repro_after == "aot":

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/debug.py:305, in DebugContext.wrap..inner(*args, **kwargs)
302 @functools.wraps(fn)
303 def inner(*args, **kwargs):
304 with DebugContext():
--> 305 return fn(*args, **kwargs)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:320, in compile_fx_inner(gm, example_inputs, cudagraphs, num_fixed, is_backward, graph_id, cpp_wrapper, aot_mode, is_inference, boxed_forward_device_index, user_visible_outputs, layout_opt, extern_node_serializer)
316 compiled_graph = FxGraphCache.load(
317 fx_codegen_and_compile, gm, example_inputs, graph_kwargs
318 )
319 else:
--> 320 compiled_graph = fx_codegen_and_compile(
321 gm, example_inputs, **graph_kwargs # type: ignore[arg-type]
322 )
324 log.debug("FX codegen and compilation took %.3fs", time.time() - start)
326 # Return the output strides to the caller via TracingContext

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:535, in fx_codegen_and_compile(gm, example_inputs, cudagraphs, num_fixed, is_backward, graph_id, cpp_wrapper, aot_mode, is_inference, user_visible_outputs, layout_opt, extern_node_serializer)
519 graph = GraphLowering(
520 gm,
521 # example_inputs will be used by AOTInductor to dry-run the generated code for Triton kernel tuning.
(...)
532 is_inference=is_inference,
533 )
534 with V.set_graph_handler(graph):
--> 535 graph.run(*example_inputs)
536 output_strides: List[Optional[Tuple[int, ...]]] = []
537 if graph.graph_outputs is not None:
538 # We'll put the output strides in the compiled graph so we
539 # can later return them to the caller via TracingContext

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/graph.py:519, in GraphLowering.run(self, *args)
517 @dynamo_timed
518 def run(self, *args):
--> 519 return super().run(*args)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/fx/interpreter.py:138, in Interpreter.run(self, initial_env, enable_io_processing, *args)
135 continue
137 try:
--> 138 self.env[node] = self.run_node(node)
139 except Exception as e:
140 if self.extra_traceback:

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/graph.py:814, in GraphLowering.run_node(self, n)
812 debug("layout_constraints")
813 args, kwargs = layout_constraints[n.target](n, *args, **kwargs)
--> 814 result = self.call_function(n.target, args, kwargs)
815 elif is_magic_method(n.target):
816 # TODO: this is sus, it probably should be handled in the
817 # lowerings themselves similarly to sym_size/sym-stride
818 debug("is_magic_method")

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/graph.py:694, in GraphLowering.call_function(self, target, args, kwargs)
692 return out
693 except Exception as e:
--> 694 raise LoweringException(e, target, args, kwargs).with_traceback(
695 e.traceback
696 ) from None

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/graph.py:691, in GraphLowering.call_function(self, target, args, kwargs)
689 try:
690 log.debug(" via %s", lowerings[target])
--> 691 out = lowerings[target](*args, **kwargs)
692 return out
693 except Exception as e:

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/lowering.py:291, in _register_lowering..wrapped(*args, **kwargs)
288 if unpacked:
289 args = [args]
--> 291 out = decomp_fn(*args, **kwargs)
292 validate_ir(out)
294 return out

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/kernel/conv.py:367, in convolution(x, weight, bias, stride, padding, dilation, transposed, output_padding, groups)
363 return convert_1x1_conv_to_mm(x, weight, bias)
365 if bias is not None and ir.get_device_type(x) != "cpu":
366 # peel off the bias, cudnn is slower with it
--> 367 result = convolution(x, weight, None, **kwargs)
368 return L[aten.add](
369 result, L[aten.view](bias, [result.get_size()[1]] + ndim * [1])
370 )
372 x.realize()

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/kernel/conv.py:457, in convolution(x, weight, bias, stride, padding, dilation, transposed, output_padding, groups)
432 for cfg in conv_configs(
433 sympy_product([x.get_size()[0], *x.get_size()[2:]]),
434 out_chan,
435 in_chan,
436 ):
437 conv2d_template.maybe_append_choice(
438 choices,
439 input_nodes=(x, weight),
(...)
454 **cfg.kwargs,
455 )
--> 457 return autotune_select_algorithm("convolution", choices, args, layout)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/select_algorithm.py:991, in autotune_select_algorithm(*args, **kwargs)
989 if _ALGORITHM_SELECTOR_CACHE is None:
990 _ALGORITHM_SELECTOR_CACHE = AlgorithmSelectorCache()
--> 991 return _ALGORITHM_SELECTOR_CACHE(*args, **kwargs)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/select_algorithm.py:748, in AlgorithmSelectorCache.call(self, name, choices, input_nodes, layout, input_gen_fns)
745 tuning_pool.initialize()
747 autotune_start_ts = time.time()
--> 748 timings = self.lookup(
749 choices,
750 name,
751 repr([self.key_of(x) for x in input_nodes]),
752 autotune,
753 )
754 autotune_elapse = time.time() - autotune_start_ts
755 if timings == {} or choices[0] not in timings:

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/codecache.py:291, in PersistentCache.lookup(self, choices, name, inputs, benchmark)
285 if not check_cache(local_cache) and not (
286 use_global_cache()
287 and check_cache(self.get_global_cache(), callback=log_stats)
288 ):
289 try:
290 # re-benchmark everything to try to get consistent numbers from the same machine
--> 291 timings = benchmark(choices)
292 assert all(choice in timings for choice in choices)
294 local_cache.setdefault(name, {})

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/select_algorithm.py:739, in AlgorithmSelectorCache.call..autotune(choices)
738 def autotune(choices):
--> 739 return make_benchmark_fn()(choices)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/select_algorithm.py:848, in AlgorithmSelectorCache.make_benchmark_fn..benchmark_in_current_process(choices)
846 for choice in choices:
847 try:
--> 848 timing = benchmark_choice_in_current_process(choice)
849 except CUDACompileError as e:
850 log.warning(
851 "CUDA compilation error: \n%s. \nIgnore this choice.", str(e)
852 )

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/select_algorithm.py:838, in AlgorithmSelectorCache.make_benchmark_fn..benchmark_choice_in_current_process(choice)
835 result = choice.benchmark(*example_inputs_extern, out=out_extern)
836 else:
837 # triton templates want the base pointer for sliced tensors
--> 838 result = choice.benchmark(*example_inputs, out=out)
839 if VERIFY:
840 torch.testing.assert_close(out_extern, expected, **VERIFY)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/select_algorithm.py:604, in TritonTemplateCaller.benchmark(self, out, *args)
602 def benchmark(self, *args, out):
603 assert self.bmreq is not None
--> 604 return self.bmreq.benchmark(*args, output_tensor=out)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/autotune_process.py:452, in BenchmarkRequest.benchmark(self, output_tensor, *input_tensors)
449 load_elapse = time.time() - start_ts
450 start_ts = time.time()
--> 452 out = do_bench(fn)
453 torch.cuda.synchronize() # shake out any CUDA errors
455 if debug:

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/torch/_inductor/utils.py:167, in do_bench(*args, **kwargs)
165 if quantile_field_name not in kwargs:
166 kwargs[quantile_field_name] = (0.5, 0.2, 0.8)
--> 167 return triton_do_bench(*args, **kwargs)[0]

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/triton/testing.py:102, in do_bench(fn, warmup, rep, grad_to_none, quantiles, fast_flush, return_mode)
83 import torch
84 """
85 Benchmark the runtime of the provided function. By default, return the median runtime of :code:fn along with
86 the 20-th and 80-th performance percentile.
(...)
99 :type fast_flush: bool
100 """
--> 102 fn()
103 torch.cuda.synchronize()
105 # We maintain a buffer of 256 MB that we clear
106 # before each kernel call to make sure that the L2
107 # doesn't contain any input data before the run

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/triton/runtime/jit.py:550, in JITFunction.run(self, *args, **kwargs)
548 bin = self.cache[device][key]
549 if not warmup:
--> 550 bin.c_wrapper(
551 grid_0,
552 grid_1,
553 grid_2,
554 bin.num_warps,
555 bin.num_ctas,
556 bin.clusterDims[0],
557 bin.clusterDims[1],
558 bin.clusterDims[2],
559 bin.shared,
560 stream,
561 bin.cu_function,
562 CompiledKernel.launch_enter_hook,
563 CompiledKernel.launch_exit_hook,
564 bin,
565 *bin.assemble_tensormap_to_arg(non_constexpr_arg_values),
566 )
567 return bin

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/triton/compiler/compiler.py:692, in CompiledKernel.getattribute(self, name)
690 def getattribute(self, name):
691 if name == 'c_wrapper':
--> 692 self._init_handles()
693 return super().getattribute(name)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/triton/compiler/compiler.py:670, in CompiledKernel._init_handles(self)
668 if self.device_type in ["cuda"]:
669 device = get_current_device()
--> 670 bin_path = {driver.HIP: "hsaco_path", driver.CUDA: "cubin"}[driver.backend]
671 max_shared = driver.utils.get_device_properties(device)["max_shared_mem"]
672 fn_load_binary = driver.utils.load_binary

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/triton/runtime/driver.py:157, in LazyProxy.getattr(self, name)
156 def getattr(self, name):
--> 157 self._initialize_obj()
158 return getattr(self._obj, name)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/triton/runtime/driver.py:154, in LazyProxy._initialize_obj(self)
152 def _initialize_obj(self):
153 if self._obj is None:
--> 154 self._obj = self._init_fn()

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/triton/runtime/driver.py:187, in initialize_driver()
185 return HIPDriver()
186 elif torch.cuda.is_available():
--> 187 return CudaDriver()
188 else:
189 return UnsupportedDriver()

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/triton/runtime/driver.py:77, in CudaDriver.init(self)
76 def init(self):
---> 77 self.utils = CudaUtils()
78 self.backend = self.CUDA

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/triton/runtime/driver.py:47, in CudaUtils.init(self)
45 with open(src_path, "w") as f:
46 f.write(src)
---> 47 so = _build("cuda_utils", src_path, tmpdir)
48 with open(so, "rb") as f:
49 cache_path = cache.put(f.read(), fname, binary=True)

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/triton/common/build.py:103, in _build(name, src, srcdir)
98 cc_cmd = [
99 cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda",
100 "-o", so
101 ]
102 cc_cmd += [f"-L{dir}" for dir in cuda_lib_dirs]
--> 103 ret = subprocess.check_call(cc_cmd)
105 if ret == 0:
106 return so

File ~/anaconda3/envs/segment_fast_env/lib/python3.11/subprocess.py:413, in check_call(*popenargs, **kwargs)
411 if cmd is None:
412 cmd = popenargs[0]
--> 413 raise CalledProcessError(retcode, cmd)
414 return 0

BackendCompilerFailed: backend='inductor' raised:
LoweringException: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpaqkjsyxw/main.c', '-O3', '-I/root/anaconda3/envs/segment_fast_env/lib/python3.11/site-packages/triton/common/../third_party/cuda/include', '-I/root/anaconda3/envs/segment_fast_env/include/python3.11', '-I/tmp/tmpaqkjsyxw', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpaqkjsyxw/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-L/lib64', '-L/lib', '-L/lib64', '-L/lib']' returned non-zero exit status 1.
target: aten.convolution.default
args[0]: TensorBox(StorageBox(
InputBuffer(name='arg457_1', layout=FixedLayout('cuda', torch.bfloat16, size=[1, 3, 1024, 1024], stride=[3145728, 1048576, 1024, 1]))
))
args[1]: TensorBox(StorageBox(
InputBuffer(name='arg69_1', layout=FixedLayout('cuda', torch.bfloat16, size=[1280, 3, 16, 16], stride=[768, 256, 16, 1]))
))
args[2]: TensorBox(StorageBox(
InputBuffer(name='arg70_1', layout=FixedLayout('cuda', torch.bfloat16, size=[1280], stride=[1]))
))
args[3]: [16, 16]
args[4]: [0, 0]
args[5]: [1, 1]
args[6]: False
args[7]: [0, 0]
args[8]: 1

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True

can't use it in gtx 4080

Error:SamAutomaticMaskGenerator function has a large memory footprint

GPU：4090 24G
System：Ubuntu for WSL2
Model：sam_vit_h
Image_size：[1024,1024]
Parameter settings：
model=sam,
points_per_side=128,
points_per_batch = 64,
pred_iou_thresh=0.86,
stability_score_thresh=0.92,
crop_n_layers=3,
crop_n_points_downscale_factor=2,
min_mask_region_area=100,
process_batch_size=4
Issue：
When I use SamAutomaticMaskGenerator，GPU memory usage up to 55GB.

And there will be an error.
[torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.63 GiB. GPU 0 has a total capacity of 23.99 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 34.86 GiB is allocated by PyTorch, and 5.63 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
]

However, when using the original SAM code, this problem does not exist, and the GPU memory will not exceed 24GB.

How to get the visible SAM memory consume

hi,
After reading the documentation, I found this picture very interesting, but I did not find a way to draw or generate this image. How can I obtain such an image?

Looking forward to your answer :)

Inference on A10 GPU become slower than original SAM

Hi, I run inference on Nvidia A10 GPU using sam_model_fast_registry to init model and SamAutomaticMaskGenerator to generate masks, but the inference time become slower than original SAM. Why?

original SAM elapsed time: 6.36s per image.
SAM-fast elapsed time: 10.92s per image.

How to train SAM using this repo? Can I get similar gain for training?

ValueError: Expected query, key, and value to all be be jagged at dimension 2, but got query._ragged_idx: 1, key._ragged_idx: 1 and value._ragged_idx: 1 instead.

Reproduce:

import cv2
from segment_anything_fast import sam_model_registry, SamPredictor, SamAutomaticMaskGenerator

img_path = 'amg_example/dog.jpg'
model_type = 'vit_b'
checkpoint = 'checkpoints/sam_vit_b_01ec64.pth'

sam = sam_model_registry[model_type](checkpoint=checkpoint)

img = cv2.imread(img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

mask_generator = SamAutomaticMaskGenerator(sam)
masks = mask_generator.generate(img)

Error stack:

Traceback (most recent call last):
  File "/home/gxw/workspace/sam/segment-anything-fast/scripts/gen_mask.py", line 21, in <module>
    masks = mask_generator.generate(img)
  File "/home/gxw/miniconda3/envs/triton/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/gxw/workspace/sam/segment-anything-fast/segment_anything_fast/automatic_mask_generator.py", line 170, in generate
    mask_data = self._generate_masks(image)
  File "/home/gxw/workspace/sam/segment-anything-fast/segment_anything_fast/automatic_mask_generator.py", line 213, in _generate_masks
    crop_data = self._process_crop(image, crop_box, layer_idx, orig_size)
  File "/home/gxw/workspace/sam/segment-anything-fast/segment_anything_fast/automatic_mask_generator.py", line 255, in _process_crop
    batch_data = self._process_batch(some_points, cropped_im_size, crop_box, orig_size)
  File "/home/gxw/workspace/sam/segment-anything-fast/segment_anything_fast/automatic_mask_generator.py", line 298, in _process_batch
    nt_masks, nt_iou_preds, _ = self.predictor.predict_torch(
  File "/home/gxw/miniconda3/envs/triton/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/gxw/workspace/sam/segment-anything-fast/segment_anything_fast/predictor.py", line 230, in predict_torch
    low_res_masks, iou_predictions = self.model.mask_decoder(
  File "/home/gxw/miniconda3/envs/triton/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/gxw/miniconda3/envs/triton/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gxw/workspace/sam/segment-anything-fast/segment_anything_fast/modeling/mask_decoder.py", line 99, in forward
    masks, iou_pred = self.predict_masks_nested(
  File "/home/gxw/workspace/sam/segment-anything-fast/segment_anything_fast/modeling/mask_decoder.py", line 188, in predict_masks_nested
    hs, src = self.transformer(src, pos_src, tokens)
  File "/home/gxw/miniconda3/envs/triton/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/gxw/miniconda3/envs/triton/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gxw/workspace/sam/segment-anything-fast/segment_anything_fast/modeling/transformer.py", line 91, in forward
    queries, keys = layer(
  File "/home/gxw/miniconda3/envs/triton/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/gxw/miniconda3/envs/triton/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gxw/workspace/sam/segment-anything-fast/segment_anything_fast/modeling/transformer.py", line 155, in forward
    queries = self.self_attn(q=queries, k=queries, v=queries)
  File "/home/gxw/miniconda3/envs/triton/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/gxw/miniconda3/envs/triton/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gxw/workspace/sam/segment-anything-fast/segment_anything_fast/modeling/transformer.py", line 227, in forward
    out = torch.nn.functional.scaled_dot_product_attention(q, k, v)
  File "/home/gxw/miniconda3/envs/triton/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 229, in __torch_function__
    return jagged_torch_function(func, *args, **kwargs)
  File "/home/gxw/miniconda3/envs/triton/lib/python3.10/site-packages/torch/nested/_internal/ops.py", line 265, in jagged_torch_function
    return jagged_scaled_dot_product_attention(*args, **kwargs)
  File "/home/gxw/miniconda3/envs/triton/lib/python3.10/site-packages/torch/nested/_internal/sdpa.py", line 640, in jagged_scaled_dot_product_attention
    _validate_sdpa_input(query, key, value, attn_mask, dropout_p, is_causal, scale)
  File "/home/gxw/miniconda3/envs/triton/lib/python3.10/site-packages/torch/nested/_internal/sdpa.py", line 59, in _validate_sdpa_input
    raise ValueError(
ValueError: Expected query, key, and value to all be be jagged at dimension 2, but got query._ragged_idx: 1, key._ragged_idx: 1 and value._ragged_idx: 1 instead.

collect_env:

Collecting environment information...
PyTorch version: 2.2.0.dev20231206+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 530.30.02
cuDNN version: Probably one of the following:
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn.so.8.9.5
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.5
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.5
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.5
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.9.5
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.9.5
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             48
On-line CPU(s) list:                0-47
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz
CPU family:                         6
Model:                              106
Thread(s) per core:                 2
Core(s) per socket:                 12
Socket(s):                          2
Stepping:                           6
BogoMIPS:                           4200.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          1.1 MiB (24 instances)
L1i cache:                          768 KiB (24 instances)
L2 cache:                           30 MiB (24 instances)
L3 cache:                           36 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s):                  1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.2
[pip3] onnx==1.15.0
[pip3] onnxruntime==1.16.3
[pip3] pytorch-labs-segment-anything-fast==0.2
[pip3] pytorch-triton==2.1.0+bcad9dabe1
[pip3] torch==2.2.0.dev20231206+cu121
[pip3] torchao==0.0.1
[pip3] torchaudio==2.2.0.dev20231206+cu121
[pip3] torchvision==0.17.0.dev20231206+cu121
[pip3] triton==2.1.0
[conda] numpy                     1.26.2                   pypi_0    pypi
[conda] pytorch-labs-segment-anything-fast 0.2                       dev_0    <develop>
[conda] pytorch-triton            2.1.0+bcad9dabe1          pypi_0    pypi
[conda] torch                     2.2.0.dev20231206+cu121          pypi_0    pypi
[conda] torchao                   0.0.1                    pypi_0    pypi
[conda] torchaudio                2.2.0.dev20231206+cu121          pypi_0    pypi
[conda] torchvision               0.17.0.dev20231206+cu121          pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypi

It seems a problem with nested_tensor, but I don't know what causes it and how to fix it.

error with flash attention

hi,
I'm trying to run amg_example.py , but meet an Userwarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:253.) return torch.nn.functional.scaled_dot_product_attention(q_, k_, v_, attn_mask=attn_bias) in lash_4.py:line369. how can i solve it?

Shared memory out of resource.Need 135200M memory！

Remind the testers of this project that it requires 130GB of memory for the demo.

'triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 135200, Hardware limit: 101376. Reducing block sizes or num_stages may help.'

RTX 4090: RuntimeError: CUDA error: an illegal memory access was encountered

Hi I am using a script modeled after the amg example running on an RTX 4090 with pytorch nightly from yesterday and I am getting the following error when I call mask_generator.generate

Warning: Custom flash attention kernels were written specifically for A100.
We will try to read previously created kernel configurations from /home/scassidy/sam-model/flash_4_configs.p.
You can disable this kernel by setting SEGMENT_ANYTHING_FAST_USE_FLASH_4=0
key  (torch.Size([1, 16, 4096, 128]), torch.Size([1, 16, 4096, 128]), torch.Size([1, 16, 4096, 128]), torch.Size([1, 16, 4096, 128]), torch.Size([1, 16, 4096, 128]), (8388608, 524288, 128, 1), (8388608, 524288, 128, 1), (8388608, 524288, 128, 1), (8388608, 524288, 128, 1), (8388608, 524288, 128, 1))  not found. Running autotune. This might take a while.
all configs len:  60
(64, 64, 1, 1)  : None
(64, 64, 2, 1)  : None
(64, 64, 2, 2)  : None
(64, 64, 4, 1)  : None
(64, 64, 4, 2)  : None
(64, 64, 4, 3)  : None
(64, 64, 4, 4)  : None
(64, 64, 8, 1)  : None
(64, 64, 8, 2)  : None
(64, 64, 8, 3)  : None
(64, 64, 8, 4)  : None
(64, 64, 8, 5)  : None
(64, 64, 8, 6)  : None
(64, 64, 8, 7)  : None
(64, 64, 8, 8)  : None
(64, 128, 1, 1)  : None
(64, 128, 2, 1)  : None
(64, 128, 2, 2)  : None
(64, 128, 4, 1)  : None
(64, 128, 4, 2)  : None
(64, 128, 4, 3)  : None
(64, 128, 4, 4)  : None
(64, 128, 8, 1)  : None
(64, 128, 8, 2)  : None
(64, 128, 8, 3)  : None
(64, 128, 8, 4)  : None
(64, 128, 8, 5)  : None
(64, 128, 8, 6)  : None
(64, 128, 8, 7)  : None
(64, 128, 8, 8)  : None
(128, 64, 1, 1)  : None
(128, 64, 2, 1)  : None
(128, 64, 2, 2)  : None
(128, 64, 4, 1)  : None
(128, 64, 4, 2)  : None
(128, 64, 4, 3)  : None
(128, 64, 4, 4)  : None
(128, 64, 8, 1)  : None
(128, 64, 8, 2)  : None
(128, 64, 8, 3)  : None
(128, 64, 8, 4)  : None
(128, 64, 8, 5)  : None
(128, 64, 8, 6)  : None
(128, 64, 8, 7)  : None
(128, 64, 8, 8)  : None
(128, 128, 1, 1)  : None
(128, 128, 2, 1)  : None
(128, 128, 2, 2)  : None
(128, 128, 4, 1)  : None
(128, 128, 4, 2)  : None
(128, 128, 4, 3)  : None
(128, 128, 4, 4)  : None
(128, 128, 8, 1)  : None
(128, 128, 8, 2)  : None
(128, 128, 8, 3)  : None
(128, 128, 8, 4)  : None
(128, 128, 8, 5)  : None
(128, 128, 8, 6)  : None
(128, 128, 8, 7)  : None
(128, 128, 8, 8)  : None
Found best_config  None  with time  None  for key  (torch.Size([1, 16, 4096, 128]), torch.Size([1, 16, 4096, 128]), torch.Size([1, 16, 4096, 128]), torch.Size([1, 16, 4096, 128]), torch.Size([1, 16, 4096, 128]), (8388608, 524288, 128, 1), (8388608, 524288, 128, 1), (8388608, 524288, 128, 1), (8388608, 524288, 128, 1), (8388608, 524288, 128, 1))
Warning: Custom flash attention kernels were written specifically for A100.
Storing configs for NVIDIA GeForce RTX 4090 locally under /home/scassidy/sam-model/flash_4_configs.p
Saving best configs to file /home/scassidy/sam-model/flash_4_configs.p
Traceback (most recent call last):
  File "/home/scassidy/sam-model/__main__.py", line 47, in <module>
    masks = mask_generator.generate(image)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/segment_anything_fast/automatic_mask_generator.py", line 170, in generate
    mask_data = self._generate_masks(image)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/segment_anything_fast/automatic_mask_generator.py", line 213, in _generate_masks
    crop_data = self._process_crop(image, crop_box, layer_idx, orig_size)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/segment_anything_fast/automatic_mask_generator.py", line 243, in _process_crop
    self.predictor.set_image(cropped_im)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/segment_anything_fast/predictor.py", line 60, in set_image
    self.set_torch_image(input_image_torch, image.shape[:2])
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/segment_anything_fast/predictor.py", line 90, in set_torch_image
    self.features = self.model.image_encoder(input_image.to(model_dtype))
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/segment_anything_fast/modeling/image_encoder.py", line 107, in forward
    def forward(self, x: torch.Tensor) -> torch.Tensor:
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 899, in forward
    return compiled_fn(full_args)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 81, in g
    return f(*args)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 94, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 118, in rng_functionalization_wrapper
    return compiled_fw(args)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 864, in __call__
    return self.get_current_callable()(inputs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 665, in run
    return compiled_fn(new_inputs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 380, in deferred_cudagraphify
    fn, out = cudagraphify(model, inputs, new_static_input_idxs, *args, **kwargs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 408, in cudagraphify
    return manager.add_function(
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 1941, in add_function
    return fn, fn(inputs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 1755, in run
    out = self._run(new_inputs, function_id)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 1796, in _run
    return self.run_eager(new_inputs, function_id)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 1911, in run_eager
    return node.run(new_inputs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 611, in run
    out = self.wrapped_function.model(new_inputs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 892, in _run_from_cache
    return compiled_graph.compiled_artifact(inputs)
  File "/tmp/torchinductor_scassidy/54/c54lurvlewsemc6k2ro7hglxwd7zdr2m74jrga6yu6mtylrbmuyw.py", line 3350, in call
    triton_poi_fused_clone_28.run(buf204, buf205, 5242880, grid=grid(5242880), stream=stream0)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 533, in run
    self.autotune_to_one_config(*args, grid=grid, **kwargs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 437, in autotune_to_one_config
    timings = self.benchmark_all_configs(*args, **kwargs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 413, in benchmark_all_configs
    timings = {
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 414, in <dictcomp>
    launcher: self.bench(launcher, *args, **kwargs)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 385, in bench
    return do_bench(kernel_call, rep=40, fast_flush=True)
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/_inductor/utils.py", line 167, in do_bench
    return triton_do_bench(*args, **kwargs)[0]
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/triton/testing.py", line 103, in do_bench
    torch.cuda.synchronize()
  File "/home/scassidy/miniconda3/envs/sam-model/lib/python3.10/site-packages/torch/cuda/__init__.py", line 801, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


import numpy as np
import torch
import matplotlib.pyplot as plt
import cv2
from segment_anything_fast import (
    sam_model_registry,
    sam_model_fast_registry,
    SamAutomaticMaskGenerator,
)


def show_anns(anns):
    if len(anns) == 0:
        return
    sorted_anns = sorted(anns, key=(lambda x: x["area"]), reverse=True)
    ax = plt.gca()
    ax.set_autoscale_on(False)

    img = np.ones(
        (
            sorted_anns[0]["segmentation"].shape[0],
            sorted_anns[0]["segmentation"].shape[1],
            4,
        )
    )
    img[:, :, 3] = 0
    for ann in sorted_anns:
        m = ann["segmentation"]
        color_mask = np.concatenate([np.random.random(3), [0.35]])
        img[m] = color_mask
    ax.imshow(img)


image = cv2.imread("<some_image>")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)


sam_checkpoint = "checkpoints/sam_vit_h_4b8939.pth"
model_type = "vit_h"
device = "cuda:0"

sam = sam_model_fast_registry[model_type](checkpoint=sam_checkpoint)
sam.to(device=device)
mask_generator = SamAutomaticMaskGenerator(sam, process_batch_size=8)

masks = mask_generator.generate(image)

# Save an example
plt.figure(figsize=(image.shape[1] / 100.0, image.shape[0] / 100.0), dpi=100)
plt.imshow(image)
show_anns(masks)
plt.axis("off")
plt.tight_layout()
plt.savefig("<some_image>_1_fast.png", format="png")
plt.close()

image2 = cv2.imread("<some_image>")
image2 = cv2.cvtColor(image2, cv2.COLOR_BGR2RGB)

masks2 = mask_generator.generate(image2)

# Save an example
plt.figure(figsize=(image2.shape[1] / 100.0, image2.shape[0] / 100.0), dpi=100)
plt.imshow(image2)
show_anns(masks2)
plt.axis("off")
plt.tight_layout()
plt.savefig("<some_image>_2_fast.png", format="png")
plt.close()

error for int8 inference on Nvidia 3090

python run_experiments.py 8 vit_b ../ ../../segment-anything experiments_data --run-experiments --num-workers 8 --capture_output True

there wasn't an error traceback, just the results for each method

int8,0.4473809043566386,local-fork,2.2.0.dev20231117+cu121,ERROR

technique	time	sam_commit_name	pytorch_version	sam_model_type	batch_size	memory(MiB)	memory(%)	img_s(avg)	batch_ms(avg)/batch_size	mIoU	use_compile	use_half	compress	epilogue_fusion_first	use_compile_decoder	use_nested_tensor	use_rel_pos	pad_input_image_batch	num_workers	num_batches	num_images	profile_path	memory_path
fp32	9.687259	default	2.2.0.dev20231117+cu121	vit_b	8	19934	82	10.270197	97.369114	0.5335680786450683	False	None	None	False	False	False	True	True	8	619	4952	None	None
bf16	4.469203	codesign	2.2.0.dev20231117+cu121	vit_b	8	10003	41	23.441896	42.658666	0.5420768795834657	False	torch.bfloat16	None	False	False	False	True	True	8	619	4952	None	None
compile	5.114826	codesign	2.2.0.dev20231117+cu121	vit_b	8	8159	33	30.676929	32.597788	0.5425282997311265	max-autotune	torch.bfloat16	None	False	False	False	True	True	8	619	4952	None	None
SDPA	3.309412	sdpa-decoder	2.2.0.dev20231117+cu121	vit_b	8	4858	20	38.438522	26.015569	0.5363043800669344	max-autotune	torch.bfloat16	None	False	False	False	True	True	8	619	4952	None	None
Triton	3.219514	local-fork	2.2.0.dev20231117+cu121	vit_b	8	4671	19	38.492990	25.978756	0.5363043800669344	max-autotune	torch.bfloat16	None	False	False	False	True	True	8	619	4952	None	None
NT	3.210428	local-fork	2.2.0.dev20231117+cu121	vit_b	8	4671	19	39.489851	25.322962	0.5355440758154442	max-autotune	torch.bfloat16	None	False	False	True	True	True	8	619	4952	None	None
int8	0.447381	local-fork	2.2.0.dev20231117+cu121	ERROR
sparse	3.139057	local-fork	2.2.0.dev20231117+cu121	vit_b	8	4969	20	46.041364	21.719600	0.4862656356911103	max-autotune	torch.bfloat16	sparse	False	False	False	True	True	8	619	4952	None	None

Performance on GTX 4070

Hello,

I ran the vanilla SAM (original and Fast) on a GTX-4070 running Ubuntu. Here is what I get number I get for an 1770X1180 image:

Original FPS: 0.219003
Fast FPS: 0.236836

Is original SAM still faster than SAM-Fast or am I doing something wrong?

Thanks

Question of behavior of storing custom configuration of autotune function

As I mentioned in the question I understand the function of

segment-anything-fast/segment_anything_fast/flash_4.py

Line 148 in 2ed7d1a

def _autotune(configs, function):

However, rather than using A100 GPU instances

segment-anything-fast/segment_anything_fast/flash_4.py

Line 225 in 2ed7d1a

def _load_best_configs():

it was intended to use saved_config.

instead of

segment-anything-fast/segment_anything_fast/flash_4.py

Line 237 in 2ed7d1a

return None

shouldn't be

    if not device_name.startswith('NVIDIA A100'):
        cwd = pathlib.Path.cwd()
        saved_configs = cwd / "flash_4_configs.p"
        print(f"We will try to read previously created kernel configurations from {saved_configs}.")
        print("You can disable this kernel by setting SEGMENT_ANYTHING_FAST_USE_FLASH_4=0")
    if saved_configs.is_file():
        import pickle
        with open(saved_configs, 'rb') as f:
            print(f"Loading best configs from file {saved_configs}")
            return pickle.load(f)
    return None

support for aot compilation with torch.export?

thanks for this amazing work and the accompanying blog post, very educational!

would it be possible to export this fast sam variant with torch.export or tracing? I suspect the answer may be no because this torch.compile variation supports jagged arrays and torch.export doesn't, but thought I would check.

In any case, any information on the ability to export this model for inference would be valuable!

I'd like to use this variation of SAM in torchserve, we currently experience long encode times for each image request. I noticed this comment in #67 which makes me wonder if the JIT compile times will be too long for a user to interactively use this variation of SAM since it will need to be compiled for each REST request.

Please also note that the first time you're running this model you'll likely need to wait a bit for it to compile.

Labeling tool ISAT have supported segment-anything-fast.

The effect is very amazing ,time cost from 0.8198s to 0.2228s and GPU memory from 7300M to 4900M when use sam_vit_h_4b8939.pth.

This method can be used on ohter SAM type model? Like sam-hq, MobileSAM.

Windows not yet supported

Why is it that when I run amg _ example.py, the result shows RuntimeError : Windows not yet supported for torch.compile？

requires gpu large memory

Hi,
This model takes up ~23gb of vram on my rtx 3090. I cannot run the model for more than two images before I hit OOM. Vanilla SAM only takes up ~5gb of vram. Is this expected behavior? Thanks!

I don't understand how the batching work in the Mask Generator or the Predictor

As I can see in the Blog Post:

You're using different batch sizes of 1, 8, 32
After checking the code: it seems that the batching isn't on the image and on the points

Which I don't comprehend why would the points need batching?

use sam_model_fast_registry on NVIDIA GeForce RTX 4090 ，get error

/opt/conda/lib/python3.8/site-packages/torch/_dynamo/utils.py:1570: UserWarning: Memory Efficient Attention requires the attn_mask to be aligned to, 8 elements. Prior to calling SDPA, pad the last dimension of the attn_mask to be at least a multiple of 8 and then slice the attn_mask to the original size. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:551.)
  return node.target(*args, **kwargs)
AUTOTUNE convolution(1x3x1024x1024, 1280x3x16x16)
  convolution 0.2431 ms 100.0%
  triton_convolution_3 0.5962 ms 40.8%
  triton_convolution_1 0.6139 ms 39.6%
  triton_convolution_6 0.6553 ms 37.1%
  triton_convolution_4 0.8272 ms 29.4%
  triton_convolution_5 0.9010 ms 27.0%
  triton_convolution_0 1.0113 ms 24.0%
  triton_convolution_2 3.7757 ms 6.4%
SingleProcess AUTOTUNE takes 8.4253 seconds
AUTOTUNE mm(4900x1280, 1280x3840)
  triton_mm_10 0.3042 ms 100.0%
  triton_mm_9 0.3044 ms 100.0%
  triton_mm_15 0.3175 ms 95.8%
  mm 0.3243 ms 93.8%
  triton_mm_8 0.3281 ms 92.7%
  triton_mm_11 0.3322 ms 91.6%
  triton_mm_7 0.3356 ms 90.6%
  triton_mm_14 0.3555 ms 85.6%
  triton_mm_13 0.4628 ms 65.7%
  triton_mm_12 0.4674 ms 65.1%
SingleProcess AUTOTUNE takes 9.9830 seconds
AUTOTUNE bmm(14x5600x80, 14x80x14)
  bmm 0.0276 ms 100.0%
  triton_bmm_26 0.0531 ms 51.9%
  triton_bmm_21 0.0539 ms 51.1%
  triton_bmm_20 0.0546 ms 50.5%
  triton_bmm_23 0.0552 ms 49.9%
  triton_bmm_24 0.0559 ms 49.3%
  triton_bmm_30 0.0561 ms 49.1%
  triton_bmm_28 0.0570 ms 48.3%
  triton_bmm_29 0.0576 ms 47.9%
  triton_bmm_22 0.0577 ms 47.8%
SingleProcess AUTOTUNE takes 9.0480 seconds
AUTOTUNE bmm(14x5600x80, 14x80x14)
  bmm 0.0295 ms 100.0%
  triton_bmm_36 0.0546 ms 54.1%
  triton_bmm_39 0.0550 ms 53.7%
  triton_bmm_35 0.0558 ms 52.9%
  triton_bmm_33 0.0558 ms 52.9%
  triton_bmm_32 0.0559 ms 52.8%
  triton_bmm_41 0.0563 ms 52.4%
  triton_bmm_31 0.0563 ms 52.4%
  triton_bmm_42 0.0564 ms 52.3%
  triton_bmm_40 0.0567 ms 52.1%
SingleProcess AUTOTUNE takes 6.5209 seconds
AUTOTUNE mm(4900x1280, 1280x1280)
  triton_mm_44 0.1199 ms 100.0%
  triton_mm_51 0.1242 ms 96.5%
  triton_mm_46 0.1272 ms 94.2%
  triton_mm_47 0.1283 ms 93.4%
  triton_mm_45 0.1284 ms 93.4%
  triton_mm_43 0.1305 ms 91.9%
  mm 0.1467 ms 81.8%
  triton_mm_50 0.1582 ms 75.8%
  triton_mm_49 0.1715 ms 69.9%
  triton_mm_48 0.1749 ms 68.5%
SingleProcess AUTOTUNE takes 7.7492 seconds
AUTOTUNE mm(4096x1280, 1280x5120)
  triton_mm_56 0.3222 ms 100.0%
  triton_mm_58 0.3230 ms 99.8%
  triton_mm_57 0.3249 ms 99.2%
  triton_mm_59 0.3261 ms 98.8%
  mm 0.3281 ms 98.2%
  triton_mm_63 0.3495 ms 92.2%
  triton_mm_55 0.3971 ms 81.1%
  triton_mm_62 0.4037 ms 79.8%
  triton_mm_61 0.5124 ms 62.9%
  triton_mm_60 0.5172 ms 62.3%
SingleProcess AUTOTUNE takes 10.1354 seconds
AUTOTUNE mm(4096x5120, 5120x1280)
  triton_mm_70 0.3214 ms 100.0%
  triton_mm_71 0.3223 ms 99.7%
  mm 0.3253 ms 98.8%
  triton_mm_69 0.3438 ms 93.5%
  triton_mm_68 0.3445 ms 93.3%
  triton_mm_75 0.3832 ms 83.9%
  triton_mm_67 0.3880 ms 82.8%
  triton_mm_74 0.4686 ms 68.6%
  triton_mm_73 0.5219 ms 61.6%
  triton_mm_72 0.5222 ms 61.5%
SingleProcess AUTOTUNE takes 9.7927 seconds
AUTOTUNE addmm(4096x3840, 4096x1280, 1280x3840)
  triton_mm_514 0.2487 ms 100.0%
  triton_mm_513 0.2495 ms 99.7%
  triton_mm_515 0.2504 ms 99.3%
  triton_mm_512 0.2662 ms 93.4%
  triton_mm_519 0.2712 ms 91.7%
  bias_addmm 0.2847 ms 87.4%
  triton_mm_518 0.2966 ms 83.9%
  triton_mm_511 0.2989 ms 83.2%
  addmm 0.3052 ms 81.5%
  triton_mm_516 0.3908 ms 63.7%
SingleProcess AUTOTUNE takes 9.8966 seconds
AUTOTUNE bmm(64x1024x80, 64x80x64)
  triton_bmm_531 0.0304 ms 100.0%
  triton_bmm_532 0.0323 ms 94.2%
  triton_bmm_529 0.0324 ms 93.7%
  triton_bmm_528 0.0327 ms 93.0%
  bmm 0.0328 ms 92.7%
  triton_bmm_524 0.0329 ms 92.4%
  triton_bmm_523 0.0332 ms 91.7%
  triton_bmm_525 0.0335 ms 90.8%
  triton_bmm_530 0.0336 ms 90.5%
  triton_bmm_526 0.0347 ms 87.6%
SingleProcess AUTOTUNE takes 7.9944 seconds
AUTOTUNE mm(4096x1280, 1280x1280)
  triton_mm_550 0.0950 ms 100.0%
  triton_mm_551 0.0951 ms 99.8%
  triton_mm_549 0.0982 ms 96.7%
  triton_mm_548 0.0994 ms 95.6%
  triton_mm_555 0.1012 ms 93.9%
  triton_mm_547 0.1047 ms 90.7%
  mm 0.1159 ms 81.9%
  triton_mm_554 0.1332 ms 71.3%
  triton_mm_553 0.1449 ms 65.6%
  triton_mm_552 0.1482 ms 64.1%
SingleProcess AUTOTUNE takes 9.5205 seconds
AUTOTUNE convolution(1x1280x64x64, 256x1280x1x1)
  convolution 0.0409 ms 100.0%
  triton_convolution_2315 0.0632 ms 64.7%
  triton_convolution_2317 0.0963 ms 42.5%
  triton_convolution_2314 0.0977 ms 41.8%
  triton_convolution_2312 0.1126 ms 36.3%
  conv1x1_via_mm 0.1306 ms 31.3%
  triton_convolution_2316 0.1394 ms 29.3%
  triton_convolution_2313 0.1460 ms 28.0%
  triton_convolution_2311 0.1762 ms 23.2%
SingleProcess AUTOTUNE takes 6.8541 seconds
AUTOTUNE convolution(1x256x64x64, 256x256x3x3)
  convolution 0.0861 ms 100.0%
  triton_convolution_2324 0.1452 ms 59.3%
  triton_convolution_2319 0.1805 ms 47.7%
  triton_convolution_2322 0.1813 ms 47.5%
  triton_convolution_2321 0.1835 ms 46.9%
  triton_convolution_2323 0.2455 ms 35.1%
  triton_convolution_2320 0.2628 ms 32.8%
  triton_convolution_2318 0.3300 ms 26.1%
SingleProcess AUTOTUNE takes 8.1397 seconds
Traceback (most recent call last):
  File "server/main.py", line 36, in sam
    ans = pth_process(info, image)
  File "/sam/./scripts/pth_model.py", line 143, in process
    masks, scores, logits = predictor.infer(image,input_point,input_label,input_box,pmask)
  File "/sam/./scripts/pth_model.py", line 20, in infer
    return self.predictor.infer(input_image, input_point, input_label,input_box,input_mask)
  File "/sam/./sam_fast/segment_anything_fast/predictor.py", line 282, in infer
    return self.predict(
  File "/sam/./sam_fast/segment_anything_fast/predictor.py", line 165, in predict
    iou_predictions_np = iou_predictions[0].detach().cpu().numpy()
TypeError: Got unsupported ScalarType BFloat16

I use vit_h and code below:
sam_fast is segment-anything-fast:main-branch

       from sam_fast.segment_anything_fast import sam_model_registry,sam_model_fast_registry, SamPredictor
        sam = sam_model_fast_registry[model_type](checkpoint=checkpoint)
        sam.to(device=config["device"])
        predictor= SamPredictor(sam)
        masks, scores, logits = predictor.infer(image,input_point,input_label,input_box,pmask)



     predictor .infer
     def infer(self,input_image, input_point, input_label,input_box,input_mask):
      self.set_image(input_image)
      return self.predict(
            point_coords=input_point,
            point_labels=input_label,
            box=input_box,
            multimask_output=False,
            mask_input=input_mask
        )

How to download weights for this model?

How to download weights for this model or are you using the same weights as there in the original model?

No module named triton on 2080TI

Hi all

I was try to run amg_example.py on 2080TI too , I know the triton kernel was written specifically for the A100 ,

so according the ReadMe file its need to set the environment variable to SEGMENT_ANYTHING_FAST_USE_FLASH_4=0,

here is my code

import OS
os.environ[' SEGMENT_ANYTHING_FAST_USE_FLASH_4'] = '0'

but its still have miss the triton module error ,

Did I do something wrong? Or have any suggestions?

thanks you

Worse performance on SamAutomaticMaskGenerator

hi,
I run on A100 and use SamAutomaticMaskGenerator method to segment images, and the parameters are as follows:

    points_per_side: Optional[int] = 32,
    points_per_batch: int = 64,
    pred_iou_thresh: float = 0.85,
    stability_score_thresh: float = 0.85,
    stability_score_offset: float = 1.0,
    box_nms_thresh: float = 0.7,
    crop_n_layers: int = 0,
    crop_nms_thresh: float = 0.7,
    crop_overlap_ratio: float = 0,
    crop_n_points_downscale_factor: int = 1,
    point_grids: Optional[List[np.ndarray]] = None,
    min_mask_region_area: int = 0,
    output_mode: str = "binary_mask",

Also, I use from segment_anything import sam_model_fast_registry instead of from segment_anything import sam_model_registry. However, the segmentation speed is slower:

 sam: 8.35s
 sam-faster: 48s

loaded logs：

May I ask why this problem happened? Thank you!

How did Automatic Mask Generator support batching before applying nested tensors

In the benchmark of the model on Torch.compile and CudaSync Operations, It's benchmarked on different batch size.

Meanwhile the batch size was introduced when adding nested tensors and isn't available in the Segment Anything original repo

so how are we benchmarking with various batch sizes meanwhile it wasn't implemented yet

Exceed INT_MAX

When I use the "vit_h" model to infer anything images, once the image size is too large, such as 1500*2000, I get an error saying it exceeds INT_MAX. However, when I reduce the image size, the error no longer occurs. How can I solve this issue?

Looking forward to the author's response, thank you!

Analysis pytorch profiler

Hi, I see this blog:https://pytorch.org/blog/accelerating-generative-ai/, but I could not find kernel trace in my perfetto viewer with such result. In addition, can the author share some tutorials on using profiling tools to debug performance bottlenecks? I'm not very good at analysising from timeline, and felt like this would take some experience.

ValueError: ('Unsupported kind: ', 'FRAGMENT')

Why do I still get an error ValueError: ('Unsupported kind: ', 'FRAGMENT')，when I set an environment variable os.environ['SEGMENT_ANYTHING_FAST_USE_FLASH_4'] = '0'
I'm using torch2.0

Windows system

Can 't segment-anything-fast run on Windows system ? Must be a Linux system ?

Error running SamAutomaticMaskGenerator

Hello, I got this error running the sam_model_fast_registry with sam_vit_b_01ec64

torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: AttributeError: 'PermuteView' object has no attribute 'freeze_layout'
  target: aten.convolution.default
  args[0]: TensorBox(
    PermuteView(data=StorageBox(
      ComputedBuffer(name='buf353', layout=FlexibleLayout('cpu', torch.bfloat16, size=[1, 64, 64, 768], stride=[3145728, 49152, 768, 1]), data=Pointwise(
        'cpu',
        torch.bfloat16,
        def inner_fn(index):
            _, i1, i2, i3 = index
            tmp0 = ops.load(buf345, i3 + 768 * i2 + 49152 * i1)
            tmp1 = ops.load(buf352, i3 + 768 * i2 + 49152 * i1)
            tmp2 = tmp0 + tmp1
            return tmp2
        ,
        ranges=[1, 64, 64, 768],
        origin_node=add_132,
        origins={add_132}
      ))
    ), dims=[0, 3, 1, 2])
  )
  args[1]: TensorBox(StorageBox(
    InputBuffer(name='arg175_1', layout=FixedLayout('cpu', torch.bfloat16, size=[256, 768, 1, 1], stride=[768, 1, 1, 1]))
  ))
  args[2]: None
  args[3]: [1, 1]
  args[4]: [0, 0]
  args[5]: [1, 1]
  args[6]: False
  args[7]: [0, 0]
  args[8]: 1

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

amg _ example.py

Why is it that when I run amg _ example.py, the result shows RuntimeError : Windows not yet supported for torch.compile？Can it run on the Ubuntu system ?

一定要使用最新的pytorch吗？

您好，一定要使用最新的pytorch吗？

Do I have to use the latest version of PyTorch?

Do you have any specific requirements for the PyTorch version?

can't torch.export.export the model and set batch size to dynamic

the model can be exported with out any dynamic shapes.

import torch
import torchvision
from torch.export import export, Dim
from segment_anything_fast import sam_model_fast_registry, SamPredictor

sam_checkpoint = "../segment-anything-fast/experiments/sam_vit_h_4b8939.pth"
model_type = "vit_h"
device = "cuda"

sam = sam_model_fast_registry[model_type](checkpoint=sam_checkpoint)
# sam.to(device=device)
predictor = SamPredictor(sam)
encoder = predictor.model.image_encoder

example_args = (torch.randn(2,3,1024, 1024, dtype=torch.bfloat16),)
# Create a dynamic batch size
batch = Dim("batch")
h = Dim("h")
w = Dim("w")
# # Specify that the batch and height and width dimensions are dynamic
# dynamic_shapes=(({0: Dim("batch"), 2: Dim("h"), 3: Dim("w")},),)
dynamic_shapes=()
exported_program = export(predictor.model.image_encoder, args=example_args, dynamic_shapes=dynamic_shapes)

but setting the batch size to be dynamic causes a strange error.

dynamic_shapes=(({0: Dim("batch")},),)

UserError: Expecting `args` to be a tuple of example positional inputs, got <class 'torch.Tensor'>

and setting h or w to dynamic triggers guard errors. does segment anything fast not support dynamic input shapes? this would be great so that inputs don't need to be resized.

the full thread where I'm trying to debug torch export is here. I tried various options for nesting the dynamic shape specs but can't find a combination that works for segment anything fast.

https://pytorch.slack.com/archives/C3PDTEV8E/p1702780972665469

Below is a minimal working example of dynamic shape export provided by Angela Yi in the slack channel

import torch
from torch.export import export, Dim
def g(x):
    return x + x

def f(*args):
    return g(*args)

example_args = (torch.randn(2,3,1024, 1024),)
dynamic_shapes=(({0: Dim("batch")},),)
export(f, example_args, dynamic_shapes=dynamic_shapes)

I would expect dynamic_shapes=(({0: Dim("batch")},),) to work for the segment anything fast encoder as well given that the encoder takes an input with the same dimensions.

_attention_rel_h_rel_w_kernel_aligned_device not receiving correct dtype

I am trying to use set_image from SamPredictor:

import numpy as np
from PIL import Image
from segment_anything_fast import SamPredictor, sam_model_registry
sam = sam_model_registry["vit_h"](checkpoint="<path/to/checkpoint>").cuda()

predictor = SamPredictor(sam)
img = np.array(Image.open("path/to/img"))
predictor.set_image(img)

which returns the error:

Loading best configs from file /home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/segment_anything_fast/configs/flash_4_configs_a100.p
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/segment_anything_fast/predictor.py", line 60, in set_image
    self.set_torch_image(input_image_torch, image.shape[:2])
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/segment_anything_fast/predictor.py", line 90, in set_torch_image
    self.features = self.model.image_encoder(input_image.to(model_dtype))
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/segment_anything_fast/modeling/image_encoder.py", line 113, in forward
    x = blk(x)
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/segment_anything_fast/modeling/image_encoder.py", line 175, in forward
    x = self.attn(x)
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/segment_anything_fast/modeling/image_encoder.py", line 245, in forward
    x = _attention_rel_h_rel_w(q, k, v, rel_h, rel_w)
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/segment_anything_fast/flash_4.py", line 337, in _attention_rel_h_rel_w
    o = torch.ops.customflash.custom_flash_aligned(
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/torch/_ops.py", line 760, in __call__
    return self._op(*args, **kwargs or {})
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/segment_anything_fast/flash_4.py", line 298, in _attention_rel_h_rel_w_kernel_aligned
    _attention_rel_h_rel_w_kernel_aligned_device(q,
  File "/home/gtorres/anaconda3/envs/sam-fast/lib/python3.9/site-packages/segment_anything_fast/flash_4.py", line 185, in _attention_rel_h_rel_w_kernel_aligned_device
    assert (q.dtype == torch.bfloat16 or q.dtype == torch.float16)
AssertionError

I have installed following the instructions in the README in a fresh environment.

Min Memory requirement?

Hello,

I am unable to run the code on GTX 4070 with 12GB memory. I get the following error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 0 has a total capacity of 11.73 GiB of which 285.94 MiB is free. Process 4646 has 298.61 MiB memory in use. Including non-PyTorch memory, this process has 10.81 GiB memory in use. Of the allocated memory 8.91 GiB is allocated by PyTorch, and 1.67 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Is there a way to reduce the memory requirement?

Thanks.

pytorch-labs / segment-anything-fast Goto Github PK

segment-anything-fast's People

Contributors

Stargazers

Watchers

Forkers

segment-anything-fast's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs