facebookincubator / aitemplate Goto Github PK

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

License: Apache License 2.0

Shell 0.02% Python 77.19% Cuda 3.86% C++ 18.85% CMake 0.01% Nix 0.01% C 0.07%

aitemplate's People

Contributors

Stargazers

Watchers

Forkers

kerrmudgeon antinucleon hlu1 ktg1 patrickgalbraith lly-zero-one terrychenism seastar105 zhiqwang rosenrodt techthiyanes josephrocca zivzone kartset zumbalamambo jiawenliu64 kyrie-zhao xiaomingde marcus-arcadius chengruichang snapbuy anminhhung rsh4d0w yunda-si galic1987 xkey- nitin-mane nanderoo bill-kalog davidhuji chomolungma hephaex michaelwnau alejandrosuarez pruthvistony rocm americanpresidentjimmycarter ssahgal kennethgoodman voctory xiezhq-hermann richardsonjf cli99 sun2011yao zjing14 teora lixipin johnpertoft hirajanwin chunjieshan doomedramen aralche stbere b1sounours draa jojogithub02 a2393439531 shaojiewang chengscott lyrl juanpa04 rajusheik syed-ahmed blackxin zedysproject1 jaybdub baekms flychen50 gyanachand1 the-intelligence-of-information ankitrana2709 bhaskarnallani nhtlongcs gavinljj mbrukman dearowen trannhiem syncbytes ilovespectra onenewworld wqn1 neiltian-tencent lygztq deep-learning-plan freshy969 oms1996 joskid dlml xumicoder test00dezwebsite kimsoohwan devrabbiz ericxsun bokyliu sirflickka zhangjun namld sergeykadiyevskiy ayicherry dmillner

aitemplate's Issues

compile_model throws error about profiling (cutlass_f16_s884fprop_fixed_channels_f16_256x128_32x3_nhwc_align_4_8 is not executable)

When I compile a model that had nn.Conv2dBiasFewChannels I get this error:

<aitemplate.compiler.ops.conv.conv2d> Profile: conv2d_bias_few_channels_1: NI == 1 && HI == 384 && WI == 384 && CI == 4
<aitemplate.backend.profiler_runner> Using 1 GPU for profiling conv2d_bias_few_channels_1

RuntimeError: Profiler ./output/profiler/conv2d_bias_few_channels/cutlass_f16_s884fprop_fixed_channels_f16_256x128_32x3_nhwc_align_4_8 is not executable

This was confusing because I did not start any profiling, just did compile_model. Until I read that for some ops codegen require profiling. Is it to find the most optimized path? It is unusual for a compiler to run part of the code, conceptually.

My first question is this, which ops require profiling for codegen?

Here are some that seem to create a profiler during compile: conv2d_bias_few_channels, conv2d_bias_add_identity, gemm_rcr_bias, bmm_crr, bmm_rcr, bmm_ccr_add, gemm_rcr_bias_fast_gelu, gemm_rcr_bias_add, bmm_rrr, conv2d_bias,

And my second question is why did this fail? Could it be that my V100 machine does not support that op? All other ops mentioned above have the executable built but this one. When I use nn.Conv2dBias instead, I get another issue "a/b is not aligned" which I am now looking into

add example wav2vec/hubert?

automatically parsing Pytorch module to AIT module, not define AIT module layer by layer

I referred official tutorial but I can't find the automatic conversion from PyTorch module to AIT module like ONNX->TensorRT.

https://facebookincubator.github.io/AITemplate/tutorial/how_to_infer_pt.html#define-a-pytorch-module

Is there any parser or converter building AIT module from PyTorch module directly?

Example does not run due to missing cutlass lib

To reproduce the error, I started a fresh install with these commands, following the README guides:

cd python
python setup.py bdist_wheel
pip install dist/*.whl
cd ..
python3 examples/05_stable_diffusion/compile.py

│ModuleNotFoundError: No module named 'cutlass_lib'

ops.conv2d(group=256) outputs NaN and Inf

This is the UnitTest

import unittest

import torch

from aitemplate.compiler import compile_model, ops
from aitemplate.frontend import Tensor
from aitemplate.testing import detect_target


class ConvGroupTestCase(unittest.TestCase):
    def test_fp16(self):
        groups = 256   # if changed to 1 this passes
        size = (12,12)
        target = detect_target()
        X = Tensor(
            shape=[1, *size, 256],
            dtype="float16",
            name="input_0",
            is_input=True,
        )
        W = Tensor(
            shape=[256, 3, 3, 256//groups], dtype="float16", name="input_1", is_input=True
        )
        OP = ops.conv2d(stride=1, pad=1, dilate=1, group=groups)
        Y = OP(X, W)
        Y._attrs["name"] = "output_0"
        Y._attrs["is_output"] = True
        module = compile_model(Y, target, "./output", "conv2dgroup")

        X_pt = torch.randn(1, 256, *size).cuda().half()
        W_pt = torch.randn(256, 256//groups, 3, 3).cuda().half()
        Y_pt = torch.nn.functional.conv2d(X_pt, W_pt, padding=1, groups=groups)
        x = X_pt.permute((0, 2, 3, 1)).contiguous()
        w = W_pt.permute((0, 2, 3, 1)).contiguous()
        y = torch.empty([1, *size, 256]).cuda().half()
        module.run_with_tensors({"input_0": x, "input_1": w}, [y])
        y_transpose = y.permute((0, 3, 1, 2))
        self.assertFalse(y_transpose.isnan().any())
        self.assertFalse(y_transpose.isinf().any())
        if target.name() == "cuda":
            self.assertTrue(torch.allclose(Y_pt, y_transpose, atol=1e-2, rtol=1e-2))
        else:
            self.assertTrue(torch.allclose(Y_pt, y_transpose, atol=1.25e-1, rtol=1e-1))

if __name__ == "__main__":
    torch.manual_seed(0)
    unittest.main()

The AIT output contains many NaN and zeros

I noticed there was no tests for ops.conv2d testing group. Also, PyTorch conv2d calls this groups while it is called group here

Is there C++ API provided?

Thank for your project! When we want to deploy my model in c++ project, is there C++ API provided to deploy my model?
We don't find any c++ api to use. if you could provide c++ api, we will appreciate it.

does diffuer example support batch size option?

In plain diffuser, if we make prompt a list, it will batch the input, but got following error in AITemplate. I make prompt a list of size 2.

{'trained_betas'} was not found in config. Values will be initialized to default values.
[18:28:36] ./tmp/CLIPTextModel/model-generated.h:275: Init AITemplate Runtime.
[18:28:37] ./tmp/UNet2DConditionModel/model-generated.h:3262: Init AITemplate Runtime.
[18:28:37] ./tmp/AutoencoderKL/model-generated.h:678: Init AITemplate Runtime.
[18:28:40] ./tmp/CLIPTextModel/model_interface.cu:92: Error: [SetValue] Dimension got value out of bounds; expected value to be in [1, 1], but got 2
Traceback (most recent call last):
  File "examples/05_stable_diffusion/demo.py", line 46, in <module>
    run()
  File "/home/root/miniconda3/envs/ldm/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/root/miniconda3/envs/ldm/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/root/miniconda3/envs/ldm/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/root/miniconda3/envs/ldm/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "examples/05_stable_diffusion/demo.py", line 37, in run
    image = pipe(prompt).images[0]
  File "/home/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/root/repos/AITemplate/examples/05_stable_diffusion/pipeline_stable_diffusion_ait.py", line 247, in __call__
    text_embeddings = self.clip_inference(text_input.input_ids.to(self.device))
  File "/home/root/repos/AITemplate/examples/05_stable_diffusion/pipeline_stable_diffusion_ait.py", line 139, in clip_inference
    exe_module.run_with_tensors(inputs, ys, graph_mode=True)
  File "/home/root/miniconda3/envs/ldm/lib/python3.8/site-packages/aitemplate/compiler/model.py", line 483, in run_with_tensors
    outputs_ait = self.run(
  File "/home/root/miniconda3/envs/ldm/lib/python3.8/site-packages/aitemplate/compiler/model.py", line 438, in run
    return self._run_impl(
  File "/home/root/miniconda3/envs/ldm/lib/python3.8/site-packages/aitemplate/compiler/model.py", line 377, in _run_impl
    self.DLL.AITemplateModelContainerRun(
  File "/home/root/miniconda3/envs/ldm/lib/python3.8/site-packages/aitemplate/compiler/model.py", line 192, in _wrapped_func
    raise RuntimeError(f"Error in function: {method.__name__}")
RuntimeError: Error in function: AITemplateModelContainerRun

nn.Conv2d output mismatch with torch.nn.Conv2d

This is a full repro:

import torch
import numpy as np

from collections import OrderedDict

from aitemplate.testing import detect_target
from aitemplate.frontend import nn, Tensor
from aitemplate.compiler import compile_model

def map_pt_params(ait_model, pt_model):
  ait_model.name_parameter_tensor()
  pt_params = dict(pt_model.named_parameters())
  mapped_pt_params = OrderedDict()
  # names should be valid C++ variables
  for name, _ in ait_model.named_parameters():
    ait_name = name.replace(".", "_")
    assert name in pt_params
    params = pt_params[name]
    if len(params.shape) == 4:
        # NCHW->NHWC
        params = params.permute(0,2,3,1).contiguous()
        # Pad for few channels
        if params.shape[-1] == 3:
            print(f"pad {name}")
            params = torch.nn.functional.pad(params, (0,1))
    mapped_pt_params[ait_name] = params
  return mapped_pt_params

def mark_output(Y):
    Y._attrs["is_output"] = True
    Y._attrs["name"] = "Y"


def get_input(shape=None):
    X = Tensor(
        shape=shape,
        name="X",
        dtype="float16",
        is_input=True,
    )
    return X

EPS = 1e-1

class ConvEmbed(nn.Module):
    """ Image to Patch Embedding
    """

    def __init__(
        self,
        in_chans=3,
        embed_dim=64,
        patch_size=7,
        stride=4,
        padding=2,
    ):
        super().__init__()
        self.patch_size = patch_size

        self.proj = nn.Conv2dBias(
            in_chans, embed_dim, kernel_size=patch_size, stride=stride, padding=padding
        )


    def forward(self, x):
        x = self.proj(x)
        return x

class ConvEmbedPt(torch.nn.Module):
    """ Image to Patch Embedding
    """

    def __init__(
        self,
        in_chans=3,
        embed_dim=64,
        patch_size=7,
        stride=4,
        padding=2,
    ):
        super().__init__()
        self.patch_size = patch_size

        self.proj = torch.nn.Conv2d(
            in_chans, embed_dim, kernel_size=patch_size, stride=stride, padding=padding
        )


    def forward(self, x):
        x = self.proj(x)
        return x

def build_convembed0():
    ait_model = ConvEmbed(in_chans=4, embed_dim=256, patch_size=7, stride=4, padding=3)
    ait_model.name_parameter_tensor()
    X = get_input(shape=[1,384,384,4])
    Y = ait_model(X)
    mark_output(Y)
    return ait_model, Y

x = torch.rand(1,4,384,384).cuda().half()
m = ConvEmbedPt(4,256,patch_size=7,stride=4,padding=3).cuda().half()
with torch.no_grad():
    y_pt = m(x)
ait_model, Y = build_convembed0()
weights = map_pt_params(ait_model, m)
target = detect_target()
module = compile_model(Y, target, "./output", "repro", constants=weights)

inputs = [x.permute((0, 2, 3, 1)).contiguous()]
ys = []
num_ouputs = len(module.get_output_name_to_index_map())
for i in range(num_ouputs):
    shape = module.get_output_maximum_shape(i)
    ys.append(torch.empty(shape).cuda().half())

module.run_with_tensors(inputs, ys)
print((y_pt.permute(0,2,3,1) - ys[0]).abs().max())
np.testing.assert_allclose(
    y_pt.transpose(1,-1).detach().cpu().numpy(),
    ys[0].cpu().numpy(),
    atol=0.1,
    rtol=0.1,
)

I tried using nn.Conv2dBias and nn.Conv2d which did not help (Conv has bias but it was not clear if nn.Conv2dBias is same as nn.Conv2d).
Also tries not transposing the weights, also not helpful

can we get stable diffusion example work with xformers?

Hi, I was wondering ... is there any easy way to use xformers with the AITemplate in stable diffusion? Since it lowers memory consumption, we can infer with resolution north of 1024x1024. Can we pair it with AITemplate too? reference ... huggingface/diffusers#532

Easter egg..?

https://github.com/facebookincubator/AITemplate/blob/main/python/aitemplate/backend/cuda/gemm_universal/common.py#L216
https://github.com/facebookincubator/AITemplate/blob/main/python/aitemplate/backend/cuda/gemm_universal/group_common.py#L250

RuntimeError: Error in function: AITemplateModelContainerRun

I have been following installation steps and trying to run the stable diffusion template.
But running all the commands, I face issues after this.

CMD:
python3 examples/05_stable_diffusion/demo.py --token ACCESS_TOKEN
Error:
./tmp/UNet2DConditionModel/model-generated.h:3334: Init AITemplate Runtime.
pt output: torch.Size([2, 4, 64, 64])
[09:26:03] ./tmp/UNet2DConditionModel/model_interface.cu:92: Error: [SetValue] Dimension got value out of bounds; expected value to be in [96, 96], but got 64
Traceback (most recent call last):
File "examples/05_stable_diffusion/demo.py", line 46, in
run()
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "examples/05_stable_diffusion/demo.py", line 37, in run
image = pipe(prompt).images[0]
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/AITemplate/examples/05_stable_diffusion/pipeline_stable_diffusion_ait.py", line 328, in call
noise_pred = self.unet_inference(
File "/AITemplate/examples/05_stable_diffusion/pipeline_stable_diffusion_ait.py", line 123, in unet_inference
exe_module.run_with_tensors(inputs, ys, graph_mode=False)
File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 483, in run_with_tensors
outputs_ait = self.run(
File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 438, in run
return self._run_impl(
File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 377, in _run_impl
self.DLL.AITemplateModelContainerRun(
File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 192, in _wrapped_func
raise RuntimeError(f"Error in function: {method.name}")
RuntimeError: Error in function: AITemplateModelContainerRun

I cannot find any way to solve this.

Do not gate V100 support

The README.md says NVIDIA: AIT is only tested on SM80+ GPUs (Ampere etc). Not all kernels work with old SM75/SM70 (T4/V100) GPUs.

Which I interpreted as it may work but we won't guarantee it. However in https://github.com/facebookincubator/AITemplate/blob/main/python/aitemplate/testing/detect_target.py#L41 there's an explicit gate on V100 which if I fixed the example works and is also 2x faster

If this was not intended, please let me know I can make the PR to fix this. V100 and T4 are by far the most popular GPUs I see among enterprises.

if "V100" in stdout or "RTX 20" in stdout:
  return "75"

Performance on V100

AITemplate time: 0.11990207433700562 ms/iter
PyTorch eager time: 0.20665957641601562 ms/iter

Repro

from collections import OrderedDict

import torch

from aitemplate.compiler import compile_model
from aitemplate.frontend import nn, Tensor
from aitemplate.testing import detect_target
from aitemplate.testing.benchmark_pt import benchmark_torch_function
from aitemplate.utils.graph_utils import sorted_graph_pseudo_code

class PTSimpleModel(torch.nn.Module):
  def __init__(self, hidden, eps: float = 1e-5):
    super().__init__()
    self.dense1 = torch.nn.Linear(hidden, 4 * hidden)
    self.act1 = torch.nn.functional.gelu
    self.dense2 = torch.nn.Linear(4 * hidden, hidden)
    self.layernorm = torch.nn.LayerNorm(hidden, eps=eps)

  def forward(self, input):
    hidden_states = self.dense1(input)
    hidden_states = self.act1(hidden_states)
    hidden_states = self.dense2(hidden_states)
    hidden_states = hidden_states + input
    hidden_states = self.layernorm(hidden_states)
    return hidden_states

class AITSimpleModel(nn.Module):
  def __init__(self, hidden, eps: float = 1e-5):
    super().__init__()
    self.dense1 = nn.Linear(hidden, 4 * hidden, specialization="fast_gelu")
    self.dense2 = nn.Linear(4 * hidden, hidden)
    self.layernorm = nn.LayerNorm(hidden, eps=eps)

  def forward(self, input):
    hidden_states = self.dense1(input)
    hidden_states = self.dense2(hidden_states)
    hidden_states = hidden_states + input
    hidden_states = self.layernorm(hidden_states)
    return hidden_states

def map_pt_params(ait_model, pt_model):
  ait_model.name_parameter_tensor()
  pt_params = dict(pt_model.named_parameters())
  mapped_pt_params = OrderedDict()
  for name, _ in ait_model.named_parameters():
    ait_name = name.replace(".", "_")
    assert name in pt_params
    mapped_pt_params[ait_name] = pt_params[name]
  return mapped_pt_params

batch_size=1024
hidden=512
# create pt model
pt_model = PTSimpleModel(hidden).cuda().half()

# create pt input
x = torch.randn([batch_size, hidden]).cuda().half()

# run pt model
pt_model.eval()
y_pt = pt_model(x)

batch_size=1024
hidden=512
# create AIT model
ait_model = AITSimpleModel(hidden)
# create AIT input Tensor
X = Tensor(
      shape=[batch_size, hidden],
      name="X",
      dtype="float16",
      is_input=True,
)
# run AIT module to generate output tensor
Y = ait_model(X)
# mark the output tensor
Y._attrs["is_output"] = True
Y._attrs["name"] = "Y"

# map pt weights to ait
weights = map_pt_params(ait_model, pt_model)

# codegen
target = detect_target()
with compile_model(
    Y, target, "./tmp", "simple_model_demo", constants=weights
) as module:
  # create storage for output tensor
  y = torch.empty([batch_size, hidden]).cuda().half()

  # inputs and outputs dict
  inputs = {"X": x}
  outputs = {"Y": y}

  # run
  module.run_with_tensors(inputs, outputs, graph_mode=True)

  # verify output is correct
  print(torch.allclose(y, y_pt, atol=1e-2, rtol=1e-2))

  # benchmark ait and pt
  count = 1000
  ait_t, _, _ = module.benchmark_with_tensors(
      inputs, outputs, graph_mode=True, count=count
  )
  print(f"AITemplate time: {ait_t} ms/iter")

  pt_t = benchmark_torch_function(count, pt_model.forward, x)
  print(f"PyTorch eager time: {pt_t} ms/iter")

Index Tensor with Tensor

I want to ask how to index tensor with tensor just like torch.tensor:

# pytorch
x=x[x.argmax(dim=-1)]

I only find the ops.argmax() in AIT, but how could I index another Tensor with the argmax results whose type would also be Tensor?

Benchmark and multithread

I have a question over the computation of the latency in the cpp benchmark function.

Below accumulate applies a max over each thread output:

AITemplate/static/csrc/model_container.cpp

Line 224 in 44026ba

auto max_time = std::accumulate(

  auto max_time = std::accumulate(
      futures.begin(), futures.end(), 0.f, [](float cur_val, auto& future) {
        return std::max(future.get(), cur_val);
      });

So, my understanding is that max_time is the longest time of all threads and not the total time taken.

Later on, max_time value is divided by the total nb of inferences performed by all threads.

AITemplate/static/csrc/model_container.cpp

Line 256 in 44026ba

return max_time / total_num_iters;

  auto total_num_iters = num_threads * count;
  return max_time / total_num_iters;

I don't understand why accumulate performs a max instead of a sum (of total time for each thread) if we divide it by num_threads * count.

v0.1.1 Stable Diffusion conv2d_bias crush on RTX 3080/3090

Under investigation...

Conv2dBiasFewChannels issue with map_pt_params

When using Conv2dBiasFewChannels with a C=3 input, special_conv2d_bias_activation calls nhwc3to4 to pad the tensor, so the convolution weights shape is recorded with 4 channels instead of 3.

When using map_pt_params in compile_model results in this error:

ValueError: ConstantTensor's maximum size is not equal to len(data)! Got len(data)=75264, but expected at least 100352 bytes. Check that the ConstantTensor's size and dtype are correct.

The error message could be a little more helpful to print out the name and size of the parameter tensor that does not match.
It took me some time, and the workaround is to pad the weights.
This should be documented

future plans for backpropagation?

stable diffusion example in docker error

running in docker using downloaded diffuser checkpoint. See snippet below:

  pipe = StableDiffusionAITPipeline.from_pretrained(
      "./stable-diffusion-v1-4",
      revision="fp16",
      torch_dtype=torch.float16,
      # use_auth_token=token,
  ).to("cuda")

Error outputs:

root@3c9c833e62c7:/AITemplate# CUUDA_VISIBLE_DEVICES=7 python3 my_exampels/05_stable_diffusion/demo.py
INFO:aitemplate.testing.detect_target:Set target to CUDA
ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy.
{'trained_betas'} was not found in config. Values will be initialized to default values.
Traceback (most recent call last):
  File "my_exampels/05_stable_diffusion/demo.py", line 46, in <module>
    run()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "my_exampels/05_stable_diffusion/demo.py", line 29, in run
    pipe = StableDiffusionAITPipeline.from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/diffusers/pipeline_utils.py", line 391, in from_pretrained
    model = pipeline_class(**init_kwargs)
  File "/AITemplate/my_exampels/05_stable_diffusion/pipeline_stable_diffusion_ait.py", line 89, in __init__
    self.clip_ait_exe = self.init_ait_module(
  File "/AITemplate/my_exampels/05_stable_diffusion/pipeline_stable_diffusion_ait.py", line 104, in init_ait_module
    mod = Model(os.path.join(workdir, model_name, "test.so"))
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 212, in __init__
    self.DLL = self._DLLWrapper(lib_path, num_runtimes)
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 169, in __init__
    self.DLL = ctypes.cdll.LoadLibrary(lib_path)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 451, in LoadLibrary
    return self._dlltype(name)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: tmp/CLIPTextModel/test.so: cannot open shared object file: No such file or directory
Exception ignored in: <function Model.__del__ at 0x7fcbb1d8b940>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 255, in __del__
    self.close()
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 259, in close
    for ptr in list(self._allocated_ait_data):
AttributeError: 'Model' object has no attribute '_allocated_ait_data'

work with torchserve: ./tmp/CLIPTextModel/model-generated.h:3327: Pending model run did not finish successfully. Error: an illegal memory access was encountered

I built a torchserve docker image on top of AITemplate docker. The demo code works fine in my docker. However, when I pack the AITemplate SD model to torchserve archiver and invoke an inference request, It outputs below errors.

The first highlighted logs indicates the AITemplate SD model is being loaded. Looks OK, but the second highlighted logs says "./tmp/CLIPTextModel/model-generated.h:3327: Pending model run did not finish successfully. Error: an illegal memory access was encountered" when I tried to invoke an inference task from torchserve. Any ideas on how to debug this? Or is it possible for AITemplate models to work with torchserve at this moment? Thanks!

2022-10-29T00:31:01,400 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - Set target to CUDA
2022-10-29T00:31:03,385 [WARN ] W-9000-sd-v1-5_2.2-stderr MODEL_LOG - [00:31:03] ./tmp/CLIPTextModel/model-generated.h:275: Init AITemplate Runtime.
2022-10-29T00:31:03,634 [WARN ] W-9000-sd-v1-5_2.2-stderr MODEL_LOG - [00:31:03] ./tmp/UNet2DConditionModel/model-generated.h:3262: Init AITemplate Runtime.
2022-10-29T00:31:03,662 [WARN ] W-9000-sd-v1-5_2.2-stderr MODEL_LOG - [00:31:03] ./tmp/AutoencoderKL/model-generated.h:678: Init AITemplate Runtime.
2022-10-29T00:31:06,315 [INFO ] W-9000-sd-v1-5_2.2 org.pytorch.serve.wlm.WorkerThread - Backend response time: 6164
2022-10-29T00:31:06,316 [DEBUG] W-9000-sd-v1-5_2.2 org.pytorch.serve.wlm.WorkerThread - W-9000-sd-v1-5_2.2 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2022-10-29T00:31:06,316 [INFO ] W-9000-sd-v1-5_2.2 TS_METRICS - W-9000-sd-v1-5_2.2.ms:7131|#Level:Host|#hostname:7435eee333e7,timestamp:1667003466
2022-10-29T00:31:06,316 [INFO ] epollEventLoopGroup-3-2 ACCESS_LOG - /172.17.0.1:45166 "PUT /models/sd-v1-5?min_worker=1&synchronous=true HTTP/1.1" 200 7136
2022-10-29T00:31:06,317 [INFO ] W-9000-sd-v1-5_2.2 TS_METRICS - WorkerThreadTime.ms:14|#Level:Host|#hostname:7435eee333e7,timestamp:1667003466
2022-10-29T00:31:06,317 [INFO ] epollEventLoopGroup-3-2 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:7435eee333e7,timestamp:1667003409
2022-10-29T00:31:48,406 [INFO ] W-9000-sd-v1-5_2.2 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1667003508406
2022-10-29T00:31:48,409 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - Backend received inference at: 1667003508
2022-10-29T00:31:49,477 [WARN ] W-9000-sd-v1-5_2.2-stderr MODEL_LOG - [00:31:49] ./tmp/CLIPTextModel/model-generated.h:3327: Pending model run did not finish successfully. Error: an illegal memory access was encountered
2022-10-29T00:31:49,477 [WARN ] W-9000-sd-v1-5_2.2-stderr MODEL_LOG - [00:31:49] ./tmp/CLIPTextModel/model-generated.h:248: Got error: no error enum: 700 at ./tmp/CLIPTextModel/model-generated.h: 617
2022-10-29T00:31:49,477 [WARN ] W-9000-sd-v1-5_2.2-stderr MODEL_LOG - [00:31:49] ./tmp/CLIPTextModel/model_interface.cu:92: Error: Got error: no error enum: 700 at ./tmp/CLIPTextModel/model-generated.h: 617
2022-10-29T00:31:49,479 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - Invoking custom service failed.
2022-10-29T00:31:49,479 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - Traceback (most recent call last):
2022-10-29T00:31:49,479 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - File "/home/venv/lib/python3.8/site-packages/ts/service.py", line 102, in predict
2022-10-29T00:31:49,479 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - ret = self._entry_point(input_batch, self.context)
2022-10-29T00:31:49,479 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - File "/home/venv/lib/python3.8/site-packages/ts/torch_handler/base_handler.py", line 232, in handle
2022-10-29T00:31:49,480 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - output = self.inference(data_preprocess)
2022-10-29T00:31:49,480 [INFO ] W-9000-sd-v1-5_2.2 org.pytorch.serve.wlm.WorkerThread - Backend response time: 1071
2022-10-29T00:31:49,480 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - File "/home/model-server/tmp/models/de091f5ad82f4a5dafed8e5d35304dfe/handler.py", line 62, in inference
2022-10-29T00:31:49,480 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - pil_imgs = self.model(prompt, random_seed, bs, disable_nsfw)
2022-10-29T00:31:49,480 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - File "/home/model-server/tmp/models/de091f5ad82f4a5dafed8e5d35304dfe/model.py", line 23, in call
2022-10-29T00:31:49,481 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - images = self.model(prompt=[prompt]*4, generator=generator, num_images_per_prompt=4 if bs is None else bs).images
2022-10-29T00:31:49,481 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - File "/home/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
2022-10-29T00:31:49,481 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - return func(*args, **kwargs)
2022-10-29T00:31:49,481 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - File "/home/model-server/AITemplate/examples/05_stable_diffusion/pipeline_stable_diffusion_ait.py", line 262, in call
2022-10-29T00:31:49,482 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - uncond_embeddings = self.clip_inference(
2022-10-29T00:31:49,482 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - File "/home/model-server/AITemplate/examples/05_stable_diffusion/pipeline_stable_diffusion_ait.py", line 139, in clip_inference
2022-10-29T00:31:49,482 [INFO ] W-9000-sd-v1-5_2.2 ACCESS_LOG - /172.17.0.1:54318 "POST /predictions/sd-v1-5 HTTP/1.1" 503 1089
2022-10-29T00:31:49,482 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - exe_module.run_with_tensors(inputs, ys, graph_mode=False)
2022-10-29T00:31:49,482 [INFO ] W-9000-sd-v1-5_2.2 TS_METRICS - Requests5XX.Count:1|#Level:Host|#hostname:7435eee333e7,timestamp:1667003409
2022-10-29T00:31:49,482 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - File "/home/venv/lib/python3.8/site-packages/aitemplate/compiler/model.py", line 483, in run_with_tensors
2022-10-29T00:31:49,483 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - outputs_ait = self.run(
2022-10-29T00:31:49,483 [DEBUG] W-9000-sd-v1-5_2.2 org.pytorch.serve.job.Job - Waiting time ns: 295810, Inference time ns: 1077028904
2022-10-29T00:31:49,483 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - File "/home/venv/lib/python3.8/site-packages/aitemplate/compiler/model.py", line 438, in run
2022-10-29T00:31:49,483 [INFO ] W-9000-sd-v1-5_2.2 TS_METRICS - WorkerThreadTime.ms:6|#Level:Host|#hostname:7435eee333e7,timestamp:1667003509
2022-10-29T00:31:49,483 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - return self._run_impl(
2022-10-29T00:31:49,483 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - File "/home/venv/lib/python3.8/site-packages/aitemplate/compiler/model.py", line 377, in _run_impl
2022-10-29T00:31:49,483 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - self.DLL.AITemplateModelContainerRun(
2022-10-29T00:31:49,484 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - File "/home/venv/lib/python3.8/site-packages/aitemplate/compiler/model.py", line 192, in _wrapped_func
2022-10-29T00:31:49,484 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - raise RuntimeError(f"Error in function: {method.name}")
2022-10-29T00:31:49,485 [INFO ] W-9000-sd-v1-5_2.2-stdout MODEL_LOG - RuntimeError: Error in function: AITemplateModelContainerRun

Does the Stable Diffusion AI Template work with T4 and V100?

Simple nn.Conv2dBias throws RuntimeError: a/b is not aligned

cutlass_f16_s884fprop_optimized_f16_256x128_32x2_nhwc_align_8_8

RuntimeError: a/b is not aligned

Repro:

import torch

from aitemplate.frontend import nn, Tensor
from aitemplate.testing import detect_target
from aitemplate.compiler import compile_model


class ConvEmbed(nn.Module):
    """ Image to Patch Embedding
    """

    def __init__(
        self,
        patch_size=7,
        in_chans=3,
        embed_dim=64,
        stride=4,
        padding=2,
    ):
        super().__init__()
        self.patch_size = patch_size

        self.proj = nn.Conv2dBias(
            in_chans, embed_dim, kernel_size=patch_size, stride=stride, padding=padding
        )

    def forward(self, x):
        x = self.proj(x)
        return x

target = detect_target()

ait_model = ConvEmbed(in_chans=3, embed_dim=256, patch_size=7, stride=4, padding=3)
X = Tensor(
      shape=[1,384,384,3],
      name="X",
      dtype="float16",
      is_input=True,
)
Y = ait_model(X)
# mark the output tensor
Y._attrs["is_output"] = True
Y._attrs["name"] = "Y"

# map pt weights to ait
ait_model.name_parameter_tensor()

module = compile_model(Y, target, "./output", "conv_embed")#, constants=weights)

The input has only 3 channels, so this function throws error.

If I change the channels to 4 then it works, but I wonder if there is a better way

ait_model = ConvEmbed(in_chans=4, embed_dim=256, patch_size=7, stride=4, padding=3)
X = Tensor(
      shape=[1,384,384,4],
      name="X",
      dtype="float16",
      is_input=True,
)

Support arbitrary width & height in stable diffusion example

It would be nice to be able to use different heights and widths up to 1024x1024.

Example Models Wishlist

We keep a wishlist of examples may appear in v0.2 or later release. Any contributions are welcomed.

[bug] layer norm refine on rocm backend

refine pass is currently broken on rocm

What is the best way to accept uint8 input

float16 is not CPU-friendly and float32 input is unnecessarily large (if we are to add data marshaling).
I usually pass the input as bytes (uint8), then convert to float16 inside the model (a GPU node e.g. in ONNX). Currently I am thinking of adding a ops.castfp16 but wanted to ask if there is already a better solution.

03-bert example failing to tune

Failed to complete Bert example.
Using ROCm, MI250

AITemplate/examples/03_bert# python3 benchmark_ait.py

...

2022-10-06 18:20:10,019 INFO <aitemplate.backend.builder> Using 256 CPU for building
2022-10-06 18:20:10,019 INFO <aitemplate.compiler.ops.gemm_universal.gemm_common> Profile: gemm_rcr_bias_permute_m2n3_11233: M == 524288 && N == 2304 && K == 768
2022-10-06 18:20:10,019 INFO <aitemplate.backend.profiler_runner> Using 1 GPU for profiling gemm_rcr_bias_permute_m2n3_11233
Traceback (most recent call last):
  File "benchmark_ait.py", line 298, in <module>
    compile_and_benchmark()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "benchmark_ait.py", line 284, in compile_and_benchmark
    mod = compile_module(
  File "benchmark_ait.py", line 210, in compile_module
    mod = compile_model(y, target, "./tmp", model_name)
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/compiler.py", line 176, in compile_model
    compiler.transform.profile(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/transform/profile.py", line 67, in profile
    func.profile(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/ops/gemm_universal/gemm_common.py", line 675, in profile
    best_algo, workspace, split_k = self._profile_single_workload(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/ops/gemm_universal/gemm_common.py", line 593, in _profile_single_workload
    raise RuntimeError(
RuntimeError: Profile workload: failed. Results: [].

Failing to complete tuning for 03-bert example KeyError: 'exec_path'

AITemplate/examples/03_bert# python3 benchmark_ait.py
Using ROCM on MI250

2022-10-05 22:15:06,791 INFO <aitemplate.backend.builder> Using 256 CPU for building
2022-10-05 22:15:06,791 INFO <aitemplate.compiler.ops.gemm_universal.gemm_common> Profile: gemm_rcr_bias_permute_m2n3_1: M == 64 && N == 2304 && K == 768
2022-10-05 22:15:06,791 INFO <aitemplate.compiler.ops.gemm_universal.gemm_common> Profile: bmm_softmax_bmm_permute_7: B == 12 && M == 64 && N == 64 && K == 64 && O == 64
2022-10-05 22:15:06,791 INFO <aitemplate.compiler.ops.gemm_universal.gemm_common> Profile: gemm_rcr_bias_add_10: M == 64 && N == 768 && K == 768
2022-10-05 22:15:06,791 INFO <aitemplate.compiler.ops.layernorm.layernorm> Profile: layernorm_13: M == 64 && N == 768
2022-10-05 22:15:06,791 INFO <aitemplate.compiler.ops.layernorm.layernorm> Load profiling result from cache.
2022-10-05 22:15:06,791 INFO <aitemplate.compiler.ops.gemm_universal.gemm_common> Profile: gemm_rcr_bias_fast_gelu_14: M == 64 && N == 3072 && K == 768
2022-10-05 22:15:06,791 INFO <aitemplate.compiler.ops.gemm_universal.gemm_common> Profile: gemm_rcr_bias_add_15: M == 64 && N == 768 && K == 3072
Traceback (most recent call last):
  File "benchmark_ait.py", line 298, in <module>
    compile_and_benchmark()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "benchmark_ait.py", line 284, in compile_and_benchmark
    mod = compile_module(
  File "benchmark_ait.py", line 210, in compile_module
    mod = compile_model(y, target, "./tmp", model_name)
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/compiler.py", line 176, in compile_model
    compiler.transform.profile(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/transform/profile.py", line 60, in profile
    paths = func._attrs["exec_path"].keys()
KeyError: 'exec_path'

Seamless images in stable diffusion

Hi! There's a neat hack to create seamless images in stable diffusion - by replacing padding_mode in all conv2d to "circular", for vae and for unet: Sygil-Dev/sygil-webui#224 (comment)
Is it possible to do the same with AITemplate?

Compile issue: Tensor conv2d_bias_64_1 not in outputs for op avg_pool2d_53

summary

I use AITemplate to re-construct a diffusion model which is slightly different than the one in examples, but error occurs while call compile_model().
Since it says some conv2d_bias tensor is not in the outputs of some AvgPooling op, so I just show the related code as below, where 'AvgPooling' is only used by the 'Resample' module.
I review the forward implementation of 'ResidualBlock' several times, but no clue can be found.

code1

the nn.AvgPool2d is only defined in Resample Module, and only used in ResidualBlock forward()

class Resample(nn.Module):
    def __init__(self, in_dim, out_dim, scale_factor, use_conv=False):
        assert scale_factor in [0.5, 1.0, 2.0]
        super(Resample, self).__init__()
        self.in_dim = in_dim
        self.out_dim = out_dim
        self.scale_factor = scale_factor
        self.use_conv = use_conv

        # layers
        if scale_factor == 2.0:
            self.resample = nn.Sequential(
                nn.Upsampling2d(scale_factor=scale_factor, mode='nearest'),
                nn.Conv2dBias(in_dim, out_dim, 3, 1, padding=1) if use_conv else nn.Identity())
        elif scale_factor == 0.5:
            if use_conv:
                self.resample = nn.Conv2dBias(in_dim, out_dim, 3, stride=2, padding=1) 
            else:
                self.resample = nn.AvgPool2d(kernel_size=2, stride=2, padding=0)
        else:
            self.resample = nn.Identity()
    
    def forward(self, x):
        return self.resample(x)

class SiLU(nn.Module):
    def __init__(self) -> None:
        super(SiLU, self).__init__()
        self.silu = ops.silu
    
    def forward(self, x):
        out = self.silu(x)
        return out

class ResidualBlock(nn.Module):

    def __init__(self, in_dim, embed_dim, out_dim, use_scale_shift_norm=True,
                 scale_factor=1.0, dropout=0.0):
        super(ResidualBlock, self).__init__()
        self.in_dim = in_dim
        self.embed_dim = embed_dim
        self.out_dim = out_dim
        self.use_scale_shift_norm = use_scale_shift_norm
        self.scale_factor = scale_factor

        # layers
        self.layer1 = nn.ModuleList([
            nn.GroupNorm(32, in_dim),
            SiLU(),
            nn.Conv2dBias(in_dim, out_dim, 3, 1, padding=1)])
        self.resample = Resample(in_dim, in_dim, scale_factor, use_conv=False)
        self.embedding = nn.Sequential(
            SiLU(),
            nn.Linear(embed_dim, out_dim * 2 if use_scale_shift_norm else out_dim))
        self.layer2 = nn.ModuleList([
            nn.GroupNorm(32, out_dim),
            SiLU(),
            nn.Dropout(dropout),
            nn.Conv2dBias(out_dim, out_dim, 3, 1, padding=1)])
        self.shortcut = nn.Identity() if in_dim == out_dim else nn.Conv2dBias(in_dim, out_dim, 1, 1)

    
    def forward(self, x, e):
        hidden_states = x

        hidden_states = self.layer1[0](hidden_states)
        hidden_states_0 = self.layer1[1](hidden_states)
        x = self.resample(x)  
        hidden_states_1 = self.resample(hidden_states_0) # error may occur here ?
        hidden_states_2 = self.layer1[2](hidden_states_1) 
        e = self.embedding(e)
        bs, dim = get_shape(e)
        e = ops.reshape()(e, [bs, 1, 1, dim])

        hidden_states = hidden_states_2 + e
        hidden_states = self.layer2[0](hidden_states)
        hidden_states = self.layer2[1](hidden_states)
        hidden_states = self.layer2[2](hidden_states)
        hidden_states = self.layer2[3](hidden_states)

        x = self.shortcut(x)
        out = hidden_states + x
        return out

code2

code in convert2ait_upsampler.py

def rebuild_net(use_fp16_acc=False, convert_conv_to_gemm=False):

    ...
    net = pytorch_model().cuda().half() # use fp16
    net.eval()
    ait_net = AITUpsampler()
    ait_net.name_parameter_tensor()
    mapped_params = map_pt_params(ait_net, net)

    batch_size = 4
    hh = 256
    ww = 256
    cc = 3
    x0 = Tensor(
        [batch_size, hh, ww, cc], name="input0", is_input=True
    )
    t = Tensor([batch_size, upsampler256_config['dim']], name="input1", is_input=True)
    y = Tensor([batch_size, upsampler256_config['y_dim']], name="input2", is_input=True)
    concat = Tensor(
        [batch_size, hh, ww, cc], name="input3", is_input=True
    )

    Y_out = ait_net(x0, t, y, concat)
    target = detect_target(
        use_fp16_acc=use_fp16_acc, convert_conv_to_gemm=convert_conv_to_gemm
    )

    compile_model(Y_out, target, "./tmp", "AIT_UPSAMPLER256", constants=mapped_params)

Error

Traceback (most recent call last):
  File "convert2ait_upsampler.py", line 106, in <module>
    compile_net(True, True)
  File "convert2ait_upsampler.py", line 103, in compile_net
    compile_model(Y_out, target, "./tmp", "AIT_UPSAMPLER", constants=mapped_params)
  File "/home/envs/zero/lib/python3.8/site-packages/aitemplate/compiler/compiler.py", line 152, in compile_model
    compiler.transform.remove_no_ops(graph)
  File "/home/envs/zero/lib/python3.8/site-packages/aitemplate/compiler/transform/remove_no_ops.py", line 167, in remove_no_ops
    sorted_graph = f_pass(sorted_graph)
  File "/home/envs/zero/lib/python3.8/site-packages/aitemplate/compiler/transform/remove_no_ops.py", line 82, in _remove_no_op_expands
    return transform_utils.sanitize_sorted_graph(sorted_graph)
  File "/home/envs/zero/lib/python3.8/site-packages/aitemplate/compiler/transform/transform_utils.py", line 272, in sanitize_sorted_graph
    check_graph_validity(new_sorted_graph, raiseError=True)
  File "/home/envs/zero/lib/python3.8/site-packages/aitemplate/compiler/transform/transform_utils.py", line 69, in check_graph_validity
    valid = handleError(
  File "/home/envs/zero/lib/python3.8/site-packages/aitemplate/compiler/transform/transform_utils.py", line 40, in handleError
    raise RuntimeError(msg)
RuntimeError: Tensor conv2d_bias_64_1 not in outputs for op avg_pool2d_53

Con2d: a/b is not aligned for C_in

I occurred a problem when compiling a vision transformer model. It used conv2d, but I noticed in:

AITemplate/python/aitemplate/backend/cuda/conv2d/common.py

Lines 207 to 216 in 44026ba

 def cal_align_ab(x_shape: List[int]) -> int: 

 """Returns input alignment.""" 

 k = x_shape[3] # CI 

 if k % 8 == 0: 

 return 8 

 if k % 4 == 0: 

 return 4 

 if k % 2 == 0: 

 return 2 

 raise RuntimeError("a/b is not aligned")

where x is the Channels of input tensor (which comes from [N, H, W, C]) and it's usually equal to 3 (rgb), but here why doing this? It will always raise error.

2022-10-22 15:09:37,833 INFO <aitemplate.compiler.ops.conv.conv2d> Profile: conv2d_0: NI == 1 && HI == 224 && WI == 224 && CI == 3
2022-10-22 15:09:37,897 INFO <aitemplate.backend.profiler_runner> Using 1 GPU for profiling conv2d_0

Above is my error log. You can see that NI == 1 && HI == 224 && WI == 224 && CI == 3

[FEATURE] Support for NVIDIA T4 (Turing Architecture)

Hello 🙋🏻‍♂️

It is very cool to see MetaAI going into inference optimization! This will help the community and companies to much long term speaking!
While reading through the announcement blog post i noticed that

AITemplate is currently enabled on NVIDIA's A100 and AMD’s MI200 GPU systems, both of which are widely used today in data centers from technology companies, research labs, and cloud computing service providers.

Which awesome but might be a big limitation for many since A100 is still not very accessible.
Having support for NVIDIA T4 (Turing), which is most widely available GPU in public clouds would be very helpful.

compile error when run demo for stable diffusion

2022-10-27 18:39:28,064 INFO <aitemplate.backend.builder> Using 128 CPU for building
2022-10-27 18:39:28,064 INFO <aitemplate.compiler.ops.gemm_universal.gemm_common> Profile: gemm_rcr_bias_8: M == 64 && N == 2304 && K == 768
2022-10-27 18:39:28,064 INFO <aitemplate.backend.profiler_runner> Using 1 GPU for profiling gemm_rcr_bias_8
Traceback (most recent call last):
File "examples/05_stable_diffusion/compile.py", line 359, in
compile_diffusers()
File "/ssd2/panzheng01/miniconda3/envs/aitemp/lib/python3.8/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/ssd2/panzheng01/miniconda3/envs/aitemp/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/ssd2/panzheng01/miniconda3/envs/aitemp/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/ssd2/panzheng01/miniconda3/envs/aitemp/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "examples/05_stable_diffusion/compile.py", line 346, in compile_diffusers
compile_clip(batch_size=batch_size, use_fp16_acc=use_fp16_acc, convert_conv_to_gemm=convert_conv_to_gemm)
File "examples/05_stable_diffusion/compile.py", line 252, in compile_clip
compile_model(Y, target, "./tmp", "CLIPTextModel", constants=params_ait)
File "/ssd2/panzheng01/miniconda3/envs/aitemp/lib/python3.8/site-packages/aitemplate/compiler/compiler.py", line 176, in compile_model
compiler.transform.profile(
File "/ssd2/panzheng01/miniconda3/envs/aitemp/lib/python3.8/site-packages/aitemplate/compiler/transform/profile.py", line 67, in profile
func.profile(
File "/ssd2/panzheng01/miniconda3/envs/aitemp/lib/python3.8/site-packages/aitemplate/compiler/ops/gemm_universal/gemm_common.py", line 675, in profile
best_algo, workspace, split_k = self._profile_single_workload(
File "/ssd2/panzheng01/miniconda3/envs/aitemp/lib/python3.8/site-packages/aitemplate/compiler/ops/gemm_universal/gemm_common.py", line 593, in _profile_single_workload
raise RuntimeError(
RuntimeError: Profile workload: failed. Results: [].

all input default, just run "python3 examples/05_stable_diffusion/compile.py --token ********* "
and clear tmp by "rm -rf ~/.aitemplate/* rm -rf ./tmp/profiler/*"

GPU: A100-SXM-80GB

env pkg:
tokenizers 0.12.1
torch 1.12.1+cu113
torchaudio 0.12.1+cu113
torchvision 0.13.1+cu113
diffusers 0.4.0
transformers 4.22.0

Memory usage increases when tableDiffusionAITPipeline is run repeatedly.

summary

After loading the model with from_pretrained, memory usage increases as the pipeline is used repeatedly.
Each execution consumes 50 MB to 100 MB of memory.
Eventually, the process stops, eating up memory.
Oddly enough, the first few times the memory usage does not seem to increase significantly.

code

import torch
from pipeline_stable_diffusion_ait import StableDiffusionAITPipeline
pipe = StableDiffusionAITPipeline.from_pretrained(
	"CompVis/stable-diffusion-v1-4",
	revision="fp16",
	torch_dtype=torch.float16,
	use_auth_token=token,
	).to("cuda")
while True:
        with torch.autocast("cuda"):
            image = pipe(prompt).images[0]

environment

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+

GPU:Nvidia-A100-40GB

$ sudo pip list
Package                  Version
------------------------ --------------------
aiohttp                  3.8.3
aiosignal                1.2.0
aitemplate               0.1.dev0
async-timeout            4.0.2
attrs                    19.3.0
Automat                  0.8.0
blinker                  1.4
CacheControl             0.12.11
cachetools               5.2.0
certifi                  2019.11.28
chardet                  3.0.4
charset-normalizer       2.1.1
Click                    7.0
cloud-init               22.2
colorama                 0.4.3
command-not-found        0.3
configobj                5.0.6
constantly               15.1.0
cryptography             2.8
cupshelpers              1.0
dbus-python              1.2.16
defer                    1.0.6
diffusers                0.3.0
distro                   1.4.0
distro-info              0.23ubuntu1
entrypoints              0.3
filelock                 3.8.0
firebase-admin           5.4.0
frozenlist               1.3.1
ftfy                     6.1.1
future                   0.18.2
google-api-core          2.10.1
google-api-python-client 2.64.0
google-auth              2.12.0
google-auth-httplib2     0.1.0
google-cloud-core        2.3.2
google-cloud-firestore   2.7.1
google-cloud-storage     2.5.0
google-crc32c            1.5.0
google-resumable-media   2.4.0
googleapis-common-protos 1.56.4
grpcio                   1.49.1
grpcio-status            1.49.1
httplib2                 0.20.4
huggingface-hub          0.10.0
hyperlink                19.0.0
idna                     2.8
importlib-metadata       1.5.0
incremental              16.10.1
install                  1.3.5
Jinja2                   3.1.2
jsonpatch                1.22
jsonpointer              2.0
jsonschema               3.2.0
keyring                  18.0.1
language-selector        0.1
launchpadlib             1.10.13
lazr.restfulclient       0.14.2
lazr.uri                 1.0.3
line-bot-sdk             2.3.0
macaroonbakery           1.3.1
MarkupSafe               2.1.1
more-itertools           4.2.0
msgpack                  1.0.4
multidict                6.0.2
netifaces                0.10.4
numpy                    1.23.3
oauthlib                 3.1.0
packaging                21.3
pexpect                  4.6.0
pika                     1.3.0
Pillow                   9.2.0
pip                      22.2.2
proto-plus               1.22.1
protobuf                 4.21.7
pyasn1                   0.4.2
pyasn1-modules           0.2.1
pycairo                  1.16.2
pycups                   1.9.73
PyGObject                3.36.0
PyHamcrest               1.9.0
PyJWT                    1.7.1
pymacaroons              0.13.0
PyNaCl                   1.3.0
pyOpenSSL                19.0.0
pyparsing                3.0.9
pyRFC3339                1.1
pyrsistent               0.15.5
pyserial                 3.4
python-apt               2.0.0+ubuntu0.20.4.8
python-debian            0.1.36ubuntu1
python-dotenv            0.21.0
pytz                     2019.3
PyYAML                   5.3.1
regex                    2022.9.13
requests                 2.22.0
requests-unixsocket      0.2.0
rsa                      4.9
scipy                    1.9.1
screen-resolution-extra  0.0.0
SecretStorage            2.3.1
service-identity         18.1.0
setuptools               45.2.0
simplejson               3.16.0
six                      1.14.0
sos                      4.3
ssh-import-id            5.10
systemd-python           234
tokenizers               0.12.1
torch                    1.12.1+cu116
torchaudio               0.12.1+cu116
torchvision              0.13.1+cu116
tqdm                     4.64.1
transformers             4.22.2
Twisted                  18.9.0
typing_extensions        4.3.0
ubuntu-advantage-tools   27.10
ufw                      0.36
unattended-upgrades      0.1
uritemplate              4.1.1
urllib3                  1.25.8
wadllib                  1.3.3
wcwidth                  0.2.5
wheel                    0.34.2
xkit                     0.0.0
yarl                     1.8.1
zipp                     1.0.0
zope.interface           4.7.1

difference with torchserve?

will this replace torchserve? which will be better?

recompile stable diffusion for 1024x1024

How might one do this? Currently it only works for 512x512.

Future plans to support FP32？

Any plans for official releases?

See title, would be nice to pip install the package

Document for different specializations and how they affect shapes

After reading some examples, e.g. here:

class Mlp(nn.Module):
    """MLP as used in Vision Transformer, MLP-Mixer and related networks"""

    def __init__(
        self,
        in_features,
        hidden_features=None,
        out_features=None,
        act_layer="GELU",
        drop=0,
    ):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features

        self.fc1 = nn.Linear(
            in_features,
            hidden_features,
            specialization="fast_gelu" if act_layer == "GELU" else "relu",
        )
        self.fc2 = nn.Linear(hidden_features, out_features, specialization="add")

    def forward(self, x, res):
        shape = get_shape(x)
        x = self.fc1(x)
        x = self.fc2(x, res)
        return ops.reshape()(x, shape)

I was wondering what is the reason for the ops.reshape() at the end? Does the specialization change the shapes to some canonical form? What other functions need a resape?

CompVis/Stable Diffusion/LDM - Diffusion Models Implementation Support

Hi!

Have anyone worked or planned to integrate diffusion models from CompVis/stable-diffusion/ldm? Would be of great help as of my experience that implementation is working quite good on characters category. As they have mentioned, used one (PLMS) from Katherine Crowson's Implementation.

Thanks

Attention mask in Bert

Hi,

I try to use an attention mask in Bert demo script but when I add the tensor to the input dict it crashes.
How can I provide this mask?

Reproduction script (run on the docker image):

#  Copyright (c) Meta Platforms, Inc. and affiliates.
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.
#
import time

import click
import torch
from benchmark_ait import compile_module
from modeling.torch_model import BertBaseUncased as BertPt


def run_model(activation: str, graph_mode: bool, use_fp16_acc: bool, verify: bool):
    f = open("measures.txt", mode="w")
    shape = (1, 128)
    inputs_pt = {
        "input_ids": torch.randint(2, 1000, size=shape, dtype=torch.int64, device="cuda"),
        "position_ids": torch.arange(shape[1], dtype=torch.int64).expand(shape).contiguous().cuda(),
        "attention_mask": torch.ones(shape, dtype=torch.int64, device="cuda"),
        "token_type_ids": torch.ones(size=shape, dtype=torch.int64, device="cuda"),
    }

    batch_size, seq_len = inputs_pt["input_ids"].size()

    pt_model = BertPt(pretrained=True)._model
    pt_model.eval()
    hidden_size = pt_model.config.hidden_size

    mod = compile_module(batch_size, seq_len, hidden_size, activation, use_fp16_acc, False, pt_model)

    outputs = [torch.empty(mod.get_output_maximum_shape(0)).half().cuda()]

    # warmup
    for _ in range(10):
        mod.run_with_tensors(inputs_pt, outputs, graph_mode=graph_mode)

    torch.cuda.synchronize()
    timings = list()
    for _ in range(10):
        start = time.time()
        mod.run_with_tensors(inputs_pt, outputs, graph_mode=graph_mode)
        torch.cuda.synchronize()
        timings.append(time.time() - start)

    f.write(f"{shape}: {torch.median(torch.tensor(timings)):.4f}\n")
    f.flush()
    print(f"Logits: {outputs[0]}")
    if verify:
        pt_outputs = pt_model.bert(**inputs_pt)
        torch.allclose(outputs[0], pt_outputs.last_hidden_state, 1e-1, 1e-1)
        print("Verification done!")
    f.close()


@click.command()
@click.option(
    "--activation",
    type=str,
    default="gelu",
    help="Activation function applied on BERT, currently only support gelu and fast_gelu",
)
@click.option(
    "--graph_mode",
    type=bool,
    default=True,
    help="Use CUDA graph or not. (hipGraph is not supported yet)",
)
@click.option(
    "--use_fp16_acc",
    type=bool,
    default=False,
    help="Use fp16 accumulation or not (TensorRT is using fp16_acc)",
)
@click.option(
    "--verify",
    type=bool,
    default=True,
    help="Verify AIT outputs against PT",
)
def run_demo(
    activation: str,
    graph_mode: bool,
    use_fp16_acc: bool,
    verify: bool,
):
    run_model(activation, graph_mode, use_fp16_acc, verify)


if __name__ == "__main__":
    torch.manual_seed(4896)
    run_demo()

Produces:

...
2022-10-16 12:51:44,784 INFO <aitemplate.backend.builder> Building ./tmp/BERT_gelu_1_128/model_interface.obj
2022-10-16 12:52:03,348 INFO <aitemplate.backend.builder> Building ./tmp/BERT_gelu_1_128/test.so
[12:52:03] ./tmp/BERT_gelu_1_128/model-generated.h:225: Init AITemplate Runtime.
Traceback (most recent call last):
  File "./examples/03_bert/demo_new.py", line 101, in <module>
    run_demo()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "./examples/03_bert/demo_new.py", line 96, in run_demo
    run_model(activation, graph_mode, use_fp16_acc, verify)
  File "./examples/03_bert/demo_new.py", line 45, in run_model
    mod.run_with_tensors(inputs_pt, outputs, graph_mode=graph_mode)
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 483, in run_with_tensors
    outputs_ait = self.run(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 438, in run
    return self._run_impl(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 367, in _run_impl
    inputs = self._dict_to_ordered_list(inputs, is_inputs=True)
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 327, in _dict_to_ordered_list
    raise ValueError(
ValueError: Did not get correct number of inputs expected 3, got 4

If I replace position_ids by attention_mask I get:

    inputs_pt = {
        "input_ids": torch.randint(2, 1000, size=shape, dtype=torch.int64, device="cuda"),
        # "position_ids": torch.arange(shape[1], dtype=torch.int64).expand(shape).contiguous().cuda(),
        "attention_mask": torch.ones(shape, dtype=torch.int64, device="cuda"),
        "token_type_ids": torch.ones(size=shape, dtype=torch.int64, device="cuda"),
    }

[12:54:38] ./tmp/BERT_gelu_1_128/model-generated.h:225: Init AITemplate Runtime.
Traceback (most recent call last):
  File "./examples/03_bert/demo_new.py", line 101, in <module>
    run_demo()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "./examples/03_bert/demo_new.py", line 96, in run_demo
    run_model(activation, graph_mode, use_fp16_acc, verify)
  File "./examples/03_bert/demo_new.py", line 45, in run_model
    mod.run_with_tensors(inputs_pt, outputs, graph_mode=graph_mode)
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 483, in run_with_tensors
    outputs_ait = self.run(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 438, in run
    return self._run_impl(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 367, in _run_impl
    inputs = self._dict_to_ordered_list(inputs, is_inputs=True)
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 334, in _dict_to_ordered_list
    raise ValueError(
ValueError: Got unexpected input: attention_mask

[FEATURE] Support for NVidia's A10g

Hello, thanks for this great project.

Following this request it would be amazing to have support for NVidia's latest generation of inference GPUs: the A10g.

They are roughly 2-3x faster than the T4 and very cheap w.r.t A100s.

On another topic, if we wanted to add this support ourselves for this GPU type or any future GPU from NVidia what would be the process ?

RuntimeError: Unsupported platform

Hi, I am trying to run the examples in the provided docker image, I get this error:

user@user-G3-3500:/workspace/code/AITemplate/examples$ python3.8 01_resnet-50/benchmark_pt.py Traceback (most recent call last): File "01_resnet-50/benchmark_pt.py", line 20, in <module> from aitemplate.testing.benchmark_pt import benchmark_torch_function File "/usr/local/lib/python3.8/dist-packages/aitemplate/__init__.py", line 19, in <module> from . import backend, compiler, frontend, testing, utils File "/usr/local/lib/python3.8/dist-packages/aitemplate/frontend/__init__.py", line 16, in <module> from . import nn File "/usr/local/lib/python3.8/dist-packages/aitemplate/frontend/nn/__init__.py", line 17, in <module> from .embedding import BertEmbeddings, Embedding File "/usr/local/lib/python3.8/dist-packages/aitemplate/frontend/nn/embedding.py", line 38, in <module> USE_CUDA = detect_target().name() == "cuda" File "/usr/local/lib/python3.8/dist-packages/aitemplate/testing/detect_target.py", line 97, in detect_target raise RuntimeError("Unsupported platform") RuntimeError: Unsupported platform

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1700 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2352 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
`

nvcc in host machine:
Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

nvcc in docker machine:
Cuda compilation tools, release 11.6, V11.6.124 Build cuda_11.6.r11.6/compiler.31057947_0

AttributeError: 'Model' object has no attribute '_allocated_ait_data'

I use the latest cuda docker with A100, when I run python3 examples/05_stable_diffusion/compile.py --token xxx， the main error code as follow:

57 errors detected in the compilation of "flash_attention_10.cu".
make: *** [Makefile:9: flash_attention_10.obj] Error 1
make: *** Waiting for unfinished jobs....

2022-11-11 03:11:49,781 INFO <aitemplate.compiler.compiler> compiled the final .so file elapsed time: 0:00:08.439418
Traceback (most recent call last):
File "examples/05_stable_diffusion/compile.py", line 373, in
compile_diffusers()
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "examples/05_stable_diffusion/compile.py", line 349, in compile_diffusers
compile_clip(
File "examples/05_stable_diffusion/compile.py", line 252, in compile_clip
compile_model(Y, target, "./tmp", "CLIPTextModel", constants=params_ait)
File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/compiler.py", line 260, in compile_model
module = Model(
File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 227, in init
self.DLL = self._DLLWrapper(lib_path, num_runtimes, allocator_kind)
File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 166, in init
self.DLL = ctypes.cdll.LoadLibrary(lib_path)
File "/usr/lib/python3.8/ctypes/init.py", line 451, in LoadLibrary
return self._dlltype(name)
File "/usr/lib/python3.8/ctypes/init.py", line 373, in init
self._handle = _dlopen(self._name, mode)
OSError: ./tmp/CLIPTextModel/test.so: cannot open shared object file: No such file or directory
Exception ignored in: <function Model.del at 0x7ff88f4ef3a0>
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 257, in del
self.close()
File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 261, in close
for ptr in list(self._allocated_ait_data):
AttributeError: 'Model' object has no attribute '_allocated_ait_data'

`head_size` only be [8, 16, 32, 64, 128] in flash_attention

If head_size not in [8, 16, 32, 64, 128], it will fail to forwarod in flash_attention:

AITemplate/python/aitemplate/compiler/ops/attention/flash_attention.py

Line 144 in d0ee901

assert head_size in [8, 16, 32, 64, 128]

I want to know why assert this here? This limitation is because of something related to cuda? or else? And furthermore, how could we support other kinds of head_size?

how to debug `"Error in function: AITemplateModelContainerRun` for dimension other than 512x512 in the SD pipeline

So we go into this: Error in function: AITemplateModelContainerRun, which looks to be something to do with the DLL:

self.DLL.AITemplateModelContainerRun(
                self.handle,
                c_inputs,
                ctypes.c_size_t(len(inputs)),
                c_outputs,
                ctypes.c_size_t(len(outputs)),
                c_stream,
                ctypes.c_bool(sync),
                ctypes.c_bool(graph_mode),
                c_output_shapes_out,
            )

We tried all different demensions that are multiple of 64 and it is still giving the same rror.

What is the best way to debug this? Or could this caused by a bug at a lower level?

docker error

(base) chenxin@chenxin-Nitro-AN515-52:~/disk1/github/AITemplate/docker$ ./build.sh cuda
Building CUDA Docker Image with tag ait:latest
unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /home/chenxin/disk1/github/AITemplate/docker/docker: no such file or directory

Stable Diffusion Example Installation Not Working - SyntaxError

reproduce on google colab:

!git clone /content/AITemplate https://github.com/facebookincubator/AITemplate.git
%cd /content/AITemplate/python
!python setup.py bdist_wheel
!pip install dist/*.whl
%cd /content/AITemplate/examples/05_stable_diffusion
from pipeline_stable_diffusion_ait import StableDiffusionAITPipeline

Error:

Traceback (most recent call last):

  File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File ["<ipython-input-46-f3821b1f7a5a>"](https://localhost:8080/#), line 2, in <module>
    from pipeline_stable_diffusion_ait import StableDiffusionAITPipeline

  File "/content/AITemplate/examples/05_stable_diffusion/pipeline_stable_diffusion_ait.py", line 22, in <module>
    from aitemplate.compiler import Model

  File "/usr/local/lib/python3.7/dist-packages/aitemplate/__init__.py", line 19, in <module>
    from . import backend, compiler, frontend, testing, utils

  File "/usr/local/lib/python3.7/dist-packages/aitemplate/backend/__init__.py", line 18, in <module>
    from . import (  # noqa

  File "/usr/local/lib/python3.7/dist-packages/aitemplate/backend/backend_spec.py", line 25, in <module>
    from ..compiler.ops.common.epilogue import FuncEnum

  File "/usr/local/lib/python3.7/dist-packages/aitemplate/compiler/__init__.py", line 15, in <module>
    from . import base, ops, tensor_accessor, transform

  File "<fstring>", line 1
    (len(self.src_ops())=)
                        ^
SyntaxError: invalid syntax

Can't finish running compile.py on Stable diffusion example(same issue on others)

File "\aitemplate\examples\07_how_to_run_pt_model\how_to_run_pt_model.py", line 131, in
verify_simple_model()
File "\aitemplate\examples\07_how_to_run_pt_model\how_to_run_pt_model.py", line 97, in verify_simple_model
with compile_model(
File "\AppData\Local\Programs\Python\Python310\lib\site-packages\aitemplate\compiler\compiler.py", line 200, in compile_model
compiler.transform.profile(
File "\AppData\Local\Programs\Python\Python310\lib\site-packages\aitemplate\compiler\transform\profile.py", line 88, in profile
compile_engine.make_profilers(generated_profilers, profiler_dir)
File "\AppData\Local\Programs\Python\Python310\lib\site-packages\aitemplate\backend\builder.py", line 364, in make_profilers
self._gen_makefile_for_profilers(file_pairs, build_dir)
File "\AppData\Local\Programs\Python\Python310\lib\site-packages\aitemplate\backend\builder.py", line 355, in _gen_makefile_for_profilers
with open(dumpfile, "w+") as f:
FileNotFoundError: [Errno 2] No such file or directory: "'./tmp\profiler'\Makefile"

Could I get some help with this? I apologize if this isn't the right place for this.

Support non-square sizes for stable diffusion like 640x384 don't seem to work

From @terrychenism "the group norm problem size is not supported yet."

My diff:

diff --git a/examples/05_stable_diffusion/compile.py b/examples/05_stable_diffusion/compile.py
index 513df5b..790f3c0 100644
--- a/examples/05_stable_diffusion/compile.py
+++ b/examples/05_stable_diffusion/compile.py
@@ -177,8 +177,8 @@ def map_clip_params(pt_mod, batch_size, seqlen, depth):

 def compile_unet(
     batch_size=2,
-    hh=64,
-    ww=64,
+    hh=48,
+    ww=80,
     dim=320,
     use_fp16_acc=False,
     convert_conv_to_gemm=False,
@@ -339,7 +339,8 @@ def compile_diffusers(token, batch_size, img2img=False, use_fp16_acc=True, conve
         use_auth_token=access_token,
     ).to("cuda")

-    width = 96 if img2img else 64
+    width = 80
+    height = 48

     # CLIP
     compile_clip(batch_size=batch_size, use_fp16_acc=use_fp16_acc, convert_conv_to_gemm=convert_conv_to_gemm)
@@ -347,11 +348,12 @@ def compile_diffusers(token, batch_size, img2img=False, use_fp16_acc=True, conve
     compile_unet(
         batch_size=batch_size * 2,
         ww=width,
+        hh=height,
         use_fp16_acc=use_fp16_acc,
         convert_conv_to_gemm=convert_conv_to_gemm,
     )
     # VAE
-    compile_vae(batch_size=batch_size, width=width, use_fp16_acc=use_fp16_acc, convert_conv_to_gemm=convert_conv_to_gemm)
+    compile_vae(batch_size=batch_size, width=width, height=height, use_fp16_acc=use_fp16_acc, convert_conv_to_gemm=convert_conv_to_gemm)


 if __name__ == "__main__":

Error:

/usr/include/cub/block/specializations/block_reduce_warp_reductions.cuh(75): here
            instantiation of class "cub::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> [with T=float, BLOCK_DIM_X=0, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=800]"
/usr/include/cub/block/block_reduce.cuh(249): here
            instantiation of class "cub::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> [with T=float, BLOCK_DIM_X=0, ALGORITHM=cub::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=800]"
./tmp/UNet2DConditionModel/groupnorm_swish_603.cu(336): here
            instantiation of "T <unnamed>::BlockAllReduce<ReductionOp,T,block_size>(T) [with ReductionOp=<unnamed>::SumOp, T=float, block_size=0]"
./tmp/UNet2DConditionModel/groupnorm_swish_603.cu(406): here
            instantiation of "void <unnamed>::group_norm_smem<FuseSwish,H,W,C,C_G,ILP,BANK_CONFLICT,NUM_THREADS>(const half *, half *, half *, half *, int, float) [with FuseSwish=true, H=6, W=10, C=1280, C_G=40, ILP=8, BANK_CONFLICT=0, NUM_THREADS=0]"
./tmp/UNet2DConditionModel/groupnorm_swish_603.cu(566): here
            instantiation of "cudaError_t <unnamed>::invokeGroupNorm<FuseSwish,H,W,C,G>(half *, half *, half *, half *, int, float, int, cudaStream_t) [with FuseSwish=true, H=6, W=10, C=1280, G=32]"
./tmp/UNet2DConditionModel/groupnorm_swish_603.cu(593): here

/usr/include/cub/warp/specializations/warp_reduce_shfl.cuh(73): error: division by zero
          detected during:
            instantiation of class "cub::WarpReduceShfl<T, LOGICAL_WARP_THREADS, PTX_ARCH> [with T=float, LOGICAL_WARP_THREADS=0, PTX_ARCH=800]"
/usr/include/cub/warp/warp_reduce.cuh(168): here
            instantiation of class "cub::WarpReduce<T, LOGICAL_WARP_THREADS, PTX_ARCH> [with T=float, LOGICAL_WARP_THREADS=0, PTX_ARCH=800]"
/usr/include/cub/block/specializations/block_reduce_warp_reductions.cuh(75): here
            instantiation of class "cub::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> [with T=float, BLOCK_DIM_X=0, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=800]"
/usr/include/cub/block/block_reduce.cuh(249): here
            instantiation of class "cub::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> [with T=float, BLOCK_DIM_X=0, ALGORITHM=cub::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=800]"
./tmp/UNet2DConditionModel/groupnorm_swish_603.cu(336): here
            instantiation of "T <unnamed>::BlockAllReduce<ReductionOp,T,block_size>(T) [with ReductionOp=<unnamed>::SumOp, T=float, block_size=0]"
./tmp/UNet2DConditionModel/groupnorm_swish_603.cu(406): here
            instantiation of "void <unnamed>::group_norm_smem<FuseSwish,H,W,C,C_G,ILP,BANK_CONFLICT,NUM_THREADS>(const half *, half *, half *, half *, int, float) [with FuseSwish=true, H=6, W=10, C=1280, C_G=40, ILP=8, BANK_CONFLICT=0, NUM_THREADS=0]"
./tmp/UNet2DConditionModel/groupnorm_swish_603.cu(566): here
            instantiation of "cudaError_t <unnamed>::invokeGroupNorm<FuseSwish,H,W,C,G>(half *, half *, half *, half *, int, float, int, cudaStream_t) [with FuseSwish=true, H=6, W=10, C=1280, G=32]"
./tmp/UNet2DConditionModel/groupnorm_swish_603.cu(593): here

/usr/include/cub/block/specializations/block_reduce_warp_reductions.cuh(120): error: excessive recursion at instantiation of function "cub::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::Int2Type<SUCCESSOR_WARP>) [with T=float, BLOCK_DIM_X=0, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=800, FULL_TILE=true, ReductionOp=<unnamed>::SumOp<float>, SUCCESSOR_WARP=201]"
          detected during:
            instantiation of "T cub::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::Int2Type<SUCCESSOR_WARP>) [with T=float, BLOCK_DIM_X=0, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=800, FULL_TILE=true, ReductionOp=<unnamed>::SumOp<float>, SUCCESSOR_WARP=200]"
(120): here
            instantiation of "T cub::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::Int2Type<SUCCESSOR_WARP>) [with T=float, BLOCK_DIM_X=0, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=800, FULL_TILE=true, ReductionOp=<unnamed>::SumOp<float>, SUCCESSOR_WARP=199]"
(120): here
            instantiation of "T cub::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::Int2Type<SUCCESSOR_WARP>) [with T=float, BLOCK_DIM_X=0, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=800, FULL_TILE=true, ReductionOp=<unnamed>::SumOp<float>, SUCCESSOR_WARP=198]"
(120): here
            instantiation of "T cub::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::Int2Type<SUCCESSOR_WARP>) [with T=float, BLOCK_DIM_X=0, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=800, FULL_TILE=true, ReductionOp=<unnamed>::SumOp<float>, SUCCESSOR_WARP=197]"
(120): here
            instantiation of "T cub::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::Int2Type<SUCCESSOR_WARP>) [with T=float, BLOCK_DIM_X=0, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=800, FULL_TILE=true, ReductionOp=<unnamed>::SumOp<float>, SUCCESSOR_WARP=196]"
(120): here
            [ 196 instantiation contexts not shown ]
            instantiation of "T cub::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=float, BLOCK_DIM_X=0, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=800, FULL_TILE=true, ReductionOp=<unnamed>::SumOp<float>]"
/usr/include/cub/block/block_reduce.cuh(353): here
            instantiation of "T cub::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::Reduce(T, ReductionOp) [with T=float, BLOCK_DIM_X=0, ALGORITHM=cub::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=800, ReductionOp=<unnamed>::SumOp<float>]"
./tmp/UNet2DConditionModel/groupnorm_swish_603.cu(338): here
            instantiation of "T <unnamed>::BlockAllReduce<ReductionOp,T,block_size>(T) [with ReductionOp=<unnamed>::SumOp, T=float, block_size=0]"
./tmp/UNet2DConditionModel/groupnorm_swish_603.cu(406): here
            instantiation of "void <unnamed>::group_norm_smem<FuseSwish,H,W,C,C_G,ILP,BANK_CONFLICT,NUM_THREADS>(const half *, half *, half *, half *, int, float) [with FuseSwish=true, H=6, W=10, C=1280, C_G=40, ILP=8, BANK_CONFLICT=0, NUM_THREADS=0]"
./tmp/UNet2DConditionModel/groupnorm_swish_603.cu(566): here
            instantiation of "cudaError_t <unnamed>::invokeGroupNorm<FuseSwish,H,W,C,G>(half *, half *, half *, half *, int, float, int, cudaStream_t) [with FuseSwish=true, H=6, W=10, C=1280, G=32]"
./tmp/UNet2DConditionModel/groupnorm_swish_603.cu(593): here

4 errors detected in the compilation of "./tmp/UNet2DConditionModel/groupnorm_swish_603.cu".

Done

AIT's performance compared with PT's

I have a question, have you tested the ait-model's performance compared with pt-model? Given the same input (fp32 to pt-model while fp16 to ait-model), how would the two outputs be? Differences exist I think, and these differences are generated by what? Could you give us some detailed official descriptions?

	def cal_align_ab(x_shape: List[int]) -> int:
	"""Returns input alignment."""
	k = x_shape[3] # CI
	if k % 8 == 0:
	return 8
	if k % 4 == 0:
	return 4
	if k % 2 == 0:
	return 2
	raise RuntimeError("a/b is not aligned")

facebookincubator / aitemplate Goto Github PK

aitemplate's People

Contributors

Stargazers

Watchers

Forkers

aitemplate's Issues

Performance on V100

Repro

summary

code1

code2

Error

summary

code

environment

Recommend Projects

Recommend Topics

Recommend Org

Jobs