GithubHelp home page GithubHelp logo

nvidia / cutlass Goto Github PK

View Code? Open in Web Editor NEW
4.6K 105.0 797.0 42.74 MB

CUDA Templates for Linear Algebra Subroutines

License: Other

C++ 56.56% Cuda 33.12% CMake 0.81% C 0.06% Python 5.99% Makefile 0.01% Batchfile 0.01% HTML 3.45%
cuda deep-learning deep-learning-library cpp nvidia gpu

cutlass's Introduction

ALT

CUTLASS 3.5

CUTLASS 3.5 - April 2024

CUTLASS is a collection of CUDA C++ template abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement cuBLAS and cuDNN. CUTLASS decomposes these "moving parts" into reusable, modular software components abstracted by C++ template classes. Primitives for different levels of a conceptual parallelization hierarchy can be specialized and tuned via custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies their use as building blocks within custom kernels and applications.

To support a wide variety of applications, CUTLASS provides extensive support for mixed-precision computations, providing specialized data-movement and multiply-accumulate abstractions for half-precision floating point (FP16), BFloat16 (BF16), Tensor Float 32 (TF32), single-precision floating point (FP32), FP32 emulation via tensor core instruction, double-precision floating point (FP64) types, integer data types (4b and 8b), and binary data types (1b). CUTLASS demonstrates warp-synchronous matrix multiply operations targeting the programmable, high-throughput Tensor Cores implemented by NVIDIA's Volta, Turing, Ampere, and Hopper architectures.

See the Quick Start Guide to get started quickly.

See the functionality listing for the list of operations supported at each level of the execution model hierarchy.

CUTLASS 3.0 introduced a new core library, CuTe, to describe and manipulate tensors of threads and data. CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides Layout and Tensor objects that compactly package the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. This lets programmers focus on the logical descriptions of their algorithms while CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design, implement, and modify all dense linear algebra operations.

The core abstractions of CuTe are hierarchically multidimensional layouts which can be composed with data arrays to represent tensors. The representation of layouts is powerful enough to represent nearly everything we need to implement efficient dense linear algebra. Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.

CUTLASS 3.0 and beyond adopts CuTe throughout the GEMM hierarchy in its templates. This greatly simplifies the design and improves code composability and readability. More documentation specific to CuTe can be found in its dedicated documentation directory.

In addition to GEMMs, CUTLASS implements high-performance convolution via the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline. This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.

What's New in CUTLASS 3.5

CUTLASS 3.5 is an update to CUTLASS adding:

  • Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + TMA im2col.
    • Native implementation in CUTLASS 3.x using CuTe, mirroring the same design hierarchy as that of GEMMs.
    • Support for 1D, 2D, and 3D convolutions in a rank-agnostic fashion.
    • Support for Fprop, Dgrad, and Wgrad algorithms.
    • CUTLASS profiler support for 2D and 3D convolutions implemented via the 3.x API.
    • NOTE: this is a beta release. Further updates to CUTLASS will include major performance improvements, feature enablement, and possible breaking changes to the API until 3.7 release. Your feedback is welcome on the design!
  • Support for Ada (SM89) FP8 tensor cores via the 2.x API. Requires CUDA 12.4 or newer.
  • Ampere gather/scatter convolution example in CuTe and CUTLASS 3.x.
    • Showcasing how custom kernels can be written and optimized using CUTLASS 3.x and CuTe and the general strategy for implementing convolutions as specializations of GETTs.
    • Implementation of a coarse grained sparse gather/scatter kernel achieving peak performance on Ampere class tensor cores.
  • 32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices.
  • Updates to CuTe documentation for cute::Tensor<>, MMA atoms, and an overhauled CuTe GEMM tutorial series.
  • Extensions to CuTe to support L2 prefetching and TMA store+reductions.
  • Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17.
  • Fixes to greatly reduce build warnings.
  • Updates and bugfixes from the community (thanks!)

Minimum requirements:

  • Architecture: Volta
  • Compiler: Must support at least C++17
  • CUDA Toolkit version: 11.4

Starting from CUTLASS 3.0, CUTLASS removed support for the following:

  • Maxwell and Pascal GPU architectures
  • Ubuntu 16.04
  • CUDA 10.2
  • C++ language versions less than 17.

See the CHANGELOG for a detailed listing of releases and updates.

Performance

CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels, they exhibit peak performance comparable to cuBLAS for scalar GEMM computations. The above figure shows CUTLASS performance relative to cuBLAS for large matrix dimensions on an NVIDIA H100 (NVIDIA Hopper architecture), an NVIDIA L40 (NVIDIA Ada architecture), an NVIDIA A100 (NVIDIA Ampere architecture),
and an NVIDIA A40 (NVIDIA Ampere architecture). CUTLASS 3.0 was compiled with the CUDA 12.0 Toolkit. Tensor Core operations are implemented using CUDA's mma and wgmma instructions.

When using CUTLASS building blocks to construct device-wide implicit gemm (Fprop, Dgrad, and Wgrad) kernels, CUTLASS performance is also comparable to cuDNN when running Resnet-50 layers on an NVIDIA A100 as shown in the above figure. Tensor Core operations are implemented using CUDA's mma instruction.

Compatibility

CUTLASS requires a C++17 host compiler and performs best when built with the CUDA 12.4 Toolkit. It is also compatible with CUDA 11.4, CUDA 11.5, CUDA 11.6, CUDA 11.7, CUDA 11.8, CUDA 12.0, CUDA 12.1, CUDA 12.2.2, CUDA 12.3.1 and CUDA 12.3.2.

Operating Systems

We have tested the following environments.

Operating System Compiler
Ubuntu 18.04 GCC 7.5.0
Ubuntu 20.04 GCC 10.3.0
Ubuntu 22.04 GCC 11.2.0
Ubuntu 22.04 Clang 10.0.0
Ubuntu 22.04 Clang 14.0.6
Ubuntu 22.04 Clang 17.0.6
Windows 10.0 Visual Studio 2019 v16.11.27

Note: GCC 8.5.0 has known regressions regarding fold expressions and overloaded operators. Using GCC 7.5.0 or (preferred) GCC >= 9 is recommended.

Hardware

CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on Volta, Turing, Ampere, Ada, and Hopper architecture based NVIDIA GPUs.

GPU CUDA Compute Capability Minimum CUDA Toolkit Required by CUTLASS-3
NVIDIA V100 Tensor Core GPU 7.0 11.4
NVIDIA TitanV 7.0 11.4
NVIDIA GeForce RTX 2080 TI, 2080, 2070 7.5 11.4
NVIDIA T4 7.5 11.4
NVIDIA A100 Tensor Core GPU 8.0 11.4
NVIDIA A10 8.6 11.4
NVIDIA GeForce RTX 3090 8.6 11.4
NVIDIA GeForce RTX 4090 8.9 11.8
NVIDIA L40 8.9 11.8
NVIDIA H100 Tensor Core GPU 9.0 11.8

Target Architecture

In general, PTX code generated for one target architecture can be run on future architectures (i.e., it is forward compatible). However, CUDA 12.0 introduced the concept of "architecture-accelerated features" whose PTX does not have forward compatibility guarantees. Several Hopper PTX instructions fall under this category of architecture-accelerated features, and thus require a sm_90a target architecture (note the "a" appended). For more details on this and other architecture-accelerated instructions, please refer to the CUDA Documentation.

The target architecture information is passed on to CUTLASS via the cmake flag CUTLASS_NVCC_ARCHS. In order to maximize performance on Hopper GH100, users are required to build CUTLASS with 90a as the target architecture. If a user accidentally builds a kernel which uses SM90a features (e.g. Hopper Tensor Core Instructions), using the SM90 target (note the lack of "a"), with either CTK 12 or 11.8, the kernel is expected to fail with a runtime error.

cmake .. -DCUTLASS_NVCC_ARCHS="90a" 

Please refer to the functionality documentation for details on which kernels require which target architectures.

Documentation

CUTLASS is described in the following documents and the accompanying Doxygen documentation.

Resources

We have also described the structure of an efficient GEMM in our talk at the GPU Technology Conference 2018.

Building CUTLASS

CUTLASS is a header-only template library and does not need to be built to be used by other projects. Client applications should target CUTLASS's include/ directory in their include paths.

CUTLASS unit tests, examples, and utilities can be build with CMake. The minimum version of CMake is given in the Quickstart guide. Make sure the CUDACXX environment variable points to NVCC in the CUDA Toolkit installed on your system.

$ export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc

Create a build directory within the CUTLASS project, then run CMake. By default CUTLASS will build kernels for CUDA architecture versions 5.0, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6, 8.9, and 9.0. To reduce compile time you can specify the architectures to build CUTLASS for by changing the CMake configuration setting CUTLASS_NVCC_ARCHS.

$ mkdir build && cd build

$ cmake .. -DCUTLASS_NVCC_ARCHS=80               # compiles for NVIDIA's Ampere Architecture

From the build/ directory, compile and run the CUTLASS unit tests by building the target test_unit with make.

The unit tests are organized as several binaries mirroring the top-level namespaces of CUTLASS, and they may be executed in parallel via make's -j command line argument.

$ make test_unit -j
...
...
...
[----------] Global test environment tear-down
[==========] 946 tests from 57 test cases ran. (10812 ms total)
[  PASSED  ] 946 tests.

All tests should pass on supported platforms, though the exact number of tests may vary over time.

Project Structure

CUTLASS is arranged as a header-only library along with Utilities, Tools, Examples, and unit tests. Doxygen documentation provides a complete list of files, classes, and template concepts defined in the CUTLASS project.

A detailed explanation of the source code organization may be found in the CUTLASS documentation, but several main components are summarized below.

CUTLASS Template Library

include/                     # client applications should target this directory in their build's include paths

  cutlass/                   # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only

    arch/                    # direct exposure of architecture features (including instruction-level GEMMs)

    conv/                    # code specialized for convolution

    epilogue/                # code specialized for the epilogue of gemm/convolution

    gemm/                    # code specialized for general matrix product computations

    layout/                  # layout definitions for matrices, tensors, and other mathematical objects in memory

    platform/                # CUDA-capable Standard Library components

    reduction/               # bandwidth-limited reduction kernels that do not fit the "gemm" model

    thread/                  # simt code that can be performed within a CUDA thread
    
    transform/               # code specialized for layout, type, and domain transformations

    *                        # core vocabulary types, containers, and basic numeric operations

  cute/                      # CuTe Layout, layout algebra, MMA/Copy atoms, tiled MMA/Copy

    algorithm/               # Definitions of core operations such as copy, gemm, and operations on cute::tuples

    arch/                    # Bare bones PTX wrapper structs for copy and math instructions

    atom/                    # Meta-information either link to or built from arch/ operators

      mma_atom.hpp           # cute::Mma_Atom and cute::TiledMma

      copy_atom.hpp          # cute::Copy_Atom and cute::TiledCopy

      *sm*.hpp               # Arch specific meta-information for copy and math operations

    *                        # Core library types such as Shape, Stride, Layout, Tensor, and associated operations

CUTLASS SDK Examples

CUTLASS SDK examples apply CUTLASS templates to implement basic computations.

Tools

tools/
  library/                   # CUTLASS Instance Library - contains instantiations of all supported CUTLASS templates
    include/
      cutlass/
        library/

  profiler/                  # CUTLASS Profiler         - command-line utility for executing operations in the
                             #                            CUTLASS Library
  
  util/                      # CUTLASS Utilities        - contains numerous helper classes for
    include/                 #                            manging tensors in device memory, reference
      cutlass/               #                            implementations for GEMM, random initialization
        util/                #                            of tensors, and I/O.

Test

The test/unit/ directory consist of unit tests implemented with Google Test that demonstrate basic usage of Core API components and complete tests of the CUTLASS GEMM computations.

Instructions for building and running the Unit tests are described in the Quickstart guide.

Performance Profiling

The tools/profiler/ directory contains a command-line utility for launching each of the GEMM kernels. It can be built as follows:

$ make cutlass_profiler -j16

Building all GEMM and Convolution kernels (long build times)

By default, only one tile size is instantiated for each data type, math instruction, and layout. To instantiate all, set the following environment variable when running CMake from an empty build/ directory. Beware, this results in tens of thousands of kernels and long build times. This would also result in a large binary size and on some platforms linker to fail on building the library. Therefore, it's highly recommended to generate only a subset of kernels as demonstrated in the sub-section below.

$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=all
...
$ make cutlass_profiler -j16

Building a subset of GEMM and Convolution kernels (reduced build times)

To compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with wildcard characters may be used to reduce the set of kernels. The following examples show building exactly one or a subset of kernels for NVIDIA Ampere and Turing architecture:

Building a subset Tensor Core GEMM kernels

To compile a subset of Tensor Core GEMM kernels with FP32 accumulation and FP16 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:

$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
...
$ make cutlass_profiler -j16

Example command line for profiling a subset of Tensor Core GEMM kernels is as follows:

./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=3456 --n=4096 --k=4096

...
=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed
          cuBLAS: Passed

       Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128  \
                  --cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75  \
                  --max_cc=1024

           Bytes: 118489088  bytes
           FLOPs: 115992428544  flops

         Runtime: 1.55948  ms
          Memory: 70.7616 GiB/s

            Math: 74378.8 GFLOP/s



=============================
...

Building one CUDA Core GEMM kernel

To compile one SGEMM kernel targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:

$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1
...
$ make cutlass_profiler -j16

Example command line for profiling single SGEMM CUDA kernel is as follows:

$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1

          Status: Success
    Verification: ON
     Disposition: Passed

          cuBLAS: Passed

       Arguments: --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1  \
                  --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \
                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024

           Bytes: 180355072  bytes
           FLOPs: 115992428544  flops

         Runtime: 6.73655  ms
          Memory: 24.934 GiB/s

            Math: 17218.4 GFLOP/s

=============================

Building a subset of Tensor Core Convolution kernels

To compile a subset of Tensor core convolution kernels implementing forward propagation (fprop) with FP32 accumulation and FP16 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:

$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*fprop_optimized_f16
...
$ make cutlass_profiler -j16

Example command line for profiling a subset of Tensor Core convolution kernels is as follows:

$ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*fprop_optimized_f16 --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3

...
=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: conv2d
       Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_32x5_nhwc

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed

       Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \
                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc  \
                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
                  --eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=32 --stages=5  \
                  --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024

           Bytes: 1130659840  bytes
           FLOPs: 118482796544  flops

         Runtime: 0.711496  ms
          Memory: 1479.99 GiB/s

            Math: 166526 GFLOP/s

=============================
...

Building one Convolution CUDA kernel

To compile and run one CUDA Core convolution kernel implementing forward propagation (fprop) with F32 accumulation and FP32 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:

$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
...
$ make cutlass_profiler -j16

Example command line for profiling one CUDA Core convolution kernel:

$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3


=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: conv2d
       Operation: cutlass_simt_sfprop_optimized_128x128_8x2_nhwc

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed

       Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \
                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f32:nhwc --Filter=f32:nhwc --Output=f32:nhwc  \
                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
                  --eq_gemm_provider=none --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \
                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024

           Bytes: 2055798784  bytes
           FLOPs: 118482796544  flops

         Runtime: 7.34266  ms
          Memory: 260.752 GiB/s

            Math: 16136.2 GFLOP/s


=============================

More Details on Compiling CUTLASS Kernels and CUTLASS Profiler

About

CUTLASS is released by NVIDIA Corporation as Open Source software under the 3-clause "New" BSD license.

Contributors

The official list of CUTLASS developers and contributors is available here: CONTRIBUTORS.

Copyright

Copyright (c) 2017 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: BSD-3-Clause

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

  3. Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

cutlass's People

Contributors

aakhundov avatar alexsamardzic avatar alihassanijr avatar aniket-shivam avatar artem-b avatar chsigg avatar cliffburdick avatar d-k-b avatar danthe3rd avatar dumerrill avatar enter-tainer avatar gregory-meyer avatar hwu36 avatar ipiszy avatar jackkosaian avatar januszl avatar kerrmudgeon avatar kroburg avatar mani-ananth avatar manishucsd avatar masahi avatar mnicely avatar peisun1115 avatar peter9606 avatar reed-lau avatar shuaishao93 avatar sjfeng1999 avatar thakkarv avatar xseventh avatar yzhaiustc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cutlass's Issues

includes for coord.h platform.h

Hi! I added cutlass as a git submodule of a bigger repository. When trying to include the gemm.h header file, the following error occurred:

In file included from /home/aaron/seagate/saber-cpp/cutlass/cutlass/coord.h:32:0,  
from /home/aaron/seagate/saber-cpp/cutlass/cutlass/gemm/gemm.h:34,
from /home/aaron/seagate/saber-cpp/src/matmul.cu:82,
from /home/aaron/seagate/saber-cpp/src/main.cc:6:
/home/aaron/seagate/saber-cpp/cutlass/cutlass/util/platform.h:601:21: error: โ€˜int4โ€™ was not declared in this scope
 struct alignment_of<int4> {

I looks like coord.h only includes platform.h which does not include the vector_types.h from cuda.

May I know if I'm missing something? Thanks.

Questions about the distribution of the threads over the tiles

Hello,

I have questions about the cuda core sgemm.

  1. Each thread block loads per iteration a 128x8 A-tile and a 8x128 B-tile from global into shared memory. By having 256 threads pro thread block each thread would compute a 8x8 matrix multiplication?

  2. How are the threads distributed over the A-tile * B-tile?

Greetings,
James.

(question) MatMin instead of MatMul?

Hi,

I'm relatively new to CUDA / Parallel programming and wondering if cutlass is the right library for me.

I want to create an algorithm which does a simple operation on a matrix A and its transpose B.

It's exactly the same as matrix-matrix multiplication, but not using the multiplying-operation on elements, but the min() operation on elements.

Is cutlass the right library for this? Can someone point me into the right direction?

Multiply add batches of matrices of shape [n, 4, 4], [n, 4, 4] and [n,4,4] directly on tensor cores

Will it be possible in the near future to directly use the tensor cores, with batches of matrices of shape [n, 4, 4]?
I imagine a function like this one https://github.com/NVIDIA/cutlass/blob/master/cutlass/gemm/fp16_sgemm_multiply_add.h#L69
now capable of reading three matrices;

a is of type float16 and has shape [n, 4, 4]
b is of type float16 and has shape [n, 4, 4]
c is of type float32 and has shape [n, 4, 4]

Is there some devil in the details I am missing that makes this non-trivial to create? Your documentation states it cannot be done.

It would be really awesome to have this function for this (relatively) new type of neural network that can reverse engineer scene trees, which operates on 4x4 matrices: https://openreview.net/pdf?id=HJWLfGWRb

I also imagine this very useful for many other applications.

I really hope you will consider making this if it is not too much work.

Aha! Link: https://nvaiinfa.aha.io/features/CUTLASS-65

How to cite Cutlass

Quick issue/ request. I'm working on some research that utilizes cutlass sgemm (and batched) and wanting to reference Cutlass if we publish.

I couldn't find a bibtex anywhere to include and some light googling didn't yield much. Do you guys have an open paper to reference or bibtex you could supply on your readme?

Matrix Vector Multiplication Performance

I'm seeing very poor performance compared to a simple implementation I wrote for matrix vector multiplication. I suspect I'm either using Cutlass incorrectly or the dimensions of my problem are non-ideal for Cutlass. Is there something better than the default Shape template argument for SgemmTraits for this problem? Are there level 2 gemm functions that I'm missing?

Here is the code I'm using for the comparison:

#include "cutlass/gemm/gemm.h"
#include "cutlass/gemm/sgemm_traits.h"

#include <iostream>
#include <cstdint>

template <int BLOCK_SIZE>
__global__ void feed_forward_kernel(
  const float* inputs,
  const float* weights,
  float* outputs)
{
  __shared__ float input_block[BLOCK_SIZE];

  int cell_idx = blockIdx.x * blockDim.x + threadIdx.x;
  int cell_cnt = gridDim.x * blockDim.x;

  float a = 0.f; //biases[cell_idx];
  for (int i = 0; i < gridDim.x; ++i)
  {
    int input_idx = i * blockDim.x + threadIdx.x;
    input_block[threadIdx.x] = inputs[input_idx];
    __syncthreads();

    for (int j = 0; j < blockDim.x; ++j)
    {
      float in = input_block[j];
      a += in * weights[(cell_cnt * cell_idx) + (i * blockDim.x + j)];
    }
    __syncthreads();
  }

  outputs[cell_idx] = a; //1.f / (1.f + __expf(-a));
}

const std::size_t neurons_per_layer = 2048;
const std::size_t block_size = 32;

void run_my_gemm(float* weights,
  float* inputs,
  float* activations)
{
  dim3 block_dims(block_size, 1);
  dim3 num_blocks(neurons_per_layer / block_size, 1);

  feed_forward_kernel<block_size><<<num_blocks, block_dims>>>(
    inputs,
    weights,
    activations);
}

int main(int argc, char** argv)
{
  bool use_cutlass = argc > 1 && std::string(argv[1]) == "cutlass";

  int m = neurons_per_layer;
  int n = 1;
  int k = neurons_per_layer;
  float alpha = 1.f;
  float beta = 0.f;

  float* weights;
  int lda = k;
  float* inputs;
  int ldb = n;
  float* activations;
  int ldc = n;

  cudaMalloc(&inputs, sizeof(float) * neurons_per_layer);
  cudaMalloc(&weights, sizeof(float) * neurons_per_layer * neurons_per_layer);
  cudaMalloc(&activations, sizeof(float) * neurons_per_layer);

  std::vector<float> buf;
  buf.resize(k, 4.f);
  cudaMemcpy(inputs, buf.data(), buf.size() * sizeof(float), cudaMemcpyHostToDevice);
  
  buf.resize(0);
  buf.resize(m * k, 0.3);
  cudaMemcpy(weights, buf.data(), buf.size() * sizeof(float), cudaMemcpyHostToDevice);

  typedef cutlass::gemm::Gemm<cutlass::gemm::SgemmTraits<
    cutlass::MatrixLayout::kColumnMajor,
    cutlass::MatrixLayout::kColumnMajor,
    cutlass::Shape<8, 128, 128>>> Gemm;
  
  Gemm::Params params;
  params.initialize(m, n, k, alpha, weights, lda, inputs, ldb, beta, nullptr, 0, activations, ldc);

  for (std::size_t i = 0; i < 1000; ++i)
  {
    cudaError_t res;
    if (use_cutlass)
      res = Gemm::launch(params);
    else
      run_my_gemm(weights, inputs, activations);
    
    buf.resize(0);
    buf.resize(m * n);
    res = cudaMemcpy(buf.data(), activations, buf.size() * sizeof(float), cudaMemcpyDeviceToHost);
  }

  bool print_output = false;
  if (print_output)
  {
    for (float c : buf)
      std::cout << c << "\t";
    std::cout << std::endl;
  }

  cudaFree(&inputs);
  cudaFree(&weights);
  cudaFree(&activations);

  return 0;
}

Here are the results I'm getting on a Tesla K80:

$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148
$ time build/gemm-test cutlass

real	0m11.882s
user	0m8.200s
sys	0m3.650s
$ time build/gemm-test simple

real	0m1.541s
user	0m1.048s
sys	0m0.486s

Issues building cutlass 1.2 on linux-centos

error: identifier "cublasGemmStridedBatchedEx" is undefined

1 error detected in the compilation of "/tmp/tmpxft_0000597f_00000000-6_partitionedK_sgemm_128x128x8.cpp1.ii".
CMake Error at cutlass_unit_test_generated_partitionedK_sgemm_128x128x8.cu.o.Release.cmake:279 (message):

cutlass performance

Hi,
I profiled cutlass using nvvp on Tx2 with jetpack 3.2, using "./cutlass_perf_test --m=10240 --k=4096 --n=4096 --kernels=sgemm_tt", cutlass cost 727 ms while cublas cost 545 ms. 1.38x times faster than cutlass. Why? I notice sgemm_tt should be 95% performance of cublas.

And I profiled again on Titan Xp with cuda 9.0. sgmm_nn. cublas costs 33ms and cutlass costs 46 ms, which is 71% of the cublass performace...

Performance using matrices with small K

Performance of CUTLASS with large K is great.
I get 35702.8 GFLOPs on a NVIDIA 2070 Super running the performance test with the following command:
./cutlass_perf_test ---m=2048 --n=2048 --k=2048 --kernels=wmma_gemm_f16_tn

However, when K is considerably smaller, I get worse performance, as expected (10182.5 GFLOPs), executing the following command:
./cutlass_perf_test ---m=2048 --n=2048 --k=64 --kernels=wmma_gemm_f16_tn

My application relies on lots multiplications of matrices with small K and large values for M and N.
Is there a way to use CUTLASS that allows me to achieve close to the performance of when using large K?

Eigenvector from device code

Is there a library function in cutlass which can be invoked from the device code to calculate eigenvector from a covariance matrix?

Misaligned address when doing hgemm with odd M or K

The following program fails with "misaligned address" if either M or K is odd, while cuBLAS allows such inputs:

#include <cstdio>

#include <cuda_fp16.h>
#include <cuda_runtime_api.h>

#include <cublas_v2.h>

#include <cutlass/cutlass.h>
#include <cutlass/gemm/gemm.h>
#include <cutlass/gemm/hgemm_traits.h>
#include <cutlass/gemm/wmma_gemm_traits.h>

using namespace std;

int main(int argc, char* argv[]) {
  int M = 2, K = 3, N = 4;    // Both M and K cannot be odd here
  half *A, *B, *C;
  half *lA = new half[M * K], *lB = new half[K * N], *lC = new half[M * N];
  cudaMalloc(&A, sizeof(half) * M * K);
  cudaMalloc(&B, sizeof(half) * K * N);
  cudaMalloc(&C, sizeof(half) * M * N);
  float v = 0;
  for (int i = 0; i < M * K; i++) {
    lA[i] = __float2half(v++);
  }
  v = 0;
  for (int i = 0; i < K * N; i++) {
    lB[i] = __float2half(v++);
  }
  cudaMemcpy(A, lA, sizeof(half) * M * K, cudaMemcpyHostToDevice);
  cudaMemcpy(B, lB, sizeof(half) * K * N, cudaMemcpyHostToDevice);
  cublasHandle_t handle;
  cublasCreate(&handle);
  half a = __float2half(1), b = __float2half(0);
  typedef cutlass::gemm::HgemmTraits<
    cutlass::MatrixLayout::kColumnMajor,
    cutlass::MatrixLayout::kColumnMajor> GemmTraits;
  typedef cutlass::gemm::Gemm<GemmTraits> Gemm;
  typename Gemm::Params params;
  params.initialize(M, N, K, a, A, M, B, K, b, C, M, C, M);
  Gemm::launch(params);
  // cublasHgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, M, N, K,
  //             &a, A, M, B, K, &b, C, M);
  auto ret  = cudaDeviceSynchronize();
  printf(">> %s\n", cudaGetErrorString(cudaGetLastError()));
  cudaMemcpy(lC, C, sizeof(half) * M * N, cudaMemcpyDeviceToHost);
  for (int j = 0; j < N; j++) {
    for (int i = 0; i < M; i++) {
      printf("%.f\n", __half2float(lC[j * M + i]));
    }
  }
  cublasDestroy(handle);
  cudaFree(A);
  cudaFree(B);
  cudaFree(C);
  delete lA;
  delete lB;
  delete lC;
  return 0;
}

Is it by-design behavior or a bug?

strange compilation error "error: ThreadMultiplyAdd is not a template"

In the current master branch of commit 877bdca

When compiling Igemm, it seems that the marco #if (!defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 610)) has some kind of impact on compilation

For example, if you delete the #if (!defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 610)) and #endif
in cutlass/tools/test/unit/gemm/igemm_128x128x32.cu, then compile the unit test, compiler will report an error

cutlass/cutlass/gemm/igemm_traits.h(69): error: ThreadMultiplyAdd is not a template

I have test it in two of my platform , and the error occurs in both.

Could someone give me some advice how this happens ?

CutClass failed when build with permissive- + MSVC on windows.

CutClass failed with error C3861 when build with permissive- on Windows, I use latest source on master branch. Could you please help take a look at this?ย 
You can repro this issue as the steps below:

  1. git clone https://github.com/NVIDIA/cutlass d:\Cutclass\src
  2. open a clean x64 prompt and browse to D:\CutClass
  3. mkdir build_x64 && pushd build_x64
  4. set CL=/permissive-
  5. cmake -G "Visual Studio 15 2017 Win64" -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DCUTLASS_NVCC_ARCHS=70 ..\src\
  6. msbuild /p:Configuration=Release;Platform=x64 CUTLASS.sln /t:Rebuild

Error info:
D:\cutclass\src\cutlass/gemm/wmma_gemm_global_tile.h(108): error C3861: 'stride_d': identifier not found [D:\cutclass\build_x64\tools\test\unit\cutlass_unit_test.vcxproj]
D:\cutclass\src\cutlass/gemm/wmma_gemm_global_tile.h(162): note: see reference to class template instantiation 'cutlass::gemm::WmmaGemmGlobalIteratorCd<TileTraits_,Index_>' being compiled
D:\cutclass\src\cutlass/gemm/wmma_gemm_global_tile.h(110): error C3861: 'stride_h': identifier not found [D:\cutclass\build_x64\tools\test\unit\cutlass_unit_test.vcxproj]
D:\cutclass\src\cutlass/gemm/wmma_gemm_global_tile.h(112): error C3861: 'inc_h': identifier not found [D:\cutclass\build_x64\tools\test\unit\cutlass_unit_test.vcxproj]
D:\cutclass\src\cutlass/gemm/wmma_gemm_global_tile.h(113): error C3861: 'inc_advance': identifier not found [D:\cutclass\build_x64\tools\test\unit\cutlass_unit_test.vcxproj]
D:\cutclass\src\cutlass/gemm/wmma_gemm_global_tile.h(113): error C3861: 'inc_h': identifier not found [D:\cutclass\build_x64\tools\test\unit\cutlass_unit_test.vcxproj]
D:\cutclass\src\cutlass/gemm/wmma_gemm_global_tile.h(115): error C3861: 'predicate_offset': identifier not found [D:\cutclass\build_x64\tools\test\unit\cutlass_unit_test.vcxproj]
D:\cutclass\src\cutlass/gemm/wmma_gemm_global_tile.h(116): error C3861: 'predicate_inc_h': identifier not found [D:\cutclass\build_x64\tools\test\unit\cutlass_unit_test.vcxproj]
D:\cutclass\src\cutlass/gemm/wmma_gemm_global_tile.h(117): error C3861: 'predicate_inc_advance': identifier not found [D:\cutclass\build_x64\tools\test\unit\cutlass_unit_test.vcxproj]
D:\cutclass\src\cutlass/gemm/wmma_gemm_global_tile.h(117): error C3861: 'predicate_inc_h': identifier not found [D:\cutclass\build_x64\tools\test\unit\cutlass_unit_test.vcxproj]

Could there be a direct way to do type conversion when storing to global

The default output type of IGEMM is int32_t, In my situation, I need to store the ans as a fixed-point number in a uint8_t.

It's easy to do the caculation as isuue #17 mentioned. But when I want to save the data , I have to do some intrusive modification in struct GemmGlobalIteratorCd and struct Store. Though the modification is simple, I did a lot of tests to figure out how the global store works.

So , I hope there could be a better solution in future.

cutclass failed to build due to error : Could not detect active GPU device ID with MSVC on windows.

cutclass failed to build due to error : Could not detect active GPU device ID with MSVC on windows. It can be first reproduced on fb335f6 reversion. Could you please help take a look at this?

Repro Steps:

  1. git clone https://github.com/NVIDIA/cutlass d:\Cutclass\src
  2. open a clean x64 prompt and browse to D:\CutClass
  3. mkdir build_x64 && pushd build_x64
  4. cmake -G "Visual Studio 15 2017 Win64" -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DCUTLASS_NVCC_ARCHS=70 ..\src\
  5. msbuild /p:Configuration=Release;Platform=x64 CUTLASS.sln /t:Rebuild

log_x64_build.log

Error info:
CUSTOMBUILD : *** error : Could not detect active GPU device ID [CUDA driver version is insufficient for CUDA runtime version] [D:\cutclass\build_x64\test\unit\core\test_unit_core.vcxproj]
C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\Common7\IDE\VC\VCTargets\Microsoft.CppCommon.targets(209,5): error MSB6006: "cmd.exe" exited with code -1073741515. [D:\cutclass\build_x64\test\unit\nvrtc\thread\test_unit_nvrtc_thread.vcxproj]

terminate called after throwing an instance of 'std::runtime_error'

Hello, all
My GPU is GTX1080 Ti, NVCC 10.0
When I run ./cutlass_profiler --kernels=sgemm --m=4352 --n=4096 --k=4096, it shows

terminate called after throwing an instance of 'std::runtime_error'
what(): Failed to query occupancy.
Aborted (core dumped)

I checked issues before, but no this issue has been posted.

CMAKE fails if directories have spaces

Many thanks for cutlass--I am just getting started but it looks ideal.

I had trouble with cmake when the installation directory had spaces in the path. (In my case, "OneDrive - Company Name"). I believe the problem is here:

https://github.com/NVIDIA/cutlass/blob/master/CUDA.cmake#L209

It seems that cmake interprets the spaces in the path as delimiters. In my case, it failed when it couldn't find "-" above.

When I moved to a path without spaces, it was fine (Visual Studio 2019, community edition).

Thanks!

error: more than one conversion function from "half" to a built-in type applies

Hi all,
I met a error thrown by nvcc compiler when I used CUTLASS library to implement Hgemm (FP16 GEMM). Following log as error message:

error: more than one conversion function from "half" to a built-in type applies:
            function "__half::operator float() const"
            function "__half::operator short() const"
            function "__half::operator unsigned short() const"
            function "__half::operator int() const"
            function "__half::operator unsigned int() const"
            function "__half::operator long long() const"
            function "__half::operator unsigned long long() const"
            function "__half::operator __nv_bool() const"

This error was caused at https://github.com/NVIDIA/cutlass/blob/master/cutlass/gemm/epilogue_function.h : 88

class blas_scaled_epilogue
{
   //...ignore...โ€‹

        inline __device__
        bool must_init_addend()
        {
            return (beta != scalar_t(0)); // <-- here, "scalar_t" was defined as half (FP16) type, beta was a member variable of this class
        }
   //...ignore...โ€‹
};

Do you guys know how to solve this error ?

Convolution/cross-correlation example

I think it would be extremely useful to have a 2D convolution or cross-correlation example.

Those have a lot of applications, in particular:

  • Deep Learning
  • Image and video processing (blur, edge enhancement, embossing, sharpening, denoising ...)

Multiplication of int8 type in large-scale asymmetric matrices performs poor.

I try using Tensor core with IWMMA by RTX2080, but it did not seem to work properly when the in8 matrix is: M = 100W N =16~512 K=384 and it is only up to 28 Tflops/s which severely declines.

When matrix dimensions are M=N=K=4096, it is up to 64TFlops/s. But I think there is still a certain distance from the peak using Tensor core.

The 05_wmma example I use is from cutlass.

How can I operate to reach 90TFlop/s?
I am very eager to find a solution. Thanks.

Aha! Link: https://nvaiinfa.aha.io/features/CUTLASS-66

Issues building CUTLASS 1.2 on MacOS..

Hi,
just downloaded CUDA 10 SDK for MacOS and tried to build new CUTLASS 1.2..
CUDA apps like CUDA-Z work fine..
I'm on 10.13.6 with 410.130 driver and Xcode 9.4.1..

clang -v

Apple LLVM version 9.1.0 (clang-902.0.39.2)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Found CUDA installation: /usr/local/cuda, version unknown

building fails with error:

/Volumes/datosx/sep/cutlass/tools/external/googletest/googletest/include/gtest/gtest-message.h:131:10: error: invalid operands to binary expression ('std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> >' and 'const std::__1::vector<int, std::__1::allocator<int> >')
(*(ss_)) << val; 

full log:

 make
[  2%] Built target gtest
[  4%] Built target gtest_main
[  5%] Building NVCC (Device) object tools/test/unit/CMakeFiles/cutlass_unit_test.dir/core/cutlass_unit_test_generated_layout_verification.cu.o
/usr/local/cuda/include/cuda_fp16.hpp:466:32: warning: implicit conversion changes signedness: 'int' to 'unsigned int' [-Wsign-conversion]
unsigned mantissa = (h & 1023) << 13; 
         ~~~~~~~~   ~~~~~~~~~~~^~~~~
/Volumes/datosx/sep/cutlass/tools/util/half.h:292:15: warning: implicit conversion changes signedness: 'uint16_t' (aka 'unsigned short') to 'int16_t' (aka 'short') [-Wsign-conversion]
int16_t exp = (uint16_t)(((s >> 23) & (255)) - (127)); 
        ~~~   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Volumes/datosx/sep/cutlass/tools/util/half.h:316:8: warning: implicit conversion changes signedness: 'uint16_t' (aka 'unsigned short') to 'int16_t' (aka 'short') [-Wsign-conversion]
exp = ((uint16_t)(exp + ((uint16_t)15))); 
    ~  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Volumes/datosx/sep/cutlass/tools/util/half.h:357:35: warning: implicit conversion changes signedness: 'int' to 'unsigned int' [-Wsign-conversion]
f = (((sign << 31) | (exp << 23)) | (mantissa << 13)); 
  ~  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
/Volumes/datosx/sep/cutlass/tools/util/half.h:367:35: warning: implicit conversion changes signedness: 'int' to 'unsigned int' [-Wsign-conversion]
f = (((sign << 31) | (exp << 23)) | (mantissa << 13)); 
  ~  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
/Volumes/datosx/sep/cutlass/tools/util/half.h:370:11: warning: implicit conversion changes signedness: 'int' to 'unsigned int' [-Wsign-conversion]
f = (sign << 31); 
  ~  ~~~~~^~~~~
/Volumes/datosx/sep/cutlass/tools/util/half.h:376:18: warning: implicit conversion changes signedness: 'int' to 'unsigned int' [-Wsign-conversion]
f = ((255 << 23) | (sign << 31)); 
  ~  ~~~~~~~~~~~~^~~~~~~~~~~~~~
/Volumes/datosx/sep/cutlass/tools/util/half.h:639:68: warning: implicit conversion changes signedness: 'int' to 'unsigned int' [-Wsign-conversion]
inline unsigned cutlass::half2_t::packed() const { return ((lo).x) | (((hi).x) << 16); } 
                                                   ~~~~~~ ~~~~~~~~~^~~~~~~~~~~~~~~~~~
/Volumes/datosx/sep/cutlass/tools/util/type_traits.h:144:11: warning: comparison of constant 32768 with expression of type 'cutlass::TypeTraits<__half>::integer_type' (aka 'short') is always false [-Wtautological-constant-out-of-range-compare]
if (h_int == 32768) { 
    ~~~~~ ^  ~~~~~
/Volumes/datosx/sep/cutlass/cutlass/core_io.h:92:30: warning: implicit conversion changes signedness: 'int' to 'uint32_t' (aka 'unsigned int') [-Wsign-conversion]
(out << ((int)(scalar.value)[i])); 
              ~              ^
/Volumes/datosx/sep/cutlass/cutlass/core_io.h:105:30: warning: implicit conversion changes signedness: 'int' to 'uint32_t' (aka 'unsigned int') [-Wsign-conversion]
(out << ((int)(scalar.value)[i])); 
              ~              ^
/Volumes/datosx/sep/cutlass/cutlass/core_io.h:118:35: warning: implicit conversion changes signedness: 'int' to 'uint32_t' (aka 'unsigned int') [-Wsign-conversion]
(out << ((unsigned)(scalar.value)[i])); 
                   ~              ^
/Volumes/datosx/sep/cutlass/tools/test/unit/core/layout_verification.cu:56:22: warning: implicit conversion changes signedness: 'int' to 'std::__1::vector<std::__1::vector<int, std::__1::allocator<int> >, std::__1::allocator<std::__1::vector<int, std::__1::allocator<int> > > >::size_type' (aka 'unsigned long') [-Wsign-conversion]
(dim_extent_).resize(_rank, extent_); 
              ~~~~~~ ^~~~~
/Volumes/datosx/sep/cutlass/tools/test/unit/core/layout_verification.cu:94:66: warning: implicit conversion changes signedness: 'int' to 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-conversion]
(((std::cout << ((i) ? ", " : ("")))) << ((dim_extent_).at(r).at(i))); 
                                                              ~~ ^
/Volumes/datosx/sep/cutlass/tools/test/unit/core/layout_verification.cu:104:24: warning: implicit conversion changes signedness: 'int' to 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-conversion]
Coordinate coord(this->rank(), 0); 
           ~~~~~ ~~~~~~^~~~~~
/Volumes/datosx/sep/cutlass/tools/test/unit/core/layout_verification.cu:111:26: warning: implicit conversion changes signedness: 'const int' to 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-conversion]
coord.at((layout_).at(i).dim) += (quotient * (dim_extent_).at((layout_).at(i).dim).at(i + (1))); 
      ~~ ~~~~~~~~~~~~~~~~^~~
/Volumes/datosx/sep/cutlass/tools/test/unit/core/layout_verification.cu:111:79: warning: implicit conversion changes signedness: 'const int' to 'std::__1::vector<std::__1::vector<int, std::__1::allocator<int> >, std::__1::allocator<std::__1::vector<int, std::__1::allocator<int> > > >::size_type' (aka 'unsigned long') [-Wsign-conversion]
coord.at((layout_).at(i).dim) += (quotient * (dim_extent_).at((layout_).at(i).dim).at(i + (1))); 
                                                           ~~ ~~~~~~~~~~~~~~~~^~~
/Volumes/datosx/sep/cutlass/tools/test/unit/core/layout_verification.cu:114:27: warning: implicit conversion changes signedness: 'const int' to 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-conversion]
coord.at((layout_).back().dim) += index; 
      ~~ ~~~~~~~~~~~~~~~~~^~~
/Volumes/datosx/sep/cutlass/tools/test/unit/core/layout_verification.cu:131:22: warning: implicit conversion changes signedness: 'int' to 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-conversion]
int items = coord.at(dim); 
                  ~~ ^~~
/Volumes/datosx/sep/cutlass/tools/test/unit/core/layout_verification.cu:137:10: warning: implicit conversion changes signedness: 'int' to 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-conversion]
coord.at(dim) = quotient; 
      ~~ ^~~
/Volumes/datosx/sep/cutlass/tools/test/unit/core/layout_verification.cu:147:46: warning: implicit conversion changes signedness: 'int' to 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-conversion]
(((out << ((i) ? ", " : ("")))) << (coord.at(i))); 
                                          ~~ ^
/Volumes/datosx/sep/cutlass/tools/external/googletest/googletest/include/gtest/gtest-message.h:131:10: error: invalid operands to binary expression ('std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> >' and 'const std::__1::vector<int, std::__1::allocator<int> >')
(*(ss_)) << val; 
~~~~~~~~ ^  ~~~
/Volumes/datosx/sep/cutlass/tools/test/unit/core/layout_verification.cu:168:417: note: in instantiation of function template specialization 'testing::Message::operator<<<std::__1::vector<int, std::__1::allocator<int> > >' requested here
switch (0) { case 0:  default:  if (const testing::AssertionResult gtest_ar = ::testing::internal::EqHelper< false> ::Compare("i", "index", i, index)) { ; } else { (testing::internal::AssertHelper(::testing::TestPartResult::kNonFatalFailure, "/Volumes/datosx/sep/cutlass/tools/test/unit/core/layout_verification.cu", 168, gtest_ar.failure_message()) = (((((((((((((testing::Message() << ("[")) << i)) << ("] - ("))) << (layout(i)))) << (") => "))) << (layout(layout(i))))) << (std::endl)))); }  }  
                                                                                                                                                                                                                                                                                                                                                                                                                                ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:218:16: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'const void *' for 1st argument; take the address of the argument with &
basic_ostream &operator<<(const void * __p); 
               ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:755:1: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'char' for 2nd argument
operator<<(basic_ostream< _CharT, _Traits>  &__os, char __cn) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:788:1: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'char' for 2nd argument
operator<<(basic_ostream< char, _Traits>  &__os, char __c) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:795:1: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'signed char' for 2nd argument
operator<<(basic_ostream< char, _Traits>  &__os, signed char __c) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:802:1: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'unsigned char' for 2nd argument
operator<<(basic_ostream< char, _Traits>  &__os, unsigned char __c) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:816:1: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'const char *' for 2nd argument
operator<<(basic_ostream< _CharT, _Traits>  &__os, const char *__strn) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:862:1: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'const char *' for 2nd argument
operator<<(basic_ostream< char, _Traits>  &__os, const char *__str) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:869:1: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'const signed char *' for 2nd argument
operator<<(basic_ostream< char, _Traits>  &__os, const signed char *__str) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:877:1: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'const unsigned char *' for 2nd argument
operator<<(basic_ostream< char, _Traits>  &__os, const unsigned char *__str) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:1061:1: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'const std::__1::error_code' for 2nd argument
operator<<(basic_ostream< _CharT, _Traits>  &__os, const error_code &__ec) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:194:1: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'std::__1::basic_ostream<char, char_traits<char> > &(*)(std::__1::basic_ostream<char, char_traits<char> > &)' for 1st argument
operator<<(basic_ostream &(*__pf)(basic_ostream &)) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:198:1: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to '__1::basic_ios<char, char_traits<char> > &(*)(__1::basic_ios<char, char_traits<char> > &)' for 1st argument
operator<<(__1::basic_ios< _CharT, _Traits>  &(*
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:203:1: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to '__1::ios_base &(*)(__1::ios_base &)' for 1st argument
operator<<(__1::ios_base &(*__pf)(__1::ios_base &)) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:206:16: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'bool' for 1st argument
basic_ostream &operator<<(bool __n); 
               ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:207:16: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'short' for 1st argument
basic_ostream &operator<<(short __n); 
               ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:208:16: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'unsigned short' for 1st argument
basic_ostream &operator<<(unsigned short __n); 
               ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:209:16: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'int' for 1st argument
basic_ostream &operator<<(int __n); 
               ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:210:16: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'unsigned int' for 1st argument
basic_ostream &operator<<(unsigned __n); 
               ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:211:16: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'long' for 1st argument
basic_ostream &operator<<(long __n); 
               ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:212:16: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'unsigned long' for 1st argument
basic_ostream &operator<<(unsigned long __n); 
               ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:213:16: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'long long' for 1st argument
basic_ostream &operator<<(long long __n); 
               ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:214:16: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'unsigned long long' for 1st argument
basic_ostream &operator<<(unsigned long long __n); 
               ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:215:16: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'float' for 1st argument
basic_ostream &operator<<(float __f); 
               ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:216:16: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'double' for 1st argument
basic_ostream &operator<<(double __f); 
               ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:217:16: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'long double' for 1st argument
basic_ostream &operator<<(long double __f); 
               ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:219:16: note: candidate function not viable: no known conversion from 'const std::__1::vector<int, std::__1::allocator<int> >' to 'basic_streambuf<char, std::__1::char_traits<char> > *' for 1st argument
basic_ostream &operator<<(basic_streambuf< _CharT, _Traits>  * __sb); 
               ^
/Volumes/datosx/sep/cutlass/tools/external/googletest/googletest/include/gtest/gtest-message.h:55:6: note: candidate function not viable: no known conversion from 'std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> >' to 'const testing::internal::Secret' for 1st argument
void operator<<(const testing::internal::Secret &, int); 
     ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:748:1: note: candidate template ignored: deduced conflicting types for parameter '_CharT' ('char' vs. 'std::__1::vector<int, std::__1::allocator<int> >')
operator<<(basic_ostream< _CharT, _Traits>  &__os, _CharT __c) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:809:1: note: candidate template ignored: could not match 'const _CharT *' against 'std::__1::vector<int, std::__1::allocator<int> >'
operator<<(basic_ostream< _CharT, _Traits>  &__os, const _CharT *__str) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:1044:1: note: candidate template ignored: could not match 'basic_string' against 'vector'
operator<<(basic_ostream< _CharT, _Traits>  &__os, const basic_string< _CharT, _Traits, _Allocator>  &
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:1052:1: note: candidate template ignored: could not match 'basic_string_view' against 'vector'
operator<<(basic_ostream< _CharT, _Traits>  &__os, const basic_string_view< _CharT, _Traits>  
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:1069:1: note: candidate template ignored: could not match 'shared_ptr' against 'vector'
operator<<(basic_ostream< _CharT, _Traits>  &__os, const shared_ptr< _Yp>  &__p) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:1076:1: note: candidate template ignored: could not match 'bitset' against 'vector'
operator<<(basic_ostream< _CharT, _Traits>  &__os, const bitset< _Size>  &__x) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/iomanip:362:1: note: candidate template ignored: could not match '__iom_t8' against 'vector'
operator<<(basic_ostream< _CharT, _Traits>  &__os, const __iom_t8< _MoneyT>  &__x) 
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/iomanip:474:75: note: candidate template ignored: could not match '__iom_t10' against 'vector'
template< class _Cp, class _Traits> friend basic_ostream< _Cp, _Traits>  &operator<<(basic_ostream< _Cp, _Traits>  & __os, const __1::__iom_t10< _Cp>  & __x); 
                                                                          ^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/iomanip:572:1: note: candidate template ignored: could not match '__quoted_output_proxy' against 'vector'
operator<<(basic_ostream< _CharT, _Traits>  &
^
/Volumes/datosx/sep/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/iomanip:592:1: note: candidate template ignored: could not match '__quoted_proxy' against 'vector'
operator<<(basic_ostream< _CharT, _Traits>  &
^
21 warnings and 1 error generated.
CMake Error at cutlass_unit_test_generated_layout_verification.cu.o.Release.cmake:279 (message):
  Error generating file
  /Volumes/datosx/sep/cutlass/build/tools/test/unit/CMakeFiles/cutlass_unit_test.dir/core/./cutlass_unit_test_generated_layout_verification.cu.o


make[2]: *** [tools/test/unit/CMakeFiles/cutlass_unit_test.dir/core/cutlass_unit_test_generated_layout_verification.cu.o] Error 1
make[1]: *** [tools/test/unit/CMakeFiles/cutlass_unit_test.dir/all] Error 2
make: *** [all] Error 2

Custom Epilogue Example

Are there any examples for v1.0+ on how to do bias plus ReLu activations using a custom epilogue? The corresponding section in the v0.1 blog post no longer seems relevant.

cutlass 1.1 build fails on Jetson TX2

Environment:
CUDA: CUDA 9.0
nvcc: release 9.0, V9.0.252
gcc: 5.4.0 20160609
operation system: Linux tegra-ubuntu 4.4.38-tegra #1 SMP PREEMPT Thu Mar 1 20:49:20 PST 2018 aarch64 aarch64 aarch64 GNU/Linux

Operations:
mkdir build
cd build
cmake -DCUTLASS_NVCC_ARCHS=62 ..
make -j4

The error messages are listed as follows:
make[2]: ***
[ 1%] Building NVCC (Device) object examples/01_tensor_view/CMakeFiles/01_tensor_view.dir/01_tensor_vew_generated_tensor_view.cu.o
[ 4%] Built target gtest
[ 5%] Building NVCC (Device) object examples/00_basic_gemm/CMakeFiles/00_basic_gemm.dir/00_basic_gemmgenerated_basic_gemm.cu.o
[ 6%] Building NVCC (Device) object tools/test/perf/CMakeFiles/cutlass_perf_test.dir/gemm/cutlass_per_test_generated_wmma_integer_gemm.cu.o
[ 8%] Building NVCC (Device) object examples/02_cutlass_utilities/CMakeFiles/02_cutlass_utilities.dir02_cutlass_utilities_generated_cutlass_utilities.cu.o
[ 9%] Linking CXX executable 01_tensor_view
[ 9%] Built target 01_tensor_view
[ 10%] Building NVCC (Device) object examples/03_strided_batched_gemm/CMakeFiles/03_strided_batched_gem.dir/03_strided_batched_gemm_generated_strided_batched_gemm.cu.o
nvidia@tegra-ubuntu:~/dongxiao/cutlass-1.1.0% cd build [ 13%] Built target 00_basic_gemm
[ 14%] Building NVCC (Device) object examples/04_tile_iterator/CMakeFiles/04_tile_iterator.dir/04_tile_iterator_generated_tile_iterator.cu.o
/home/nvidia/dongxiao/cutlass-1.1.0/examples/02_cutlass_utilities/cutlass_utilities.cu(150): error: calling a device function("__half") from a host function("Cutlass_FP16_SgemmNN") is not allowed

/home/nvidia/dongxiao/cutlass-1.1.0/examples/02_cutlass_utilities/cutlass_utilities.cu(155): error: cmalling a device function("__half") from a host function("Cutlass_FP16_SgemmNN") is not allowerd

2 errors detected in the compilation of "/tmp/tmpxft_00004c2c_00000000-4_cutlass_utilities.cpp4.ii".
CMake Error at 02_cutlass_utilities_generated_cutlass_utilities.cu.o.Release.cmake:279 (message):
Error generating file
/home/nvidia/dongxiao/cutlass-1.1.0/build/examples/02_cutlass_utilities/CMakeFiles/02_cutlass_utilities.dir//./02_cutlass_utilities_generated_cutlass_utilities.cu.o

examples/02_cutlass_utilities/CMakeFiles/02_cutlass_utilities.dir/build.make:368: recipe for target 'examples/02_cutlass_utilities/CMakeFiles/02_cutlass_utilities.dir/02_cutlass_utilities_generated_cutlaass_utilities.cu.o' failed

[examples/02_cutlass_utilities/CMakeFiles/02_cutlass_utilities.dir/02_cutlass_utilities_generated_cutlass_utilities.cu.o] Error 1
CMakeFiles/Makefile2:527: recipe for target 'examples/02_cutlass_utilities/CMakeFiles/02_cutlass_utilities.dir/all' failed
make[1]: *** [examples/02_cutlass_utilities/CMakeFiles/02_cutlass_utilities.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 16%] Building NVCC (Device) object tools/test/perf/CMakeFiles/cutlass_perf_test.dir/gemm/cutlass_peerf_test_generated_sgemm.cu.o
[ 17%] Linking CXX executable 03_strided_batched_gemm
[ 17%] Built target 03_strided_batched_gemm
[ 18%] Building NVCC (Device) object tools/test/perf/CMakeFiles/cutlass_perf_test.dir/gemm/cutlass_perf_test_generated_dgemm.cu.o
[ 20%] Linking CXX executable 04_tile_iterator
[ 20%] Built target 04_tile_iterator
[ 21%] Building NVCC (Device) object tools/test/perf/CMakeFiles/cutlass_perf_test.dir/gemm/cutlass_perf_test_generated_hgemm.cu.o
[ 22%] Building NVCC (Device) object tools/test/perf/CMakeFiles/cutlass_perf_test.dir/gemm/cutlass_perf_test_generated_igemm.cu.o
[ 24%] Building NVCC (Device) object tools/test/perf/CMakeFiles/cutlass_perf_test.dir/gemm/cutlass_perrf_test_generated_wmma_gemm.cu.o
[ 25%] Building NVCC (Device) object tools/test/perf/CMakeFiles/cutlass_perf_test.dir/gemm/cutlass_perrf_test_generated_wmma_binary_gemm.cu.o
^Ctools/test/perf/CMakeFiles/cutlass_perf_test.dir/build.make:1968: recipe for target 'tools/test/perf/CMakeFiles/cutlass_perf_test.dir/gemm/cutlass_perf_test_generated_igemm.cu.o' failed
make[2]: *** [tools/test/perf/CMakeFiles/cutlass_perf_test.dir/gemm/cutlass_perf_test_generated_igemm.cu.o] Interrupt
CMakeFiles/Makefile2:311: recipe for target 'tools/test/perf/CMakeFiles/cutlass_perf_test.dir/all' failed
make[1]: *** [tools/test/perf/CMakeFiles/cutlass_perf_test.dir/all] Interrupt
Makefile:83: recipe for target 'all' failed

test failed on jetson tx2

Hi,
I run cutlass on jetson tx2(jetpack 3.2), but some tests failed. Here is the information:

nvidia@tegra-ubuntu:~/Documents/cutlass/build$ ./tools/test/unit/cutlass_unit_test
Note: Google Test filter = -mma
[==========] Running 684 tests from 33 test cases.
[----------] Global test environment set-up.
[----------] 1 test from HostTensor
[ RUN ] HostTensor.gemm
[ OK ] HostTensor.gemm (1 ms)
[----------] 1 test from HostTensor (1 ms total)

[----------] 2 tests from Layout
[ RUN ] Layout.igemm
[ OK ] Layout.igemm (0 ms)
[ RUN ] Layout.sgemm_accum
[ OK ] Layout.sgemm_accum (0 ms)
[----------] 2 tests from Layout (0 ms total)

[----------] 1 test from PredicateVector
[ RUN ] PredicateVector.Basic
[ OK ] PredicateVector.Basic (94 ms)
[----------] 1 test from PredicateVector (94 ms total)

[----------] 2 tests from TileIterator
[ RUN ] TileIterator.tile_128x8_contiguous
[ OK ] TileIterator.tile_128x8_contiguous (1 ms)
[ RUN ] TileIterator.tile_128x8_rake
[ OK ] TileIterator.tile_128x8_rake (1 ms)
[----------] 2 tests from TileIterator (3 ms total)

[----------] 8 tests from Dgemm_64x32x8
[ RUN ] Dgemm_64x32x8.dgemm_64x32x8_nt
[ OK ] Dgemm_64x32x8.dgemm_64x32x8_nt (497 ms)
[ RUN ] Dgemm_64x32x8.dgemm_256x128x64_nt
[ OK ] Dgemm_64x32x8.dgemm_256x128x64_nt (29 ms)
[ RUN ] Dgemm_64x32x8.dgemm_64x32x8_nn
[ OK ] Dgemm_64x32x8.dgemm_64x32x8_nn (5 ms)
[ RUN ] Dgemm_64x32x8.dgemm_256x128x64_nn
[ OK ] Dgemm_64x32x8.dgemm_256x128x64_nn (22 ms)
[ RUN ] Dgemm_64x32x8.dgemm_64x32x8_tn
[ OK ] Dgemm_64x32x8.dgemm_64x32x8_tn (4 ms)
[ RUN ] Dgemm_64x32x8.dgemm_256x128x64_tn
[ OK ] Dgemm_64x32x8.dgemm_256x128x64_tn (21 ms)
[ RUN ] Dgemm_64x32x8.dgemm_64x32x8_tt
[ OK ] Dgemm_64x32x8.dgemm_64x32x8_tt (3 ms)
[ RUN ] Dgemm_64x32x8.dgemm_256x128x64_tt
[ OK ] Dgemm_64x32x8.dgemm_256x128x64_tt (20 ms)
[----------] 8 tests from Dgemm_64x32x8 (601 ms total)

[----------] 8 tests from Dgemm_64x64x8
[ RUN ] Dgemm_64x64x8.dgemm_64x64x8_nt
[ OK ] Dgemm_64x64x8.dgemm_64x64x8_nt (3 ms)
[ RUN ] Dgemm_64x64x8.dgemm_256x128x64_nt
[ OK ] Dgemm_64x64x8.dgemm_256x128x64_nt (21 ms)
[ RUN ] Dgemm_64x64x8.dgemm_64x64x8_nn
[ OK ] Dgemm_64x64x8.dgemm_64x64x8_nn (4 ms)
[ RUN ] Dgemm_64x64x8.dgemm_256x128x64_nn
[ OK ] Dgemm_64x64x8.dgemm_256x128x64_nn (20 ms)
[ RUN ] Dgemm_64x64x8.dgemm_64x64x8_tn
[ OK ] Dgemm_64x64x8.dgemm_64x64x8_tn (4 ms)
[ RUN ] Dgemm_64x64x8.dgemm_256x128x64_tn
[ OK ] Dgemm_64x64x8.dgemm_256x128x64_tn (20 ms)
[ RUN ] Dgemm_64x64x8.dgemm_64x64x8_tt
[ OK ] Dgemm_64x64x8.dgemm_64x64x8_tt (3 ms)
[ RUN ] Dgemm_64x64x8.dgemm_256x128x64_tt
[ OK ] Dgemm_64x64x8.dgemm_256x128x64_tt (20 ms)
[----------] 8 tests from Dgemm_64x64x8 (95 ms total)

[----------] 8 tests from Dgemm_128x32x8
[ RUN ] Dgemm_128x32x8.dgemm_128x32x8_nt
[ OK ] Dgemm_128x32x8.dgemm_128x32x8_nt (4 ms)
[ RUN ] Dgemm_128x32x8.dgemm_256x64x64_nt
[ OK ] Dgemm_128x32x8.dgemm_256x64x64_nt (12 ms)
[ RUN ] Dgemm_128x32x8.dgemm_128x32x8_nn
[ OK ] Dgemm_128x32x8.dgemm_128x32x8_nn (4 ms)
[ RUN ] Dgemm_128x32x8.dgemm_256x64x64_nn
[ OK ] Dgemm_128x32x8.dgemm_256x64x64_nn (12 ms)
[ RUN ] Dgemm_128x32x8.dgemm_128x32x8_tn
[ OK ] Dgemm_128x32x8.dgemm_128x32x8_tn (3 ms)
[ RUN ] Dgemm_128x32x8.dgemm_256x64x64_tn
[ OK ] Dgemm_128x32x8.dgemm_256x64x64_tn (12 ms)
[ RUN ] Dgemm_128x32x8.dgemm_128x32x8_tt
[ OK ] Dgemm_128x32x8.dgemm_128x32x8_tt (3 ms)
[ RUN ] Dgemm_128x32x8.dgemm_256x64x64_tt
[ OK ] Dgemm_128x32x8.dgemm_256x64x64_tt (11 ms)
[----------] 8 tests from Dgemm_128x32x8 (62 ms total)

[----------] 8 tests from Dgemm_128x128x8
[ RUN ] Dgemm_128x128x8.dgemm_128x128x8_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Dgemm_128x128x8.dgemm_128x128x8_nt (162 ms)
[ RUN ] Dgemm_128x128x8.dgemm_512x256x64_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Dgemm_128x128x8.dgemm_512x256x64_nt (1094 ms)
[ RUN ] Dgemm_128x128x8.dgemm_128x128x8_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Dgemm_128x128x8.dgemm_128x128x8_nn (118 ms)
[ RUN ] Dgemm_128x128x8.dgemm_512x256x64_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Dgemm_128x128x8.dgemm_512x256x64_nn (1062 ms)
[ RUN ] Dgemm_128x128x8.dgemm_128x128x8_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Dgemm_128x128x8.dgemm_128x128x8_tn (115 ms)
[ RUN ] Dgemm_128x128x8.dgemm_512x256x64_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Dgemm_128x128x8.dgemm_512x256x64_tn (993 ms)
[ RUN ] Dgemm_128x128x8.dgemm_128x128x8_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Dgemm_128x128x8.dgemm_128x128x8_tt (109 ms)
[ RUN ] Dgemm_128x128x8.dgemm_512x256x64_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Dgemm_128x128x8.dgemm_512x256x64_tt (956 ms)
[----------] 8 tests from Dgemm_128x128x8 (4610 ms total)

[----------] 8 tests from Dgemm_64x32x16
[ RUN ] Dgemm_64x32x16.dgemm_64x32x16_nt
[ OK ] Dgemm_64x32x16.dgemm_64x32x16_nt (3 ms)
[ RUN ] Dgemm_64x32x16.dgemm_256x128x64_nt
[ OK ] Dgemm_64x32x16.dgemm_256x128x64_nt (20 ms)
[ RUN ] Dgemm_64x32x16.dgemm_64x32x16_nn
[ OK ] Dgemm_64x32x16.dgemm_64x32x16_nn (3 ms)
[ RUN ] Dgemm_64x32x16.dgemm_256x128x64_nn
[ OK ] Dgemm_64x32x16.dgemm_256x128x64_nn (20 ms)
[ RUN ] Dgemm_64x32x16.dgemm_64x32x16_tn
[ OK ] Dgemm_64x32x16.dgemm_64x32x16_tn (3 ms)
[ RUN ] Dgemm_64x32x16.dgemm_256x128x64_tn
[ OK ] Dgemm_64x32x16.dgemm_256x128x64_tn (20 ms)
[ RUN ] Dgemm_64x32x16.dgemm_64x32x16_tt
[ OK ] Dgemm_64x32x16.dgemm_64x32x16_tt (4 ms)
[ RUN ] Dgemm_64x32x16.dgemm_256x128x64_tt
[ OK ] Dgemm_64x32x16.dgemm_256x128x64_tt (20 ms)
[----------] 8 tests from Dgemm_64x32x16 (94 ms total)

[----------] 8 tests from Dgemm_64x64x16
[ RUN ] Dgemm_64x64x16.dgemm_64x64x16_nt
[ OK ] Dgemm_64x64x16.dgemm_64x64x16_nt (3 ms)
[ RUN ] Dgemm_64x64x16.dgemm_256x128x64_nt
[ OK ] Dgemm_64x64x16.dgemm_256x128x64_nt (20 ms)
[ RUN ] Dgemm_64x64x16.dgemm_64x64x16_nn
[ OK ] Dgemm_64x64x16.dgemm_64x64x16_nn (3 ms)
[ RUN ] Dgemm_64x64x16.dgemm_256x128x64_nn
[ OK ] Dgemm_64x64x16.dgemm_256x128x64_nn (20 ms)
[ RUN ] Dgemm_64x64x16.dgemm_64x64x16_tn
[ OK ] Dgemm_64x64x16.dgemm_64x64x16_tn (4 ms)
[ RUN ] Dgemm_64x64x16.dgemm_256x128x64_tn
[ OK ] Dgemm_64x64x16.dgemm_256x128x64_tn (19 ms)
[ RUN ] Dgemm_64x64x16.dgemm_64x64x16_tt
[ OK ] Dgemm_64x64x16.dgemm_64x64x16_tt (4 ms)
[ RUN ] Dgemm_64x64x16.dgemm_256x128x64_tt
[ OK ] Dgemm_64x64x16.dgemm_256x128x64_tt (20 ms)
[----------] 8 tests from Dgemm_64x64x16 (94 ms total)

[----------] 8 tests from Dgemm_128x32x16
[ RUN ] Dgemm_128x32x16.dgemm_128x32x8_nt
[ OK ] Dgemm_128x32x16.dgemm_128x32x8_nt (4 ms)
[ RUN ] Dgemm_128x32x16.dgemm_256x64x64_nt
[ OK ] Dgemm_128x32x16.dgemm_256x64x64_nt (12 ms)
[ RUN ] Dgemm_128x32x16.dgemm_128x32x16_nn
[ OK ] Dgemm_128x32x16.dgemm_128x32x16_nn (4 ms)
[ RUN ] Dgemm_128x32x16.dgemm_256x64x64_nn
[ OK ] Dgemm_128x32x16.dgemm_256x64x64_nn (12 ms)
[ RUN ] Dgemm_128x32x16.dgemm_128x32x8_tn
[ OK ] Dgemm_128x32x16.dgemm_128x32x8_tn (4 ms)
[ RUN ] Dgemm_128x32x16.dgemm_256x64x64_tn
[ OK ] Dgemm_128x32x16.dgemm_256x64x64_tn (11 ms)
[ RUN ] Dgemm_128x32x16.dgemm_128x32x8_tt
[ OK ] Dgemm_128x32x16.dgemm_128x32x8_tt (4 ms)
[ RUN ] Dgemm_128x32x16.dgemm_256x64x64_tt
[ OK ] Dgemm_128x32x16.dgemm_256x64x64_tt (12 ms)
[----------] 8 tests from Dgemm_128x32x16 (63 ms total)

[----------] 37 tests from Hgemm_128x128x8
[ RUN ] Hgemm_128x128x8.hgemm_128x128x1_nt
[ OK ] Hgemm_128x128x8.hgemm_128x128x1_nt (12 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x8_nt
[ OK ] Hgemm_128x128x8.hgemm_128x128x8_nt (16 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x9_nt
[ OK ] Hgemm_128x128x8.hgemm_128x128x9_nt (16 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x16_nt
[ OK ] Hgemm_128x128x8.hgemm_128x128x16_nt (24 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x64_nt
[ OK ] Hgemm_128x128x8.hgemm_128x128x64_nt (65 ms)
[ RUN ] Hgemm_128x128x8.hgemm_256x128x16_nt
[ OK ] Hgemm_128x128x8.hgemm_256x128x16_nt (42 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x256x16_nt
[ OK ] Hgemm_128x128x8.hgemm_128x256x16_nt (40 ms)
[ RUN ] Hgemm_128x128x8.hgemm_256x256x16_nt
[ OK ] Hgemm_128x128x8.hgemm_256x256x16_nt (74 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x2_nn
[ OK ] Hgemm_128x128x8.hgemm_128x128x2_nn (8 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x8_nn
[ OK ] Hgemm_128x128x8.hgemm_128x128x8_nn (13 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x10_nn
[ OK ] Hgemm_128x128x8.hgemm_128x128x10_nn (14 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x16_nn
[ OK ] Hgemm_128x128x8.hgemm_128x128x16_nn (19 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x64_nn
[ OK ] Hgemm_128x128x8.hgemm_128x128x64_nn (50 ms)
[ RUN ] Hgemm_128x128x8.hgemm_256x128x16_nn
[ OK ] Hgemm_128x128x8.hgemm_256x128x16_nn (32 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x256x16_nn
[ OK ] Hgemm_128x128x8.hgemm_128x256x16_nn (32 ms)
[ RUN ] Hgemm_128x128x8.hgemm_256x256x16_nn
[ OK ] Hgemm_128x128x8.hgemm_256x256x16_nn (61 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x8_tn
[ OK ] Hgemm_128x128x8.hgemm_128x128x8_tn (12 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x10_tn
[ OK ] Hgemm_128x128x8.hgemm_128x128x10_tn (14 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x16_tn
[ OK ] Hgemm_128x128x8.hgemm_128x128x16_tn (17 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x64_tn
[ OK ] Hgemm_128x128x8.hgemm_128x128x64_tn (48 ms)
[ RUN ] Hgemm_128x128x8.hgemm_256x128x16_tn
[ OK ] Hgemm_128x128x8.hgemm_256x128x16_tn (32 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x256x16_tn
[ OK ] Hgemm_128x128x8.hgemm_128x256x16_tn (31 ms)
[ RUN ] Hgemm_128x128x8.hgemm_256x256x16_tn
[ OK ] Hgemm_128x128x8.hgemm_256x256x16_tn (61 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x8_tt
[ OK ] Hgemm_128x128x8.hgemm_128x128x8_tt (12 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x10_tt
[ OK ] Hgemm_128x128x8.hgemm_128x128x10_tt (12 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x16_tt
[ OK ] Hgemm_128x128x8.hgemm_128x128x16_tt (16 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x64_tt
[ OK ] Hgemm_128x128x8.hgemm_128x128x64_tt (47 ms)
[ RUN ] Hgemm_128x128x8.hgemm_256x128x16_tt
[ OK ] Hgemm_128x128x8.hgemm_256x128x16_tt (31 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x256x16_tt
[ OK ] Hgemm_128x128x8.hgemm_128x256x16_tt (30 ms)
[ RUN ] Hgemm_128x128x8.hgemm_256x256x16_tt
[ OK ] Hgemm_128x128x8.hgemm_256x256x16_tt (57 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x16_alpha2_nt
[ OK ] Hgemm_128x128x8.hgemm_128x128x16_alpha2_nt (15 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x16_beta1_nt
[ OK ] Hgemm_128x128x8.hgemm_128x128x16_beta1_nt (15 ms)
[ RUN ] Hgemm_128x128x8.hgemm_128x128x16_alpha2_beta1_nt
[ OK ] Hgemm_128x128x8.hgemm_128x128x16_alpha2_beta1_nt (15 ms)
[ RUN ] Hgemm_128x128x8.hgemm_120x112x64_ldg8_nt
[ OK ] Hgemm_128x128x8.hgemm_120x112x64_ldg8_nt (38 ms)
[ RUN ] Hgemm_128x128x8.hgemm_508x252x120_ragged_nt
[ OK ] Hgemm_128x128x8.hgemm_508x252x120_ragged_nt (565 ms)
[ RUN ] Hgemm_128x128x8.hgemm_124x126x32_ragged_nt
[ OK ] Hgemm_128x128x8.hgemm_124x126x32_ragged_nt (23 ms)
[ RUN ] Hgemm_128x128x8.hgemm_124x126x32_ragged_alpha2_beta1_nt
[ OK ] Hgemm_128x128x8.hgemm_124x126x32_ragged_alpha2_beta1_nt (24 ms)
[----------] 37 tests from Hgemm_128x128x8 (1637 ms total)

[----------] 33 tests from Hgemm_128x128x16
[ RUN ] Hgemm_128x128x16.hgemm_2x2x2_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_2x2x2_nt (2 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x8_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x8_nt (110 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x16_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x16_nt (114 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x17_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x17_nt (115 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x64_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x64_nt (151 ms)
[ RUN ] Hgemm_128x128x16.hgemm_256x128x16_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_256x128x16_nt (222 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x256x16_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x256x16_nt (225 ms)
[ RUN ] Hgemm_128x128x16.hgemm_256x256x16_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_256x256x16_nt (446 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x16_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x16_nn (114 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x18_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x18_nn (116 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x64_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x64_nn (151 ms)
[ RUN ] Hgemm_128x128x16.hgemm_256x128x16_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_256x128x16_nn (227 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x256x16_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x256x16_nn (224 ms)
[ RUN ] Hgemm_128x128x16.hgemm_256x256x16_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_256x256x16_nn (446 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x16_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x16_tn (114 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x18_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x18_tn (116 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x64_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x64_tn (150 ms)
[ RUN ] Hgemm_128x128x16.hgemm_256x128x16_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_256x128x16_tn (224 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x256x16_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x256x16_tn (224 ms)
[ RUN ] Hgemm_128x128x16.hgemm_256x256x16_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_256x256x16_tn (448 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x16_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x16_tt (112 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x18_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x18_tt (113 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x64_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x64_tt (149 ms)
[ RUN ] Hgemm_128x128x16.hgemm_256x128x16_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_256x128x16_tt (221 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x256x16_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x256x16_tt (222 ms)
[ RUN ] Hgemm_128x128x16.hgemm_256x256x16_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_256x256x16_tt (436 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x16_alpha2_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x16_alpha2_nt (116 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x16_beta1_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x16_beta1_nt (115 ms)
[ RUN ] Hgemm_128x128x16.hgemm_128x128x16_alpha2_beta1_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x16_alpha2_beta1_nt (123 ms)
[ RUN ] Hgemm_128x128x16.hgemm_120x112x64_ldg8_nt
[ OK ] Hgemm_128x128x16.hgemm_120x112x64_ldg8_nt (36 ms)
[ RUN ] Hgemm_128x128x16.hgemm_508x252x120_ragged_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_508x252x120_ragged_nt (1362 ms)
[ RUN ] Hgemm_128x128x16.hgemm_124x126x32_ragged_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_124x126x32_ragged_nt (119 ms)
[ RUN ] Hgemm_128x128x16.hgemm_124x126x32_ragged_alpha2_beta1_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Hgemm_128x128x16.hgemm_124x126x32_ragged_alpha2_beta1_nt (122 ms)
[----------] 33 tests from Hgemm_128x128x16 (7187 ms total)

[----------] 30 tests from Hgemm_128x32x8
[ RUN ] Hgemm_128x32x8.hgemm_128x32x1_nt
[ OK ] Hgemm_128x32x8.hgemm_128x32x1_nt (3 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x8_nt
[ OK ] Hgemm_128x32x8.hgemm_128x32x8_nt (4 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x9_nt
[ OK ] Hgemm_128x32x8.hgemm_128x32x9_nt (5 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x16_nt
[ OK ] Hgemm_128x32x8.hgemm_128x32x16_nt (6 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x32_nt
[ OK ] Hgemm_128x32x8.hgemm_128x32x32_nt (8 ms)
[ RUN ] Hgemm_128x32x8.hgemm_256x32x16_nt
[ OK ] Hgemm_128x32x8.hgemm_256x32x16_nt (9 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x64x16_nt
[ OK ] Hgemm_128x32x8.hgemm_128x64x16_nt (9 ms)
[ RUN ] Hgemm_128x32x8.hgemm_256x64x16_nt
[ OK ] Hgemm_128x32x8.hgemm_256x64x16_nt (16 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x2_nn
[ OK ] Hgemm_128x32x8.hgemm_128x32x2_nn (3 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x8_nn
[ OK ] Hgemm_128x32x8.hgemm_128x32x8_nn (5 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x10_nn
[ OK ] Hgemm_128x32x8.hgemm_128x32x10_nn (5 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x16_nn
[ OK ] Hgemm_128x32x8.hgemm_128x32x16_nn (6 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x32_nn
[ OK ] Hgemm_128x32x8.hgemm_128x32x32_nn (8 ms)
[ RUN ] Hgemm_128x32x8.hgemm_256x32x16_nn
[ OK ] Hgemm_128x32x8.hgemm_256x32x16_nn (9 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x64x16_nn
[ OK ] Hgemm_128x32x8.hgemm_128x64x16_nn (9 ms)
[ RUN ] Hgemm_128x32x8.hgemm_256x64x16_nn
[ OK ] Hgemm_128x32x8.hgemm_256x64x16_nn (16 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x8_tn
[ OK ] Hgemm_128x32x8.hgemm_128x32x8_tn (4 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x10_tn
[ OK ] Hgemm_128x32x8.hgemm_128x32x10_tn (5 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x16_tn
[ OK ] Hgemm_128x32x8.hgemm_128x32x16_tn (6 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x32_tn
[ OK ] Hgemm_128x32x8.hgemm_128x32x32_tn (8 ms)
[ RUN ] Hgemm_128x32x8.hgemm_256x32x16_tn
[ OK ] Hgemm_128x32x8.hgemm_256x32x16_tn (9 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x64x16_tn
[ OK ] Hgemm_128x32x8.hgemm_128x64x16_tn (9 ms)
[ RUN ] Hgemm_128x32x8.hgemm_256x64x16_tn
[ OK ] Hgemm_128x32x8.hgemm_256x64x16_tn (16 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x8_tt
[ OK ] Hgemm_128x32x8.hgemm_128x32x8_tt (5 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x10_tt
[ OK ] Hgemm_128x32x8.hgemm_128x32x10_tt (5 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x16_tt
[ OK ] Hgemm_128x32x8.hgemm_128x32x16_tt (6 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x32x32_tt
[ OK ] Hgemm_128x32x8.hgemm_128x32x32_tt (8 ms)
[ RUN ] Hgemm_128x32x8.hgemm_256x32x16_tt
[ OK ] Hgemm_128x32x8.hgemm_256x32x16_tt (9 ms)
[ RUN ] Hgemm_128x32x8.hgemm_128x64x16_tt
[ OK ] Hgemm_128x32x8.hgemm_128x64x16_tt (9 ms)
[ RUN ] Hgemm_128x32x8.hgemm_256x64x16_tt
[ OK ] Hgemm_128x32x8.hgemm_256x64x16_tt (16 ms)
[----------] 30 tests from Hgemm_128x32x8 (236 ms total)

[----------] 30 tests from Hgemm_128x64x8
[ RUN ] Hgemm_128x64x8.hgemm_128x64x1_nt
[ OK ] Hgemm_128x64x8.hgemm_128x64x1_nt (4 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x8_nt
[ OK ] Hgemm_128x64x8.hgemm_128x64x8_nt (7 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x9_nt
[ OK ] Hgemm_128x64x8.hgemm_128x64x9_nt (7 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x16_nt
[ OK ] Hgemm_128x64x8.hgemm_128x64x16_nt (9 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x64_nt
[ OK ] Hgemm_128x64x8.hgemm_128x64x64_nt (23 ms)
[ RUN ] Hgemm_128x64x8.hgemm_256x64x16_nt
[ OK ] Hgemm_128x64x8.hgemm_256x64x16_nt (16 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x128x16_nt
[ OK ] Hgemm_128x64x8.hgemm_128x128x16_nt (15 ms)
[ RUN ] Hgemm_128x64x8.hgemm_256x128x16_nt
[ OK ] Hgemm_128x64x8.hgemm_256x128x16_nt (29 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x2_nn
[ OK ] Hgemm_128x64x8.hgemm_128x64x2_nn (5 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x8_nn
[ OK ] Hgemm_128x64x8.hgemm_128x64x8_nn (6 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x10_nn
[ OK ] Hgemm_128x64x8.hgemm_128x64x10_nn (7 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x16_nn
[ OK ] Hgemm_128x64x8.hgemm_128x64x16_nn (8 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x64_nn
[ OK ] Hgemm_128x64x8.hgemm_128x64x64_nn (23 ms)
[ RUN ] Hgemm_128x64x8.hgemm_256x64x16_nn
[ OK ] Hgemm_128x64x8.hgemm_256x64x16_nn (15 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x128x16_nn
[ OK ] Hgemm_128x64x8.hgemm_128x128x16_nn (15 ms)
[ RUN ] Hgemm_128x64x8.hgemm_256x128x16_nn
[ OK ] Hgemm_128x64x8.hgemm_256x128x16_nn (28 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x8_tn
[ OK ] Hgemm_128x64x8.hgemm_128x64x8_tn (7 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x10_tn
[ OK ] Hgemm_128x64x8.hgemm_128x64x10_tn (7 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x16_tn
[ OK ] Hgemm_128x64x8.hgemm_128x64x16_tn (9 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x64_tn
[ OK ] Hgemm_128x64x8.hgemm_128x64x64_tn (21 ms)
[ RUN ] Hgemm_128x64x8.hgemm_256x64x16_tn
[ OK ] Hgemm_128x64x8.hgemm_256x64x16_tn (15 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x128x16_tn
[ OK ] Hgemm_128x64x8.hgemm_128x128x16_tn (15 ms)
[ RUN ] Hgemm_128x64x8.hgemm_256x128x16_tn
[ OK ] Hgemm_128x64x8.hgemm_256x128x16_tn (28 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x8_tt
[ OK ] Hgemm_128x64x8.hgemm_128x64x8_tt (7 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x10_tt
[ OK ] Hgemm_128x64x8.hgemm_128x64x10_tt (7 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x16_tt
[ OK ] Hgemm_128x64x8.hgemm_128x64x16_tt (9 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x64x64_tt
[ OK ] Hgemm_128x64x8.hgemm_128x64x64_tt (21 ms)
[ RUN ] Hgemm_128x64x8.hgemm_256x64x16_tt
[ OK ] Hgemm_128x64x8.hgemm_256x64x16_tt (15 ms)
[ RUN ] Hgemm_128x64x8.hgemm_128x128x16_tt
[ OK ] Hgemm_128x64x8.hgemm_128x128x16_tt (15 ms)
[ RUN ] Hgemm_128x64x8.hgemm_256x128x16_tt
[ OK ] Hgemm_128x64x8.hgemm_256x128x16_tt (28 ms)
[----------] 30 tests from Hgemm_128x64x8 (422 ms total)

[----------] 32 tests from Igemm_128x128x32
[ RUN ] Igemm_128x128x32.igemm_128x128x4_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x4_nt (7 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x32_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x32_nt (10 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x36_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x36_nt (7 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x64_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x64_nt (9 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x256_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x256_nt (21 ms)
[ RUN ] Igemm_128x128x32.igemm_256x128x64_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_256x128x64_nt (16 ms)
[ RUN ] Igemm_128x128x32.igemm_128x256x64_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x256x64_nt (14 ms)
[ RUN ] Igemm_128x128x32.igemm_256x256x64_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_256x256x64_nt (30 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x4_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x4_nn (3 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x32_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x32_nn (6 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x36_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x36_nn (7 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x64_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x64_nn (8 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x256_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x256_nn (21 ms)
[ RUN ] Igemm_128x128x32.igemm_256x128x64_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_256x128x64_nn (14 ms)
[ RUN ] Igemm_128x128x32.igemm_128x256x64_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x256x64_nn (14 ms)
[ RUN ] Igemm_128x128x32.igemm_256x256x64_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_256x256x64_nn (29 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x4_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x4_tn (3 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x32_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x32_tn (7 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x36_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x36_tn (6 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x64_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x64_tn (8 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x256_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x256_tn (21 ms)
[ RUN ] Igemm_128x128x32.igemm_256x128x64_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_256x128x64_tn (15 ms)
[ RUN ] Igemm_128x128x32.igemm_128x256x64_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x256x64_tn (13 ms)
[ RUN ] Igemm_128x128x32.igemm_256x256x64_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_256x256x64_tn (28 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x4_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x4_tt (3 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x32_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x32_tt (5 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x36_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x36_tt (6 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x64_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x64_tt (8 ms)
[ RUN ] Igemm_128x128x32.igemm_128x128x256_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x128x256_tt (21 ms)
[ RUN ] Igemm_128x128x32.igemm_256x128x64_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_256x128x64_tt (14 ms)
[ RUN ] Igemm_128x128x32.igemm_128x256x64_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_128x256x64_tt (14 ms)
[ RUN ] Igemm_128x128x32.igemm_256x256x64_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x128x32.igemm_256x256x64_tt (32 ms)
[----------] 32 tests from Igemm_128x128x32 (423 ms total)

[----------] 32 tests from Igemm_128x64x32
[ RUN ] Igemm_128x64x32.Igemm_128x64x4_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.Igemm_128x64x4_nt (2 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x32_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x32_nt (4 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x36_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x36_nt (4 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x64_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x64_nt (5 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x256_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x256_nt (12 ms)
[ RUN ] Igemm_128x64x32.igemm_256x64x64_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_256x64x64_nt (9 ms)
[ RUN ] Igemm_128x64x32.igemm_128x128x64_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x128x64_nt (8 ms)
[ RUN ] Igemm_128x64x32.igemm_256x128x64_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_256x128x64_nt (14 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x4_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x4_nn (2 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x32_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x32_nn (4 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x36_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x36_nn (3 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x64_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x64_nn (6 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x256_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x256_nn (12 ms)
[ RUN ] Igemm_128x64x32.igemm_256x64x64_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_256x64x64_nn (8 ms)
[ RUN ] Igemm_128x64x32.igemm_128x128x64_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x128x64_nn (8 ms)
[ RUN ] Igemm_128x64x32.igemm_256x128x64_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_256x128x64_nn (14 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x4_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x4_tn (2 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x32_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x32_tn (4 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x36_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x36_tn (4 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x64_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x64_tn (5 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x256_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x256_tn (12 ms)
[ RUN ] Igemm_128x64x32.igemm_256x64x64_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_256x64x64_tn (8 ms)
[ RUN ] Igemm_128x64x32.igemm_128x128x64_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x128x64_tn (8 ms)
[ RUN ] Igemm_128x64x32.igemm_256x128x64_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_256x128x64_tn (14 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x4_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x4_tt (3 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x32_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x32_tt (3 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x36_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x36_tt (4 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x64_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x64_tt (5 ms)
[ RUN ] Igemm_128x64x32.igemm_128x64x256_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x64x256_tt (12 ms)
[ RUN ] Igemm_128x64x32.igemm_256x64x64_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_256x64x64_tt (9 ms)
[ RUN ] Igemm_128x64x32.igemm_128x128x64_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_128x128x64_tt (8 ms)
[ RUN ] Igemm_128x64x32.igemm_256x128x64_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x64x32.igemm_256x128x64_tt (14 ms)
[----------] 32 tests from Igemm_128x64x32 (231 ms total)

[----------] 32 tests from Igemm_128x32x32
[ RUN ] Igemm_128x32x32.igemm_128x32x32x4_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x32x4_nt (2 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x32_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x32_nt (2 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x36_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x36_nt (3 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x64_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x64_nt (4 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x256_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x256_nt (7 ms)
[ RUN ] Igemm_128x32x32.igemm_256x32x64_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_256x32x64_nt (5 ms)
[ RUN ] Igemm_128x32x32.igemm_128x128x32_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x128x32_nt (6 ms)
[ RUN ] Igemm_128x32x32.igemm_256x128x32_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_256x128x32_nt (10 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x4_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x4_nn (3 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x32_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x32_nn (3 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x36_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x36_nn (3 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x64_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x64_nn (3 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x256_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x256_nn (8 ms)
[ RUN ] Igemm_128x32x32.igemm_256x32x64_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_256x32x64_nn (5 ms)
[ RUN ] Igemm_128x32x32.igemm_128x128x32_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x128x32_nn (6 ms)
[ RUN ] Igemm_128x32x32.igemm_256x128x32_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_256x128x32_nn (10 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x4_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x4_tn (2 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x32_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x32_tn (3 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x36_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x36_tn (3 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x64_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x64_tn (4 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x256_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x256_tn (7 ms)
[ RUN ] Igemm_128x32x32.igemm_256x32x64_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_256x32x64_tn (5 ms)
[ RUN ] Igemm_128x32x32.igemm_128x128x32_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x128x32_tn (6 ms)
[ RUN ] Igemm_128x32x32.igemm_256x128x32_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_256x128x32_tn (10 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x4_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x4_tt (2 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x32_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x32_tt (3 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x36_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x36_tt (3 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x64_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x64_tt (4 ms)
[ RUN ] Igemm_128x32x32.igemm_128x32x256_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x32x256_tt (7 ms)
[ RUN ] Igemm_128x32x32.igemm_256x32x64_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_256x32x64_tt (5 ms)
[ RUN ] Igemm_128x32x32.igemm_128x128x32_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_128x128x32_tt (6 ms)
[ RUN ] Igemm_128x32x32.igemm_256x128x32_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_128x32x32.igemm_256x128x32_tt (10 ms)
[----------] 32 tests from Igemm_128x32x32 (160 ms total)

[----------] 32 tests from Igemm_128x128x32_float
[ RUN ] Igemm_128x128x32_float.igemm_128x128x4_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x4_nt (119 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x32_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x32_nt (115 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x36_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x36_nt (113 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x64_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x64_nt (116 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x256_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x256_nt (137 ms)
[ RUN ] Igemm_128x128x32_float.igemm_256x128x64_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_256x128x64_nt (228 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x256x64_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x256x64_nt (225 ms)
[ RUN ] Igemm_128x128x32_float.igemm_256x256x64_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_256x256x64_nt (450 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x4_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x4_nn (105 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x32_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x32_nn (110 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x36_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x36_nn (111 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x64_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x64_nn (114 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x256_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x256_nn (138 ms)
[ RUN ] Igemm_128x128x32_float.igemm_256x128x64_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_256x128x64_nn (226 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x256x64_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x256x64_nn (224 ms)
[ RUN ] Igemm_128x128x32_float.igemm_256x256x64_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_256x256x64_nn (446 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x4_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x4_tn (104 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x32_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x32_tn (109 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x36_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x36_tn (110 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x64_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x64_tn (113 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x256_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x256_tn (135 ms)
[ RUN ] Igemm_128x128x32_float.igemm_256x128x64_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_256x128x64_tn (224 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x256x64_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x256x64_tn (222 ms)
[ RUN ] Igemm_128x128x32_float.igemm_256x256x64_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_256x256x64_tn (446 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x4_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x4_tt (104 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x32_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x32_tt (109 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x36_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x36_tt (110 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x64_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x64_tt (113 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x128x256_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x256_tt (135 ms)
[ RUN ] Igemm_128x128x32_float.igemm_256x128x64_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_256x128x64_tt (226 ms)
[ RUN ] Igemm_128x128x32_float.igemm_128x256x64_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_128x256x64_tt (222 ms)
[ RUN ] Igemm_128x128x32_float.igemm_256x256x64_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_float.igemm_256x256x64_tt (444 ms)
[----------] 32 tests from Igemm_128x128x32_float (5904 ms total)

[----------] 32 tests from Igemm_128x128x32_int8
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x4_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x4_nt (56 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x32_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x32_nt (58 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x36_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x36_nt (54 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x64_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x64_nt (58 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x256_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x256_nt (78 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_256x128x64_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_256x128x64_nt (110 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x256x64_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x256x64_nt (110 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_256x256x64_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_256x256x64_nt (214 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x4_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x4_nn (50 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x32_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x32_nn (54 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x36_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x36_nn (54 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x64_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x64_nn (57 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x256_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x256_nn (79 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_256x128x64_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_256x128x64_nn (110 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x256x64_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x256x64_nn (110 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_256x256x64_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_256x256x64_nn (214 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x4_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x4_tn (50 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x32_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x32_tn (55 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x36_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x36_tn (53 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x64_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x64_tn (56 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x256_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x256_tn (77 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_256x128x64_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_256x128x64_tn (108 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x256x64_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x256x64_tn (108 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_256x256x64_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_256x256x64_tn (209 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x4_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x4_tt (51 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x32_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x32_tt (56 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x36_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x36_tt (53 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x64_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x64_tt (55 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x128x256_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x256_tt (76 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_256x128x64_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_256x128x64_tt (108 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_128x256x64_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_128x256x64_tt (107 ms)
[ RUN ] Igemm_128x128x32_int8.igemm_256x256x64_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:97: Failure
Value of: testbed.verify_with_host()
Actual: false
Expected: true
[ FAILED ] Igemm_128x128x32_int8.igemm_256x256x64_tt (209 ms)
[----------] 32 tests from Igemm_128x128x32_int8 (2898 ms total)

[----------] 16 tests from Igemm_32x32x128
[ RUN ] Igemm_32x32x128.igemm_32x32x4_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x4_nt (2 ms)
[ RUN ] Igemm_32x32x128.igemm_32x32x8_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x8_nt (2 ms)
[ RUN ] Igemm_32x32x128.igemm_32x32x32_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x32_nt (2 ms)
[ RUN ] Igemm_32x32x128.igemm_32x32x128_nt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x128_nt (3 ms)
[ RUN ] Igemm_32x32x128.igemm_32x32x4_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x4_nn (1 ms)
[ RUN ] Igemm_32x32x128.igemm_32x32x8_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x8_nn (2 ms)
[ RUN ] Igemm_32x32x128.igemm_32x32x32_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x32_nn (2 ms)
[ RUN ] Igemm_32x32x128.igemm_32x32x128_nn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x128_nn (2 ms)
[ RUN ] Igemm_32x32x128.igemm_32x32x4_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x4_tn (2 ms)
[ RUN ] Igemm_32x32x128.igemm_32x32x8_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x8_tn (2 ms)
[ RUN ] Igemm_32x32x128.igemm_32x32x15_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x15_tn (1 ms)
[ RUN ] Igemm_32x32x128.igemm_32x32x32_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x32_tn (2 ms)
[ RUN ] Igemm_32x32x128.igemm_32x32x128_tn
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x128_tn (3 ms)
[ RUN ] Igemm_32x32x128.igemm_32x32x8_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x8_tt (2 ms)
[ RUN ] Igemm_32x32x128.igemm_32x32x32_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x32_tt (2 ms)
[ RUN ] Igemm_32x32x128.igemm_32x32x128_tt
unknown file: Failure
C++ exception with description "compute_cublas() failed" thrown in the test body.
[ FAILED ] Igemm_32x32x128.igemm_32x32x128_tt (4 ms)
[----------] 16 tests from Igemm_32x32x128 (34 ms total)

[----------] 36 tests from Sgemm_128x128x8
[ RUN ] Sgemm_128x128x8.sgemm_128x81x1_nt
[ OK ] Sgemm_128x128x8.sgemm_128x81x1_nt (4 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x112x8_nt
[ OK ] Sgemm_128x128x8.sgemm_128x112x8_nt (7 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x112x9_nt
[ OK ] Sgemm_128x128x8.sgemm_128x112x9_nt (7 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x73x16_nt
[ OK ] Sgemm_128x128x8.sgemm_128x73x16_nt (6 ms)
[ RUN ] Sgemm_128x128x8.sgemm_97x112x64_nt
[ OK ] Sgemm_128x128x8.sgemm_97x112x64_nt (12 ms)
[ RUN ] Sgemm_128x128x8.sgemm_256x112x16_nt
[ OK ] Sgemm_128x128x8.sgemm_256x112x16_nt (13 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x240x16_nt
[ OK ] Sgemm_128x128x8.sgemm_128x240x16_nt (11 ms)
[ RUN ] Sgemm_128x128x8.sgemm_256x240x16_nt
[ OK ] Sgemm_128x128x8.sgemm_256x240x16_nt (24 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x112x1_nn
[ OK ] Sgemm_128x128x8.sgemm_128x112x1_nn (4 ms)
[ RUN ] Sgemm_128x128x8.sgemm_79x112x8_nn
[ OK ] Sgemm_128x128x8.sgemm_79x112x8_nn (4 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x81x9_nn
[ OK ] Sgemm_128x128x8.sgemm_128x81x9_nn (5 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x112x16_nn
[ OK ] Sgemm_128x128x8.sgemm_128x112x16_nn (6 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x73x64_nn
[ OK ] Sgemm_128x128x8.sgemm_128x73x64_nn (9 ms)
[ RUN ] Sgemm_128x128x8.sgemm_256x112x16_nn
[ OK ] Sgemm_128x128x8.sgemm_256x112x16_nn (12 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x256x16_nn
[ OK ] Sgemm_128x128x8.sgemm_128x256x16_nn (11 ms)
[ RUN ] Sgemm_128x128x8.sgemm_256x256x16_nn
[ OK ] Sgemm_128x128x8.sgemm_256x256x16_nn (24 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x128x1_tn
[ OK ] Sgemm_128x128x8.sgemm_128x128x1_tn (5 ms)
[ RUN ] Sgemm_128x128x8.sgemm_127x112x8_tn
[ OK ] Sgemm_128x128x8.sgemm_127x112x8_tn (5 ms)
[ RUN ] Sgemm_128x128x8.sgemm_21x112x9_tn
[ OK ] Sgemm_128x128x8.sgemm_21x112x9_tn (3 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x73x16_tn
[ OK ] Sgemm_128x128x8.sgemm_128x73x16_tn (5 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x81x64_tn
[ OK ] Sgemm_128x128x8.sgemm_128x81x64_tn (9 ms)
[ RUN ] Sgemm_128x128x8.sgemm_256x112x16_tn
[ OK ] Sgemm_128x128x8.sgemm_256x112x16_tn (12 ms)
[ RUN ] Sgemm_128x128x8.sgemm_47x256x16_tn
[ OK ] Sgemm_128x128x8.sgemm_47x256x16_tn (6 ms)
[ RUN ] Sgemm_128x128x8.sgemm_211x256x16_tn
[ OK ] Sgemm_128x128x8.sgemm_211x256x16_tn (16 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x128x1_tt
[ OK ] Sgemm_128x128x8.sgemm_128x128x1_tt (5 ms)
[ RUN ] Sgemm_128x128x8.sgemm_109x112x8_tt
[ OK ] Sgemm_128x128x8.sgemm_109x112x8_tt (5 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x112x9_tt
[ OK ] Sgemm_128x128x8.sgemm_128x112x9_tt (5 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x112x16_tt
[ OK ] Sgemm_128x128x8.sgemm_128x112x16_tt (6 ms)
[ RUN ] Sgemm_128x128x8.sgemm_123x112x64_tt
[ OK ] Sgemm_128x128x8.sgemm_123x112x64_tt (9 ms)
[ RUN ] Sgemm_128x128x8.sgemm_256x112x16_tt
[ OK ] Sgemm_128x128x8.sgemm_256x112x16_tt (11 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x256x16_tt
[ OK ] Sgemm_128x128x8.sgemm_128x256x16_tt (10 ms)
[ RUN ] Sgemm_128x128x8.sgemm_256x256x16_tt
[ OK ] Sgemm_128x128x8.sgemm_256x256x16_tt (23 ms)
[ RUN ] Sgemm_128x128x8.sgemm_120x112x64_ldg4_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x8.sgemm_120x112x64_ldg4_nt (99 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x128x16_alpha2_nt
[ OK ] Sgemm_128x128x8.sgemm_128x128x16_alpha2_nt (6 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x112x16_beta1_nt
[ OK ] Sgemm_128x128x8.sgemm_128x112x16_beta1_nt (6 ms)
[ RUN ] Sgemm_128x128x8.sgemm_128x112x16_alpha2_beta1_nt
[ OK ] Sgemm_128x128x8.sgemm_128x112x16_alpha2_beta1_nt (6 ms)
[----------] 36 tests from Sgemm_128x128x8 (414 ms total)

[----------] 40 tests from Sgemm_128x128x16
[ RUN ] Sgemm_128x128x16.sgemm_128x128x16_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x128x16_nt (102 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x81x1_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x81x1_nt (58 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x112x16_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x16_nt (91 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x112x17_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x17_nt (90 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x73x16_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x73x16_nt (60 ms)
[ RUN ] Sgemm_128x128x16.sgemm_97x112x64_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_97x112x64_nt (81 ms)
[ RUN ] Sgemm_128x128x16.sgemm_256x112x16_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_256x112x16_nt (181 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x240x16_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x240x16_nt (189 ms)
[ RUN ] Sgemm_128x128x16.sgemm_256x240x16_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_256x240x16_nt (380 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x128x16_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x128x16_nn (101 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x112x1_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x1_nn (79 ms)
[ RUN ] Sgemm_128x128x16.sgemm_79x112x16_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_79x112x16_nn (57 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x81x17_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x81x17_nn (66 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x112x16_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x16_nn (90 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x73x64_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x73x64_nn (69 ms)
[ RUN ] Sgemm_128x128x16.sgemm_256x112x16_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_256x112x16_nn (178 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x256x16_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x256x16_nn (201 ms)
[ RUN ] Sgemm_128x128x16.sgemm_256x256x16_nn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_256x256x16_nn (402 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x128x16_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x128x16_tn (102 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x128x1_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x128x1_tn (90 ms)
[ RUN ] Sgemm_128x128x16.sgemm_127x112x16_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_127x112x16_tn (88 ms)
[ RUN ] Sgemm_128x128x16.sgemm_21x112x17_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_21x112x17_tn (18 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x73x16_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x73x16_tn (60 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x81x64_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x81x64_tn (76 ms)
[ RUN ] Sgemm_128x128x16.sgemm_256x112x16_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_256x112x16_tn (179 ms)
[ RUN ] Sgemm_128x128x16.sgemm_47x256x16_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_47x256x16_tn (76 ms)
[ RUN ] Sgemm_128x128x16.sgemm_211x256x16_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_211x256x16_tn (326 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x128x16_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x128x16_tt (102 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x128x1_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x128x1_tt (89 ms)
[ RUN ] Sgemm_128x128x16.sgemm_109x112x16_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_109x112x16_tt (77 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x112x17_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x17_tt (90 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x112x16_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x16_tt (90 ms)
[ RUN ] Sgemm_128x128x16.sgemm_123x112x64_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_123x112x64_tt (97 ms)
[ RUN ] Sgemm_128x128x16.sgemm_256x112x16_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_256x112x16_tt (179 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x256x16_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x256x16_tt (199 ms)
[ RUN ] Sgemm_128x128x16.sgemm_256x256x16_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_256x256x16_tt (401 ms)
[ RUN ] Sgemm_128x128x16.sgemm_120x112x64_ldg4_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_120x112x64_ldg4_nt (96 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x128x16_alpha2_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x128x16_alpha2_nt (105 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x112x16_beta1_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x16_beta1_nt (91 ms)
[ RUN ] Sgemm_128x128x16.sgemm_128x112x16_alpha2_beta1_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x16_alpha2_beta1_nt (110 ms)
[----------] 40 tests from Sgemm_128x128x16 (5219 ms total)

[----------] 34 tests from Sgemm_128x64x8
[ RUN ] Sgemm_128x64x8.sgemm_128x64x1_nt
[ OK ] Sgemm_128x64x8.sgemm_128x64x1_nt (4 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x8_nt
[ OK ] Sgemm_128x64x8.sgemm_128x64x8_nt (3 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x9_nt
[ OK ] Sgemm_128x64x8.sgemm_128x64x9_nt (4 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x16_nt
[ OK ] Sgemm_128x64x8.sgemm_128x64x16_nt (4 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x64_nt
[ OK ] Sgemm_128x64x8.sgemm_128x64x64_nt (7 ms)
[ RUN ] Sgemm_128x64x8.sgemm_256x64x16_nt
[ OK ] Sgemm_128x64x8.sgemm_256x64x16_nt (7 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x128x16_nt
[ OK ] Sgemm_128x64x8.sgemm_128x128x16_nt (7 ms)
[ RUN ] Sgemm_128x64x8.sgemm_256x128x16_nt
[ OK ] Sgemm_128x64x8.sgemm_256x128x16_nt (14 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x1_nn
[ OK ] Sgemm_128x64x8.sgemm_128x64x1_nn (3 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x8_nn
[ OK ] Sgemm_128x64x8.sgemm_128x64x8_nn (4 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x9_nn
[ OK ] Sgemm_128x64x8.sgemm_128x64x9_nn (4 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x16_nn
[ OK ] Sgemm_128x64x8.sgemm_128x64x16_nn (4 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x64_nn
[ OK ] Sgemm_128x64x8.sgemm_128x64x64_nn (7 ms)
[ RUN ] Sgemm_128x64x8.sgemm_256x64x16_nn
[ OK ] Sgemm_128x64x8.sgemm_256x64x16_nn (7 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x128x16_nn
[ OK ] Sgemm_128x64x8.sgemm_128x128x16_nn (7 ms)
[ RUN ] Sgemm_128x64x8.sgemm_256x128x16_nn
[ OK ] Sgemm_128x64x8.sgemm_256x128x16_nn (14 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x1_tn
[ OK ] Sgemm_128x64x8.sgemm_128x64x1_tn (5 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x8_tn
[ OK ] Sgemm_128x64x8.sgemm_128x64x8_tn (4 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x9_tn
[ OK ] Sgemm_128x64x8.sgemm_128x64x9_tn (4 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x16_tn
[ OK ] Sgemm_128x64x8.sgemm_128x64x16_tn (4 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x64_tn
[ OK ] Sgemm_128x64x8.sgemm_128x64x64_tn (7 ms)
[ RUN ] Sgemm_128x64x8.sgemm_256x64x16_tn
[ OK ] Sgemm_128x64x8.sgemm_256x64x16_tn (7 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x128x16_tn
[ OK ] Sgemm_128x64x8.sgemm_128x128x16_tn (7 ms)
[ RUN ] Sgemm_128x64x8.sgemm_256x128x16_tn
[ OK ] Sgemm_128x64x8.sgemm_256x128x16_tn (14 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x1_tt
[ OK ] Sgemm_128x64x8.sgemm_128x64x1_tt (4 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x8_tt
[ OK ] Sgemm_128x64x8.sgemm_128x64x8_tt (3 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x9_tt
[ OK ] Sgemm_128x64x8.sgemm_128x64x9_tt (4 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x16_tt
[ OK ] Sgemm_128x64x8.sgemm_128x64x16_tt (5 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x64_tt
[ OK ] Sgemm_128x64x8.sgemm_128x64x64_tt (7 ms)
[ RUN ] Sgemm_128x64x8.sgemm_256x64x16_tt
[ OK ] Sgemm_128x64x8.sgemm_256x64x16_tt (7 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x128x16_tt
[ OK ] Sgemm_128x64x8.sgemm_128x128x16_tt (7 ms)
[ RUN ] Sgemm_128x64x8.sgemm_256x128x16_tt
[ OK ] Sgemm_128x64x8.sgemm_256x128x16_tt (14 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x64_8x4_accumulators_nt
[ OK ] Sgemm_128x64x8.sgemm_128x64x64_8x4_accumulators_nt (7 ms)
[ RUN ] Sgemm_128x64x8.sgemm_128x64x64_4x8_accumulators_nt
[ OK ] Sgemm_128x64x8.sgemm_128x64x64_4x8_accumulators_nt (7 ms)
[----------] 34 tests from Sgemm_128x64x8 (219 ms total)

[----------] 27 tests from Sgemm_128x64x16
[ RUN ] Sgemm_128x64x16.sgemm_128x64x1_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x1_nt (45 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x64x16_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x16_nt (53 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x64x17_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x17_nt (53 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x64x64_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x64_nt (61 ms)
[ RUN ] Sgemm_128x64x16.sgemm_256x64x16_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_256x64x16_nt (104 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x128x16_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_128x128x16_nt (108 ms)
[ RUN ] Sgemm_128x64x16.sgemm_256x128x16_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_256x128x16_nt (203 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x64x1_nn
[ OK ] Sgemm_128x64x16.sgemm_128x64x1_nn (3 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x64x8_nn
[ OK ] Sgemm_128x64x16.sgemm_128x64x8_nn (5 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x64x17_nn
[ OK ] Sgemm_128x64x16.sgemm_128x64x17_nn (4 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x64x64_nn
[ OK ] Sgemm_128x64x16.sgemm_128x64x64_nn (6 ms)
[ RUN ] Sgemm_128x64x16.sgemm_256x64x16_nn
[ OK ] Sgemm_128x64x16.sgemm_256x64x16_nn (7 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x128x16_nn
[ OK ] Sgemm_128x64x16.sgemm_128x128x16_nn (6 ms)
[ RUN ] Sgemm_128x64x16.sgemm_256x128x16_nn
[ OK ] Sgemm_128x64x16.sgemm_256x128x16_nn (14 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x64x1_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x1_tn (91 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x64x16_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x16_tn (52 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x64x17_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x17_tn (53 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x64x64_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x64_tn (60 ms)
[ RUN ] Sgemm_128x64x16.sgemm_256x64x16_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_256x64x16_tn (102 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x128x16_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_128x128x16_tn (103 ms)
[ RUN ] Sgemm_128x64x16.sgemm_256x128x16_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_256x128x16_tn (203 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x64x1_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x1_tt (90 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x64x16_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x16_tt (53 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x64x17_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x17_tt (53 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x64x64_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x64_tt (60 ms)
[ RUN ] Sgemm_128x64x16.sgemm_128x128x16_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_128x128x16_tt (103 ms)
[ RUN ] Sgemm_128x64x16.sgemm_256x128x16_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x64x16.sgemm_256x128x16_tt (203 ms)
[----------] 27 tests from Sgemm_128x64x16 (1902 ms total)

[----------] 32 tests from Sgemm_128x32x8
[ RUN ] Sgemm_128x32x8.sgemm_128x32x1_nt
[ OK ] Sgemm_128x32x8.sgemm_128x32x1_nt (3 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x8_nt
[ OK ] Sgemm_128x32x8.sgemm_128x32x8_nt (3 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x9_nt
[ OK ] Sgemm_128x32x8.sgemm_128x32x9_nt (3 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x16_nt
[ OK ] Sgemm_128x32x8.sgemm_128x32x16_nt (4 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x32_nt
[ OK ] Sgemm_128x32x8.sgemm_128x32x32_nt (4 ms)
[ RUN ] Sgemm_128x32x8.sgemm_256x32x16_nt
[ OK ] Sgemm_128x32x8.sgemm_256x32x16_nt (4 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x64x16_nt
[ OK ] Sgemm_128x32x8.sgemm_128x64x16_nt (5 ms)
[ RUN ] Sgemm_128x32x8.sgemm_256x64x16_nt
[ OK ] Sgemm_128x32x8.sgemm_256x64x16_nt (6 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x1_nn
[ OK ] Sgemm_128x32x8.sgemm_128x32x1_nn (3 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x8_nn
[ OK ] Sgemm_128x32x8.sgemm_128x32x8_nn (3 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x9_nn
[ OK ] Sgemm_128x32x8.sgemm_128x32x9_nn (4 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x16_nn
[ OK ] Sgemm_128x32x8.sgemm_128x32x16_nn (3 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x32_nn
[ OK ] Sgemm_128x32x8.sgemm_128x32x32_nn (4 ms)
[ RUN ] Sgemm_128x32x8.sgemm_256x32x16_nn
[ OK ] Sgemm_128x32x8.sgemm_256x32x16_nn (4 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x64x16_nn
[ OK ] Sgemm_128x32x8.sgemm_128x64x16_nn (5 ms)
[ RUN ] Sgemm_128x32x8.sgemm_256x64x16_nn
[ OK ] Sgemm_128x32x8.sgemm_256x64x16_nn (7 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x1_tn
[ OK ] Sgemm_128x32x8.sgemm_128x32x1_tn (4 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x8_tn
[ OK ] Sgemm_128x32x8.sgemm_128x32x8_tn (3 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x9_tn
[ OK ] Sgemm_128x32x8.sgemm_128x32x9_tn (4 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x16_tn
[ OK ] Sgemm_128x32x8.sgemm_128x32x16_tn (3 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x32_tn
[ OK ] Sgemm_128x32x8.sgemm_128x32x32_tn (4 ms)
[ RUN ] Sgemm_128x32x8.sgemm_256x32x16_tn
[ OK ] Sgemm_128x32x8.sgemm_256x32x16_tn (4 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x64x16_tn
[ OK ] Sgemm_128x32x8.sgemm_128x64x16_tn (5 ms)
[ RUN ] Sgemm_128x32x8.sgemm_256x64x16_tn
[ OK ] Sgemm_128x32x8.sgemm_256x64x16_tn (7 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x1_tt
[ OK ] Sgemm_128x32x8.sgemm_128x32x1_tt (4 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x8_tt
[ OK ] Sgemm_128x32x8.sgemm_128x32x8_tt (3 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x9_tt
[ OK ] Sgemm_128x32x8.sgemm_128x32x9_tt (3 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x16_tt
[ OK ] Sgemm_128x32x8.sgemm_128x32x16_tt (4 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x32x32_tt
[ OK ] Sgemm_128x32x8.sgemm_128x32x32_tt (4 ms)
[ RUN ] Sgemm_128x32x8.sgemm_256x32x16_tt
[ OK ] Sgemm_128x32x8.sgemm_256x32x16_tt (4 ms)
[ RUN ] Sgemm_128x32x8.sgemm_128x64x16_tt
[ OK ] Sgemm_128x32x8.sgemm_128x64x16_tt (5 ms)
[ RUN ] Sgemm_128x32x8.sgemm_256x64x16_tt
[ OK ] Sgemm_128x32x8.sgemm_256x64x16_tt (7 ms)
[----------] 32 tests from Sgemm_128x32x8 (133 ms total)

[----------] 28 tests from Sgemm_128x32x16
[ RUN ] Sgemm_128x32x16.sgemm_128x32x1_nt
[ OK ] Sgemm_128x32x16.sgemm_128x32x1_nt (2 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x32x16_nt
[ OK ] Sgemm_128x32x16.sgemm_128x32x16_nt (3 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x32x17_nt
[ OK ] Sgemm_128x32x16.sgemm_128x32x17_nt (3 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x32x32_nt
[ OK ] Sgemm_128x32x16.sgemm_128x32x32_nt (4 ms)
[ RUN ] Sgemm_128x32x16.sgemm_256x32x16_nt
[ OK ] Sgemm_128x32x16.sgemm_256x32x16_nt (5 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x64x16_nt
[ OK ] Sgemm_128x32x16.sgemm_128x64x16_nt (4 ms)
[ RUN ] Sgemm_128x32x16.sgemm_256x64x16_nt
[ OK ] Sgemm_128x32x16.sgemm_256x64x16_nt (7 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x32x1_nn
[ OK ] Sgemm_128x32x16.sgemm_128x32x1_nn (3 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x32x16_nn
[ OK ] Sgemm_128x32x16.sgemm_128x32x16_nn (3 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x32x17_nn
[ OK ] Sgemm_128x32x16.sgemm_128x32x17_nn (3 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x32x32_nn
[ OK ] Sgemm_128x32x16.sgemm_128x32x32_nn (4 ms)
[ RUN ] Sgemm_128x32x16.sgemm_256x32x16_nn
[ OK ] Sgemm_128x32x16.sgemm_256x32x16_nn (5 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x64x16_nn
[ OK ] Sgemm_128x32x16.sgemm_128x64x16_nn (4 ms)
[ RUN ] Sgemm_128x32x16.sgemm_256x64x16_nn
[ OK ] Sgemm_128x32x16.sgemm_256x64x16_nn (7 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x32x1_tn
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x32x16.sgemm_128x32x1_tn (90 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x32x16_tn
[ OK ] Sgemm_128x32x16.sgemm_128x32x16_tn (3 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x32x17_tn
[ OK ] Sgemm_128x32x16.sgemm_128x32x17_tn (3 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x32x32_tn
[ OK ] Sgemm_128x32x16.sgemm_128x32x32_tn (4 ms)
[ RUN ] Sgemm_128x32x16.sgemm_256x32x16_tn
[ OK ] Sgemm_128x32x16.sgemm_256x32x16_tn (4 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x64x16_tn
[ OK ] Sgemm_128x32x16.sgemm_128x64x16_tn (5 ms)
[ RUN ] Sgemm_128x32x16.sgemm_256x64x16_tn
[ OK ] Sgemm_128x32x16.sgemm_256x64x16_tn (6 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x32x1_tt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_128x32x16.sgemm_128x32x1_tt (90 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x32x16_tt
[ OK ] Sgemm_128x32x16.sgemm_128x32x16_tt (4 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x32x17_tt
[ OK ] Sgemm_128x32x16.sgemm_128x32x17_tt (3 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x32x32_tt
[ OK ] Sgemm_128x32x16.sgemm_128x32x32_tt (4 ms)
[ RUN ] Sgemm_128x32x16.sgemm_256x32x16_tt
[ OK ] Sgemm_128x32x16.sgemm_256x32x16_tt (4 ms)
[ RUN ] Sgemm_128x32x16.sgemm_128x64x16_tt
[ OK ] Sgemm_128x32x16.sgemm_128x64x16_tt (4 ms)
[ RUN ] Sgemm_128x32x16.sgemm_256x64x16_tt
[ OK ] Sgemm_128x32x16.sgemm_256x64x16_tt (7 ms)
[----------] 28 tests from Sgemm_128x32x16 (290 ms total)

[----------] 1 test from Sgemm_64x128x8
[ RUN ] Sgemm_64x128x8.sgemm_64x128x64_4x8_accumulators_nt
[ OK ] Sgemm_64x128x8.sgemm_64x128x64_4x8_accumulators_nt (10 ms)
[----------] 1 test from Sgemm_64x128x8 (10 ms total)

[----------] 1 test from Sgemm_64x128x16
[ RUN ] Sgemm_64x128x16.sgemm_64x128x64_4x8_accumulators_nt
/home/nvidia/Documents/cutlass/tools/test/unit/gemm/gemm.h:95: Failure
Value of: testbed.verify_with_cublas()
Actual: false
Expected: true
[ FAILED ] Sgemm_64x128x16.sgemm_64x128x64_4x8_accumulators_nt (64 ms)
[----------] 1 test from Sgemm_64x128x16 (64 ms total)

[----------] 32 tests from Sgemm_64x64x8
[ RUN ] Sgemm_64x64x8.sgemm_64x64x1_nt
[ OK ] Sgemm_64x64x8.sgemm_64x64x1_nt (3 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x8_nt
[ OK ] Sgemm_64x64x8.sgemm_64x64x8_nt (3 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x9_nt
[ OK ] Sgemm_64x64x8.sgemm_64x64x9_nt (3 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x16_nt
[ OK ] Sgemm_64x64x8.sgemm_64x64x16_nt (3 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x64_nt
[ OK ] Sgemm_64x64x8.sgemm_64x64x64_nt (4 ms)
[ RUN ] Sgemm_64x64x8.sgemm_128x64x16_nt
[ OK ] Sgemm_64x64x8.sgemm_128x64x16_nt (5 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x128x16_nt
[ OK ] Sgemm_64x64x8.sgemm_64x128x16_nt (4 ms)
[ RUN ] Sgemm_64x64x8.sgemm_128x128x16_nt
[ OK ] Sgemm_64x64x8.sgemm_128x128x16_nt (6 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x1_nn
[ OK ] Sgemm_64x64x8.sgemm_64x64x1_nn (3 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x8_nn
[ OK ] Sgemm_64x64x8.sgemm_64x64x8_nn (3 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x9_nn
[ OK ] Sgemm_64x64x8.sgemm_64x64x9_nn (3 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x16_nn
[ OK ] Sgemm_64x64x8.sgemm_64x64x16_nn (3 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x64_nn
[ OK ] Sgemm_64x64x8.sgemm_64x64x64_nn (5 ms)
[ RUN ] Sgemm_64x64x8.sgemm_128x64x16_nn
[ OK ] Sgemm_64x64x8.sgemm_128x64x16_nn (4 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x128x16_nn
[ OK ] Sgemm_64x64x8.sgemm_64x128x16_nn (4 ms)
[ RUN ] Sgemm_64x64x8.sgemm_128x128x16_nn
[ OK ] Sgemm_64x64x8.sgemm_128x128x16_nn (6 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x1_tn
[ OK ] Sgemm_64x64x8.sgemm_64x64x1_tn (3 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x8_tn
[ OK ] Sgemm_64x64x8.sgemm_64x64x8_tn (3 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x9_tn
[ OK ] Sgemm_64x64x8.sgemm_64x64x9_tn (3 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x16_tn
[ OK ] Sgemm_64x64x8.sgemm_64x64x16_tn (4 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x64_tn
[ OK ] Sgemm_64x64x8.sgemm_64x64x64_tn (4 ms)
[ RUN ] Sgemm_64x64x8.sgemm_128x64x16_tn
[ OK ] Sgemm_64x64x8.sgemm_128x64x16_tn (4 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x128x16_tn
[ OK ] Sgemm_64x64x8.sgemm_64x128x16_tn (5 ms)
[ RUN ] Sgemm_64x64x8.sgemm_128x128x16_tn
[ OK ] Sgemm_64x64x8.sgemm_128x128x16_tn (6 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x1_tt
[ OK ] Sgemm_64x64x8.sgemm_64x64x1_tt (3 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x8_tt
[ OK ] Sgemm_64x64x8.sgemm_64x64x8_tt (2 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x9_tt
[ OK ] Sgemm_64x64x8.sgemm_64x64x9_tt (4 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x16_tt
[ OK ] Sgemm_64x64x8.sgemm_64x64x16_tt (3 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x64x64_tt
[ OK ] Sgemm_64x64x8.sgemm_64x64x64_tt (4 ms)
[ RUN ] Sgemm_64x64x8.sgemm_128x64x16_tt
[ OK ] Sgemm_64x64x8.sgemm_128x64x16_tt (4 ms)
[ RUN ] Sgemm_64x64x8.sgemm_64x128x16_tt
[ OK ] Sgemm_64x64x8.sgemm_64x128x16_tt (4 ms)
[ RUN ] Sgemm_64x64x8.sgemm_128x128x16_tt
[ OK ] Sgemm_64x64x8.sgemm_128x128x16_tt (6 ms)
[----------] 32 tests from Sgemm_64x64x8 (125 ms total)

[----------] 28 tests from Sgemm_64x64x16
[ RUN ] Sgemm_64x64x16.sgemm_64x64x1_nt
[ OK ] Sgemm_64x64x16.sgemm_64x64x1_nt (3 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x64x16_nt
[ OK ] Sgemm_64x64x16.sgemm_64x64x16_nt (3 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x64x17_nt
[ OK ] Sgemm_64x64x16.sgemm_64x64x17_nt (3 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x64x64_nt
[ OK ] Sgemm_64x64x16.sgemm_64x64x64_nt (4 ms)
[ RUN ] Sgemm_64x64x16.sgemm_128x64x16_nt
[ OK ] Sgemm_64x64x16.sgemm_128x64x16_nt (4 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x128x16_nt
[ OK ] Sgemm_64x64x16.sgemm_64x128x16_nt (5 ms)
[ RUN ] Sgemm_64x64x16.sgemm_128x128x16_nt
[ OK ] Sgemm_64x64x16.sgemm_128x128x16_nt (6 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x64x1_nn
[ OK ] Sgemm_64x64x16.sgemm_64x64x1_nn (3 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x64x16_nn
[ OK ] Sgemm_64x64x16.sgemm_64x64x16_nn (3 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x64x17_nn
[ OK ] Sgemm_64x64x16.sgemm_64x64x17_nn (3 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x64x64_nn
[ OK ] Sgemm_64x64x16.sgemm_64x64x64_nn (4 ms)
[ RUN ] Sgemm_64x64x16.sgemm_128x64x16_nn
[ OK ] Sgemm_64x64x16.sgemm_128x64x16_nn (4 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x128x16_nn
[ OK ] Sgemm_64x64x16.sgemm_64x128x16_nn (4 ms)
[ RUN ] Sgemm_64x64x16.sgemm_128x128x16_nn
[ OK ] Sgemm_64x64x16.sgemm_128x128x16_nn (6 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x64x1_tn
[ OK ] Sgemm_64x64x16.sgemm_64x64x1_tn (3 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x64x16_tn
[ OK ] Sgemm_64x64x16.sgemm_64x64x16_tn (3 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x64x17_tn
[ OK ] Sgemm_64x64x16.sgemm_64x64x17_tn (3 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x64x64_tn
[ OK ] Sgemm_64x64x16.sgemm_64x64x64_tn (4 ms)
[ RUN ] Sgemm_64x64x16.sgemm_128x64x16_tn
[ OK ] Sgemm_64x64x16.sgemm_128x64x16_tn (5 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x128x16_tn
[ OK ] Sgemm_64x64x16.sgemm_64x128x16_tn (4 ms)
[ RUN ] Sgemm_64x64x16.sgemm_128x128x16_tn
[ OK ] Sgemm_64x64x16.sgemm_128x128x16_tn (7 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x64x1_tt
[ OK ] Sgemm_64x64x16.sgemm_64x64x1_tt (3 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x64x16_tt
[ OK ] Sgemm_64x64x16.sgemm_64x64x16_tt (4 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x64x17_tt
[ OK ] Sgemm_64x64x16.sgemm_64x64x17_tt (3 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x64x64_tt
[ OK ] Sgemm_64x64x16.sgemm_64x64x64_tt (6 ms)
[ RUN ] Sgemm_64x64x16.sgemm_128x64x16_tt
[ OK ] Sgemm_64x64x16.sgemm_128x64x16_tt (5 ms)
[ RUN ] Sgemm_64x64x16.sgemm_64x128x16_tt
[ OK ] Sgemm_64x64x16.sgemm_64x128x16_tt (5 ms)
[ RUN ] Sgemm_64x64x16.sgemm_128x128x16_tt
[ OK ] Sgemm_64x64x16.sgemm_128x128x16_tt (5 ms)
[----------] 28 tests from Sgemm_64x64x16 (117 ms total)

[----------] 31 tests from Sgemm_64x32x8
[ RUN ] Sgemm_64x32x8.sgemm_64x32x1_nt
[ OK ] Sgemm_64x32x8.sgemm_64x32x1_nt (2 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x8_nt
[ OK ] Sgemm_64x32x8.sgemm_64x32x8_nt (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x9_nt
[ OK ] Sgemm_64x32x8.sgemm_64x32x9_nt (2 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x16_nt
[ OK ] Sgemm_64x32x8.sgemm_64x32x16_nt (2 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x64_nt
[ OK ] Sgemm_64x32x8.sgemm_64x32x64_nt (4 ms)
[ RUN ] Sgemm_64x32x8.sgemm_128x32x16_nt
[ OK ] Sgemm_64x32x8.sgemm_128x32x16_nt (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x64x16_nt
[ OK ] Sgemm_64x32x8.sgemm_64x64x16_nt (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_128x64x16_nt
[ OK ] Sgemm_64x32x8.sgemm_128x64x16_nt (4 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x1_nn
[ OK ] Sgemm_64x32x8.sgemm_64x32x1_nn (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x8_nn
[ OK ] Sgemm_64x32x8.sgemm_64x32x8_nn (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x9_nn
[ OK ] Sgemm_64x32x8.sgemm_64x32x9_nn (2 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x16_nn
[ OK ] Sgemm_64x32x8.sgemm_64x32x16_nn (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x64_nn
[ OK ] Sgemm_64x32x8.sgemm_64x32x64_nn (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_128x32x16_nn
[ OK ] Sgemm_64x32x8.sgemm_128x32x16_nn (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x64x16_nn
[ OK ] Sgemm_64x32x8.sgemm_64x64x16_nn (4 ms)
[ RUN ] Sgemm_64x32x8.sgemm_128x64x16_nn
[ OK ] Sgemm_64x32x8.sgemm_128x64x16_nn (4 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x8_tn
[ OK ] Sgemm_64x32x8.sgemm_64x32x8_tn (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x9_tn
[ OK ] Sgemm_64x32x8.sgemm_64x32x9_tn (2 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x16_tn
[ OK ] Sgemm_64x32x8.sgemm_64x32x16_tn (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x64_tn
[ OK ] Sgemm_64x32x8.sgemm_64x32x64_tn (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_128x32x16_tn
[ OK ] Sgemm_64x32x8.sgemm_128x32x16_tn (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x64x16_tn
[ OK ] Sgemm_64x32x8.sgemm_64x64x16_tn (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_128x64x16_tn
[ OK ] Sgemm_64x32x8.sgemm_128x64x16_tn (4 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x64x1_tt
[ OK ] Sgemm_64x32x8.sgemm_64x64x1_tt (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x8_tt
[ OK ] Sgemm_64x32x8.sgemm_64x32x8_tt (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x9_tt
[ OK ] Sgemm_64x32x8.sgemm_64x32x9_tt (2 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x16_tt
[ OK ] Sgemm_64x32x8.sgemm_64x32x16_tt (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x32x64_tt
[ OK ] Sgemm_64x32x8.sgemm_64x32x64_tt (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_128x32x16_tt
[ OK ] Sgemm_64x32x8.sgemm_128x32x16_tt (4 ms)
[ RUN ] Sgemm_64x32x8.sgemm_64x64x16_tt
[ OK ] Sgemm_64x32x8.sgemm_64x64x16_tt (3 ms)
[ RUN ] Sgemm_64x32x8.sgemm_128x64x16_tt
[ OK ] Sgemm_64x32x8.sgemm_128x64x16_tt (4 ms)
[----------] 31 tests from Sgemm_64x32x8 (96 ms total)

[----------] 26 tests from Sgemm_64x32x16
[ RUN ] Sgemm_64x32x16.sgemm_64x32x1_nt
[ OK ] Sgemm_64x32x16.sgemm_64x32x1_nt (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x32x16_nt
[ OK ] Sgemm_64x32x16.sgemm_64x32x16_nt (2 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x32x17_nt
[ OK ] Sgemm_64x32x16.sgemm_64x32x17_nt (2 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x32x64_nt
[ OK ] Sgemm_64x32x16.sgemm_64x32x64_nt (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_128x32x16_nt
[ OK ] Sgemm_64x32x16.sgemm_128x32x16_nt (4 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x64x16_nt
[ OK ] Sgemm_64x32x16.sgemm_64x64x16_nt (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_128x64x16_nt
[ OK ] Sgemm_64x32x16.sgemm_128x64x16_nt (4 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x32x1_nn
[ OK ] Sgemm_64x32x16.sgemm_64x32x1_nn (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x32x16_nn
[ OK ] Sgemm_64x32x16.sgemm_64x32x16_nn (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x32x17_nn
[ OK ] Sgemm_64x32x16.sgemm_64x32x17_nn (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x32x64_nn
[ OK ] Sgemm_64x32x16.sgemm_64x32x64_nn (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_128x32x16_nn
[ OK ] Sgemm_64x32x16.sgemm_128x32x16_nn (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x64x16_nn
[ OK ] Sgemm_64x32x16.sgemm_64x64x16_nn (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_128x64x16_nn
[ OK ] Sgemm_64x32x16.sgemm_128x64x16_nn (4 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x32x16_tn
[ OK ] Sgemm_64x32x16.sgemm_64x32x16_tn (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x32x17_tn
[ OK ] Sgemm_64x32x16.sgemm_64x32x17_tn (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x32x64_tn
[ OK ] Sgemm_64x32x16.sgemm_64x32x64_tn (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_128x32x16_tn
[ OK ] Sgemm_64x32x16.sgemm_128x32x16_tn (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x64x16_tn
[ OK ] Sgemm_64x32x16.sgemm_64x64x16_tn (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_128x64x16_tn
[ OK ] Sgemm_64x32x16.sgemm_128x64x16_tn (4 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x64x1_tt
[ OK ] Sgemm_64x32x16.sgemm_64x64x1_tt (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x32x16_tt
[ OK ] Sgemm_64x32x16.sgemm_64x32x16_tt (2 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x32x17_tt
[ OK ] Sgemm_64x32x16.sgemm_64x32x17_tt (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_128x32x16_tt
[ OK ] Sgemm_64x32x16.sgemm_128x32x16_tt (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_64x64x16_tt
[ OK ] Sgemm_64x32x16.sgemm_64x64x16_tt (3 ms)
[ RUN ] Sgemm_64x32x16.sgemm_128x64x16_tt
[ OK ] Sgemm_64x32x16.sgemm_128x64x16_tt (5 ms)
[----------] 26 tests from Sgemm_64x32x16 (83 ms total)

[----------] Global test environment tear-down
[==========] 684 tests from 33 test cases ran. (33523 ms total)
[ PASSED ] 404 tests.
[ FAILED ] 280 tests, listed below:
[ FAILED ] Dgemm_128x128x8.dgemm_128x128x8_nt
[ FAILED ] Dgemm_128x128x8.dgemm_512x256x64_nt
[ FAILED ] Dgemm_128x128x8.dgemm_128x128x8_nn
[ FAILED ] Dgemm_128x128x8.dgemm_512x256x64_nn
[ FAILED ] Dgemm_128x128x8.dgemm_128x128x8_tn
[ FAILED ] Dgemm_128x128x8.dgemm_512x256x64_tn
[ FAILED ] Dgemm_128x128x8.dgemm_128x128x8_tt
[ FAILED ] Dgemm_128x128x8.dgemm_512x256x64_tt
[ FAILED ] Hgemm_128x128x16.hgemm_2x2x2_nt
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x8_nt
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x16_nt
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x17_nt
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x64_nt
[ FAILED ] Hgemm_128x128x16.hgemm_256x128x16_nt
[ FAILED ] Hgemm_128x128x16.hgemm_128x256x16_nt
[ FAILED ] Hgemm_128x128x16.hgemm_256x256x16_nt
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x16_nn
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x18_nn
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x64_nn
[ FAILED ] Hgemm_128x128x16.hgemm_256x128x16_nn
[ FAILED ] Hgemm_128x128x16.hgemm_128x256x16_nn
[ FAILED ] Hgemm_128x128x16.hgemm_256x256x16_nn
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x16_tn
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x18_tn
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x64_tn
[ FAILED ] Hgemm_128x128x16.hgemm_256x128x16_tn
[ FAILED ] Hgemm_128x128x16.hgemm_128x256x16_tn
[ FAILED ] Hgemm_128x128x16.hgemm_256x256x16_tn
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x16_tt
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x18_tt
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x64_tt
[ FAILED ] Hgemm_128x128x16.hgemm_256x128x16_tt
[ FAILED ] Hgemm_128x128x16.hgemm_128x256x16_tt
[ FAILED ] Hgemm_128x128x16.hgemm_256x256x16_tt
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x16_alpha2_nt
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x16_beta1_nt
[ FAILED ] Hgemm_128x128x16.hgemm_128x128x16_alpha2_beta1_nt
[ FAILED ] Hgemm_128x128x16.hgemm_508x252x120_ragged_nt
[ FAILED ] Hgemm_128x128x16.hgemm_124x126x32_ragged_nt
[ FAILED ] Hgemm_128x128x16.hgemm_124x126x32_ragged_alpha2_beta1_nt
[ FAILED ] Igemm_128x128x32.igemm_128x128x4_nt
[ FAILED ] Igemm_128x128x32.igemm_128x128x32_nt
[ FAILED ] Igemm_128x128x32.igemm_128x128x36_nt
[ FAILED ] Igemm_128x128x32.igemm_128x128x64_nt
[ FAILED ] Igemm_128x128x32.igemm_128x128x256_nt
[ FAILED ] Igemm_128x128x32.igemm_256x128x64_nt
[ FAILED ] Igemm_128x128x32.igemm_128x256x64_nt
[ FAILED ] Igemm_128x128x32.igemm_256x256x64_nt
[ FAILED ] Igemm_128x128x32.igemm_128x128x4_nn
[ FAILED ] Igemm_128x128x32.igemm_128x128x32_nn
[ FAILED ] Igemm_128x128x32.igemm_128x128x36_nn
[ FAILED ] Igemm_128x128x32.igemm_128x128x64_nn
[ FAILED ] Igemm_128x128x32.igemm_128x128x256_nn
[ FAILED ] Igemm_128x128x32.igemm_256x128x64_nn
[ FAILED ] Igemm_128x128x32.igemm_128x256x64_nn
[ FAILED ] Igemm_128x128x32.igemm_256x256x64_nn
[ FAILED ] Igemm_128x128x32.igemm_128x128x4_tn
[ FAILED ] Igemm_128x128x32.igemm_128x128x32_tn
[ FAILED ] Igemm_128x128x32.igemm_128x128x36_tn
[ FAILED ] Igemm_128x128x32.igemm_128x128x64_tn
[ FAILED ] Igemm_128x128x32.igemm_128x128x256_tn
[ FAILED ] Igemm_128x128x32.igemm_256x128x64_tn
[ FAILED ] Igemm_128x128x32.igemm_128x256x64_tn
[ FAILED ] Igemm_128x128x32.igemm_256x256x64_tn
[ FAILED ] Igemm_128x128x32.igemm_128x128x4_tt
[ FAILED ] Igemm_128x128x32.igemm_128x128x32_tt
[ FAILED ] Igemm_128x128x32.igemm_128x128x36_tt
[ FAILED ] Igemm_128x128x32.igemm_128x128x64_tt
[ FAILED ] Igemm_128x128x32.igemm_128x128x256_tt
[ FAILED ] Igemm_128x128x32.igemm_256x128x64_tt
[ FAILED ] Igemm_128x128x32.igemm_128x256x64_tt
[ FAILED ] Igemm_128x128x32.igemm_256x256x64_tt
[ FAILED ] Igemm_128x64x32.Igemm_128x64x4_nt
[ FAILED ] Igemm_128x64x32.igemm_128x64x32_nt
[ FAILED ] Igemm_128x64x32.igemm_128x64x36_nt
[ FAILED ] Igemm_128x64x32.igemm_128x64x64_nt
[ FAILED ] Igemm_128x64x32.igemm_128x64x256_nt
[ FAILED ] Igemm_128x64x32.igemm_256x64x64_nt
[ FAILED ] Igemm_128x64x32.igemm_128x128x64_nt
[ FAILED ] Igemm_128x64x32.igemm_256x128x64_nt
[ FAILED ] Igemm_128x64x32.igemm_128x64x4_nn
[ FAILED ] Igemm_128x64x32.igemm_128x64x32_nn
[ FAILED ] Igemm_128x64x32.igemm_128x64x36_nn
[ FAILED ] Igemm_128x64x32.igemm_128x64x64_nn
[ FAILED ] Igemm_128x64x32.igemm_128x64x256_nn
[ FAILED ] Igemm_128x64x32.igemm_256x64x64_nn
[ FAILED ] Igemm_128x64x32.igemm_128x128x64_nn
[ FAILED ] Igemm_128x64x32.igemm_256x128x64_nn
[ FAILED ] Igemm_128x64x32.igemm_128x64x4_tn
[ FAILED ] Igemm_128x64x32.igemm_128x64x32_tn
[ FAILED ] Igemm_128x64x32.igemm_128x64x36_tn
[ FAILED ] Igemm_128x64x32.igemm_128x64x64_tn
[ FAILED ] Igemm_128x64x32.igemm_128x64x256_tn
[ FAILED ] Igemm_128x64x32.igemm_256x64x64_tn
[ FAILED ] Igemm_128x64x32.igemm_128x128x64_tn
[ FAILED ] Igemm_128x64x32.igemm_256x128x64_tn
[ FAILED ] Igemm_128x64x32.igemm_128x64x4_tt
[ FAILED ] Igemm_128x64x32.igemm_128x64x32_tt
[ FAILED ] Igemm_128x64x32.igemm_128x64x36_tt
[ FAILED ] Igemm_128x64x32.igemm_128x64x64_tt
[ FAILED ] Igemm_128x64x32.igemm_128x64x256_tt
[ FAILED ] Igemm_128x64x32.igemm_256x64x64_tt
[ FAILED ] Igemm_128x64x32.igemm_128x128x64_tt
[ FAILED ] Igemm_128x64x32.igemm_256x128x64_tt
[ FAILED ] Igemm_128x32x32.igemm_128x32x32x4_nt
[ FAILED ] Igemm_128x32x32.igemm_128x32x32_nt
[ FAILED ] Igemm_128x32x32.igemm_128x32x36_nt
[ FAILED ] Igemm_128x32x32.igemm_128x32x64_nt
[ FAILED ] Igemm_128x32x32.igemm_128x32x256_nt
[ FAILED ] Igemm_128x32x32.igemm_256x32x64_nt
[ FAILED ] Igemm_128x32x32.igemm_128x128x32_nt
[ FAILED ] Igemm_128x32x32.igemm_256x128x32_nt
[ FAILED ] Igemm_128x32x32.igemm_128x32x4_nn
[ FAILED ] Igemm_128x32x32.igemm_128x32x32_nn
[ FAILED ] Igemm_128x32x32.igemm_128x32x36_nn
[ FAILED ] Igemm_128x32x32.igemm_128x32x64_nn
[ FAILED ] Igemm_128x32x32.igemm_128x32x256_nn
[ FAILED ] Igemm_128x32x32.igemm_256x32x64_nn
[ FAILED ] Igemm_128x32x32.igemm_128x128x32_nn
[ FAILED ] Igemm_128x32x32.igemm_256x128x32_nn
[ FAILED ] Igemm_128x32x32.igemm_128x32x4_tn
[ FAILED ] Igemm_128x32x32.igemm_128x32x32_tn
[ FAILED ] Igemm_128x32x32.igemm_128x32x36_tn
[ FAILED ] Igemm_128x32x32.igemm_128x32x64_tn
[ FAILED ] Igemm_128x32x32.igemm_128x32x256_tn
[ FAILED ] Igemm_128x32x32.igemm_256x32x64_tn
[ FAILED ] Igemm_128x32x32.igemm_128x128x32_tn
[ FAILED ] Igemm_128x32x32.igemm_256x128x32_tn
[ FAILED ] Igemm_128x32x32.igemm_128x32x4_tt
[ FAILED ] Igemm_128x32x32.igemm_128x32x32_tt
[ FAILED ] Igemm_128x32x32.igemm_128x32x36_tt
[ FAILED ] Igemm_128x32x32.igemm_128x32x64_tt
[ FAILED ] Igemm_128x32x32.igemm_128x32x256_tt
[ FAILED ] Igemm_128x32x32.igemm_256x32x64_tt
[ FAILED ] Igemm_128x32x32.igemm_128x128x32_tt
[ FAILED ] Igemm_128x32x32.igemm_256x128x32_tt
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x4_nt
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x32_nt
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x36_nt
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x64_nt
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x256_nt
[ FAILED ] Igemm_128x128x32_float.igemm_256x128x64_nt
[ FAILED ] Igemm_128x128x32_float.igemm_128x256x64_nt
[ FAILED ] Igemm_128x128x32_float.igemm_256x256x64_nt
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x4_nn
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x32_nn
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x36_nn
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x64_nn
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x256_nn
[ FAILED ] Igemm_128x128x32_float.igemm_256x128x64_nn
[ FAILED ] Igemm_128x128x32_float.igemm_128x256x64_nn
[ FAILED ] Igemm_128x128x32_float.igemm_256x256x64_nn
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x4_tn
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x32_tn
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x36_tn
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x64_tn
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x256_tn
[ FAILED ] Igemm_128x128x32_float.igemm_256x128x64_tn
[ FAILED ] Igemm_128x128x32_float.igemm_128x256x64_tn
[ FAILED ] Igemm_128x128x32_float.igemm_256x256x64_tn
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x4_tt
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x32_tt
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x36_tt
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x64_tt
[ FAILED ] Igemm_128x128x32_float.igemm_128x128x256_tt
[ FAILED ] Igemm_128x128x32_float.igemm_256x128x64_tt
[ FAILED ] Igemm_128x128x32_float.igemm_128x256x64_tt
[ FAILED ] Igemm_128x128x32_float.igemm_256x256x64_tt
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x4_nt
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x32_nt
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x36_nt
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x64_nt
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x256_nt
[ FAILED ] Igemm_128x128x32_int8.igemm_256x128x64_nt
[ FAILED ] Igemm_128x128x32_int8.igemm_128x256x64_nt
[ FAILED ] Igemm_128x128x32_int8.igemm_256x256x64_nt
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x4_nn
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x32_nn
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x36_nn
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x64_nn
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x256_nn
[ FAILED ] Igemm_128x128x32_int8.igemm_256x128x64_nn
[ FAILED ] Igemm_128x128x32_int8.igemm_128x256x64_nn
[ FAILED ] Igemm_128x128x32_int8.igemm_256x256x64_nn
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x4_tn
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x32_tn
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x36_tn
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x64_tn
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x256_tn
[ FAILED ] Igemm_128x128x32_int8.igemm_256x128x64_tn
[ FAILED ] Igemm_128x128x32_int8.igemm_128x256x64_tn
[ FAILED ] Igemm_128x128x32_int8.igemm_256x256x64_tn
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x4_tt
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x32_tt
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x36_tt
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x64_tt
[ FAILED ] Igemm_128x128x32_int8.igemm_128x128x256_tt
[ FAILED ] Igemm_128x128x32_int8.igemm_256x128x64_tt
[ FAILED ] Igemm_128x128x32_int8.igemm_128x256x64_tt
[ FAILED ] Igemm_128x128x32_int8.igemm_256x256x64_tt
[ FAILED ] Igemm_32x32x128.igemm_32x32x4_nt
[ FAILED ] Igemm_32x32x128.igemm_32x32x8_nt
[ FAILED ] Igemm_32x32x128.igemm_32x32x32_nt
[ FAILED ] Igemm_32x32x128.igemm_32x32x128_nt
[ FAILED ] Igemm_32x32x128.igemm_32x32x4_nn
[ FAILED ] Igemm_32x32x128.igemm_32x32x8_nn
[ FAILED ] Igemm_32x32x128.igemm_32x32x32_nn
[ FAILED ] Igemm_32x32x128.igemm_32x32x128_nn
[ FAILED ] Igemm_32x32x128.igemm_32x32x4_tn
[ FAILED ] Igemm_32x32x128.igemm_32x32x8_tn
[ FAILED ] Igemm_32x32x128.igemm_32x32x15_tn
[ FAILED ] Igemm_32x32x128.igemm_32x32x32_tn
[ FAILED ] Igemm_32x32x128.igemm_32x32x128_tn
[ FAILED ] Igemm_32x32x128.igemm_32x32x8_tt
[ FAILED ] Igemm_32x32x128.igemm_32x32x32_tt
[ FAILED ] Igemm_32x32x128.igemm_32x32x128_tt
[ FAILED ] Sgemm_128x128x8.sgemm_120x112x64_ldg4_nt
[ FAILED ] Sgemm_128x128x16.sgemm_128x128x16_nt
[ FAILED ] Sgemm_128x128x16.sgemm_128x81x1_nt
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x16_nt
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x17_nt
[ FAILED ] Sgemm_128x128x16.sgemm_128x73x16_nt
[ FAILED ] Sgemm_128x128x16.sgemm_97x112x64_nt
[ FAILED ] Sgemm_128x128x16.sgemm_256x112x16_nt
[ FAILED ] Sgemm_128x128x16.sgemm_128x240x16_nt
[ FAILED ] Sgemm_128x128x16.sgemm_256x240x16_nt
[ FAILED ] Sgemm_128x128x16.sgemm_128x128x16_nn
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x1_nn
[ FAILED ] Sgemm_128x128x16.sgemm_79x112x16_nn
[ FAILED ] Sgemm_128x128x16.sgemm_128x81x17_nn
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x16_nn
[ FAILED ] Sgemm_128x128x16.sgemm_128x73x64_nn
[ FAILED ] Sgemm_128x128x16.sgemm_256x112x16_nn
[ FAILED ] Sgemm_128x128x16.sgemm_128x256x16_nn
[ FAILED ] Sgemm_128x128x16.sgemm_256x256x16_nn
[ FAILED ] Sgemm_128x128x16.sgemm_128x128x16_tn
[ FAILED ] Sgemm_128x128x16.sgemm_128x128x1_tn
[ FAILED ] Sgemm_128x128x16.sgemm_127x112x16_tn
[ FAILED ] Sgemm_128x128x16.sgemm_21x112x17_tn
[ FAILED ] Sgemm_128x128x16.sgemm_128x73x16_tn
[ FAILED ] Sgemm_128x128x16.sgemm_128x81x64_tn
[ FAILED ] Sgemm_128x128x16.sgemm_256x112x16_tn
[ FAILED ] Sgemm_128x128x16.sgemm_47x256x16_tn
[ FAILED ] Sgemm_128x128x16.sgemm_211x256x16_tn
[ FAILED ] Sgemm_128x128x16.sgemm_128x128x16_tt
[ FAILED ] Sgemm_128x128x16.sgemm_128x128x1_tt
[ FAILED ] Sgemm_128x128x16.sgemm_109x112x16_tt
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x17_tt
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x16_tt
[ FAILED ] Sgemm_128x128x16.sgemm_123x112x64_tt
[ FAILED ] Sgemm_128x128x16.sgemm_256x112x16_tt
[ FAILED ] Sgemm_128x128x16.sgemm_128x256x16_tt
[ FAILED ] Sgemm_128x128x16.sgemm_256x256x16_tt
[ FAILED ] Sgemm_128x128x16.sgemm_120x112x64_ldg4_nt
[ FAILED ] Sgemm_128x128x16.sgemm_128x128x16_alpha2_nt
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x16_beta1_nt
[ FAILED ] Sgemm_128x128x16.sgemm_128x112x16_alpha2_beta1_nt
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x1_nt
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x16_nt
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x17_nt
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x64_nt
[ FAILED ] Sgemm_128x64x16.sgemm_256x64x16_nt
[ FAILED ] Sgemm_128x64x16.sgemm_128x128x16_nt
[ FAILED ] Sgemm_128x64x16.sgemm_256x128x16_nt
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x1_tn
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x16_tn
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x17_tn
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x64_tn
[ FAILED ] Sgemm_128x64x16.sgemm_256x64x16_tn
[ FAILED ] Sgemm_128x64x16.sgemm_128x128x16_tn
[ FAILED ] Sgemm_128x64x16.sgemm_256x128x16_tn
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x1_tt
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x16_tt
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x17_tt
[ FAILED ] Sgemm_128x64x16.sgemm_128x64x64_tt
[ FAILED ] Sgemm_128x64x16.sgemm_128x128x16_tt
[ FAILED ] Sgemm_128x64x16.sgemm_256x128x16_tt
[ FAILED ] Sgemm_128x32x16.sgemm_128x32x1_tn
[ FAILED ] Sgemm_128x32x16.sgemm_128x32x1_tt
[ FAILED ] Sgemm_64x128x16.sgemm_64x128x64_4x8_accumulators_nt

280 FAILED TESTS

support cuda8?

I use the cuda-8.0,.Some error reported as follows:

make sgemm sm=60
mkdir -p bin
"/usr/local/cuda-8.0/bin/nvcc" -DTEST_SGEMM  -gencode=arch=compute_60,code=\"sm_60,compute_60\" -o bin/sgemm_nn_sm60_8.0 gemm.cu -O3 -Xptxas -v -std=c++11  -I./ -I../  -lcublas
../cutlass/gemm/k_split_control.h(154): error: identifier "__syncwarp" is undefined

cublas_dispatch.h(78): error: identifier "CUDA_R_32I" is undefined

performance in small blocks

Hi,
From my experiment with cutlass_perf_test on sgemm_nn, it seems that on small block size (e.g., 128,64,512) cutlass gets 3~10X slower in terms of flops.
How do I get a performance comparable to large block cases? Seems like changing only output tile shape doesn't work.

cutlass fails to build w/ nvcc unless -std=c++11 is set in NVCC_FLAGS

CUTLASS fails to build with default compiler options and CUDA-9.2, where NVCC defaults to -std=c++14. Build works if compiled with -std=c++11.

/usr/local/google/home/tra/work/cuda/cutlass-1.0/cutlass/util/platform.h(310): error: namespace "std" has no member "bool_constant"

std::bool_constant should be available since C++11, so it's somewhat puzzling that the build works in c++11 mode, but not in c++14.

Reproduction:
System running recent debian distribution.
$ git clone https://github.com/NVIDIA/cutlass.git cutlass-1.0
$ mkdir cutlass-1.0/build
$ cd cutlass-1.0/build
$ cmake -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-9.2 ..
$ make VERBOSE=1

Full build log: https://pastebin.com/i9VmjZGV

Accumulate in different type than input/output

I am writing gemm for my own types. Input/output type is 32bit, but the accumulation needs to be 64bit.
I have all the +*= operators implemented so I can work with multiply add, but cannot get the code to compile.
Is there an example of how to do this?
Also, for epilogue, I need no operation except that I have impleneted the code
outputType = AccumulatorType
to do the proper math (not just casting). Is this adaquate or do I need something else?

It seems that a Cub like approach to operators can simplify the API a lot. Operators can be passed along with their parameters to make a very simple Host side interface.
Thanks!

Aha! Link: https://nvaiinfa.aha.io/features/CUTLASS-67

Support basic gemm for unsigned char

Try to use example of basic gemm with other types: it works if I change float to unsigned int, but produce misaligned address error if i try to use unsigned char. Is it my error or there is no ability to use basic gemm with unsigned char type?
What I really change can be seen here: branch unsigned_char of my fork

Poor Performance with Int1

Doing a performance comparison between Int1 (binary gemm) and Sgemm, I'm seeing only a 10x speedup.

Here's what I ran:

./cutlass_perf_test --m=1024 --n=32 --k=1024 --kernels=sgemm_nn --iterations=10000
===============================================
[Passed]: sgemm_nn with disposition: passed
Kernel: sgemm_nn
    provider: Cutlass
    problem: 1024-by-32-by-1024, A: column-major, B: column-major, beta: 0, batch: 1
    disposition: passed
    runtime:     0.193089 ms

    performance: 347.553 GFLOPs

and

./cutlass_perf_test --m=1024 --n=32 --k=32 --kernels=wmma_binary_gemm_tn --iterations=10000
=================================================
[NotVerified]: wmma_binary_gemm_tn with disposition: not_verified
Kernel: wmma_binary_gemm_tn
    provider: Cutlass
    problem: 1024-by-32-by-32, A: row-major, B: column-major, beta: 0, batch: 1
    disposition: not_verified
    runtime:     0.0172656 ms

    performance: 3886.85 GFLOPs

In both runs, N=32, M=K=1024 (1024 bits in bgemm, 1024 floats for sgemm).

Comparing the runtimes, we see that binary gemm is only 0.193089/0.0172656 ~ 10x faster than sgemm. I'd expect better performance, somewhere in the 24x - 30x range as operating on bits using popcount should be 32x faster than doing multiply accumulates.

Any idea why this is the case? Appreciate any input here.

Change threadblock tile size of Gemm kernel

Hello! I'm reading the basic_gemm.cu example and I'm trying to change the threadblock tile size for the kernel. But I'm not sure where is the right place to change, as the kernel is encapsulated so tight. I tried to change the "cutlass::Shape<8, 128, 128>" in the GemmTraits, but it doesn't compile.

Does anybody have any suggestion on changing the threadblock size of the gemm kernel? Thanks

CUTLASS results incorrect

I tried building the 00_basic_gemm project and when I run it it says "CUTLASS results incorrect". I'm also getting some weird results with my own code.

Details:
CUTLASS 1.1.0
CUDA release 9.2, V9.2.88
Tesla M40 GPU
CMake Flags: -DCUTLASS_NVCC_ARCHS=35;50;60;70

Compiling and running the same code on a Tesla P4 passes without any issues.

Question about nvprof result on cutlass_perf_test

Tested sgemm performance with cutlass_perf_test, the shared_efficiency on pascal cards is around 50%.

How could the shared_efficiency not be 100% when shared_st_bank_conflict and shared_ld_bank_conflict were both 0?

Unit tests fails on GTX 1060

I am using GTX 1060 GPU which I think it is Pascal Arch. with CUDA V9.2, I tried to run the unit test but they all failed.

cutlass calculate matrix size problem

I wrote a function using cutlass to test the performance of cutlass calculation (int8, int8 to int), but I have now found a problem. M, N, and K in my parameters cannot be selected at random, where N and K must Multiples of 16. Choosing something else will cause an error. Is there something wrong with my writing of this function?

int Int8Operator:: cutlass_gemm32I_tensorop(const CBLAS_TRANSPOSE TransA,
    const CBLAS_TRANSPOSE TransB, const int M, const int N, const int K,
    const void *alpha, const void* A, const void* B, const void *beta,
    void* C,cublasGemmAlgo_t algo/*non used*/)
    {
            using A_Major = cutlass::layout::ColumnMajor;
            using B_Major = cutlass::layout::ColumnMajor;
            using ElementOutput = int32_t;
            using ElementAccumulator = int32_t;
            int lda = (TransA == CblasNoTrans) ? K : M;
            int ldb = (TransB == CblasNoTrans) ? N : K;
            int ldc = N;
            using Gemm = cutlass::gemm::device::Gemm<
            int8_t,
            A_Major,
            int8_t,
            B_Major,
            ElementOutput,
            cutlass::layout::RowMajor,
            ElementAccumulator,
            cutlass::arch::OpClassWmmaTensorOp,
            cutlass::arch::Sm75,
            cutlass::gemm::GemmShape<128, 128, 32>,
            cutlass::gemm::GemmShape<64, 64, 32>,
            cutlass::gemm::GemmShape<16, 16, 16>,
            cutlass::epilogue::thread::LinearCombination<
            ElementOutput,
            128 / cutlass::sizeof_bits<ElementOutput>::value,
            ElementAccumulator,
            ElementAccumulator
            >,
            cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle,
            2
        >;
        Gemm gemm_op;
        int alpha_ = *(static_cast<const int*>(alpha));
        int beta_ = *(static_cast<const int*>(beta));
        cutlass::Status status = gemm_op({
            {M, N, K},
            {static_cast<const int8_t *>(A), lda},
            {static_cast<const int8_t *>(B), ldb},
            {static_cast<int*>(C), ldc},
            {static_cast<int*>(C), ldc}, 
            {alpha_,beta_}
        });
        if (status != cutlass::Status::kSuccess) {
            return cudaErrorUnknown;
          }
          return cudaSuccess;
    }
      

Where can I see examples of WMMA GEMM usage for INT1 (bit 1)?

  • Does the CUTLASS 1.2 library really support INT1 (1 bit) GEMM by using Tensor Cores, so can we use it for XNOR neural networks?

  • Does it perform XNOR !(a^b) operations instead of Multiply?

  • Does it perform C[j][i] = popcnt( A_i_row[x] XNOR B_j_col[x] ) ?

  • Should we pack each 32 bits into uint32_t (A along row, B along column) in such a maner as in cuDNN, where we should use CUDNN_DATA_INT8x32 and CUDNN_TENSOR_NCHW_VECT_C to use INT8 on Tensor Cores with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_โ€‹PRECOMP_GEMM? https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-speedup-tips

  • Where can I read more about this and where can I see examples of Warp-Level Matrix Operations (WMMA) GEMM usage for INT1 (1 bit)?

I can see only tests for INT8 and INT4: https://github.com/NVIDIA/cutlass/blob/master/tools/test/unit/gemm/wmma_integer_gemm.cu


As written here we can achieve 2088 TOPS for INT1 (1 bit) on GeForce RTX 2080 Ti (TU102): http://on-demand.gputechconf.com/gtc-il/2018/pdf/sil8140-optimizing-cuda-applications-for-the-volta-turing-gpu-architecture.pdf

https://github.com/NVIDIA/cutlass#whats-new-in-cutlass-11

WMMA GEMM targeting TensorCores - INT8, INT4, 1-bit https://github.com/NVIDIA/cutlass/blob/master/tools/test/unit/gemm/wmma_integer_gemm.cu

From the last newsletter:

CUTLASS 1.2, the latest version of the CUDA template library for linear algebra subroutines, includes the following key updates:

  • Support for Turing Tensor Cores that significantly speedup matrix computations for deep learning inference
  • Tensor Core optimized WMMA GEMMs for the new INT8, INT4, and INT1 precision modes introduced in Turing
  • Support for batched strided GEMMs, parallelized GEMM-K reductions, enhanced utilities, and samples

cutlass performance evaluation

Hi cutlass developers,

I tested the cutlass with cublas as provided in cuda 9.2 samples. The m, n, k are set as 640, 480 and 320 respectively.

Following are cublas results on GTX 1080:

/usr/local/cuda-9.2/samples/bin/x86_64/linux/release/matrixMulCUBLAS
GPU Device 0: "GeForce GTX 1080" with compute capability 6.1

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 3065.46 GFlop/s, Time= 0.064 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

Following are cutlass results on GTX 1080:

./build/tools/test/perf/cutlass_perf_test --m=640 --n=480 --k=320 --kernels=dgemm

============================================================================
Kernel: dgemm_nt
problem: 640-by-480-by-320, A: column-major, B: row-major, beta: 0
disposition: passed
runtime: 0.872562 ms

performance: 225.323 GFLOPs

============================================================================
Kernel: dgemm_nn
problem: 640-by-480-by-320, A: column-major, B: column-major, beta: 0
disposition: passed
runtime: 0.873718 ms

performance: 225.025 GFLOPs

============================================================================
Kernel: dgemm_tn
problem: 640-by-480-by-320, A: row-major, B: column-major, beta: 0
disposition: passed
runtime: 0.874086 ms

performance: 224.93 GFLOPs

============================================================================
Kernel: dgemm_tt
problem: 640-by-480-by-320, A: row-major, B: row-major, beta: 0
disposition: passed
runtime: 0.874783 ms

performance: 224.751 GFLOPs

It seems that in this case cutlass is slower than cublas. Do you mind providing your benchmark script for comparing cutlass with cublas? Moreover, what does the flags specified in kernel arguments mean, such as "s/d/h/i/nn/nt..."?

Thanks!

Run Multiple Gemm Kernels Concurrently

Hello! I'm reading the basic_gemm.cu example. I'm trying to run multiple gemm kernels concurrently by calling Gemm::launch(params) multiple times, but it does not seem to work.

I think I have to allocate a stream for each kernel, but I didn't find how to do that. Does anybody have any suggestions on how to run multiple kernels concurrently? Thanks!

Fix the license

We are going to use the BSD License. There's a typo in the README.

Is that possible to do operations other than element-wise operations in epilogue?

Hi all,

I want to transform the output of matrix multiplication, but the transformation is not about element-wise operation. Actually, I want to do a transformation, which is based on a small region of matrix multiplication result. For example, I want to calculate the average of all 2x2 tiles in matrix multiplication result, and write the average number to the corresponding tile region.

And, if I want to do something like above before matrix multiplication, how can I solve this with CUTLASS (or is that possible/suitable to do this with CUTLASS)?

Thanks!

Doing Complex GEMM with bugs, that do not update result matrix (D).

here is my code using cutlass 1.2

cgemm_traits.h

#pragma once

#include "cutlass/gemm/gemm.h"
#include "cutlass/gemm/gemm_epilogue.h"
#include "cutlass/gemm/gemm_epilogue_traits.h"
#include "cutlass/gemm/gemm_global_tile.h"
#include "cutlass/gemm/gemm_shared_tile.h"
#include "cutlass/gemm/gemm_traits.h"
#include "cutlass/gemm/thread_multiply_add.h"

#include "cutlass/util/complex.h"

using cutlass::platform::complex;

namespace cutlass {
namespace gemm {

////////////////////////////////////////////////////////////////////////////////////////////////////

template <
    /// The tile size for threadblock-level GEMM (K-by-N-by-M).
    typename OutputTile_,
    /// Tile size for thread-level GEMM (K-by-N-by-M)
    typename ThreadGemmShape_,
    /// The number of scalars per LDG for A.
    int kScalarsPerLdgA_ = 1,
    /// The number of scalars per LDG for B.
    int kScalarsPerLdgB_ = 1>
struct CgemmConfig
    : public GemmConfig<
          /// The scalar type for A.
          complex<float>,
          /// The scalar type for B.
          complex<float>,
          /// The scalar type for C.
          complex<float>,
          /// The scalar type for D.
          complex<float>,
          /// The tile size for the GEMM KxNxM.
          OutputTile_,
          /// The functor to do the math in the main loop.
          ThreadMultiplyAdd<ThreadGemmShape_, Shape<1, 4, 8>, complex<float>, complex<float>, complex<float>>,
          /// The number of scalars per LDG for A.
          kScalarsPerLdgA_,
          /// The number of scalars per STS for A.
          kScalarsPerLdgA_,
          /// The number of scalars per LDS for A.
          2,
          /// The number of scalars per LDG for B.
          kScalarsPerLdgB_,
          /// The number of scalars per STS for B.
          kScalarsPerLdgB_,
          /// The number of scalars per LDS for B.
          2,
          /// The number of scalars per LDG for C and STG for D.
          1,
          /// The number of scalars per STS for D.
          2,
          /// The number of scalars per LDS for D.
          1,
          /// The number of stages in shared memory.
          2,
          /// kResidueSeparate
          false,
          /// kResidueInPrologue
          false,
          /// kLaunchBounds
          false
          >{};

////////////////////////////////////////////////////////////////////////////////////////////////////

template <
    /// The layout for A.
    MatrixLayout::Kind kLayoutA_,
    /// The layout for B.
    MatrixLayout::Kind kLayoutB_,
    /// The tile size for threadblock-level GEMM (K-by-N-by-M)
    typename OutputTile_ = Shape<8, 64, 128>,
    /// The functor to use in the epilogue.
    typename EpilogueFunctor_ = LinearScaling<complex<float>>,
    /// Tile size for thread-level GEMM (K-by-N-by-M)
    typename ThreadGemmShape_ = Shape<8, 8, 8>,
    /// The number of cuComplexs loaded in one LDG for A.
    int kScalarsPerLdgA_ = 1,
    /// The number of cuComplexs loaded in one LDG for B.
    int kScalarsPerLdgB_ = 1,
    /// The index.
    typename Index_ = int,
    /// The Cgemm config.
    typename GemmConfig_ =
        CgemmConfig<OutputTile_, ThreadGemmShape_, kScalarsPerLdgA_, kScalarsPerLdgB_>,
    /// The traits class for the epilogue.
    typename GemmEpilogueTraits_ =
        SimplifiedGemmEpilogueTraits<GemmConfig_, EpilogueFunctor_, Index_> >
struct CgemmTraits : public SimplifiedGemmTraits<
                         // The layout for A.
                         kLayoutA_,
                         // The layout for B.
                         kLayoutB_,
                         // The config.
                         GemmConfig_,
                         // The epilogue.
                         GemmEpilogue<GemmEpilogueTraits_>,
                         // The index.
                         Index_> {};

////////////////////////////////////////////////////////////////////////////////////////////////////

}  // namespace gemm
}  // namespace cutlass

cgemm.cu

/*
  This example demonstrates how to call a CUTLASS GEMM kernel and provides a naive reference
  matrix multiply kernel to verify its correctness.

  The CUTLASS Gemm template is instantiated in the function CutlassSgemmNN. This is kernel computes
  the general matrix product (GEMM) using single-precision floating-point arithmetic and assumes
  all matrices have column-major layout.

  The threadblock tile size is chosen as 128x128x8 which offers good performance for large matrices.
  See the CUTLASS Parallel for All blog post for more exposition on the tunable parameters available
  in CUTLASS.

  https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/

  Aside from defining and launching the SGEMM kernel, this example does not use any other components
  or utilities within CUTLASS. Such utilities are demonstrated elsewhere in other examples and are
  prevalent in the CUTLASS unit tests.
*/

// Standard Library includes
#include <iostream>
#include <sstream>
#include <vector>


#include "cutlass/util/complex.h"


//
// CUTLASS includes needed for single-precision GEMM kernel
//

// Defines cutlass::gemm::Gemm, the generic Gemm computation template class.
#include "cutlass/gemm/gemm.h"

// Defines cutlass::gemm::SgemmTraits, the structural components for single-precision GEMM
#include "cgemm_traits.h"

///////////////////////////////////////////////////////////////////////////////////////////////////
//
// This function defines a CUTLASS GEMM kernel instantiation, constructs its parameters object,
// and launches it on the CUDA device.
//
///////////////////////////////////////////////////////////////////////////////////////////////////

/// Define a CUTLASS GEMM template and launch a GEMM kernel.
cudaError_t CutlassCgemmNN(
  int M,
  int N,
  int K,
  complex<float> alpha,
  complex<float> const *A,
  int lda,
  complex<float> const *B,
  int ldb,
  complex<float> beta,
  complex<float> *C,
  int ldc) {

  // Define type definition for single-precision CUTLASS GEMM with column-major
  // input matrices and 128x128x8 threadblock tile size.
  //
  // Note, GemmTraits<> is a generic template defined for various general matrix product
  // computations within CUTLASS. It is intended to be maximally flexible, and consequently
  // it contains numerous template arguments.
  //
  // To keep the interface manageable, several helpers are defined for plausible compositions
  // including the following example for single-precision GEMM. Typical values are used as
  // default template arguments. See `cutlass/gemm/gemm_traits.h` for more details.
  //
  typedef cutlass::gemm::CgemmTraits<
    cutlass::MatrixLayout::kColumnMajor,   // layout of A matrix
    cutlass::MatrixLayout::kColumnMajor,   // layout of B matrix
    cutlass::Shape<8, 32, 64>            // threadblock tile size
  >
    GemmTraits;

  // Define a CUTLASS GEMM type from a GemmTraits<> instantiation.
  typedef cutlass::gemm::Gemm<GemmTraits> Gemm;

  // Construct and initialize CUTLASS GEMM parameters object.
  //
  // One of CUTLASS's design patterns is to define parameters objects that are constructible
  // in host code and passed to kernels by value. These may include pointers, strides, scalars,
  // and other arguments needed by Gemm and its components.
  //
  // The benefits of this pattern are (1.) a structured, composable strategy for passing host-constructible
  // arguments to kernels and (2.) minimized initialization overhead on kernel entry.
  //
  typename Gemm::Params params;

  int result = params.initialize(
    M,     // GEMM M dimension
    N,     // GEMM N dimension
    K,     // GEMM K dimension
    alpha, // scalar alpha
    A,     // matrix A operand
    lda,
    B,     // matrix B operand
    ldb,
    beta,  // scalar beta
    C,     // source matrix C
    ldc,
    C,     // destination matrix C (may be different memory than source C matrix)
    ldc
  );

  if (result) {
    std::cerr << "Failed to initialize CUTLASS Gemm::Params object." << std::endl;
    return cudaErrorInvalidValue;
  }

  // Launch the CUTLASS GEMM kernel.
  Gemm::launch(params);


  // Return any errors associated with the launch or cudaSuccess if no error.
  return cudaGetLastError();
}

///////////////////////////////////////////////////////////////////////////////////////////////////
//
// The source code after this point in the file is generic CUDA using the CUDA Runtime API
// and simple CUDA kernels to initialize matrices and compute the general matrix product.
//
///////////////////////////////////////////////////////////////////////////////////////////////////

/// Kernel to initialize a matrix with small integers.
__global__ void InitializeMatrix_kernel(
  complex<float> *matrix,
  int ldm,
  int rows,
  int columns,
  int seed = 0) {

  int i = threadIdx.x + blockIdx.x * blockDim.x;
  int j = threadIdx.y + blockIdx.y * blockDim.y;

  if (i < rows && j < columns) {
    int offset = i + j * ldm;

    // Generate arbitrary elements.
    int const k = 16807;
    int const m = 16;
    matrix[offset].real() = 1.0f;
    matrix[offset].imag() = 1.0f;
  }
}

/// Simple function to initialize a matrix to arbitrary small integers.
cudaError_t InitializeMatrix(complex<float> *matrix, int ldm, int rows, int columns, int seed = 0) {

  dim3 block(16, 16);
  dim3 grid(
    (rows + block.x - 1) / block.x,
    (columns + block.y - 1) / block.y
  );

  InitializeMatrix_kernel<<< grid, block >>>(matrix, ldm, rows, columns, seed);

  return cudaGetLastError();
}

///////////////////////////////////////////////////////////////////////////////////////////////////

/// Allocates device memory for a matrix then fills with arbitrary small integers.
cudaError_t AllocateMatrix(complex<float> **matrix, int ldm, int rows, int columns, int seed = 0) {
  cudaError_t result;

  size_t sizeof_matrix = sizeof(complex<float>) * ldm * columns;

  // Allocate device memory.
  result = cudaMalloc(reinterpret_cast<void **>(matrix), sizeof_matrix);

  if (result != cudaSuccess) {
    std::cerr << "Failed to allocate matrix: "
      << cudaGetErrorString(result) << std::endl;
    return result;
  }

  // Clear the allocation.
  result = cudaMemset(*matrix, 0, sizeof_matrix);

  if (result != cudaSuccess) {
    std::cerr << "Failed to clear matrix device memory: "
      << cudaGetErrorString(result) << std::endl;
    return result;
  }

  // Initialize matrix elements to arbitrary small integers.
  result = InitializeMatrix(*matrix, ldm, rows, columns, seed);

  if (result != cudaSuccess) {
    std::cerr << "Failed to initialize matrix: "
      << cudaGetErrorString(result) << std::endl;
    return result;
  }

  return result;
}

///////////////////////////////////////////////////////////////////////////////////////////////////

/// Naive reference GEMM computation.
__global__ void ReferenceGemm_kernel(
  int M,
  int N,
  int K,
  complex<float> alpha,
  complex<float> const *A,
  int lda,
  complex<float> const *B,
  int ldb,
  complex<float> beta,
  complex<float> *C,
  int ldc) {

  int i = threadIdx.x + blockIdx.x * blockDim.x;
  int j = threadIdx.y + blockIdx.y * blockDim.y;

  if (i < M && j < N) {
    complex<float> accumulator = 0;

    for (int k = 0; k < K; ++k) {
      accumulator += A[i + k * lda] * B[k + j * ldb];
    }

    C[i + j * ldc] = alpha * accumulator + beta * C[i + j * ldc];
  }
}

/// Reference GEMM computation.
cudaError_t ReferenceGemm(
  int M,
  int N,
  int K,
  complex<float> alpha,
  complex<float> const *A,
  int lda,
  complex<float> const *B,
  int ldb,
  complex<float> beta,
  complex<float> *C,
  int ldc) {

  dim3 block(16, 16);
  dim3 grid(
    (M + block.x - 1) / block.x,
    (N + block.y - 1) / block.y
  );

  ReferenceGemm_kernel<<< grid, block >>>(M, N, K, alpha, A, lda, B, ldb, beta, C, ldc);

  return cudaGetLastError();
}

///////////////////////////////////////////////////////////////////////////////////////////////////

/// Allocate several matrices in GPU device memory and call a single-precision
/// CUTLASS GEMM kernel.
cudaError_t TestCutlassGemm(int M, int N, int K, complex<float> alpha, complex<float> beta) {
  cudaError_t result;

  //
  // Define several matrices to be used as operands to GEMM kernels.
  //

  // Compute leading dimensions for each matrix.
  int lda = M;
  int ldb = K;
  int ldc = M;

  // Compute size in bytes of the C matrix.
  size_t sizeof_C = sizeof(complex<float>) * ldc * N;

  // Define pointers to matrices in GPU device memory.
  complex<float> *A;
  complex<float> *B;
  complex<float> *C_cutlass;
  complex<float> *C_reference;

  //
  // Allocate matrices in GPU device memory with arbitrary seeds.
  //

  result = AllocateMatrix(&A, lda, M, K, 0);

  if (result !=  cudaSuccess) {
    return result;
  }

  result = AllocateMatrix(&B, ldb, K, N, 17);

  if (result !=  cudaSuccess) {
    cudaFree(A);
    return result;
  }

  result = AllocateMatrix(&C_cutlass, ldc, M, N, 101);

  if (result != cudaSuccess) {
    cudaFree(A);
    cudaFree(B);
    return result;
  }

  result = AllocateMatrix(&C_reference, ldc, M, N, 101);

  if (result != cudaSuccess) {
    cudaFree(A);
    cudaFree(B);
    cudaFree(C_cutlass);
    return result;
  }

  result = cudaMemcpy(C_reference, C_cutlass, sizeof_C, cudaMemcpyDeviceToDevice);

  if (result != cudaSuccess) {
    std::cerr << "Failed to copy C_cutlass matrix to C_reference: "
      << cudaGetErrorString(result) << std::endl;

    cudaFree(C_reference);
    cudaFree(C_cutlass);
    cudaFree(B);
    cudaFree(A);

    return result;
  }

  //
  // Launch CUTLASS GEMM.
  //

  result = CutlassCgemmNN(M, N, K, alpha, A, lda, B, ldb, beta, C_cutlass, ldc);

  if (result != cudaSuccess) {
    std::cerr << "CUTLASS GEMM kernel failed: "
      << cudaGetErrorString(result) << std::endl;

    cudaFree(C_reference);
    cudaFree(C_cutlass);
    cudaFree(B);
    cudaFree(A);

    return result;
  }

  //
  // Verify.
  //

  // Launch reference GEMM
  result = ReferenceGemm(M, N, K, alpha, A, lda, B, ldb, beta, C_reference, ldc);

  if (result != cudaSuccess) {
    std::cerr << "Reference GEMM kernel failed: "
      << cudaGetErrorString(result) << std::endl;

    cudaFree(C_reference);
    cudaFree(C_cutlass);
    cudaFree(B);
    cudaFree(A);

    return result;
  }

  // Copy to host and verify equivalence.
  std::vector<complex<float>> host_cutlass(ldc * N, complex<float>());
  std::vector<complex<float>> host_reference(ldc * N, complex<float>());
#if 1
  result = cudaMemcpy(host_cutlass.data(), C_cutlass, sizeof_C, cudaMemcpyDeviceToHost);

  if (result != cudaSuccess) {
    std::cerr << "Failed to copy CUTLASS GEMM results: "
      << cudaGetErrorString(result) << std::endl;

    cudaFree(C_reference);
    cudaFree(C_cutlass);
    cudaFree(B);
    cudaFree(A);

    return result;
  }

  result = cudaMemcpy(host_reference.data(), C_reference, sizeof_C, cudaMemcpyDeviceToHost);

  if (result != cudaSuccess) {
    std::cerr << "Failed to copy Reference GEMM results: "
      << cudaGetErrorString(result) << std::endl;

    cudaFree(C_reference);
    cudaFree(C_cutlass);
    cudaFree(B);
    cudaFree(A);

    return result;
  }

  //
  // Free device memory allocations.
  //

  cudaFree(C_reference);
  cudaFree(C_cutlass);
  cudaFree(B);
  cudaFree(A);

  //
  // Test for bit equivalence of results.
  //
#endif

  for (int j = 0; j < 5; j++) {
    for (int i = 0; i < 5; i++) {
      complex<float>& hc = host_cutlass[j * ldc + i];
      complex<float>& hr = host_reference[j * ldc + i];
      if (hc != hr) {
        std::cout << "hc = " << hc << ", hr = " << hr << std::endl;
        //exit(0);
      }
    }
  }

  if (host_cutlass != host_reference) {
    std::cerr << "CUTLASS results incorrect." << std::endl;

    return cudaErrorUnknown;
  }

  return cudaSuccess;
}

///////////////////////////////////////////////////////////////////////////////////////////////////

/// Entry point to basic_gemm example.
//
// usage:
//
//   00_basic_gemm <M> <N> <K> <alpha> <beta>
//
int main(int argc, const char *arg[]) {

  assert(sizeof(complex<float>) == 8);

  //
  // Parse the command line to obtain GEMM dimensions and scalar values.
  //

  // GEMM problem dimensions.
  int problem[3] = { 128, 128, 128 };

  for (int i = 1; i < argc && i < 4; ++i) {
    std::stringstream ss(arg[i]);
    ss >> problem[i - 1];
  }

  // Scalars used for linear scaling the result of the matrix product.
  complex<float> scalars[2] = { complex<float>(1.0f, 1.0f), complex<float>(1.0f, 0.0f) };

  for (int i = 4; i < argc && i < 8; i += 2) {
    std::stringstream ss(arg[i]);
    ss >> scalars[i - 4].real() >> scalars[i - 4].imag();
  }

  //
  // Run the CUTLASS GEMM test.
  //

  cudaError_t result = TestCutlassGemm(
    problem[0],     // GEMM M dimension
    problem[1],     // GEMM N dimension
    problem[2],     // GEMM K dimension
    scalars[0],     // alpha
    scalars[1]      // beta
  );

  if (result == cudaSuccess) {
    std::cout << "Passed." << std::endl;
  }

  // Exit.
  return result == cudaSuccess ? 0 : -1;
}

///////////////////////////////////////////////////////////////////////////////////////////////////

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.