GithubHelp home page GithubHelp logo

nvidia / fastertransformer Goto Github PK

View Code? Open in Web Editor NEW
5.5K 63.0 856.0 67.33 MB

Transformer related optimization, including BERT, GPT

License: Apache License 2.0

CMake 1.84% C++ 67.00% Cuda 29.22% Python 1.32% Shell 0.54% C 0.03% HCL 0.02% Makefile 0.03%
pytorch transformer gpt bert

fastertransformer's Introduction

Note: FasterTransformer development has transitioned to TensorRT-LLM. All developers are encouraged to leverage TensorRT-LLM to get the latest improvements on LLM Inference. The NVIDIA/FasterTransformer repo will stay up, but will not have further development.

FasterTransformer

This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA.

Table Of Contents

Model overview

In NLP, encoder and decoder are two important components, with the transformer layer becoming a popular architecture for both components. FasterTransformer implements a highly optimized transformer layer for both the encoder and decoder for inference. On Volta, Turing and Ampere GPUs, the computing power of Tensor Cores are used automatically when the precision of the data and weights are FP16.

FasterTransformer is built on top of CUDA, cuBLAS, cuBLASLt and C++. We provide at least one API of the following frameworks: TensorFlow, PyTorch and Triton backend. Users can integrate FasterTransformer into these frameworks directly. For supporting frameworks, we also provide example codes to demonstrate how to use, and show the performance on these frameworks.

Support matrix

Models Framework FP16 INT8 (after Turing) Sparsity (after Ampere) Tensor parallel Pipeline parallel FP8 (after Hopper)
BERT TensorFlow Yes Yes - - - -
BERT PyTorch Yes Yes Yes Yes Yes -
BERT Triton backend Yes - - Yes Yes -
BERT C++ Yes Yes - - - Yes
XLNet C++ Yes - - - - -
Encoder TensorFlow Yes Yes - - - -
Encoder PyTorch Yes Yes Yes - - -
Decoder TensorFlow Yes - - - - -
Decoder PyTorch Yes - - - - -
Decoding TensorFlow Yes - - - - -
Decoding PyTorch Yes - - - - -
GPT TensorFlow Yes - - - - -
GPT/OPT PyTorch Yes - - Yes Yes Yes
GPT/OPT Triton backend Yes - - Yes Yes -
GPT-MoE PyTorch Yes - - Yes Yes -
BLOOM PyTorch Yes - - Yes Yes -
BLOOM Triton backend Yes - - Yes Yes -
GPT-J Triton backend Yes - - Yes Yes -
Longformer PyTorch Yes - - - - -
T5/UL2 PyTorch Yes - - Yes Yes -
T5 TensorFlow 2 Yes - - - - -
T5/UL2 Triton backend Yes - - Yes Yes -
T5 TensorRT Yes - - Yes Yes -
T5-MoE PyTorch Yes - - Yes Yes -
Swin Transformer PyTorch Yes Yes - - - -
Swin Transformer TensorRT Yes Yes - - - -
ViT PyTorch Yes Yes - - - -
ViT TensorRT Yes Yes - - - -
GPT-NeoX PyTorch Yes - - Yes Yes -
GPT-NeoX Triton backend Yes - - Yes Yes -
BART/mBART PyTorch Yes - - Yes Yes -
WeNet C++ Yes - - - - -
DeBERTa TensorFlow 2 Yes - - On-going On-going -
DeBERTa PyTorch Yes - - On-going On-going -
  • Note that the FasterTransformer supports the models above on C++ because all source codes are built on C++.

More details of specific models are put in xxx_guide.md of docs/, where xxx means the model name. Some common questions and the respective answers are put in docs/QAList.md. Note that the model of Encoder and BERT are similar and we put the explanation into bert_guide.md together.

Advanced

The following code lists the directory structure of FasterTransformer:

/src/fastertransformer: source code of FasterTransformer
    |--/cutlass_extensions: Implementation of cutlass gemm/kernels.
    |--/kernels: CUDA kernels for different models/layers and operations, like addBiasResiual.
    |--/layers: Implementation of layer modules, like attention layer, ffn layer.
    |--/models: Implementation of different models, like BERT, GPT.
    |--/tensorrt_plugin: encapluate FasterTransformer into TensorRT plugin.
    |--/tf_op: custom Tensorflow OP implementation
    |--/th_op: custom PyTorch OP implementation
    |--/triton_backend: custom triton backend implementation
    |--/utils: Contains common cuda utils, like cublasMMWrapper, memory_utils
/examples: C++, tensorflow and pytorch interface examples
    |--/cpp: C++ interface examples
    |--/pytorch: PyTorch OP examples
    |--/tensorflow: TensorFlow OP examples
    |--/tensorrt: TensorRT examples
/docs: Documents to explain the details of implementation of different models, and show the benchmark
/benchmark: Contains the scripts to run the benchmarks of different models
/tests: Unit tests
/templates: Documents to explain how to add a new model/example into FasterTransformer repo

Note that many folders contains many sub-folders to split different models. Quantization tools are move to examples, like examples/tensorflow/bert/bert-quantization/ and examples/pytorch/bert/bert-quantization-sparsity/.

Global Environment

FasterTransformer provides some convenient environment variables for debuging and testing.

  1. FT_LOG_LEVEL: This environment controls the log level of debug messae. More details are in src/fastertransformer/utils/logger.h. Note that the program will print lots of message when the level is lower than DEBUG and the program would become very slow.
  2. FT_NVTX: If it is set to be ON like FT_NVTX=ON ./bin/gpt_example, the program will insert tha tag of nvtx to help profiling the program.
  3. FT_DEBUG_LEVEL: If it is set to be DEBUG, then the program will run cudaDeviceSynchronize() after every kernels. Otherwise, the kernel is executued asynchronously by default. It is helpful to locate the error point during debuging. But this flag affects the performance of program significantly. So, it should be used only for debuging.

Performance

Hardware settings:

  • 8xA100-80GBs (with mclk 1593MHz, pclk 1410MHz) with AMD EPYC 7742 64-Core Processor
  • T4 (with mclk 5000MHz, pclk 1590MHz) with Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

In order to run the following benchmark, we need to install the unix computing tool "bc" by

apt-get install bc

BERT base performance

The FP16 results of TensorFlow were obtained by running the benchmarks/bert/tf_benchmark.sh.

The INT8 results of TensorFlow were obtained by running the benchmarks/bert/tf_int8_benchmark.sh.

The FP16 results of PyTorch were obtained by running the benchmarks/bert/pyt_benchmark.sh.

The INT8 results of PyTorch were obtained by running the benchmarks/bert/pyt_int8_benchmark.sh.

More benchmarks are put in docs/bert_guide.md.

BERT base performances of FasterTransformer new features

The following figure compares the performances of different features of FasterTransformer and FasterTransformer under FP16 on T4.

For large batch size and sequence length, both EFF-FT and FT-INT8-v2 bring about 2x speedup. Using Effective FasterTransformer and int8v2 at the same time can bring about 3.5x speedup compared to FasterTransformer FP16 for large case.

BERT base performance on TensorFlow

The following figure compares the performances of different features of FasterTransformer and TensorFlow XLA under FP16 on T4.

For small batch size and sequence length, using FasterTransformer can bring about 3x speedup.

For large batch size and sequence length, using Effective FasterTransformer with INT8-v2 quantization can bring about 5x speedup.

BERT base performance on PyTorch

The following figure compares the performances of different features of FasterTransformer and PyTorch TorchScript under FP16 on T4.

For small batch size and sequence length, using FasterTransformer CustomExt can bring about 4x ~ 6x speedup.

For large batch size and sequence length, using Effective FasterTransformer with INT8-v2 quantization can bring about 5x speedup.

Decoding and Decoder performance

The results of TensorFlow were obtained by running the benchmarks/decoding/tf_decoding_beamsearch_benchmark.sh and benchmarks/decoding/tf_decoding_sampling_benchmark.sh

The results of PyTorch were obtained by running the benchmarks/decoding/pyt_decoding_beamsearch_benchmark.sh.

In the experiments of decoding, we updated the following parameters:

  • head_num = 8
  • size_per_head = 64
  • num_layers = 6 for both encoder and decoder
  • vocabulary_size = 32001 for TensorFlow sample codes, 31538 for PyTorch sample codes
  • memory_hidden_dim = 512
  • max sequenc elength = 128

More benchmarks are put in docs/decoder_guide.md.

Decoder and Decoding end-to-end translation performance on TensorFlow

The following figure shows the speedup of of FT-Decoder op and FT-Decoding op compared to TensorFlow under FP16 with T4. Here, we use the throughput of translating a test set to prevent the total tokens of each methods may be different. Compared to TensorFlow, FT-Decoder provides 1.5x ~ 3x speedup; while FT-Decoding provides 4x ~ 18x speedup.

Decoder and Decoding end-to-end translation performance on PyTorch

The following figure shows the speedup of of FT-Decoder op and FT-Decoding op compared to PyTorch under FP16 with T4. Here, we use the throughput of translating a test set to prevent the total tokens of each methods may be different. Compared to PyTorch, FT-Decoder provides 1.2x ~ 3x speedup; while FT-Decoding provides 3.8x ~ 13x speedup.

GPT performance

The following figure compares the performances of Megatron and FasterTransformer under FP16 on A100.

In the experiments of decoding, we updated the following parameters:

  • head_num = 96
  • size_per_head = 128
  • num_layers = 48 for GPT-89B model, 96 for GPT-175B model
  • data_type = FP16
  • vocab_size = 51200
  • top_p = 0.9
  • tensor parallel size = 8
  • input sequence length = 512
  • output sequence length = 32

Release notes

Changelog

May 2023

  • Fix bugs of generation early stopping

January 2023

  • Support GPT MoE
  • Support FP8 for Bert and GPT (Experimental)
  • Support DeBERTa on TensorFlow 2 and PyTorch

Dec 2022

  • Release the FasterTransformer 5.2
  • Support min length penalty

Nov 2022

  • Support T5 Tensorflow 2 custom op.
  • Support T5 MoE
  • Support WeNet
  • Support BART & mBART
  • Support SwinV2
  • Initial support for w8a8 int8 mode with GPT (preview)
  • Support fused mha in GPT

Oct 2022

  • Support BLOOM

Sep 2022

  • Support factual sampling (link) in gpt
  • Support for IA3 adapting scheme in T5

Aug 2022

  • Support returning context tokens embeddings in GPT
  • Release the FasterTransformer 5.1
  • Support for interactive generation
  • Support for attention time-limited memory
  • Support mt5 and t5-v1.1

July 2022

  • Support UL2 huggingface ckpt. (link)
    • Fix bug of T5 under bfloat16.
  • Add ViT INT8 TensorRT Plugin
  • Support batch sampling
  • Support shared context optimization in GPT model

June 2022

  • Support streaming generation for triton backend.
  • Support OPT.
  • Support multi-node multi-GPU BERT under FP32, FP16 and BF16.

May 2022

  • Support bfloat16 on most models.
  • Support prefix-prompt for GPT-J.
  • Support GPT-NeoX.
    • epsilon value used in layernorm is now a parameter
    • rotary embedding GPT-NeoX style (only GPT-J was implemented)
    • load per-GPU layernorm and bias parameters
    • weight conversion from EleutherAI checkpoint

April 2022

  • Release the FasterTransformer 5.0
    • Change the default accumulation type of all gemm to FP32.
    • Support bfloat16 inference in GPT model.
    • Support Nemo Megatron T5 and Megatron-LM T5 model.
    • Support ViT.

March 2022

  • Support stop_ids and ban_bad_ids in GPT-J.
  • Support dynamice start_id and end_id in GPT-J, GPT, T5 and Decoding.

February 2022

  • Support Swin Transformer.
  • Optimize the k/v cache update of beam search by in-direction buffer.
  • Support runtime input for GPT-J, T5 and GPT.
  • Support soft prompt in GPT and GPT-J.
  • Support custom all reduce kernel.
    • Limitation:
      1. Only support tensor parallel size = 8 on DGX-A100.
      2. Only support CUDA with cudaMallocAsync.

December 2021

  • Add TensorRT plugin of T5 model.
  • Change some hyper-parameters of GPT model to runtime query.
  • Optimize the memory allocator under C++ code.
  • Fix bug of CUB including when using CUDA 11.5 or newer version.

November 2021

  • Update the FasterTransformer 5.0 beta
  • Add GPT-3 INT8 weight only qauntization for batch size <= 2.
  • Support multi-node multi-gpu support on T5.
  • Enhance the multi-node multi-gpu supporting in GPT-3.

August 2021

  • Release the FasterTransformer 5.0 beta
    • Refactor the repo and codes
    • And special thanks to NAVER Corp. for contributing a lot to this version, as listed below.
      • Bugs fix
        • Fix error that occurs when batch_size is less than max_batch_size for gpt pytorch wrapper.
        • Fix memory leak that occurs every forward because of reused allocator.
        • Fix race condition that occurs in repetition penalty kernel.
      • Enhancement
        • Add random seed setting.
        • Fix GEMM buffer overflow on FP16 of GPT.
        • Change to invalidate finished buffer for every completion.
        • Introduce stop_before for early stop.
    • Support Longformer.
    • Rename layer_para to pipeline_para.
    • Optimize the sorting of top p sampling.
    • Support sparsity for Ampere GPUs on BERT.
    • Support size_per_head 96, 160, 192, 224, 256 for GPT model.
    • Support multi-node inference for GPT Triton backend.

June 2021

  • Support XLNet

April 2021

  • Release the FasterTransformer 4.0
    • Support multi-gpus and multi-nodes inference for GPT model on C++ and PyTorch.
    • Support single node, multi-gpus inference for GPT model on triton.
    • Add the int8 fused multi-head attention kernel for bert.
    • Add the FP16 fused multi-head attention kernel of V100 for bert.
    • Optimize the kernel of decoder.
    • Move to independent repo.
    • Eager mode PyTorch extension is deprecated.

Dec 2020

  • Release the FasterTransformer 3.1
    • Optimize the decoding by adding the finisehd mask to prevent useless computing.
    • Support opennmt encoder.
    • Remove the TensorRT plugin supporting.
    • TorchScript custom op is deprecated.

Nov 2020

  • Optimize the INT8 inference.
  • Support PyTorch INT8 inference.
  • Provide PyTorch INT8 quantiztion tools.
  • Integrate the fused multi-head attention kernel of TensorRT into FasterTransformer.
  • Add unit test of SQuAD.
  • Update the missed NGC checkpoints.

Sep 2020

  • Support GPT2
  • Release the FasterTransformer 3.0
    • Support INT8 quantization of encoder of cpp and TensorFlow op.
    • Add bert-tf-quantization tool.
    • Fix the issue that Cmake 15 or Cmake 16 fail to build this project.

Aug 2020

  • Fix the bug of trt plugin.

June 2020

  • Release the FasterTransformer 2.1
    • Add Effective FasterTransformer based on the idea of Effective Transformer idea.
    • Optimize the beam search kernels.
    • Add PyTorch op supporting

May 2020

  • Fix the bug that seq_len of encoder must be larger than 3.
  • Add the position_encoding of decoding as the input of FasterTransformer decoding. This is convenient to use different types of position encoding. FasterTransformer does not compute the position encoding value, but only lookup the table.
  • Modifying the method of loading model in translate_sample.py.

April 2020

  • Rename decoding_opennmt.h to decoding_beamsearch.h
  • Add DiverseSiblingsSearch for decoding.
  • Add sampling into Decoding
    • The implementation is in the decoding_sampling.h
    • Add top_k sampling, top_p sampling for decoding.
  • Refactor the tensorflow custom op codes.
    • Merge bert_transformer_op.h, bert_transformer_op.cu.cc into bert_transformer_op.cc
    • Merge decoder.h, decoder.cu.cc into decoder.cc
    • Merge decoding_beamsearch.h, decoding_beamsearch.cu.cc into decoding_beamsearch.cc
  • Fix the bugs of finalize function decoding.py.
  • Fix the bug of tf DiverseSiblingSearch.
  • Add BLEU scorer bleu_score.py into utils. Note that the BLEU score requires python3.
  • Fuse QKV Gemm of encoder and masked_multi_head_attention of decoder.
  • Add dynamic batch size and dynamic sequence length features into all ops.

March 2020

  • Add feature in FasterTransformer 2.0
    • Add translate_sample.py to demonstrate how to translate a sentence by restoring the pretrained model of OpenNMT-tf.
  • Fix bugs of Fastertransformer 2.0
    • Fix the bug of maximum sequence length of decoder cannot be larger than 128.
    • Fix the bug that decoding does not check finish or not after each step.
    • Fix the bug of decoder about max_seq_len.
    • Modify the decoding model structure to fit the OpenNMT-tf decoding model.
      • Add a layer normalization layer after decoder.
      • Add a normalization for inputs of decoder

February 2020

  • Release the FasterTransformer 2.0
    • Provide a highly optimized OpenNMT-tf based decoder and decoding, including C++ API and TensorFlow op.
    • Refine the sample codes of encoder.
    • Add dynamic batch size feature into encoder op.

July 2019

  • Release the FasterTransformer 1.0
    • Provide a highly optimized bert equivalent transformer layer, including C++ API, TensorFlow op and TensorRT plugin.

Known issues

  • Cannot compile on tensorflow 2.10 due to undefined symbol issue.
  • Undefined symbol errors when import the extension
    • Please import torch first. If this has been done, it is due to the incompatible C++ ABI. You may need to check the PyTorch used during compilation and execution are the same, or you need to check how your PyTorch is compiled, or the version of your GCC, etc.
  • Results of TensorFlow and OP would be different in decoding. This problem is caused by the accumulated log probability, and we do not avoid this problem.
  • If encounter some problem in the custom environment, try to use the gcc/g++ 4.8 to build the project of TensorFlow op, especially for TensorFlow 1.14.

fastertransformer's People

Contributors

842974287 avatar andabi avatar appleejii avatar byshiue avatar christinaz avatar daemyung avatar daun4168 avatar f0rmiga avatar fmassa avatar jinwoongkim avatar lanking520 avatar luliyucoordinate avatar mbalc avatar mymusise avatar noppayut avatar odellus avatar perkzzheng avatar prnake avatar rkindi avatar rohithkrn avatar shengr avatar trellixvulnteam avatar xsr-thu avatar yangruipis avatar ying1123 avatar yuanzhedong avatar yuekaizhang avatar zhang-ge-hao avatar zhangxin81 avatar zobinhuang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastertransformer's Issues

Fastertransformer with tensorflow-serving

Hi, i was using the Fastertransformer locally, it's pretty good. we have good inference speed now. For many online application, we usually use tensorflow-serving, So, do you guys have plan to implement Fastertransformer with tensorflow-serving or is there any exist method. Thanks

Fastertransformer with tensorflow-serving

Hi, i was using the Fastertransformer locally, it's pretty good. we have good inference speed now. For many online application, we usually use tensorflow-serving, So, do you guys have plan to implement Fastertransformer with tensorflow-serving or is there any exist method. Thanks

[FastTransformer/Pytorch] TXX, RuntimeError: CUDA error: invalid device function

  • FastTransformer v3.0
  • CUDA 10.2

with TXX, run bash pytorch/scripts/run_mrpc.sh thsext fp32 get:

11/24/2020 18:50:14 - INFO - __main__ -   Use custom BERT encoder for TorchScript
Traceback (most recent call last):
  File "/workdir/fastertransformer/build/pytorch/run_glue.py", line 383, in <module>
    main()
  File "/workdir/fastertransformer/build/pytorch/run_glue.py", line 359, in main
    enc_ = torch.jit.trace(enc, (fake_inp, fake_mask))
  File "/usr/local/conda/lib/python3.6/site-packages/torch/jit/_trace.py", line 742, in trace
    _module_class,
  File "/usr/local/conda/lib/python3.6/site-packages/torch/jit/_trace.py", line 966, in trace_module
    _module_class,
  File "/usr/local/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/conda/lib/python3.6/site-packages/torch/jit/_trace.py", line 327, in _check_trace
    copied_dict[name] = _clone_inputs(data)
  File "/usr/local/conda/lib/python3.6/site-packages/torch/jit/_trace.py", line 160, in _clone_inputs
    )(args)
  File "/usr/local/conda/lib/python3.6/site-packages/torch/autograd/function.py", line 282, in _map
    return type(obj)(mapped)
  File "/usr/local/conda/lib/python3.6/site-packages/torch/autograd/function.py", line 278, in <genexpr>
    mapped = (_map(x) for x in obj)
  File "/usr/local/conda/lib/python3.6/site-packages/torch/autograd/function.py", line 274, in _map
    return fn(obj)
  File "/usr/local/conda/lib/python3.6/site-packages/torch/jit/_trace.py", line 149, in clone_input
    .clone(memory_format=torch.preserve_format)
RuntimeError: CUDA error: invalid device function

However, with T4, it works.

[FasterTransformer/V2] No speedUp when the sequence length is large

Hello,
I had implemented a gpt2 model according to FT. Then I tested the performance between pytorch(Fairseq) and FT, This is the result :

Setting : batch = 1, hidden_units = 1024, head_num = 16, size_per_head = 64
time : ms
seq_len : 8 16 32 64 128 256 512 800
pytorch: 21 23 23 23 28 23.6 22.6 24
___FT : 6.2 6.4 6.7 7.6 8.6 12.3 24 34.7

From this table , it seems like the FT is much worse than pytorch when the seq_len is larger than about 500. it's very unacceptable that the FT is slower. And why the performance of pytorch is almost never change with the increase of sequence length ?
It's also has the same phenomenon when I test the masked BERT model.
Anyone help ?

fastertransformer nan error

When using my own pretrained model, I got NaN from the eight transformer layer.
By reading "fastertransformer/cuda/open_attention.cu", I found that the softmax input is not protected (substract the max value), the right formula shoud be:

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

I'm trying to modify "softmax_kernel_v2", maybe someone can offer a better version

faster transformer: Segmentation fault when "./bin/gemm_fp16 16 128 12 64"

I can compile it successfully.
step: ./bin/gemm_fp16 16 128 12 64, I got:

./bin/gemm_fp16 16 128 12 64
Device
FP16 Gemm Testing

GEMM test 0: [M: 2048, K: 768, N: 768] from_tensor * weightQ/K/V, attr * output_kernel
[FT][ERROR] CUDA runtime error: CUDA driver version is insufficient for CUDA runtime version /home/yons/Bert-master/bert/FasterTransformer/tools/gemm_test/gemm_fp16.cu:107
[FT][ERROR] CUDA runtime error: CUDA driver version is insufficient for CUDA runtime version /home/yons/Bert-master/bert/FasterTransformer/tools/gemm_test/gemm_fp16.cu:108
[FT][ERROR] CUDA runtime error: CUDA driver version is insufficient for CUDA runtime version /home/yons/Bert-master/bert/FasterTransformer/tools/gemm_test/gemm_fp16.cu:109
Segmentation fault (core dumped)

cuda:10.0
driver/nvidia/version:410.104
gcc:7.4.0
python3.6
Tensorflow 1.13.1
cmake:3.14.4

[Transformer/V2] two-way buffer to update Mask-attention's KV

I can not understand what's the mean of two-way buffer for K_cache and V_cache in decoding_opennmt.h. What the benefit of it ? The update at the end of every step seems like just a copy from one to another. Is it enough to use just one-way buffer ?

A question about fastertransformer

Hi, i am using the faster transformer. I have a question about the step 1: build Generate the gemm_config.in file. The first parameter named batch size. is this batch size the train_batch size or the eval batch size? btw, i am using faster transformer for a bert-based task. Thanks

[Performance of INT8] Feature requested

It's exciting to see the open source of Fast Transformer v3.0. However, I don't find the performance of INT8 on applications codes in the README.md, while FP32 and FP16 are both analyzed. Where can I find these results?

[FastTransformer/v2] run translate_sample.py with batch_size=16 failed.

Related to FastTransformer/v2
FasterTransformer/v2

Describe the bug
run translate_sample.py with batch_size=16 failed.
`
2020-04-08 14:29:42.124155: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-04-08 14:29:43.819479: I tensorflow/stream_executor/platform/default/dso_
loader.cc:42] Successfully opened dynamic library libcublas.so.10
Traceback (most recent call last):
File "translate_sample.py", line 248, in
sess.run([op_target_tokens, op_target_length, source]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/sessio
n.py", line 950, in run
run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/sessio
n.py", line 1173, in _run
feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/sessio
n.py", line 1350, in _do_run
run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/sessio
n.py", line 1370, in _do_call
raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s)
found.
(0) Invalid argument: Input to reshape is a tensor with 264192 values, but t
he requested shape requires a multiple of 8192 [[node Reshape (defined at translate_sample.py:143) ]]
[[Minimum_3/_761]]
(1) Invalid argument: Input to reshape is a tensor with 264192 values, but t
he requested shape requires a multiple of 8192
[[node Reshape (defined at translate_sample.py:143) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation. Input Source operations connected to node Reshape:
transformer/encoder/LayerNorm/batchnorm/add_1 (defined at /usr/local/lib/pyth
on2.7/dist-packages/opennmt/layers/transformer.py:324)
Input Source operations connected to node Reshape:
transformer/encoder/LayerNorm/batchnorm/add_1 (defined at /usr/local/lib/python2.7/dist-packages/opennmt/layers/transformer.py:324)

Original stack trace for u'Reshape': File "translate_sample.py", line 143, in
tf_encoder_result, [batch_size, -1, encoder_hidden_dim])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 7715, in reshape
"Reshape", tensor=tensor, shape=shape, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_
def_library.py", line 788, in _apply_op_helper op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecat
ion.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops
.py", line 3616, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops
.py", line 2005, in init
self._traceback = tf_stack.extract_stack()

`

To Reproduce
Steps to reproduce the behavior:

  1. enter docker and build fasttransformer
  2. ./bin/decoding_gemm 16 4 8 64 32001 100 512 0
  3. python translate_sample.py --batch_size=16

Expected behavior

Environment
Please provide at least:

  • docker: nvcr.io/nvidia/tensorflow:19.07-py2

single example inference seems slow

Hi, my environment is tf 1.13.1 . I already set up the fastertransformer v1 and used bert example. When i used original bert inference(input test file) and the predcit time per sample is around 0.0035s(time used in estimator.predict / num of sample). The original bert(without fastertransformer is around 0.007s)

However, when i used input fn builder(not file based) to inference only one sample, the time is 0.009s (same as one inference of original bert which is also 0.009s). Could u please help about this?

[FasterTransformer v2] EncoderInitParam has no member named attr_kernel_Q

FasterTransformer v2

It seems EncoderInitParam has changed in v2:

template <typename T>
class EncoderInitParam
{
public:
  const T *from_tensor;
  const T *to_tensor;

  AttentionWeight<T> self_attention;
  const T *attr_mask;
  LayerNormWeight<T> self_layernorm;

  FFNWeight<T> ffn;
  LayerNormWeight<T> ffn_layernorm;

  T *transformer_out;
  cublasHandle_t cublas_handle;
  cudaStream_t stream;
};

I want to use C++ API, but the example encoder has only one layer, so I try the TensorRT encoder example. When I build it, many errors appear:

trt_plugin/bert_transformer_plugin.h:131:37: error: โ€˜class fastertransformer::EncoderInitParam<float>โ€™ has no member named โ€˜attr_kernel_Qโ€™
         encoder_param.attr_kernel_Q = d_attr_kernel_Q_;
                                     ^
trt_plugin/bert_transformer_plugin.h:132:37: error: โ€˜class fastertransformer::EncoderInitParam<float>โ€™ has no member named โ€˜attr_kernel_Kโ€™
         encoder_param.attr_kernel_K = d_attr_kernel_K_;
                                     ^
trt_plugin/bert_transformer_plugin.h:133:37: error: โ€˜class fastertransformer::EncoderInitParam<float>โ€™ has no member named โ€˜attr_kernel_Vโ€™
         encoder_param.attr_kernel_V = d_attr_kernel_V_;

It seems that TensorRT example of v2 depends on v1's core.

Q: Can you fix the example of TensorRT in v2? Or could you add N-layer encoder example code for C++ API?

export faster transformer model for tensorflow serving

I have export the fast transformer model and put it under the tensorflow serving model dir, but it failed when loading the model with msg:
2019-11-04 10:44:17.458999: E tensorflow_serving/util/retrier.cc:37] Loading servable: {name: house_price_ft version: 1572707458} failed: Not found: Op type not registered 'BertTransformer' in binary running on mmpayfaceaep1. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resampler should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.

Any idea on serving the faster transformer model?

[FasterTransformer] Run demo crash on P100

Related to FasterTransformer

Describe the bug

When executing the command on P100:

CUDA_VISIBLE_DEVICES=1 python sample/tensorflow/transformer_fp32.py 1 12 32 12 64

image

To Reproduce
Steps to reproduce the behavior:

  1. cd build
  2. cmake -DSM=60 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/path/to/python2.7/site-packages/tensorflow ..
  3. make -j64
  4. ./build/bin/gemm_fp32 1 12 32 12 64
  5. CUDA_VISIBLE_DEVICES=1 python sample/tensorflow/transformer_fp32.py 1 12 32 12 64

Environment
Please provide at least:

  • Container version: no use docker
  • GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): P100
  • CUDA driver version (e.g. 418.67): 410.93
  • CUDA version: 10
  • GCC: 6.4.0
  • cmake: 3.8.2

Feedback of faster transformer

Hi, NVIDIA's faster transformer is good to use, but I also get the problem is that it's really hard to reference NV's faster transformer work in my project, since it's under DeepLearningExamples second folder. I really want the direct reference to faster transformer, just the same way as TensorRT or nvidia-docker.

[Fast Transformer] Is there anyway to do int8 quantization without calibration ?

We need a calibration dataset to collect the scalar of quantization which is really inconvenient when we do int8 quantization. It would be nice if we make a int8 conversion like a data type conversion. So could we make int8 inference without calibration ?
As I think, we can get the quantization scalar according to the current input just with a simple max function instead of percentile or MSE to get the scalar. Although max function sometimes introduce the outlier , it just consider the current input values and we can do per-channel quantization on it which maybe increase the precision. Besides, we can use kernel fused to get the max inside gemm casing little cost.

what is 'encoder_gemm' in FasterTransformer?

In FasterTransformer, it is recommended that every time encoder_gemm should be run first. May I ask why? It generates 'gemm_config.in' file. But I couldn't find anywhere that uses this file.
Is it purely used for GPU warm up?

[FasterTransformer] speedup on T4 issue

Environment requirements

  • CMake 3.14.6
  • CUDA 10.0
  • CUDNN 7.4.2
  • Python 2.7
  • Tensorflow 1.13
  • GCC 6

I run performance test on T4. use transformer_fp16.py.
https://github.com/NVIDIA/DeepLearningExamples/blob/master/FasterTransformer/sample/tensorflow/transformer_fp16.py
when Batch_size is {8,16,32}, XLA is better.


<batch_size, layers, eq_len, head_num, size_per_head> Tensorflow XLA on T4 FP16 (in ms) Faster Transformer T4 FP16 (in ms) Speedup GEMM ๅ‚ๆ•ฐ
(1, 6, 32, 12, 64) 1.889802 ms 1.194158 ms 1.583 107,100,107,114,111
(2, 6, 32, 12, 64) 1.940294 ms 1.277116 ms 1.519 107,100,107,101,115
(4, 6, 32, 12, 64) 2.257926 ms 1.715474 ms 1.316 107,100,107,102,100
(8, 6, 32, 12, 64) 2.82261 ms 2.83554 ms 0.95 107,100,107,115,105
(16, 6, 32, 12, 64) 4.363834 ms 4.560086 ms 0.96 100,110,107,107,99
(32, 6, 32, 12, 64) 7.870094 ms 8.690178 ms 0.915 100,99,107,100,111
(64, 6, 32, 12, 64) 15.913654 ms 14.668242 ms 1.084 100,103,110,100,115
(128, 6, 32, 12, 64) 30.419768 ms 25.81849 ms 1.178 110,106,110,99,100
(256, 6, 32, 12, 64) 58.770124 ms 50.166238 ms 1.172 103,103,103,109,115
(512, 6, 32, 12, 64) 121.467958 ms 98.644416 ms 1.231 103,103,103,108,100

Is my test result correct?

[FasterTransformer/effective transformer] using wrong offset to remove seq padding

Related to FasterTransformer/effective transform

Describe the bug
In Kernel remove_sequence_length_padding, the offset to get real location in source padding tensor is wrong.

Current impl:
`template
global void remove_sequence_length_padding(const T* src, T* tgt,
const int* tmp_mask_offset,
int* mask_offset,
const int n)
{
const int tid = threadIdx.x;
const int bid = blockIdx.x;
mask_offset[bid] = tmp_mask_offset[bid];

// the src_seq_id is not right, which should be mask_offset[bid], no need add bid.
const int src_seq_id = bid + mask_offset[bid];
const int tgt_seq_id = bid;

for(int i = tid; i < n; i += blockDim.x)
{
tgt[tgt_seq_id * n + i] = src[src_seq_id * n + i];
}
}`

Expected behavior
Change:
const int src_seq_id = bid + mask_offset[bid];
to:
const int src_seq_id = mask_offset[bid];

cublasGemmStridedBatchedEx Compute Type

I want to know why I set Compute Type as CUDA_R_16F and A,B,C Data Type as CUDA_R_16F ,the result of matrix multiplication is 0 . But when set compute tyep as CUDA_R_32F and A,B,C as CUDA_R_16F ,the answer is right.

fp16 cross_check false

env:
ubuntu 18.04
python: 3.6.8
GCC: 8.0.1
tensorflow: 1.13.0-rc2
CUDA: 10.0

  1. I install in tensorflow mode with V100 :
    cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/usr/local/lib/python2.7/dist-packages/tensorflow .. # Tensorflow mode

  2. and then generate the gemm_config.in with:
    ./build/bin/gemm_fp16 100 32 12 64
    run with:
    python transformer_fp16.py 100 12 32 12 64

But i got some result like:
GetImage

So, we can see that FasterTransformer output have a big diff vs OriginTransformer in fp16 like 0.0332, and then i try run OriginTransformer twice, the diff is 0.007812.

My problem is:
1.FasterTransformer the diff 0.0332 is big or not? whether or not affect convergence? because OriginTransformer run twice is only 0.007812.
2.why diff(0.0332 vs 0.007812) is different, i think this maybe because FasterTransformer fuse some ops, but OriginTransformer don't, so maybe the Intermediate variable will cause the different?

[FastTransformer v3.1/TensorFlow] Get CUBLAS_STATUS_INTERNAL_ERROR when run tensorflow/gpt2-sample.py

Related to FastTransformer v3.1/TensorFlow/GPT-2

Describe the bug
If I run ./bin/decoding_gemm 4 1 12 64 50257 32 768 0 before python tensorflow/gpt2_sample.py, I got
Internal: [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_INTERNAL_ERROR FasterTransformer/fastertransformer/cuda/open_decoder.cu:1708.
However, If I don't run ./bin/decoding_gemm 4 1 12 64 50257 32 768 0 before python tensorflow/gpt2_sample.py (use the default gemm), everything is OK.

To Reproduce
Steps to reproduce the behavior:

  1. nvidia-docker run -it -v local_dir:container_dir nvcr.io/nvidia/tensorflow:19.06-py3 bash
  2. cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/usr/local/lib/python3.5/dist-packages/tensorflow ..
  3. make
  4. ./bin/decoding_gemm 4 1 12 64 50257 32 768 0
  5. python tensorflow/gpt2_sample.py

Expected behavior
There should be no error.

Environment
Please provide at least:

  • Container version: nvcr.io/nvidia/tensorflow:19.06-py3
  • GPUs in the system: 8x Tesla V100-32GB
  • CUDA driver version: 435.21

When will the transformer decoder be released?

Hi! Thanks for the great Fastertransformer, my team has benefited a lot from it.
I remember there was a live held by NVIDIA a few weeks ago, the speaker mentioned to release the Decoder version soon. I'd like to ask if you still have this plan and when will the decoder be released?
Thanks :)

[FasterTransformer] CUBLAS_STATUS_NOT_INITIALIZED

Related to Model/Framework(s)
(FasterTransformer)

Describe the bug
I can build fastser transformer for pytorch without any error. However, when I use GEMM, I get following error:

terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_NOT_INITIALIZED /root/DeepLearningExamples/FasterTransformer/v3.1/fastertransformer/gemm_test/encoder_gemm_func.cc:114

To Reproduce
Steps to reproduce the behavior:

  1. sudo docker run --gpus all --network=host --privileged -w '/root' --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it nvcr.io/nvidia/pytorch:20.07-py3 /bin/bash

  2. git clone https://github.com/NVIDIA/DeepLearningExamples
    cd DeepLearningExamples/FasterTransformer/v3.1
    mkdir -p build
    cd build

  3. cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_THE=ON -DBUILD_THS=ON -DCXX_STD=14 ..

  4. make

  5. pip install transformers==2.5.1

  6. ./bin/encoder_gemm 32 32 12 64 0 0

Expected behavior
should work without throwing error

Environment
Please provide at least:

  • Container version: nvcr.io/nvidia/pytorch:20.07-py3
  • GPUs in the system: Tesla V100-PCIE-16GB
  • CUDA driver version 10.2

Faster Transformer make error

Hi,
I failed to make the faster transformer. My environment is : V100 GPU, Cuda10, tensorflow 1.13.1, cmake 3.15.5, gcc 5.4.0, python 2.7

My make command is :
cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/root/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow ..

I got the Error:
-- The CXX compiler identification is GNU 5.4.0
-- The CUDA compiler identification is NVIDIA 10.0.130
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /usr/local/cuda-10.0/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda-10.0/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda-10.0 (found suitable version "10.0", minimum required is "10.0")
-- Found CUDA: /usr/local/cuda-10.0 (found version "10.0")
-- Assign GPU architecture (sm=70)
-- Configuring done
-- Generating done
-- Build files have been written to: /home/FasterTransformer/build

In the build/CMakeFiles/CMakeError.log:
Run Build Command(s):/usr/bin/make cmTC_2e459/fast && /usr/bin/make -f CMakeFiles/cmTC_2e459.dir/build.make CMakeFiles/cmTC_2e459.dir/build
make[1]: Entering directory '/home/FasterTransformer/build/CMakeFiles/CMakeTmp'
Building CXX object CMakeFiles/cmTC_2e459.dir/src.cxx.o
/usr/bin/c++ -DCMAKE_HAVE_LIBC_PTHREAD -o CMakeFiles/cmTC_2e459.dir/src.cxx.o -c /home/hFasterTransformer/build/CMakeFiles/CMakeTmp/src.cxx
Linking CXX executable cmTC_2e459
/dev/pkgs/cmake-3.15.5-Linux-x86_64/bin/cmake -E cmake_link_script CMakeFiles/cmTC_2e459.dir/link.txt --verbose=1
/usr/bin/c++ -DCMAKE_HAVE_LIBC_PTHREAD CMakeFiles/cmTC_2e459.dir/src.cxx.o -o cmTC_2e459
CMakeFiles/cmTC_2e459.dir/src.cxx.o๏ผšๅœจๅ‡ฝๆ•ฐโ€˜mainโ€™ไธญ๏ผš
src.cxx:(.text+0x3c)๏ผšๅฏนโ€˜pthread_createโ€™ๆœชๅฎšไน‰็š„ๅผ•็”จ
src.cxx:(.text+0x48)๏ผšๅฏนโ€˜pthread_detachโ€™ๆœชๅฎšไน‰็š„ๅผ•็”จ
src.cxx:(.text+0x59)๏ผšๅฏนโ€˜pthread_joinโ€™ๆœชๅฎšไน‰็š„ๅผ•็”จ
src.cxx:(.text+0x6d)๏ผšๅฏนโ€˜pthread_atforkโ€™ๆœชๅฎšไน‰็š„ๅผ•็”จ
collect2: error: ld returned 1 exit status
CMakeFiles/cmTC_2e459.dir/build.make:86: recipe for target 'cmTC_2e459' failed
make[1]: *** [cmTC_2e459] Error 1
make[1]: Leaving directory '/home/FasterTransformer/build/CMakeFiles/CMakeTmp'
Makefile:121: recipe for target 'cmTC_2e459/fast' failed
make: *** [cmTC_2e459/fast] Error 2

Desire for help. Thanks !!!

It cause precision error issue after removing add_QKV_bias in FasterTransformer

Related to Model/Framework(s)
( FasterTransformer)

Describe the bug
When we set qkv "use_bias=False" in python code, and removing cuda kernel functionโ€œadd_QKV_biasโ€ from fastertransformer/cuda/open_attention.cu, it would cause precision issue in float32 mode.
Here is the cross check result before and after remove qkv bias.
Original :
#################################
cross_check False
max diff 4.027567-06
min diff 0.0
After removing qkv bias:
#################################
cross_check False
max diff 0.0012447834
min diff 0.0
If ran in float16 mode, the precision error is even bigger.
To Reproduce
Need to modify the code of open_attention.cu,open_attention.h,bert_transformer_op.cc and transformer_fp32.py,etc.

Expected behavior
Is there any tips about the reason why this happen? Or Is there any suggestions to avoid?

Environment

  • Container version (nvcr.io/nvidia/tensorflow:19.10-py3):
  • GPUs in the system: ( Tesla V100-SXM2-16GB):
  • CUDA driver version ( 418.67):

does faster transformer compile under DMS=52?

In the CMakeLists.txt, it seems to support compute capability 52:

set(SM_SETS 52 60 61 70 75 80).

But if i compile under

cmake -DSM=52 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON ..
make

i end up with compile error:

/FasterTransformer/fastertransformer/cuda/cuda_kernels.cu(110): error: more than one conversion function from "const half" to a built-in type applies:
            function "__half::operator float() const"
            function "__half::operator short() const"
            function "__half::operator unsigned short() const"
            function "__half::operator int() const"
            function "__half::operator unsigned int() const"
            function "__half::operator long long() const"
            function "__half::operator unsigned long long() const"
            function "__half::operator __nv_bool() const"
          detected during:
            instantiation of "void fastertransformer::update_logits_kernel(float *, const T *, const T *, int, const __nv_bool *, int) [with T=half]" 
(349): here
            instantiation of "void fastertransformer::update_logits(float *, const T *, const T *, int, const __nv_bool *, int, int, cudaStream_t) [with T=half]" 
(355): here

...

faster transformer compile error with docker

image: nvidia/cuda 10.0-cudnn7-devel-ubuntu16.04 docker image
cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/usr/lib/python2.7/site-packages/tensorflow .. output:
-- The CXX compiler identification is GNU 5.4.0
-- The CUDA compiler identification is NVIDIA 10.0.130
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda (found suitable version "10.0", minimum required is "10.0")
-- Found CUDA: /usr/local/cuda (found version "10.0")
-- Assign GPU architecture (sm=70)
-- Configuring done
-- Generating done
-- Build files have been written to: /root/DeepLearningExamples/FasterTransformer/build

make output:
CMakeFiles/gemm_fp32.dir/gemm_fp32.cu.o: In function __sti____cudaRegisterAll()': tmpxft_0000054d_00000000-5_gemm_fp32.cudafe1.cpp:(.text.startup+0x15): undefined reference to __cudaRegisterLinkedBinary_44_tmpxft_0000054d_00000000_6_gemm_fp32_cpp1_ii_5cd8620e'
collect2: error: ld returned 1 exit status
tools/gemm_test/CMakeFiles/gemm_fp32.dir/build.make:83: recipe for target 'bin/gemm_fp32' failed
make[2]: *** [bin/gemm_fp32] Error 1
CMakeFiles/Makefile2:148: recipe for target 'tools/gemm_test/CMakeFiles/gemm_fp32.dir/all' failed
make[1]: *** [tools/gemm_test/CMakeFiles/gemm_fp32.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2

maybe some advices for cmake

1๏ผŒI think in FindNCCL.cmake, the "set(NCCL_INCLUDE_DIR $ENV{NCCL_INCLUDE_DIR} CACHE ...)" should better be surounded by a "if (DEFINED ENV{...} " thing, to avoid the variable in the cache to be set to "" when the env var is not set. in such cases when afterwards the variable is set, the "null" variable in the cache still works
2๏ผŒwhen building a pytorch version ,setting BUILD_GPT=OFF doesn't work, maybe because the gpt.h still has to be compiled.
3, line 63 of fused_multihead_attention_op.cc, The rank of from tensor should be 2, not 3

[fast-transformer/v1] CUBLAS_STATUS_ARCH_MISMATCH

there is a CUDA runtime error when i execute demo using command
"./transformer_fp32 1 12 128 12 64". i execute demo after running the command "cmake -DSM=37 -DCMAKE_BUILD_TYPE=Release -DBUILD_TRT=ON -DTRT_PATH=/root/TensorRT-5.1.5.0 -DBUILD_TF=ON -DTF_PATH=/root/anaconda2/lib/python2.7/site-packages/tensorflow .." and "./gemm_fp32 1 20 12 64"

the info is listed below๏ผš
[Device Tesla K80
before allocate free 11.10 GB total 11.17 GB
After allocate free 11.07 GB used 0.10 GB total 11.17 GB
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_ARCH_MISMATCH /root/DeepLearningExamples-master/FasterTransformer/v1/fastertransformer/cuda/open_attention.h:171

Aborted (core dumped)]

i searched internet using the keywords "CUBLAS_STATUS_ARCH_MISMATCH" and i found some info at the webpage https://docs.nvidia.com/cuda/cublas/index.html. the webpage said the "CUBLAS_STATUS_ARCH_MISMATCH" may be because "the device has a compute capabilites lower than 5.0"

Environment
cudnn version: 7.6.4
CUDA Version: 10.0
GPU version: K80
container: nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04

[FastTransformer/Pytorch] Input error, TorchScript FP16 with and without FastTransformer

Hi, i tried SQUAD demo in FastTransformer 3.0, and got good results. However, when i tried:

bash pytorch/scripts/run_squad.sh ths fp16

I got error:

DeepLearningExamples/FasterTransformer/v3.0/build/pytorch/run_squad.py(474): main
DeepLearningExamples/FasterTransformer/v3.0/build/pytorch/run_squad.py(489): <module>
RuntimeError: expected scalar type Float but found Half

And when i tried:

bash pytorch/scripts/run_squad.sh thsext fp16

i got:

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "DeepLearningExamples/FasterTransformer/v3.0/build/pytorch/utils/encoder.py", line 116, in forward
    def forward(self, hidden_states, attention_mask, sequence_lengths=torch.Tensor(0).to(torch.int).cuda()):
        for i in range(self.layer_num):
            hidden_states = self.encoders[i].forward(hidden_states, attention_mask, sequence_lengths)
                            ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        return (hidden_states,)
RuntimeError: Inconsistency of Tensor type: input

Only in TorchScript and FP16 mode i will get problem, my environment:

  • T4, CUDA 10.2
  • Pytorch 1.6.0

[FasterTransformer/Pytorch] CMake build failed with undefined reference to pthread_create

Related to FasterTransformer/Pytorch

Describe the bug
cmake -DSM=80 ... build failed with undefined reference to pthread_create error.
Build with docker is succeed anyway. (and this cannot meet my requirements...)

Already tried:

  • I have libpthread.so installed under /lib/x86_64-linux-gnu/ and it is registered in /etc/ld.so.conf.d/x86_64-linux-gnu.conf already.
  • I tried to fix provieded CMakeLists.txt manually, but it does not help. I think the test code is out there.
set(THREADS_PREFER_PTHREAD_FLAG ON)
find_package(Threads REQUIRED)
  • I tried various cmake versions for x86-64 linux (ubuntu 20.04), and encountered same error (pthread_create checking test code is different)
    • 3.10
    • 3.14
    • 3.16
    • 3.20-rc4

To Reproduce
Steps to reproduce the behavior:

# 1. Setup python environment
conda create -n faster_transformer python=3.8
conda activate faster_transformer
conda install pytorch -c pytorch
pip install transformers==2.5.1 opennmt-py==1.1.1  # not the point

# 2. Clone git repository
git clone [email protected]:NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples/FasterTransformer/v3.1
mkdir -p build
cd build

# 3. Build cmake project (I used SM=80 for A100 GPUs)
cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_THE=ON -DBUILD_THS=ON -DCXX_STD=14 ..   # Error!

Console outputs:

-- The CXX compiler identification is GNU 9.3.0
-- The CUDA compiler identification is NVIDIA 11.0.221
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda (found suitable version "11.0", minimum required is "10.1")
-- Add DCUDA11_MODE
-- Assign GPU architecture (sm=80)
-- Found CUDA: /usr/local/cuda (found version "11.0")
-- Caffe2: CUDA detected: 11.0
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.0
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so
-- Found cuDNN: v8.0.5  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
CMake Warning at /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:198 (message):
  Failed to compute shorthash for libnvrtc.so
Call Stack (most recent call first):
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:155 (find_package)


CMake Warning at /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/utils.cmake:365 (message):
  In the future we will require one to explicitly pass TORCH_CUDA_ARCH_LIST
  to cmake instead of implicitly setting it as an env variable.  This will
  become a FATAL_ERROR in future version of pytorch.
Call Stack (most recent call first):
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:483 (torch_cuda_get_nvcc_gencode_flag)
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:155 (find_package)


-- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80
-- Found Torch: /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/lib/libtorch.so
<string>:3: DeprecationWarning: SO is deprecated, use EXT_SUFFIX
Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: _prepare_ldflags() missing 1 required positional argument: 'is_standalone'
CMake Error at CMakeLists.txt:175 (message):
  PyTorch link config Error.

Log file:

# CMakeFiles/CMakeError.log
Determining if the pthread_create exist failed with the following output:
Change Dir: /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp

Run Build Command(s):/usr/bin/make cmTC_5ac33/fast
/usr/bin/make -f CMakeFiles/cmTC_5ac33.dir/build.make CMakeFiles/cmTC_5ac33.dir/build
make[1]: Entering directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
Building CXX object CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o
/usr/bin/c++     -o CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o -c /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp/CheckSymbolExists.cxx
Linking CXX executable cmTC_5ac33
/home/kalaluthien/cmake-3.14.0-Linux-x86_64/bin/cmake -E cmake_link_script CMakeFiles/cmTC_5ac33.dir/link.txt --verbose=1
/usr/bin/c++       CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o  -o cmTC_5ac33
/usr/bin/ld: CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o: in function `main':
CheckSymbolExists.cxx:(.text+0x1f): undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status
make[1]: *** [CMakeFiles/cmTC_5ac33.dir/build.make:87: cmTC_5ac33] Error 1
make[1]: Leaving directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
make: *** [Makefile:121: cmTC_5ac33/fast] Error 2

File /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp/CheckSymbolExists.cxx:
/* */
#include <pthread.h>

int main(int argc, char** argv)
{
  (void)argv;
#ifndef pthread_create
  return ((int*)(&pthread_create))[argc];
#else
  (void)argc;
  return 0;
#endif
}

Determining if the function pthread_create exists in the pthreads failed with the following output:
Change Dir: /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp

Run Build Command(s):/usr/bin/make cmTC_2be15/fast
/usr/bin/make -f CMakeFiles/cmTC_2be15.dir/build.make CMakeFiles/cmTC_2be15.dir/build
make[1]: Entering directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
Building CXX object CMakeFiles/cmTC_2be15.dir/CheckFunctionExists.cxx.o
/usr/bin/c++    -DCHECK_FUNCTION_EXISTS=pthread_create   -o CMakeFiles/cmTC_2be15.dir/CheckFunctionExists.cxx.o -c /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CheckLibraryExists/CheckFunctionExists.cxx
Linking CXX executable cmTC_2be15
/home/kalaluthien/cmake-3.14.0-Linux-x86_64/bin/cmake -E cmake_link_script CMakeFiles/cmTC_2be15.dir/link.txt --verbose=1
/usr/bin/c++   -DCHECK_FUNCTION_EXISTS=pthread_create    CMakeFiles/cmTC_2be15.dir/CheckFunctionExists.cxx.o  -o cmTC_2be15 -lpthreads
/usr/bin/ld: cannot find -lpthreads
collect2: error: ld returned 1 exit status
make[1]: *** [CMakeFiles/cmTC_2be15.dir/build.make:87: cmTC_2be15] Error 1
make[1]: Leaving directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
make: *** [Makefile:121: cmTC_2be15/fast] Error 2

I found that cmake uses -lpthread when it compiles FasterTransformer parts, and uses -lpthreads only when CheckSymbolExists. Weired..

Expected behavior
A clear and concise description of what you expected to happen.
Succeed to build cmake project.

Environment
Please provide at least:

  • Container version (e.g. pytorch:19.05-py3): latest native pytorch, but not the point.
  • GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): A100-SXM4-40GB
  • CUDA driver version (e.g. 418.67): 450.80.02

[FastTransformer v3.1/TensorFlow] Get CUBLAS_STATUS_INTERNAL_ERROR when run tensorflow/gpt2-sample.py

Related to FastTransformer v3.1/TensorFlow/GPT-2

Describe the bug
If I run ./bin/decoding_gemm 4 1 12 64 50257 32 768 0 before python tensorflow/gpt2_sample.py, I got
Internal: [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_INTERNAL_ERROR FasterTransformer/fastertransformer/cuda/open_decoder.cu:1708.
However, If I don't run ./bin/decoding_gemm 4 1 12 64 50257 32 768 0 before python tensorflow/gpt2_sample.py (use the default gemm), everything is OK.

To Reproduce
Steps to reproduce the behavior:

  1. nvidia-docker run -it -v local_dir:container_dir nvcr.io/nvidia/tensorflow:19.06-py3 bash
  2. cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/usr/local/lib/python3.5/dist-packages/tensorflow ..
  3. make
  4. ./bin/decoding_gemm 4 1 12 64 50257 32 768 0
  5. python tensorflow/gpt2_sample.py

Expected behavior
There should be no error.

Environment
Please provide at least:

  • Container version: nvcr.io/nvidia/tensorflow:19.06-py3
  • GPUs in the system: 8x Tesla V100-32GB
  • CUDA driver version: 435.21

[FasterTransformer] nvcc fatal : redefinition of argument 'std'

Hi, i compiled fatsertrasformer code and get error:

[  0%] Built target copy
[  2%] Building CXX object tools/gemm_test/CMakeFiles/decoding_gemm.dir/decoding_gemm.cc.o
[  4%] Linking CXX executable ../../bin/decoding_gemm
[  4%] Built target decoding_gemm
[  6%] Building CXX object tools/gemm_test/CMakeFiles/encoder_gemm.dir/encoder_gemm.cc.o
[  8%] Linking CXX executable ../../bin/encoder_gemm
[  8%] Built target encoder_gemm
[ 10%] Building CUDA object fastertransformer/cuda/CMakeFiles/topk.dir/topk_kernels.cu.o
nvcc fatal   : redefinition of argument 'std'
make[2]: *** [fastertransformer/cuda/CMakeFiles/topk.dir/topk_kernels.cu.o] Error 1
make[1]: *** [fastertransformer/cuda/CMakeFiles/topk.dir/all] Error 2
make: *** [all] Error 2

my commands are:

mkdir -p build
cd build
cmake -DSM=75 -DCMAKE_BUILD_TYPE=Release -DBUILD_THE=ON -DBUILD_THS=ON -DBUILD_THSOP=ON -DCXX_STD=14 ..
make

Any suggestion? Thank you very much!

libtf_fastertransformer.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs?

hi, i convert model to fp16 format, but when i running faster ,call me libtf_fastertransformer.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs,

python ckpt_type_convert.py --init_checkpoint=$MODEL --fp16_checkpoint=imdb_output/fp16_model.ckpt

python run_classifier_fastertf.py --task_name=Imdb --do_eval=true --data_dir=$IMDB_DIR --vocab_file=$BERT_BASE_DIR/vocab.txt --bert_config_file=$BERT_BASE_DIR/bert_config.json --init_checkpoint=imdb_output/fp16_model.ckpt --max_seq_length=128 --eval_batch_size=16 --output_dir=imdb_output --floatx=float16

/home/ubt/anaconda3/envs/fastertf/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubt/anaconda3/envs/fastertf/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubt/anaconda3/envs/fastertf/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubt/anaconda3/envs/fastertf/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubt/anaconda3/envs/fastertf/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubt/anaconda3/envs/fastertf/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Traceback (most recent call last):
File "run_classifier_fastertf.py", line 54, in
import fast_infer_util as fiu
File "/home/ubt/FasterTransformer/FasterTransformer_Bert/fast_infer_util.py", line 29, in
os.path.join(build_path, 'libtf_fastertransformer.so'))
File "/home/ubt/anaconda3/envs/fastertf/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/ubt/FasterTransformer/FasterTransformer_Bert/./build/lib/libtf_fastertransformer.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs

[FasterTransformer] sample/transformer_trt failed in fp16 mode

Related to FasterTransformer/TensorRT

Describe the bug
When running transformer_trt.cc in fp16 mode, I met several CUDA Error during forward.

CUDA Error: CUDA_ERROR_INVALID_VALUE CUDA Error: CUDA_ERROR_INVALID_VALUE /usr/local/app/workspace/ljq/DeepLearningExamples-master/FasterTransformer/v3.1/fastertransformer/trt_fused_multihead_attention/fused_multihead_attention_v2.h 507
[FT][ERROR] CUDA runtime error: invalid configuration argument /usr/local/app/workspace/ljq/DeepLearningExamples-master/FasterTransformer/v3.1/fastertransformer/trt_fused_multihead_attention/qkvToContext.cu:137

[FT][ERROR] CUDA runtime error: invalid configuration argument /usr/local/app/workspace/ljq/DeepLearningExamples-master/FasterTransformer/v3.1/fastertransformer/cuda/open_attention.h:626

And the result seems to be not correct.
By printing out the params, I found that params.b: -2, which should be a non-negative number as gridDim.y.

To Reproduce
Just run the transformer_trt with fp16=1

Expected behavior
No error should be found and the result should be same as other sample transformers.

Environment
Please provide at least:

  • Container version: self-made container
  • GPUs in the system: Tesla T4-16GB
  • CUDA driver version: 440.33
  • CUDA version: 10.2.89

transformer_fp32 core

./transformer_fp32 1 12 128 12 64
Device TITAN V
before allocate free 11.00 GB total 11.75 GB
After allocate free 10.96 GB used 0.79 GB total 11.75 GB
[FT][CALL] BertEncoderTransformer
[FT][CALL] OpenMultiHeadAttention
gemm_config.in is not found
loading GEMM algorithms error, using default GEMM algorithms
gemm_config.in is not found
loading GEMM algorithms error, using default GEMM algorithms!
[FT][CALL] initialize
[FT][CALL] initialize
[FT][CALL] forward
[FT][CALL] forward
transformer_fp32: xxx/DeepLearningExamples/FasterTransformer/fastertransformer/cuda/open_attention.cu:329: void fastertransformer::cuda::OpenMultiHeadAttention<OpType_>::multiHeadAttr_nofuse_kernelLauncher(cudaStream_t, cublasHandle_t, fastertransformer::cuda::OpenMultiHeadAttention<OpType_>::DataType_, const DataType_, fastertransformer::cuda::OpenMultiHeadAttention<OpType_>::DataType_, const DataType_, fastertransformer::cuda::OpenMultiHeadAttention<OpType_>::DataType_, const DataType_, const DataType_, fastertransformer::cuda::OpenMultiHeadAttention<OpType_>::DataType_, int, int, int, int, fastertransformer::cuda::OpenMultiHeadAttention<OpType_>::DataType_) [with fastertransformer::OperationType OpType_ = (fastertransformer::OperationType)0; cudaStream_t = CUstream_st*; cublasHandle_t = cublasContext*; fastertransformer::cuda::OpenMultiHeadAttention<OpType_>::DataType_ = float]: Assertion `k > 1024' failed.
Aborted (core dumped)

[FasterTransformer3.0/Pytorch] Translation with FasterTransformer3.0 on PyTorch demo model file fails to load

Related to FasterTransformer/Pytorch
(e.g. GNMT/PyTorch or FasterTransformer/All)

Describe the bug

the downloaded transformer model can't be loaded with error:

Traceback (most recent call last):
  File "pytorch/load.py", line 37, in <module>
    fields, model, model_opt = load_test_model(opt, args)
  File "/app/build/pytorch/utils/translation_model.py", line 80, in load_test_model
    map_location=lambda storage, loc: storage)
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 585, in load
    with _open_zipfile_reader(f) as opened_zipfile:
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 245, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at ../caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old.

To Reproduce
Steps to reproduce the behavior:
I'm trying to follow the readme and use docker withh pytorch 1.5

python pytorch/run_translation.py --batch_size 128 --beam_size 4 --model_type decoding_ext --data_type fp32

Seems due to opennmt-py requires pytorch==1.6?

Environment
Please provide at least:

  • Container version (e.g. pytorch:19.05-py3): pytorch:20.03-py3
  • GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): 1080Ti
  • CUDA driver version (e.g. 418.67):10.2

[FasterTransformer/Pytorch] CMake build failed with undefined reference to pthread_create

Related to FasterTransformer/Pytorch

Describe the bug
cmake -DSM=80 ... build failed with undefined reference to pthread_create error.
Build with docker is succeed anyway. (and this cannot meet my requirements...)

Already tried:

  • I have libpthread.so installed under /lib/x86_64-linux-gnu/ and it is registered in /etc/ld.so.conf.d/x86_64-linux-gnu.conf already.
  • I tried to fix provieded CMakeLists.txt manually, but it does not help. I think the test code is out there.
set(THREADS_PREFER_PTHREAD_FLAG ON)
find_package(Threads REQUIRED)
  • I tried various cmake versions for x86-64 linux (ubuntu 20.04), and encountered same error (pthread_create checking test code is different)
    • 3.10
    • 3.14
    • 3.16
    • 3.20-rc4

To Reproduce
Steps to reproduce the behavior:

# 1. Setup python environment
conda create -n faster_transformer python=3.8
conda activate faster_transformer
conda install pytorch -c pytorch
pip install transformers==2.5.1 opennmt-py==1.1.1  # not the point

# 2. Clone git repository
git clone [email protected]:NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples/FasterTransformer/v3.1
mkdir -p build
cd build

# 3. Build cmake project (I used SM=80 for A100 GPUs)
cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_THE=ON -DBUILD_THS=ON -DCXX_STD=14 ..   # Error!

Console outputs:

-- The CXX compiler identification is GNU 9.3.0
-- The CUDA compiler identification is NVIDIA 11.0.221
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda (found suitable version "11.0", minimum required is "10.1")
-- Add DCUDA11_MODE
-- Assign GPU architecture (sm=80)
-- Found CUDA: /usr/local/cuda (found version "11.0")
-- Caffe2: CUDA detected: 11.0
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.0
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so
-- Found cuDNN: v8.0.5  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
CMake Warning at /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:198 (message):
  Failed to compute shorthash for libnvrtc.so
Call Stack (most recent call first):
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:155 (find_package)


CMake Warning at /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/utils.cmake:365 (message):
  In the future we will require one to explicitly pass TORCH_CUDA_ARCH_LIST
  to cmake instead of implicitly setting it as an env variable.  This will
  become a FATAL_ERROR in future version of pytorch.
Call Stack (most recent call first):
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:483 (torch_cuda_get_nvcc_gencode_flag)
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:155 (find_package)


-- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80
-- Found Torch: /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/lib/libtorch.so
<string>:3: DeprecationWarning: SO is deprecated, use EXT_SUFFIX
Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: _prepare_ldflags() missing 1 required positional argument: 'is_standalone'
CMake Error at CMakeLists.txt:175 (message):
  PyTorch link config Error.

Log file:

# CMakeFiles/CMakeError.log
Determining if the pthread_create exist failed with the following output:
Change Dir: /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp

Run Build Command(s):/usr/bin/make cmTC_5ac33/fast
/usr/bin/make -f CMakeFiles/cmTC_5ac33.dir/build.make CMakeFiles/cmTC_5ac33.dir/build
make[1]: Entering directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
Building CXX object CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o
/usr/bin/c++     -o CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o -c /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp/CheckSymbolExists.cxx
Linking CXX executable cmTC_5ac33
/home/kalaluthien/cmake-3.14.0-Linux-x86_64/bin/cmake -E cmake_link_script CMakeFiles/cmTC_5ac33.dir/link.txt --verbose=1
/usr/bin/c++       CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o  -o cmTC_5ac33
/usr/bin/ld: CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o: in function `main':
CheckSymbolExists.cxx:(.text+0x1f): undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status
make[1]: *** [CMakeFiles/cmTC_5ac33.dir/build.make:87: cmTC_5ac33] Error 1
make[1]: Leaving directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
make: *** [Makefile:121: cmTC_5ac33/fast] Error 2

File /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp/CheckSymbolExists.cxx:
/* */
#include <pthread.h>

int main(int argc, char** argv)
{
  (void)argv;
#ifndef pthread_create
  return ((int*)(&pthread_create))[argc];
#else
  (void)argc;
  return 0;
#endif
}

Determining if the function pthread_create exists in the pthreads failed with the following output:
Change Dir: /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp

Run Build Command(s):/usr/bin/make cmTC_2be15/fast
/usr/bin/make -f CMakeFiles/cmTC_2be15.dir/build.make CMakeFiles/cmTC_2be15.dir/build
make[1]: Entering directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
Building CXX object CMakeFiles/cmTC_2be15.dir/CheckFunctionExists.cxx.o
/usr/bin/c++    -DCHECK_FUNCTION_EXISTS=pthread_create   -o CMakeFiles/cmTC_2be15.dir/CheckFunctionExists.cxx.o -c /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CheckLibraryExists/CheckFunctionExists.cxx
Linking CXX executable cmTC_2be15
/home/kalaluthien/cmake-3.14.0-Linux-x86_64/bin/cmake -E cmake_link_script CMakeFiles/cmTC_2be15.dir/link.txt --verbose=1
/usr/bin/c++   -DCHECK_FUNCTION_EXISTS=pthread_create    CMakeFiles/cmTC_2be15.dir/CheckFunctionExists.cxx.o  -o cmTC_2be15 -lpthreads
/usr/bin/ld: cannot find -lpthreads
collect2: error: ld returned 1 exit status
make[1]: *** [CMakeFiles/cmTC_2be15.dir/build.make:87: cmTC_2be15] Error 1
make[1]: Leaving directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
make: *** [Makefile:121: cmTC_2be15/fast] Error 2

I found that cmake uses -lpthread when it compiles FasterTransformer parts, and uses -lpthreads only when CheckSymbolExists. Weired..

Expected behavior
A clear and concise description of what you expected to happen.
Succeed to build cmake project.

Environment
Please provide at least:

  • Container version (e.g. pytorch:19.05-py3): latest native pytorch, but not the point.
  • GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): A100-SXM4-40GB
  • CUDA driver version (e.g. 418.67): 450.80.02

[Faster transformer] having a guide on how to use weights from a Hugginface transfomer model (Roberta based) with faster transformer 3.1

Related to Faster transformers + Hugging face + Pytorch

Is your feature request related to a problem? Please describe.
It seems that Faster transformer should be able to import weights from a Roberta based huggingface model, but the way to perform it is not obvious.

Describe the solution you'd like
A part of the README dedicated to use weights from huggingface v4 (last version) in a faster transformer model.

Describe alternatives you've considered
N/A

Additional context
At some point in the project, huggingface v2 is used, but my attempt to load a Roberta based model from Huggingface v4 failed, even if in theory it's the same architecture. I tried to rename the layers to match those expected by Bert but it didn't work, the output didn't match the ones before the transfer... There are probably other transformations to perform, but I didn't find which ones.

def rewrite_layer_name(layer_name: str) -> str:
    if "roberta." in layer_name:
        layer_name = layer_name.replace("roberta.", "bert.")
    elif "classifier.dense." in layer_name:
        layer_name = layer_name.replace("classifier.dense.", "bert.pooler.dense.")
    elif "classifier.out_proj." in layer_name:
        layer_name = layer_name.replace("classifier.out_proj.", "classifier.")
    return layer_name

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.