nvidia / fastertransformer Goto Github PK

Transformer related optimization, including BERT, GPT

License: Apache License 2.0

CMake 1.84% C++ 67.00% Cuda 29.22% Python 1.32% Shell 0.54% C 0.03% HCL 0.02% Makefile 0.03%

fastertransformer's Introduction

Note: FasterTransformer development has transitioned to TensorRT-LLM. All developers are encouraged to leverage TensorRT-LLM to get the latest improvements on LLM Inference. The NVIDIA/FasterTransformer repo will stay up, but will not have further development.

FasterTransformer

This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA.

FasterTransformer

Model overview

In NLP, encoder and decoder are two important components, with the transformer layer becoming a popular architecture for both components. FasterTransformer implements a highly optimized transformer layer for both the encoder and decoder for inference. On Volta, Turing and Ampere GPUs, the computing power of Tensor Cores are used automatically when the precision of the data and weights are FP16.

FasterTransformer is built on top of CUDA, cuBLAS, cuBLASLt and C++. We provide at least one API of the following frameworks: TensorFlow, PyTorch and Triton backend. Users can integrate FasterTransformer into these frameworks directly. For supporting frameworks, we also provide example codes to demonstrate how to use, and show the performance on these frameworks.

Support matrix

Models	Framework	FP16	INT8 (after Turing)	Sparsity (after Ampere)	Tensor parallel	Pipeline parallel	FP8 (after Hopper)
BERT	TensorFlow	Yes	Yes	-	-	-	-
BERT	PyTorch	Yes	Yes	Yes	Yes	Yes	-
BERT	Triton backend	Yes	-	-	Yes	Yes	-
BERT	C++	Yes	Yes	-	-	-	Yes
XLNet	C++	Yes	-	-	-	-	-
Encoder	TensorFlow	Yes	Yes	-	-	-	-
Encoder	PyTorch	Yes	Yes	Yes	-	-	-
Decoder	TensorFlow	Yes	-	-	-	-	-
Decoder	PyTorch	Yes	-	-	-	-	-
Decoding	TensorFlow	Yes	-	-	-	-	-
Decoding	PyTorch	Yes	-	-	-	-	-
GPT	TensorFlow	Yes	-	-	-	-	-
GPT/OPT	PyTorch	Yes	-	-	Yes	Yes	Yes
GPT/OPT	Triton backend	Yes	-	-	Yes	Yes	-
GPT-MoE	PyTorch	Yes	-	-	Yes	Yes	-
BLOOM	PyTorch	Yes	-	-	Yes	Yes	-
BLOOM	Triton backend	Yes	-	-	Yes	Yes	-
GPT-J	Triton backend	Yes	-	-	Yes	Yes	-
Longformer	PyTorch	Yes	-	-	-	-	-
T5/UL2	PyTorch	Yes	-	-	Yes	Yes	-
T5	TensorFlow 2	Yes	-	-	-	-	-
T5/UL2	Triton backend	Yes	-	-	Yes	Yes	-
T5	TensorRT	Yes	-	-	Yes	Yes	-
T5-MoE	PyTorch	Yes	-	-	Yes	Yes	-
Swin Transformer	PyTorch	Yes	Yes	-	-	-	-
Swin Transformer	TensorRT	Yes	Yes	-	-	-	-
ViT	PyTorch	Yes	Yes	-	-	-	-
ViT	TensorRT	Yes	Yes	-	-	-	-
GPT-NeoX	PyTorch	Yes	-	-	Yes	Yes	-
GPT-NeoX	Triton backend	Yes	-	-	Yes	Yes	-
BART/mBART	PyTorch	Yes	-	-	Yes	Yes	-
WeNet	C++	Yes	-	-	-	-	-
DeBERTa	TensorFlow 2	Yes	-	-	On-going	On-going	-
DeBERTa	PyTorch	Yes	-	-	On-going	On-going	-

Note that the FasterTransformer supports the models above on C++ because all source codes are built on C++.

More details of specific models are put in xxx_guide.md of docs/, where xxx means the model name. Some common questions and the respective answers are put in docs/QAList.md. Note that the model of Encoder and BERT are similar and we put the explanation into bert_guide.md together.

Advanced

The following code lists the directory structure of FasterTransformer:

/src/fastertransformer: source code of FasterTransformer
    |--/cutlass_extensions: Implementation of cutlass gemm/kernels.
    |--/kernels: CUDA kernels for different models/layers and operations, like addBiasResiual.
    |--/layers: Implementation of layer modules, like attention layer, ffn layer.
    |--/models: Implementation of different models, like BERT, GPT.
    |--/tensorrt_plugin: encapluate FasterTransformer into TensorRT plugin.
    |--/tf_op: custom Tensorflow OP implementation
    |--/th_op: custom PyTorch OP implementation
    |--/triton_backend: custom triton backend implementation
    |--/utils: Contains common cuda utils, like cublasMMWrapper, memory_utils
/examples: C++, tensorflow and pytorch interface examples
    |--/cpp: C++ interface examples
    |--/pytorch: PyTorch OP examples
    |--/tensorflow: TensorFlow OP examples
    |--/tensorrt: TensorRT examples
/docs: Documents to explain the details of implementation of different models, and show the benchmark
/benchmark: Contains the scripts to run the benchmarks of different models
/tests: Unit tests
/templates: Documents to explain how to add a new model/example into FasterTransformer repo

Note that many folders contains many sub-folders to split different models. Quantization tools are move to examples, like examples/tensorflow/bert/bert-quantization/ and examples/pytorch/bert/bert-quantization-sparsity/.

Global Environment

FasterTransformer provides some convenient environment variables for debuging and testing.

FT_LOG_LEVEL: This environment controls the log level of debug messae. More details are in src/fastertransformer/utils/logger.h. Note that the program will print lots of message when the level is lower than DEBUG and the program would become very slow.
FT_NVTX: If it is set to be ON like FT_NVTX=ON ./bin/gpt_example, the program will insert tha tag of nvtx to help profiling the program.
FT_DEBUG_LEVEL: If it is set to be DEBUG, then the program will run cudaDeviceSynchronize() after every kernels. Otherwise, the kernel is executued asynchronously by default. It is helpful to locate the error point during debuging. But this flag affects the performance of program significantly. So, it should be used only for debuging.

Performance

Hardware settings:

8xA100-80GBs (with mclk 1593MHz, pclk 1410MHz) with AMD EPYC 7742 64-Core Processor
T4 (with mclk 5000MHz, pclk 1590MHz) with Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

In order to run the following benchmark, we need to install the unix computing tool "bc" by

apt-get install bc

BERT base performance

The FP16 results of TensorFlow were obtained by running the benchmarks/bert/tf_benchmark.sh.

The INT8 results of TensorFlow were obtained by running the benchmarks/bert/tf_int8_benchmark.sh.

The FP16 results of PyTorch were obtained by running the benchmarks/bert/pyt_benchmark.sh.

The INT8 results of PyTorch were obtained by running the benchmarks/bert/pyt_int8_benchmark.sh.

More benchmarks are put in docs/bert_guide.md.

BERT base performances of FasterTransformer new features

The following figure compares the performances of different features of FasterTransformer and FasterTransformer under FP16 on T4.

For large batch size and sequence length, both EFF-FT and FT-INT8-v2 bring about 2x speedup. Using Effective FasterTransformer and int8v2 at the same time can bring about 3.5x speedup compared to FasterTransformer FP16 for large case.

BERT base performance on TensorFlow

The following figure compares the performances of different features of FasterTransformer and TensorFlow XLA under FP16 on T4.

For small batch size and sequence length, using FasterTransformer can bring about 3x speedup.

For large batch size and sequence length, using Effective FasterTransformer with INT8-v2 quantization can bring about 5x speedup.

BERT base performance on PyTorch

The following figure compares the performances of different features of FasterTransformer and PyTorch TorchScript under FP16 on T4.

For small batch size and sequence length, using FasterTransformer CustomExt can bring about 4x ~ 6x speedup.

For large batch size and sequence length, using Effective FasterTransformer with INT8-v2 quantization can bring about 5x speedup.

Decoding and Decoder performance

The results of TensorFlow were obtained by running the benchmarks/decoding/tf_decoding_beamsearch_benchmark.sh and benchmarks/decoding/tf_decoding_sampling_benchmark.sh

The results of PyTorch were obtained by running the benchmarks/decoding/pyt_decoding_beamsearch_benchmark.sh.

In the experiments of decoding, we updated the following parameters:

head_num = 8
size_per_head = 64
num_layers = 6 for both encoder and decoder
vocabulary_size = 32001 for TensorFlow sample codes, 31538 for PyTorch sample codes
memory_hidden_dim = 512
max sequenc elength = 128

More benchmarks are put in docs/decoder_guide.md.

Decoder and Decoding end-to-end translation performance on TensorFlow

The following figure shows the speedup of of FT-Decoder op and FT-Decoding op compared to TensorFlow under FP16 with T4. Here, we use the throughput of translating a test set to prevent the total tokens of each methods may be different. Compared to TensorFlow, FT-Decoder provides 1.5x ~ 3x speedup; while FT-Decoding provides 4x ~ 18x speedup.

Decoder and Decoding end-to-end translation performance on PyTorch

The following figure shows the speedup of of FT-Decoder op and FT-Decoding op compared to PyTorch under FP16 with T4. Here, we use the throughput of translating a test set to prevent the total tokens of each methods may be different. Compared to PyTorch, FT-Decoder provides 1.2x ~ 3x speedup; while FT-Decoding provides 3.8x ~ 13x speedup.

GPT performance

The following figure compares the performances of Megatron and FasterTransformer under FP16 on A100.

In the experiments of decoding, we updated the following parameters:

head_num = 96
size_per_head = 128
num_layers = 48 for GPT-89B model, 96 for GPT-175B model
data_type = FP16
vocab_size = 51200
top_p = 0.9
tensor parallel size = 8
input sequence length = 512
output sequence length = 32

Release notes

Changelog

May 2023

Fix bugs of generation early stopping

January 2023

Support GPT MoE
Support FP8 for Bert and GPT (Experimental)
Support DeBERTa on TensorFlow 2 and PyTorch

Dec 2022

Release the FasterTransformer 5.2
Support min length penalty

Nov 2022

Support T5 Tensorflow 2 custom op.
Support T5 MoE
Support WeNet
Support BART & mBART
Support SwinV2
Initial support for w8a8 int8 mode with GPT (preview)
Support fused mha in GPT

Oct 2022

Support BLOOM

Sep 2022

Support factual sampling (link) in gpt
Support for IA3 adapting scheme in T5

Aug 2022

Support returning context tokens embeddings in GPT
Release the FasterTransformer 5.1
Support for interactive generation
Support for attention time-limited memory
Support mt5 and t5-v1.1

July 2022

Support UL2 huggingface ckpt. (link)
- Fix bug of T5 under bfloat16.
Add ViT INT8 TensorRT Plugin
Support batch sampling
Support shared context optimization in GPT model

June 2022

Support streaming generation for triton backend.
Support OPT.
Support multi-node multi-GPU BERT under FP32, FP16 and BF16.

May 2022

Support bfloat16 on most models.
Support prefix-prompt for GPT-J.
Support GPT-NeoX.
- epsilon value used in layernorm is now a parameter
- rotary embedding GPT-NeoX style (only GPT-J was implemented)
- load per-GPU layernorm and bias parameters
- weight conversion from EleutherAI checkpoint

April 2022

Release the FasterTransformer 5.0
- Change the default accumulation type of all gemm to FP32.
- Support bfloat16 inference in GPT model.
- Support Nemo Megatron T5 and Megatron-LM T5 model.
- Support ViT.

March 2022

Support stop_ids and ban_bad_ids in GPT-J.
Support dynamice start_id and end_id in GPT-J, GPT, T5 and Decoding.

February 2022

Support Swin Transformer.
Optimize the k/v cache update of beam search by in-direction buffer.
Support runtime input for GPT-J, T5 and GPT.
Support soft prompt in GPT and GPT-J.
Support custom all reduce kernel.
- Limitation:
  1. Only support tensor parallel size = 8 on DGX-A100.
  2. Only support CUDA with cudaMallocAsync.

December 2021

Add TensorRT plugin of T5 model.
Change some hyper-parameters of GPT model to runtime query.
Optimize the memory allocator under C++ code.
Fix bug of CUB including when using CUDA 11.5 or newer version.

November 2021

Update the FasterTransformer 5.0 beta
Add GPT-3 INT8 weight only qauntization for batch size <= 2.
Support multi-node multi-gpu support on T5.
Enhance the multi-node multi-gpu supporting in GPT-3.

August 2021

Release the FasterTransformer 5.0 beta
- Refactor the repo and codes
- And special thanks to NAVER Corp. for contributing a lot to this version, as listed below.
  - Bugs fix
    - Fix error that occurs when batch_size is less than max_batch_size for gpt pytorch wrapper.
    - Fix memory leak that occurs every forward because of reused allocator.
    - Fix race condition that occurs in repetition penalty kernel.
  - Enhancement
    - Add random seed setting.
    - Fix GEMM buffer overflow on FP16 of GPT.
    - Change to invalidate finished buffer for every completion.
    - Introduce stop_before for early stop.
- Support Longformer.
- Rename layer_para to pipeline_para.
- Optimize the sorting of top p sampling.
- Support sparsity for Ampere GPUs on BERT.
- Support size_per_head 96, 160, 192, 224, 256 for GPT model.
- Support multi-node inference for GPT Triton backend.

June 2021

Support XLNet

April 2021

Release the FasterTransformer 4.0
- Support multi-gpus and multi-nodes inference for GPT model on C++ and PyTorch.
- Support single node, multi-gpus inference for GPT model on triton.
- Add the int8 fused multi-head attention kernel for bert.
- Add the FP16 fused multi-head attention kernel of V100 for bert.
- Optimize the kernel of decoder.
- Move to independent repo.
- Eager mode PyTorch extension is deprecated.

Dec 2020

Release the FasterTransformer 3.1
- Optimize the decoding by adding the finisehd mask to prevent useless computing.
- Support opennmt encoder.
- Remove the TensorRT plugin supporting.
- TorchScript custom op is deprecated.

Nov 2020

Optimize the INT8 inference.
Support PyTorch INT8 inference.
Provide PyTorch INT8 quantiztion tools.
Integrate the fused multi-head attention kernel of TensorRT into FasterTransformer.
Add unit test of SQuAD.
Update the missed NGC checkpoints.

Sep 2020

Support GPT2
Release the FasterTransformer 3.0
- Support INT8 quantization of encoder of cpp and TensorFlow op.
- Add bert-tf-quantization tool.
- Fix the issue that Cmake 15 or Cmake 16 fail to build this project.

Aug 2020

Fix the bug of trt plugin.

June 2020

Release the FasterTransformer 2.1
- Add Effective FasterTransformer based on the idea of Effective Transformer idea.
- Optimize the beam search kernels.
- Add PyTorch op supporting

May 2020

Fix the bug that seq_len of encoder must be larger than 3.
Add the position_encoding of decoding as the input of FasterTransformer decoding. This is convenient to use different types of position encoding. FasterTransformer does not compute the position encoding value, but only lookup the table.
Modifying the method of loading model in translate_sample.py.

April 2020

Rename decoding_opennmt.h to decoding_beamsearch.h
Add DiverseSiblingsSearch for decoding.
Add sampling into Decoding
- The implementation is in the decoding_sampling.h
- Add top_k sampling, top_p sampling for decoding.
Refactor the tensorflow custom op codes.
- Merge bert_transformer_op.h, bert_transformer_op.cu.cc into bert_transformer_op.cc
- Merge decoder.h, decoder.cu.cc into decoder.cc
- Merge decoding_beamsearch.h, decoding_beamsearch.cu.cc into decoding_beamsearch.cc
Fix the bugs of finalize function decoding.py.
Fix the bug of tf DiverseSiblingSearch.
Add BLEU scorer bleu_score.py into utils. Note that the BLEU score requires python3.
Fuse QKV Gemm of encoder and masked_multi_head_attention of decoder.
Add dynamic batch size and dynamic sequence length features into all ops.

March 2020

Add feature in FasterTransformer 2.0
- Add translate_sample.py to demonstrate how to translate a sentence by restoring the pretrained model of OpenNMT-tf.
Fix bugs of Fastertransformer 2.0
- Fix the bug of maximum sequence length of decoder cannot be larger than 128.
- Fix the bug that decoding does not check finish or not after each step.
- Fix the bug of decoder about max_seq_len.
- Modify the decoding model structure to fit the OpenNMT-tf decoding model.
  - Add a layer normalization layer after decoder.
  - Add a normalization for inputs of decoder

February 2020

Release the FasterTransformer 2.0
- Provide a highly optimized OpenNMT-tf based decoder and decoding, including C++ API and TensorFlow op.
- Refine the sample codes of encoder.
- Add dynamic batch size feature into encoder op.

July 2019

Release the FasterTransformer 1.0
- Provide a highly optimized bert equivalent transformer layer, including C++ API, TensorFlow op and TensorRT plugin.

Known issues

Cannot compile on tensorflow 2.10 due to undefined symbol issue.
Undefined symbol errors when import the extension
- Please import torch first. If this has been done, it is due to the incompatible C++ ABI. You may need to check the PyTorch used during compilation and execution are the same, or you need to check how your PyTorch is compiled, or the version of your GCC, etc.
Results of TensorFlow and OP would be different in decoding. This problem is caused by the accumulated log probability, and we do not avoid this problem.
If encounter some problem in the custom environment, try to use the gcc/g++ 4.8 to build the project of TensorFlow op, especially for TensorFlow 1.14.

fastertransformer's People

Contributors

Stargazers

Watchers

Forkers

sublee byshiue dumpmemory andabi whitefu zeta1999 yyht duduscript yds1024 shiyuzh2007 jinwoongkim omar-fouad neiltian-tencent jawaechan frostml seongl baagie7 appleejii menggeliu1205 opentld qksidmx jasonlucn tengfeihan0 liyuanlucasliu perfcv perkzzheng afirejay lauthu duxiaochao ironteen wxyhv mu-l aprilll02 yuyan2do mymusise gwangsoohong hiyoung-asr hanguangmic cocoking99 zhaohb gyin94 amal-r-17 yotofu chenchu-zs 842974287 yueyedeai mmdzzh codewithzichao feihugis haitong smallv0221 yuanzhedong triper1022 cding-nv gongel softworkz og-chronic xiedake imaginary-person baituhuangyu jackch-nv hamidshojanazeri bestjuly trendingtechnology tangjicheng3 stanleyjacob shp776 bangshengtang nvpohanh davis-love-ai llinc93 amralaa-msft xmfbit liuchiachi chiehchiu byungkeonko wanghaoshuang shangdehao1 jerryzh168 lswzjuer zrphercule yidong72 zhou3968322 hutao965 sanster ivanajovicic jsjason xiaoyichao vlasenkoalexey algoskynet shengr dingjingzhen cqray1990 ferdinandzhong reymondzzzz apollos rainmaker712 julianneknott wxp12322 feixliu

fastertransformer's Issues

Fastertransformer with tensorflow-serving

Hi, i was using the Fastertransformer locally, it's pretty good. we have good inference speed now. For many online application, we usually use tensorflow-serving, So, do you guys have plan to implement Fastertransformer with tensorflow-serving or is there any exist method. Thanks

Fastertransformer with tensorflow-serving

[FasterTransformer]In Tensorflow demos, faster transformer is slower than original transformer.

I have test the code on Titan RTX.

In C++ demos, I get the correct resut.
[batch_size 1 seq_len 32 12 transformer layers] costs 1.45 ms
But, in tensorflow demos, I get the below result.

# python transformer_fp16.py 1 12 32 12 64
original costs 4.61138 ms
optimized costs 51.12479999999999 ms

what's wrong in the tensorflow demos?

[FastTransformer/Pytorch] TXX, RuntimeError: CUDA error: invalid device function

FastTransformer v3.0
CUDA 10.2

with TXX, run bash pytorch/scripts/run_mrpc.sh thsext fp32 get:

11/24/2020 18:50:14 - INFO - __main__ -   Use custom BERT encoder for TorchScript
Traceback (most recent call last):
  File "/workdir/fastertransformer/build/pytorch/run_glue.py", line 383, in <module>
    main()
  File "/workdir/fastertransformer/build/pytorch/run_glue.py", line 359, in main
    enc_ = torch.jit.trace(enc, (fake_inp, fake_mask))
  File "/usr/local/conda/lib/python3.6/site-packages/torch/jit/_trace.py", line 742, in trace
    _module_class,
  File "/usr/local/conda/lib/python3.6/site-packages/torch/jit/_trace.py", line 966, in trace_module
    _module_class,
  File "/usr/local/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/conda/lib/python3.6/site-packages/torch/jit/_trace.py", line 327, in _check_trace
    copied_dict[name] = _clone_inputs(data)
  File "/usr/local/conda/lib/python3.6/site-packages/torch/jit/_trace.py", line 160, in _clone_inputs
    )(args)
  File "/usr/local/conda/lib/python3.6/site-packages/torch/autograd/function.py", line 282, in _map
    return type(obj)(mapped)
  File "/usr/local/conda/lib/python3.6/site-packages/torch/autograd/function.py", line 278, in <genexpr>
    mapped = (_map(x) for x in obj)
  File "/usr/local/conda/lib/python3.6/site-packages/torch/autograd/function.py", line 274, in _map
    return fn(obj)
  File "/usr/local/conda/lib/python3.6/site-packages/torch/jit/_trace.py", line 149, in clone_input
    .clone(memory_format=torch.preserve_format)
RuntimeError: CUDA error: invalid device function

However, with T4, it works.

[FasterTransformer/V2] No speedUp when the sequence length is large

Hello,
I had implemented a gpt2 model according to FT. Then I tested the performance between pytorch(Fairseq) and FT, This is the result :

Setting : batch = 1, hidden_units = 1024, head_num = 16, size_per_head = 64
time : ms
seq_len : 8 16 32 64 128 256 512 800
pytorch: 21 23 23 23 28 23.6 22.6 24
___FT : 6.2 6.4 6.7 7.6 8.6 12.3 24 34.7

From this table , it seems like the FT is much worse than pytorch when the seq_len is larger than about 500. it's very unacceptable that the FT is slower. And why the performance of pytorch is almost never change with the increase of sequence length ?
It's also has the same phenomenon when I test the masked BERT model.
Anyone help ?

fastertransformer nan error

When using my own pretrained model, I got NaN from the eight transformer layer.
By reading "fastertransformer/cuda/open_attention.cu", I found that the softmax input is not protected (substract the max value), the right formula shoud be:

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

I'm trying to modify "softmax_kernel_v2", maybe someone can offer a better version

faster transformer: Segmentation fault when "./bin/gemm_fp16 16 128 12 64"

I can compile it successfully.
step: ./bin/gemm_fp16 16 128 12 64, I got:

./bin/gemm_fp16 16 128 12 64
Device
FP16 Gemm Testing

GEMM test 0: [M: 2048, K: 768, N: 768] from_tensor * weightQ/K/V, attr * output_kernel
[FT][ERROR] CUDA runtime error: CUDA driver version is insufficient for CUDA runtime version /home/yons/Bert-master/bert/FasterTransformer/tools/gemm_test/gemm_fp16.cu:107
[FT][ERROR] CUDA runtime error: CUDA driver version is insufficient for CUDA runtime version /home/yons/Bert-master/bert/FasterTransformer/tools/gemm_test/gemm_fp16.cu:108
[FT][ERROR] CUDA runtime error: CUDA driver version is insufficient for CUDA runtime version /home/yons/Bert-master/bert/FasterTransformer/tools/gemm_test/gemm_fp16.cu:109
Segmentation fault (core dumped)

cuda:10.0
driver/nvidia/version:410.104
gcc:7.4.0
python3.6
Tensorflow 1.13.1
cmake:3.14.4

[Transformer/V2] two-way buffer to update Mask-attention's KV

I can not understand what's the mean of two-way buffer for K_cache and V_cache in decoding_opennmt.h. What the benefit of it ? The update at the end of every step seems like just a copy from one to another. Is it enough to use just one-way buffer ?

[Transformer/v2] Why the sequence length of encoder and decoding must be the same ?

As the title said, why there is this kind of limitation ?
The encoder sequence is the source sequence and the decoding sequence is the target sequence ?
It should be not limited.

A question about fastertransformer

Hi, i am using the faster transformer. I have a question about the step 1: build Generate the gemm_config.in file. The first parameter named batch size. is this batch size the train_batch size or the eval batch size? btw, i am using faster transformer for a bert-based task. Thanks

support for using Huggingface Transformers GPT2, T5 and Bart implementation?

can we utilize the gpt optimization for the huggingface transformer GPT2 model?
can we utilize the decoding optimization for T5 or Bart in Huggingface Transformers?

[Performance of INT8] Feature requested

It's exciting to see the open source of Fast Transformer v3.0. However, I don't find the performance of INT8 on applications codes in the README.md, while FP32 and FP16 are both analyzed. Where can I find these results?

[FastTransformer/v2] run translate_sample.py with batch_size=16 failed.

Related to FastTransformer/v2
FasterTransformer/v2

Describe the bug
run translate_sample.py with batch_size=16 failed.
`
2020-04-08 14:29:42.124155: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-04-08 14:29:43.819479: I tensorflow/stream_executor/platform/default/dso_
loader.cc:42] Successfully opened dynamic library libcublas.so.10
Traceback (most recent call last):
File "translate_sample.py", line 248, in
sess.run([op_target_tokens, op_target_length, source]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/sessio
n.py", line 950, in run
run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/sessio
n.py", line 1173, in _run
feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/sessio
n.py", line 1350, in _do_run
run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/sessio
n.py", line 1370, in _do_call
raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s)
found.
(0) Invalid argument: Input to reshape is a tensor with 264192 values, but t
he requested shape requires a multiple of 8192 [[node Reshape (defined at translate_sample.py:143) ]]
[[Minimum_3/_761]]
(1) Invalid argument: Input to reshape is a tensor with 264192 values, but t
he requested shape requires a multiple of 8192
[[node Reshape (defined at translate_sample.py:143) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation. Input Source operations connected to node Reshape:
transformer/encoder/LayerNorm/batchnorm/add_1 (defined at /usr/local/lib/pyth
on2.7/dist-packages/opennmt/layers/transformer.py:324)
Input Source operations connected to node Reshape:
transformer/encoder/LayerNorm/batchnorm/add_1 (defined at /usr/local/lib/python2.7/dist-packages/opennmt/layers/transformer.py:324)

Original stack trace for u'Reshape': File "translate_sample.py", line 143, in
tf_encoder_result, [batch_size, -1, encoder_hidden_dim])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 7715, in reshape
"Reshape", tensor=tensor, shape=shape, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_
def_library.py", line 788, in _apply_op_helper op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecat
ion.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops
.py", line 3616, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops
.py", line 2005, in init
self._traceback = tf_stack.extract_stack()

To Reproduce
Steps to reproduce the behavior:

enter docker and build fasttransformer
./bin/decoding_gemm 16 4 8 64 32001 100 512 0
python translate_sample.py --batch_size=16

Expected behavior

Environment
Please provide at least:

docker: nvcr.io/nvidia/tensorflow:19.07-py2

single example inference seems slow

Hi, my environment is tf 1.13.1 . I already set up the fastertransformer v1 and used bert example. When i used original bert inference(input test file) and the predcit time per sample is around 0.0035s(time used in estimator.predict / num of sample). The original bert(without fastertransformer is around 0.007s)

However, when i used input fn builder(not file based) to inference only one sample, the time is 0.009s (same as one inference of original bert which is also 0.009s). Could u please help about this?

[FasterTransformer v2] EncoderInitParam has no member named attr_kernel_Q

FasterTransformer v2

It seems EncoderInitParam has changed in v2:

template <typename T>
class EncoderInitParam
{
public:
  const T *from_tensor;
  const T *to_tensor;

  AttentionWeight<T> self_attention;
  const T *attr_mask;
  LayerNormWeight<T> self_layernorm;

  FFNWeight<T> ffn;
  LayerNormWeight<T> ffn_layernorm;

  T *transformer_out;
  cublasHandle_t cublas_handle;
  cudaStream_t stream;
};

I want to use C++ API, but the example encoder has only one layer, so I try the TensorRT encoder example. When I build it, many errors appear:

trt_plugin/bert_transformer_plugin.h:131:37: error: ‘class fastertransformer::EncoderInitParam<float>’ has no member named ‘attr_kernel_Q’
         encoder_param.attr_kernel_Q = d_attr_kernel_Q_;
                                     ^
trt_plugin/bert_transformer_plugin.h:132:37: error: ‘class fastertransformer::EncoderInitParam<float>’ has no member named ‘attr_kernel_K’
         encoder_param.attr_kernel_K = d_attr_kernel_K_;
                                     ^
trt_plugin/bert_transformer_plugin.h:133:37: error: ‘class fastertransformer::EncoderInitParam<float>’ has no member named ‘attr_kernel_V’
         encoder_param.attr_kernel_V = d_attr_kernel_V_;

It seems that TensorRT example of v2 depends on v1's core.

Q: Can you fix the example of TensorRT in v2? Or could you add N-layer encoder example code for C++ API?

How to integrate FasterTransformer into tensorflow serving as a custom op?

How to integrate fastertransformer into tensorflow serving? Faster transformer is compiled with cmake, but tensorflow serving uses bazel. Do I need to rewrite cmake to bazel?

export faster transformer model for tensorflow serving

I have export the fast transformer model and put it under the tensorflow serving model dir, but it failed when loading the model with msg:
2019-11-04 10:44:17.458999: E tensorflow_serving/util/retrier.cc:37] Loading servable: {name: house_price_ft version: 1572707458} failed: Not found: Op type not registered 'BertTransformer' in binary running on mmpayfaceaep1. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resampler should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.

Any idea on serving the faster transformer model?

[FasterTransformer Pytorch ] Detect Bert model in Pytorch and replace it by FasterTransformer OP

PyTorch

Hi, i want to detect Bert model structure in pytorch model and convert the structure by faster transformer op automatically.
Is there any way to access it? Thankyou very much!

[FasterTransformer] Run demo crash on P100

Related to FasterTransformer

Describe the bug

When executing the command on P100:

CUDA_VISIBLE_DEVICES=1 python sample/tensorflow/transformer_fp32.py 1 12 32 12 64

To Reproduce
Steps to reproduce the behavior:

cd build
cmake -DSM=60 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/path/to/python2.7/site-packages/tensorflow ..
make -j64
./build/bin/gemm_fp32 1 12 32 12 64
CUDA_VISIBLE_DEVICES=1 python sample/tensorflow/transformer_fp32.py 1 12 32 12 64

Environment
Please provide at least:

Container version: no use docker
GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): P100
CUDA driver version (e.g. 418.67): 410.93
CUDA version: 10
GCC: 6.4.0
cmake: 3.8.2

Feedback of faster transformer

Hi, NVIDIA's faster transformer is good to use, but I also get the problem is that it's really hard to reference NV's faster transformer work in my project, since it's under DeepLearningExamples second folder. I really want the direct reference to faster transformer, just the same way as TensorRT or nvidia-docker.

[Fast Transformer] Is there anyway to do int8 quantization without calibration ?

We need a calibration dataset to collect the scalar of quantization which is really inconvenient when we do int8 quantization. It would be nice if we make a int8 conversion like a data type conversion. So could we make int8 inference without calibration ?
As I think, we can get the quantization scalar according to the current input just with a simple max function instead of percentile or MSE to get the scalar. Although max function sometimes introduce the outlier , it just consider the current input values and we can do per-channel quantization on it which maybe increase the precision. Besides, we can use kernel fused to get the max inside gemm casing little cost.

what is 'encoder_gemm' in FasterTransformer?

In FasterTransformer, it is recommended that every time encoder_gemm should be run first. May I ask why? It generates 'gemm_config.in' file. But I couldn't find anywhere that uses this file.
Is it purely used for GPU warm up?

[Transformer/V2] Limitation for hidden_unit's size

Decoder does not support hidden_unit > 1024. This is really bad.
So I want to how to remove this limit ? If it will be deleted in next version ?

[FasterTransformer] speedup on T4 issue

Environment requirements

CMake 3.14.6
CUDA 10.0
CUDNN 7.4.2
Python 2.7
Tensorflow 1.13
GCC 6

I run performance test on T4. use transformer_fp16.py.
https://github.com/NVIDIA/DeepLearningExamples/blob/master/FasterTransformer/sample/tensorflow/transformer_fp16.py
when Batch_size is {8,16,32}, XLA is better.

<batch_size, layers, eq_len, head_num, size_per_head>	Tensorflow XLA on T4 FP16 (in ms)	Faster Transformer T4 FP16 (in ms)	Speedup	GEMM 参数
(1, 6, 32, 12, 64)	1.889802 ms	1.194158 ms	1.583	107,100,107,114,111
(2, 6, 32, 12, 64)	1.940294 ms	1.277116 ms	1.519	107,100,107,101,115
(4, 6, 32, 12, 64)	2.257926 ms	1.715474 ms	1.316	107,100,107,102,100
(8, 6, 32, 12, 64)	2.82261 ms	2.83554 ms	0.95	107,100,107,115,105
(16, 6, 32, 12, 64)	4.363834 ms	4.560086 ms	0.96	100,110,107,107,99
(32, 6, 32, 12, 64)	7.870094 ms	8.690178 ms	0.915	100,99,107,100,111
(64, 6, 32, 12, 64)	15.913654 ms	14.668242 ms	1.084	100,103,110,100,115
(128, 6, 32, 12, 64)	30.419768 ms	25.81849 ms	1.178	110,106,110,99,100
(256, 6, 32, 12, 64)	58.770124 ms	50.166238 ms	1.172	103,103,103,109,115
(512, 6, 32, 12, 64)	121.467958 ms	98.644416 ms	1.231	103,103,103,108,100

Is my test result correct?

[FasterTransformer/effective transformer] using wrong offset to remove seq padding

Related to FasterTransformer/effective transform

Describe the bug
In Kernel remove_sequence_length_padding, the offset to get real location in source padding tensor is wrong.

Current impl:
`template
global void remove_sequence_length_padding(const T* src, T* tgt,
const int* tmp_mask_offset,
int* mask_offset,
const int n)
{
const int tid = threadIdx.x;
const int bid = blockIdx.x;
mask_offset[bid] = tmp_mask_offset[bid];

// the src_seq_id is not right, which should be mask_offset[bid], no need add bid.
const int src_seq_id = bid + mask_offset[bid];
const int tgt_seq_id = bid;

for(int i = tid; i < n; i += blockDim.x)
{
tgt[tgt_seq_id * n + i] = src[src_seq_id * n + i];
}
}`

Expected behavior
Change:
const int src_seq_id = bid + mask_offset[bid];
to:
const int src_seq_id = mask_offset[bid];

cublasGemmStridedBatchedEx Compute Type

I want to know why I set Compute Type as CUDA_R_16F and A,B,C Data Type as CUDA_R_16F ,the result of matrix multiplication is 0 . But when set compute tyep as CUDA_R_32F and A,B,C as CUDA_R_16F ,the answer is right.

fp16 cross_check false

env:
ubuntu 18.04
python: 3.6.8
GCC: 8.0.1
tensorflow: 1.13.0-rc2
CUDA: 10.0

I install in tensorflow mode with V100 :
cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/usr/local/lib/python2.7/dist-packages/tensorflow .. # Tensorflow mode
and then generate the gemm_config.in with:
./build/bin/gemm_fp16 100 32 12 64
run with:
python transformer_fp16.py 100 12 32 12 64

But i got some result like:

So, we can see that FasterTransformer output have a big diff vs OriginTransformer in fp16 like 0.0332, and then i try run OriginTransformer twice, the diff is 0.007812.

My problem is:
1.FasterTransformer the diff 0.0332 is big or not? whether or not affect convergence? because OriginTransformer run twice is only 0.007812.
2.why diff(0.0332 vs 0.007812) is different, i think this maybe because FasterTransformer fuse some ops, but OriginTransformer don't, so maybe the Intermediate variable will cause the different?

[FastTransformer v3.1/TensorFlow] Get CUBLAS_STATUS_INTERNAL_ERROR when run tensorflow/gpt2-sample.py

Related to FastTransformer v3.1/TensorFlow/GPT-2

Describe the bug
If I run ./bin/decoding_gemm 4 1 12 64 50257 32 768 0 before python tensorflow/gpt2_sample.py, I got
Internal: [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_INTERNAL_ERROR FasterTransformer/fastertransformer/cuda/open_decoder.cu:1708.
However, If I don't run ./bin/decoding_gemm 4 1 12 64 50257 32 768 0 before python tensorflow/gpt2_sample.py (use the default gemm), everything is OK.

To Reproduce
Steps to reproduce the behavior:

nvidia-docker run -it -v local_dir:container_dir nvcr.io/nvidia/tensorflow:19.06-py3 bash
cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/usr/local/lib/python3.5/dist-packages/tensorflow ..
make
./bin/decoding_gemm 4 1 12 64 50257 32 768 0
python tensorflow/gpt2_sample.py

Expected behavior
There should be no error.

Environment
Please provide at least:

Container version: nvcr.io/nvidia/tensorflow:19.06-py3
GPUs in the system: 8x Tesla V100-32GB
CUDA driver version: 435.21

When will the transformer decoder be released?

Hi! Thanks for the great Fastertransformer, my team has benefited a lot from it.
I remember there was a live held by NVIDIA a few weeks ago, the speaker mentioned to release the Decoder version soon. I'd like to ask if you still have this plan and when will the decoder be released?
Thanks :)

[FasterTransformer] CUBLAS_STATUS_NOT_INITIALIZED

Related to Model/Framework(s)
(FasterTransformer)

Describe the bug
I can build fastser transformer for pytorch without any error. However, when I use GEMM, I get following error:

terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_NOT_INITIALIZED /root/DeepLearningExamples/FasterTransformer/v3.1/fastertransformer/gemm_test/encoder_gemm_func.cc:114

To Reproduce
Steps to reproduce the behavior:

sudo docker run --gpus all --network=host --privileged -w '/root' --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it nvcr.io/nvidia/pytorch:20.07-py3 /bin/bash
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/FasterTransformer/v3.1
mkdir -p build
cd build
cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_THE=ON -DBUILD_THS=ON -DCXX_STD=14 ..
make
pip install transformers==2.5.1
./bin/encoder_gemm 32 32 12 64 0 0

Expected behavior
should work without throwing error

Environment
Please provide at least:

Container version: nvcr.io/nvidia/pytorch:20.07-py3
GPUs in the system: Tesla V100-PCIE-16GB
CUDA driver version 10.2

Faster Transformer make error

Hi,
I failed to make the faster transformer. My environment is : V100 GPU, Cuda10, tensorflow 1.13.1, cmake 3.15.5, gcc 5.4.0, python 2.7

My make command is :
cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/root/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow ..

I got the Error:
-- The CXX compiler identification is GNU 5.4.0
-- The CUDA compiler identification is NVIDIA 10.0.130
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /usr/local/cuda-10.0/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda-10.0/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda-10.0 (found suitable version "10.0", minimum required is "10.0")
-- Found CUDA: /usr/local/cuda-10.0 (found version "10.0")
-- Assign GPU architecture (sm=70)
-- Configuring done
-- Generating done
-- Build files have been written to: /home/FasterTransformer/build

In the build/CMakeFiles/CMakeError.log:
Run Build Command(s):/usr/bin/make cmTC_2e459/fast && /usr/bin/make -f CMakeFiles/cmTC_2e459.dir/build.make CMakeFiles/cmTC_2e459.dir/build
make[1]: Entering directory '/home/FasterTransformer/build/CMakeFiles/CMakeTmp'
Building CXX object CMakeFiles/cmTC_2e459.dir/src.cxx.o
/usr/bin/c++ -DCMAKE_HAVE_LIBC_PTHREAD -o CMakeFiles/cmTC_2e459.dir/src.cxx.o -c /home/hFasterTransformer/build/CMakeFiles/CMakeTmp/src.cxx
Linking CXX executable cmTC_2e459
/dev/pkgs/cmake-3.15.5-Linux-x86_64/bin/cmake -E cmake_link_script CMakeFiles/cmTC_2e459.dir/link.txt --verbose=1
/usr/bin/c++ -DCMAKE_HAVE_LIBC_PTHREAD CMakeFiles/cmTC_2e459.dir/src.cxx.o -o cmTC_2e459
CMakeFiles/cmTC_2e459.dir/src.cxx.o：在函数‘main’中：
src.cxx:(.text+0x3c)：对‘pthread_create’未定义的引用
src.cxx:(.text+0x48)：对‘pthread_detach’未定义的引用
src.cxx:(.text+0x59)：对‘pthread_join’未定义的引用
src.cxx:(.text+0x6d)：对‘pthread_atfork’未定义的引用
collect2: error: ld returned 1 exit status
CMakeFiles/cmTC_2e459.dir/build.make:86: recipe for target 'cmTC_2e459' failed
make[1]: *** [cmTC_2e459] Error 1
make[1]: Leaving directory '/home/FasterTransformer/build/CMakeFiles/CMakeTmp'
Makefile:121: recipe for target 'cmTC_2e459/fast' failed
make: *** [cmTC_2e459/fast] Error 2

Desire for help. Thanks !!!

It cause precision error issue after removing add_QKV_bias in FasterTransformer

Related to Model/Framework(s)
( FasterTransformer)

Describe the bug
When we set qkv "use_bias=False" in python code, and removing cuda kernel function“add_QKV_bias” from fastertransformer/cuda/open_attention.cu, it would cause precision issue in float32 mode.
Here is the cross check result before and after remove qkv bias.
Original :
#################################
cross_check False
max diff 4.027567-06
min diff 0.0
After removing qkv bias:
#################################
cross_check False
max diff 0.0012447834
min diff 0.0
If ran in float16 mode, the precision error is even bigger.
To Reproduce
Need to modify the code of open_attention.cu,open_attention.h,bert_transformer_op.cc and transformer_fp32.py,etc.

Expected behavior
Is there any tips about the reason why this happen? Or Is there any suggestions to avoid?

Environment

Container version (nvcr.io/nvidia/tensorflow:19.10-py3):
GPUs in the system: ( Tesla V100-SXM2-16GB):
CUDA driver version ( 418.67):

[FasterTransformer] Allow compiling of compute capability independent binaries

Sometimes we cannot ensure that the build and run time environment uses the same exact hardware.

Having binaries that are compatible across multiple types of GPUs would make FasterTransformer much easier to prebuild development images for or for production use cases.

Why there is a dequantization and a quantization at the end of the fused multi-head attention module

In the encoder_guide, the Fig.3 shows the workflow of int8 inference including int8v1 and int8v2. At the end of Fused Multi-head attention, there is a deQ&Q before Proj Gemm. Why does FT need to do this? I wonder maybe why not quantize the result of Batch Gemm 2 directly？

faster tranformer cause "cross_check False "

environment
centos 7
tensorflow-gpu 1.13.1
cuda10
cudnn 7.6
gcc 7.3

i compile the tf_op,following https://github.com/NVIDIA/DeepLearningExamples/issues/114
it can run ,but always return the same

how can i fix it ,thanks

does faster transformer compile under DMS=52?

In the CMakeLists.txt, it seems to support compute capability 52:

set(SM_SETS 52 60 61 70 75 80).

But if i compile under

cmake -DSM=52 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON ..
make

i end up with compile error:

/FasterTransformer/fastertransformer/cuda/cuda_kernels.cu(110): error: more than one conversion function from "const half" to a built-in type applies:
            function "__half::operator float() const"
            function "__half::operator short() const"
            function "__half::operator unsigned short() const"
            function "__half::operator int() const"
            function "__half::operator unsigned int() const"
            function "__half::operator long long() const"
            function "__half::operator unsigned long long() const"
            function "__half::operator __nv_bool() const"
          detected during:
            instantiation of "void fastertransformer::update_logits_kernel(float *, const T *, const T *, int, const __nv_bool *, int) [with T=half]" 
(349): here
            instantiation of "void fastertransformer::update_logits(float *, const T *, const T *, int, const __nv_bool *, int, int, cudaStream_t) [with T=half]" 
(355): here

...

How to use FasterTransformer 4.0 with TRT Model On Triton Inference Server

How to run FasterTransformer 4.0 on Triton Inference Server with TRT Bert model?

faster transformer compile error with docker

image: nvidia/cuda 10.0-cudnn7-devel-ubuntu16.04 docker image
cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/usr/lib/python2.7/site-packages/tensorflow .. output:
-- The CXX compiler identification is GNU 5.4.0
-- The CUDA compiler identification is NVIDIA 10.0.130
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda (found suitable version "10.0", minimum required is "10.0")
-- Found CUDA: /usr/local/cuda (found version "10.0")
-- Assign GPU architecture (sm=70)
-- Configuring done
-- Generating done
-- Build files have been written to: /root/DeepLearningExamples/FasterTransformer/build

make output:
CMakeFiles/gemm_fp32.dir/gemm_fp32.cu.o: In function __sti____cudaRegisterAll()': tmpxft_0000054d_00000000-5_gemm_fp32.cudafe1.cpp:(.text.startup+0x15): undefined reference to __cudaRegisterLinkedBinary_44_tmpxft_0000054d_00000000_6_gemm_fp32_cpp1_ii_5cd8620e'
collect2: error: ld returned 1 exit status
tools/gemm_test/CMakeFiles/gemm_fp32.dir/build.make:83: recipe for target 'bin/gemm_fp32' failed
make[2]: *** [bin/gemm_fp32] Error 1
CMakeFiles/Makefile2:148: recipe for target 'tools/gemm_test/CMakeFiles/gemm_fp32.dir/all' failed
make[1]: *** [tools/gemm_test/CMakeFiles/gemm_fp32.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2

maybe some advices for cmake

1，I think in FindNCCL.cmake, the "set(NCCL_INCLUDE_DIR $ENV{NCCL_INCLUDE_DIR} CACHE ...)" should better be surounded by a "if (DEFINED ENV{...} " thing, to avoid the variable in the cache to be set to "" when the env var is not set. in such cases when afterwards the variable is set, the "null" variable in the cache still works
2，when building a pytorch version ,setting BUILD_GPT=OFF doesn't work, maybe because the gpt.h still has to be compiled.
3, line 63 of fused_multihead_attention_op.cc, The rank of from tensor should be 2, not 3

[fast-transformer/v1] CUBLAS_STATUS_ARCH_MISMATCH

there is a CUDA runtime error when i execute demo using command
"./transformer_fp32 1 12 128 12 64". i execute demo after running the command "cmake -DSM=37 -DCMAKE_BUILD_TYPE=Release -DBUILD_TRT=ON -DTRT_PATH=/root/TensorRT-5.1.5.0 -DBUILD_TF=ON -DTF_PATH=/root/anaconda2/lib/python2.7/site-packages/tensorflow .." and "./gemm_fp32 1 20 12 64"

the info is listed below：
[Device Tesla K80
before allocate free 11.10 GB total 11.17 GB
After allocate free 11.07 GB used 0.10 GB total 11.17 GB
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_ARCH_MISMATCH /root/DeepLearningExamples-master/FasterTransformer/v1/fastertransformer/cuda/open_attention.h:171

Aborted (core dumped)]

i searched internet using the keywords "CUBLAS_STATUS_ARCH_MISMATCH" and i found some info at the webpage https://docs.nvidia.com/cuda/cublas/index.html. the webpage said the "CUBLAS_STATUS_ARCH_MISMATCH" may be because "the device has a compute capabilites lower than 5.0"

Environment
cudnn version: 7.6.4
CUDA Version: 10.0
GPU version: K80
container: nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04

[FastTransformer/Pytorch] Input error, TorchScript FP16 with and without FastTransformer

Hi, i tried SQUAD demo in FastTransformer 3.0, and got good results. However, when i tried:

bash pytorch/scripts/run_squad.sh ths fp16

I got error:

DeepLearningExamples/FasterTransformer/v3.0/build/pytorch/run_squad.py(474): main
DeepLearningExamples/FasterTransformer/v3.0/build/pytorch/run_squad.py(489): <module>
RuntimeError: expected scalar type Float but found Half

And when i tried:

bash pytorch/scripts/run_squad.sh thsext fp16

i got:

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "DeepLearningExamples/FasterTransformer/v3.0/build/pytorch/utils/encoder.py", line 116, in forward
    def forward(self, hidden_states, attention_mask, sequence_lengths=torch.Tensor(0).to(torch.int).cuda()):
        for i in range(self.layer_num):
            hidden_states = self.encoders[i].forward(hidden_states, attention_mask, sequence_lengths)
                            ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        return (hidden_states,)
RuntimeError: Inconsistency of Tensor type: input

Only in TorchScript and FP16 mode i will get problem, my environment:

T4, CUDA 10.2
Pytorch 1.6.0

[FasterTransformer/Pytorch] CMake build failed with undefined reference to pthread_create

Related to FasterTransformer/Pytorch

Describe the bug
cmake -DSM=80 ... build failed with undefined reference to pthread_create error.
Build with docker is succeed anyway. (and this cannot meet my requirements...)

Already tried:

I have libpthread.so installed under /lib/x86_64-linux-gnu/ and it is registered in /etc/ld.so.conf.d/x86_64-linux-gnu.conf already.
I tried to fix provieded CMakeLists.txt manually, but it does not help. I think the test code is out there.

set(THREADS_PREFER_PTHREAD_FLAG ON)
find_package(Threads REQUIRED)

I tried various cmake versions for x86-64 linux (ubuntu 20.04), and encountered same error (pthread_create checking test code is different)
- 3.10
- 3.14
- 3.16
- 3.20-rc4

To Reproduce
Steps to reproduce the behavior:

# 1. Setup python environment
conda create -n faster_transformer python=3.8
conda activate faster_transformer
conda install pytorch -c pytorch
pip install transformers==2.5.1 opennmt-py==1.1.1  # not the point

# 2. Clone git repository
git clone [email protected]:NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples/FasterTransformer/v3.1
mkdir -p build
cd build

# 3. Build cmake project (I used SM=80 for A100 GPUs)
cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_THE=ON -DBUILD_THS=ON -DCXX_STD=14 ..   # Error!

Console outputs:

-- The CXX compiler identification is GNU 9.3.0
-- The CUDA compiler identification is NVIDIA 11.0.221
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda (found suitable version "11.0", minimum required is "10.1")
-- Add DCUDA11_MODE
-- Assign GPU architecture (sm=80)
-- Found CUDA: /usr/local/cuda (found version "11.0")
-- Caffe2: CUDA detected: 11.0
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.0
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so
-- Found cuDNN: v8.0.5  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
CMake Warning at /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:198 (message):
  Failed to compute shorthash for libnvrtc.so
Call Stack (most recent call first):
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:155 (find_package)


CMake Warning at /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/utils.cmake:365 (message):
  In the future we will require one to explicitly pass TORCH_CUDA_ARCH_LIST
  to cmake instead of implicitly setting it as an env variable.  This will
  become a FATAL_ERROR in future version of pytorch.
Call Stack (most recent call first):
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:483 (torch_cuda_get_nvcc_gencode_flag)
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:155 (find_package)


-- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80
-- Found Torch: /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/lib/libtorch.so
<string>:3: DeprecationWarning: SO is deprecated, use EXT_SUFFIX
Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: _prepare_ldflags() missing 1 required positional argument: 'is_standalone'
CMake Error at CMakeLists.txt:175 (message):
  PyTorch link config Error.

Log file:

# CMakeFiles/CMakeError.log
Determining if the pthread_create exist failed with the following output:
Change Dir: /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp

Run Build Command(s):/usr/bin/make cmTC_5ac33/fast
/usr/bin/make -f CMakeFiles/cmTC_5ac33.dir/build.make CMakeFiles/cmTC_5ac33.dir/build
make[1]: Entering directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
Building CXX object CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o
/usr/bin/c++     -o CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o -c /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp/CheckSymbolExists.cxx
Linking CXX executable cmTC_5ac33
/home/kalaluthien/cmake-3.14.0-Linux-x86_64/bin/cmake -E cmake_link_script CMakeFiles/cmTC_5ac33.dir/link.txt --verbose=1
/usr/bin/c++       CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o  -o cmTC_5ac33
/usr/bin/ld: CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o: in function `main':
CheckSymbolExists.cxx:(.text+0x1f): undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status
make[1]: *** [CMakeFiles/cmTC_5ac33.dir/build.make:87: cmTC_5ac33] Error 1
make[1]: Leaving directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
make: *** [Makefile:121: cmTC_5ac33/fast] Error 2

File /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp/CheckSymbolExists.cxx:
/* */
#include <pthread.h>

int main(int argc, char** argv)
{
  (void)argv;
#ifndef pthread_create
  return ((int*)(&pthread_create))[argc];
#else
  (void)argc;
  return 0;
#endif
}

Determining if the function pthread_create exists in the pthreads failed with the following output:
Change Dir: /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp

Run Build Command(s):/usr/bin/make cmTC_2be15/fast
/usr/bin/make -f CMakeFiles/cmTC_2be15.dir/build.make CMakeFiles/cmTC_2be15.dir/build
make[1]: Entering directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
Building CXX object CMakeFiles/cmTC_2be15.dir/CheckFunctionExists.cxx.o
/usr/bin/c++    -DCHECK_FUNCTION_EXISTS=pthread_create   -o CMakeFiles/cmTC_2be15.dir/CheckFunctionExists.cxx.o -c /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CheckLibraryExists/CheckFunctionExists.cxx
Linking CXX executable cmTC_2be15
/home/kalaluthien/cmake-3.14.0-Linux-x86_64/bin/cmake -E cmake_link_script CMakeFiles/cmTC_2be15.dir/link.txt --verbose=1
/usr/bin/c++   -DCHECK_FUNCTION_EXISTS=pthread_create    CMakeFiles/cmTC_2be15.dir/CheckFunctionExists.cxx.o  -o cmTC_2be15 -lpthreads
/usr/bin/ld: cannot find -lpthreads
collect2: error: ld returned 1 exit status
make[1]: *** [CMakeFiles/cmTC_2be15.dir/build.make:87: cmTC_2be15] Error 1
make[1]: Leaving directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
make: *** [Makefile:121: cmTC_2be15/fast] Error 2

I found that cmake uses -lpthread when it compiles FasterTransformer parts, and uses -lpthreads only when CheckSymbolExists. Weired..

Expected behavior
A clear and concise description of what you expected to happen.
Succeed to build cmake project.

Environment
Please provide at least:

Container version (e.g. pytorch:19.05-py3): latest native pytorch, but not the point.
GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): A100-SXM4-40GB
CUDA driver version (e.g. 418.67): 450.80.02

[FastTransformer v3.1/TensorFlow] Get CUBLAS_STATUS_INTERNAL_ERROR when run tensorflow/gpt2-sample.py

Related to FastTransformer v3.1/TensorFlow/GPT-2

To Reproduce
Steps to reproduce the behavior:

nvidia-docker run -it -v local_dir:container_dir nvcr.io/nvidia/tensorflow:19.06-py3 bash
cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/usr/local/lib/python3.5/dist-packages/tensorflow ..
make
./bin/decoding_gemm 4 1 12 64 50257 32 768 0
python tensorflow/gpt2_sample.py

Expected behavior
There should be no error.

Environment
Please provide at least:

Container version: nvcr.io/nvidia/tensorflow:19.06-py3
GPUs in the system: 8x Tesla V100-32GB
CUDA driver version: 435.21

[FasterTransformer] nvcc fatal : redefinition of argument 'std'

Hi, i compiled fatsertrasformer code and get error:

[  0%] Built target copy
[  2%] Building CXX object tools/gemm_test/CMakeFiles/decoding_gemm.dir/decoding_gemm.cc.o
[  4%] Linking CXX executable ../../bin/decoding_gemm
[  4%] Built target decoding_gemm
[  6%] Building CXX object tools/gemm_test/CMakeFiles/encoder_gemm.dir/encoder_gemm.cc.o
[  8%] Linking CXX executable ../../bin/encoder_gemm
[  8%] Built target encoder_gemm
[ 10%] Building CUDA object fastertransformer/cuda/CMakeFiles/topk.dir/topk_kernels.cu.o
nvcc fatal   : redefinition of argument 'std'
make[2]: *** [fastertransformer/cuda/CMakeFiles/topk.dir/topk_kernels.cu.o] Error 1
make[1]: *** [fastertransformer/cuda/CMakeFiles/topk.dir/all] Error 2
make: *** [all] Error 2

my commands are:

mkdir -p build
cd build
cmake -DSM=75 -DCMAKE_BUILD_TYPE=Release -DBUILD_THE=ON -DBUILD_THS=ON -DBUILD_THSOP=ON -DCXX_STD=14 ..
make

Any suggestion? Thank you very much!

libtf_fastertransformer.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs?

hi, i convert model to fp16 format, but when i running faster ,call me libtf_fastertransformer.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs,

python ckpt_type_convert.py --init_checkpoint=$MODEL --fp16_checkpoint=imdb_output/fp16_model.ckpt

python run_classifier_fastertf.py --task_name=Imdb --do_eval=true --data_dir=$IMDB_DIR --vocab_file=$BERT_BASE_DIR/vocab.txt --bert_config_file=$BERT_BASE_DIR/bert_config.json --init_checkpoint=imdb_output/fp16_model.ckpt --max_seq_length=128 --eval_batch_size=16 --output_dir=imdb_output --floatx=float16

/home/ubt/anaconda3/envs/fastertf/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubt/anaconda3/envs/fastertf/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubt/anaconda3/envs/fastertf/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubt/anaconda3/envs/fastertf/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubt/anaconda3/envs/fastertf/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubt/anaconda3/envs/fastertf/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Traceback (most recent call last):
File "run_classifier_fastertf.py", line 54, in
import fast_infer_util as fiu
File "/home/ubt/FasterTransformer/FasterTransformer_Bert/fast_infer_util.py", line 29, in
os.path.join(build_path, 'libtf_fastertransformer.so'))
File "/home/ubt/anaconda3/envs/fastertf/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/ubt/FasterTransformer/FasterTransformer_Bert/./build/lib/libtf_fastertransformer.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs

[FasterTransformer] sample/transformer_trt failed in fp16 mode

Related to FasterTransformer/TensorRT

Describe the bug
When running transformer_trt.cc in fp16 mode, I met several CUDA Error during forward.

CUDA Error: CUDA_ERROR_INVALID_VALUE CUDA Error: CUDA_ERROR_INVALID_VALUE /usr/local/app/workspace/ljq/DeepLearningExamples-master/FasterTransformer/v3.1/fastertransformer/trt_fused_multihead_attention/fused_multihead_attention_v2.h 507
[FT][ERROR] CUDA runtime error: invalid configuration argument /usr/local/app/workspace/ljq/DeepLearningExamples-master/FasterTransformer/v3.1/fastertransformer/trt_fused_multihead_attention/qkvToContext.cu:137

[FT][ERROR] CUDA runtime error: invalid configuration argument /usr/local/app/workspace/ljq/DeepLearningExamples-master/FasterTransformer/v3.1/fastertransformer/cuda/open_attention.h:626

And the result seems to be not correct.
By printing out the params, I found that params.b: -2, which should be a non-negative number as gridDim.y.

To Reproduce
Just run the transformer_trt with fp16=1

Expected behavior
No error should be found and the result should be same as other sample transformers.

Environment
Please provide at least:

Container version: self-made container
GPUs in the system: Tesla T4-16GB
CUDA driver version: 440.33
CUDA version: 10.2.89

transformer_fp32 core

./transformer_fp32 1 12 128 12 64
Device TITAN V
before allocate free 11.00 GB total 11.75 GB
After allocate free 10.96 GB used 0.79 GB total 11.75 GB
[FT][CALL] BertEncoderTransformer
[FT][CALL] OpenMultiHeadAttention
gemm_config.in is not found
loading GEMM algorithms error, using default GEMM algorithms
gemm_config.in is not found
loading GEMM algorithms error, using default GEMM algorithms!
[FT][CALL] initialize
[FT][CALL] initialize
[FT][CALL] forward
[FT][CALL] forward
transformer_fp32: xxx/DeepLearningExamples/FasterTransformer/fastertransformer/cuda/open_attention.cu:329: void fastertransformer::cuda::OpenMultiHeadAttention<OpType_>::multiHeadAttr_nofuse_kernelLauncher(cudaStream_t, cublasHandle_t, fastertransformer::cuda::OpenMultiHeadAttention<OpType_>::DataType_, const DataType_, fastertransformer::cuda::OpenMultiHeadAttention<OpType_>::DataType_, const DataType_, fastertransformer::cuda::OpenMultiHeadAttention<OpType_>::DataType_, const DataType_, const DataType_, fastertransformer::cuda::OpenMultiHeadAttention<OpType_>::DataType_, int, int, int, int, fastertransformer::cuda::OpenMultiHeadAttention<OpType_>::DataType_) [with fastertransformer::OperationType OpType_ = (fastertransformer::OperationType)0; cudaStream_t = CUstream_st*; cublasHandle_t = cublasContext*; fastertransformer::cuda::OpenMultiHeadAttention<OpType_>::DataType_ = float]: Assertion `k > 1024' failed.
Aborted (core dumped)

[FasterTransformer3.0/Pytorch] Translation with FasterTransformer3.0 on PyTorch demo model file fails to load

Related to FasterTransformer/Pytorch
(e.g. GNMT/PyTorch or FasterTransformer/All)

Describe the bug

the downloaded transformer model can't be loaded with error:

Traceback (most recent call last):
  File "pytorch/load.py", line 37, in <module>
    fields, model, model_opt = load_test_model(opt, args)
  File "/app/build/pytorch/utils/translation_model.py", line 80, in load_test_model
    map_location=lambda storage, loc: storage)
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 585, in load
    with _open_zipfile_reader(f) as opened_zipfile:
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 245, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at ../caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old.

To Reproduce
Steps to reproduce the behavior:
I'm trying to follow the readme and use docker withh pytorch 1.5

python pytorch/run_translation.py --batch_size 128 --beam_size 4 --model_type decoding_ext --data_type fp32

Seems due to opennmt-py requires pytorch==1.6?

Environment
Please provide at least:

Container version (e.g. pytorch:19.05-py3): pytorch:20.03-py3
GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): 1080Ti
CUDA driver version (e.g. 418.67):10.2

[FasterTransformer/Pytorch] CMake build failed with undefined reference to pthread_create

Related to FasterTransformer/Pytorch

Describe the bug
cmake -DSM=80 ... build failed with undefined reference to pthread_create error.
Build with docker is succeed anyway. (and this cannot meet my requirements...)

Already tried:

I have libpthread.so installed under /lib/x86_64-linux-gnu/ and it is registered in /etc/ld.so.conf.d/x86_64-linux-gnu.conf already.
I tried to fix provieded CMakeLists.txt manually, but it does not help. I think the test code is out there.

set(THREADS_PREFER_PTHREAD_FLAG ON)
find_package(Threads REQUIRED)

I tried various cmake versions for x86-64 linux (ubuntu 20.04), and encountered same error (pthread_create checking test code is different)
- 3.10
- 3.14
- 3.16
- 3.20-rc4

To Reproduce
Steps to reproduce the behavior:

# 1. Setup python environment
conda create -n faster_transformer python=3.8
conda activate faster_transformer
conda install pytorch -c pytorch
pip install transformers==2.5.1 opennmt-py==1.1.1  # not the point

# 2. Clone git repository
git clone [email protected]:NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples/FasterTransformer/v3.1
mkdir -p build
cd build

# 3. Build cmake project (I used SM=80 for A100 GPUs)
cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_THE=ON -DBUILD_THS=ON -DCXX_STD=14 ..   # Error!

Console outputs:

-- The CXX compiler identification is GNU 9.3.0
-- The CUDA compiler identification is NVIDIA 11.0.221
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda (found suitable version "11.0", minimum required is "10.1")
-- Add DCUDA11_MODE
-- Assign GPU architecture (sm=80)
-- Found CUDA: /usr/local/cuda (found version "11.0")
-- Caffe2: CUDA detected: 11.0
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.0
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so
-- Found cuDNN: v8.0.5  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
CMake Warning at /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:198 (message):
  Failed to compute shorthash for libnvrtc.so
Call Stack (most recent call first):
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:155 (find_package)


CMake Warning at /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/utils.cmake:365 (message):
  In the future we will require one to explicitly pass TORCH_CUDA_ARCH_LIST
  to cmake instead of implicitly setting it as an env variable.  This will
  become a FATAL_ERROR in future version of pytorch.
Call Stack (most recent call first):
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:483 (torch_cuda_get_nvcc_gencode_flag)
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:155 (find_package)


-- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80
-- Found Torch: /home/kalaluthien/.miniconda3/envs/faster_transformer/lib/python3.8/site-packages/torch/lib/libtorch.so
<string>:3: DeprecationWarning: SO is deprecated, use EXT_SUFFIX
Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: _prepare_ldflags() missing 1 required positional argument: 'is_standalone'
CMake Error at CMakeLists.txt:175 (message):
  PyTorch link config Error.

Log file:

# CMakeFiles/CMakeError.log
Determining if the pthread_create exist failed with the following output:
Change Dir: /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp

Run Build Command(s):/usr/bin/make cmTC_5ac33/fast
/usr/bin/make -f CMakeFiles/cmTC_5ac33.dir/build.make CMakeFiles/cmTC_5ac33.dir/build
make[1]: Entering directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
Building CXX object CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o
/usr/bin/c++     -o CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o -c /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp/CheckSymbolExists.cxx
Linking CXX executable cmTC_5ac33
/home/kalaluthien/cmake-3.14.0-Linux-x86_64/bin/cmake -E cmake_link_script CMakeFiles/cmTC_5ac33.dir/link.txt --verbose=1
/usr/bin/c++       CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o  -o cmTC_5ac33
/usr/bin/ld: CMakeFiles/cmTC_5ac33.dir/CheckSymbolExists.cxx.o: in function `main':
CheckSymbolExists.cxx:(.text+0x1f): undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status
make[1]: *** [CMakeFiles/cmTC_5ac33.dir/build.make:87: cmTC_5ac33] Error 1
make[1]: Leaving directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
make: *** [Makefile:121: cmTC_5ac33/fast] Error 2

File /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp/CheckSymbolExists.cxx:
/* */
#include <pthread.h>

int main(int argc, char** argv)
{
  (void)argv;
#ifndef pthread_create
  return ((int*)(&pthread_create))[argc];
#else
  (void)argc;
  return 0;
#endif
}

Determining if the function pthread_create exists in the pthreads failed with the following output:
Change Dir: /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp

Run Build Command(s):/usr/bin/make cmTC_2be15/fast
/usr/bin/make -f CMakeFiles/cmTC_2be15.dir/build.make CMakeFiles/cmTC_2be15.dir/build
make[1]: Entering directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
Building CXX object CMakeFiles/cmTC_2be15.dir/CheckFunctionExists.cxx.o
/usr/bin/c++    -DCHECK_FUNCTION_EXISTS=pthread_create   -o CMakeFiles/cmTC_2be15.dir/CheckFunctionExists.cxx.o -c /home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CheckLibraryExists/CheckFunctionExists.cxx
Linking CXX executable cmTC_2be15
/home/kalaluthien/cmake-3.14.0-Linux-x86_64/bin/cmake -E cmake_link_script CMakeFiles/cmTC_2be15.dir/link.txt --verbose=1
/usr/bin/c++   -DCHECK_FUNCTION_EXISTS=pthread_create    CMakeFiles/cmTC_2be15.dir/CheckFunctionExists.cxx.o  -o cmTC_2be15 -lpthreads
/usr/bin/ld: cannot find -lpthreads
collect2: error: ld returned 1 exit status
make[1]: *** [CMakeFiles/cmTC_2be15.dir/build.make:87: cmTC_2be15] Error 1
make[1]: Leaving directory '/home/kalaluthien/DeepLearningExamples/FasterTransformer/v3.1/build/CMakeFiles/CMakeTmp'
make: *** [Makefile:121: cmTC_2be15/fast] Error 2

I found that cmake uses -lpthread when it compiles FasterTransformer parts, and uses -lpthreads only when CheckSymbolExists. Weired..

Expected behavior
A clear and concise description of what you expected to happen.
Succeed to build cmake project.

Environment
Please provide at least:

Container version (e.g. pytorch:19.05-py3): latest native pytorch, but not the point.
GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): A100-SXM4-40GB
CUDA driver version (e.g. 418.67): 450.80.02

[Faster transformer] having a guide on how to use weights from a Hugginface transfomer model (Roberta based) with faster transformer 3.1

Related to Faster transformers + Hugging face + Pytorch

Is your feature request related to a problem? Please describe.
It seems that Faster transformer should be able to import weights from a Roberta based huggingface model, but the way to perform it is not obvious.

Describe the solution you'd like
A part of the README dedicated to use weights from huggingface v4 (last version) in a faster transformer model.

Describe alternatives you've considered
N/A

Additional context
At some point in the project, huggingface v2 is used, but my attempt to load a Roberta based model from Huggingface v4 failed, even if in theory it's the same architecture. I tried to rename the layers to match those expected by Bert but it didn't work, the output didn't match the ones before the transfer... There are probably other transformations to perform, but I didn't find which ones.

def rewrite_layer_name(layer_name: str) -> str:
    if "roberta." in layer_name:
        layer_name = layer_name.replace("roberta.", "bert.")
    elif "classifier.dense." in layer_name:
        layer_name = layer_name.replace("classifier.dense.", "bert.pooler.dense.")
    elif "classifier.out_proj." in layer_name:
        layer_name = layer_name.replace("classifier.out_proj.", "classifier.")
    return layer_name

nvidia / fastertransformer Goto Github PK

fastertransformer's Introduction

FasterTransformer

Table Of Contents

Model overview

Support matrix

Advanced

Global Environment

Performance

BERT base performance

BERT base performances of FasterTransformer new features

BERT base performance on TensorFlow

BERT base performance on PyTorch

Decoding and Decoder performance

Decoder and Decoding end-to-end translation performance on TensorFlow

Decoder and Decoding end-to-end translation performance on PyTorch

GPT performance

Release notes

Changelog

Known issues

fastertransformer's People

Contributors

Stargazers

Watchers

Forkers

fastertransformer's Issues

./bin/gemm_fp16 16 128 12 64 Device FP16 Gemm Testing

Environment requirements

Recommend Projects

Recommend Topics

Recommend Org

Jobs

./bin/gemm_fp16 16 128 12 64
Device
FP16 Gemm Testing