GithubHelp home page GithubHelp logo

ctranslate2_triton_backend's Introduction

CTranslate2 Backend for Triton Inference Server

This is a backend based on CTranslate2 for NVIDIA's Triton Inference Server, which can be used to deploy translation and language models supported by CTranslate2 on Triton with both CPU and GPU capabilities.

It supports ragged and dynamic batching and setting of (a subset of) CTranslate decoding parameters in the model config.

Building

Make sure to have cmake installed on your system.

  1. Build and install CTranslate2: https://opennmt.net/CTranslate2/installation.html#compile-the-c-library
  2. Build the backend
mkdir build && cd build
export BACKEND_INSTALL_DIR=$(pwd)/install
cmake .. -DCMAKE_BUILD_TYPE=Release -DTRITON_ENABLE_GPU=1 -DCMAKE_INSTALL_PREFIX=$BACKEND_INSTALL_DIR
make install

This builds the backend into $BACKEND_INSTALL_DIR/backends/ctranslate2.

Setting up the backend

First install the pip package to convert models: pip install ctranslate2. Then create a model repository, which consists of a configuration (config.pbtxt) and the converted model.

For example for the Helsinki-NLP/opus-mt-en-de HuggingFace transformer model, create a new directory e.g. mkdir $MODEL_DIR/opus-mt-en-de. The model needs to be moved into a directory called model that is nested in a folder specifying a numerical version of the model:

ct2-transformers-converter --model Helsinki-NLP/opus-mt-en-de --output_dir 1/model

The minimum configuration for the model and backend is the following, you can see an example configs in examples/model_repo:

backend: "ctranslate2" # must be ctranslate2
name: "opus-mt-en-de" # must be the same as the model name
max_batch_size: 128 # can be optimised based on available GPU memory
input [
  {
    name: "INPUT_IDS"
    data_type: TYPE_INT32
    dims: [ -1 ]
    allow_ragged_batch: true # needed for dynamic batching
  }
]
output [
  {
    name: "OUTPUT_IDS"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]

instance_group [{ kind: KIND_GPU, count: 1 }] # use KIND_CPU for CPU inference
dynamic_batching {
  max_queue_delay_microseconds: 5000 # can be tuned based on latency requirements
}

Start the tritonserver with the ctranslate backend and model repository:

tritonserver --backend-directory $BACKEND_INSTALL_DIR/backends --model-repository $MODEL_DIR

The backend is set up to use ragged, dynamic batching. This means you should send each input in it's own request and Triton will take care of batching, using the dynamic_batching and max_batch_size configuration to build appropriate batches, for best performance on GPUs.

Providing a target prefix

There are models that require a special prefix token for the decoder. For example the M2M models need a token that specifies the target language, or sometimes it might be useful to start the translation with a specific prefix. An example config can be found in the facebook M2M config.pbtxt.

Sending requests

The backend expects token IDs as input and output, which need to be INT32 or bigger data-types. This might include beginning-of-sentence and end-of-sentence token, depending on the model. No padding tokens need to be added though, the backend is taking care of padding and batching.

You can use the offical Triton clients to make requests, both HTTP and gRPC protocols are supported. We provide an example for a Python client here.

You can try the working translation by running

echo "How is your day going?" | python3 examples/client.py

In case you want to use a special prefix for decoding, the request also needs to have the input TARGET_PREFIX set, which could look like this in Python:

inputs.append(
    tritonclient.grpc.InferInput("TARGET_PREFIX", prefix_ids.shape, np_to_triton_dtype(prefix_ids.dtype))
)
inputs[1].set_data_from_numpy(prefix_ids)

Configuration

The backend exposes a few parameters for customisation, currently that is a subset of decoding options in the CTranslate2 C++ interface and some special ones that are useful for limiting inference compute requirements:

  • compute_type overrides the numerical precision used for the majority of the Transformer computation, relates to quantization types in model conversion.
  • max_decoding_length_multiple can be used to limit the number of output tokens as a multiple of input tokens. E.g if the longest input sequence is 10 tokens and max_decoding_length_multiple: "2" decoding is limitted to 20 tokens.

Decoding parameters passed through to CTranslate, more to be added:

  • beam_size
  • repetition_penalty

The parameters can be set like this in the config.pbtxt:

parameters [
  {
    key: "compute_type"
    value {
      string_value: "float16" # optional, can be used to force a compute type
    }
  },
  {
    key: "max_decoding_length_multiple"
    value {
      string_value: "2" 
    }
  },
    {
    key: "beam_size"
    value {
      string_value: "4" 
    }
  },
    {
    key: "repetition_penalty"
    value {
      string_value: "1.5"
    }
  }
]

ctranslate2_triton_backend's People

Contributors

hennerm avatar

Stargazers

Aleksandr Smechov avatar Moonsik Song avatar Haider Asad avatar  avatar Kevin Qiu avatar Tuan-Vu Trinh  avatar jaehak avatar yinguo avatar hudengjun avatar  avatar Charles Melby-Thompson avatar Muhammet Ali Köker avatar Emre Barkın Bozdağ avatar Clayton Yochum avatar Yusuf Candra Arif K.A avatar Raunak  avatar Caroline Dockes avatar Nimit Pattanasri avatar Michael Feil avatar  avatar  avatar Mai Quang Tuan avatar Trinh Phuong Dong avatar Nguyen Ngoc Dam avatar  avatar  avatar Jamie Dougherty avatar Aamir avatar

Watchers

Anartz Nuin avatar Brad Phipps avatar  avatar Waldek Maleska avatar Muhammet Ali Köker avatar Nick Gerig avatar Aaron Ng avatar Alex Wicks avatar BTaylor avatar Jamie Dougherty avatar Matt Nemitz avatar

ctranslate2_triton_backend's Issues

Adding a minimal config example

Hey,

Thank you for bringing ctranslate2 to Triton! I am currently working on a project for a course using ctranslate2 and would like to host the model on Triton.

I got it up with Triton, but I am struggling to find the correct config.

Would it be possible to add a minimal example for a model?

Thank you!

Investigate support for streaming mode

Previously reported in #2 (comment) by @aamir-s18

Streaming mode could be useful for very big models. It can help in real-time use cases where we can improve User Experience by generating one token at a time.

Triton Server supports streaming with decoupled models

It needs to be investigated how CTranslate can be used to get decoded tokens one-by-one. Additionally this might be trickier in a beam decode setting, unless we are willing to always return the best guess which could flip previous words

Skip tokenisation and detokenisation

CTranslate does not provide an API that takes numerical token Ids, thus we convert the IDs we get from Triton to subwords for the encoder input and convert the decoder outputs back to IDs. This is unnecessary since it means that it has to be done on both the client and server.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.