GithubHelp home page GithubHelp logo

alpayariyak / worker-vllm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from runpod-workers/worker-vllm

0.0 0.0 0.0 94 KB

The RunPod worker template for serving our large language model endpoints. Powered by VLLM.

License: MIT License

Shell 5.15% Python 85.32% Dockerfile 9.52%

worker-vllm's Introduction

vLLM Endpoint | Serverless Worker

CI | Test Worker ย  Docker Image

๐Ÿš€ | This serverless worker utilizes vLLM (very Large Language Model) behind the scenes and is integrated into RunPod's serverless environment. It supports dynamic auto-scaling using the built-in RunPod autoscaling feature.

Docker Arguments:

  1. HUGGING_FACE_HUB_TOKEN: Your private Hugging Face token. This token is required for downloading models that necessitate agreement to an End User License Agreement (EULA), such as the llama2 family of models.
  2. MODEL_NAME: The Hugging Face model to use. Please ensure that the chosen model is supported by vLLM. Refer to the list of supported models for compatibility.
  3. TOKENIZER: (Optional) The specified tokenizer to use. If you want to use the default tokenizer for the model, do not provide this docker argument at all.
  4. STREAMING: Whether to use HTTP Streaming or not. Specify True if you want to enable HTTP Streaming; otherwise, omit this argument.

llama2 7B Chat:

docker build . --platform linux/amd64 --build-arg HUGGING_FACE_HUB_TOKEN=your_hugging_face_token_here --build-arg MODEL_NAME=meta-llama/Llama-2-7b-chat-hf --build-arg TOKENIZER=hf-internal-testing/llama-tokenizer --build-arg STREAMING=True

llama2 13B Chat:

docker build . --platform linux/amd64 --build-arg HUGGING_FACE_HUB_TOKEN=your_hugging_face_token_here --build-arg MODEL_NAME=meta-llama/Llama-2-13b-chat-hf --build-arg TOKENIZER=hf-internal-testing/llama-tokenizer --build-arg STREAMING=True

Please make sure to replace your_hugging_face_token_here with your actual Hugging Face token to enable model downloads that require it.

Ensure that you have Docker installed and properly set up before running the docker build commands. Once built, you can deploy this serverless worker in your desired environment with confidence that it will automatically scale based on demand. For further inquiries or assistance, feel free to contact our support team.

Model Inputs

| Argument           | Type            | Default   | Description                                                                                                                                                      |
|--------------------|-----------------|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| n                  | int             | 1         | Number of output sequences to return for the given prompt.                                                                                                      |
| best_of            | Optional[int]   | None      | Number of output sequences that are generated from the prompt. From these `best_of` sequences, the top `n` sequences are returned. `best_of` must be greater than or equal to `n`. This is treated as the beam width when `use_beam_search` is True. By default, `best_of` is set to `n`. |
| presence_penalty   | float           | 0.0       | Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.                        |
| frequency_penalty  | float           | 0.0       | Float that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.                          |
| temperature        | float           | 1.0       | Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.                                        |
| top_p              | float           | 1.0       | Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.                            |
| top_k              | int             | -1        | Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.                                                               |
| use_beam_search    | bool            | False     | Whether to use beam search instead of sampling.                                                                                                             |
| stop               | Union[None, str, List[str]] | None | List of strings that stop the generation when they are generated. The returned output will not contain the stop strings.                       |
| ignore_eos         | bool            | False     | Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.                                                            |
| max_tokens         | int             | 256       | Maximum number of tokens to generate per output sequence.                                                                                                   |
| logprobs           | Optional[int]   | None      | Number of log probabilities to return per output token.                                                                                                     |

Test Inputs

The following inputs can be used for testing the model:

{
    "input": {
       "prompt": "Who is the president of the United States?",
       "sampling_params": {
           "max_tokens": 100
       }
    }
}

worker-vllm's People

Contributors

jorghi12 avatar vladmihaisima avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.