GithubHelp home page GithubHelp logo

cog-vllm-405b-base's Introduction

Cog-vLLM: Run vLLM on Replicate

Cog is an open-source tool that lets you package machine learning models in a standard, production-ready container. vLLM is a fast and easy-to-use library for LLM inference and serving.

You can deploy your packaged model to your own infrastructure, or to Replicate.

Highlights

  • ๐Ÿš€ Run vLLM in the cloud with an API. Deploy any vLLM-supported language model at scale on Replicate.

  • ๐Ÿญ Support multiple concurrent requests. Continuous batching works out of the box.

  • ๐Ÿข Open Source, all the way down. Look inside, take it apart, make it do exactly what you need.

Quickstart

Go to replicate.com/replicate/vllm and create a new vLLM model from a supported Hugging Face repo, such as google/gemma-2b

Important

Gated models require a Hugging Face API token, which you can set in the hf_token field of the model creation form.

Create a new vLLM model on Replicate

Replicate downloads the model files, packages them into a .tar archive, and pushes a new version of your model that's ready to use.

Trained vLLM model on Replicate

From here, you can either use your model as-is, or customize it and push up your changes.

Local Development

If you're on a machine or VM with a GPU, you can try out changes before pushing them to Replicate.

Start by installing or upgrading Cog. You'll need Cog v0.10.0-alpha11:

$ sudo curl -o /usr/local/bin/cog -L "https://github.com/replicate/cog/releases/download/v0.10.0-alpha11/cog_$(uname -s)_$(uname -m)"
$ sudo chmod +x /usr/local/bin/cog

Then clone this repository:

$ git clone https://github.com/replicate/cog-vllm
$ cd cog-vllm

Go to the Replicate dashboard and navigate to the training for your vLLM model. From that page, copy the weights URL from the Download weights button.

Copy weights URL from Replicate training

Set the COG_WEIGHTS environment variable with that copied value:

$ export COG_WEIGHTS="..."

Now, make your first prediction against the model locally:

$ cog predict -e "COG_WEIGHTS=$COG_WEIGHTS" \ 
              -i prompt="Hello!"

The first time you run this command, Cog downloads the model weights and save them to the models subdirectory.

To make multiple predictions, start up the HTTP server and send it POST /predictions requests.

# Start the HTTP server
$ cog run -p 5000 -e "COG_WEIGHTS=$COG_WEIGHTS" python -m cog.server.http

# In a different terminal session, send requests to the server
$ curl http://localhost:5000/predictions -X POST \
    -H 'Content-Type: application/json' \
    -d '{"input": {"prompt": "Hello!"}}'

When you're finished working, you can push your changes to Replicate.

Grab your token from replicate.com/account and set it as an environment variable:

export REPLICATE_API_TOKEN=<your token>
$ echo $REPLICATE_API_TOKEN | cog login --token-stdin
$ cog push r8.im/<your-username>/<your-model-name>
--> ...
--> Pushing image 'r8.im/...'

After you push your model, you can try running it on Replicate.

Install the Replicate Python SDK:

$ pip install replicate

Create a prediction and stream its output:

import replicate

model = replicate.models.get("<your-username>/<your-model-name>")
prediction = replicate.predictions.create(
    version=model.latest_version,
    input={ "prompt": "Hello" },
    stream=True
)

for event in prediction.stream():
    print(str(event), end="")

cog-vllm-405b-base's People

Contributors

technillogue avatar mattt avatar cosmicoptima avatar joehoover avatar cloneofsimo avatar zeke avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.