GithubHelp home page GithubHelp logo

Comments (4)

richardliaw avatar richardliaw commented on June 18, 2024

What type of distributed inferencing do you plan to do? Is it model parallel or data parallel?

from vllm.

hetian127 avatar hetian127 commented on June 18, 2024

I just want to use online api serving based on LLM like Qwen1.5-110B-Chat.
My main steps are as follows:
1, I made docker image include ofed driver, "ibstat" can show my 200G infiniband card.

2, create yaml file, like:

rayClusterConfig:
rayVersion: '2.9.0' # should match the Ray version in the image of the containers
######################headGroupSpecs#################################
# Ray head pod template.
headGroupSpec:
# The rayStartParams are used to configure the ray start command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of rayStartParams in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in rayStartParams.
rayStartParams:
dashboard-host: '0.0.0.0'
#pod template
template:
spec:
containers:
- name: ray-head
image: repo:5000/harbor/rayvllm:v3
resources:
limits:
nvidia.com/gpu: 8
cpu: 8
memory: 64Gi
requests:
nvidia.com/gpu: 8
cpu: 8
memory: 64Gi
volumeMounts:
- name: share
mountPath: "/share"
- name: shm
mountPath: "/dev/shm"
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
env:
- name: USE_RDMA
value: "true"
volumes:
- name: share
hostPath:
path: "/share"
type: Directory
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 5
# logical group name, for this called small-group, also can be functional
groupName: small-group
# The rayStartParams are used to configure the ray start command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of rayStartParams in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in rayStartParams.
rayStartParams: {}
#pod template
template:
spec:
containers:
- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: repo:5000/harbor/rayvllm:v3
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
resources:
limits:
nvidia.com/gpu: 8
cpu: "8"
memory: "64Gi"
requests:
nvidia.com/gpu: 8
cpu: "8"
memory: "64Gi"
volumeMounts:
- name: share
mountPath: "/share"
- name: shm
mountPath: "/dev/shm"
env:
- name: USE_RDMA
value: "true"
volumes:
- name: share
hostPath:
path: "/share"
type: Directory
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"

3, I created a head node and a worker node by using kuberay with the image I made,and run the commad on the head node:

python -m vllm.entrypoints.openai.api_server
--model /path/Qwen1.5-110B-Chat
--tensor-parallel-size 16
--host 0.0.0.0
--trust-remote-code
--port 8000
--worker-use-ray

4, I run a benckmark scripts like:

python benchmarks/benchmark_serving.py
--backend vllm
--model /path/Qwen1.5-110B-Chat
--dataset benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json
--request-rate 5
--num-prompts 100
--host xxxx
--port 8000
--trust-remote-code \

I observed the Ray cluster's dashboard and found that the read/write throughput can reach up to 1.2GB/s, but it does not utilize the InfiniBand network bandwidth.

So, I just plan to use multiple nodes to perform distributed inference for large models, providing an OpenAI API server service, and using InfiniBand high-speed networks for communication between node

from vllm.

xiphl avatar xiphl commented on June 18, 2024

i have similar use cases. Tested it in a DGX cluster, deliberately spread the falcon180b model to multiple nodes (and saw that the read/write per node is about 2-3GB/s).
i didnt set the USE_RDMA though

from vllm.

mohittalele avatar mohittalele commented on June 18, 2024

@xiphl @hetian127 I have also similar use case.

How did you actually spread model ? Was it using SPREAD placement strategy as described here - https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html ?

And did vllm automatically sharded model across multiple machine ?

from vllm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.