intel / llm-on-ray Goto Github PK

Pretrain, finetune and serve LLMs on Intel platforms with Ray

License: Apache License 2.0

Python 93.21% Dockerfile 0.23% Shell 5.20% Jinja 1.36%

llm-on-ray's Introduction

LLM-on-Ray

Introduction

LLM-on-Ray is a comprehensive solution designed to empower users in building, customizing, and deploying Large Language Models (LLMs). Whether you're starting from scratch with pretraining, looking to finetuning an existing model, or aiming to deploy a production-ready LLM endpoint service, this project simplifies these complex processes into manageable steps.

LLM-on-Ray harnesses the power of Ray, an industry-leading framework for distributed computing, to scale your AI workloads efficiently. This integration ensures robust fault tolerance and cluster resource management, making your LLM projects more resilient and scalable.

LLM-on-Ray is built to operate across various hardware setups, including Intel CPU, Intel GPU and Intel Gaudi2. It incorporates several industry and Intel optimizations to maximize performance, including vLLM, llama.cpp, Intel Extension for PyTorch/DeepSpeed, IPEX-LLM, RecDP-LLM, NeuralChat and more.

Solution Technical Overview

LLM-on-Ray's modular workflow structure is designed to comprehensively cater to the various stages of LLM development, from pretraining and finetuning to serving. These workflows are intuitive, highly configurable, and tailored to meet the specific needs of each phase in the LLM lifecycle:

Pretraining Workflow: Provides the infrastructure to build LLMs from scratch.
- Data Preparation: Includes a suite of tools for preparing your training data which facilitate tasks like the removal of Personally Identifiable Information (PII), data deduplication (Dedup), and other preprocessing needs, making the data safe and efficient for training.
- Megatron-DeepSpeed Integration: Leverages the power of Megatron-DeepSpeed to enable advanced capabilities such as pipeline parallelism, tensor parallelism, data parallelism, and Zero Redundancy Optimizer (ZeRO). This integration facilitates efficient and scalable model training from the ground up.
- Robust Fault Tolerance: Offers automatic fault tolerance powered by Ray. This ensures high availability, reliability, and optimal performance for large scale pretraining.
Finetuning Workflow: Supports refinement of pre-trained models with proprietary or specialized data, improving models' accuracy and applicability to various use cases.
- Ease of Customization: Users can easily configure the base model and resource allocation for the training job, customize training parameters to fit their specific needs. This can be accomplished through a simple command line or via the Web UI.
- Parameter Efficient Finetuning: Supports various parameter efficient finetuning methods such as LoRA to accelerate the finetuning process.
- Reinforcement Learning with Human Feedback (RLHF): Users can further refine the model using RLHF, which leverages the proximal policy optimization (PPO).
Serving Workflow: Deploys a scalable and production-ready LLM serving endpoint.
- Easy Deployment of Models: Supports the deployment of both widely-used open-source models and custom finetuned models through flexible configurations.
- Autoscaling and Scale-to-Zero Capabilities: Ensures high efficiency and cost-effectiveness in model deployment. The workflow can dynamically scale resources to match demand and scale down to zero when the model is not in use, optimizing resource usage and reducing operational costs.
- Optimized for Performance and Efficiency: LLM-on-Ray incorporates several optimizations to maximize performance. This includes support for various precision levels and the utilization of advanced optimization techniques from Intel, ensuring efficient processing and reduced resource consumption.
- OpenAI-Like REST API: Provides APIs similar to OpenAI's, making it easier for users to transition to or integrate open-source models into their systems.
Interactive Web UI for Enhanced Usability: Except for command line, LLM-on-Ray introduces a Web UI, allowing users to easily finetune and deploy LLMs through a user-friendly interface. Additionally, the UI includes a chatbot application, enabling users to immediately test and refine the models.

Getting Started Locally With Source code

This guide will assist you in setting up LLM-on-Ray on Intel CPU locally, covering the initial setup, finetuning models, and deploying them for serving.

Setup

1. Clone the repository, install llm-on-ray and its dependencies.

Software requirement: Git and Conda

git clone https://github.com/intel/llm-on-ray.git
cd llm-on-ray
conda create -n llm-on-ray python=3.9
conda activate llm-on-ray
pip install .[cpu] --extra-index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/

2. Start Ray

[Optional] If DeepSpeed is enabled or doing distributed finetuing, oneCCL and Intel MPI libraries should be dynamically linked in every node before Ray starts:

source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl; print(torch_ccl.cwd)")/env/setvars.sh

Start Ray locally using the following command. To launch a Ray cluster, please follow the setup document.

ray start --head

Finetuning

Use the following command to finetune a model using an example dataset and default configurations. The finetuned model will be stored in /tmp/llm-ray/output by default. To customize the base model, dataset and configurations, please see the finetuning document:

llm_on_ray-finetune --config_file llm_on_ray/finetune/finetune.yaml

Serving

Deploy a model on Ray and expose an endpoint for serving. This command uses GPT2 as an example, but more model configuration examples can be found in the inference/models directory:

llm_on_ray-serve --config_file llm_on_ray/inference/models/gpt2.yaml

You can also use model_ids to serve directly through:

llm_on_ray-serve --models gpt2

List all support model_ids with config file path:

llm_on_ray-serve --list_model_ids

The default served method is to provide an OpenAI-compatible API server (OpenAI API Reference), you can access and test it in many ways:

# using curl
export ENDPOINT_URL=http://localhost:8000/v1
curl $ENDPOINT_URL/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "gpt2",
    "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}],
    "temperature": 0.7
    }'

# using requests library
python examples/inference/api_server_openai/query_http_requests.py

# using OpenAI SDK
pip install openai>=1.0
export OPENAI_BASE_URL=http://localhost:8000/v1
export OPENAI_API_KEY="not_a_real_key"
python examples/inference/api_server_openai/query_openai_sdk.py

Or you can serve specific model to a simple endpoint according to the port and route_prefix parameters in configuration file,

llm_on_ray-serve --config_file llm_on_ray/inference/models/gpt2.yaml --simple

After deploying the model endpoint, you can access and test it by using the script below:

python examples/inference/api_server_simple/query_single.py --model_endpoint http://127.0.0.1:8000/gpt2

Getting Started With Docker

This guide will assist you in setting up LLM-on-Ray on With Docker.

git clone https://github.com/intel/llm-on-ray.git
cd llm-on-ray

The dockerfile for user is in dev/docker/Dockerfile.user.

1. Source Docker Functions

Detailed parameter can be set up for docker in dev/scripts/docker-functions.sh.

source dev/scripts/docker-functions.sh

2. Build Docker Image

Default cpu and deepspeed for llm serving.

build_docker

Change build_docker fuction's args for different environment.

Use vllm for llm serving.

build_docker vllm

Use ipex-vllm for llm serving.

build_docker ipex-llm

3. Start Docker

Change any settings in dev/scripts/docker-functions.sh.

Run docker with cpu and gpt2 serving.

start_docker

Run docker with cpu and other support models serving.

start_docker llama-2-7b-chat-hf

Run docker with different environment and other models start_docker {environment} {models} like:

start_docker vllm llama-2-7b-chat-hf

4. Start LLM-on-Ray

The model serving port in docker container has been mapped to local.

Using requests library.

python examples/inference/api_server_openai/query_http_requests.py

Using OpenAI SDK

python examples/inference/api_server_openai/query_openai_sdk.py

Documents

The following are detailed guidelines for pretraining, finetuning and serving LLMs in various computing environment.

Pretraining:

Pretrain LLMs on Intel Gaudi

Finetuning:

Serving

Web UI

Finetune and Deploy LLMs through Web UI

Disclaimer

To the extent that any public datasets are referenced by Intel or accessed using tools or code on this site those datasets are provided by the third party indicated as the data source. Intel does not create the data, or datasets, and does not warrant their accuracy or quality. By accessing the public dataset(s), or using a model trained on those datasets, you agree to the terms associated with those datasets and that your use complies with the applicable license.

Intel expressly disclaims the accuracy, adequacy, or completeness of any public datasets, and is not liable for any errors, omissions, or defects in the data, or for any reliance on the data. Intel is not liable for any liability or damages relating to your use of public datasets.

llm-on-ray's People

Contributors

Stargazers

Watchers

llm-on-ray's Issues

[Inference] Generate multiple sequences for single prompt

[Quantization] Support loading AWQ, GPTQ, GGUF/GGML quantized models

Add ipex extra in pyproject.toml to use restricted transformers version

IPEX has restriction on transformers version, but llm-on-ray doesn't have. To verify IPEX and other llm-on-ray functions in parallel in CI, we can add a new ipex extra in pyproject.toml with right transformers version. Then, add corresponding nightly CI in github workflow.

The brief steps are,

add "ipex" extra
cpu = [
"transformers>=4.35.0", # some models need higher version of transformers
"intel_extension_for_pytorch==2.1.0+cpu",
"torch==2.1.0+cpu",
"oneccl_bind_pt==2.1.0+cpu"
]
**+ipex = [

"transformers==4.31.0", # to make fully functional ipex. choose right version based on following ipex version.
"intel_extension_for_pytorch==2.1.0+cpu",
"torch==2.1.0+cpu",
"oneccl_bind_pt==2.1.0+cpu"
+]**

add a separate dockerfile so that it can be cached properly in CI build.
copy one of Dockerfiles under dev/docker and rename it to Dockerfile.ipex. After that, replace 'pip install ...' with 'pip install .[ipex]
add nightly CI
copy the workflow-inference.yaml and rename to workflow-inference-ipex.yaml and call workflow-inference-ipex.yaml to workflow_orders_nightly.yaml.
call-inference:
uses: ./.github/workflows/workflow_inference.yml
with:
ci_type: nightly

call-inference-ipex:
uses: ./.github/workflows/workflow_inference-ipex.yml

Inside workflow-inference-ipex.yaml, add inference tests for ipex supported models.

Add type hints to core interfaces

[Serving][Benchmark] Add benchmark code for serving

Add config num_replicas in the serving yaml file

To configure how many model replicas to deploy.

[Test] [WIP] Test Plan

Black box test & White box test
Integrate with pytest & CI

Support and validate model Gemma

Enable mllm in CI

Refer to this PR #107 and add one of the models to CI.

[Core] Unifying output printing by using logging with log level

Support and validate model Mixtral-8x7B

Support single and multiple prompts for generate

[Serving] Add Observability with OpenTelemetry

[Lint] Add file license header check

Add checking in the lint tests for any source code (*.py, *.sh etc.) adding to the repo having the following header.

#
# Copyright 2023 The LLM-on-Ray Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

Move inference/deep/* scripts to dev/scripts

[Benchmark] Add benchmark code for latency and throughtput

Getting dependencies issues while installing on CPU

Facing issues in installing the below package:
intel_extension_for_pytorch==2.1.0+cpu

Facing below issue:

Below is my cpu details:

Could someone help me here?

Inconsistent device config in finetuning and serving yaml

For finetuning, the config is

device: CPU

For serving, the config is

device: "cpu"

Need to make them consistent.
Also update the docs.

Getting error while executing query_openai_sdk.py to test the inference

I ran the infernce of Falcon-7b and neural-chat-7b-v3-1 models on ray server with below command
python inference/serve.py --config_file inference/models/neural-chat-7b-v3-1.yaml --simple
python inference/serve.py --config_file inference/models/falcon-7b.yaml --simple
I could run the test infernce with python examples/inference/api_server_simple/query_single.py --model_endpoint http://172.17.0.2:8000/neural-chat-7b-v3-1
I exported export OPENAI_API_BASE=http://172.17.0.2:8000/falcon-7b
export OPENAI_API_KEY=
and tried to run python examples/inference/api_server_openai/query_openai_sdk.py, Iam getting belwo error

File "/root/llm-ray/examples/inference/api_server_openai/query_openai_sdk.py", line 45, in
models = openai.Model.list()
File "/usr/local/lib/python3.10/dist-packages/openai/api_resources/abstract/listable_api_resource.py", line 60, in list
response, _, api_key = requestor.request(
File "/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py", line 298, in request
resp, got_stream = self._interpret_response(result, stream)
File "/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py", line 700, in _interpret_response
self._interpret_response_line(
File "/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py", line 757, in _interpret_response_line
raise error.APIError(
openai.error.APIError: HTTP code 500 from API (Unexpected error, traceback: ray::ServeReplica:falcon-7b:PredictorDeployment.handle_request_streaming() (pid=15684, ip=172.17.0.2)
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/utils.py", line 165, in wrap_to_ray_error
raise exception
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 994, in call_user_method
await self._call_func_or_gen(
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 750, in _call_func_or_gen
result = await result
File "/root/llm-ray/inference/predictor_deployment.py", line 84, in call
json_request: Dict[str, Any] = await http_request.json()
File "/usr/local/lib/python3.10/dist-packages/starlette/requests.py", line 244, in json
self._json = json.loads(body)
File "/usr/lib/python3.10/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0).)

I installed open-api 0.28.0 version, Please let me know what could be the isssue, Iam I missing any installations?

[Benchmark] Load config from yaml and output results with multiple formats

Load test config from yaml
Output multiple formats: stdout, csv, xlsl, mlflow etc.

Gettign error while running start_ui.py

I I followed the steps mentioned in setup file to build dcoker image for Gaudi
https://github.com/intel/llm-on-ray/blob/main/docs/setup.md, I could run the ray server using below command
ray start --head --node-ip-address 127.0.0.1 --dashboard-host='0.0.0.0' --dashboard-port=8265
I ran python -u ui/start_ui.py --master_ip_port "$node_ip:6379" since I could not figure out where to get node_user_name , conda_env_name name from the below command
python -u ui/start_ui.py --node_user_name $user --conda_env_name $conda_env --master_ip_port "$node_ip:6379"
I got many missing installations error, I installed the missing ones After that Iam getting the below error while installing. Can I please know what I might me missing , where to get node_user)name and conda_env_name from?

Error while connecting to Ray UI
Traceback (most recent call last):
File "/root/llm-ray/ui/start_ui.py", line 26, in
from inference.predictor_deployment import PredictorDeployment
File "/root/llm-ray/ui/../inference/predictor_deployment.py", line 21, in
from ray import serve
File "/usr/local/lib/python3.10/dist-packages/ray/serve/init.py", line 4, in
from ray.serve.api import (
File "/usr/local/lib/python3.10/dist-packages/ray/serve/api.py", line 15, in
from ray.serve.built_application import BuiltApplication
File "/usr/local/lib/python3.10/dist-packages/ray/serve/built_application.py", line 7, in
from ray.serve.deployment import Deployment
File "/usr/local/lib/python3.10/dist-packages/ray/serve/deployment.py", line 22, in
from ray.serve.context import _get_global_client
File "/usr/local/lib/python3.10/dist-packages/ray/serve/context.py", line 12, in
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/client.py", line 28, in
from ray.serve._private.deploy_utils import get_deploy_args
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/deploy_utils.py", line 8, in
from ray.serve.schema import ServeApplicationSchema
File "/usr/local/lib/python3.10/dist-packages/ray/serve/schema.py", line 141, in
class DeploymentSchema(BaseModel, allow_population_by_field_name=True):
File "/usr/local/lib/python3.10/dist-packages/ray/serve/schema.py", line 269, in DeploymentSchema
def num_replicas_and_autoscaling_config_mutually_exclusive(cls, values):
File "/usr/local/lib/python3.10/dist-packages/pydantic/deprecated/class_validators.py", line 231, in root_validator
return root_validator()(*__args) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/pydantic/deprecated/class_validators.py", line 237, in root_validator
raise PydanticUserError(
pydantic.errors.PydanticUserError: If you use @root_validator with pre=False (the default) you MUST specify skip_on_failure=True. Note that @root_validator is deprecated and should be replaced with @model_validator.

Add nightly CI workflow to run vllm predictor on SPR

vllm cpu only supports CPU flag avx512_bf16 which starting from SPR

Refactor inference CI

There are too many conditions and duplications.

Not able to run inference server for mistral 7b model, mpt-7b model on Ray

I built the ray image as per https://github.com/intel/llm-on-ray/blob/main/docs/setup.md and could login in to docker image and run ray server with ray start --head --node-ip-address 127.0.0.1 --dashboard-host='0.0.0.0' --dashboard-port=8265
To infer mistral-7b-v0.1 model, I ran the below command
python inference/serve.py --config_file inference/models/mistral-7b-v0.1.yaml --simple, first I got insllation error for intel-extension-for_pytorch after installaing that Iam getting the below error

After installing intel_extension_for_pytorch, getting below error...
(ServeController pid=9051) await self._user_callable_wrapper.initialize_callable()
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 778, in initialize_callable
(ServeController pid=9051) await self._call_func_or_gen(
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 748, in _call_func_or_gen
(ServeController pid=9051) result = callable(*args, **kwargs)
(ServeController pid=9051) File "/root/llm-ray/inference/predictor_deployment.py", line 64, in init
(ServeController pid=9051) self.predictor = TransformerPredictor(infer_conf)
(ServeController pid=9051) File "/root/llm-ray/inference/transformer_predictor.py", line 79, in init
(ServeController pid=9051) import intel_extension_for_pytorch as ipex
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/init.py", line 94, in
(ServeController pid=9051) from . import cpu
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/cpu/init.py", line 1, in
(ServeController pid=9051) from . import runtime
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/cpu/runtime/init.py", line 3, in
(ServeController pid=9051) from .multi_stream import (
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/cpu/runtime/multi_stream.py", line 4, in
(ServeController pid=9051) import intel_extension_for_pytorch._C as core
(ServeController pid=9051) ImportError: /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-cpu.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv

I also tried to infer mpt model with python inference/serve.py --config_file inference/models/mpt-7b.yaml --simple
Iam gettign the below error

(ServeController pid=9051) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(ServeController pid=9051) return self.__get_result()
(ServeController pid=9051) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=9051) raise self._exception
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 570, in initialize_and_get_metadata
(ServeController pid=9051) raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=9051) RuntimeError: Traceback (most recent call last):
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 554, in initialize_and_get_metadata
(ServeController pid=9051) await self._user_callable_wrapper.initialize_callable()
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 778, in initialize_callable
(ServeController pid=9051) await self._call_func_or_gen(
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 748, in _call_func_or_gen
(ServeController pid=9051) result = callable(*args, **kwargs)
(ServeController pid=9051) File "/root/llm-ray/inference/predictor_deployment.py", line 64, in init
(ServeController pid=9051) self.predictor = TransformerPredictor(infer_conf)
(ServeController pid=9051) File "/root/llm-ray/inference/transformer_predictor.py", line 22, in init
(ServeController pid=9051) from optimum.habana.transformers.modeling_utils import (
(ServeController pid=9051) File "/root/optimum-habana/optimum/habana/transformers/modeling_utils.py", line 19, in
(ServeController pid=9051) from .models import (
(ServeController pid=9051) File "/root/optimum-habana/optimum/habana/transformers/models/init.py", line 59, in
(ServeController pid=9051) from .mpt import (
(ServeController pid=9051) File "/root/optimum-habana/optimum/habana/transformers/models/mpt/init.py", line 1, in
(ServeController pid=9051) from .modeling_mpt import (
(ServeController pid=9051) File "/root/optimum-habana/optimum/habana/transformers/models/mpt/modeling_mpt.py", line 24, in
(ServeController pid=9051) from transformers.models.mpt.modeling_mpt import MptForCausalLM, MptModel, _expand_mask, _make_causal_mask
(ServeController pid=9051) ImportError: cannot import name '_expand_mask' from 'transformers.models.mpt.modeling_mpt' (/usr/local/lib/python3.10/dist-packages/transformers/models/mpt/modeling_mpt.py)
(ServeController pid=9051) WARNING 2024-01-18 09:37:50,769 controller 9051 application_state.py:726 - The deployments ['PredictorDeployment'] are UNHEALTHY.
Traceback (most recent call last):
File "/root/llm-ray/inference/serve.py", line 170, in
main(sys.argv[1:])
File "/root/llm-ray/inference/serve.py", line 160, in main
openai_serve_run(deployments, host, route_prefix, args.port)
File "/root/llm-ray/inference/api_server_openai.py", line 75, in openai_serve_run
serve.run(
File "/usr/local/lib/python3.10/dist-packages/ray/serve/api.py", line 543, in run
client.deploy_application(
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/client.py", line 50, in check
return f(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/client.py", line 321, in deploy_application
self._wait_for_application_running(name)

Can anny one help me out if Iam missing out anything or any specific version of libarary needed?

[UT] Skeleton code for adding Unit Tests

Structure

tests
    run-tests.sh
    /inference
        test_utils.py
    /finetune
    /pretrain

[Demo] refactor demo WebUI frontend

create demo/ directory and move all frontend and WebUI code into it.

'Deepspeed' should be `DeepSpeed` in the doc

Pay attention to the capitalizations.

Optimize package requirements

To minimize requirements, baseline package is

ray[serve, tune]
torch
intel_extension_for_pytorch
oneccl_bind_pt
transformers

if the dependency chain can't meet our requirements, we will add additional packages.

Override config file with command line argument

Currently we override parameter from config file over command line,
However, in normal sense, command line argument should override config file to provide flexibility when executing command line.

@carsonwang @KepingYan @jiafuzha

fix import from inference package

I am not sure why finetune and inference are installed as packages in the conda env. depending on if inference exists in current directory and where the script is running. Either package installed or local package will be used when import, which is confusing:

from inference.inference_config import InferenceConfig, DEVICE_CPU

[DOC] source oneCCL doesn't belong to install

source oneCCL is setup runtime rather than in install dependencies.

[Doc] Document serve simple protocol of request/response

[UI] gradio package conflict

Gradio is bumped from 3.36 to 4.11 by #13. If we want to adapt gradio to 4.11, codes needs a lot of changes. And the pydantic version that gradio depends on has conflict with that deepspeed depends on. It needs a solution to allow
Dependabot to ignore the version of gradio, and then limit gradio<=3.36.

# gradio 4.11 needs pydantic>=2.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gradio 4.11.0 requires pydantic>=2.0, but you have pydantic 1.10.0 which is incompatible.
# deepspeed 0.11 needs pydantic<2.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
deepspeed 0.11.1 requires pydantic<2.0.0, but you have pydantic 2.5.3 which is incompatible.

[Inference][vLLM] Integrate vLLM for CPU

Enable vllm-project/vllm#1028
Example for integration: https://github.com/ray-project/ray/blob/master/doc/source/serve/doc_code/vllm_example.py

Installation command not working

Hello! I tryed to run your repo, but the first command is not working

pip install .[cpu] -f https://developer.intel.com/ipex-whl-stable-cpu -f https://download.pytorch.org/whl/torch_stable.html

gives this error:

ERROR: Ignored the following yanked versions: 1.11.100, 1.12.200
ERROR: Ignored the following versions that require a different python version: 3.10.0.0 Requires-Python >=2.7, !=3.0., !=3.1., !=3.2., !=3.3., <3.5; 3.7.4.2 Requires-Python >=2.7, !=3.0., !=3.1., !=3.2., !=3.3., <3.5
ERROR: Could not find a version that satisfies the requirement intel-extension-for-pytorch==2.1.0+cpu; extra == "cpu" (from llm-on-ray[cpu]) (from versions: 1.10.100, 1.11.0, 1.11.200, 1.12.0, 1.12.100, 1.12.300, 1.13.0, 1.13.100, 2.0.0, 2.0.100, 2.1.0, 2.1.100, 2.2.0)
ERROR: No matching distribution found for intel-extension-for-pytorch==2.1.0+cpu; extra == "cpu"

fix device CPU or "cpu" inconsistency in model files

[Serving] Example to chat from command line

Current the examples just send request to the serving server via http request, OpenAI SDK, etc. We can only demo chat from the web UI. It will be useful to support chat from the command line too.

[Serving] HTTP Protocol

Let's discuss http protocol here @kira-lin @KepingYan

[REST API] Support more parameters

Support more request parameters includes suffix, n>1, stop, logprobs, logit_bias, echo, presence_penalty, frequency_penalty. Now only support parameters which included in transformers.generate.

[Hardware Support] Intel Habana Gaudi/Gaudi2 Support

[Install] Align installation with ipex

@jiafuzha @KepingYan
https://github.com/intel/intel-extension-for-pytorch?tab=readme-ov-file#cpu-version

python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
python -m pip install intel-extension-for-pytorch --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/
# for PRC user, you can check with the following link
python -m pip install intel-extension-for-pytorch --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/cn/

Support building docker image with different python version

Currently the version is fixed (3.9)

[Serving] Add a table of models and corresponding supported parameters

We provide many model configuration files in inference/models, but users cannot clearly know which parameter can be enabled, such as ipex, deepspeed and vllm. We can add a table to illustrate it.

Add client-side latency and throughput benchmarks

Support functions/tools in OpenAI API

Support functions/tools in the API to enable more use cases.
Refer to the OpenAI document below:
https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools

tools:
A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for.

tool_choice:
Controls which (if any) function is called by the model. none means the model will not call a function and instead generates a message. auto means the model can pick between generating a message or calling a function. Specifying a particular function via {"type": "function", "function": {"name": "my_function"}} forces the model to call that function.

none is the default when no functions are present. auto is the default if functions are present.

curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "user",
      "content": "What is the weather like in Boston?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}'

[DOC] Add Coding Style Guide and how to format code to CONTRIBUTING.md

Also add a section of Contributing in README.md
lint was added in #34

serving mpt-7b-bigdl.yaml crash when install with pip install -e .[bigdl-cpu]

First create an new conda env with python 3.9, and install with

pip install -e .[bigdl-cpu] -f https://developer.intel.com/ipex-whl-stable-cpu -f https://download.pytorch.org/whl/torch_stable.html

Then run:

python inference/serve.py --config_file inference/models/bigdl/mpt-7b-bigdl.yaml --serve_simple

It looks something is messed after installing bigdl-llm:

After I switched to another env, it's OK.

@KepingYan @jiafuzha

refactor examples/inference/api_server_simple/{query_batch.py, query_single.py}

Include --help (add_help=True)
Command line description for query_single.py: "Example script to query with single request"
Remove query_batch.py since it just used RayData to store prompts, it's not fit for our serving scenarios any longer.
Remove time measurement as we will add specific benchmark program later

@Deegue

Need to check if the request is a valid json

inference/predictor_deployment.py:

async def __call__(self, http_request: Request) -> Union[StreamingResponse, str]:
        json_request: Dict[str, Any] = await http_request.json()

This type hint here doesn't restrict returning a dict.
Should add exception handling here for invalid request.

intel / llm-on-ray Goto Github PK

llm-on-ray's Introduction

LLM-on-Ray

Introduction

Solution Technical Overview

Getting Started Locally With Source code

Setup

1. Clone the repository, install llm-on-ray and its dependencies.

2. Start Ray

Finetuning

Serving

Getting Started With Docker

1. Source Docker Functions

2. Build Docker Image

3. Start Docker

4. Start LLM-on-Ray

Documents

Pretraining:

Finetuning:

Serving

Web UI

Disclaimer

llm-on-ray's People

Contributors

Stargazers

Watchers

Forkers

llm-on-ray's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs