GithubHelp home page GithubHelp logo

intel / llm-on-ray Goto Github PK

View Code? Open in Web Editor NEW
77.0 8.0 27.0 14.82 MB

Pretrain, finetune and serve LLMs on Intel platforms with Ray

License: Apache License 2.0

Python 93.21% Dockerfile 0.23% Shell 5.20% Jinja 1.36%

llm-on-ray's Introduction

LLM-on-Ray

Introduction

LLM-on-Ray is a comprehensive solution designed to empower users in building, customizing, and deploying Large Language Models (LLMs). Whether you're starting from scratch with pretraining, looking to finetuning an existing model, or aiming to deploy a production-ready LLM endpoint service, this project simplifies these complex processes into manageable steps.

LLM-on-Ray harnesses the power of Ray, an industry-leading framework for distributed computing, to scale your AI workloads efficiently. This integration ensures robust fault tolerance and cluster resource management, making your LLM projects more resilient and scalable.

LLM-on-Ray is built to operate across various hardware setups, including Intel CPU, Intel GPU and Intel Gaudi2. It incorporates several industry and Intel optimizations to maximize performance, including vLLM, llama.cpp, Intel Extension for PyTorch/DeepSpeed, IPEX-LLM, RecDP-LLM, NeuralChat and more.

Solution Technical Overview

LLM-on-Ray's modular workflow structure is designed to comprehensively cater to the various stages of LLM development, from pretraining and finetuning to serving. These workflows are intuitive, highly configurable, and tailored to meet the specific needs of each phase in the LLM lifecycle:

  • Pretraining Workflow: Provides the infrastructure to build LLMs from scratch.

    • Data Preparation: Includes a suite of tools for preparing your training data which facilitate tasks like the removal of Personally Identifiable Information (PII), data deduplication (Dedup), and other preprocessing needs, making the data safe and efficient for training.
    • Megatron-DeepSpeed Integration: Leverages the power of Megatron-DeepSpeed to enable advanced capabilities such as pipeline parallelism, tensor parallelism, data parallelism, and Zero Redundancy Optimizer (ZeRO). This integration facilitates efficient and scalable model training from the ground up.
    • Robust Fault Tolerance: Offers automatic fault tolerance powered by Ray. This ensures high availability, reliability, and optimal performance for large scale pretraining.
  • Finetuning Workflow: Supports refinement of pre-trained models with proprietary or specialized data, improving models' accuracy and applicability to various use cases.

    • Ease of Customization: Users can easily configure the base model and resource allocation for the training job, customize training parameters to fit their specific needs. This can be accomplished through a simple command line or via the Web UI.
    • Parameter Efficient Finetuning: Supports various parameter efficient finetuning methods such as LoRA to accelerate the finetuning process.
    • Reinforcement Learning with Human Feedback (RLHF): Users can further refine the model using RLHF, which leverages the proximal policy optimization (PPO).
  • Serving Workflow: Deploys a scalable and production-ready LLM serving endpoint.

    • Easy Deployment of Models: Supports the deployment of both widely-used open-source models and custom finetuned models through flexible configurations.
    • Autoscaling and Scale-to-Zero Capabilities: Ensures high efficiency and cost-effectiveness in model deployment. The workflow can dynamically scale resources to match demand and scale down to zero when the model is not in use, optimizing resource usage and reducing operational costs.
    • Optimized for Performance and Efficiency: LLM-on-Ray incorporates several optimizations to maximize performance. This includes support for various precision levels and the utilization of advanced optimization techniques from Intel, ensuring efficient processing and reduced resource consumption.
    • OpenAI-Like REST API: Provides APIs similar to OpenAI's, making it easier for users to transition to or integrate open-source models into their systems.
  • Interactive Web UI for Enhanced Usability: Except for command line, LLM-on-Ray introduces a Web UI, allowing users to easily finetune and deploy LLMs through a user-friendly interface. Additionally, the UI includes a chatbot application, enabling users to immediately test and refine the models.

llm-on-ray

Getting Started Locally With Source code

This guide will assist you in setting up LLM-on-Ray on Intel CPU locally, covering the initial setup, finetuning models, and deploying them for serving.

Setup

1. Clone the repository, install llm-on-ray and its dependencies.

Software requirement: Git and Conda

git clone https://github.com/intel/llm-on-ray.git
cd llm-on-ray
conda create -n llm-on-ray python=3.9
conda activate llm-on-ray
pip install .[cpu] --extra-index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/

2. Start Ray

[Optional] If DeepSpeed is enabled or doing distributed finetuing, oneCCL and Intel MPI libraries should be dynamically linked in every node before Ray starts:

source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl; print(torch_ccl.cwd)")/env/setvars.sh

Start Ray locally using the following command. To launch a Ray cluster, please follow the setup document.

ray start --head

Finetuning

Use the following command to finetune a model using an example dataset and default configurations. The finetuned model will be stored in /tmp/llm-ray/output by default. To customize the base model, dataset and configurations, please see the finetuning document:

llm_on_ray-finetune --config_file llm_on_ray/finetune/finetune.yaml

Serving

Deploy a model on Ray and expose an endpoint for serving. This command uses GPT2 as an example, but more model configuration examples can be found in the inference/models directory:

llm_on_ray-serve --config_file llm_on_ray/inference/models/gpt2.yaml

You can also use model_ids to serve directly through:

llm_on_ray-serve --models gpt2

List all support model_ids with config file path:

llm_on_ray-serve --list_model_ids

The default served method is to provide an OpenAI-compatible API server (OpenAI API Reference), you can access and test it in many ways:

# using curl
export ENDPOINT_URL=http://localhost:8000/v1
curl $ENDPOINT_URL/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "gpt2",
    "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}],
    "temperature": 0.7
    }'

# using requests library
python examples/inference/api_server_openai/query_http_requests.py

# using OpenAI SDK
pip install openai>=1.0
export OPENAI_BASE_URL=http://localhost:8000/v1
export OPENAI_API_KEY="not_a_real_key"
python examples/inference/api_server_openai/query_openai_sdk.py

Or you can serve specific model to a simple endpoint according to the port and route_prefix parameters in configuration file,

llm_on_ray-serve --config_file llm_on_ray/inference/models/gpt2.yaml --simple

After deploying the model endpoint, you can access and test it by using the script below:

python examples/inference/api_server_simple/query_single.py --model_endpoint http://127.0.0.1:8000/gpt2

Getting Started With Docker

This guide will assist you in setting up LLM-on-Ray on With Docker.

git clone https://github.com/intel/llm-on-ray.git
cd llm-on-ray

The dockerfile for user is in dev/docker/Dockerfile.user.

1. Source Docker Functions

Detailed parameter can be set up for docker in dev/scripts/docker-functions.sh.

source dev/scripts/docker-functions.sh

2. Build Docker Image

Default cpu and deepspeed for llm serving.

build_docker 

Change build_docker fuction's args for different environment.

Use vllm for llm serving.

build_docker vllm 

Use ipex-vllm for llm serving.

build_docker ipex-llm 

3. Start Docker

Change any settings in dev/scripts/docker-functions.sh.

Run docker with cpu and gpt2 serving.

start_docker

Run docker with cpu and other support models serving.

start_docker llama-2-7b-chat-hf

Run docker with different environment and other models start_docker {environment} {models} like:

start_docker vllm llama-2-7b-chat-hf

4. Start LLM-on-Ray

The model serving port in docker container has been mapped to local.

Using requests library.

python examples/inference/api_server_openai/query_http_requests.py

Using OpenAI SDK

python examples/inference/api_server_openai/query_openai_sdk.py

Documents

The following are detailed guidelines for pretraining, finetuning and serving LLMs in various computing environment.

Pretraining:

Finetuning:

Serving

Web UI

Disclaimer

To the extent that any public datasets are referenced by Intel or accessed using tools or code on this site those datasets are provided by the third party indicated as the data source. Intel does not create the data, or datasets, and does not warrant their accuracy or quality. By accessing the public dataset(s), or using a model trained on those datasets, you agree to the terms associated with those datasets and that your use complies with the applicable license.

Intel expressly disclaims the accuracy, adequacy, or completeness of any public datasets, and is not liable for any errors, omissions, or defects in the data, or for any reliance on the data. Intel is not liable for any liability or damages relating to your use of public datasets.

llm-on-ray's People

Contributors

carsonwang avatar deegue avatar dependabot[bot] avatar faaany avatar harborn avatar jiafuzha avatar kepingyan avatar kira-lin avatar minmingzhu avatar rbrugaro avatar rdower avatar susu-noob avatar thequantumquirk avatar tianyil1 avatar xuechendi avatar xwu99 avatar yao531441 avatar yuanwu2017 avatar yutianchen666 avatar zhangjian94cn avatar zhouyu5 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llm-on-ray's Issues

Add ipex extra in pyproject.toml to use restricted transformers version

IPEX has restriction on transformers version, but llm-on-ray doesn't have. To verify IPEX and other llm-on-ray functions in parallel in CI, we can add a new ipex extra in pyproject.toml with right transformers version. Then, add corresponding nightly CI in github workflow.

The brief steps are,

  1. add "ipex" extra
    cpu = [
    "transformers>=4.35.0", # some models need higher version of transformers
    "intel_extension_for_pytorch==2.1.0+cpu",
    "torch==2.1.0+cpu",
    "oneccl_bind_pt==2.1.0+cpu"
    ]
    **+ipex = [
  • "transformers==4.31.0", # to make fully functional ipex. choose right version based on following ipex version.
  • "intel_extension_for_pytorch==2.1.0+cpu",
  • "torch==2.1.0+cpu",
  • "oneccl_bind_pt==2.1.0+cpu"
    +]**
  1. add a separate dockerfile so that it can be cached properly in CI build.
    copy one of Dockerfiles under dev/docker and rename it to Dockerfile.ipex. After that, replace 'pip install ...' with 'pip install .[ipex]

  2. add nightly CI
    copy the workflow-inference.yaml and rename to workflow-inference-ipex.yaml and call workflow-inference-ipex.yaml to workflow_orders_nightly.yaml.
    call-inference:
    uses: ./.github/workflows/workflow_inference.yml
    with:
    ci_type: nightly

  • call-inference-ipex:
  • uses: ./.github/workflows/workflow_inference-ipex.yml

Inside workflow-inference-ipex.yaml, add inference tests for ipex supported models.

[Lint] Add file license header check

Add checking in the lint tests for any source code (*.py, *.sh etc.) adding to the repo having the following header.

#
# Copyright 2023 The LLM-on-Ray Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

Getting error while executing query_openai_sdk.py to test the inference

I ran the infernce of Falcon-7b and neural-chat-7b-v3-1 models on ray server with below command
python inference/serve.py --config_file inference/models/neural-chat-7b-v3-1.yaml --simple
python inference/serve.py --config_file inference/models/falcon-7b.yaml --simple
I could run the test infernce with python examples/inference/api_server_simple/query_single.py --model_endpoint http://172.17.0.2:8000/neural-chat-7b-v3-1
I exported export OPENAI_API_BASE=http://172.17.0.2:8000/falcon-7b
export OPENAI_API_KEY=
and tried to run python examples/inference/api_server_openai/query_openai_sdk.py, Iam getting belwo error

File "/root/llm-ray/examples/inference/api_server_openai/query_openai_sdk.py", line 45, in
models = openai.Model.list()
File "/usr/local/lib/python3.10/dist-packages/openai/api_resources/abstract/listable_api_resource.py", line 60, in list
response, _, api_key = requestor.request(
File "/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py", line 298, in request
resp, got_stream = self._interpret_response(result, stream)
File "/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py", line 700, in _interpret_response
self._interpret_response_line(
File "/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py", line 757, in _interpret_response_line
raise error.APIError(
openai.error.APIError: HTTP code 500 from API (Unexpected error, traceback: ray::ServeReplica:falcon-7b:PredictorDeployment.handle_request_streaming() (pid=15684, ip=172.17.0.2)
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/utils.py", line 165, in wrap_to_ray_error
raise exception
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 994, in call_user_method
await self._call_func_or_gen(
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 750, in _call_func_or_gen
result = await result
File "/root/llm-ray/inference/predictor_deployment.py", line 84, in call
json_request: Dict[str, Any] = await http_request.json()
File "/usr/local/lib/python3.10/dist-packages/starlette/requests.py", line 244, in json
self._json = json.loads(body)
File "/usr/lib/python3.10/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0).)

I installed open-api 0.28.0 version, Please let me know what could be the isssue, Iam I missing any installations?

Gettign error while running start_ui.py

I I followed the steps mentioned in setup file to build dcoker image for Gaudi
https://github.com/intel/llm-on-ray/blob/main/docs/setup.md, I could run the ray server using below command
ray start --head --node-ip-address 127.0.0.1 --dashboard-host='0.0.0.0' --dashboard-port=8265
I ran python -u ui/start_ui.py --master_ip_port "$node_ip:6379" since I could not figure out where to get node_user_name , conda_env_name name from the below command
python -u ui/start_ui.py --node_user_name $user --conda_env_name $conda_env --master_ip_port "$node_ip:6379"
I got many missing installations error, I installed the missing ones After that Iam getting the below error while installing. Can I please know what I might me missing , where to get node_user)name and conda_env_name from?

Error while connecting to Ray UI
Traceback (most recent call last):
File "/root/llm-ray/ui/start_ui.py", line 26, in
from inference.predictor_deployment import PredictorDeployment
File "/root/llm-ray/ui/../inference/predictor_deployment.py", line 21, in
from ray import serve
File "/usr/local/lib/python3.10/dist-packages/ray/serve/init.py", line 4, in
from ray.serve.api import (
File "/usr/local/lib/python3.10/dist-packages/ray/serve/api.py", line 15, in
from ray.serve.built_application import BuiltApplication
File "/usr/local/lib/python3.10/dist-packages/ray/serve/built_application.py", line 7, in
from ray.serve.deployment import Deployment
File "/usr/local/lib/python3.10/dist-packages/ray/serve/deployment.py", line 22, in
from ray.serve.context import _get_global_client
File "/usr/local/lib/python3.10/dist-packages/ray/serve/context.py", line 12, in
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/client.py", line 28, in
from ray.serve._private.deploy_utils import get_deploy_args
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/deploy_utils.py", line 8, in
from ray.serve.schema import ServeApplicationSchema
File "/usr/local/lib/python3.10/dist-packages/ray/serve/schema.py", line 141, in
class DeploymentSchema(BaseModel, allow_population_by_field_name=True):
File "/usr/local/lib/python3.10/dist-packages/ray/serve/schema.py", line 269, in DeploymentSchema
def num_replicas_and_autoscaling_config_mutually_exclusive(cls, values):
File "/usr/local/lib/python3.10/dist-packages/pydantic/deprecated/class_validators.py", line 231, in root_validator
return root_validator()(*__args) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/pydantic/deprecated/class_validators.py", line 237, in root_validator
raise PydanticUserError(
pydantic.errors.PydanticUserError: If you use @root_validator with pre=False (the default) you MUST specify skip_on_failure=True. Note that @root_validator is deprecated and should be replaced with @model_validator.

Not able to run inference server for mistral 7b model, mpt-7b model on Ray

I built the ray image as per https://github.com/intel/llm-on-ray/blob/main/docs/setup.md and could login in to docker image and run ray server with ray start --head --node-ip-address 127.0.0.1 --dashboard-host='0.0.0.0' --dashboard-port=8265
To infer mistral-7b-v0.1 model, I ran the below command
python inference/serve.py --config_file inference/models/mistral-7b-v0.1.yaml --simple, first I got insllation error for intel-extension-for_pytorch after installaing that Iam getting the below error

After installing intel_extension_for_pytorch, getting below error...
(ServeController pid=9051) await self._user_callable_wrapper.initialize_callable()
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 778, in initialize_callable
(ServeController pid=9051) await self._call_func_or_gen(
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 748, in _call_func_or_gen
(ServeController pid=9051) result = callable(*args, **kwargs)
(ServeController pid=9051) File "/root/llm-ray/inference/predictor_deployment.py", line 64, in init
(ServeController pid=9051) self.predictor = TransformerPredictor(infer_conf)
(ServeController pid=9051) File "/root/llm-ray/inference/transformer_predictor.py", line 79, in init
(ServeController pid=9051) import intel_extension_for_pytorch as ipex
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/init.py", line 94, in
(ServeController pid=9051) from . import cpu
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/cpu/init.py", line 1, in
(ServeController pid=9051) from . import runtime
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/cpu/runtime/init.py", line 3, in
(ServeController pid=9051) from .multi_stream import (
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/cpu/runtime/multi_stream.py", line 4, in
(ServeController pid=9051) import intel_extension_for_pytorch._C as core
(ServeController pid=9051) ImportError: /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-cpu.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv

I also tried to infer mpt model with python inference/serve.py --config_file inference/models/mpt-7b.yaml --simple
Iam gettign the below error

(ServeController pid=9051) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(ServeController pid=9051) return self.__get_result()
(ServeController pid=9051) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=9051) raise self._exception
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 570, in initialize_and_get_metadata
(ServeController pid=9051) raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=9051) RuntimeError: Traceback (most recent call last):
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 554, in initialize_and_get_metadata
(ServeController pid=9051) await self._user_callable_wrapper.initialize_callable()
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 778, in initialize_callable
(ServeController pid=9051) await self._call_func_or_gen(
(ServeController pid=9051) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 748, in _call_func_or_gen
(ServeController pid=9051) result = callable(*args, **kwargs)
(ServeController pid=9051) File "/root/llm-ray/inference/predictor_deployment.py", line 64, in init
(ServeController pid=9051) self.predictor = TransformerPredictor(infer_conf)
(ServeController pid=9051) File "/root/llm-ray/inference/transformer_predictor.py", line 22, in init
(ServeController pid=9051) from optimum.habana.transformers.modeling_utils import (
(ServeController pid=9051) File "/root/optimum-habana/optimum/habana/transformers/modeling_utils.py", line 19, in
(ServeController pid=9051) from .models import (
(ServeController pid=9051) File "/root/optimum-habana/optimum/habana/transformers/models/init.py", line 59, in
(ServeController pid=9051) from .mpt import (
(ServeController pid=9051) File "/root/optimum-habana/optimum/habana/transformers/models/mpt/init.py", line 1, in
(ServeController pid=9051) from .modeling_mpt import (
(ServeController pid=9051) File "/root/optimum-habana/optimum/habana/transformers/models/mpt/modeling_mpt.py", line 24, in
(ServeController pid=9051) from transformers.models.mpt.modeling_mpt import MptForCausalLM, MptModel, _expand_mask, _make_causal_mask
(ServeController pid=9051) ImportError: cannot import name '_expand_mask' from 'transformers.models.mpt.modeling_mpt' (/usr/local/lib/python3.10/dist-packages/transformers/models/mpt/modeling_mpt.py)
(ServeController pid=9051) WARNING 2024-01-18 09:37:50,769 controller 9051 application_state.py:726 - The deployments ['PredictorDeployment'] are UNHEALTHY.
Traceback (most recent call last):
File "/root/llm-ray/inference/serve.py", line 170, in
main(sys.argv[1:])
File "/root/llm-ray/inference/serve.py", line 160, in main
openai_serve_run(deployments, host, route_prefix, args.port)
File "/root/llm-ray/inference/api_server_openai.py", line 75, in openai_serve_run
serve.run(
File "/usr/local/lib/python3.10/dist-packages/ray/serve/api.py", line 543, in run
client.deploy_application(
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/client.py", line 50, in check
return f(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/client.py", line 321, in deploy_application
self._wait_for_application_running(name)

Can anny one help me out if Iam missing out anything or any specific version of libarary needed?

Optimize package requirements

To minimize requirements, baseline package is

ray[serve, tune]
torch
intel_extension_for_pytorch
oneccl_bind_pt
transformers

if the dependency chain can't meet our requirements, we will add additional packages.

fix import from inference package

I am not sure why finetune and inference are installed as packages in the conda env. depending on if inference exists in current directory and where the script is running. Either package installed or local package will be used when import, which is confusing:

from inference.inference_config import InferenceConfig, DEVICE_CPU

[UI] gradio package conflict

Gradio is bumped from 3.36 to 4.11 by #13. If we want to adapt gradio to 4.11, codes needs a lot of changes. And the pydantic version that gradio depends on has conflict with that deepspeed depends on. It needs a solution to allow
Dependabot to ignore the version of gradio, and then limit gradio<=3.36.

# gradio 4.11 needs pydantic>=2.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gradio 4.11.0 requires pydantic>=2.0, but you have pydantic 1.10.0 which is incompatible.
# deepspeed 0.11 needs pydantic<2.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
deepspeed 0.11.1 requires pydantic<2.0.0, but you have pydantic 2.5.3 which is incompatible.

Installation command not working

Hello! I tryed to run your repo, but the first command is not working

pip install .[cpu] -f https://developer.intel.com/ipex-whl-stable-cpu -f https://download.pytorch.org/whl/torch_stable.html

gives this error:

ERROR: Ignored the following yanked versions: 1.11.100, 1.12.200
ERROR: Ignored the following versions that require a different python version: 3.10.0.0 Requires-Python >=2.7, !=3.0., !=3.1., !=3.2., !=3.3., <3.5; 3.7.4.2 Requires-Python >=2.7, !=3.0., !=3.1., !=3.2., !=3.3., <3.5
ERROR: Could not find a version that satisfies the requirement intel-extension-for-pytorch==2.1.0+cpu; extra == "cpu" (from llm-on-ray[cpu]) (from versions: 1.10.100, 1.11.0, 1.11.200, 1.12.0, 1.12.100, 1.12.300, 1.13.0, 1.13.100, 2.0.0, 2.0.100, 2.1.0, 2.1.100, 2.2.0)
ERROR: No matching distribution found for intel-extension-for-pytorch==2.1.0+cpu; extra == "cpu"

[Serving] Example to chat from command line

Current the examples just send request to the serving server via http request, OpenAI SDK, etc. We can only demo chat from the web UI. It will be useful to support chat from the command line too.

[REST API] Support more parameters

Support more request parameters includes suffix, n>1, stop, logprobs, logit_bias, echo, presence_penalty, frequency_penalty. Now only support parameters which included in transformers.generate.

[Install] Align installation with ipex

@jiafuzha @KepingYan
https://github.com/intel/intel-extension-for-pytorch?tab=readme-ov-file#cpu-version

python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
python -m pip install intel-extension-for-pytorch --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/
# for PRC user, you can check with the following link
python -m pip install intel-extension-for-pytorch --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/cn/

Support functions/tools in OpenAI API

Support functions/tools in the API to enable more use cases.
Refer to the OpenAI document below:
https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools

tools:
A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for.

tool_choice:
Controls which (if any) function is called by the model. none means the model will not call a function and instead generates a message. auto means the model can pick between generating a message or calling a function. Specifying a particular function via {"type": "function", "function": {"name": "my_function"}} forces the model to call that function.

none is the default when no functions are present. auto is the default if functions are present.

curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "user",
      "content": "What is the weather like in Boston?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}'

serving mpt-7b-bigdl.yaml crash when install with pip install -e .[bigdl-cpu]

First create an new conda env with python 3.9, and install with

pip install -e .[bigdl-cpu] -f https://developer.intel.com/ipex-whl-stable-cpu -f https://download.pytorch.org/whl/torch_stable.html

Then run:

python inference/serve.py --config_file inference/models/bigdl/mpt-7b-bigdl.yaml --serve_simple

It looks something is messed after installing bigdl-llm:

image

After I switched to another env, it's OK.

@KepingYan @jiafuzha

Need to check if the request is a valid json

inference/predictor_deployment.py:

async def __call__(self, http_request: Request) -> Union[StreamingResponse, str]:
        json_request: Dict[str, Any] = await http_request.json()

This type hint here doesn't restrict returning a dict.
Should add exception handling here for invalid request.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.