I ran the infernce of Falcon-7b and neural-chat-7b-v3-1 models on ray server with belo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Getting error while executing query_openai_sdk.py to test the inference about llm-on-ray HOT 9 OPEN

intel commented on September 14, 2024

Getting error while executing query_openai_sdk.py to test the inference

from llm-on-ray.

Comments (9)

xwu99 commented on September 14, 2024

@yutianchen666 Could you help to reproduce the issue? I am not sure if it is OpenAI version causing api break.

from llm-on-ray.

dkiran1 commented on September 14, 2024

I used openai==0.28 version, since latest version gave error and recommoneded to use this version

from llm-on-ray.

yutianchen666 commented on September 14, 2024

@yutianchen666 Could you help to reproduce the issue? I am not sure if it is OpenAI version causing api break.

ok, I'll reproduce it soon

from llm-on-ray.

KepingYan commented on September 14, 2024

@dkiran1 Thank you for your reporting. If you want to use Openai compatible sdk, please remove --simple parameter. After serving, please set ENDPOINT_URL=http://localhost:8000/v1 when running query_http_requests.py or set OPENAI_API_BASE=http://localhost:8000/v1 when running query_open_sdk.py. And you can see serve.md for more details.

from llm-on-ray.

dkiran1 commented on September 14, 2024

Hi Yan, Thanks for the details, I tried the above mentioned steps, I could run inference server with falcon model, but on running
python examples/inference/api_server_openai/query_openai_sdk.py --model_name="falcon-7b" Its waiting for the response from long time, but no response, I tried with neural chat model, yestuday it was working on upgrading transformer library , but its giving error

d lead to undefined behavior!
(ServeController pid=11891) ERROR 2024-01-19 05:35:26,615 controller 11891 deployment_state.py:672 - Exception in replica 'neural-chat-7b-v3-1#PredictorDeployment#3jmxrf36', the replica will be stopped.
(ServeController pid=11891) Traceback (most recent call last):
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/deployment_state.py", line 670, in check_ready
(ServeController pid=11891) _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(ServeController pid=11891) return fn(*args, **kwargs)
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=11891) return func(*args, **kwargs)
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2656, in get
(ServeController pid=11891) values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 869, in get_objects
(ServeController pid=11891) raise value.as_instanceof_cause()
(ServeController pid=11891) ray.exceptions.RayTaskError(RuntimeError): ray::ServeReplica:neural-chat-7b-v3-1:PredictorDeployment.initialize_and_get_metadata() (pid=18013, ip=172.17.0.2, actor_id=685216a503325bcc4e3c3c7701000000, repr=<ray.serve._private.replica.ServeReplica:neural-chat-7b-v3-1:PredictorDeployment object at 0x7fabd93efd00>)
(ServeController pid=11891) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
(ServeController pid=11891) return self.__get_result()
(ServeController pid=11891) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=11891) raise self._exception
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 570, in initialize_and_get_metadata
(ServeController pid=11891) raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=11891) RuntimeError: Traceback (most recent call last):
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 554, in initialize_and_get_metadata
(ServeController pid=11891) await self._user_callable_wrapper.initialize_callable()
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 778, in initialize_callable
(ServeController pid=11891) await self._call_func_or_gen(
(ServeController pid=11891) result = callable(*args, **kwargs)
(ServeController pid=11891) File "/root/llm-ray/inference/predictor_deployment.py", line 64, in init
(ServeController pid=11891) self.predictor = TransformerPredictor(infer_conf)
(ServeController pid=11891) File "/root/llm-ray/inference/transformer_predictor.py", line 22, in init
(ServeController pid=11891) from optimum.habana.transformers.modeling_utils import (
(ServeController pid=11891) File "/root/optimum-habana/optimum/habana/transformers/modeling_utils.py", line 19, in
(ServeController pid=11891) from .models import (
(ServeController pid=11891) File "/root/optimum-habana/optimum/habana/transformers/models/init.py", line 59, in
(ServeController pid=11891) from .mpt import (
(ServeController pid=11891) File "/root/optimum-habana/optimum/habana/transformers/models/mpt/init.py", line 1, in
(ServeController pid=11891) from .modeling_mpt import (
(ServeController pid=11891) File "/root/optimum-habana/optimum/habana/transformers/models/mpt/modeling_mpt.py", line 24, in
(ServeController pid=11891) from transformers.models.mpt.modeling_mpt import MptForCausalLM, MptModel, _expand_mask, _make_causal_mask
(ServeController pid=11891) ImportError: cannot import name '_expand_mask' from 'transformers.models.mpt.modeling_mpt' (/usr/local/lib/python3.10/dist-packages/transformers/models/mpt/modeling_mpt.py)
(ServeController pid=11891) INFO 2024-01-19 05:35:27,338 controller 11891 deployment_state.py:2188 - Replica neural-chat-7b-v3-1#PredictorDeployment#3jmxrf36 is stopped.
(ServeController pid=11891) INFO 2024-01-19 05:35:27,339 controller 11891 deployment_state.py:1850 - Adding 1 replica to deployment PredictorDeployment in application 'neural-chat-7b-v3-1'.
exit(ServeReplica:router:PredictorDeployment pid=18206) /usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
(ServeReplica:router:PredictorDeployment pid=18206) warnings.warn(
(ServeReplica:neural-chat-7b-v3-1:PredictorDeployment pid=18013) [WARNING|utils.py:190] 2024-01-19 05:35:26,443 >> optimum-habana v1.8.0.dev0 has been validated for SynapseAI v1.11.0 but the driver version is v1.13.0, this could lead to undefined behavior!

from llm-on-ray.

kira-lin commented on September 14, 2024

Hi @dkiran1 , we currently have limited bandwidth and hardware to test on Gaudi. Currently the Gaudi related part is not up to date. I just tested in docker, in vault.habana.ai/gaudi-docker/1.13.0/ubuntu22.04/habanalabs/pytorch-installer-2.1.0 container, you only need to

# install llm-on-ray, assume mounted
pip install -e .
# install latest optimum[habana]
pip install optimum[habana]

Make sure tranformers version is 4.34.1, which is required by optimum[habana], and caused your error. In addition, inference with gaudi does not require IPEX

from llm-on-ray.

dkiran1 commented on September 14, 2024

Hi Lin, Thanks a lot after doing pip install optimum[habana] neural-chat model along with query_openai_sdk is working fine. I will test other models and will post the status

from llm-on-ray.

dkiran1 commented on September 14, 2024

I tested falcon-7b,mpt-7b,mistral-7b and neural-chat model ,I could run inference server of these models , Iam getting response for neural-chat and mistral-7b model with query_openai_sdk.py , but its waiting for resposne for mpt-7b and flacon model

from llm-on-ray.

kira-lin commented on September 14, 2024

Hi @dkiran1 ,
When you use openai serving, try add --max_new_tokens config. It seems like optimum-habana requires this config. I'll look into why and how to fix this later.

from llm-on-ray.

Getting error while executing query_openai_sdk.py to test the inference about llm-on-ray HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs