opencsgs / llm-inference Goto Github PK

llm-inference is a platform for publishing and managing llm inference, providing a wide range of out-of-the-box features for model deployment, such as UI, RESTful API, auto-scaling, computing resource management, monitoring, and more.

License: Apache License 2.0

Shell 0.28% Python 97.42% Dockerfile 0.17% JavaScript 0.58% Jupyter Notebook 1.54%

deepspeed llama-cpp llm-inference ray transformer vllm

llm-inference's People

Contributors

Stargazers

Watchers

Forkers

jasonhe258 pulltheflower depenglee1707 zhenrong-wang seanhh86 iqiuyu-0821 wanggxa

llm-inference's Issues

Auto load models from ./models for when api server start

avoid to ping huggingface when start serving to speed up the deployement

support "revision" in yaml defination

Upgrade ray 2.20.0

New feature needs for deploy ray on kubernetes.

Model streaming API enhancement

Stream request load-balance for multi-workers for predictor.
More parameters for model.generate api
Stream support for class DefaultTransformersPipeline
RouterDeployment api support format "/{model}/run/predict"
Model id mapping for api server, like mapping facebook/opt-125m to facebook--opt-125m

Upgrade ray to 2.9.3

Upgrade ray from 2.8.0 to 2.9.3

Failed to load qwen1_5-72b-chat-q5_k_m.gguf

(ServeController pid=9277) Traceback (most recent call last):
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 656, in check_ready
(ServeController pid=9277)     _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(ServeController pid=9277)     return fn(*args, **kwargs)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=9277)     return func(*args, **kwargs)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
(ServeController pid=9277)     raise value.as_instanceof_cause()
(ServeController pid=9277) ray.exceptions.RayTaskError(RuntimeError): ray::5-72B-Chat-GGUF.initialize_and_get_metadata() (pid=9483, ip=172.17.0.3, actor_id=b5fcde3ad8e5c6c8e719d32404000000, repr=<ray.serve._private.replica.ServeReplica:Qwen--Qwen1.5-72B-Chat-GGUF:Qwen--Qwen1.5-72B-Chat-GGUF object at 0x7fa4048274c0>)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(ServeController pid=9277)     return self.__get_result()
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=9277)     raise self._exception
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 455, in initialize_and_get_metadata
(ServeController pid=9277)     raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=9277) RuntimeError: Traceback (most recent call last):
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 445, in initialize_and_get_metadata
(ServeController pid=9277)     await self.replica.update_user_config(
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 724, in update_user_config
(ServeController pid=9277)     await reconfigure_method(user_config)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/server/app.py", line 151, in reconfigure
(ServeController pid=9277)     await self.rollover(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 64, in rollover
(ServeController pid=9277)     self.new_worker_group = await self._create_worker_group(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 154, in _create_worker_group
(ServeController pid=9277)     engine = await self.engine.launch_engine(scaling_config, self.pg, scaling_options)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 333, in launch_engine
(ServeController pid=9277)     await asyncio.gather(
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
(ServeController pid=9277)     return (yield from awaitable.__await__())
(ServeController pid=9277) ray.exceptions.RayTaskError(ValueError): ray::PredictionWorker.init_model() (pid=9703, ip=172.17.0.3, actor_id=5691b4ad8e1d62a67ddc668004000000, repr=PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(ServeController pid=9277)     return self.__get_result()
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=9277)     raise self._exception
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 217, in init_model
(ServeController pid=9277)     self.generator = init_model(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/utils.py", line 159, in inner
(ServeController pid=9277)     ret = func(*args, **kwargs)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 133, in init_model
(ServeController pid=9277)     resp_batch = generate(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/utils.py", line 159, in inner
(ServeController pid=9277)     ret = func(*args, **kwargs)
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 168, in generate
(ServeController pid=9277)     outputs = pipeline(
(ServeController pid=9277)   File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 141, in __call__
(ServeController pid=9277)     output = self.model(input, **kwargs)
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 1547, in __call__
(ServeController pid=9277)     return self.create_completion(
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 1480, in create_completion
(ServeController pid=9277)     completion: Completion = next(completion_or_chunks)  # type: ignore
(ServeController pid=9277)   File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 959, in _create_completion
(ServeController pid=9277)     raise ValueError(
(ServeController pid=9277) ValueError: Requested tokens (818) exceed context window of 512
(ServeController pid=9277) INFO 2024-04-05 11:16:41,444 controller 9277 deployment_state.py:2185 - Replica Qwen--Qwen1.5-72B-Chat-GGUF#Qwen--Qwen1.5-72B-Chat-GGUF#ZgAOMG is stopped.
(ServeController pid=9277) INFO 2024-04-05 11:16:41,445 controller 9277 deployment_state.py:1831 - Adding 1 replica to deployment Qwen--Qwen1.5-72B-Chat-GGUF in application 'Qwen--Qwen1.5-72B-Chat-GGUF'.

Error happen when do inference for wukong dtype=bfloat16 of use default transformer pipeline load model

1:job_id:04000000
:actor_name:ServeReplica:default:opencsg--csg-wukong-1B
[INFO 2024-04-30 03:46:04,636] __init__.py: 14  Import vllm related stuff failed, please make sure 'vllm' is installed.
INFO 2024-04-30 03:46:04,723 default_opencsg--csg-wukong-1B IULCpr app.py:95 - LLM Deployment initialize
[INFO 2024-04-30 03:46:04,723] predictor.py: 27  LLM Predictor Initialize
INFO 2024-04-30 03:46:04,724 default_opencsg--csg-wukong-1B IULCpr app.py:145 - LLM Deployment Reconfiguring...
INFO 2024-04-30 03:46:04,724 default_opencsg--csg-wukong-1B IULCpr app.py:103 - LLM Deployment _should_reinit_worker_group
[INFO 2024-04-30 03:46:04,724] predictor.py: 48  Initializing new worker group ScalingConfig(trainer_resources={'CPU': 0}, num_workers=1, use_gpu=True, resources_per_worker={'CPU': 1.0, 'GPU': 1.0})
[INFO 2024-04-30 03:46:04,724] predictor.py: 59  Engine name is generic
[INFO 2024-04-30 03:46:04,724] predictor.py: 83  LLM Predictor creating a new worker group
[INFO 2024-04-30 03:46:04,818] predictor.py: 100  Build Prediction Worker with runtime_env:
[INFO 2024-04-30 03:46:04,819] predictor.py: 101  None
[INFO 2024-04-30 03:46:04,819] predictor.py: 109  Waiting for placement group to be ready...
[INFO 2024-04-30 03:46:04,887] predictor.py: 113  Starting initialize_node tasks...
[INFO 2024-04-30 03:46:06,970] predictor.py: 124  get version: [None]
[INFO 2024-04-30 03:46:06,970] generic.py: 351  Creating prediction workers...
[INFO 2024-04-30 03:46:06,975] generic.py: 358  Initializing torch_dist process group on workers...
[INFO 2024-04-30 03:46:09,210] generic.py: 368  Initializing model on workers with local_ranks: [0]
[INFO 2024-04-30 03:46:10,294] predictor.py: 68  Rolling over to new worker group [Actor(PredictionWorker, efd48e82c51a27d83f8078f604000000)]
INFO 2024-04-30 03:46:10,377 default_opencsg--csg-wukong-1B IULCpr app.py:236 - new_max_batch_size is 1
INFO 2024-04-30 03:46:10,377 default_opencsg--csg-wukong-1B IULCpr app.py:237 - new_batch_wait_timeout_s is 0
INFO 2024-04-30 03:46:10,377 default_opencsg--csg-wukong-1B IULCpr app.py:162 - LLM Deployment Reconfigured.
/home/yons/llm-inference/llmserve/backend/llm/predictor.py:212: RuntimeWarning: coroutine 'GenericEngine.check_health' was never awaited
  self.engine.check_health()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
INFO 2024-04-30 03:47:11,008 default_opencsg--csg-wukong-1B IULCpr 0dd6808d-68b3-42fe-ac35-8f4ce6fb6d21 /api/v1/default/opencsg--csg-wukong-1B/run/predict app.py:210 - batch_generate_text prompts: [Prompt(prompt='What can I do', use_prompt_format=False)] 
INFO 2024-04-30 03:47:11,008 default_opencsg--csg-wukong-1B IULCpr 0dd6808d-68b3-42fe-ac35-8f4ce6fb6d21 /api/v1/default/opencsg--csg-wukong-1B/run/predict app.py:273 - Received 1 prompts [Prompt(prompt='What can I do', use_prompt_format=False)]. start_timestamp None timeout_s 100
[INFO 2024-04-30 03:47:11,008] generic.py: 416  LLM GenericEngine do async predict
ERROR 2024-04-30 03:47:11,135 default_opencsg--csg-wukong-1B IULCpr 0dd6808d-68b3-42fe-ac35-8f4ce6fb6d21 /api/v1/default/opencsg--csg-wukong-1B/run/predict replica.py:756 - Request failed due to RayTaskError:
Traceback (most recent call last):
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 753, in wrap_user_method_call
    yield
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 914, in call_user_method
    raise e from None
ray.exceptions.RayTaskError: �[36mray::ServeReplica:default:opencsg--csg-wukong-1B.handle_request()�[39m (pid=1492889, ip=192.168.80.2)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/ray/serve/_private/utils.py", line 165, in wrap_to_ray_error
    raise exception
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 895, in call_user_method
    result = await method_to_call(*request_args, **request_kwargs)
  File "/home/yons/llm-inference/llmserve/backend/server/app.py", line 217, in batch_generate_text
    texts = await asyncio.gather(
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/ray/serve/batching.py", line 498, in batch_wrapper
    return await enqueue_request(args, kwargs)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/ray/serve/batching.py", line 228, in _process_batches
    results = await func_future
  File "/home/yons/llm-inference/llmserve/backend/server/app.py", line 285, in generate_text_batch
    prediction = await self._predict_async(
  File "/home/yons/llm-inference/llmserve/backend/llm/predictor.py", line 183, in _predict_async
    prediction = await self.engine.predict(prompts, generate, timeout_s=timeout_s, start_timestamp=start_timestamp, lock=self._base_worker_group_lock)
  File "/home/yons/llm-inference/llmserve/backend/llm/engines/generic.py", line 443, in predict
    await asyncio.gather(
  File "/home/yons/.conda/envs/abc/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
ray.exceptions.RayTaskError(RuntimeError): �[36mray::PredictionWorker.generate()�[39m (pid=1493087, ip=192.168.80.2, actor_id=efd48e82c51a27d83f8078f604000000, repr=PredictionWorker:opencsg/csg-wukong-1B)
  File "/home/yons/llm-inference/llmserve/backend/llm/engines/generic.py", line 268, in generate
    return generate(
  File "/home/yons/llm-inference/llmserve/backend/llm/utils.py", line 161, in inner
    ret = func(*args, **kwargs)
  File "/home/yons/llm-inference/llmserve/backend/llm/engines/generic.py", line 169, in generate
    outputs = pipeline(
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/yons/llm-inference/llmserve/backend/llm/pipelines/default_transformers_pipeline.py", line 77, in __call__
    model_outputs = self.forward(model_inputs, **forward_params)
  File "/home/yons/llm-inference/llmserve/backend/llm/pipelines/default_transformers_pipeline.py", line 208, in forward
    generated_sequence = self.pipeline(**prompt_text, **generate_kwargs)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 240, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1187, in __call__
    outputs = list(final_iterator)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
    processed = self.infer(item, **self.params)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1112, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 327, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/generation/utils.py", line 1575, in generate
    result = self._sample(
  File "/home/yons/.conda/envs/abc/lib/python3.10/site-packages/transformers/generation/utils.py", line 2735, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
INFO 2024-04-30 03:47:11,135 default_opencsg--csg-wukong-1B IULCpr 0dd6808d-68b3-42fe-ac35-8f4ce6fb6d21 /api/v1/default/opencsg--csg-wukong-1B/run/predict replica.py:772 - BATCH_GENERATE_TEXT ERROR 127.1ms

Requested tokens (817) exceed context window of 512

(PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF pid=42050) [INFO 2024-04-16 09:34:13,880] llamacpp_pipeline.py: 212 generate_kwargs: {'max_tokens': 1024, 'echo': False, 'stop': ['<|im_end|>'], 'logits_processor': [], 'stopping_criteria': []}
(ServeController pid=41618) ERROR 2024-04-16 09:34:14,246 controller 41618 deployment_state.py:658 - Exception in replica 'default#Qwen--Qwen1.5-72B-Chat-GGUF#dMqscG', the replica will be stopped.
(ServeController pid=41618) Traceback (most recent call last):
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 656, in check_ready
(ServeController pid=41618) _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(ServeController pid=41618) return fn(*args, **kwargs)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=41618) return func(*args, **kwargs)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
(ServeController pid=41618) raise value.as_instanceof_cause()
(ServeController pid=41618) ray.exceptions.RayTaskError(RuntimeError): ray::5-72B-Chat-GGUF.initialize_and_get_metadata() (pid=41823, ip=172.17.0.2, actor_id=6aff10f7a7934a83f523892907000000, repr=<ray.serve._private.replica.ServeReplica:default:Qwen--Qwen1.5-72B-Chat-GGUF object at 0x7f24110af4c0>)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(ServeController pid=41618) return self.__get_result()
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=41618) raise self._exception
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 455, in initialize_and_get_metadata
(ServeController pid=41618) raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=41618) RuntimeError: Traceback (most recent call last):
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 445, in initialize_and_get_metadata
(ServeController pid=41618) await self.replica.update_user_config(
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 724, in update_user_config
(ServeController pid=41618) await reconfigure_method(user_config)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/server/app.py", line 154, in reconfigure
(ServeController pid=41618) await self.rollover(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 64, in rollover
(ServeController pid=41618) self.new_worker_group = await self._create_worker_group(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 159, in _create_worker_group
(ServeController pid=41618) engine = await self.engine.launch_engine(scaling_config, self.pg, scaling_options)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 367, in launch_engine
(ServeController pid=41618) await asyncio.gather(
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
(ServeController pid=41618) return (yield from awaitable.await())
(ServeController pid=41618) ray.exceptions.RayTaskError(ValueError): ray::PredictionWorker.init_model() (pid=42050, ip=172.17.0.2, actor_id=b7ddc7c61575fad3b581750d07000000, repr=PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 236, in init_model
(ServeController pid=41618) self.generator = init_model(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/utils.py", line 161, in inner
(ServeController pid=41618) ret = func(*args, **kwargs)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 133, in init_model
(ServeController pid=41618) resp_batch = generate(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/utils.py", line 161, in inner
(ServeController pid=41618) ret = func(*args, **kwargs)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 168, in generate
(ServeController pid=41618) outputs = pipeline(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 102, in call
(ServeController pid=41618) for batch_response in self.stream(inputs, **kwargs):
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 214, in stream
(ServeController pid=41618) for token in output:
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 970, in _create_completion
(ServeController pid=41618) raise ValueError(
(ServeController pid=41618) ValueError: Requested tokens (817) exceed context window of 512
(ServeController pid=41618) INFO 2024-04-16 09:34:16,388 controller 41618 deployment_state.py:2185 - Replica default#Qwen--Qwen1.5-72B-Chat-GGUF#dMqscG is stopped.

Install dependency llama-cpp-python failed

Using cached exceptiongroup-1.2.0-py3-none-any.whl (16 kB)
Building wheels for collected packages: deepspeed, llama-cpp-python, llm-serve, ffmpy
  Building wheel for deepspeed (setup.py) ... done
  Created wheel for deepspeed: filename=deepspeed-0.14.0-py3-none-any.whl size=1400347 sha256=db3cabb92e930a4d76b2adf48e2bae802dc28c333d54d790ab2b4256efe03fe0
  Stored in directory: /Users/hhwang/Library/Caches/pip/wheels/23/96/24/bab20c3b4e2af15e195b339afaec373eca7072cf90620432e5
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [66 lines of output]
      *** scikit-build-core 0.8.2 using CMake 3.29.0 (wheel)
      *** Configuring CMake...
      2024-03-31 14:09:18,364 - scikit_build_core - WARNING - libdir/ldlibrary: /Users/hhwang/anaconda3/envs/abc/lib/libpython3.10.a is not a real file!
      2024-03-31 14:09:18,364 - scikit_build_core - WARNING - Can't find a Python library, got libdir=/Users/hhwang/anaconda3/envs/abc/lib, ldlibrary=libpython3.10.a, multiarch=darwin, masd=None
      loading initial cache file /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/CMakeInit.txt
      -- The C compiler identification is AppleClang 15.0.0.15000309
      -- The CXX compiler identification is AppleClang 15.0.0.15000309
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /Library/Developer/CommandLineTools/usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Found Git: /usr/bin/git (found version "2.39.3 (Apple Git-146)")
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Accelerate framework found
      -- Metal framework found
      -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with LLAMA_CCACHE=OFF
      -- CMAKE_SYSTEM_PROCESSOR: arm64
      -- ARM detected
      -- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E
      -- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
      CMake Warning (dev) at vendor/llama.cpp/CMakeLists.txt:1218 (install):
        Target llama has RESOURCE files but no RESOURCE DESTINATION.
      This warning is for project developers.  Use -Wno-dev to suppress it.

      CMake Warning (dev) at CMakeLists.txt:21 (install):
        Target llama has PUBLIC_HEADER files but no PUBLIC_HEADER DESTINATION.
      This warning is for project developers.  Use -Wno-dev to suppress it.

      CMake Warning (dev) at CMakeLists.txt:30 (install):
        Target llama has PUBLIC_HEADER files but no PUBLIC_HEADER DESTINATION.
      This warning is for project developers.  Use -Wno-dev to suppress it.

      -- Configuring done (0.5s)
      -- Generating done (0.0s)
      -- Build files have been written to: /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build
      *** Building project with Ninja...
      Change Dir: '/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build'

      Run Build Command(s): /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-build-env-h3q63wii/normal/lib/python3.10/site-packages/ninja/data/bin/ninja -v
      [1/25] cd /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/vendor/llama.cpp && xcrun -sdk macosx metal -O3 -c /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && xcrun -sdk macosx metallib /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/default.metallib && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-common.h && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal
      FAILED: bin/default.metallib /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/default.metallib
      cd /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/vendor/llama.cpp && xcrun -sdk macosx metal -O3 -c /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && xcrun -sdk macosx metallib /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air -o /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/default.metallib && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.air && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-common.h && rm -f /var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/tmpujapt1jr/build/bin/ggml-metal.metal
      xcrun: error: unable to find utility "metal", not a developer tool or in PATH
      [2/25] cd /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp && /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-build-env-h3q63wii/normal/lib/python3.10/site-packages/cmake/data/bin/cmake -DMSVC= -DCMAKE_C_COMPILER_VERSION=15.0.0.15000309 -DCMAKE_C_COMPILER_ID=AppleClang -DCMAKE_VS_PLATFORM_NAME= -DCMAKE_C_COMPILER=/Library/Developer/CommandLineTools/usr/bin/cc -P /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/common/../scripts/gen-build-info-cpp.cmake
      -- Found Git: /usr/bin/git (found version "2.39.3 (Apple Git-146)")
      [3/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-alloc.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-alloc.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-alloc.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-alloc.c
      [4/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-backend.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-backend.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-backend.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-backend.c
      [5/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../.. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../../common -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wno-cast-qual -MD -MT vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/llava.cpp.o -MF vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/llava.cpp.o.d -o vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/llava.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/llava.cpp
      [6/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-metal.m.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-metal.m.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-metal.m.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-metal.m
      [7/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-quants.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-quants.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-quants.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml-quants.c
      [8/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -Dllama_EXPORTS -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -MD -MT vendor/llama.cpp/CMakeFiles/llama.dir/unicode.cpp.o -MF vendor/llama.cpp/CMakeFiles/llama.dir/unicode.cpp.o.d -o vendor/llama.cpp/CMakeFiles/llama.dir/unicode.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/unicode.cpp
      [9/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../.. -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/../../common -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wno-cast-qual -MD -MT vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/clip.cpp.o -MF vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/clip.cpp.o.d -o vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/clip.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/examples/llava/clip.cpp
      [10/25] /Library/Developer/CommandLineTools/usr/bin/cc -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -MD -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml.c.o -MF vendor/llama.cpp/CMakeFiles/ggml.dir/ggml.c.o.d -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml.c.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/ggml.c
      [11/25] /Library/Developer/CommandLineTools/usr/bin/c++ -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_SCHED_MAX_COPIES=4 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DLLAMA_BUILD -DLLAMA_SHARED -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -Dllama_EXPORTS -I/private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk/System/Library/Frameworks -O3 -DNDEBUG -std=gnu++11 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk -mmacosx-version-min=14.3 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -MD -MT vendor/llama.cpp/CMakeFiles/llama.dir/llama.cpp.o -MF vendor/llama.cpp/CMakeFiles/llama.dir/llama.cpp.o.d -o vendor/llama.cpp/CMakeFiles/llama.dir/llama.cpp.o -c /private/var/folders/wm/9mckczj143jdzdzr037_g8lm0000gn/T/pip-install-f3axqn06/llama-cpp-python_20b9c0a542fd49ac96cea9fb409a8d94/vendor/llama.cpp/llama.cpp
      ninja: build stopped: subcommand failed.


      *** CMake build failed
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
  Building wheel for llm-serve (pyproject.toml) ... done
  Created wheel for llm-serve: filename=llm_serve-0.0.1-py3-none-any.whl size=100808 sha256=5896e4e7b35cf15f8977a5847a9ff40f78ed2ae42e95adc28def70cefc2b426c
  Stored in directory: /Users/hhwang/Library/Caches/pip/wheels/cb/6e/71/619b3e1f616ba182cb9bfc8e0e239a9e8402f4305bc75d27d7
  Building wheel for ffmpy (setup.py) ... done
  Created wheel for ffmpy: filename=ffmpy-0.3.2-py3-none-any.whl size=5582 sha256=f2f3304e01d27a1e9f63c8c504d5d56cf0a5c40ec98c2e805c1a5d8c41ea17be
  Stored in directory: /Users/hhwang/Library/Caches/pip/wheels/bd/65/9a/671fc6dcde07d4418df0c592f8df512b26d7a0029c2a23dd81
Successfully built deepspeed llm-serve ffmpy
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

Inference throw timeout sometime

throw error of async timeout when inference process for model Qwen with batch_wait_timeout_s: 0 in yaml

Change model default download address

Investigate how to update model default download address to opencsg

Model inference cross multi-nodes

Model inference cross mulit-nodes

Wrong model id when there are -- in model id for do predict

Convert / to -- in model ID for url, then there is error when convert -- back to / for predict action.
for example, there is model with id opencsg/code--llama-v1.0, throw a error when do predict action.

No default value for "timeout" if missing "batch_wait_timeout_s: 0" in yaml config

Cause timeout exception when missing batch_wait_timeout_s: 0
must be a bug _

[BUG] Get error when try "translation" downstream model

Run command:
llm-serve start experimental --model ./models/translation--t5-small.yaml

get error:

(ServeController pid=26978)   File "/Users/lipeng/workspaces/github.com/depenglee1707/llm-inference/llmserve/backend/llm/pipelines/default_transformers_pipeline.py", line 125, in __call__
(ServeController pid=26978)     output = self.format_output(data[0], inputs, preprocess_time, generation_time)
(ServeController pid=26978)   File "/Users/lipeng/workspaces/github.com/depenglee1707/llm-inference/llmserve/backend/llm/pipelines/default_transformers_pipeline.py", line 183, in format_output
(ServeController pid=26978)     num_generated_tokens = len(self.tokenizer(output["generated_text"]).input_ids)
(ServeController pid=26978) TypeError: string indices must be integers

@jasonhe258 please take a look

vllm cannot address "runtime_env"

for Qwen/Qwen-7B, we set runtime_env like this:

  initialization:
    runtime_env:
      pip: ["transformers_stream_generator", "tiktoken"]

but when start up, still get the exception:

ImportError: This modeling file requires the following packages that were not found in your environment: tiktoken. Run `pip install tiktoken`

The usage introduction of `llm-serve` is not correct in quick_start.md

Wrong:

# llm-serve --help

 Usage: llm-serve [OPTIONS] COMMAND [ARGS]...

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help          Show this message and exit.                                                                                                                        │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ evaluate     Evaluate and summarize the results of a multi_query run with a strong 'evaluator' LLM like GPT-4.                                                     │
│ list         List available model(s) and deployed serving etc.                                                                                                     │
│ predict      Predict one or several models with one or multiple prompts, optionally read from file, and save the results to a file.                                │
│ start        Start application(s) for LLM serving, API server, experimention, fine-tuning and comparation.                                                         │
│ stop         Stop application(s) for LLM serving and API server.                                                                                                   │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

The sub command evaluate has already deprecated and removed

inference gradio web reponse random words for deepseek instrcuct model

while using rest api, everything seems to be OK:

curl -H "Content-Type: application/json" -X POST -d '{"prompt": "写一个快排吧"}' "http://127.0.0.1:8000/api/v1/default/opencsg--opencsg-deepseek-coder-1.3b-v0.1/run/predict" {"generated_text":"}\n\n\n# 快排\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len(arr) // 2]\n left = [x for x in arr if x < pivot]\n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n return quicksort(left) + middle + quicksort(right)\n\n# 测试\nprint(quicksort(arr))\n\n# 输出: [1, 2, 3, 4, 5, 6, 7, 8, 9]\n```\n\n这个程序使用了快速排序算法，它是一种高效的排序算法，基于分治法的原理。它选择一个元素作为枢轴，并根据它们与枢轴的大小将其他元素分成两个子数组，然后递归地对子数组进行排序。\n\n快速排序的平均时间复杂度为O(n log n)，最坏情况下的时间复杂度为O(n^2)，但这种情况很少发生。","num_input_tokens":16,"num_input_tokens_batch":16,"num_generated_tokens":267,"num_generated_tokens_batch":267,"preprocessing_time":0.008793507993686944,"generation_time":2.4766286090016365,"postprocessing_time":0.0009328589949291199,"generation_time_per_token":0.008751337840995181,"generation_time_per_token_batch":0.008751337840995181,"num_total_tokens":283,"num_total_tokens_batch":283,"total_time":2.4863549759902526,"total_time_per_token":0.008785706628940822,"total_time_per_token_batch":0.008785706628940822}(.llm-inference) root@opencsg-gpu1-4090:~/pl/workspace/depenglee/llm-inference#

Api server blocked when one request is in-process

Need more test for this issue

GGUF implements will make duplicate copy since cannot detect config.json file in the cache folder

速度和sglang相比哪个快？

Support Quantized Model

Support Quantized Model.
For example:
https://huggingface.co/THUDM/chatglm2-6b-int4
https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GPTQ-Int4

API server startup slow

Too much yaml file need to be load during api server startup

vllm implements cannot support download model from repo besides hg

  initialization:
    runtime_env:
      env_vars:
        HF_ENDPOINT: https://hub.opencsg.com/hf
    initializer:
      type: Vllm
      from_pretrained_kwargs:
        trust_remote_code: true
    pipeline: vllm

this cannot work as expected

vllm, gguf, llamacpp, these integration cannot address local path of model

Better to implement streaming output feature

Better to implement streaming output feature, so user can read the LLM output word by word.

Add inference SDK for invoke

Add python sdk for inference api

Generate incorrect text format when use pipeline defaulttransformers

Set pipeline: defaulttransformers and prompt_format: "'role': 'user', 'content': {instruction}" in yaml, and seems there is text format issue in generated_text as following.

[{"generated_text":"'role': 'user', 'content': hello nihao\n{'role': 'user', 'content': '你好'}","num_input_tokens":2,"num_input_tokens_batch":2,"num_generated_tokens":26,"num_generated_tokens_batch":26,"preprocessing_time":0.007688470010180026,"generation_time":7.110702240024693,"postprocessing_time":0.0007505400571972132,"generation_time_per_token":0.2539536514294533,"generation_time_per_token_batch":0.2539536514294533,"num_total_tokens":28,"num_total_tokens_batch":28,"total_time":7.1191412500920705,"total_time_per_token":0.2542550446461454,"total_time_per_token_batch":0.2542550446461454}]

Enhance inference API to support OpenAI style

enable reset generate config on fly

for now the generation params is addressed in yaml files,
add the ability reset these params on fly is useful:

    generate_kwargs:
      do_sample: false
      max_new_tokens: 512
      min_new_tokens: 16
      temperature: 0.7
      repetition_penalty: 1.1
      top_p: 0.8
      top_k: 50
      pad_token: "<|extra_0|>"
      eos_token: "<|endoftext|>"

Expose model generate parameters by API server

generate_kwargs:
  do_sample: true
  max_new_tokens: 128
  min_new_tokens: 16
  temperature: 0.7
  repetition_penalty: 1.1
  top_p: 0.8
  top_k: 50

Api server was blocked when LLM deployment scaling config beyond the cluster resouces

for example, ray cluster just has 12 cpus.

curl -H "Content-Type: application/json" -H "user-name: default"  -d '[{"model_id": "facebook/opt-125m", "model_task": "text-generation", "model_revision": "main", "is_oob": true, "scaling_config": {"num_workers": 1, "num_gpus_per_worker": 1,"num_cpus_per_worker": 20}}]' -X POST "http://127.0.0.1:8000/api/start_serving"

Support load Qwen1.5-72B-Chat-GPTQ-Int4 by auto_gptq

Run Qwen1.5-72B-Chat-GPTQ-Int4 is much slower than Qwen1.5-72B-Chat by transformer package.
Quantited model need load by auto_gptq.

https://github.com/QwenLM/Qwen/blob/main/README_CN.md#%E6%8E%A8%E7%90%86%E6%80%A7%E8%83%BD

opencsgs / llm-inference Goto Github PK

llm-inference's People

Contributors

Stargazers

Watchers

Forkers

llm-inference's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs