deepjavalibrary / djl-serving Goto Github PK

View Code? Open in Web Editor NEW

168.0 13.0 55.0 8.29 MB

A universal scalable machine learning model deployment solution

License: Apache License 2.0

Java 53.27% HTML 0.04% JavaScript 0.67% CSS 0.11% Dockerfile 1.29% Shell 0.92% Python 40.08% Vue 3.61% Less 0.01%

deep-learning deployment djl inference pytorch serving

djl-serving's People

Contributors

Stargazers

Watchers

djl-serving's Issues

Package shoudn't put into tags?

Description

hope this project can have a login page

I'd like to use the UI control panel plugin，but there is no security verification so that I dare not open the UI page in the production environment.

Can you add a simple login page？even if only one admin's account. I think it will be very helpful to the development of the project

We need a version for redhat/centos

Description

Error reported on incoming file using djl-serving on Windows

I used the inference API to pass in a file and the following error occurred：

Caused by: java.lang.IllegalArgumentException: Malformed data
at ai.djl.ndarray.NDList.decode(NDList.java:124) ~[api-0.19.0.jar:?]
at ai.djl.ndarray.NDList.decode(NDList.java:85) ~[api-0.19.0.jar:?]
at ai.djl.modality.Input.getAsNDList(Input.java:328) ~[api-0.19.0.jar:?]
at ai.djl.modality.Input.getDataAsNDList(Input.java:198) ~[api-0.19.0.jar:?]
at ai.djl.translate.NoopServingTranslatorFactory$NoopServingTranslator.processInput(NoopServingTranslatorFactory.java:68) ~[api-0.19.0.jar:?]
... 8 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:202) ~[?:?]
at java.io.DataInputStream.readFully(DataInputStream.java:170) ~[?:?]
at ai.djl.ndarray.NDList.decode(NDList.java:99) ~[api-0.19.0.jar:?]
at ai.djl.ndarray.NDList.decode(NDList.java:85) ~[api-0.19.0.jar:?]
at ai.djl.modality.Input.getAsNDList(Input.java:328) ~[api-0.19.0.jar:?]
at ai.djl.modality.Input.getDataAsNDList(Input.java:198) ~[api-0.19.0.jar:?]
at ai.djl.translate.NoopServingTranslatorFactory$NoopServingTranslator.processInput(NoopServingTranslatorFactory.java:68) ~[api-0.19.0.jar:?]

requirement.txt installation lack of mechanism for off-network installation

Description

For off network installations, we should run

pip install -r requirements.txt --no-deps

instead of

pip install -r requirements.txt

This prevent the wheel to find dependencies on the network

Sharded GPT2 model cannot be loaded with DeepSpeed

Use DeepSpeed AOT to partition the GPT2 model, everything works fine, but when load the model it failed:

assert self.ckpt_load_enabled, "Meta tensors are not supported for this model currently."

The partition should fail if the model is not supported
If the model doesn't support AOT with DeepSpeed, we should default to FasterTransformer

Streaming Llama Model Issue

Description

Getting below output from the streaming Utils . As you can see there is space between design and ing

design ing , developing , testing , and maintain ing software

Expected Behavior

There should not be any space . I am using LLama+Lora model.

Error Message

Wrong Result

How to Reproduce?

     generator = stream_generator(model, tokenizer, prompt, **generate_kwargs)
    generated = ""
    for text in generator:
        generated += ' ' + text[0]
        paginator.add_cache(session_id, generated)
    paginator.add_cache(session_id, generated + "<eos>")

docker pull is very slowly

Description

(A clear and concise description of what the feature is.)

Will this change the current api? How?

Who will benefit from this enhancement?

References

list reference and related literature
list known implementations

Saved DeepSpeed sharded model does not support steaming

In the saved partition model config.json file has been changed, it's different from original config.json file

{
  "_name_or_path": "bigscience/bloom-1b1",
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "BloomModel"
  ],
  "attention_dropout": 0.0,
  "attention_softmax_in_fp32": true,
  "bias_dropout_fusion": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_dropout": 0.0,
  "hidden_size": 1536,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "masked_softmax_fusion": true,
  "model_type": "bloom",
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "offset_alibi": 100,
  "pad_token_id": 3,
  "pretraining_tp": 1,
  "skip_bias_add": true,
  "skip_bias_add_qkv": false,
  "slow_but_exact": false,
  "torch_dtype": "float32",
  "transformers_version": "4.27.1",
  "unk_token_id": 0,
  "use_cache": true,
  "vocab_size": 250880
}

GPTNeoX variant model crashes on LMI

Description

https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1

This model tested on P3.8xlarge with TP4 has weird NCCL issues. TP1 also doesn't help and throw a Cublas handle problems

CUDA error when using huggingface streaming

serving.properties:

option.model_id=EleutherAI/gpt-neo-1.3B
option.task=text-generation
option.tensor_parallel_degree=2
option.dtype=fp16
#option.enable_streaming=true
option.enable_streaming=huggingface
engine=DeepSpeed
option.parallel_loading=true

curl command:

curl -X POST "http://localhost:8080/invocations" \
     -H "content-type: application/json" \
     -d '{"inputs": ["Large language model is"], "parameters": {"max_length" :25}}' 

{"outputs": ["Large language model is"]}

{"outputs": "CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"}

DeepSpeed 0.9.0 generation issue with large tokens and batches

Description

If you generate single batch with large sequence (e.g 2048). Afterwards if you try batch size of 4, it will fail with issues.

DJLServing is not reporting back failure states on SageMaker

Description

Send a failed signal back to sagemaker

LMI container not compatible with Multi-container endpoint of SageMaker

Description

The DJL containers should be added with Docker label to enable it in multi-container endpoint in SageMaker.
Currently, its not enabled and users gets below error -

An error occurred (ValidationException) when calling the CreateModel operation: Your Ecr Image 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-fastertransformer5.3.0-cu117 does not contain required com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true Docker label(s).

(A clear and concise description of what the feature is.)

Will this change the current api? How?
Change the dockerfile and add the label

Who will benefit from this enhancement?
All SageMaker customers

References

list reference and related literature
list known implementations

Python engine batching functionality does not work

Description

We have deployed a flant5 model on a Nvidia GPU infrastructure with the following serving.properties

engine=Python
option.entryPoint=djl_python.deepspeed
option.task=text2text-generation
option.dtype=int8
option.device_map=balanced
batch_size=2
max_batch_delay=1

Model works fine for a single request but for concurrent users it starts throwing HTTP 400 error

Expected Behavior

Dynamic batching should be supported for DJL serving

Error Message

{
"code": 400,
"type": "TranslateException",
"message": "Batch output size mismatch, expected: 2, actual: 1"
}

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

Download flant5-xl model from hugging face repository to the local disk
Update serving.properties
Deploy deepjavalibrary/djl-serving:0.22.1-deepspeed on the kubernetes environment
Container command in the deployment.yaml djl-serving -m /data/flanT5
Load test with Apache Jmeter scripts

What have you tried to solve it?

Replace the entry point script for deepspeed with latest from the main branch
Use hugging face entry point script

Cannot run ./gradlew FJ or build under the version of jdk 17.0.4.

Description

(A clear and concise description of what the bug is.)
Cannot run ./gradlew FJ or build under the version of jdk 17.0.4 in directory /serving.
And both of yang and Sindhu found that.

Expected Behavior

Run successfully, like this.

Error Message

How to Reproduce?

In master branch, run ./gradlew FJ or build under /djl-serving/serving

Steps to reproduce

cd /serving
./gradlew FJ or build

What have you tried to solve it?

change the version of java into 11.0.16

The model file with the same name does not reload

Replace the old model file with a new model zip file with the same name. Even if the deregister or the server is restarted, the old model will still be registered again.You need to change the model file name before loading a new model

Direct huggingface download to tmp

Description

HuggingFace repo would download by default to the home dir. Which is small space on SageMaker

env = {"HUGGINGFACE_HUB_CACHE": "/tmp", "TRANSFORMERS_CACHE": "/tmp"}

Let's add these two in the container we build

MosaicML/MPT7b model not working on DJLServing handler

Description

import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
  'mosaicml/mpt-7b',
  trust_remote_code=True
)

The model is adding a new option called

trust_remote_code=True

This needs to enable to run inference

Please add unit suffixes to relevant variables/methods

In e.g.

    public ModelInfo(
            String id,
            String modelUrl,
            String version,
            String engineName,
            Class<I> inputClass,
            Class<O> outputClass,
            int queueSize,
            int maxIdleTime,
            int maxBatchDelay,
            int batchSize) {

there's no indication what the unit of time or delay is for maxIdleTime or maxBatchDelay. This makes it frustrating to use the library as I have to click through a bunch of source code to see actual usage to infer what's intended.

It makes the library a lot easier to use if cases like that are named e.g. maxIdleTimeSecs and maxBatchDelayMillis.

This appears pervasive - e.g. Job.getBegin (which should probably be removed as System.nanoTime() isn't absolute?) and Job.getWaitingTime (-> Job.getWaitingTimeMicrosecs)

Time can usually be omitted if it's obvious, e.g. Job.getWaitingMicrosecs, maxIdleSecs.

DeepSpeed streaming, max_length is ignored

serving.properties:

option.model_id=EleutherAI/gpt-neo-1.3B
option.task=text-generation
option.tensor_parallel_degree=2
option.dtype=fp16
option.enable_streaming=true
#option.enable_streaming=huggingface
engine=DeepSpeed
option.parallel_loading=true

curl command:

curl -X POST "http://localhost:8080/invocations" \
     -H "content-type: application/json" \
     -d '{"inputs": ["Large language model is"], "parameters": {"max_length" :2}}'

Expected to return 2 new tokens, but 50 tokens are returned

FT bloom conversion fails with HF model id

Description

FT handler will fail with changes to conversion script in upstream - NVIDIA/FasterTransformer#568 as model is not fetched using from_pretrained() method.

[Handler] Pipeline parallelism

Description

In current HuggingFace accelerate implementation, we only used pipeline parallism instead of tensor parallism. However, we still require user to pass in tensor_parallel_degree which doesn't make much sense. We should offer pipeline_parallel_degree to address this issue

djl-serving language support

When I used djl-serving 0.19.0, I found that the predicted result of the model was garbled in Chinese.

Later, I found that the image of ubuntu does not support Chinese. Later, I added the installation of language-pack-zh-hans in install_djl_serving.sh, and added

RUN localedef -c -f UTF-8 -i zh_CN zh_CN.utf8
ENV LC_ALL zh_CN.UTF-8

to Dockerfile to make a new image.Finally it support prediction results in Chinese.

Is my handling correct? Or the prediction results originally support Chinese, but there is a problem with my model.

Or do you intend to support the prediction results in Chinese?

Is there any example about this "map" operation？

As a final example, here is one that features a more complicated interaction. The human detection model will find all of the humans in an image. Then, the "splitHumans" function will turn all of them into separate images that can be treated as a list. The "map" will apply the "poseEstimation" model to each of the detected humans in the list.

workflow:
humans: ["splitHumans", ["humanDetection", "in"]]
out: ["map", "poseEstimation", "humans"]

First ModelInfo constructor leaves queueSize etc uninitialized

https://github.com/deepjavalibrary/djl-serving/blob/master/wlm/src/main/java/ai/djl/serving/wlm/ModelInfo.java#L75-L80 currently reads:

    public ModelInfo(String modelUrl, Class<I> inputClass, Class<O> outputClass) {
        this.id = modelUrl;
        this.modelUrl = modelUrl;
        this.inputClass = inputClass;
        this.outputClass = outputClass;
    }

This is missing the default initialization of queueSize et al present in the Criteria-constructor just below ( https://github.com/deepjavalibrary/djl-serving/blob/master/wlm/src/main/java/ai/djl/serving/wlm/ModelInfo.java#L88-L99 ):

       WlmConfigManager config = WlmConfigManager.getInstance();
       queueSize = config.getJobQueueSize();
       maxIdleTime = config.getMaxIdleTime();
       batchSize = config.getBatchSize();
       maxBatchDelay = config.getMaxBatchDelay();

This makes the first constructor not very useful?

DeepSpeed handler has left padding issues

Description

Special characters is not cleaned up

Have you considered introducing a distributed deployment solution for djl-serving?

Have you considered introducing a distributed deployment solution for djl-serving?
After all, it can only be a single machine now, and there are still some unstable factors in the production environment.

Support option.EntryPoint to an URL

Description

Allow url that point to a model.py that related to the entryPoint.

So user could offer any model.py url for the model deployment.

[Docker] pip install issue

Description

the pip command in the 0.18.0 docker is broken for some reason. Always have error code.

95%.......

In central model ,I just run
./gradlew run

but the terminal like this:

Task :central:buildReactApp
asset main.js 2.06 MiB [compared for emit] (name: main) 1 related asset
orphan modules 78.8 KiB [orphan] 83 modules
runtime modules 972 bytes 5 modules
modules by path ./node_modules/ 1.68 MiB 240 modules
modules by path ./src/main/webapp/ 41.2 KiB
modules by path ./src/main/webapp/components/ 38 KiB
modules by path ./src/main/webapp/components/modelpanels/ 6.34 KiB 5 modules
modules by path ./src/main/webapp/components/*.jsx 11.5 KiB 3 modules
modules by path ./src/main/webapp/components/TabPanel/ 4.85 KiB 2 modules
+ 1 module
modules by path ./src/main/webapp/css/ 2.15 KiB
./src/main/webapp/css/useStyles.jsx 798 bytes [built] [code generated]
./src/main/webapp/css/style.css 537 bytes [built] [code generated]
./node_modules/css-loader/dist/cjs.js!./src/main/webapp/css/style.css 864 bytes [built] [code generated]
./src/main/webapp/Main.jsx 1.07 KiB [built] [code generated]
webpack 5.74.0 compiled successfully in 2680 ms

Task :central:run
Listening for transport dt_socket at address: 4000
[INFO ] - [id: 0xfb7e5f31] REGISTERED
[INFO ] - [id: 0xfb7e5f31] BIND: 0.0.0.0/0.0.0.0:8080
[INFO ] - [id: 0xfb7e5f31, L:/[0:0:0:0:0:0:0:0]:8080] ACTIVE
[INFO ] - [id: 0xfb7e5f31, L:/[0:0:0:0:0:0:0:0]:8080] READ: [id: 0xfd448de3, L:/[0:0:0:0:0:0:0:1]:8080 - R:/[0:0:0:0:0:0:0:1]:50076]
[INFO ] - [id: 0xfb7e5f31, L:/[0:0:0:0:0:0:0:0]:8080] READ: [id: 0x5fb3bd15, L:/[0:0:0:0:0:0:0:1]:8080 - R:/[0:0:0:0:0:0:0:1]:50077]
<============-> 95% EXECUTING [3m 33s]
<============-> 95% EXECUTING [1m 36s]
:central:run

then i visit http://localhost:8080/ but there is no response.and it's waiting

[Missing Dependency][Docker] Missing libgfortran4 while trying to install numpy

Description

libgfortran.so.4 is missing in the file while trying to install numpy in the docker. (cpu-full)

MME model dependency conflict scenario

Description

Different model packages may have different versions. Especially like protobuf has strict version for different models, user may install different pip wheel within a single environment. We need to find a way to adress this. Ideally, user could specify like

option.python_path=/path/to/python

Support T5 model for HuggingFace and FT handler

Description

T5 model series is not supported due to AutoModelForCausalLM. We need to support this singularity

Is there any way to build a gpu docker image based on Onnx?

Can we create a new gpu Dockerfile based on Onnx?

DeepSpeed (streaming or not), segmentation fault if content-type is text/plain

sometime mpi process hang and python process cannot be restarted

serving.properties:

option.model_id=EleutherAI/gpt-neo-1.3B
option.task=text-generation
option.tensor_parallel_degree=2
option.dtype=fp16
option.enable_streaming=true
engine=DeepSpeed
option.parallel_loading=true

curl command:

curl -X POST "http://localhost:8080/invocations" \
    -H "content-type: text/plain" \
    -d "Large language model is"

WARN  PyProcess Primary job  terminated normally, but 1 process returned
WARN  PyProcess a non-zero exit code. Per user-direction, the job has been aborted.

[Not DJLServing]HFPipeline error - GPT Neox

Description

This is not actually an error on DJLServing. Just tracking this here. Will raise an issue in HF as well.
HF Pipeline actually trying to generate the outputs on CPU despite including the device_map=auto as configuration for GPT_NeoX 20B model.

Workaround is to use model.generate method by manually converting the input_ids to GPU.

Error Message

 Bug: RuntimeError: "topk_cpu" not implemented for 'Half'

How to reproduce?

Trying GPT_NEOX 20B model with our huggingface.py handler.

This was actually recorded as issues in transformers.

huggingface/transformers#18703
huggingface/transformers#19445

Rename TensorParallelDegree

Description

Customers is confusing TensorParallelDegree while using the HuggingFace model. Maybe we can change this name permanently to ModelParallelDegree or model_parallel_degree to clarify.

Unable to generate self-signed certificate in jdk 17

Description

Generating self-signed certificate throws an error with jdk 17. Due to this gradle build failed as these following two tests failed.
serving/build/reports/tests/test/classes/ai.djl.serving.ModelServerTest.html#test
serving/build/reports/tests/test/classes/ai.djl.serving.ModelServerTest.html#testWorkflows

Error Message

Stack trace of the error message:

java.security.cert.CertificateException: No provider succeeded to generate a self-signed certificate. See debug log for the root cause.
	at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:249)
	at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:166)
	at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:115)
	at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:90)
	at ai.djl.serving.util.ConfigManager.getSslContext(ConfigManager.java:385)
	at ai.djl.serving.ConfigManagerTest.testSsl(ConfigManagerTest.java:46)
	at ai.djl.serving.ModelServerTest.test(ModelServerTest.java:291)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:135)
	at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:673)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:220)
	at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
	at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:945)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:193)
	at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
	at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
	at org.testng.TestRunner.privateRun(TestRunner.java:808)
	at org.testng.TestRunner.run(TestRunner.java:603)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:429)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:423)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:383)
	at org.testng.SuiteRunner.run(SuiteRunner.java:326)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:95)
	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1249)
	at org.testng.TestNG.runSuitesLocally(TestNG.java:1169)
	at org.testng.TestNG.runSuites(TestNG.java:1092)
	at org.testng.TestNG.run(TestNG.java:1060)
	at org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.runTests(TestNGTestClassProcessor.java:141)
	at org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.stop(TestNGTestClassProcessor.java:90)
	at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.stop(SuiteTestClassProcessor.java:61)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
	at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
	at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94)
	at jdk.proxy2/jdk.proxy2.$Proxy5.stop(Unknown Source)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker$3.run(TestWorker.java:193)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.executeAndMaintainThreadName(TestWorker.java:129)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:100)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:60)
	at org.gradle.process.internal.worker.child.ActionExecutionWorker.execute(ActionExecutionWorker.java:56)
	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:133)
	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:71)
	at worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
	at worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
	Suppressed: java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider
		at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:240)
		... 52 more
	Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jce.provider.BouncyCastleProvider
		at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
		at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
		at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
		... 53 more
Caused by: java.lang.IllegalAccessError: class io.netty.handler.ssl.util.OpenJdkSelfSignedCertGenerator (in unnamed module @0x531d72ca) cannot access class sun.security.x509.X509CertInfo (in module java.base) because module java.base does not export sun.security.x509 to unnamed module @0x531d72ca
	at io.netty.handler.ssl.util.OpenJdkSelfSignedCertGenerator.generate(OpenJdkSelfSignedCertGenerator.java:52)
	at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:246)
	... 52 more

This issue is describe in netty's github repo.

What have you tried to solve it?

Adding org.bouncycastle:bcpkix-jdk15on:1.65 to the dependencies, solved it.
Did not face this error in Java 11.

Add timestamp for each log entry

Description

DJLServing should produce timestamp for each log

is there any instruction for parsing customized paramters of REST API ?

I load my model on djl-serving and the paramters should be a float[4] array or a specific class。

In demo i just override processInput() and processOutput()，because the class of input is a customized class。

But in djl-serving I don't how to parse paramters into "input". I saw maybe some code should be configured in .yml file.

so is there any instruction for parse customized paramters into input?

Was 0.21.0 source tarball updated?

Description

(A clear and concise description of what the bug is.)

While packaging OpenJDK for Homebrew, it was noticed that DJL Serving's tarball, downloaded from https://publish.djl.ai/djl-serving/serving-0.21.0.tar, reported a different checksum. It used to be 8fa8afd1a4181fc55e6ad2cb31cea8ec07fc4ad5df135e62bd07106ce3fc6c80 at 2023-02-26 03:36 UTC, but now it is 523c742f80fb277bfc7f8c3f706ede4b28fbc5d95851d526f63d7e6f02c6c423. May I confirm if the tarball is re-uploaded? Thanks!

Expected Behavior

Tarball checksum should match the one in our formula (package description).

Error Message

(Paste the complete error message, including stack trace.)

See CI failure here:

  ==> Downloading https://publish.djl.ai/djl-serving/serving-0.21.0.tar
  Downloaded to: /Users/brew/Library/Caches/Homebrew/downloads/acdf5ceb0cf03acc49888f36839af1c9e017be2fce0c48dc17d9641f01945263--serving-0.21.0.tar
  SHA256: 523c742f80fb277bfc7f8c3f706ede4b28fbc5d95851d526f63d7e6f02c6c423
  Warning: Formula reports different sha256: 8fa8afd1a4181fc55e6ad2cb31cea8ec07fc4ad5df135e62bd07106ce3fc6c80

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

$ curl -L https://publish.djl.ai/djl-serving/serving-0.21.0.tar | shasum -a256 -
523c742f80fb277bfc7f8c3f706ede4b28fbc5d95851d526f63d7e6f02c6c423  -

Steps to reproduce

(Paste the commands you ran that produced the error.)

See above.

What have you tried to solve it?

N/A.

Support Torch-TensorRT / quantized inference for pytorch models?

(Apologies if this is already supported, the docs are unclear/confusing)

DJL-serving claims support for TensorRT model on https://github.com/deepjavalibrary/djl-serving

DJL FAQ doesn't mention support for TensorRT: https://djl.ai/docs/faq.html - ??

I have a pytorch model that I'd like to run inference on. For memory & performance reasons I'd like to quantize it to at least fp16, ideally uint8. To not run into quantization issues (which I do if I just change the format of the weights to fp16 in my pytorch -> torchscript model) I need to apply Post-Training-Quantization with suitable calibration.

The only path I've found to actually do that with pytorch is via TensorRT:

However, that creates something that is neither a torchscript nor a TensorRT model, it's instead some Torch-TensorRT hybrid. Ok. Now, to deploy that monstrosity their docs ( https://pytorch.org/TensorRT/tutorials/runtime.html#runtime ) claim that all you have to do is link in libtorchtrt_runtime.so that's included in their C++ distribution. Great.

Is it possible to do that and use this workflow with DJL serving? Has anyone done it?

Is there another (better) path to get quantized inference in DJL?

Thanks!

Unable to deploy GPT-Neox using Large Model inference container

Description

(A clear and concise description of what the bug is.)
Deploying GPT-Neox (https://huggingface.co/EleutherAI/gpt-neox-20b) on SageMaker is unsuccessful.

Expected Behavior

(what's the expected behavior?)
Successfully deploy the model to a SageMaker endpoint using the latest version of the Large Model Inference container. (https://github.com/aws/deep-learning-containers/blob/master/available_images.md)

Error Message

(Paste the complete error message, including stack trace.)
[INFO ] PyProcess - [1,2]<stdout>: File "/root/.djl.ai/python/0.20.0/djl_python/deepspeed.py", line 207, in _validate_model_type_and_task
[INFO ] PyProcess - [1,2]<stdout>:ValueError: model_type: gpt_neox is not currently supported by DeepSpeed

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Using the following configuration in serving.properties

engine = DeepSpeed
option.entryPoint=djl_python.deepspeed
option.tensor_parallel_degree=8
option.model_id=EleutherAI/gpt-neox-20b

Steps to reproduce

(Paste the commands you ran that produced the error.)

Create a SageMaker model and model configuration and then create an endpoint.

What have you tried to solve it?

gpt-neox is listed in the SUPPORTED_MODEL_TYPES - https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/setup/djl_python/deepspeed.py#L39.

Examining the model_type using the following code -

from transformers import AutoConfig
model_config = AutoConfig.from_pretrained("EleutherAI/gpt-neox-20b")
model_config.model_type

shows that it is probably gpt_neox.

Bug in streaming

Description

At line 183 of streaming_utils.py: https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/setup/djl_python/streaming_utils.py#L183

Error Message

Traceback (most recent call last):
File "/home/ubuntu/models/linguist/djl-model/steaming_test.py", line 68, in
next_token_id = decoding_method(
File "/home/ubuntu/models/linguist/djl-model/streaming_utils.py", line 189, in _sampling_decoding
logits[-1:, :] = processors(input_ids, logits[-1:, :])
RuntimeError: Output 0 of SliceBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.

What have you tried to solve it?

replace line 183 with following:
_logits = logits.detach().clone()
_logits[-1:, :] = processors(input_ids, logits[-1:, :])
logits = _logits

HuggingFace default handler could not run LLaMA model

Description

engine=Python
option.entryPoint=djl_python.huggingface
option.tensor_parallel_degree=4
option.dtype=fp16
option.model_id=huggyllama/llama-13b

Using the above setup and run with a simple command

curl -X POST "http://127.0.0.1:8080/predictions/test" \
     -H 'Content-Type: application/json' \
     -d '{"parameters":{"max_new_tokens": 256, "min_new_tokens": 256},
          "inputs":["Large Language model is"]
          }'

Comes with the error:

The following `model_kwargs` are not used by the model: ['token_type_ids']

If we change the lines https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/setup/djl_python/huggingface.py#L234-L235 to

output_tokens = model.generate(input_tokens.input_ids, **kwargs)

The problem could be resolved. This might be a huggingface bug

About release 0.20.0

It looks like there is 0.20.0 release, but there is no downloadable artifact for https://publish.djl.ai/djl-serving/serving-0.20.0.tar, raising this issue to confirm if there is anything missing in the release process. Thanks!

deepspeed.py is not picking up datatype

Description

After trying to specify datatype for model loading as

option.dtype=fp16

deepspeed.py is not picking up

GitHub Action Patching [2023/05/31]

Description

update set-output for DJL, DJL Demo and DJLServing

https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/

Flaky Pipeline due to Transformers 4.27.x

Description

Integration and Performance tests periodically fail due to bug in Transformers 4.27.x due to cache hub int bug huggingface/transformers#22427.

Expected Behavior

Consistent download and run of our integration and performance pipeline

deepjavalibrary / djl-serving Goto Github PK

djl-serving's People

Contributors

Stargazers

Watchers

Forkers

djl-serving's Issues

Description

Description

Description

Description

Expected Behavior

Error Message

How to Reproduce?

Description

References

Description

Description

Description

Description

References

Description

Expected Behavior

Error Message

How to Reproduce?

Steps to reproduce

What have you tried to solve it?

Description

Expected Behavior

Error Message

How to Reproduce?

Steps to reproduce

What have you tried to solve it?

Description

Description

Description

Description

Description

Description

Description

Description

Description

Description

Description

Error Message

How to reproduce?

Description

Description

Error Message

What have you tried to solve it?

Description

Description

Expected Behavior

Error Message

How to Reproduce?

Steps to reproduce

What have you tried to solve it?

Description

Expected Behavior

Error Message

How to Reproduce?

Steps to reproduce

What have you tried to solve it?

Description

Error Message

What have you tried to solve it?

Description

Description

Description

Description

Expected Behavior

Recommend Projects

Recommend Topics

Recommend Org

Jobs