GithubHelp home page GithubHelp logo

deepjavalibrary / djl-serving Goto Github PK

View Code? Open in Web Editor NEW
168.0 13.0 55.0 8.29 MB

A universal scalable machine learning model deployment solution

License: Apache License 2.0

Java 53.27% HTML 0.04% JavaScript 0.67% CSS 0.11% Dockerfile 1.29% Shell 0.92% Python 40.08% Vue 3.61% Less 0.01%
deep-learning deployment djl inference pytorch serving

djl-serving's People

Contributors

a-ys avatar alexkarezin avatar amazon-auto avatar bryanktliu avatar c007456 avatar chen3933 avatar davidthomas426 avatar dependabot[bot] avatar ethnzhng avatar frankfliu avatar hana-meister avatar jimburtoft avatar kexinfeng avatar lanking520 avatar lokiiiiii avatar maaquib avatar marckarp avatar nskool avatar oyy2000 avatar rohithkrn avatar siddvenk avatar sindhuvahinis avatar skirdey avatar tosterberg avatar xyang16 avatar ydm-amazon avatar zachgk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

djl-serving's Issues

hope this project can have a login page

I'd like to use the UI control panel plugin,but there is no security verification so that I dare not open the UI page in the production environment.

Can you add a simple login page?even if only one admin's account. I think it will be very helpful to the development of the project

Error reported on incoming file using djl-serving on Windows

I used the inference API to pass in a file and the following error occurred:

Caused by: java.lang.IllegalArgumentException: Malformed data
at ai.djl.ndarray.NDList.decode(NDList.java:124) ~[api-0.19.0.jar:?]
at ai.djl.ndarray.NDList.decode(NDList.java:85) ~[api-0.19.0.jar:?]
at ai.djl.modality.Input.getAsNDList(Input.java:328) ~[api-0.19.0.jar:?]
at ai.djl.modality.Input.getDataAsNDList(Input.java:198) ~[api-0.19.0.jar:?]
at ai.djl.translate.NoopServingTranslatorFactory$NoopServingTranslator.processInput(NoopServingTranslatorFactory.java:68) ~[api-0.19.0.jar:?]
... 8 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:202) ~[?:?]
at java.io.DataInputStream.readFully(DataInputStream.java:170) ~[?:?]
at ai.djl.ndarray.NDList.decode(NDList.java:99) ~[api-0.19.0.jar:?]
at ai.djl.ndarray.NDList.decode(NDList.java:85) ~[api-0.19.0.jar:?]
at ai.djl.modality.Input.getAsNDList(Input.java:328) ~[api-0.19.0.jar:?]
at ai.djl.modality.Input.getDataAsNDList(Input.java:198) ~[api-0.19.0.jar:?]
at ai.djl.translate.NoopServingTranslatorFactory$NoopServingTranslator.processInput(NoopServingTranslatorFactory.java:68) ~[api-0.19.0.jar:?]

Sharded GPT2 model cannot be loaded with DeepSpeed

Use DeepSpeed AOT to partition the GPT2 model, everything works fine, but when load the model it failed:

assert self.ckpt_load_enabled, "Meta tensors are not supported for this model currently."
  1. The partition should fail if the model is not supported
  2. If the model doesn't support AOT with DeepSpeed, we should default to FasterTransformer

Streaming Llama Model Issue

Description

Getting below output from the streaming Utils . As you can see there is space between design and ing

design ing , developing , testing , and maintain ing software

Expected Behavior

There should not be any space . I am using LLama+Lora model.

Error Message

Wrong Result

How to Reproduce?

`

     generator = stream_generator(model, tokenizer, prompt, **generate_kwargs)
    generated = ""
    for text in generator:
        generated += ' ' + text[0]
        paginator.add_cache(session_id, generated)
    paginator.add_cache(session_id, generated + "<eos>")

`

docker pull is very slowly

Description

(A clear and concise description of what the feature is.)

Will this change the current api? How?

Who will benefit from this enhancement?

References

  • list reference and related literature
  • list known implementations

Saved DeepSpeed sharded model does not support steaming

In the saved partition model config.json file has been changed, it's different from original config.json file

{
  "_name_or_path": "bigscience/bloom-1b1",
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "BloomModel"
  ],
  "attention_dropout": 0.0,
  "attention_softmax_in_fp32": true,
  "bias_dropout_fusion": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_dropout": 0.0,
  "hidden_size": 1536,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "masked_softmax_fusion": true,
  "model_type": "bloom",
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "offset_alibi": 100,
  "pad_token_id": 3,
  "pretraining_tp": 1,
  "skip_bias_add": true,
  "skip_bias_add_qkv": false,
  "slow_but_exact": false,
  "torch_dtype": "float32",
  "transformers_version": "4.27.1",
  "unk_token_id": 0,
  "use_cache": true,
  "vocab_size": 250880
}

CUDA error when using huggingface streaming

serving.properties:

option.model_id=EleutherAI/gpt-neo-1.3B
option.task=text-generation
option.tensor_parallel_degree=2
option.dtype=fp16
#option.enable_streaming=true
option.enable_streaming=huggingface
engine=DeepSpeed
option.parallel_loading=true

curl command:

curl -X POST "http://localhost:8080/invocations" \
     -H "content-type: application/json" \
     -d '{"inputs": ["Large language model is"], "parameters": {"max_length" :25}}' 

{"outputs": ["Large language model is"]}

{"outputs": "CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"}

LMI container not compatible with Multi-container endpoint of SageMaker

Description

The DJL containers should be added with Docker label to enable it in multi-container endpoint in SageMaker.
Currently, its not enabled and users gets below error -

An error occurred (ValidationException) when calling the CreateModel operation: Your Ecr Image 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-fastertransformer5.3.0-cu117 does not contain required com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true Docker label(s).

(A clear and concise description of what the feature is.)

Will this change the current api? How?
Change the dockerfile and add the label

Who will benefit from this enhancement?
All SageMaker customers

References

  • list reference and related literature
  • list known implementations

Python engine batching functionality does not work

Description

We have deployed a flant5 model on a Nvidia GPU infrastructure with the following serving.properties

engine=Python
option.entryPoint=djl_python.deepspeed
option.task=text2text-generation
option.dtype=int8
option.device_map=balanced
batch_size=2
max_batch_delay=1

Model works fine for a single request but for concurrent users it starts throwing HTTP 400 error

Expected Behavior

Dynamic batching should be supported for DJL serving

Error Message

{
"code": 400,
"type": "TranslateException",
"message": "Batch output size mismatch, expected: 2, actual: 1"
}

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Download flant5-xl model from hugging face repository to the local disk
  2. Update serving.properties
  3. Deploy deepjavalibrary/djl-serving:0.22.1-deepspeed on the kubernetes environment
  4. Container command in the deployment.yaml djl-serving -m /data/flanT5
  5. Load test with Apache Jmeter scripts

What have you tried to solve it?

  1. Replace the entry point script for deepspeed with latest from the main branch
  2. Use hugging face entry point script

Cannot run ./gradlew FJ or build under the version of jdk 17.0.4.

Description

(A clear and concise description of what the bug is.)
Cannot run ./gradlew FJ or build under the version of jdk 17.0.4 in directory /serving.
And both of yang and Sindhu found that.

Expected Behavior

Run successfully, like this.
image

Error Message

image

image

How to Reproduce?

In master branch, run ./gradlew FJ or build under /djl-serving/serving

Steps to reproduce

  1. cd /serving
  2. ./gradlew FJ or build

What have you tried to solve it?

  1. change the version of java into 11.0.16

The model file with the same name does not reload

Replace the old model file with a new model zip file with the same name. Even if the deregister or the server is restarted, the old model will still be registered again.You need to change the model file name before loading a new model

Direct huggingface download to tmp

Description

HuggingFace repo would download by default to the home dir. Which is small space on SageMaker

env = {"HUGGINGFACE_HUB_CACHE": "/tmp", "TRANSFORMERS_CACHE": "/tmp"}

Let's add these two in the container we build

MosaicML/MPT7b model not working on DJLServing handler

Description

import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
  'mosaicml/mpt-7b',
  trust_remote_code=True
)

The model is adding a new option called

trust_remote_code=True

This needs to enable to run inference

Please add unit suffixes to relevant variables/methods

In e.g.

    public ModelInfo(
            String id,
            String modelUrl,
            String version,
            String engineName,
            Class<I> inputClass,
            Class<O> outputClass,
            int queueSize,
            int maxIdleTime,
            int maxBatchDelay,
            int batchSize) {

there's no indication what the unit of time or delay is for maxIdleTime or maxBatchDelay. This makes it frustrating to use the library as I have to click through a bunch of source code to see actual usage to infer what's intended.

It makes the library a lot easier to use if cases like that are named e.g. maxIdleTimeSecs and maxBatchDelayMillis.

This appears pervasive - e.g. Job.getBegin (which should probably be removed as System.nanoTime() isn't absolute?) and Job.getWaitingTime (-> Job.getWaitingTimeMicrosecs)

Time can usually be omitted if it's obvious, e.g. Job.getWaitingMicrosecs, maxIdleSecs.

DeepSpeed streaming, max_length is ignored

serving.properties:

option.model_id=EleutherAI/gpt-neo-1.3B
option.task=text-generation
option.tensor_parallel_degree=2
option.dtype=fp16
option.enable_streaming=true
#option.enable_streaming=huggingface
engine=DeepSpeed
option.parallel_loading=true

curl command:

curl -X POST "http://localhost:8080/invocations" \
     -H "content-type: application/json" \
     -d '{"inputs": ["Large language model is"], "parameters": {"max_length" :2}}' 

Expected to return 2 new tokens, but 50 tokens are returned

[Handler] Pipeline parallelism

Description

In current HuggingFace accelerate implementation, we only used pipeline parallism instead of tensor parallism. However, we still require user to pass in tensor_parallel_degree which doesn't make much sense. We should offer pipeline_parallel_degree to address this issue

djl-serving language support

When I used djl-serving 0.19.0, I found that the predicted result of the model was garbled in Chinese.

Later, I found that the image of ubuntu does not support Chinese. Later, I added the installation of language-pack-zh-hans in install_djl_serving.sh, and added

RUN localedef -c -f UTF-8 -i zh_CN zh_CN.utf8
ENV LC_ALL zh_CN.UTF-8

to Dockerfile to make a new image.Finally it support prediction results in Chinese.

Is my handling correct? Or the prediction results originally support Chinese, but there is a problem with my model.

Or do you intend to support the prediction results in Chinese?

Is there any example about this "map" operation?

As a final example, here is one that features a more complicated interaction. The human detection model will find all of the humans in an image. Then, the "splitHumans" function will turn all of them into separate images that can be treated as a list. The "map" will apply the "poseEstimation" model to each of the detected humans in the list.

workflow:
humans: ["splitHumans", ["humanDetection", "in"]]
out: ["map", "poseEstimation", "humans"]

First ModelInfo constructor leaves queueSize etc uninitialized

https://github.com/deepjavalibrary/djl-serving/blob/master/wlm/src/main/java/ai/djl/serving/wlm/ModelInfo.java#L75-L80 currently reads:

    public ModelInfo(String modelUrl, Class<I> inputClass, Class<O> outputClass) {
        this.id = modelUrl;
        this.modelUrl = modelUrl;
        this.inputClass = inputClass;
        this.outputClass = outputClass;
    }

This is missing the default initialization of queueSize et al present in the Criteria-constructor just below ( https://github.com/deepjavalibrary/djl-serving/blob/master/wlm/src/main/java/ai/djl/serving/wlm/ModelInfo.java#L88-L99 ):

       WlmConfigManager config = WlmConfigManager.getInstance();
       queueSize = config.getJobQueueSize();
       maxIdleTime = config.getMaxIdleTime();
       batchSize = config.getBatchSize();
       maxBatchDelay = config.getMaxBatchDelay();

This makes the first constructor not very useful?

Support option.EntryPoint to an URL

Description

Allow url that point to a model.py that related to the entryPoint.

So user could offer any model.py url for the model deployment.

95%.......

In central model ,I just run
./gradlew run

but the terminal like this:

Task :central:buildReactApp
asset main.js 2.06 MiB [compared for emit] (name: main) 1 related asset
orphan modules 78.8 KiB [orphan] 83 modules
runtime modules 972 bytes 5 modules
modules by path ./node_modules/ 1.68 MiB 240 modules
modules by path ./src/main/webapp/ 41.2 KiB
modules by path ./src/main/webapp/components/ 38 KiB
modules by path ./src/main/webapp/components/modelpanels/ 6.34 KiB 5 modules
modules by path ./src/main/webapp/components/*.jsx 11.5 KiB 3 modules
modules by path ./src/main/webapp/components/TabPanel/ 4.85 KiB 2 modules
+ 1 module
modules by path ./src/main/webapp/css/ 2.15 KiB
./src/main/webapp/css/useStyles.jsx 798 bytes [built] [code generated]
./src/main/webapp/css/style.css 537 bytes [built] [code generated]
./node_modules/css-loader/dist/cjs.js!./src/main/webapp/css/style.css 864 bytes [built] [code generated]
./src/main/webapp/Main.jsx 1.07 KiB [built] [code generated]
webpack 5.74.0 compiled successfully in 2680 ms

Task :central:run
Listening for transport dt_socket at address: 4000
[INFO ] - [id: 0xfb7e5f31] REGISTERED
[INFO ] - [id: 0xfb7e5f31] BIND: 0.0.0.0/0.0.0.0:8080
[INFO ] - [id: 0xfb7e5f31, L:/[0:0:0:0:0:0:0:0]:8080] ACTIVE
[INFO ] - [id: 0xfb7e5f31, L:/[0:0:0:0:0:0:0:0]:8080] READ: [id: 0xfd448de3, L:/[0:0:0:0:0:0:0:1]:8080 - R:/[0:0:0:0:0:0:0:1]:50076]
[INFO ] - [id: 0xfb7e5f31, L:/[0:0:0:0:0:0:0:0]:8080] READ: [id: 0x5fb3bd15, L:/[0:0:0:0:0:0:0:1]:8080 - R:/[0:0:0:0:0:0:0:1]:50077]
<============-> 95% EXECUTING [3m 33s]
<============-> 95% EXECUTING [1m 36s]
:central:run

then i visit http://localhost:8080/ but there is no response.and it's waiting

MME model dependency conflict scenario

Description

Different model packages may have different versions. Especially like protobuf has strict version for different models, user may install different pip wheel within a single environment. We need to find a way to adress this. Ideally, user could specify like

option.python_path=/path/to/python

DeepSpeed (streaming or not), segmentation fault if content-type is text/plain

sometime mpi process hang and python process cannot be restarted

serving.properties:

option.model_id=EleutherAI/gpt-neo-1.3B
option.task=text-generation
option.tensor_parallel_degree=2
option.dtype=fp16
option.enable_streaming=true
engine=DeepSpeed
option.parallel_loading=true

curl command:

curl -X POST "http://localhost:8080/invocations" \
    -H "content-type: text/plain" \
    -d "Large language model is"

WARN  PyProcess Primary job  terminated normally, but 1 process returned
WARN  PyProcess a non-zero exit code. Per user-direction, the job has been aborted.

[Not DJLServing]HFPipeline error - GPT Neox

Description

This is not actually an error on DJLServing. Just tracking this here. Will raise an issue in HF as well.
HF Pipeline actually trying to generate the outputs on CPU despite including the device_map=auto as configuration for GPT_NeoX 20B model.

Workaround is to use model.generate method by manually converting the input_ids to GPU.

Error Message

 Bug: RuntimeError: "topk_cpu" not implemented for 'Half'

How to reproduce?

Trying GPT_NEOX 20B model with our huggingface.py handler.

This was actually recorded as issues in transformers.

huggingface/transformers#18703
huggingface/transformers#19445

Rename TensorParallelDegree

Description

Customers is confusing TensorParallelDegree while using the HuggingFace model. Maybe we can change this name permanently to ModelParallelDegree or model_parallel_degree to clarify.

Unable to generate self-signed certificate in jdk 17

Description

Generating self-signed certificate throws an error with jdk 17. Due to this gradle build failed as these following two tests failed.
serving/build/reports/tests/test/classes/ai.djl.serving.ModelServerTest.html#test
serving/build/reports/tests/test/classes/ai.djl.serving.ModelServerTest.html#testWorkflows

Error Message

Stack trace of the error message:

java.security.cert.CertificateException: No provider succeeded to generate a self-signed certificate. See debug log for the root cause.
	at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:249)
	at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:166)
	at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:115)
	at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:90)
	at ai.djl.serving.util.ConfigManager.getSslContext(ConfigManager.java:385)
	at ai.djl.serving.ConfigManagerTest.testSsl(ConfigManagerTest.java:46)
	at ai.djl.serving.ModelServerTest.test(ModelServerTest.java:291)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:135)
	at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:673)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:220)
	at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
	at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:945)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:193)
	at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
	at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
	at org.testng.TestRunner.privateRun(TestRunner.java:808)
	at org.testng.TestRunner.run(TestRunner.java:603)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:429)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:423)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:383)
	at org.testng.SuiteRunner.run(SuiteRunner.java:326)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:95)
	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1249)
	at org.testng.TestNG.runSuitesLocally(TestNG.java:1169)
	at org.testng.TestNG.runSuites(TestNG.java:1092)
	at org.testng.TestNG.run(TestNG.java:1060)
	at org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.runTests(TestNGTestClassProcessor.java:141)
	at org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.stop(TestNGTestClassProcessor.java:90)
	at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.stop(SuiteTestClassProcessor.java:61)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
	at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
	at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94)
	at jdk.proxy2/jdk.proxy2.$Proxy5.stop(Unknown Source)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker$3.run(TestWorker.java:193)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.executeAndMaintainThreadName(TestWorker.java:129)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:100)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:60)
	at org.gradle.process.internal.worker.child.ActionExecutionWorker.execute(ActionExecutionWorker.java:56)
	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:133)
	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:71)
	at worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
	at worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
	Suppressed: java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider
		at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:240)
		... 52 more
	Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jce.provider.BouncyCastleProvider
		at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
		at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
		at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
		... 53 more
Caused by: java.lang.IllegalAccessError: class io.netty.handler.ssl.util.OpenJdkSelfSignedCertGenerator (in unnamed module @0x531d72ca) cannot access class sun.security.x509.X509CertInfo (in module java.base) because module java.base does not export sun.security.x509 to unnamed module @0x531d72ca
	at io.netty.handler.ssl.util.OpenJdkSelfSignedCertGenerator.generate(OpenJdkSelfSignedCertGenerator.java:52)
	at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:246)
	... 52 more

This issue is describe in netty's github repo.

What have you tried to solve it?

  1. Adding org.bouncycastle:bcpkix-jdk15on:1.65 to the dependencies, solved it.
  2. Did not face this error in Java 11.

is there any instruction for parsing customized paramters of REST API ?

I load my model on djl-serving and the paramters should be a float[4] array or a specific class。

In demo i just override processInput() and processOutput(),because the class of input is a customized class。

But in djl-serving I don't how to parse paramters into "input". I saw maybe some code should be configured in .yml file.

so is there any instruction for parse customized paramters into input?

Was 0.21.0 source tarball updated?

Description

(A clear and concise description of what the bug is.)

While packaging OpenJDK for Homebrew, it was noticed that DJL Serving's tarball, downloaded from https://publish.djl.ai/djl-serving/serving-0.21.0.tar, reported a different checksum. It used to be 8fa8afd1a4181fc55e6ad2cb31cea8ec07fc4ad5df135e62bd07106ce3fc6c80 at 2023-02-26 03:36 UTC, but now it is 523c742f80fb277bfc7f8c3f706ede4b28fbc5d95851d526f63d7e6f02c6c423. May I confirm if the tarball is re-uploaded? Thanks!

Expected Behavior

Tarball checksum should match the one in our formula (package description).

Error Message

(Paste the complete error message, including stack trace.)

See CI failure here:

  ==> Downloading https://publish.djl.ai/djl-serving/serving-0.21.0.tar
  Downloaded to: /Users/brew/Library/Caches/Homebrew/downloads/acdf5ceb0cf03acc49888f36839af1c9e017be2fce0c48dc17d9641f01945263--serving-0.21.0.tar
  SHA256: 523c742f80fb277bfc7f8c3f706ede4b28fbc5d95851d526f63d7e6f02c6c423
  Warning: Formula reports different sha256: 8fa8afd1a4181fc55e6ad2cb31cea8ec07fc4ad5df135e62bd07106ce3fc6c80

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

$ curl -L https://publish.djl.ai/djl-serving/serving-0.21.0.tar | shasum -a256 -
523c742f80fb277bfc7f8c3f706ede4b28fbc5d95851d526f63d7e6f02c6c423  -

Steps to reproduce

(Paste the commands you ran that produced the error.)

See above.

What have you tried to solve it?

N/A.

Support Torch-TensorRT / quantized inference for pytorch models?

(Apologies if this is already supported, the docs are unclear/confusing)

DJL-serving claims support for TensorRT model on https://github.com/deepjavalibrary/djl-serving

DJL FAQ doesn't mention support for TensorRT: https://djl.ai/docs/faq.html - ??

I have a pytorch model that I'd like to run inference on. For memory & performance reasons I'd like to quantize it to at least fp16, ideally uint8. To not run into quantization issues (which I do if I just change the format of the weights to fp16 in my pytorch -> torchscript model) I need to apply Post-Training-Quantization with suitable calibration.

The only path I've found to actually do that with pytorch is via TensorRT:

  1. https://developer.nvidia.com/tensorrt
  2. https://developer.nvidia.com/blog/accelerating-inference-up-to-6x-faster-in-pytorch-with-torch-tensorrt/
  3. https://pytorch.org/TensorRT/tutorials/ptq.html#ptq

However, that creates something that is neither a torchscript nor a TensorRT model, it's instead some Torch-TensorRT hybrid. Ok. Now, to deploy that monstrosity their docs ( https://pytorch.org/TensorRT/tutorials/runtime.html#runtime ) claim that all you have to do is link in libtorchtrt_runtime.so that's included in their C++ distribution. Great.

Is it possible to do that and use this workflow with DJL serving? Has anyone done it?

Is there another (better) path to get quantized inference in DJL?

Thanks!

Unable to deploy GPT-Neox using Large Model inference container

Description

(A clear and concise description of what the bug is.)
Deploying GPT-Neox (https://huggingface.co/EleutherAI/gpt-neox-20b) on SageMaker is unsuccessful.

Expected Behavior

(what's the expected behavior?)
Successfully deploy the model to a SageMaker endpoint using the latest version of the Large Model Inference container. (https://github.com/aws/deep-learning-containers/blob/master/available_images.md)

Error Message

(Paste the complete error message, including stack trace.)
[INFO ] PyProcess - [1,2]<stdout>: File "/root/.djl.ai/python/0.20.0/djl_python/deepspeed.py", line 207, in _validate_model_type_and_task
[INFO ] PyProcess - [1,2]<stdout>:ValueError: model_type: gpt_neox is not currently supported by DeepSpeed

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Using the following configuration in serving.properties

engine = DeepSpeed
option.entryPoint=djl_python.deepspeed
option.tensor_parallel_degree=8
option.model_id=EleutherAI/gpt-neox-20b

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Create a SageMaker model and model configuration and then create an endpoint.

What have you tried to solve it?

gpt-neox is listed in the SUPPORTED_MODEL_TYPES - https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/setup/djl_python/deepspeed.py#L39.

Examining the model_type using the following code -

from transformers import AutoConfig
model_config = AutoConfig.from_pretrained("EleutherAI/gpt-neox-20b")
model_config.model_type

shows that it is probably gpt_neox.

Bug in streaming

Description

At line 183 of streaming_utils.py: https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/setup/djl_python/streaming_utils.py#L183

Error Message

Traceback (most recent call last):
File "/home/ubuntu/models/linguist/djl-model/steaming_test.py", line 68, in
next_token_id = decoding_method(
File "/home/ubuntu/models/linguist/djl-model/streaming_utils.py", line 189, in _sampling_decoding
logits[-1:, :] = processors(input_ids, logits[-1:, :])
RuntimeError: Output 0 of SliceBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.

What have you tried to solve it?

replace line 183 with following:
_logits = logits.detach().clone()
_logits[-1:, :] = processors(input_ids, logits[-1:, :])
logits = _logits

HuggingFace default handler could not run LLaMA model

Description

engine=Python
option.entryPoint=djl_python.huggingface
option.tensor_parallel_degree=4
option.dtype=fp16
option.model_id=huggyllama/llama-13b

Using the above setup and run with a simple command

curl -X POST "http://127.0.0.1:8080/predictions/test" \
     -H 'Content-Type: application/json' \
     -d '{"parameters":{"max_new_tokens": 256, "min_new_tokens": 256},
          "inputs":["Large Language model is"]
          }'

Comes with the error:

The following `model_kwargs` are not used by the model: ['token_type_ids']

If we change the lines https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/setup/djl_python/huggingface.py#L234-L235 to

output_tokens = model.generate(input_tokens.input_ids, **kwargs)

The problem could be resolved. This might be a huggingface bug

About release 0.20.0

It looks like there is 0.20.0 release, but there is no downloadable artifact for https://publish.djl.ai/djl-serving/serving-0.20.0.tar, raising this issue to confirm if there is anything missing in the release process. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.