deepjavalibrary / djl-serving Goto Github PK
View Code? Open in Web Editor NEWA universal scalable machine learning model deployment solution
License: Apache License 2.0
A universal scalable machine learning model deployment solution
License: Apache License 2.0
RT
I'd like to use the UI control panel plugin,but there is no security verification so that I dare not open the UI page in the production environment.
Can you add a simple login page?even if only one admin's account. I think it will be very helpful to the development of the project
RT
I used the inference API to pass in a file and the following error occurred:
Caused by: java.lang.IllegalArgumentException: Malformed data
at ai.djl.ndarray.NDList.decode(NDList.java:124) ~[api-0.19.0.jar:?]
at ai.djl.ndarray.NDList.decode(NDList.java:85) ~[api-0.19.0.jar:?]
at ai.djl.modality.Input.getAsNDList(Input.java:328) ~[api-0.19.0.jar:?]
at ai.djl.modality.Input.getDataAsNDList(Input.java:198) ~[api-0.19.0.jar:?]
at ai.djl.translate.NoopServingTranslatorFactory$NoopServingTranslator.processInput(NoopServingTranslatorFactory.java:68) ~[api-0.19.0.jar:?]
... 8 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:202) ~[?:?]
at java.io.DataInputStream.readFully(DataInputStream.java:170) ~[?:?]
at ai.djl.ndarray.NDList.decode(NDList.java:99) ~[api-0.19.0.jar:?]
at ai.djl.ndarray.NDList.decode(NDList.java:85) ~[api-0.19.0.jar:?]
at ai.djl.modality.Input.getAsNDList(Input.java:328) ~[api-0.19.0.jar:?]
at ai.djl.modality.Input.getDataAsNDList(Input.java:198) ~[api-0.19.0.jar:?]
at ai.djl.translate.NoopServingTranslatorFactory$NoopServingTranslator.processInput(NoopServingTranslatorFactory.java:68) ~[api-0.19.0.jar:?]
For off network installations, we should run
pip install -r requirements.txt --no-deps
instead of
pip install -r requirements.txt
This prevent the wheel to find dependencies on the network
Use DeepSpeed AOT to partition the GPT2 model, everything works fine, but when load the model it failed:
assert self.ckpt_load_enabled, "Meta tensors are not supported for this model currently."
Getting below output from the streaming Utils . As you can see there is space between design and ing
design ing , developing , testing , and maintain ing software
There should not be any space . I am using LLama+Lora model.
Wrong Result
`
generator = stream_generator(model, tokenizer, prompt, **generate_kwargs)
generated = ""
for text in generator:
generated += ' ' + text[0]
paginator.add_cache(session_id, generated)
paginator.add_cache(session_id, generated + "<eos>")
`
(A clear and concise description of what the feature is.)
Will this change the current api? How?
Who will benefit from this enhancement?
In the saved partition model config.json
file has been changed, it's different from original config.json
file
{
"_name_or_path": "bigscience/bloom-1b1",
"apply_residual_connection_post_layernorm": false,
"architectures": [
"BloomModel"
],
"attention_dropout": 0.0,
"attention_softmax_in_fp32": true,
"bias_dropout_fusion": true,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_dropout": 0.0,
"hidden_size": 1536,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"masked_softmax_fusion": true,
"model_type": "bloom",
"n_head": 16,
"n_inner": null,
"n_layer": 24,
"offset_alibi": 100,
"pad_token_id": 3,
"pretraining_tp": 1,
"skip_bias_add": true,
"skip_bias_add_qkv": false,
"slow_but_exact": false,
"torch_dtype": "float32",
"transformers_version": "4.27.1",
"unk_token_id": 0,
"use_cache": true,
"vocab_size": 250880
}
https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1
This model tested on P3.8xlarge with TP4 has weird NCCL issues. TP1 also doesn't help and throw a Cublas handle problems
serving.properties:
option.model_id=EleutherAI/gpt-neo-1.3B
option.task=text-generation
option.tensor_parallel_degree=2
option.dtype=fp16
#option.enable_streaming=true
option.enable_streaming=huggingface
engine=DeepSpeed
option.parallel_loading=true
curl command:
curl -X POST "http://localhost:8080/invocations" \
-H "content-type: application/json" \
-d '{"inputs": ["Large language model is"], "parameters": {"max_length" :25}}'
{"outputs": ["Large language model is"]}
{"outputs": "CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"}
If you generate single batch with large sequence (e.g 2048). Afterwards if you try batch size of 4, it will fail with issues.
Send a failed signal back to sagemaker
The DJL containers should be added with Docker label to enable it in multi-container endpoint in SageMaker.
Currently, its not enabled and users gets below error -
An error occurred (ValidationException) when calling the CreateModel operation: Your Ecr Image 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-fastertransformer5.3.0-cu117 does not contain required com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true Docker label(s).
(A clear and concise description of what the feature is.)
Will this change the current api? How?
Change the dockerfile and add the label
Who will benefit from this enhancement?
All SageMaker customers
We have deployed a flant5 model on a Nvidia GPU infrastructure with the following serving.properties
engine=Python
option.entryPoint=djl_python.deepspeed
option.task=text2text-generation
option.dtype=int8
option.device_map=balanced
batch_size=2
max_batch_delay=1
Model works fine for a single request but for concurrent users it starts throwing HTTP 400 error
Dynamic batching should be supported for DJL serving
{
"code": 400,
"type": "TranslateException",
"message": "Batch output size mismatch, expected: 2, actual: 1"
}
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
(Paste the commands you ran that produced the error.)
(A clear and concise description of what the bug is.)
Cannot run ./gradlew FJ or build under the version of jdk 17.0.4 in directory /serving.
And both of yang and Sindhu found that.
In master branch, run ./gradlew FJ or build under /djl-serving/serving
Replace the old model file with a new model zip file with the same name. Even if the deregister or the server is restarted, the old model will still be registered again.You need to change the model file name before loading a new model
HuggingFace repo would download by default to the home dir. Which is small space on SageMaker
env = {"HUGGINGFACE_HUB_CACHE": "/tmp", "TRANSFORMERS_CACHE": "/tmp"}
Let's add these two in the container we build
import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
'mosaicml/mpt-7b',
trust_remote_code=True
)
The model is adding a new option called
trust_remote_code=True
This needs to enable to run inference
In e.g.
public ModelInfo(
String id,
String modelUrl,
String version,
String engineName,
Class<I> inputClass,
Class<O> outputClass,
int queueSize,
int maxIdleTime,
int maxBatchDelay,
int batchSize) {
there's no indication what the unit of time or delay is for maxIdleTime
or maxBatchDelay
. This makes it frustrating to use the library as I have to click through a bunch of source code to see actual usage to infer what's intended.
It makes the library a lot easier to use if cases like that are named e.g. maxIdleTimeSecs
and maxBatchDelayMillis
.
This appears pervasive - e.g. Job.getBegin
(which should probably be removed as System.nanoTime() isn't absolute?) and Job.getWaitingTime
(-> Job.getWaitingTimeMicrosecs
)
Time
can usually be omitted if it's obvious, e.g. Job.getWaitingMicrosecs
, maxIdleSecs
.
serving.properties:
option.model_id=EleutherAI/gpt-neo-1.3B
option.task=text-generation
option.tensor_parallel_degree=2
option.dtype=fp16
option.enable_streaming=true
#option.enable_streaming=huggingface
engine=DeepSpeed
option.parallel_loading=true
curl command:
curl -X POST "http://localhost:8080/invocations" \
-H "content-type: application/json" \
-d '{"inputs": ["Large language model is"], "parameters": {"max_length" :2}}'
Expected to return 2 new tokens, but 50 tokens are returned
FT handler will fail with changes to conversion script in upstream - NVIDIA/FasterTransformer#568 as model is not fetched using from_pretrained()
method.
In current HuggingFace accelerate implementation, we only used pipeline parallism instead of tensor parallism. However, we still require user to pass in tensor_parallel_degree
which doesn't make much sense. We should offer pipeline_parallel_degree
to address this issue
When I used djl-serving 0.19.0, I found that the predicted result of the model was garbled in Chinese.
Later, I found that the image of ubuntu does not support Chinese. Later, I added the installation of language-pack-zh-hans in install_djl_serving.sh, and added
RUN localedef -c -f UTF-8 -i zh_CN zh_CN.utf8
ENV LC_ALL zh_CN.UTF-8
to Dockerfile to make a new image.Finally it support prediction results in Chinese.
Is my handling correct? Or the prediction results originally support Chinese, but there is a problem with my model.
Or do you intend to support the prediction results in Chinese?
As a final example, here is one that features a more complicated interaction. The human detection model will find all of the humans in an image. Then, the "splitHumans" function will turn all of them into separate images that can be treated as a list. The "map" will apply the "poseEstimation" model to each of the detected humans in the list.
workflow:
humans: ["splitHumans", ["humanDetection", "in"]]
out: ["map", "poseEstimation", "humans"]
https://github.com/deepjavalibrary/djl-serving/blob/master/wlm/src/main/java/ai/djl/serving/wlm/ModelInfo.java#L75-L80 currently reads:
public ModelInfo(String modelUrl, Class<I> inputClass, Class<O> outputClass) {
this.id = modelUrl;
this.modelUrl = modelUrl;
this.inputClass = inputClass;
this.outputClass = outputClass;
}
This is missing the default initialization of queueSize
et al present in the Criteria
-constructor just below ( https://github.com/deepjavalibrary/djl-serving/blob/master/wlm/src/main/java/ai/djl/serving/wlm/ModelInfo.java#L88-L99 ):
WlmConfigManager config = WlmConfigManager.getInstance();
queueSize = config.getJobQueueSize();
maxIdleTime = config.getMaxIdleTime();
batchSize = config.getBatchSize();
maxBatchDelay = config.getMaxBatchDelay();
This makes the first constructor not very useful?
Special characters is not cleaned up
Have you considered introducing a distributed deployment solution for djl-serving?
After all, it can only be a single machine now, and there are still some unstable factors in the production environment.
Allow url that point to a model.py
that related to the entryPoint.
So user could offer any model.py url for the model deployment.
the pip command in the 0.18.0 docker is broken for some reason. Always have error code.
In central model ,I just run
./gradlew run
but the terminal like this:
Task :central:buildReactApp
asset main.js 2.06 MiB [compared for emit] (name: main) 1 related asset
orphan modules 78.8 KiB [orphan] 83 modules
runtime modules 972 bytes 5 modules
modules by path ./node_modules/ 1.68 MiB 240 modules
modules by path ./src/main/webapp/ 41.2 KiB
modules by path ./src/main/webapp/components/ 38 KiB
modules by path ./src/main/webapp/components/modelpanels/ 6.34 KiB 5 modules
modules by path ./src/main/webapp/components/*.jsx 11.5 KiB 3 modules
modules by path ./src/main/webapp/components/TabPanel/ 4.85 KiB 2 modules
+ 1 module
modules by path ./src/main/webapp/css/ 2.15 KiB
./src/main/webapp/css/useStyles.jsx 798 bytes [built] [code generated]
./src/main/webapp/css/style.css 537 bytes [built] [code generated]
./node_modules/css-loader/dist/cjs.js!./src/main/webapp/css/style.css 864 bytes [built] [code generated]
./src/main/webapp/Main.jsx 1.07 KiB [built] [code generated]
webpack 5.74.0 compiled successfully in 2680 ms
Task :central:run
Listening for transport dt_socket at address: 4000
[INFO ] - [id: 0xfb7e5f31] REGISTERED
[INFO ] - [id: 0xfb7e5f31] BIND: 0.0.0.0/0.0.0.0:8080
[INFO ] - [id: 0xfb7e5f31, L:/[0:0:0:0:0:0:0:0]:8080] ACTIVE
[INFO ] - [id: 0xfb7e5f31, L:/[0:0:0:0:0:0:0:0]:8080] READ: [id: 0xfd448de3, L:/[0:0:0:0:0:0:0:1]:8080 - R:/[0:0:0:0:0:0:0:1]:50076]
[INFO ] - [id: 0xfb7e5f31, L:/[0:0:0:0:0:0:0:0]:8080] READ: [id: 0x5fb3bd15, L:/[0:0:0:0:0:0:0:1]:8080 - R:/[0:0:0:0:0:0:0:1]:50077]
<============-> 95% EXECUTING [3m 33s]
<============-> 95% EXECUTING [1m 36s]
:central:run
then i visit http://localhost:8080/ but there is no response.and it's waiting
libgfortran.so.4 is missing in the file while trying to install numpy in the docker. (cpu-full)
Different model packages may have different versions. Especially like protobuf has strict version for different models, user may install different pip wheel within a single environment. We need to find a way to adress this. Ideally, user could specify like
option.python_path=/path/to/python
T5 model series is not supported due to AutoModelForCausalLM
. We need to support this singularity
Can we create a new gpu Dockerfile based on Onnx?
sometime mpi process hang and python process cannot be restarted
serving.properties:
option.model_id=EleutherAI/gpt-neo-1.3B
option.task=text-generation
option.tensor_parallel_degree=2
option.dtype=fp16
option.enable_streaming=true
engine=DeepSpeed
option.parallel_loading=true
curl command:
curl -X POST "http://localhost:8080/invocations" \
-H "content-type: text/plain" \
-d "Large language model is"
WARN PyProcess Primary job terminated normally, but 1 process returned
WARN PyProcess a non-zero exit code. Per user-direction, the job has been aborted.
This is not actually an error on DJLServing. Just tracking this here. Will raise an issue in HF as well.
HF Pipeline actually trying to generate the outputs on CPU despite including the device_map=auto as configuration for GPT_NeoX 20B model.
Workaround is to use model.generate method by manually converting the input_ids to GPU.
Bug: RuntimeError: "topk_cpu" not implemented for 'Half'
Trying GPT_NEOX 20B model with our huggingface.py handler.
This was actually recorded as issues in transformers.
huggingface/transformers#18703
huggingface/transformers#19445
Customers is confusing TensorParallelDegree while using the HuggingFace model. Maybe we can change this name permanently to ModelParallelDegree or model_parallel_degree to clarify.
Generating self-signed certificate throws an error with jdk 17. Due to this gradle build
failed as these following two tests failed.
serving/build/reports/tests/test/classes/ai.djl.serving.ModelServerTest.html#test
serving/build/reports/tests/test/classes/ai.djl.serving.ModelServerTest.html#testWorkflows
Stack trace of the error message:
java.security.cert.CertificateException: No provider succeeded to generate a self-signed certificate. See debug log for the root cause.
at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:249)
at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:166)
at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:115)
at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:90)
at ai.djl.serving.util.ConfigManager.getSslContext(ConfigManager.java:385)
at ai.djl.serving.ConfigManagerTest.testSsl(ConfigManagerTest.java:46)
at ai.djl.serving.ModelServerTest.test(ModelServerTest.java:291)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:135)
at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:673)
at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:220)
at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:945)
at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:193)
at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
at org.testng.TestRunner.privateRun(TestRunner.java:808)
at org.testng.TestRunner.run(TestRunner.java:603)
at org.testng.SuiteRunner.runTest(SuiteRunner.java:429)
at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:423)
at org.testng.SuiteRunner.privateRun(SuiteRunner.java:383)
at org.testng.SuiteRunner.run(SuiteRunner.java:326)
at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:95)
at org.testng.TestNG.runSuitesSequentially(TestNG.java:1249)
at org.testng.TestNG.runSuitesLocally(TestNG.java:1169)
at org.testng.TestNG.runSuites(TestNG.java:1092)
at org.testng.TestNG.run(TestNG.java:1060)
at org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.runTests(TestNGTestClassProcessor.java:141)
at org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.stop(TestNGTestClassProcessor.java:90)
at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.stop(SuiteTestClassProcessor.java:61)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94)
at jdk.proxy2/jdk.proxy2.$Proxy5.stop(Unknown Source)
at org.gradle.api.internal.tasks.testing.worker.TestWorker$3.run(TestWorker.java:193)
at org.gradle.api.internal.tasks.testing.worker.TestWorker.executeAndMaintainThreadName(TestWorker.java:129)
at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:100)
at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:60)
at org.gradle.process.internal.worker.child.ActionExecutionWorker.execute(ActionExecutionWorker.java:56)
at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:133)
at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:71)
at worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
at worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
Suppressed: java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider
at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:240)
... 52 more
Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jce.provider.BouncyCastleProvider
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
... 53 more
Caused by: java.lang.IllegalAccessError: class io.netty.handler.ssl.util.OpenJdkSelfSignedCertGenerator (in unnamed module @0x531d72ca) cannot access class sun.security.x509.X509CertInfo (in module java.base) because module java.base does not export sun.security.x509 to unnamed module @0x531d72ca
at io.netty.handler.ssl.util.OpenJdkSelfSignedCertGenerator.generate(OpenJdkSelfSignedCertGenerator.java:52)
at io.netty.handler.ssl.util.SelfSignedCertificate.<init>(SelfSignedCertificate.java:246)
... 52 more
This issue is describe in netty's github repo.
org.bouncycastle:bcpkix-jdk15on:1.65
to the dependencies, solved it.DJLServing should produce timestamp for each log
I load my model on djl-serving and the paramters should be a float[4] array or a specific class。
In demo i just override processInput() and processOutput(),because the class of input is a customized class。
But in djl-serving I don't how to parse paramters into "input". I saw maybe some code should be configured in .yml file.
so is there any instruction for parse customized paramters into input?
(A clear and concise description of what the bug is.)
While packaging OpenJDK for Homebrew, it was noticed that DJL Serving's tarball, downloaded from https://publish.djl.ai/djl-serving/serving-0.21.0.tar, reported a different checksum. It used to be 8fa8afd1a4181fc55e6ad2cb31cea8ec07fc4ad5df135e62bd07106ce3fc6c80
at 2023-02-26 03:36 UTC, but now it is 523c742f80fb277bfc7f8c3f706ede4b28fbc5d95851d526f63d7e6f02c6c423
. May I confirm if the tarball is re-uploaded? Thanks!
Tarball checksum should match the one in our formula (package description).
(Paste the complete error message, including stack trace.)
See CI failure here:
==> Downloading https://publish.djl.ai/djl-serving/serving-0.21.0.tar
Downloaded to: /Users/brew/Library/Caches/Homebrew/downloads/acdf5ceb0cf03acc49888f36839af1c9e017be2fce0c48dc17d9641f01945263--serving-0.21.0.tar
SHA256: 523c742f80fb277bfc7f8c3f706ede4b28fbc5d95851d526f63d7e6f02c6c423
Warning: Formula reports different sha256: 8fa8afd1a4181fc55e6ad2cb31cea8ec07fc4ad5df135e62bd07106ce3fc6c80
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
$ curl -L https://publish.djl.ai/djl-serving/serving-0.21.0.tar | shasum -a256 -
523c742f80fb277bfc7f8c3f706ede4b28fbc5d95851d526f63d7e6f02c6c423 -
(Paste the commands you ran that produced the error.)
See above.
N/A.
(Apologies if this is already supported, the docs are unclear/confusing)
DJL-serving claims support for TensorRT model on https://github.com/deepjavalibrary/djl-serving
DJL FAQ doesn't mention support for TensorRT: https://djl.ai/docs/faq.html - ??
I have a pytorch model that I'd like to run inference on. For memory & performance reasons I'd like to quantize it to at least fp16, ideally uint8. To not run into quantization issues (which I do if I just change the format of the weights to fp16 in my pytorch -> torchscript model) I need to apply Post-Training-Quantization with suitable calibration.
The only path I've found to actually do that with pytorch is via TensorRT:
However, that creates something that is neither a torchscript nor a TensorRT model, it's instead some Torch-TensorRT hybrid. Ok. Now, to deploy that monstrosity their docs ( https://pytorch.org/TensorRT/tutorials/runtime.html#runtime ) claim that all you have to do is link in libtorchtrt_runtime.so
that's included in their C++ distribution. Great.
Is it possible to do that and use this workflow with DJL serving? Has anyone done it?
Is there another (better) path to get quantized inference in DJL?
Thanks!
(A clear and concise description of what the bug is.)
Deploying GPT-Neox (https://huggingface.co/EleutherAI/gpt-neox-20b) on SageMaker is unsuccessful.
(what's the expected behavior?)
Successfully deploy the model to a SageMaker endpoint using the latest version of the Large Model Inference container. (https://github.com/aws/deep-learning-containers/blob/master/available_images.md)
(Paste the complete error message, including stack trace.)
[INFO ] PyProcess - [1,2]<stdout>: File "/root/.djl.ai/python/0.20.0/djl_python/deepspeed.py", line 207, in _validate_model_type_and_task
[INFO ] PyProcess - [1,2]<stdout>:ValueError: model_type: gpt_neox is not currently supported by DeepSpeed
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
Using the following configuration in serving.properties
engine = DeepSpeed
option.entryPoint=djl_python.deepspeed
option.tensor_parallel_degree=8
option.model_id=EleutherAI/gpt-neox-20b
(Paste the commands you ran that produced the error.)
gpt-neox
is listed in the SUPPORTED_MODEL_TYPES - https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/setup/djl_python/deepspeed.py#L39.
Examining the model_type
using the following code -
from transformers import AutoConfig
model_config = AutoConfig.from_pretrained("EleutherAI/gpt-neox-20b")
model_config.model_type
shows that it is probably gpt_neox
.
At line 183 of streaming_utils.py: https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/setup/djl_python/streaming_utils.py#L183
Traceback (most recent call last):
File "/home/ubuntu/models/linguist/djl-model/steaming_test.py", line 68, in
next_token_id = decoding_method(
File "/home/ubuntu/models/linguist/djl-model/streaming_utils.py", line 189, in _sampling_decoding
logits[-1:, :] = processors(input_ids, logits[-1:, :])
RuntimeError: Output 0 of SliceBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.
replace line 183 with following:
_logits = logits.detach().clone()
_logits[-1:, :] = processors(input_ids, logits[-1:, :])
logits = _logits
engine=Python
option.entryPoint=djl_python.huggingface
option.tensor_parallel_degree=4
option.dtype=fp16
option.model_id=huggyllama/llama-13b
Using the above setup and run with a simple command
curl -X POST "http://127.0.0.1:8080/predictions/test" \
-H 'Content-Type: application/json' \
-d '{"parameters":{"max_new_tokens": 256, "min_new_tokens": 256},
"inputs":["Large Language model is"]
}'
Comes with the error:
The following `model_kwargs` are not used by the model: ['token_type_ids']
If we change the lines https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/setup/djl_python/huggingface.py#L234-L235 to
output_tokens = model.generate(input_tokens.input_ids, **kwargs)
The problem could be resolved. This might be a huggingface bug
It looks like there is 0.20.0 release, but there is no downloadable artifact for https://publish.djl.ai/djl-serving/serving-0.20.0.tar
, raising this issue to confirm if there is anything missing in the release process. Thanks!
After trying to specify datatype for model loading as
option.dtype=fp16
deepspeed.py is not picking up
update set-output for DJL, DJL Demo and DJLServing
Integration and Performance tests periodically fail due to bug in Transformers 4.27.x due to cache hub int bug huggingface/transformers#22427.
Consistent download and run of our integration and performance pipeline
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.