GithubHelp home page GithubHelp logo

TXT2IMAGE - TXTRuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmStridedBatchedExFix about deepspeed-mii HOT 5 CLOSED

microsoft avatar microsoft commented on July 21, 2024
TXT2IMAGE - TXTRuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmStridedBatchedExFix

from deepspeed-mii.

Comments (5)

eran-sefirot avatar eran-sefirot commented on July 21, 2024

when running: python mii-sd.py

a_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'transformer_inference'
[2022-11-27 11:35:16,846] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 6581
[2022-11-27 11:35:16,846] [ERROR] [launch.py:324:sigkill_handler] ['/opt/conda/bin/python', '-m', 'mii.launch.multi_gpu_server', '--task-name', 'text-to-image', '--model', 'CompVis/stable-diffusion-v1-4', '--model-path', '/tmp/mii_models', '--port', '50050', '--ds-optimize', '--provider', 'diffusers', '--config', 'eyJ0ZW5zb3JfcGFyYWxsZWwiOiAxLCAicG9ydF9udW1iZXIiOiA1MDA1MCwgImR0eXBlIjogImZwMTYiLCAiZW5hYmxlX2N1ZGFfZ3JhcGgiOiBmYWxzZSwgImNoZWNrcG9pbnRfZGljdCI6IG51bGwsICJkZXBsb3lfcmFuayI6IFswXSwgInRvcmNoX2Rpc3RfcG9ydCI6IDI5NTAwLCAiaGZfYXV0aF90b2tlbiI6ICJoZl9Xc0NwVWFFYVhMbGtEZEtLTkVtS2NxZk9vTHBjcWxXWHF5IiwgInJlcGxhY2Vfd2l0aF9rZXJuZWxfaW5qZWN0IjogdHJ1ZSwgInByb2ZpbGVfbW9kZWxfdGltZSI6IGZhbHNlLCAic2tpcF9tb2RlbF9jaGVjayI6IGZhbHNlfQ=='] exits with return code = 1
[2022-11-27 11:35:18,791] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start...
Traceback (most recent call last):
File "/home/ec2-user/DeepSpeed-MII/examples/benchmark/txt2img/mii-sd.py", line 15, in
mii.deploy(task='text-to-image',
File "/opt/conda/lib/python3.9/site-packages/mii/deployment.py", line 114, in deploy
return _deploy_local(deployment_name, model_path=model_path)
File "/opt/conda/lib/python3.9/site-packages/mii/deployment.py", line 120, in _deploy_local
mii.utils.import_score_file(deployment_name).init()
File "/tmp/mii_cache/sd_deploy/score.py", line 29, in init
model = mii.MIIServerClient(task,
File "/opt/conda/lib/python3.9/site-packages/mii/server_client.py", line 92, in init
self._wait_until_server_is_live()
File "/opt/conda/lib/python3.9/site-packages/mii/server_client.py", line 115, in _wait_until_server_is_live
raise RuntimeError("server crashed for some reason, unable to proceed")
RuntimeError: server crashed for some reason, unable to proceed

from deepspeed-mii.

eran-sefirot avatar eran-sefirot commented on July 21, 2024

OK I've installed the latest AMI for deep learning with cuda 11.7
now I get the following when running python mii-sd.py:

/opt/conda/envs/pytorch/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu:8:10: fatal error: cuda_profiler_api.h: No such file or directory
#include <cuda_profiler_api.h>
^~~~~~~~~~~~~~~~~~~~~

from deepspeed-mii.

eran-sefirot avatar eran-sefirot commented on July 21, 2024

I've switched to different AMI with pytorch 1.2 and cuda 1.6
and now I get the following error:

Time to load spatial_inference op: 17.237044095993042 seconds
**** found and replaced unet w. <class 'deepspeed.model_implementations.diffusers.unet.DSUNet'>
About to start server
Started
[2022-11-27 13:35:10,519] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start...
[2022-11-27 13:35:15,524] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start...
[2022-11-27 13:35:15,524] [INFO] [server_client.py:118:_wait_until_server_is_live] server has started on 50050
Traceback (most recent call last):
File "/home/ec2-user/DeepSpeed-MII/examples/benchmark/txt2img/mii-sd.py", line 23, in
results = pipe.query(prompts)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/mii/server_client.py", line 367, in query
response = self.asyncio_loop.run_until_complete(
File "/opt/conda/envs/pytorch/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/mii/server_client.py", line 263, in _query_in_tensor_parallel
await responses[0]
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/mii/server_client.py", line 313, in _request_async_response
response = await self.stubs[stub_id].Txt2ImgReply(req)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in await
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception calling application: 'DSUNet' object has no attribute 'config'"
debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:50050 {grpc_message:"Exception calling application: 'DSUNet' object has no attribute 'config'", grpc_status:2, created_time:"2022-11-27T13:35:15.530649601+00:00"}"

from deepspeed-mii.

mrwyattii avatar mrwyattii commented on July 21, 2024

This was resolved recently. Please see #112 (comment)

from deepspeed-mii.

mrwyattii avatar mrwyattii commented on July 21, 2024

Please reopen if this issue is still not resolved.

from deepspeed-mii.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.