Comments (5)
Thanks for reporting this bug. We will take a look at this as soon as possible. I just created two test cases that reproduce the error (one with ZeRO and one with FP16 but no ZeRO).
https://github.com/microsoft/DeepSpeed/blob/jeffra/onecycle_bug/tests/unit/test_fp16.py#L147-L246
from deepspeed.
Note : I have the same error when I try the same configuration and LRRangeTest
as scheduler.
from deepspeed.
Hi @colanim, it should be up to date. Can you tell us this info from inside your docker container?
python -c 'import deepspeed; print("deepspeed info:", deepspeed.__version__, deepspeed.__git_branch__, deepspeed.__git_hash__)'
Also I just looked at the lasted docker build, it prints this same version info and it looks to be aligned with the latest March 12th commit (3d3f8d3): https://dev.azure.com/DeepSpeedMSFT/DeepSpeed/_build/results?buildId=416&view=logs&j=3dc8fd7e-4368-5a92-293e-d53cefc8c4b3&t=a1aa9649-a94b-5ac4-3f5e-9bb6223edb04&l=1717
** info: 0.1.0 master 3d3f8d3
from deepspeed.
I'm still meeting this issue in deepspeed/deepspeed:latest
. How should I update the docker image to pull latest code from source ?
from deepspeed.
My bad, I didn't pull the latest image..
After doing docker pull deepspeed/deepspeed:latest
, it's working 👍
from deepspeed.
Related Issues (20)
- [BUG] File not found in autotuner cache in multi-node setting on SLURM HOT 1
- Inference with the MoE based GPT model trained by ds_pretrain_gpt_345M_MoE128.sh [BUG]
- RuntimeError: still have inflight params[BUG]
- Install issue with setuptools 70 HOT 2
- [BUG] oneapi/ccl.hpp: No such file or directory. HOT 1
- [BUG]模型卡在trainer.train()一直不训练
- [BUG] Running llama2-7b step3 with tensor parallel and HE fails due to incompatible shapes
- RuntimeError: Error building extension 'cpu_adam', because /usr/bin/ld: can not find -lcurand,help! HOT 1
- Fail to use zero_init to construct llama2 with deepspeed zero3 and bnb!
- does DeepSpeed support AMSP (a new DP shard strategy)
- [BUG] 'Invalidate trace cache' with Seq2SeqTrainer+predict_with_generate+Zero3
- AssertionError: Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops. HOT 6
- How to set different learning rates for different parameters of LLMs
- Getting parameters of embeddings (safe_get_local_fp32_param)and setting the weight of embeddings (safe_set_local_fp32_param) does not work (bug?).
- [BUG] DeepSpeed on pypi not compatible with latest `numpy` HOT 5
- [BUG] GPU memory leaking after deleting deepspeed engine HOT 2
- [BUG] Using and Building DeepSpeedCPUAdam HOT 23
- Bug Report: Issues Building DeepSpeed on Windows HOT 4
- [BUG] Logs full of FutureWarning when training with nightly PyTorch HOT 1
- [BUG] inference ValueError
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed.