I like deep neural nets.
karpathy / build-nanogpt Goto Github PK
View Code? Open in Web Editor NEWVideo+code lecture on building nanoGPT from scratch
Video+code lecture on building nanoGPT from scratch
I like deep neural nets.
Thanks Andrej for the incredible video and content !!!
While trying to following along (starting with the play.iynb file), I have trouble getting the log file "log124M_40B/log.txt". I am on a PC and it's impossible for me to get reproduce your result till the very end :(.
Could you provide the download link to this file, or point to the location of the file (if I missed it from the repo...)?
fullgraph=True
will make sure that there are no graphbreaks (this may already be the case).
mode="reduce-overhead"
will use CUDA graphs if possible. See in [these benchmarks] that going from regular torch.compile
to reduce-overhead
gives a good 70-100% speed-up on top of regular torch.compile
.
Hi,
I have tried to implement GPT2 from scratch according to the Video tutorial. However, if I try to execute the code on 2 GPUs with:
torchrun --standalone --nproc_per_node=2 GPT.py
My program fails with the following error message:
Device: cuda:1
Device Count: 1
[rank1]: Traceback (most recent call last):
[rank1]: File "/my_transformer/GPT.py", line 238, in <module>
[rank1]: torch.cuda.set_device(device)
[rank1]: File "/.local/lib/python3.9/site-packages/torch/cuda/__init__.py", line 399, in set_device
[rank1]: torch._C._cuda_setDevice(device)
[rank1]: RuntimeError: CUDA error: invalid device ordinal
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Device: cuda:0
Device Count: 1
Master-Process: True
Total desired batch size: 524288
Calculated gradient accumulation steps: 16.
loaded 338025 tokens.
W0626 10:12:11.821799 22703772874560 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2031187 closing signal SIGTERM
E0626 10:12:11.853472 22703772874560 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 2031188) of binary: ~/my_transformer/.venv/bin/python3.9
Traceback (most recent call last):
File "~/my_transformer/.venv/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "~/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "~/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "~/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
GPT.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-06-26_10:12:11
host : haicn01.localdomain
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2031188)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
If I execute with just 1 GPU, I get another error:
[rank0]: OutOfMemoryError: CUDA out of memory. Tried to allocate 786.00 MiB. GPU
Any ideas what could be the reason? I exactly followed the video tutorial and also checked the code in the repository. I should have enough memory. According to nvidia-smi I get the following output:
Wed Jun 26 10:51:56 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:CA:00.0 Off | On |
| N/A 55C P0 165W / 400W | 612MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| 0 8 0 0 | 12MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 9 0 1 | 12MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Thanks in advance.
When I loaded the smallest GPT-2 model weights from Hugging Face and performed inference using both flash attention and a manually implemented attention under the same seed setting, I obtained consistent results within each method individually. However, the results between the two methods were not consistent, and the manually implemented attention seemed to produce more reasonable outputs. Is this normal?
I noticed that in the following snippet, that the std
of nn.Embedding
is set to 0.02
:
def _init_weights(self, module):
if isinstance(module, nn.Linear):
std = 0.02
if hasattr(module, 'NANOGPT_SCALE_INIT'):
std *= (2 * self.config.n_layer) ** -0.5
torch.nn.init.normal_(module.weight, mean=0.0, std=std)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
The official implementation sets it to 0.01
as noted in the video. In only matters for positional embeddings due to weight sharing scheme of wte
and lm_head
Have a question - in the current form of making batches, aren't we throwing away information? e.g. we take a an input and transform it into B * T matrix. Now for each row, the first token is blind to previous tokens as we never put that sequence into the training loop. Wouldn't better way to make dataloader would be something like a moving window?
Thanks so much Andrej for making these incredible videos!!!!
Could you please make them google collab friendly? I'm getting RuntimeErrors in the fineweb.py code and compile won't work for Windows because Triton can't be installed in Windows.
Thanks so much!!
I thought tensor parallelism would be an interesting idea. There's a tutorial for this and even some code examples, but so far no joy.
I started simple, trying to shard the MLP like this:
# run using: torchrun --standalone --nproc-per-node=2 train_gpt2_tp.py
from torch.distributed._tensor.device_mesh import init_device_mesh
from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel, RowwiseParallel
_world_size = int(os.environ["WORLD_SIZE"])
device_mesh = init_device_mesh(device_type="cuda", mesh_shape=(_world_size,))
class Block(nn.Module):
def __init__(self, config):
...
# was: self.mlp = MLP(config)
self.mlp = parallelize_module(
module=MLP(config),
device_mesh=device_mesh,
parallelize_plan={
"c_fc": ColwiseParallel(),
"c_proj": RowwiseParallel(),
},
)
But PyTorch (nightly) gives me grief:
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/Sync-shared/projects/repos/build-nanogpt/train_gpt2_tp.py", line 326, in <module>
[rank0]: norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 21, in _no_grad_wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 68, in clip_grad_norm_
[rank0]: norms.extend(torch._foreach_norm(device_grads, norm_type))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.12/site-packages/torch/_compile.py", line 31, in inner
[rank0]: return disable_fn(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank0]: return fn(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/api.py", line 309, in __torch_dispatch__
[rank0]: return DTensor._op_dispatcher.dispatch(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 115, in dispatch
[rank0]: op_info = self.unwrap_to_op_info(op_call, args, kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 348, in unwrap_to_op_info
[rank0]: args_schema.append(try_get_replicate_spec(arg, mesh))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 329, in try_get_replicate_spec
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: aten._foreach_norm.Scalar: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!
[rank1]: Traceback (most recent call last):
[rank1]: File "/mnt/Sync-shared/projects/repos/build-nanogpt/train_gpt2_tp.py", line 326, in <module>
[rank1]: norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 21, in _no_grad_wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 68, in clip_grad_norm_
[rank1]: norms.extend(torch._foreach_norm(device_grads, norm_type))
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.12/site-packages/torch/_compile.py", line 31, in inner
[rank1]: return disable_fn(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank1]: return fn(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/api.py", line 309, in __torch_dispatch__
[rank1]: return DTensor._op_dispatcher.dispatch(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 115, in dispatch
[rank1]: op_info = self.unwrap_to_op_info(op_call, args, kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 348, in unwrap_to_op_info
[rank1]: args_schema.append(try_get_replicate_spec(arg, mesh))
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 329, in try_get_replicate_spec
[rank1]: raise RuntimeError(
[rank1]: RuntimeError: aten._foreach_norm.Scalar: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!
As a quick fix I tried converting what I thought were DTensors to local tensors:
class MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
self.gelu = nn.GELU(approximate='tanh')
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
self.c_proj.NANOGPT_SCALE_INIT = 1
def forward(self, x):
x = self.c_fc(x)
x = self.gelu(x)
x = self.c_proj(x)
return x.to_local() # change here!
but then I get even more grief ๐คฆโโ๏ธ:
[rank0]: File "/mnt/Sync-shared/projects/repos/build-nanogpt/train_gpt2_tp.py", line 58, in forward
[rank0]: return x.to_local()
[rank0]: ^^^^^^^^^^
[rank0]: AttributeError: 'AsyncCollectiveTensor' object has no attribute 'to_local'
Any ideas? ๐
Below what what I get everytime I try to shard the dataset, it dose not look like the last one is completing I ran this multiply time and each time it stops in the same spot. Any ideas?
Shard 97: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 99999910/100000000 [00:10<00:00, 9236426.65tokens/s]
Shard 98: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 99999499/100000000 [00:11<00:00, 8723382.11tokens/s]
Shard 99: 54%|โโโโโโโโโโโโโโโโโโโโโโโโโโโ | 53989101/100000000 [00:08<00:07, 6051927.02tokens/s]
PS E:\build-nanogpt-master\build-nanogpt-master>
The issue is that gcc is not available on the machine, at least in my setup.
Check GCC with type gcc
command or path /usr/bin/gcc
. If not available, install via build-essential package:
sudo apt-get install build-essential
Restart the shell and set use_compile = True
in the training script. This worked for my setup.
Edit: Not fixed.
The current script bypasses the text generation step when the model is compiled. However, if we change from model(...)
to raw_model(...)
, we can still generate the text when the model is compiled.
Lines 459 to 461 in 6104ab1
The data loader prepares the input data batch in chunks. Let's say the chunk size is 6, then you have a sliding window approach where you advance each chunk by 6 as you show in the video:
Tokenized text: [ 5962, 22307, 25, 198, 8421, 356, 5120, 597, 2252, 11, 3285, 502, 2740, 13, 198, 198, 3237, 25, 1081, 5248, 461, 11, 2740, 13, 99]
Batch inputs:
tensor([[ 5962, 22307, 25, 198, 8421, 356],
[ 5120, 597, 2252, 11, 3285, 502],
[ 2740, 13, 198, 198, 3237, 25],
[ 198, 5248, 461, 11, 2740, 13]])
Batch targets (inputs shifted by +1):
tensor([[22307, 25, 198, 8421, 356, 5120],
[ 597, 2252, 11, 3285, 502, 2740],
[ 13, 198, 198, 3237, 25, 1081],
[ 5248, 461, 11, 2740, 13, 99]])
This works well, and this is usually also how I do it.
However, I think for exactly reproducing the original GPT-2 model, I think they had overlaps between the inputs. I.e, each new chunk starts just one token after the previous one:
Batch Inputs:
tensor([[ 5962, 22307, 25, 198, 8421, 356 ]
[22307, 25, 198, 8421, 356, 5120]
[25, 198, 8421, 356, 5120, 597 ]
[198, 8421, 356, 5120, 597, 2252]
...])
Batch Targets:
tensor([[22307, 25, 198, 8421, 356, 5120],
[ 25, 198, 8421, 356, 5120, 597],
[ 198, 8421, 356, 5120, 597, 2252],
[ 8421, 356, 5120, 597, 2252, 11],
...])
I.e., instead of advancing the input by "chunk size", they advanced the input position by 1. Please correct me if I'm wrong.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.