karpathy / build-nanogpt Goto Github PK

View Code? Open in Web Editor NEW

3.1K 3.1K 388.0 458 KB

Video+code lecture on building nanoGPT from scratch

Python 75.63% Jupyter Notebook 24.37%

build-nanogpt's Introduction

I like deep neural nets.

build-nanogpt's People

Contributors

Stargazers

Watchers

Forkers

adeelahmad watanka atyenoria eliebak celdiniz praleks zihongchen dhingratul ayodele-akinpelumi evdcush vuotnh kevingil splendor1811 singh-pankaj-k chris-thegamechanger tungika91 shaun95 ahricat stjordanis johngmuender elmiraghorbani ailabteam notauserx toanphung novusnota-forks stan-anony arosstale thomaub hinetapora beimingmaster ashnoorsingh parvbhullar dalian-ai janrvdolf vvr-rao hainguyen0295 abelhubprog mlamrini ehzawad 596050 raineydavid vinayreddy100 ahmed-gamal-dev juan-garassino melnikovics tamle-ml ronnieml denisurya kevinlau8219 macos yingding eminsight moostafa1 uciadonis qianliang-lq contropist zuwuliao maliangone shanphd enkaybit daiwk coinhubx stmerlhin ltogniolli hitesh97 zfang2019 wzljerry javiervicho cnp-ciimar sonnydev ioanszilagyi jmelsbach hamlim peterjpxie a43501 techthiyanes kevinlights eyeteebe jhale1206636 rasbt marceloneppel haikuoxin lilinnk kingfener pie33000 zhengfangwu aspiringastro jonathan-joe-star tdl77 banyan-god professeurfalken shehbaztariq anindya-paul kojokesse tarasovin iyempissy forestofrain anyangml nasser-mallouli ryansunday

build-nanogpt's Issues

Cannot get the log file "log124M_40B/log.txt"?

Thanks Andrej for the incredible video and content !!!

While trying to following along (starting with the play.iynb file), I have trouble getting the log file "log124M_40B/log.txt". I am on a PC and it's impossible for me to get reproduce your result till the very end :(.

Could you provide the download link to this file, or point to the location of the file (if I missed it from the repo...)?

Consider using `torch.compile(model, fullgraph=True, mode="reduce-overhead")`

fullgraph=True will make sure that there are no graphbreaks (this may already be the case).
mode="reduce-overhead" will use CUDA graphs if possible. See in [these benchmarks] that going from regular torch.compile to reduce-overhead gives a good 70-100% speed-up on top of regular torch.compile.

RuntimeError: User specified an unsupported autocast device_type 'cuda:0'

Executing with 1 GPU raises "OutOfMemory Exception", executing with 2 GPUs "RuntimeError: CUDA error: invalid device ordinal"

Hi,

I have tried to implement GPT2 from scratch according to the Video tutorial. However, if I try to execute the code on 2 GPUs with:

torchrun --standalone --nproc_per_node=2 GPT.py

My program fails with the following error message:

Device: cuda:1

Device Count: 1

[rank1]: Traceback (most recent call last):

[rank1]:   File "/my_transformer/GPT.py", line 238, in <module>

[rank1]:     torch.cuda.set_device(device)

[rank1]:   File "/.local/lib/python3.9/site-packages/torch/cuda/__init__.py", line 399, in set_device

[rank1]:     torch._C._cuda_setDevice(device)

[rank1]: RuntimeError: CUDA error: invalid device ordinal

[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

 

Device: cuda:0

Device Count: 1

Master-Process: True

Total desired batch size: 524288

Calculated gradient accumulation steps: 16.

loaded 338025 tokens.

W0626 10:12:11.821799 22703772874560 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2031187 closing signal SIGTERM

E0626 10:12:11.853472 22703772874560 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 2031188) of binary: ~/my_transformer/.venv/bin/python3.9

Traceback (most recent call last):

  File "~/my_transformer/.venv/bin/torchrun", line 8, in <module>

    sys.exit(main())

  File "/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper

    return f(*args, **kwargs)

  File "~/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main

    run(args)

  File "/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run

    elastic_launch(

  File "~/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__

    return launch_agent(self._config, self._entrypoint, list(args))

  File "~/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent

    raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================

GPT.py FAILED

------------------------------------------------------------

Failures:

  <NO_OTHER_FAILURES>

------------------------------------------------------------

Root Cause (first observed failure):

[0]:

  time      : 2024-06-26_10:12:11

  host      : haicn01.localdomain

  rank      : 1 (local_rank: 1)

  exitcode  : 1 (pid: 2031188)

  error_file: <N/A>

  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

If I execute with just 1 GPU, I get another error:

[rank0]: OutOfMemoryError: CUDA out of memory. Tried to allocate 786.00 MiB. GPU

Any ideas what could be the reason? I exactly followed the video tutorial and also checked the code in the repository. I should have enough memory. According to nvidia-smi I get the following output:

Wed Jun 26 10:51:56 2024      

+-----------------------------------------------------------------------------------------+

| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |

|-----------------------------------------+------------------------+----------------------+

| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |

| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |

|                                         |                        |               MIG M. |

|=========================================+========================+======================|

|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:CA:00.0 Off |                   On |

| N/A   55C    P0            165W /  400W |     612MiB /  40960MiB |     N/A      Default |

|                                         |                        |              Enabled |

+-----------------------------------------+------------------------+----------------------+

 

+-----------------------------------------------------------------------------------------+

| MIG devices:                                                                            |

+------------------+----------------------------------+-----------+-----------------------+

| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |

|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |

|                  |                                  |        ECC|                       |

|==================+==================================+===========+=======================|

|  0    8   0   0  |              12MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |

|                  |                 0MiB /  8191MiB  |           |                       |

+------------------+----------------------------------+-----------+-----------------------+

|  0    9   0   1  |              12MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |

|                  |                 0MiB /  8191MiB  |           |                       |

+------------------+----------------------------------+-----------+-----------------------+

                                                                                        

+-----------------------------------------------------------------------------------------+

| Processes:                                                                              |

|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |

|        ID   ID                                                               Usage      |

|=========================================================================================|

|  No running processes found                                                             |

+-----------------------------------------------------------------------------------------+

Thanks in advance.

Different inference results between flash attention and manually implemented attention appeared.

When I loaded the smallest GPT-2 model weights from Hugging Face and performed inference using both flash attention and a manually implemented attention under the same seed setting, I obtained consistent results within each method individually. However, the results between the two methods were not consistent, and the manually implemented attention seemed to produce more reasonable outputs. Is this normal?

NO dropout in MLP and CausalSelfAttention

Embeddings are initialized with std of 0.02

I noticed that in the following snippet, that the std of nn.Embedding is set to 0.02:

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            std = 0.02
            if hasattr(module, 'NANOGPT_SCALE_INIT'):
                std *= (2 * self.config.n_layer) ** -0.5
            torch.nn.init.normal_(module.weight, mean=0.0, std=std)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

The official implementation sets it to 0.01 as noted in the video. In only matters for positional embeddings due to weight sharing scheme of wte and lm_head

Is dataloader making optimal batches?

Have a question - in the current form of making batches, aren't we throwing away information? e.g. we take a an input and transform it into B * T matrix. Now for each row, the first token is blind to previous tokens as we never put that sequence into the training loop. Wouldn't better way to make dataloader would be something like a moving window?

Running codes on Windows issues

Thanks so much Andrej for making these incredible videos!!!!

Could you please make them google collab friendly? I'm getting RuntimeErrors in the fineweb.py code and compile won't work for Windows because Triton can't be installed in Windows.

Thanks so much!!

Implement tensor parallelism

I thought tensor parallelism would be an interesting idea. There's a tutorial for this and even some code examples, but so far no joy.

I started simple, trying to shard the MLP like this:

# run using: torchrun --standalone --nproc-per-node=2 train_gpt2_tp.py

from torch.distributed._tensor.device_mesh import init_device_mesh
from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel, RowwiseParallel

_world_size = int(os.environ["WORLD_SIZE"])
device_mesh = init_device_mesh(device_type="cuda", mesh_shape=(_world_size,))

class Block(nn.Module):

    def __init__(self, config):

        ...
        
        # was: self.mlp = MLP(config)
        self.mlp = parallelize_module( 
            module=MLP(config),
            device_mesh=device_mesh,
            parallelize_plan={
                "c_fc": ColwiseParallel(),
                "c_proj": RowwiseParallel(),
            },
        )

But PyTorch (nightly) gives me grief:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/Sync-shared/projects/repos/build-nanogpt/train_gpt2_tp.py", line 326, in <module>
[rank0]:     norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 21, in _no_grad_wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 68, in clip_grad_norm_
[rank0]:     norms.extend(torch._foreach_norm(device_grads, norm_type))
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/_compile.py", line 31, in inner
[rank0]:     return disable_fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/api.py", line 309, in __torch_dispatch__
[rank0]:     return DTensor._op_dispatcher.dispatch(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 115, in dispatch
[rank0]:     op_info = self.unwrap_to_op_info(op_call, args, kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 348, in unwrap_to_op_info
[rank0]:     args_schema.append(try_get_replicate_spec(arg, mesh))
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 329, in try_get_replicate_spec
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: aten._foreach_norm.Scalar: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!
[rank1]: Traceback (most recent call last):
[rank1]:   File "/mnt/Sync-shared/projects/repos/build-nanogpt/train_gpt2_tp.py", line 326, in <module>
[rank1]:     norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 21, in _no_grad_wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 68, in clip_grad_norm_
[rank1]:     norms.extend(torch._foreach_norm(device_grads, norm_type))
[rank1]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/_compile.py", line 31, in inner
[rank1]:     return disable_fn(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank1]:     return fn(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/api.py", line 309, in __torch_dispatch__
[rank1]:     return DTensor._op_dispatcher.dispatch(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 115, in dispatch
[rank1]:     op_info = self.unwrap_to_op_info(op_call, args, kwargs)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 348, in unwrap_to_op_info
[rank1]:     args_schema.append(try_get_replicate_spec(arg, mesh))
[rank1]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 329, in try_get_replicate_spec
[rank1]:     raise RuntimeError(
[rank1]: RuntimeError: aten._foreach_norm.Scalar: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!

As a quick fix I tried converting what I thought were DTensors to local tensors:

class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.gelu    = nn.GELU(approximate='tanh')
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd)
        self.c_proj.NANOGPT_SCALE_INIT = 1

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return x.to_local() # change here!

but then I get even more grief 🤦‍♂️:

[rank0]:   File "/mnt/Sync-shared/projects/repos/build-nanogpt/train_gpt2_tp.py", line 58, in forward
[rank0]:     return x.to_local()
[rank0]:            ^^^^^^^^^^
[rank0]: AttributeError: 'AsyncCollectiveTensor' object has no attribute 'to_local'

Any ideas? 🙏

How to support padding in the train dataset for training ?

Sharding the dataset not completing?

Below what what I get everytime I try to shard the dataset, it dose not look like the last one is completing I ran this multiply time and each time it stops in the same spot. Any ideas?

Shard 97: 100%|█████████████████████████████████████████████████▉| 99999910/100000000 [00:10<00:00, 9236426.65tokens/s]
Shard 98: 100%|█████████████████████████████████████████████████▉| 99999499/100000000 [00:11<00:00, 8723382.11tokens/s]
Shard 99: 54%|██████████████████████████▉ | 53989101/100000000 [00:08<00:07, 6051927.02tokens/s]
PS E:\build-nanogpt-master\build-nanogpt-master>

Fix torch.compile Issue - Error with HellaSwag eval and Generation

The issue is that gcc is not available on the machine, at least in my setup.

Check GCC with type gcc command or path /usr/bin/gcc. If not available, install via build-essential package:

sudo apt-get install build-essential

Restart the shell and set use_compile = True in the training script. This worked for my setup.

Edit: Not fixed.

Text generation can use raw_model instead of model

The current script bypasses the text generation step when the model is compiled. However, if we change from model(...) to raw_model(...), we can still generate the text when the model is compiled.

build-nanogpt/train_gpt2.py

Lines 459 to 461 in 6104ab1

 with torch.no_grad(): 

 with torch.autocast(device_type=device_type, dtype=torch.bfloat16): 

 logits, loss = model(xgen) # (B, T, vocab_size)

Integrating GPT-2 with deepspeed Zero-1, Zero-2 and Zero-3

Chunking method in the original GPT-2 training dataset

The data loader prepares the input data batch in chunks. Let's say the chunk size is 6, then you have a sliding window approach where you advance each chunk by 6 as you show in the video:

Tokenized text: [ 5962, 22307, 25, 198, 8421, 356, 5120, 597, 2252, 11, 3285, 502, 2740, 13, 198, 198, 3237, 25, 1081, 5248, 461, 11, 2740, 13, 99]

Batch inputs:

tensor([[ 5962, 22307, 25,   198, 8421, 356],
        [ 5120, 597,   2252, 11,  3285, 502],
        [ 2740, 13,    198,  198, 3237, 25],
        [ 198,  5248,  461,  11,  2740, 13]])

Batch targets (inputs shifted by +1):

tensor([[22307,  25, 198, 8421, 356, 5120],
        [ 597,  2252, 11, 3285, 502, 2740],
        [ 13,   198, 198, 3237,  25, 1081],
        [ 5248, 461,  11, 2740,  13, 99]])

This works well, and this is usually also how I do it.

However, I think for exactly reproducing the original GPT-2 model, I think they had overlaps between the inputs. I.e, each new chunk starts just one token after the previous one:

Batch Inputs:

tensor([[ 5962, 22307,   25,  198, 8421, 356 ]
        [22307,    25,  198, 8421,  356, 5120]
        [25,      198, 8421,  356, 5120, 597 ]
        [198,    8421,  356, 5120,  597, 2252]
        ...])

Batch Targets:
tensor([[22307,    25,   198,  8421,  356, 5120],
        [   25,   198,  8421,   356, 5120,  597],
        [  198,  8421,   356,  5120,  597, 2252],
        [ 8421,   356,  5120,   597, 2252,   11],
        ...])

I.e., instead of advancing the input by "chunk size", they advanced the input position by 1. Please correct me if I'm wrong.

	with torch.no_grad():
	with torch.autocast(device_type=device_type, dtype=torch.bfloat16):
	logits, loss = model(xgen) # (B, T, vocab_size)

karpathy / build-nanogpt Goto Github PK

build-nanogpt's Introduction

build-nanogpt's People

Contributors

Stargazers

Watchers

Forkers

build-nanogpt's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs