GithubHelp home page GithubHelp logo

build-nanogpt's Introduction

I like deep neural nets.

build-nanogpt's People

Contributors

karpathy avatar wilsoncwu avatar zhangfaen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

build-nanogpt's Issues

Cannot get the log file "log124M_40B/log.txt"?

Thanks Andrej for the incredible video and content !!!

While trying to following along (starting with the play.iynb file), I have trouble getting the log file "log124M_40B/log.txt". I am on a PC and it's impossible for me to get reproduce your result till the very end :(.

Could you provide the download link to this file, or point to the location of the file (if I missed it from the repo...)?

Executing with 1 GPU raises "OutOfMemory Exception", executing with 2 GPUs "RuntimeError: CUDA error: invalid device ordinal"

Hi,

I have tried to implement GPT2 from scratch according to the Video tutorial. However, if I try to execute the code on 2 GPUs with:

torchrun --standalone --nproc_per_node=2 GPT.py

My program fails with the following error message:

Device: cuda:1

Device Count: 1

[rank1]: Traceback (most recent call last):

[rank1]:   File "/my_transformer/GPT.py", line 238, in <module>

[rank1]:     torch.cuda.set_device(device)

[rank1]:   File "/.local/lib/python3.9/site-packages/torch/cuda/__init__.py", line 399, in set_device

[rank1]:     torch._C._cuda_setDevice(device)

[rank1]: RuntimeError: CUDA error: invalid device ordinal

[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

 

Device: cuda:0

Device Count: 1

Master-Process: True

Total desired batch size: 524288

Calculated gradient accumulation steps: 16.

loaded 338025 tokens.

W0626 10:12:11.821799 22703772874560 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2031187 closing signal SIGTERM

E0626 10:12:11.853472 22703772874560 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 2031188) of binary: ~/my_transformer/.venv/bin/python3.9

Traceback (most recent call last):

  File "~/my_transformer/.venv/bin/torchrun", line 8, in <module>

    sys.exit(main())

  File "/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper

    return f(*args, **kwargs)

  File "~/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main

    run(args)

  File "/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run

    elastic_launch(

  File "~/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__

    return launch_agent(self._config, self._entrypoint, list(args))

  File "~/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent

    raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================

GPT.py FAILED

------------------------------------------------------------

Failures:

  <NO_OTHER_FAILURES>

------------------------------------------------------------

Root Cause (first observed failure):

[0]:

  time      : 2024-06-26_10:12:11

  host      : haicn01.localdomain

  rank      : 1 (local_rank: 1)

  exitcode  : 1 (pid: 2031188)

  error_file: <N/A>

  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

If I execute with just 1 GPU, I get another error:

[rank0]: OutOfMemoryError: CUDA out of memory. Tried to allocate 786.00 MiB. GPU

Any ideas what could be the reason? I exactly followed the video tutorial and also checked the code in the repository. I should have enough memory. According to nvidia-smi I get the following output:

Wed Jun 26 10:51:56 2024      

+-----------------------------------------------------------------------------------------+

| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |

|-----------------------------------------+------------------------+----------------------+

| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |

| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |

|                                         |                        |               MIG M. |

|=========================================+========================+======================|

|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:CA:00.0 Off |                   On |

| N/A   55C    P0            165W /  400W |     612MiB /  40960MiB |     N/A      Default |

|                                         |                        |              Enabled |

+-----------------------------------------+------------------------+----------------------+

 

+-----------------------------------------------------------------------------------------+

| MIG devices:                                                                            |

+------------------+----------------------------------+-----------+-----------------------+

| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |

|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |

|                  |                                  |        ECC|                       |

|==================+==================================+===========+=======================|

|  0    8   0   0  |              12MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |

|                  |                 0MiB /  8191MiB  |           |                       |

+------------------+----------------------------------+-----------+-----------------------+

|  0    9   0   1  |              12MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |

|                  |                 0MiB /  8191MiB  |           |                       |

+------------------+----------------------------------+-----------+-----------------------+

                                                                                        

+-----------------------------------------------------------------------------------------+

| Processes:                                                                              |

|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |

|        ID   ID                                                               Usage      |

|=========================================================================================|

|  No running processes found                                                             |

+-----------------------------------------------------------------------------------------+

Thanks in advance.

Different inference results between flash attention and manually implemented attention appeared.

When I loaded the smallest GPT-2 model weights from Hugging Face and performed inference using both flash attention and a manually implemented attention under the same seed setting, I obtained consistent results within each method individually. However, the results between the two methods were not consistent, and the manually implemented attention seemed to produce more reasonable outputs. Is this normal?

Embeddings are initialized with std of 0.02

I noticed that in the following snippet, that the std of nn.Embedding is set to 0.02:

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            std = 0.02
            if hasattr(module, 'NANOGPT_SCALE_INIT'):
                std *= (2 * self.config.n_layer) ** -0.5
            torch.nn.init.normal_(module.weight, mean=0.0, std=std)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

The official implementation sets it to 0.01 as noted in the video. In only matters for positional embeddings due to weight sharing scheme of wte and lm_head

Is dataloader making optimal batches?

Have a question - in the current form of making batches, aren't we throwing away information? e.g. we take a an input and transform it into B * T matrix. Now for each row, the first token is blind to previous tokens as we never put that sequence into the training loop. Wouldn't better way to make dataloader would be something like a moving window?

Running codes on Windows issues

Thanks so much Andrej for making these incredible videos!!!!

Could you please make them google collab friendly? I'm getting RuntimeErrors in the fineweb.py code and compile won't work for Windows because Triton can't be installed in Windows.

Thanks so much!!

Implement tensor parallelism

I thought tensor parallelism would be an interesting idea. There's a tutorial for this and even some code examples, but so far no joy.

I started simple, trying to shard the MLP like this:

# run using: torchrun --standalone --nproc-per-node=2 train_gpt2_tp.py

from torch.distributed._tensor.device_mesh import init_device_mesh
from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel, RowwiseParallel

_world_size = int(os.environ["WORLD_SIZE"])
device_mesh = init_device_mesh(device_type="cuda", mesh_shape=(_world_size,))

class Block(nn.Module):

    def __init__(self, config):

        ...
        
        # was: self.mlp = MLP(config)
        self.mlp = parallelize_module( 
            module=MLP(config),
            device_mesh=device_mesh,
            parallelize_plan={
                "c_fc": ColwiseParallel(),
                "c_proj": RowwiseParallel(),
            },
        )

But PyTorch (nightly) gives me grief:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/Sync-shared/projects/repos/build-nanogpt/train_gpt2_tp.py", line 326, in <module>
[rank0]:     norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 21, in _no_grad_wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 68, in clip_grad_norm_
[rank0]:     norms.extend(torch._foreach_norm(device_grads, norm_type))
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/_compile.py", line 31, in inner
[rank0]:     return disable_fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/api.py", line 309, in __torch_dispatch__
[rank0]:     return DTensor._op_dispatcher.dispatch(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 115, in dispatch
[rank0]:     op_info = self.unwrap_to_op_info(op_call, args, kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 348, in unwrap_to_op_info
[rank0]:     args_schema.append(try_get_replicate_spec(arg, mesh))
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 329, in try_get_replicate_spec
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: aten._foreach_norm.Scalar: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!
[rank1]: Traceback (most recent call last):
[rank1]:   File "/mnt/Sync-shared/projects/repos/build-nanogpt/train_gpt2_tp.py", line 326, in <module>
[rank1]:     norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 21, in _no_grad_wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 68, in clip_grad_norm_
[rank1]:     norms.extend(torch._foreach_norm(device_grads, norm_type))
[rank1]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/_compile.py", line 31, in inner
[rank1]:     return disable_fn(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank1]:     return fn(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/api.py", line 309, in __torch_dispatch__
[rank1]:     return DTensor._op_dispatcher.dispatch(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 115, in dispatch
[rank1]:     op_info = self.unwrap_to_op_info(op_call, args, kwargs)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 348, in unwrap_to_op_info
[rank1]:     args_schema.append(try_get_replicate_spec(arg, mesh))
[rank1]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.12/site-packages/torch/distributed/_tensor/_dispatch.py", line 329, in try_get_replicate_spec
[rank1]:     raise RuntimeError(
[rank1]: RuntimeError: aten._foreach_norm.Scalar: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!

As a quick fix I tried converting what I thought were DTensors to local tensors:

class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.gelu    = nn.GELU(approximate='tanh')
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd)
        self.c_proj.NANOGPT_SCALE_INIT = 1

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return x.to_local() # change here!

but then I get even more grief ๐Ÿคฆโ€โ™‚๏ธ:

[rank0]:   File "/mnt/Sync-shared/projects/repos/build-nanogpt/train_gpt2_tp.py", line 58, in forward
[rank0]:     return x.to_local()
[rank0]:            ^^^^^^^^^^
[rank0]: AttributeError: 'AsyncCollectiveTensor' object has no attribute 'to_local'

Any ideas? ๐Ÿ™

Sharding the dataset not completing?

Below what what I get everytime I try to shard the dataset, it dose not look like the last one is completing I ran this multiply time and each time it stops in the same spot. Any ideas?

Shard 97: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰| 99999910/100000000 [00:10<00:00, 9236426.65tokens/s]
Shard 98: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰| 99999499/100000000 [00:11<00:00, 8723382.11tokens/s]
Shard 99: 54%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 53989101/100000000 [00:08<00:07, 6051927.02tokens/s]
PS E:\build-nanogpt-master\build-nanogpt-master>

Fix torch.compile Issue - Error with HellaSwag eval and Generation

The issue is that gcc is not available on the machine, at least in my setup.

Check GCC with type gcc command or path /usr/bin/gcc. If not available, install via build-essential package:

sudo apt-get install build-essential

Restart the shell and set use_compile = True in the training script. This worked for my setup.

Edit: Not fixed.

Text generation can use raw_model instead of model

The current script bypasses the text generation step when the model is compiled. However, if we change from model(...) to raw_model(...), we can still generate the text when the model is compiled.

build-nanogpt/train_gpt2.py

Lines 459 to 461 in 6104ab1

with torch.no_grad():
with torch.autocast(device_type=device_type, dtype=torch.bfloat16):
logits, loss = model(xgen) # (B, T, vocab_size)

Chunking method in the original GPT-2 training dataset

The data loader prepares the input data batch in chunks. Let's say the chunk size is 6, then you have a sliding window approach where you advance each chunk by 6 as you show in the video:

Tokenized text: [ 5962, 22307, 25, 198, 8421, 356, 5120, 597, 2252, 11, 3285, 502, 2740, 13, 198, 198, 3237, 25, 1081, 5248, 461, 11, 2740, 13, 99]

Batch inputs:

tensor([[ 5962, 22307, 25,   198, 8421, 356],
        [ 5120, 597,   2252, 11,  3285, 502],
        [ 2740, 13,    198,  198, 3237, 25],
        [ 198,  5248,  461,  11,  2740, 13]])

Batch targets (inputs shifted by +1):

tensor([[22307,  25, 198, 8421, 356, 5120],
        [ 597,  2252, 11, 3285, 502, 2740],
        [ 13,   198, 198, 3237,  25, 1081],
        [ 5248, 461,  11, 2740,  13, 99]])

This works well, and this is usually also how I do it.

However, I think for exactly reproducing the original GPT-2 model, I think they had overlaps between the inputs. I.e, each new chunk starts just one token after the previous one:

Batch Inputs:

tensor([[ 5962, 22307,   25,  198, 8421, 356 ]
        [22307,    25,  198, 8421,  356, 5120]
        [25,      198, 8421,  356, 5120, 597 ]
        [198,    8421,  356, 5120,  597, 2252]
        ...])

Batch Targets:
tensor([[22307,    25,   198,  8421,  356, 5120],
        [   25,   198,  8421,   356, 5120,  597],
        [  198,  8421,   356,  5120,  597, 2252],
        [ 8421,   356,  5120,   597, 2252,   11],
        ...])

I.e., instead of advancing the input by "chunk size", they advanced the input position by 1. Please correct me if I'm wrong.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.