karpathy / ng-video-lecture Goto Github PK

View Code? Open in Web Editor NEW

3.1K 3.1K 820.0 440 KB

Python 100.00%

ng-video-lecture's People

Contributors

Stargazers

Watchers

Forkers

kabongosalomon brennoalencar shaun95 alexistercero55 yun-long awalpremi russ76 anthonymadia w32zhong jeffreygray dumpmemory melonsmasher pterameta vn-os atillayasar davidyuan666 sandipwalke davgit kuminov javiervicho filip-klimenka tcrapse rowzzy bishwajitdey stjordanis codedcclxxvii 5l1v3r1 deepxmatter sanjaykrishnamurthy aoe-khkhan ravitandon90 mvisionai tienpqt jonathanseng hamdisha mrgoldilocks ronofays jhordanfigueroa cawithhz mendi80 waynemunro biomathcode aiching8x8 amitdutta121 jocobtt varun-varghese liamnorman groundhogdisplay diegoascanio akankushjnvku remcqueen tsuoki 666erik saimarpaka cloudata-ai nithiroj zaidorx aliaiops madhavsapkota anastasiyatkachuk00 sanya115836 karina285276 rahulranjan7201 amunnezza dot8pixels nicholas-dinicola themcsebi vkramdev ryan4daniel4 twistedshampoo fifmikey salvajigit tsilagava gugoku ingvarstep rr-hung-nguyen kermorvant polich12 bdonkey iuriimattos2 hungry-heart natashaaasmi dukecamel kurtniegilso kokojo56 assassin65170 pure-water thefirebanks rohit4100 shadab4150 agony121 smvorwerk lp-addict sebvikingo ukaserge xiaoconstantine doytsujin chrisiiiixxx andrehirano10 eniompw

ng-video-lecture's Issues

mac studio can't generate token

the gpt.py code well run at cuda. but when I can devices with mps. this model can be trained. but can not generate token.
when run the code at generating can't finish. keeping running forever. thx for reply.

can it be run on ubuntu PC with nvidia 3060 GPU 8 GB

how to save, Load and Finetune the model

Model :

model = LanguageModel()

To Save

with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

To Load :

with open('model.pkl', 'rb') as f:
    model = pickle.load(f)
m = model.to(device)

KeyError

Does anyone come across the same issue as follow?

Traceback (most recent call last):
  File "./gpt.py", line 224, in <module>
    print(decode(m.generate(context, max_new_tokens=3000)[0].tolist()))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./gpt.py", line 32, in <lambda>
    decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
                               ^^^^^^^^^^^^^^^^^^^^
  File "./gpt.py", line 32, in <listcomp>
    decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
                                ~~~~^^^
KeyError: -9223372036854775808

"index out of range" error when using a different embedding dimension than vocab_size

ng-video-lecture/bigram.py

Line 66 in 5220142

self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

If I change the 2nd parameter (dimension of the embedding) to something different than vocab_size (e.g. 128), I got "index out of range error" in generate().

To replicate the error, just change this line in the notebook:

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, 128)    # <-- change dimension to 128

And then rerun the cell:

torch.Size([32, 128])
tensor(5.2106, grad_fn=<NllLossBackward0>)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
[<ipython-input-14-58747f1080e0>](https://localhost:8080/#) in <cell line: 48>()
     46 print(loss)
     47 
---> 48 print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

5 frames
[<ipython-input-14-58747f1080e0>](https://localhost:8080/#) in generate(self, idx, max_new_tokens)
     30         for _ in range(max_new_tokens):
     31             # get the predictions
---> 32             logits, loss = self(idx)
     33             # focus only on the last time step
     34             logits = logits[:, -1, :] # becomes (B, C)

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

[<ipython-input-14-58747f1080e0>](https://localhost:8080/#) in forward(self, idx, targets)
     14 
     15         # idx and targets are both (B,T) tensor of integers
---> 16         logits = self.token_embedding_table(idx) # (B,T,C)
     17 
     18         if targets is None:

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py](https://localhost:8080/#) in forward(self, input)
    160 
    161     def forward(self, input: Tensor) -> Tensor:
--> 162         return F.embedding(
    163             input, self.weight, self.padding_idx, self.max_norm,
    164             self.norm_type, self.scale_grad_by_freq, self.sparse)

[/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py](https://localhost:8080/#) in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2208         # remove once script supports set_grad_enabled
   2209         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2210     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2211 
   2212 

IndexError: index out of range in self

how do I crack a pin showing this (****)?

How is torch broadcasting (T, T) @ (B, T, C) ?!

At around 53:10 of the lecture, Andrej does a matrix multiplication with tensors of size (T, T) and (B, T, C). More precisely: (8, 8) @ (4, 8, 2).

Now, even after looking over PyTorch docs on broadcasting semantics, I'm surprised to see that this works - but sure enough, running the code produces an output of (4, 8, 2).

Can anyone explain how this broadcast works?

// align trailing dimensions
     8, 8
4, 8, 2

// pad missing dimensions with 1
1, 8, 8
4, 8, 2

// duplicate 1 dimensions until match
4, 8, 8
4, 8 ,2

// now what???

gpt.py how to save the model after training and how to use it so that it returns the text to me relevant to ChatGPT?

I have familiarized myself with the course gpt.py in principle, everything is clear with the training data, I have prepared a dataset. However, I want to save the resulting gpt model and then connect to it, insert some text into it and see how it will respond to it

Might want to modify README to remove the "NOTE"

Now that you have added the "init" bit in d38c865

wei value not 100% per row after dropout

It doesn't make sense to me, but

        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)

although after this step the row level percentages sum up to 100%, taking the dropout

        wei = self.dropout(wei)

the values increase above 100%. Any reason for that? Does it cause any issues? I mean the overall calculation shouldn't be effected too much, other parts of the network can overcome this issue, but still.

Call `model.eval()` before generating?

I understand why we have to call model.eval() before calculating the average loss in estimate_loss(). But should we not similarly call model.eval() before we start generating from the model?

Change the Title Please

ng it mostly used to describe nodejs things these days. Might I suggest a quick repo update with a title that way more readable? Great video and repo btw. Cheers.

The mathematical trick in self-attention, why it returns false for torch.allclose(xbow, xbow2)?

Hi
I noticed that the result of torch.allclose(xbow, xbow2), torch.allclose(xbow, xbow3) are all false when running the Collab example gpt-dev.ipynb in The mathematical trick in self-attention section. Here is what I got, has anyone encountered the same issue?

Discrepancy with dimensions

In the colab notebook linked under your YT video the dimensions for the single headed attention appear to be incorrect.

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

I believe v.shape is not BTC but rather B,T, hs. In this repository it is correct:

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # input of size (batch, time-step, channels)
        # output of size (batch, time-step, head size)
        B,T,C = x.shape
        k = self.key(x)   # (B,T,hs)
        q = self.query(x) # (B,T,hs)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,hs)
        out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out

This caused my some confusion, maybe you could change it?
Thank you for such a wonderful, educational project!

UML diagram helping beginners understand gpt.py

Hey @karpathy , I created a high-level UML diagram showcasing what's going on at a high-level in gpt.py. This will make it easier for folks to hack the rest of your repo ;D
I used Graphical Code Tracer to generate this.

may you share code to run only inference

how to run only inference
may you share code to run only inference

Position embedding seems wrong

No license file

Please add MIT license file.
You can do it quickly using Github GUI and by choosing MIT license template.

About gpt.py line 134-135

Acording to the paper of transformer , it seems that we can change
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
to
x = self.ln1(x + self.sa(x))
x = self.ln2(x + self.ffwd(x))
Although the result is similar.

Strange model behavior when taking the softmax in the wrong dimension

ng-video-lecture/gpt.py

Line 85 in 5220142

wei = F.softmax(wei, dim=-1) # (B, T, T)

I accidentally changed the softmax dimension to -2 instead of -1 and got incredibly low losses on both the training and validation set. However, when generating from the model, I get very low-quality result. What is the explanation ?

My guess is that I'm somehow leaking information when taking the softmax in the wrong dimension, which may explain why the training loss is very low. However, I don't quite get why validation loss would also be low.

@karpathy Any idea why this is the case?

no longer bigram model?

In https://github.com/karpathy/ng-video-lecture/blob/master/gpt.py
on lines 136 / 137:

# super simple bigram model
class BigramLanguageModel(nn.Module):

just to clarify, is this now a GPT model and not a bigram model?

bug?: m vs model

I'm curious if gpt.py is buggy (which is my guess and gpt-4's guess as well https://chat.openai.com/share/b50316a4-0f63-4813-8888-9cb3ca68b7f1) or why it's not.

On line 199 of gpt.py we define m. We then use m and model when I think we should be using only one of them (I think 199 should be model = ... and then we only use model. There might be some spots where we have to move tensors so everything is on the same device).

Using the variable "model" after declaring variable "m"

When you call m = model.to(device), it returns a model that shares the same parameters with the original model but is located on the specified device.

So, in your training loop and anywhere else you use the model after this point, you should use m, not model.

https://github.com/karpathy/ng-video-lecture/blob/52201428ed7b46804849dea0b3ccf0de9df1a5c3/gpt.py#L217C2-L217C2