ng-video-lecture's People
Forkers
kabongosalomon brennoalencar shaun95 alexistercero55 yun-long awalpremi russ76 anthonymadia w32zhong jeffreygray dumpmemory melonsmasher pterameta vn-os atillayasar davidyuan666 sandipwalke davgit kuminov javiervicho filip-klimenka tcrapse rowzzy bishwajitdey stjordanis codedcclxxvii 5l1v3r1 deepxmatter sanjaykrishnamurthy aoe-khkhan ravitandon90 mvisionai tienpqt jonathanseng hamdisha mrgoldilocks ronofays jhordanfigueroa cawithhz mendi80 waynemunro biomathcode aiching8x8 amitdutta121 jocobtt varun-varghese liamnorman groundhogdisplay diegoascanio akankushjnvku remcqueen tsuoki 666erik saimarpaka cloudata-ai nithiroj zaidorx aliaiops madhavsapkota anastasiyatkachuk00 sanya115836 karina285276 rahulranjan7201 amunnezza dot8pixels nicholas-dinicola themcsebi vkramdev ryan4daniel4 twistedshampoo fifmikey salvajigit tsilagava gugoku ingvarstep rr-hung-nguyen kermorvant polich12 bdonkey iuriimattos2 hungry-heart natashaaasmi dukecamel kurtniegilso kokojo56 assassin65170 pure-water thefirebanks rohit4100 shadab4150 agony121 smvorwerk lp-addict sebvikingo ukaserge xiaoconstantine doytsujin chrisiiiixxx andrehirano10 eniompwng-video-lecture's Issues
mac studio can't generate token
the gpt.py code well run at cuda. but when I can devices with mps. this model can be trained. but can not generate token.
when run the code at generating can't finish. keeping running forever. thx for reply.
can it be run on ubuntu PC with nvidia 3060 GPU 8 GB
can it be run on ubuntu PC with nvidia 3060 GPU 8 GB
how to save, Load and Finetune the model
Model :
model = LanguageModel()
To Save
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
To Load :
with open('model.pkl', 'rb') as f:
model = pickle.load(f)
m = model.to(device)
KeyError
Does anyone come across the same issue as follow?
Traceback (most recent call last):
File "./gpt.py", line 224, in <module>
print(decode(m.generate(context, max_new_tokens=3000)[0].tolist()))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./gpt.py", line 32, in <lambda>
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
^^^^^^^^^^^^^^^^^^^^
File "./gpt.py", line 32, in <listcomp>
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
~~~~^^^
KeyError: -9223372036854775808
"index out of range" error when using a different embedding dimension than vocab_size
Line 66 in 5220142
If I change the 2nd parameter (dimension of the embedding) to something different than vocab_size
(e.g. 128), I got "index out of range error" in generate()
.
To replicate the error, just change this line in the notebook:
class BigramLanguageModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
# each token directly reads off the logits for the next token from a lookup table
self.token_embedding_table = nn.Embedding(vocab_size, 128) # <-- change dimension to 128
And then rerun the cell:
torch.Size([32, 128])
tensor(5.2106, grad_fn=<NllLossBackward0>)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
[<ipython-input-14-58747f1080e0>](https://localhost:8080/#) in <cell line: 48>()
46 print(loss)
47
---> 48 print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))
5 frames
[<ipython-input-14-58747f1080e0>](https://localhost:8080/#) in generate(self, idx, max_new_tokens)
30 for _ in range(max_new_tokens):
31 # get the predictions
---> 32 logits, loss = self(idx)
33 # focus only on the last time step
34 logits = logits[:, -1, :] # becomes (B, C)
[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
[<ipython-input-14-58747f1080e0>](https://localhost:8080/#) in forward(self, idx, targets)
14
15 # idx and targets are both (B,T) tensor of integers
---> 16 logits = self.token_embedding_table(idx) # (B,T,C)
17
18 if targets is None:
[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py](https://localhost:8080/#) in forward(self, input)
160
161 def forward(self, input: Tensor) -> Tensor:
--> 162 return F.embedding(
163 input, self.weight, self.padding_idx, self.max_norm,
164 self.norm_type, self.scale_grad_by_freq, self.sparse)
[/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py](https://localhost:8080/#) in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2208 # remove once script supports set_grad_enabled
2209 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2210 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2211
2212
IndexError: index out of range in self
pin
how do I crack a pin showing this (****)?
How is torch broadcasting (T, T) @ (B, T, C) ?!
At around 53:10 of the lecture, Andrej does a matrix multiplication with tensors of size (T, T) and (B, T, C). More precisely: (8, 8) @ (4, 8, 2).
Now, even after looking over PyTorch docs on broadcasting semantics, I'm surprised to see that this works - but sure enough, running the code produces an output of (4, 8, 2).
Can anyone explain how this broadcast works?
// align trailing dimensions
8, 8
4, 8, 2
// pad missing dimensions with 1
1, 8, 8
4, 8, 2
// duplicate 1 dimensions until match
4, 8, 8
4, 8 ,2
// now what???
gpt.py how to save the model after training and how to use it so that it returns the text to me relevant to ChatGPT?
I have familiarized myself with the course gpt.py in principle, everything is clear with the training data, I have prepared a dataset. However, I want to save the resulting gpt model and then connect to it, insert some text into it and see how it will respond to it
Might want to modify README to remove the "NOTE"
Now that you have added the "init" bit in d38c865
wei value not 100% per row after dropout
It doesn't make sense to me, but
wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
wei = F.softmax(wei, dim=-1) # (B, T, T)
although after this step the row level percentages sum up to 100%, taking the dropout
wei = self.dropout(wei)
the values increase above 100%. Any reason for that? Does it cause any issues? I mean the overall calculation shouldn't be effected too much, other parts of the network can overcome this issue, but still.
Call `model.eval()` before generating?
I understand why we have to call model.eval()
before calculating the average loss in estimate_loss()
. But should we not similarly call model.eval()
before we start generating from the model?
Change the Title Please
ng it mostly used to describe nodejs things these days. Might I suggest a quick repo update with a title that way more readable? Great video and repo btw. Cheers.
The mathematical trick in self-attention, why it returns false for torch.allclose(xbow, xbow2)?
Discrepancy with dimensions
In the colab notebook linked under your YT video the dimensions for the single headed attention appear to be incorrect.
class Head(nn.Module):
""" one head of self-attention """
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
self.dropout = nn.Dropout(dropout)
def forward(self, x):
B,T,C = x.shape
k = self.key(x) # (B,T,C)
q = self.query(x) # (B,T,C)
# compute attention scores ("affinities")
wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
wei = F.softmax(wei, dim=-1) # (B, T, T)
wei = self.dropout(wei)
# perform the weighted aggregation of the values
v = self.value(x) # (B,T,C)
out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
return out
I believe v.shape is not BTC but rather B,T, hs. In this repository it is correct:
class Head(nn.Module):
""" one head of self-attention """
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# input of size (batch, time-step, channels)
# output of size (batch, time-step, head size)
B,T,C = x.shape
k = self.key(x) # (B,T,hs)
q = self.query(x) # (B,T,hs)
# compute attention scores ("affinities")
wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
wei = F.softmax(wei, dim=-1) # (B, T, T)
wei = self.dropout(wei)
# perform the weighted aggregation of the values
v = self.value(x) # (B,T,hs)
out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
return out
This caused my some confusion, maybe you could change it?
Thank you for such a wonderful, educational project!
UML diagram helping beginners understand gpt.py
Hey @karpathy , I created a high-level UML diagram showcasing what's going on at a high-level in gpt.py. This will make it easier for folks to hack the rest of your repo ;D
I used Graphical Code Tracer to generate this.
may you share code to run only inference
how to run only inference
may you share code to run only inference
Position embedding seems wrong
No license file
Please add MIT license file.
You can do it quickly using Github GUI and by choosing MIT license template.
About gpt.py line 134-135
Acording to the paper of transformer , it seems that we can change
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
to
x = self.ln1(x + self.sa(x))
x = self.ln2(x + self.ffwd(x))
Although the result is similar.
Strange model behavior when taking the softmax in the wrong dimension
Line 85 in 5220142
I accidentally changed the softmax dimension to -2 instead of -1 and got incredibly low losses on both the training and validation set. However, when generating from the model, I get very low-quality result. What is the explanation ?
My guess is that I'm somehow leaking information when taking the softmax in the wrong dimension, which may explain why the training loss is very low. However, I don't quite get why validation loss would also be low.
@karpathy Any idea why this is the case?
no longer bigram model?
In https://github.com/karpathy/ng-video-lecture/blob/master/gpt.py
on lines 136 / 137:
# super simple bigram model
class BigramLanguageModel(nn.Module):
just to clarify, is this now a GPT model and not a bigram model?
bug?: m vs model
I'm curious if gpt.py is buggy (which is my guess and gpt-4's guess as well https://chat.openai.com/share/b50316a4-0f63-4813-8888-9cb3ca68b7f1) or why it's not.
On line 199 of gpt.py we define m
. We then use m
and model
when I think we should be using only one of them (I think 199 should be model = ...
and then we only use model
. There might be some spots where we have to move tensors so everything is on the same device).
Using the variable "model" after declaring variable "m"
When you call m = model.to(device)
, it returns a model that shares the same parameters with the original model but is located on the specified device.
So, in your training loop and anywhere else you use the model after this point, you should use m, not model.
can be windows OS with only CPU used ?
can be windows OS with only CPU used ?
supplementary video lecture: may you share link to this video pls
My current plan is to publish a supplementary video lecture and cover these parts, then I will also push the exact code changes to this rep
may you share link to this video pls
time series data like BTC price
if we have a time series data like BTC price, then we don't need to do token_embedding
i suppose,
how do we do position_embedding
in this case?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.