yandexdataschool / nlp_course Goto Github PK
View Code? Open in Web Editor NEWYSDA course in Natural Language Processing
Home Page: https://lena-voita.github.io/nlp_course.html
License: MIT License
YSDA course in Natural Language Processing
Home Page: https://lena-voita.github.io/nlp_course.html
License: MIT License
Training SimpleModel
using default code produces the following exception at the end of training:
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
<ipython-input-23-aa60453f1abd> in iterate_minibatches(data, batch_size, shuffle, cycle, max_batches)
10 if max_batches and total_batches >= max_batches:
---> 11 raise StopIteration()
12 if not cycle: break
StopIteration:
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-27-137835dc5639> in <module>()
----> 1 for batch in tqdm(iterate_minibatches(train_data, cycle=True, max_batches=2500)):
2 loss_t, _ = sess.run([trainer.loss, trainer.step],
3 {trainer.ph[key]: batch[key] for key in trainer.ph})
4 loss_history.append(loss_t)
5
/usr/lib/python3.7/site-packages/tqdm/_tqdm.py in __iter__(self)
977 """, fp_write=getattr(self.fp, 'write', sys.stderr.write))
978
--> 979 for obj in iterable:
980 yield obj
981 # Update and possibly print the progressbar.
RuntimeError: generator raised StopIteration
This is an intended regression in Python 3.7. Please see https://www.python.org/dev/peps/pep-0479/ for additional details.
If you have any issues with libraries, post 'em here.
We assume that you have basic data science toolkit (sklearn, numpy/scipy/pandas). Basically whatever comes with default anaconda distribution.
If you don't/can't install that (e.g. you use windows and installation is tricky), there's a docker container available (see below).
pip install --upgrade nltk gensim bs4 editdistance
pip install --upgrade tensorflow keras
pip install bokeh tqdm
To enable GPU on tensorflow,
pip uninstall tensorflow
(or conda uninstall tensorflow if you used anaconda)conda install tensorflow-gpu
pip install tensorflow-gpu
. Make sure you have the appropriate cuda toolkitClone course repo from dockerhub
(or just docker pull justheuristic/nlp_course
if you have docker shell)
If you want to build it yourself, use these instructions.
If you run into any trouble, feel free to post here, even if it's like "i don't know what the hell are all these words!".
Make sure to get rid of it! It's in the "Salary" category, and will kill your embedding/CNN/regression training.
np.argpartition
is more efficient for retrieving the indices of the k largest elements in an array.
Hello, I am trying to adapt the idea of learning on triplets for the classification task on an imbalanced datasets.
I am selecting 1 anchor, 1 positive example from the same class and 1 negative example from random other classes. I want to make the model learn how to embed sentences of same classes closer together and afterwards train SVM or something else to make classification according to embedding received from the trained model.
Can you please suggest what the model's architecture could look like? In course you suggested using several dense layers on top of the pretrained Bert (you also suggested not training Bert embeddings, but just training these dense layers).
What should be good output size of the vector if I want to use it later for classification? Maybe 16?
I will be very grateful for suggestions!
P.S. Ребята, вы действительно лучшие, мне Ваш курс очень помог в изучении NLP!
There is a broken link in the 2 cell, which must load the data for the seminar:
https://ysda-seminars.s3.eu-central-1.amazonaws.com/Train_rev1.zip
Hi,
Thanks for a great course! I wonder if in this year's lectures BERT and its derivatives will be covered.
Cheers,
Sergey,
https://github.com/yandexdataschool/nlp_course/blob/master/week08_multitask/README.md refers to seminar.ipynb
which does not exist anymore, and claims that there is no homework for the week, despite the presence of homework.ipynb
in the directory.
First of all, Thanks a lot for your great website! Really hats off!
In the link, generative should be discriminative in the second bullet point.
Apologies for opening an issue for a very small thing.
Thanks.
Hi!
Talking about nlp_course/week01_embeddings/seminar.ipynb:
This row "Requirements: pip install --upgrade nltk gensim bokeh , but only if you're running locally." will install the latest versions of libraries, because you didn't specify exact versions.
I suggest to specify exact versions of libraries you intended to use in your notebooks.
As of May 2021, gensim has version 4.0.1
It means that
words = sorted(model.vocab.keys(),
key=lambda word: model.vocab[word].count,
reverse=True)[:1000]
will not work.
Better to replace it with
words = sorted(model.key_to_index.keys(),
key=lambda word: model.get_vecattr(word, "count"),
reverse=True)[:1000]
Talking about nlp_course/week01_embeddings/homework.ipynb:
precision_top1 = precision(uk_ru_test, mapping.predict(X_test), 1)
precision_top5 = precision(uk_ru_test, mapping.predict(X_test), 5)
assert precision_top1 >= 0.635
assert precision_top5 >= 0.813
And here it works only with this fix precision_top5 >= 0.811
(probably due to new gensim library as well)
P.S. I will update this issue with new problems as I go through the course.
I think, in 47 slide from the lecture about embeddings is incorrect formula for calculating tf-idf. In the denominator of log argument we need to write {d € D: w € d (not D)}.
Problem with this seminar.
Right now error is raised in cells with model.predict(make_batch(...))
, because there is no predict
method for nn.Module class.
Please, add somewhere function like:
def predict(model, batch):
return model(batch).detach().numpy()
and change model.predict(make_batch(...).detach().cpu()
expression to the predict(model, make_batch(...)))
.
Greetings,
I am working on the week2 homework(part1) notebook and have question about bow_vocabulary
, as the length of the bow_vocabulary
I created is different from the length of the set of all tokens
in the training set.
See the screenshot below:
The way I created bow_vocabulary
as follows:
Basically like the idea of week1, I split the text with " "(space) which are the tokens generated from TweetTokenizer
.
Then I count the occurrence of each token and only keep the top k
words in the vocabulary.
When putting all tokens into a set
, some tokens(str) would be treated as the same one so that the length decreases.
I am wondering:
bow_vocabulary
correct?set
used in notebook. So creation of vocabulary is like: tokens -> pre-processing -> vocabulary.Thanks for your time to look at this and I am looking forward to hearing your reply!
Hello.
Could you tell, what tool did you use for creating such beautiful graphics?
Link of slide in lecture 8 gives 403 error.
where can i see that slide ?
Thank you
class MyBertBasedClassifier(nn.Module):
def init(self):
super().init()
self.bert = transformers.AutoModel.from_pretrained('bert-base-uncased')
self.head = nn.Linear(768, 2)
def forward(self, **kwargs):
out = model(**tokens_info)['pooler_output']
return self.head(out)
clf = MyBertBasedClassifier()
clf(**tokens_info)
'''the last string returns an error:
TypeError: _forward_unimplemented() got an unexpected keyword argument 'input_ids'
I double-checked an entire code and its equal to the code in the original seminar notebook. Cant understand what
s wrong.'''
Are there any resolutions provided for the homework and Where can I find them? Thanks~
From the list of references for the word embedings, this one is missing
https://arxiv.org/pdf/1402.3722.pdf
and it is very important for educational purposes.
I stuck in building the model in week 2. Here is my code.
def build_model(n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=64):
l_title = L.Input(shape=[None], name="Title")
l_descr = L.Input(shape=[None], name="FullDescription")
l_categ = L.Input(shape=[None], name="Categorical")
# Build your monster!
# <YOUR CODE>
# Title
x_t = Embedding(input_dim=len(tokens),output_dim=5, name="Title_Embedding")(l_title)
x_t = Conv1D(filters=5, kernel_size=5, activation='relu')(x_t)
x_t = MaxPooling1D(5)(x_t)
x_t = Dense(1)(x_t)
# FullDescription
x_d = Embedding(input_dim=len(tokens),output_dim=5, name="FullDescription_Embedding")(l_descr)
x_d = Conv1D(filters=5, kernel_size=5, activation='relu')(x_d)
x_d = MaxPooling1D(5)(x_d)
x_d = Dense(1)(x_d)
# Categorical
x_c = Embedding(input_dim=n_cat_features,output_dim=5, name="Catigorical_Embedding")(l_categ)
x_c = Dense(1)(x_c)
#Concat
contact = Concatenate()([x_t, x_d, x_c])
output_layer = Dense(1)(contact)
# end of your code
model = keras.models.Model(inputs=[l_title, l_descr, l_categ], outputs=[output_layer])
model.compile('adam', 'mean_squared_error', metrics=['mean_absolute_error'])
return model`
I met with following problem.
Expected the last dense layer to have 3 dimensions, but got array with shape (100, 1)
The reason why I change the shape of categorical input layer to none is that I am not able to concatenate a defined layer with other two undefined layer at the last step.
I choose embedding layer in Categorical encoder because "The shape of the input to "Flatten" is not fully defined (got (None, 1). Make sure to pass a complete "input_shape" or "batch_input_shape" argument to the first layer in your model.", so I use embedding layer to keep them have the same dimension.
Could you please help me with these problems? Thank you in advance.
Hi there,
This is a really great resource, is it possible to give the homework solutions? Thanks!
I tried to download two zip files by wget
command in google colab but received 401 Unauthorized response. Is there anybody who could download files by wget
command in google colab? Or Is there anybody who uploaded these two files in his/her drive and share the drive file id to download it by gdown
?
I have some problem to solve the dimension in the network architecture:
In our seminar, the 'title' and the 'description' are always have different dimensions. In the paper 'Convolutional Neural Networks for Sentence Classification' the author padded the dimensions. But the code in the 'seminar.ipynb' doesn't pad all batches to the same length. Do I need to modify the code in the '.ipnb' or there're some solutions to handle the different dimensions.
I would be very grateful if I could get someone's help!
Where to find the answers? It's so hard for me!!!
:(
Distributed Representations of Sentences and Documents: there is a blank space between []
and ()
so the markdown link fails
GloVe: Global Vectors for Word Representation: resource is not arxiv
Enriching Word Vectors with Subword Information: appears twice
Are there any resolutions provided for the homework and Where can I find them?
Thanks~
Missing "have" in sentence: A network can {have} several such blocks
Link: https://lena-voita.github.io/nlp_course/models/convolutional.html
Great course btw.
It contains the following code:
assert np.mean(train_history[:10]) > np.mean(train_history[-10:]), "The model didn't converge."
This is incorrect because train_history
is a list of pairs (i, loss)
. A correct way is:
assert np.mean(train_history[:10], axis=0)[1] > np.mean(train_history[-10:], axis=0)[1], "The model didn't converge."
Also, a couple of lines further it refers to rnn_lm
instead of window_lm
.
from numpy import linalg as la
def learn_transform(X_train, Y_train):
"""
:returns: W* : float matrix[emb_dim x emb_dim] as defined in formulae above
"""
u, sigma, vt = la.svd(np.matmul(X_train.T, Y_train)) # 300*300
W = np.matmul(u,vt.T)
return W
I complete the SVD code, but the result is bad.
W = learn_transform(X_train, Y_train)
ru_emb.most_similar([np.matmul(uk_emb["серпень"], W)])
[('Раззаков', 0.2569316625595093),
('Tyrrell', 0.25682011246681213),
('Cosworth', 0.2490822821855545),
('ANZ', 0.24802428483963013),
('owned', 0.24786119163036346),
('Bowie', 0.24696007370948792),
('Хемингуэй', 0.24553290009498596),
('debut', 0.24303579330444336),
('Тэдзука', 0.24089395999908447),
('underground', 0.24075618386268616)]
I don't know why.
The notebook contains the following function:
def precision(pairs, uk_vectors, topn=1):
"""
:args:
pairs = list of right word pairs [(uk_word_0, ru_word_0), ...]
uk_vectors = list of embeddings for Ukraininan words
:returns:
precision_val, float number, total number of words for those we can find right translation at top K.
"""
This does not seem to match what the subsequent tests expect. The tests pass for uk_vectors
the vectors for Ukrainian words that have already been mapped by the linear transformation. Thus they are not embeddings for Ukrainian words, they are predicted embeddings of Russian words.
hi:
I want to comparis your homework and similar aanswer
It can be shown (see original paper) that a self-consistent linear mapping between semantic spaces should be orthogonal. TODO simplify phrases
Hi there,
This course is a terrific material to learn NLP from scratch with Pytorch/TensorFlow.
But I have a problem that I don't know if I am coding right or wrong when I finish the homework.
Is there any discussion group for learners to discuss or verify if we are coding right?
Thanks in advance.
Does anyone have any solutions for the homework? I'm stuck on several problems that need some helps
Will the answer to the assignment be released? sir 😁
will that be possible to get a requirement.txt?
since in the install instruction there are no specific version of python, genism been mentioned, I constantly getting some bugs.
Thank you!!!
hi, i can't acquire download the data in the code when i run the code,such as this code:
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
the outcome is:/bin/bash: wget: command not found
It's a great project, but there are some homework I can't solve , can you help me? Or give me the answers. Thank you very much!
Спасибо за курс, очень нравится и подача материала и уровень сложности.
В 4-й неделе есть небольшая ошибка - ссылка на запись семинара ветке 2020 относится к третьей неделе (Language Modeling)
In pytorch seminar version bonus part don't work, I suppose that part work only for keras version
def explain(model, sample, col_name='Title'):
""" Computes the effect each word had on model predictions """
sample = dict(sample)
sample_col_tokens = [tokens[token_to_id.get(tok, 0)] for tok in sample[col_name].split()]
data_drop_one_token = pd.DataFrame([sample] * (len(sample_col_tokens) + 1))
for drop_i in range(len(sample_col_tokens)):
data_drop_one_token.loc[drop_i, col_name] = ' '.join(UNK if i == drop_i else tok
for i, tok in enumerate(sample_col_tokens))
*predictions_drop_one_token, baseline_pred = model.predict(make_batch(data_drop_one_token))[:, 0]#???????????????
diffs = baseline_pred - predictions_drop_one_token
return list(zip(sample_col_tokens, diffs))
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.