yandexdataschool / nlp_course Goto Github PK

View Code? Open in Web Editor NEW

9.6K 367.0 2.5K 496.52 MB

YSDA course in Natural Language Processing

Home Page: https://lena-voita.github.io/nlp_course.html

License: MIT License

Dockerfile 0.12% Shell 0.01% Python 3.04% Jupyter Notebook 95.19% HTML 1.51% C++ 0.03% Cuda 0.10%

nlp_course's Introduction

YSDA Natural Language Processing course

This is the 2023 version. For previous year' course materials, go to this branch
Lecture and seminar materials for each week are in ./week* folders, see README.md for materials and instructions
Any technical issues, ideas, bugs in course materials, contribution ideas - add an issue
Installing libraries and troubleshooting: this thread.

Syllabus

week01 Word Embeddings
- Lecture: Word embeddings. Distributional semantics. Count-based (pre-neural) methods. Word2Vec: learn vectors. GloVe: count, then learn. Evaluation: intrinsic vs extrinsic. Analysis and Interpretability. Interactive lecture materials and more.
- Seminar: Playing with word and sentence embeddings
- Homework: Embedding-based machine translation system
week02 Text Classification
- Lecture: Text classification: introduction and datasets. General framework: feature extractor + classifier. Classical approaches: Naive Bayes, MaxEnt (Logistic Regression), SVM. Neural Networks: General View, Convolutional Models, Recurrent Models. Practical Tips: Data Augmentation. Analysis and Interpretability. Interactive lecture materials and more.
- Seminar: Text classification with convolutional NNs.
- Homework: Statistical & neural text classification.
week03 Language Modeling
- Lecture: Language Modeling: what does it mean? Left-to-right framework. N-gram language models. Neural Language Models: General View, Recurrent Models, Convolutional Models. Evaluation. Practical Tips: Weight Tying. Analysis and Interpretability. Interactive lecture materials and more.
- Seminar: Build a N-gram language model from scratch
- Homework: Neural LMs & smoothing in count-based models.
week04 Seq2seq and Attention
- Lecture: Seq2seq Basics: Encoder-Decoder framework, Training, Simple Models, Inference (e.g., beam search). Attention: general, score functions, models. Transformer: self-attention, masked self-attention, multi-head attention; model architecture. Subword Segmentation (BPE). Analysis and Interpretability: functions of attention heads; probing for linguistic structure. Interactive lecture materials and more.
- Seminar: Basic sequence to sequence model
- Homework: Machine translation with attention
week05 Transfer Learning
- Lecture: What is Transfer Learning? Great idea 1: From Words to Words-in-Context (CoVe, ELMo). Great idea 2: From Replacing Embeddings to Replacing Models (GPT, BERT). (A Bit of) Adaptors. Analysis and Interpretability. Interactive lecture materials and more.
- Homework: fine-tuning a pre-trained BERT model
week06 LLMs and Prompting
- Lecture: Scaling laws. Emergent abilities. Prompting (aka "in-context learning"): techiques that work; questioning whether model "understands" prompts. Hypotheses for why and how in-context learning works. Analysis and Interpretability.
- Homework: manual prompt engneering and chain-of-thought reasoning
week07 Transformer architecture and training
- Lecture: training tips for transformers; the evolution of transformer architecture from Vaswani et al (2017) to modern LLMs; parameter-efficient fine-tuning (PEFT)
- Homework: fine-tuning a large language model with PEFT algorithms
week08 Reinforcement Learning from Human Feedback
- Lecture: model alignment, RLHF, case study of InstructGPT and ChatGPT
- Homework: fine-tune your own language model with RL (using HuggingFace trl)
week09 (extra) Domain Adaptation in NLP
- Lecture: why do domain adaptation? Methods: reweighting, proxy labels, adversarial domain adaptation
- Optional homework: implement domain adaptation when fine-tuning BERT models
week10_ Efficient Inference in NLP
- Lecture: how NLP models are deployed, a survey of compression and acceleration: quantization, sparsification, ACT & more
- Practice: implement RTN and GPTQ for 4-bit LLM quantization
week11 (extra)_ Retrieval Augmented Language Models
- Guest lecture: retrieval in LMs, token-level retrieval (KNNLM & more), RAG, RETRO, tools: langchain , HF Agents, open problems

Contributors & course staff

Course materials and teaching performed by

Elena Voita - course admin, lectures, seminars, homeworks
Valentina Broner - course admin for on-campus students
Boris Kovarsky, David Talbot, Sergey Gubanov, Just Heuristic - help build course materials and/or held some classes
30+ volunteers who contributed and refined the notebooks and course materials. Without their help, the course would not be what it is today
A mighty host of TAs who stoically grade hundreds of homework submissions from on-campus students each year

nlp_course's People

Contributors

Stargazers

Watchers

Forkers

schoooler sashamn iamfina frolovconst alexeyqu proskurapd kowalskip awant liaksiejka1337 anakuta waytobehigh fakefeik standy66 kreptsevdmitriy vprov sviatoslavborodachev sergeybondarev eds-satish femoiseev artesby heyguys478 shakhrayv penguin138 ffbskt yaphilya veegaaa mponty xiaoqiangcs adityaasinha28 tommylees112 mikeaggro saturdaysai theainerd htnani spandyie rajnathsah thorpham laur1s edunuke alsterman xuhuiren stevenlol jdc08161063 devhttps osirisjs splendor-kill songhune gsk12 guanlongtianzi lihuawu m-dz legchikov omrajkumar rrishik spencerx zyxpaidaxing luismond jdetras anu-bioinfo thilak007 ml-ai-nlp-ir sarthusarth minhpqn zhouyonglong allensmile zeniel-oroi hack121 obaidashraf rahul-38-26-0111-0003 shaunstanislauslau jekso azmatjahan raghavendra-gali oppa3109 etsangsplk bnosac-dev hulalazz merajat lsheiba abhi-infrrd abhi-jha antimirov parety gsuvorov chsafouane jaykimbravekjh tony32769 ajberdy adimukewar kapilkoundinya tiagoooliveira beesitech giftchima gyanachand1 lokeshgithub aharol journey0621 ngoduyvu wojohowitz00 mogaio

nlp_course's Issues

download data... 403 Forbidden

week3_lm/homework.ipynb does averaging incorrectly and refers to rnn_lm instead of window_lm

It contains the following code:

assert np.mean(train_history[:10]) > np.mean(train_history[-10:]), "The model didn't converge."

This is incorrect because train_history is a list of pairs (i, loss). A correct way is:

assert np.mean(train_history[:10], axis=0)[1] > np.mean(train_history[-10:], axis=0)[1], "The model didn't converge."

Also, a couple of lines further it refers to rnn_lm instead of window_lm.

week1_embeddings homework: Remaining TODO in the notebook

It can be shown (see original paper) that a self-consistent linear mapping between semantic spaces should be orthogonal. TODO simplify phrases

A question of bow_vocabulary in w2_homework_part1

Greetings,

I am working on the week2 homework(part1) notebook and have question about bow_vocabulary, as the length of the bow_vocabulary I created is different from the length of the set of all tokens in the training set.
See the screenshot below:

The way I created bow_vocabulary as follows:
Basically like the idea of week1, I split the text with " "(space) which are the tokens generated from TweetTokenizer.
Then I count the occurrence of each token and only keep the top k words in the vocabulary.

When putting all tokens into a set, some tokens(str) would be treated as the same one so that the length decreases.
I am wondering:

Is my way creating bow_vocabulary correct?
My understanding is that we should keep and use tokens from tokenizer, such that the creation of vocabulary is like: tokens -> vocabulary.
However, I also understand that some string might be meaningless(with symbols only for example) and we can merge them together, like the set used in notebook. So creation of vocabulary is like: tokens -> pre-processing -> vocabulary.
Could you shed some light on this? That would be super helpful!

Thanks for your time to look at this and I am looking forward to hearing your reply!

I need some help!

It's a great project, but there are some homework I can't solve , can you help me? Or give me the answers. Thank you very much!

Homework

Does anyone have any solutions for the homework? I'm stuck on several problems that need some helps

week-5 readme broken link

In readme file of the week-5 link to the Huggingface quickstart tutorial is broken

Error in the formula from week 1 in lecture slides.

I think, in 47 slide from the lecture about embeddings is incorrect formula for calculating tf-idf. In the denominator of log argument we need to write {d € D: w € d (not D)}.

nlp

week 3, seminar, Kneser-Ney smoothing formula

There must be a bug in Kneser-Ney smoothing formula

The denominator shouldn't contain a subtraction of delta from counts. First of all, denominator becomes 0 in some cases. Also there is no delta in denominator at the lecture slides and wiki page.

homework answer

hi：
I want to comparis your homework and similar aanswer

NB: there is a single NaN value in the Week2 seminar dataset

Make sure to get rid of it! It's in the "Salary" category, and will kill your embedding/CNN/regression training.

Week 8 homework throws exception with Python 3.7

Training SimpleModel using default code produces the following exception at the end of training:

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-23-aa60453f1abd> in iterate_minibatches(data, batch_size, shuffle, cycle, max_batches)
     10             if max_batches and total_batches >= max_batches:
---> 11                 raise StopIteration()
     12         if not cycle: break

StopIteration: 

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-27-137835dc5639> in <module>()
----> 1 for batch in tqdm(iterate_minibatches(train_data, cycle=True, max_batches=2500)):
      2     loss_t, _ = sess.run([trainer.loss, trainer.step],
      3                          {trainer.ph[key]: batch[key] for key in trainer.ph})
      4     loss_history.append(loss_t)
      5 

/usr/lib/python3.7/site-packages/tqdm/_tqdm.py in __iter__(self)
    977 """, fp_write=getattr(self.fp, 'write', sys.stderr.write))
    978 
--> 979             for obj in iterable:
    980                 yield obj
    981                 # Update and possibly print the progressbar.

RuntimeError: generator raised StopIteration

This is an intended regression in Python 3.7. Please see https://www.python.org/dev/peps/pep-0479/ for additional details.

Url of Slide of lecture 8 gives 403 error

Link of slide in lecture 8 gives 403 error.

where can i see that slide ?

Thank you

can't download the data

hi, i can't acquire download the data in the code when i run the code,such as this code:

download the data:

!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt

alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

the outcome is:/bin/bash: wget: command not found

I have a problem in week01 homework. "Making it better (orthogonal Procrustean problem)"

from numpy import linalg as la
def learn_transform(X_train, Y_train):
    """ 
    :returns: W* : float matrix[emb_dim x emb_dim] as defined in formulae above
    """
    u, sigma, vt = la.svd(np.matmul(X_train.T, Y_train)) # 300*300
    W = np.matmul(u,vt.T)
    return W

I complete the SVD code, but the result is bad.

W = learn_transform(X_train, Y_train)
ru_emb.most_similar([np.matmul(uk_emb["серпень"], W)])

[('Раззаков', 0.2569316625595093),
 ('Tyrrell', 0.25682011246681213),
 ('Cosworth', 0.2490822821855545),
 ('ANZ', 0.24802428483963013),
 ('owned', 0.24786119163036346),
 ('Bowie', 0.24696007370948792),
 ('Хемингуэй', 0.24553290009498596),
 ('debut', 0.24303579330444336),
 ('Тэдзука', 0.24089395999908447),
 ('underground', 0.24075618386268616)]

I don't know why.

Help with seminar on conversation

Hello, I am trying to adapt the idea of learning on triplets for the classification task on an imbalanced datasets.

I am selecting 1 anchor, 1 positive example from the same class and 1 negative example from random other classes. I want to make the model learn how to embed sentences of same classes closer together and afterwards train SVM or something else to make classification according to embedding received from the trained model.

Can you please suggest what the model's architecture could look like? In course you suggested using several dense layers on top of the pretrained Bert (you also suggested not training Bert embeddings, but just training these dense layers).
What should be good output size of the vector if I want to use it later for classification? Maybe 16?

I will be very grateful for suggestions!

P.S. Ребята, вы действительно лучшие, мне Ваш курс очень помог в изучении NLP!

A delta typo

First of all, Thanks a lot for your great website! Really hats off!

In the link, generative should be discriminative in the second bullet point.

Apologies for opening an issue for a very small thing.
Thanks.

With BERT be covered in this year lectures?

Hi,

Thanks for a great course! I wonder if in this year's lectures BERT and its derivatives will be covered.

Cheers,
Sergey,

Missing word in text about residual connections

Missing "have" in sentence: A network can {have} several such blocks
Link: https://lena-voita.github.io/nlp_course/models/convolutional.html
Great course btw.

Will the answer to the assignment be released? sir 😁

the classfication assignment dowload error.

The dowload link is error 403.

Installing libraries

If you have any issues with libraries, post 'em here.

We assume that you have basic data science toolkit (sklearn, numpy/scipy/pandas). Basically whatever comes with default anaconda distribution.

Anaconda: https://www.continuum.io/downloads (or simply use python with numpy/sklearn)

If you don't/can't install that (e.g. you use windows and installation is tricky), there's a docker container available (see below).

Manual install

NLP: pip install --upgrade nltk gensim bs4 editdistance
Tensorflow: pip install --upgrade tensorflow keras
- Detailed guide - if you want GPU access
Other: pip install bokeh tqdm

Installing with GPU

To enable GPU on tensorflow,

uninstall tensorflow if you have it, pip uninstall tensorflow (or conda uninstall tensorflow if you used anaconda)
if you use conda (any OS), try conda install tensorflow-gpu
without conda (linux / mac OS only), just pip install tensorflow-gpu. Make sure you have the appropriate cuda toolkit

Install with docker

Simple interface for docker: kitematic (all platforms)
Guide for windows, linux, or macOS.

Clone course repo from dockerhub
(or just docker pull justheuristic/nlp_course if you have docker shell)

If you want to build it yourself, use these instructions.

If you run into any trouble, feel free to post here, even if it's like "i don't know what the hell are all these words!".

week_2 seminar.ipynb

Problem with this seminar.
Right now error is raised in cells with model.predict(make_batch(...)), because there is no predict method for nn.Module class.
Please, add somewhere function like:

def predict(model, batch):
    return model(batch).detach().numpy()

and change model.predict(make_batch(...).detach().cpu() expression to the predict(model, make_batch(...))).

requirement.txt

will that be possible to get a requirement.txt?
since in the install instruction there are no specific version of python, genism been mentioned, I constantly getting some bugs.

Thank you!!!

Some code is not working due to new versions of libraries

Hi!
Talking about nlp_course/week01_embeddings/seminar.ipynb:
This row "Requirements: pip install --upgrade nltk gensim bokeh , but only if you're running locally." will install the latest versions of libraries, because you didn't specify exact versions.
I suggest to specify exact versions of libraries you intended to use in your notebooks.
As of May 2021, gensim has version 4.0.1
It means that

words = sorted(model.vocab.keys(), 
               key=lambda word: model.vocab[word].count,
               reverse=True)[:1000]

will not work.
Better to replace it with

words = sorted(model.key_to_index.keys(), 
               key=lambda word: model.get_vecattr(word, "count"),
               reverse=True)[:1000]

Talking about nlp_course/week01_embeddings/homework.ipynb:

precision_top1 = precision(uk_ru_test, mapping.predict(X_test), 1)
precision_top5 = precision(uk_ru_test, mapping.predict(X_test), 5)

assert precision_top1 >= 0.635
assert precision_top5 >= 0.813

And here it works only with this fix precision_top5 >= 0.811 (probably due to new gensim library as well)

P.S. I will update this issue with new problems as I go through the course.

Download uk and ru embeddings by wget command in Google Colab (week01_embeddings/homework)

I tried to download two zip files by wget command in google colab but received 401 Unauthorized response. Is there anybody who could download files by wget command in google colab? Or Is there anybody who uploaded these two files in his/her drive and share the drive file id to download it by gdown?

Please add reference for negative sampling

From the list of references for the word embedings, this one is missing

https://arxiv.org/pdf/1402.3722.pdf

and it is very important for educational purposes.

Week 8 readme refers to nonexistent seminar

https://github.com/yandexdataschool/nlp_course/blob/master/week08_multitask/README.md refers to seminar.ipynb which does not exist anymore, and claims that there is no homework for the week, despite the presence of homework.ipynb in the directory.

Solution to homework?

Are there any resolutions provided for the homework and Where can I find them? Thanks~

is there any discussion group?

Hi there,
This course is a terrific material to learn NLP from scratch with Pytorch/TensorFlow.
But I have a problem that I don't know if I am coding right or wrong when I finish the homework.
Is there any discussion group for learners to discuss or verify if we are coding right?

Thanks in advance.

week02_classification/homework_part1: there is a wrong parameter

As shown in the picture, it's supposed to be 100 rather than 1000.

Some mistakes in week1 articles

Distributed Representations of Sentences and Documents: there is a blank space between [] and () so the markdown link fails
GloVe: Global Vectors for Word Representation: resource is not arxiv
Enriching Word Vectors with Subword Information: appears twice

Week02 seminar Kneser–Ney smoothing

Hi!
I have a problem of the Kneser-Ney smoothing formula. The formula for calculating lambda on the slide is shown below.

What is the difference between the part 1 and the part 2? I think they both calculated the number of times w_i-n+1, ... ,w_i-1 and w_i co-occur.
And the formula on wiki page is:

week04_seq2seq seminar link

Спасибо за курс, очень нравится и подача материала и уровень сложности.
В 4-й неделе есть небольшая ошибка - ссылка на запись семинара ветке 2020 относится к третьей неделе (Language Modeling)

Week02 seminar.ipynb broken link

There is a broken link in the 2 cell, which must load the data for the seminar:

https://ysda-seminars.s3.eu-central-1.amazonaws.com/Train_rev1.zip

Meet problem when building the model in week02_classification/seminar

I stuck in building the model in week 2. Here is my code.

def build_model(n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=64):
l_title = L.Input(shape=[None], name="Title")
l_descr = L.Input(shape=[None], name="FullDescription")
l_categ = L.Input(shape=[None], name="Categorical")

# Build your monster!
# <YOUR CODE>

# Title
x_t = Embedding(input_dim=len(tokens),output_dim=5, name="Title_Embedding")(l_title)
x_t = Conv1D(filters=5, kernel_size=5, activation='relu')(x_t)
x_t = MaxPooling1D(5)(x_t)
x_t = Dense(1)(x_t)

# FullDescription    
x_d = Embedding(input_dim=len(tokens),output_dim=5, name="FullDescription_Embedding")(l_descr)
x_d = Conv1D(filters=5, kernel_size=5, activation='relu')(x_d)
x_d = MaxPooling1D(5)(x_d)
x_d = Dense(1)(x_d)

# Categorical
x_c = Embedding(input_dim=n_cat_features,output_dim=5, name="Catigorical_Embedding")(l_categ)
x_c = Dense(1)(x_c)

#Concat
contact = Concatenate()([x_t, x_d, x_c])
output_layer = Dense(1)(contact)

# end of your code
model = keras.models.Model(inputs=[l_title, l_descr, l_categ], outputs=[output_layer])
model.compile('adam', 'mean_squared_error', metrics=['mean_absolute_error'])
return model`

I met with following problem.

Expected the last dense layer to have 3 dimensions, but got array with shape (100, 1)

The reason why I change the shape of categorical input layer to none is that I am not able to concatenate a defined layer with other two undefined layer at the last step.

I choose embedding layer in Categorical encoder because "The shape of the input to "Flatten" is not fully defined (got (None, 1). Make sure to pass a complete "input_shape" or "batch_input_shape" argument to the first layer in your model.", so I use embedding layer to keep them have the same dimension.

Could you please help me with these problems? Thank you in advance.

solution

Are there any resolutions provided for the homework and Where can I find them?

Thanks~

homework

Hi there,
This is a really great resource, is it possible to give the homework solutions? Thanks!

In pytorch week02_classification seminar bonus part don't work

In pytorch seminar version bonus part don't work, I suppose that part work only for keras version

def explain(model, sample, col_name='Title'):
    """ Computes the effect each word had on model predictions """
    sample = dict(sample)
    sample_col_tokens = [tokens[token_to_id.get(tok, 0)] for tok in sample[col_name].split()]
    data_drop_one_token = pd.DataFrame([sample] * (len(sample_col_tokens) + 1))

    for drop_i in range(len(sample_col_tokens)):
        data_drop_one_token.loc[drop_i, col_name] = ' '.join(UNK if i == drop_i else tok
                                                   for i, tok in enumerate(sample_col_tokens)) 

    *predictions_drop_one_token, baseline_pred = model.predict(make_batch(data_drop_one_token))[:, 0]#???????????????
    diffs = baseline_pred - predictions_drop_one_token
    return list(zip(sample_col_tokens, diffs))

week02_classification/seminar: How to solve the different length of title and description?

I have some problem to solve the dimension in the network architecture:
In our seminar, the 'title' and the 'description' are always have different dimensions. In the paper 'Convolutional Neural Networks for Sentence Classification' the author padded the dimensions. But the code in the 'seminar.ipynb' doesn't pad all batches to the same length. Do I need to modify the code in the '.ipnb' or there're some solutions to handle the different dimensions.
I would be very grateful if I could get someone's help！

week1_embeddings: naming and description of parameters of precision() is incorrect

The notebook contains the following function:

def precision(pairs, uk_vectors, topn=1):
    """ 
    :args:
        pairs = list of right word pairs [(uk_word_0, ru_word_0), ...]
        uk_vectors = list of embeddings for Ukraininan words
    :returns:
        precision_val, float number, total number of words for those we can find right translation at top K.
    """

This does not seem to match what the subsequent tests expect. The tests pass for uk_vectors the vectors for Ukrainian words that have already been mapped by the linear transformation. Thus they are not embeddings for Ukrainian words, they are predicted embeddings of Russian words.

Where to find the answers? It's so hard for me!!!

Where to find the answers? It's so hard for me!!!
:(

05_seminar: error - unexpected keyword argument 'input_ids'

class MyBertBasedClassifier(nn.Module):
def init(self):
super().init()
self.bert = transformers.AutoModel.from_pretrained('bert-base-uncased')
self.head = nn.Linear(768, 2)

def forward(self, **kwargs):
  out = model(**tokens_info)['pooler_output']
  return self.head(out)

clf = MyBertBasedClassifier()
clf(**tokens_info)

'''the last string returns an error:
TypeError: _forward_unimplemented() got an unexpected keyword argument 'input_ids'

I double-checked an entire code and its equal to the code in the original seminar notebook. Cant understand whats wrong.'''

[question] tools for creating graphics

Hello.
Could you tell, what tool did you use for creating such beautiful graphics?

week1_embeddings seminar: suggests np.argsort instead of np.argpartition

np.argpartition is more efficient for retrieving the indices of the k largest elements in an array.