GithubHelp home page GithubHelp logo

yandexdataschool / nlp_course Goto Github PK

View Code? Open in Web Editor NEW
9.5K 365.0 2.5K 496.44 MB

YSDA course in Natural Language Processing

Home Page: https://lena-voita.github.io/nlp_course.html

License: MIT License

Dockerfile 0.12% Shell 0.01% Python 3.04% Jupyter Notebook 95.19% HTML 1.51% C++ 0.03% Cuda 0.10%

nlp_course's People

Contributors

0xx400 avatar alekseik1 avatar alexeyhorkin avatar altsoph avatar artemxx avatar artnitolog avatar blacksamorez avatar drt7 avatar falaleevar avatar femoiseev avatar filimonova-md avatar justheuristic avatar kovarsky avatar lena-voita avatar ludweeg avatar m-evdokimov avatar mryab avatar muhamob avatar nazarov-yuriy avatar neychev avatar nikitachampion avatar poedator avatar sashamn avatar sava-stepurin avatar sergey-v-galtsev avatar shakhrayv avatar tixfeniks avatar valvarl avatar vprov avatar yura52 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlp_course's Issues

Week 8 homework throws exception with Python 3.7

Training SimpleModel using default code produces the following exception at the end of training:

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-23-aa60453f1abd> in iterate_minibatches(data, batch_size, shuffle, cycle, max_batches)
     10             if max_batches and total_batches >= max_batches:
---> 11                 raise StopIteration()
     12         if not cycle: break

StopIteration: 

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-27-137835dc5639> in <module>()
----> 1 for batch in tqdm(iterate_minibatches(train_data, cycle=True, max_batches=2500)):
      2     loss_t, _ = sess.run([trainer.loss, trainer.step],
      3                          {trainer.ph[key]: batch[key] for key in trainer.ph})
      4     loss_history.append(loss_t)
      5 

/usr/lib/python3.7/site-packages/tqdm/_tqdm.py in __iter__(self)
    977 """, fp_write=getattr(self.fp, 'write', sys.stderr.write))
    978 
--> 979             for obj in iterable:
    980                 yield obj
    981                 # Update and possibly print the progressbar.

RuntimeError: generator raised StopIteration

This is an intended regression in Python 3.7. Please see https://www.python.org/dev/peps/pep-0479/ for additional details.

Installing libraries

If you have any issues with libraries, post 'em here.

We assume that you have basic data science toolkit (sklearn, numpy/scipy/pandas). Basically whatever comes with default anaconda distribution.

If you don't/can't install that (e.g. you use windows and installation is tricky), there's a docker container available (see below).

Manual install

  • NLP: pip install --upgrade nltk gensim bs4 editdistance
  • Tensorflow: pip install --upgrade tensorflow keras
  • Other: pip install bokeh tqdm

Installing with GPU

To enable GPU on tensorflow,

  • uninstall tensorflow if you have it, pip uninstall tensorflow (or conda uninstall tensorflow if you used anaconda)
  • if you use conda (any OS), try conda install tensorflow-gpu
  • without conda (linux / mac OS only), just pip install tensorflow-gpu. Make sure you have the appropriate cuda toolkit

Install with docker

Clone course repo from dockerhub
(or just docker pull justheuristic/nlp_course if you have docker shell)

If you want to build it yourself, use these instructions.

If you run into any trouble, feel free to post here, even if it's like "i don't know what the hell are all these words!".

Help with seminar on conversation

Hello, I am trying to adapt the idea of learning on triplets for the classification task on an imbalanced datasets.

I am selecting 1 anchor, 1 positive example from the same class and 1 negative example from random other classes. I want to make the model learn how to embed sentences of same classes closer together and afterwards train SVM or something else to make classification according to embedding received from the trained model.

Can you please suggest what the model's architecture could look like? In course you suggested using several dense layers on top of the pretrained Bert (you also suggested not training Bert embeddings, but just training these dense layers).
What should be good output size of the vector if I want to use it later for classification? Maybe 16?

I will be very grateful for suggestions!

P.S. Ребята, вы действительно лучшие, мне Ваш курс очень помог в изучении NLP!

A delta typo

First of all, Thanks a lot for your great website! Really hats off!

In the link, generative should be discriminative in the second bullet point.

Apologies for opening an issue for a very small thing.
Thanks.

Some code is not working due to new versions of libraries

Hi!
Talking about nlp_course/week01_embeddings/seminar.ipynb:
This row "Requirements: pip install --upgrade nltk gensim bokeh , but only if you're running locally." will install the latest versions of libraries, because you didn't specify exact versions.
I suggest to specify exact versions of libraries you intended to use in your notebooks.
As of May 2021, gensim has version 4.0.1
It means that

words = sorted(model.vocab.keys(), 
               key=lambda word: model.vocab[word].count,
               reverse=True)[:1000]

will not work.
Better to replace it with

words = sorted(model.key_to_index.keys(), 
               key=lambda word: model.get_vecattr(word, "count"),
               reverse=True)[:1000]

Talking about nlp_course/week01_embeddings/homework.ipynb:

precision_top1 = precision(uk_ru_test, mapping.predict(X_test), 1)
precision_top5 = precision(uk_ru_test, mapping.predict(X_test), 5)

assert precision_top1 >= 0.635
assert precision_top5 >= 0.813

And here it works only with this fix precision_top5 >= 0.811 (probably due to new gensim library as well)

P.S. I will update this issue with new problems as I go through the course.

week_2 seminar.ipynb

Problem with this seminar.
Right now error is raised in cells with model.predict(make_batch(...)), because there is no predict method for nn.Module class.
Please, add somewhere function like:

def predict(model, batch):
    return model(batch).detach().numpy()

and change model.predict(make_batch(...).detach().cpu() expression to the predict(model, make_batch(...))).

A question of bow_vocabulary in w2_homework_part1

Greetings,

I am working on the week2 homework(part1) notebook and have question about bow_vocabulary, as the length of the bow_vocabulary I created is different from the length of the set of all tokens in the training set.
See the screenshot below:
image

The way I created bow_vocabulary as follows:
Basically like the idea of week1, I split the text with " "(space) which are the tokens generated from TweetTokenizer.
Then I count the occurrence of each token and only keep the top k words in the vocabulary.
image

When putting all tokens into a set, some tokens(str) would be treated as the same one so that the length decreases.
I am wondering:

  1. Is my way creating bow_vocabulary correct?
  2. My understanding is that we should keep and use tokens from tokenizer, such that the creation of vocabulary is like: tokens -> vocabulary.
    However, I also understand that some string might be meaningless(with symbols only for example) and we can merge them together, like the set used in notebook. So creation of vocabulary is like: tokens -> pre-processing -> vocabulary.
    Could you shed some light on this? That would be super helpful!

Thanks for your time to look at this and I am looking forward to hearing your reply!

week 3, seminar, Kneser-Ney smoothing formula

There must be a bug in Kneser-Ney smoothing formula
image
The denominator shouldn't contain a subtraction of delta from counts. First of all, denominator becomes 0 in some cases. Also there is no delta in denominator at the lecture slides and wiki page.

05_seminar: error - unexpected keyword argument 'input_ids'

class MyBertBasedClassifier(nn.Module):
def init(self):
super().init()
self.bert = transformers.AutoModel.from_pretrained('bert-base-uncased')
self.head = nn.Linear(768, 2)

def forward(self, **kwargs):
  out = model(**tokens_info)['pooler_output']
  return self.head(out)

clf = MyBertBasedClassifier()
clf(**tokens_info)

'''the last string returns an error:
TypeError: _forward_unimplemented() got an unexpected keyword argument 'input_ids'

I double-checked an entire code and its equal to the code in the original seminar notebook. Cant understand whats wrong.'''

Solution to homework?

Are there any resolutions provided for the homework and Where can I find them? Thanks~

Meet problem when building the model in week02_classification/seminar

I stuck in building the model in week 2. Here is my code.

def build_model(n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=64):
l_title = L.Input(shape=[None], name="Title")
l_descr = L.Input(shape=[None], name="FullDescription")
l_categ = L.Input(shape=[None], name="Categorical")

# Build your monster!
# <YOUR CODE>

# Title
x_t = Embedding(input_dim=len(tokens),output_dim=5, name="Title_Embedding")(l_title)
x_t = Conv1D(filters=5, kernel_size=5, activation='relu')(x_t)
x_t = MaxPooling1D(5)(x_t)
x_t = Dense(1)(x_t)

# FullDescription    
x_d = Embedding(input_dim=len(tokens),output_dim=5, name="FullDescription_Embedding")(l_descr)
x_d = Conv1D(filters=5, kernel_size=5, activation='relu')(x_d)
x_d = MaxPooling1D(5)(x_d)
x_d = Dense(1)(x_d)

# Categorical
x_c = Embedding(input_dim=n_cat_features,output_dim=5, name="Catigorical_Embedding")(l_categ)
x_c = Dense(1)(x_c)

#Concat
contact = Concatenate()([x_t, x_d, x_c])
output_layer = Dense(1)(contact)

# end of your code
model = keras.models.Model(inputs=[l_title, l_descr, l_categ], outputs=[output_layer])
model.compile('adam', 'mean_squared_error', metrics=['mean_absolute_error'])
return model`

I met with following problem.

Expected the last dense layer to have 3 dimensions, but got array with shape (100, 1)

The reason why I change the shape of categorical input layer to none is that I am not able to concatenate a defined layer with other two undefined layer at the last step.

I choose embedding layer in Categorical encoder because "The shape of the input to "Flatten" is not fully defined (got (None, 1). Make sure to pass a complete "input_shape" or "batch_input_shape" argument to the first layer in your model.", so I use embedding layer to keep them have the same dimension.

Could you please help me with these problems? Thank you in advance.

Week02 seminar Kneser–Ney smoothing

Hi!
I have a problem of the Kneser-Ney smoothing formula. The formula for calculating lambda on the slide is shown below.



What is the difference between the part 1 and the part 2? I think they both calculated the number of times w_i-n+1, ... ,w_i-1 and w_i co-occur.
And the formula on wiki page is:

homework

Hi there,
This is a really great resource, is it possible to give the homework solutions? Thanks!

week02_classification/seminar: How to solve the different length of title and description?

I have some problem to solve the dimension in the network architecture:
In our seminar, the 'title' and the 'description' are always have different dimensions. In the paper 'Convolutional Neural Networks for Sentence Classification' the author padded the dimensions. But the code in the 'seminar.ipynb' doesn't pad all batches to the same length. Do I need to modify the code in the '.ipnb' or there're some solutions to handle the different dimensions.
I would be very grateful if I could get someone's help!

Some mistakes in week1 articles

Distributed Representations of Sentences and Documents: there is a blank space between [] and () so the markdown link fails
GloVe: Global Vectors for Word Representation: resource is not arxiv
Enriching Word Vectors with Subword Information: appears twice

solution

Are there any resolutions provided for the homework and Where can I find them?

Thanks~

week3_lm/homework.ipynb does averaging incorrectly and refers to rnn_lm instead of window_lm

It contains the following code:

assert np.mean(train_history[:10]) > np.mean(train_history[-10:]), "The model didn't converge."

This is incorrect because train_history is a list of pairs (i, loss). A correct way is:

assert np.mean(train_history[:10], axis=0)[1] > np.mean(train_history[-10:], axis=0)[1], "The model didn't converge."

Also, a couple of lines further it refers to rnn_lm instead of window_lm.

I have a problem in week01 homework. "Making it better (orthogonal Procrustean problem)"

from numpy import linalg as la
def learn_transform(X_train, Y_train):
    """ 
    :returns: W* : float matrix[emb_dim x emb_dim] as defined in formulae above
    """
    u, sigma, vt = la.svd(np.matmul(X_train.T, Y_train)) # 300*300
    W = np.matmul(u,vt.T)
    return W

I complete the SVD code, but the result is bad.

W = learn_transform(X_train, Y_train)
ru_emb.most_similar([np.matmul(uk_emb["серпень"], W)])

[('Раззаков', 0.2569316625595093),
 ('Tyrrell', 0.25682011246681213),
 ('Cosworth', 0.2490822821855545),
 ('ANZ', 0.24802428483963013),
 ('owned', 0.24786119163036346),
 ('Bowie', 0.24696007370948792),
 ('Хемингуэй', 0.24553290009498596),
 ('debut', 0.24303579330444336),
 ('Тэдзука', 0.24089395999908447),
 ('underground', 0.24075618386268616)]

I don't know why.

week1_embeddings: naming and description of parameters of precision() is incorrect

The notebook contains the following function:

def precision(pairs, uk_vectors, topn=1):
    """ 
    :args:
        pairs = list of right word pairs [(uk_word_0, ru_word_0), ...]
        uk_vectors = list of embeddings for Ukraininan words
    :returns:
        precision_val, float number, total number of words for those we can find right translation at top K.
    """

This does not seem to match what the subsequent tests expect. The tests pass for uk_vectors the vectors for Ukrainian words that have already been mapped by the linear transformation. Thus they are not embeddings for Ukrainian words, they are predicted embeddings of Russian words.

homework answer

hi:
I want to comparis your homework and similar aanswer

is there any discussion group?

Hi there,
This course is a terrific material to learn NLP from scratch with Pytorch/TensorFlow.
But I have a problem that I don't know if I am coding right or wrong when I finish the homework.
Is there any discussion group for learners to discuss or verify if we are coding right?

Thanks in advance.

Homework

Does anyone have any solutions for the homework? I'm stuck on several problems that need some helps

requirement.txt

will that be possible to get a requirement.txt?
since in the install instruction there are no specific version of python, genism been mentioned, I constantly getting some bugs.

Thank you!!!

I need some help!

It's a great project, but there are some homework I can't solve , can you help me? Or give me the answers. Thank you very much!

In pytorch week02_classification seminar bonus part don't work

In pytorch seminar version bonus part don't work, I suppose that part work only for keras version

def explain(model, sample, col_name='Title'):
    """ Computes the effect each word had on model predictions """
    sample = dict(sample)
    sample_col_tokens = [tokens[token_to_id.get(tok, 0)] for tok in sample[col_name].split()]
    data_drop_one_token = pd.DataFrame([sample] * (len(sample_col_tokens) + 1))

    for drop_i in range(len(sample_col_tokens)):
        data_drop_one_token.loc[drop_i, col_name] = ' '.join(UNK if i == drop_i else tok
                                                   for i, tok in enumerate(sample_col_tokens)) 

    *predictions_drop_one_token, baseline_pred = model.predict(make_batch(data_drop_one_token))[:, 0]#???????????????
    diffs = baseline_pred - predictions_drop_one_token
    return list(zip(sample_col_tokens, diffs))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.