yorko / mlcourse.ai Goto Github PK

Open Machine Learning Course

License: Other

Python 90.94% HTML 9.05% Batchfile 0.01% Shell 0.01%

machine-learning data-analysis data-science pandas algorithms numpy scipy matplotlib seaborn plotly scikit-learn kaggle-inclass vowpal-wabbit python ipynb docker math

mlcourse.ai's Introduction

mlcourse.ai – Open Machine Learning Course

mlcourse.ai is an open Machine Learning course by OpenDataScience (ods.ai), led by Yury Kashnitsky (yorko). Having both a Ph.D. degree in applied math and a Kaggle Competitions Master tier, Yury aimed at designing an ML course with a perfect balance between theory and practice. Thus, the course meets you with math formulae in lectures, and a lot of practice in a form of assignments and Kaggle Inclass competitions. Currently, the course is in a self-paced mode. Here we guide you through the self-paced mlcourse.ai.

Bonus: Additionally, you can purchase a Bonus Assignments pack with the best non-demo versions of mlcourse.ai assignments. Select the "Bonus Assignments" tier. Refer to the details of the deal on the main page mlcourse.ai.

Mirrors (:uk:-only): mlcourse.ai (main site), Kaggle Dataset (same notebooks as Kaggle Notebooks)

Self-paced passing

You are guided through 10 weeks of mlcourse.ai. For each week, from Pandas to Gradient Boosting, instructions are given on which articles to read, lectures to watch, what assignments to accomplish.

Articles

This is the list of published articles on medium.com 🇬🇧, habr.com 🇷🇺. Also notebooks in Chinese are mentioned 🇨🇳 and links to Kaggle Notebooks (in English) are given. Icons are clickable.

Exploratory Data Analysis with Pandas 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebook
Visual Data Analysis with Python 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebooks: part1, part2
Classification, Decision Trees and k Nearest Neighbors 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebook
Linear Classification and Regression 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebooks: part1, part2, part3, part4, part5
Bagging and Random Forest 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebooks: part1, part2, part3
Feature Engineering and Feature Selection 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebook
Unsupervised Learning: Principal Component Analysis and Clustering 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebook
Vowpal Wabbit: Learning with Gigabytes of Data 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebook
Time Series Analysis with Python, part 1 🇬🇧 🇷🇺 🇨🇳. Predicting future with Facebook Prophet, part 2 🇬🇧, 🇨🇳 Kaggle Notebooks: part1, part2
Gradient Boosting 🇬🇧 🇷🇺, 🇨🇳, Kaggle Notebook

Lectures

Videolectures are uploaded to this YouTube playlist. Introduction, video, slides

Exploratory data analysis with Pandas, video
Visualization, main plots for EDA, video
Decision trees: theory and practical part
Logistic regression: theoretical foundations, practical part (baselines in the "Alice" competition)
Ensembles and Random Forest – part 1. Classification metrics – part 2. Example of a business task, predicting a customer payment – part 3
Linear regression and regularization - theory, LASSO & Ridge, LTV prediction - practice
Unsupervised learning - Principal Component Analysis and Clustering
Stochastic Gradient Descent for classification and regression - part 1, part 2 TBA
Time series analysis with Python (ARIMA, Prophet) - video
Gradient boosting: basic ideas - part 1, key ideas behind Xgboost, LightGBM, and CatBoost + practice - part 2

Assignments

The following are demo-assignments. Additionally, within the "Bonus Assignments" tier you can get access to non-demo assignments.

Exploratory data analysis with Pandas, nbviewer, Kaggle Notebook, solution
Analyzing cardiovascular disease data, nbviewer, Kaggle Notebook, solution
Decision trees with a toy task and the UCI Adult dataset, nbviewer, Kaggle Notebook, solution
Sarcasm detection, Kaggle Notebook, solution. Linear Regression as an optimization problem, nbviewer, Kaggle Notebook
Logistic Regression and Random Forest in the credit scoring problem, nbviewer, Kaggle Notebook, solution
Exploring OLS, Lasso and Random Forest in a regression task, nbviewer, Kaggle Notebook, solution
Unsupervised learning, nbviewer, Kaggle Notebook, solution
Implementing online regressor, nbviewer, Kaggle Notebook, solution
Time series analysis, nbviewer, Kaggle Notebook, solution
Beating baseline in a competition, Kaggle Notebook

Bonus assignments

Additionally, you can purchase a Bonus Assignments pack with the best non-demo versions of mlcourse.ai assignments. Select the "Bonus Assignments" tier on Patreon or a similar tier on Boosty (rus).

Details of the deal

mlcourse.ai is still in self-paced mode but we offer you Bonus Assignments with solutions for a contribution of $17/month. The idea is that you pay for ~1-5 months while studying the course materials, but a single contribution is still fine and opens your access to the bonus pack.

Note: the first payment is charged at the moment of joining the Tier Patreon, and the next payment is charged on the 1st day of the next month, thus it's better to purchase the pack in the 1st half of the month.

mlcourse.ai is never supposed to go fully monetized (it's created in the wonderful open ODS.ai community and will remain open and free) but it'd help to cover some operational costs, and Yury also put in quite some effort into assembling all the best assignments into one pack. Please note that unlike the rest of the course content, Bonus Assignments are copyrighted. Informally, Yury's fine if you share the pack with 2-3 friends but public sharing of the Bonus Assignments pack is prohibited.

The bonus pack contains 10 assignments, in some of them you are challenged to beat a baseline in a Kaggle competition under thorough guidance ("Alice" and "Medium") or implement an algorithm from scratch -- efficient stochastic gradient descent classifier and gradient boosting.

Kaggle competitions

Catch Me If You Can: Intruder Detection through Webpage Session Tracking. Kaggle Inclass
Predicting popularity of a Medium article. Kaggle Inclass
DotA 2 winner prediction. Kaggle Inclass

Citing mlcourse.ai

If you happen to cite mlcourse.ai in your work, you can use this BibTeX record:

@misc{mlcourse_ai,
    author = {Kashnitsky, Yury},
    title = {mlcourse.ai – Open Machine Learning Course},
    year = {2020},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/Yorko/mlcourse.ai}},
}

Community

You can join the Singularis.ai Slack community to ask questions on the course materials. The community is mostly Russian-speaking but questions in English are still welcome.

mlcourse.ai's People

Contributors

Stargazers

Watchers

Forkers

abushkevichgrisha kolivan brqda cenaka daniil-konovalenko ozeron yaleksandrov borowis smolsnastya vitalik1991 anamarina theotheo altair-13 ivalnic2016 tsuvery alibek24 shulg95 rogotulka nushakoff tashia008 aspektr yankelevich-dima phelinity struy jwegas pandatramp borisabramkin ab1992ao zavadskayak martikum re9ulus panda06 hitoriken schoooler almiri letoile gbrut savelievvv stillnata oleiva katenahornaya ruslan-efimov dimitrioum anton-khodak valentinadeshko newcreator dvoloskov alsereb whoknowswhocares adrenalinovaya dntgafck ddubina irrrkah tottinar yanko99 andrey-bryzgalov larionovaanna mariahalushko erebrova anton-lukyanov svdan chesslav sharlotik anneta01 btimur anniesss1 tim-tv vanya13331 druzhdruzh sal1290 mikhaildubov abliznyuk usasha hound ruslanisk fedorarbuzov alexkorovkov alexmcs vladkhv malevdv lesyalesechka tayozhniy odaykhovskaya roombooroom borisilin85 rezvanov44 gofat kyrylop bekerov bewrrrie 437am ilyakhov shakhova danilatj minign zikrach nikulin more-na bardyshev kmazorchuk

mlcourse.ai's Issues

Add athlete_events.csv to data folder

The athlete_events.csv file seems to be missing from the data folder. Sure, it can be downloaded via the Kaggle link but it's not immediately obvious.

Minor typo in the 3rd Assignment

In our case there's only ine feature so ...

/assignments_demo/assignment04_habr_popularity_ridge.ipynb - Опечатка в тексте задания

"Инициализируйте DictVectorizer с параметрами по умолчанию.
Примените метод fit_transform к X_train['title'] и метод transform к X_valid['title'] и X_test['title']"

Скорее всего здесь опечатка: должно быть X_train[feats], X_valid[feats], X_test[feats]

Topic 4. Part 4: Some modules are imported at the beginning, but are not used further in the text

It seems that TfidfTransformer, TfidfVectorizer, LinearSVC were forgotten to be removed when preparing the final article. If they are left on purpose, it seems that it is worth adding a few words about them in the text.

Missing image on Lesson 3 notebook

Hey,

Image credit_scoring_toy_tree_english.png is missing on the topic3_decision_trees_kNN notebook.

Variance formula in Article 3 and Assignment 3

The formulas for variance D in Article 3 and Assignment 3 are somewhat different, and both seem not ideal. In the article index i is used for both inner and external sums of y, that may be not quite right from a formal point of view. Possibly, in the assigment it was corrected by using yj and xj, however, j and xj was used in the same paragraph to mark the split feature. That can mislead someone.

hw1

week 3 workbooks seed issue

Right now code in https://github.com/Yorko/mlcourse_open/blob/master/jupyter_notebooks/topic03_decision_trees_knn/topic3_trees_knn.ipynb In[3] is np.seed = 7 but this seems to be typo and should be np.random.seed(7)?

Re-run topic9_part1_time_series_python.ipynb for English version

Plots and output are missing for the second half of the notebook when viewed via nbviewer:
https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic09_time_series/topic9_part1_time_series_python.ipynb

Globally, an excellent resource. Thanks!

Короткая инструкция по работе с материалами курса средствами GItHub

Добрый день!

Во-первых, я хотел бы сказать, что вы большие молодцы, что делаете этот курс! Основательность с который вы его запустили впечатляет!

Во-вторых, я хотел бы предложить небольшое дополнение, а именно добавить короткую инструкцию о том, как работать с материалами курса средствами Git/GitHub. Я думаю не все ваши слушатели умеют работать c Git/GitHub. По опыту прошлогоднего курса Юрия в ВШЭ, где он так же выкладывал материалы на GitHub, было совсем не очевидно как корректно обновлять их, не теряя собственных заметок в лекциях.

Я догадываюсь, что скорее всего нужно сделать Fork / Clone, наверное, сделать свой бранч(?) а потом как-то так, но хотелось бы знать наверняка. Возможно, это пригодиться и другим.

С наилучшими,
Андрей

Topic 2: typos

In
mlcourse.ai-master/jupyter_english/topic02_visual_data_analysis/topic2_visual_data_analysis.ipynb
"ellpise" instead of "ellipse"

Topic 2 Par 2, "median ($50\%)"

Something went wrong with the formula "median ($50%)" in the boxplot() explanation in the "3. Seaborn" part. Considering previous arcticle, the "median ($50\%$)" should give correct display, not the "($50\%)".

Minor typo in 4.1

"X – is a matrix of obesrvations and their features".
There's a typo in the word "observations".

Assignment 1: Question 6 answer should be 1 of Olympics dataset. Because of men double medal.

typo in topic3_decision_trees_kNN article's image

Hello.

Typo in image topic3_hse_instruction.png
Arvix -> arXiv

Assignment 9

https://www.kaggle.com/kashnitsky/assignment-9-time-series-analysis

web form is not corresponding to the questions in the task. At least 1st question is missing
I doubt that in the 1st question there is a correct answer. Simple operations lead to result 3426.195682 which is not listed in the possible answers. Kernel related. In q2-4 the situation is the same. Practically no coding, but copypasting from the lecture. Still not the results listed in the possible answers. Although may be I didn't understand the tasks correctly
kernel (2).zip
For some reason numpy is not imported when it is definetly needed

week 3 workbooks / hw dot not found

Possibly can be fixed by
RUN apt-get install graphviz

Add demo gif to README

Disclaimer: This is a bot

It looks like your repo is trending. The github_trending_videos Instgram account automatically shows the demo gifs of trending repos in Github.

Your README doesn't seem to have any demo gifs. Add one and the next time the parser runs it will pick it up and post it on its Instagram feed. If you don't want to just close this issue we won't bother you again.

Topic 2 Workbook issue

In >topic2_part2_telecom_churn_tsne.ipynb

fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(16, 10))

for idx, feat in  enumerate(features):
    sns.boxplot(x='Churn', y=feat, data=df, ax=axes[idx / 4, idx % 4])
    axes[idx / 4, idx % 4].legend()
    axes[idx / 4, idx % 4].set_xlabel('Churn')
    axes[idx / 4, idx % 4].set_ylabel(feat);

would yield following error

IndexError                                Traceback (most recent call last)
<ipython-input-15-b3733e9b0263> in <module>()
      2 
      3 for idx, feat in  enumerate(features):
----> 4     sns.boxplot(x='Churn', y=feat, data=df, ax=axes[idx / 4, idx % 4])
      5     axes[idx / 4, idx % 4].legend()
      6     axes[idx / 4, idx % 4].set_xlabel('Churn')

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Can be fixed by either replacing idx / 4 -> int(idx / 4) or idx / 4 -> idx // 4

Topic 2 Part 1, Box plot explanation

It looks like the word "horizontal", not "vertical", should be used in the phrase "The vertical line inside the box marks the median (50%) of the distribution" in the description of sns.boxplot picture.

Possibly, in the sentence "its length is determined by the 25th(Q1) and 75th(Q3) percentiles" the word "height" would be more suitable as well.

Typo in 4.1

https://github.com/Yorko/mlcourse.ai/blob/master/jupyter_english/topic04_linear_models/topic4_linear_models_part1_mse_likelihood_bias_variance.ipynb
k should be renamed to x.

P.S. Maybe you should change the '|' notation. It got me confused because I thoutght that it is conditional probability and had to search for additional information
P.S.S Also for some reason the maximum likelihood method is not finished and resulting theta is not found after finding the derivative

Topic 6: "Lasso and Ridge regressions" isn't on the main page

Link to Topic 6: "Lasso and Ridge regressions" isn't on the main page

Confidence intervals interpretation

A 95% confidence interval does not mean that for a given realized interval there is a 95% probability that the population parameter lies within the interval (i.e., a 95% probability that the interval covers the population parameter).

Wiki

But on medium,
In the end, we see that, with 95% probability, the average number of customer service calls from loyal customers lies between 1.4 and 1.49 while the churned clients called 2.06 through 2.40 times on average.

Яндекс&МФТИ, Coursera, Final project - Идентификация пользователей

Здравствуйте! Уточните пожалуйста, какова форма ответа в задании 2 недели, вопрос 2:
"Распределено ли нормально число уникальных сайтов в сессии?". В форме нет четких указаний на формулировку ответа, варианты "Нет", "No", значение статистики и p-value критерия Шапиро-Вилка не подходят...
Может, я неверно посчитал, но ведь так и не понять :)

Topic 2.2: For new versions, add sns.set() to apply styles throughout notebook

In the article:

Even by simply adding import seaborn in your code, the images of your plots will become much nicer.

This is correct for older versions of seaborn.
For new versions, add sns.set() to apply styles throughout notebook.

From the documentation:

(Note that in versions of seaborn prior to 0.8, set() was called on import. On later versions, it must be explicitly invoked).

Validation form is out of date for the demo assignment 3

Questions 3.6 and 3.7 in the validation form for demo assignment 3 are incorrect. The questions are valid for the previous version of the assignment that is accessible by commit 152a534.

A2 demo: incomplete information in answers form

The answers form asks: "What's the rounded difference between median values of age for smokers and non-smokers? You'll need to figure out the units of feature age in this dataset."
This formulation should be answered with years, logically taken. But turns out that it wants months instead, so that should be mentioned.

Решение вопроса 5.11 не стабильно

Даже при выставленных random_state параметрах, best_score лучшей модели отличается от вариантов в ответах.

Подтверждено запуском несколькими участниками.

Возможно влияют конкретные версии пакетов на расчеты.

Могу приложить ipynb, на котором воспроизводится.

Topic 3: Some tree schemes are not rendered

The following cells do not render the tree scheme in the article: In[9], In[13], In[24], In[32]. Checked in Chrome and Firefox.

There is a similar closed issue, but the problem remained.

typo : unupervised

https://mlcourse.ai/assignments
the n° 7 is unupervised, should probably be unsupervised

Topic 3, Some graphviz images missing

It looks like graphviz images after code cell 9 and 13 were not rendered. Considering presence of the graph after the 6th cell, it isn't a browser error and there should not be many difficulties restoring them.

Topic 10 Kaggle template broken

Slack invitation link has been expired

Could you pls resend it?

Topic 5 Part 1: KDE Plot used for 'Customer service calls'

The KDE plot is used for 'Customer service calls' in Topic 5 Part 1. Due to the nature of KDE the plot is smoother than it needs to be for this type of discrete data e.g. the pikes for the loyal customers are lowered and there is the tail of the negative numbers of phone calls for the churned customers.

I suggest correcting to probably distplot which can estimate the distribution with both the histogram and the kernel. An example of code is shown below.

import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = 10, 6

telecom_data = pd.read_csv('../../data/telecom_churn.csv')

ax = sns.distplot(telecom_data[telecom_data['Churn'] == False]['Customer service calls'], label = 'Loyal')
ax = sns.distplot(telecom_data[telecom_data['Churn'] == True]['Customer service calls'],  label = 'Churn')        
ax.set(xlabel='Number of calls', ylabel='Density')
ax.legend()
plt.show()

Typo in 4.1

After defining linearity of weights there is a mathematical expression which can not be displayed clearly due to a typo I believe. Presumably ∀k is meant.
Best

where $\forall\ k\ $,

topic 5 part 1 summation sign

comment in ODS

Topic 2: wrong direction on chart in section 4.3 t-SNE

Wrong direction on chart for the article on Github (https://mlcourse.ai/notebooks/blob/master/jupyter_english/topic02_visual_data_analysis/topic2_visual_data_analysis.ipynb): "south-west".
But it's true for the article on Medium (https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-2-visual-data-analysis-in-python-846b989675cd), as there is another version of the chart:

It's not a critical inaccuracy, but a little embarrassing.

questions about the form of the course

should we just read the course web, or there are some videos? thx

Topic 9 Kaggle template broken

https://www.kaggle.com/kashnitsky/topic-9-part-2-time-series-with-facebook-prophet

locally built docker image doesn't work

I've created docker image locally, using docker image build and then tried to run it like this:

python run_docker_jupyter.py -t mlc_local

got this:

Running command
docker run -it  --rm -p 5022:22 -p 4545:4545 -v "/home/egor/private/mlcourse.ai":/notebooks -w /notebooks mlc_local jupyter
Command: jupyter
[I 12:44:17.454 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/traitlets/traitlets.py", line 528, in get
    value = obj._trait_values[self.name]
KeyError: 'allow_remote_access'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/notebook/notebookapp.py", line 869, in _default_allow_remote
    addr = ipaddress.ip_address(self.ip)
  File "/usr/lib/python3.5/ipaddress.py", line 54, in ip_address
    address)
ValueError: '' does not appear to be an IPv4 or IPv6 address

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/jupyter-notebook", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.5/dist-packages/jupyter_core/application.py", line 266, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-7>", line 2, in initialize
  File "/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/notebook/notebookapp.py", line 1629, in initialize
    self.init_webapp()
  File "/usr/local/lib/python3.5/dist-packages/notebook/notebookapp.py", line 1379, in init_webapp
    self.jinja_environment_options,
  File "/usr/local/lib/python3.5/dist-packages/notebook/notebookapp.py", line 158, in __init__
    default_url, settings_overrides, jinja_env_options)
  File "/usr/local/lib/python3.5/dist-packages/notebook/notebookapp.py", line 251, in init_settings
    allow_remote_access=jupyter_app.allow_remote_access,
  File "/usr/local/lib/python3.5/dist-packages/traitlets/traitlets.py", line 556, in __get__
    return self.get(obj, cls)
  File "/usr/local/lib/python3.5/dist-packages/traitlets/traitlets.py", line 535, in get
    value = self._validate(obj, dynamic_default())
  File "/usr/local/lib/python3.5/dist-packages/notebook/notebookapp.py", line 872, in _default_allow_remote
    for info in socket.getaddrinfo(self.ip, self.port, 0, socket.SOCK_STREAM):
  File "/usr/lib/python3.5/socket.py", line 732, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -5] No address associated with hostname

Undefined names 'dprev_h' and 'dprev_c'

flake8 testing of https://github.com/Yorko/mlcourse_open

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./class_cs231n/assignment3/cs231n/rnn_layers.py:264:16: F821 undefined name 'dprev_h'
    return dx, dprev_h, dprev_c, dWx, dWh, db
               ^
./class_cs231n/assignment3/cs231n/rnn_layers.py:264:25: F821 undefined name 'dprev_c'
    return dx, dprev_h, dprev_c, dWx, dWh, db
                        ^

Docker Image

The Docker image needs to be updated to the latest packages versioning, maybe...

Ссылка на 3 статью

Добавьте пожалуйста в readme ссылку на третью статью на хабр.

Topic 3. Decision tree regressor, MSE

В примере по DecisionTreeRegressor неправильный расчет MSE в названии графика:
plt.title("Decision tree regressor, MSE = %.2f" % np.sum((y_test - reg_tree_pred) ** 2))
Нужно ещё поделить на количество наблюдений, предлагаю поправить так:
plt.title("Decision tree regressor, MSE = %.4f" % (np.sum((y_test - reg_tree_pred) ** 2) / n_test))

Файл:
https://github.com/Yorko/mlcourse.ai/blob/master/jupyter_english/topic03_decision_trees_kNN/topic3_decision_trees_kNN.ipynb
И в русской версии аналогично:
https://github.com/Yorko/mlcourse.ai/blob/master/jupyter_russian/topic03_decision_trees_knn/topic3_trees_knn.ipynb

Topic 7 typo

Agglomerative clustering
# linkage — is an implementation if agglomerative algorithm
should be of instead of if

Assignment 7
For classification, use the support vector machine – class sklearn.svm.LinearSVC. In this course, we did study this algorithm separately, but it is well-known and you can read about it, for example here.
it seems that it shoud be didn't instead of did.

Topic 6

In kaggle kernel In [38] it is said that the performance got worse when it actually got better.
https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection
In habr article however the numbers have really gotten worse.

Also in assignment 6 "Train a LASSO model " it is not said whether I should train on scaled X or original X. As I understand it should be scaled X to perform next task on finding least important features

Parsing error for topic9_part2_facebook_prophet.ipynb

This notebook isn't rendering in nbviewer nor on Github nor after uploading it to binder..

https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic09_time_series/topic9_part2_facebook_prophet.ipynb

there's an extra i at the beginning of the notebook(had checked the raw file)

Even removing that, the nbs isn't opening..

A1 (demo). Pandas and UCI adult dataset - small error

Small error in
https://github.com/Yorko/mlcourse.ai/blob/master/jupyter_english/assignments_demo/assignment01_pandas_uci_adult_solution.ipynb

. Among whom the proportion of those who earn a lot(>50K) is more: among married or single men (marital-status feature)? Consider married those who have a marital-status starting with Married (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.

One of the steps of the proposed solution is
data[(data['sex'] == 'Male') & (data['marital-status'].isin(['Never-married', 'Separated', 'Divorced']))]['salary'].value_counts()

Why don't you consider adults with marital-status == 'Widowed'? So, the correct step is the following:
data[(data['sex'] == 'Male') & (data['marital-status'].isin(['Never-married', 'Separated', 'Divorced', 'Widowed']))]['salary'].value_counts()

Docker image - seaborn

Looks like the docker image is not connected to seaborn 0.9.0 , your exercise 2 requires catplot .

assignment1_pandas_olympic Q10

https://github.com/Yorko/mlcourse.ai/blob/master/jupyter_english/assignments_fall2018/assignment1_pandas_olympic.ipynb
There was a question

What is the absolute difference between the number of unique sports at the 1995 Olympics and 2016 Olympics?

But in 1995 there was no official Olympic game. Nearest games was at 1996 https://en.wikipedia.org/wiki/1996_Summer_Olympics

If you analysis results of that games at 1996 then the possible option is not presented (43 new Event in 33 Sports in 2006)