GithubHelp home page GithubHelp logo

yorko / mlcourse.ai Goto Github PK

View Code? Open in Web Editor NEW
9.4K 585.0 5.6K 558.94 MB

Open Machine Learning Course

Home Page: https://mlcourse.ai

License: Other

Python 90.94% HTML 9.05% Batchfile 0.01% Shell 0.01%
machine-learning data-analysis data-science pandas algorithms numpy scipy matplotlib seaborn plotly scikit-learn kaggle-inclass vowpal-wabbit python ipynb docker math

mlcourse.ai's Introduction

ODS stickers

mlcourse.ai – Open Machine Learning Course

License: CC BY-NC-SA 4.0 Slack Donate Donate

mlcourse.ai is an open Machine Learning course by OpenDataScience (ods.ai), led by Yury Kashnitsky (yorko). Having both a Ph.D. degree in applied math and a Kaggle Competitions Master tier, Yury aimed at designing an ML course with a perfect balance between theory and practice. Thus, the course meets you with math formulae in lectures, and a lot of practice in a form of assignments and Kaggle Inclass competitions. Currently, the course is in a self-paced mode. Here we guide you through the self-paced mlcourse.ai.

Bonus: Additionally, you can purchase a Bonus Assignments pack with the best non-demo versions of mlcourse.ai assignments. Select the "Bonus Assignments" tier. Refer to the details of the deal on the main page mlcourse.ai.

Mirrors (:uk:-only): mlcourse.ai (main site), Kaggle Dataset (same notebooks as Kaggle Notebooks)

Self-paced passing

You are guided through 10 weeks of mlcourse.ai. For each week, from Pandas to Gradient Boosting, instructions are given on which articles to read, lectures to watch, what assignments to accomplish.

Articles

This is the list of published articles on medium.com 🇬🇧, habr.com 🇷🇺. Also notebooks in Chinese are mentioned 🇨🇳 and links to Kaggle Notebooks (in English) are given. Icons are clickable.

  1. Exploratory Data Analysis with Pandas 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebook
  2. Visual Data Analysis with Python 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebooks: part1, part2
  3. Classification, Decision Trees and k Nearest Neighbors 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebook
  4. Linear Classification and Regression 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebooks: part1, part2, part3, part4, part5
  5. Bagging and Random Forest 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebooks: part1, part2, part3
  6. Feature Engineering and Feature Selection 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebook
  7. Unsupervised Learning: Principal Component Analysis and Clustering 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebook
  8. Vowpal Wabbit: Learning with Gigabytes of Data 🇬🇧 🇷🇺 🇨🇳, Kaggle Notebook
  9. Time Series Analysis with Python, part 1 🇬🇧 🇷🇺 🇨🇳. Predicting future with Facebook Prophet, part 2 🇬🇧, 🇨🇳 Kaggle Notebooks: part1, part2
  10. Gradient Boosting 🇬🇧 🇷🇺, 🇨🇳, Kaggle Notebook

Lectures

Videolectures are uploaded to this YouTube playlist. Introduction, video, slides

  1. Exploratory data analysis with Pandas, video
  2. Visualization, main plots for EDA, video
  3. Decision trees: theory and practical part
  4. Logistic regression: theoretical foundations, practical part (baselines in the "Alice" competition)
  5. Ensembles and Random Forest – part 1. Classification metrics – part 2. Example of a business task, predicting a customer payment – part 3
  6. Linear regression and regularization - theory, LASSO & Ridge, LTV prediction - practice
  7. Unsupervised learning - Principal Component Analysis and Clustering
  8. Stochastic Gradient Descent for classification and regression - part 1, part 2 TBA
  9. Time series analysis with Python (ARIMA, Prophet) - video
  10. Gradient boosting: basic ideas - part 1, key ideas behind Xgboost, LightGBM, and CatBoost + practice - part 2

Assignments

The following are demo-assignments. Additionally, within the "Bonus Assignments" tier you can get access to non-demo assignments.

  1. Exploratory data analysis with Pandas, nbviewer, Kaggle Notebook, solution
  2. Analyzing cardiovascular disease data, nbviewer, Kaggle Notebook, solution
  3. Decision trees with a toy task and the UCI Adult dataset, nbviewer, Kaggle Notebook, solution
  4. Sarcasm detection, Kaggle Notebook, solution. Linear Regression as an optimization problem, nbviewer, Kaggle Notebook
  5. Logistic Regression and Random Forest in the credit scoring problem, nbviewer, Kaggle Notebook, solution
  6. Exploring OLS, Lasso and Random Forest in a regression task, nbviewer, Kaggle Notebook, solution
  7. Unsupervised learning, nbviewer, Kaggle Notebook, solution
  8. Implementing online regressor, nbviewer, Kaggle Notebook, solution
  9. Time series analysis, nbviewer, Kaggle Notebook, solution
  10. Beating baseline in a competition, Kaggle Notebook

Bonus assignments

Additionally, you can purchase a Bonus Assignments pack with the best non-demo versions of mlcourse.ai assignments. Select the "Bonus Assignments" tier on Patreon or a similar tier on Boosty (rus).

  

Details of the deal

mlcourse.ai is still in self-paced mode but we offer you Bonus Assignments with solutions for a contribution of $17/month. The idea is that you pay for ~1-5 months while studying the course materials, but a single contribution is still fine and opens your access to the bonus pack.

Note: the first payment is charged at the moment of joining the Tier Patreon, and the next payment is charged on the 1st day of the next month, thus it's better to purchase the pack in the 1st half of the month.

mlcourse.ai is never supposed to go fully monetized (it's created in the wonderful open ODS.ai community and will remain open and free) but it'd help to cover some operational costs, and Yury also put in quite some effort into assembling all the best assignments into one pack. Please note that unlike the rest of the course content, Bonus Assignments are copyrighted. Informally, Yury's fine if you share the pack with 2-3 friends but public sharing of the Bonus Assignments pack is prohibited.


The bonus pack contains 10 assignments, in some of them you are challenged to beat a baseline in a Kaggle competition under thorough guidance ("Alice" and "Medium") or implement an algorithm from scratch -- efficient stochastic gradient descent classifier and gradient boosting.

Kaggle competitions

  1. Catch Me If You Can: Intruder Detection through Webpage Session Tracking. Kaggle Inclass
  2. Predicting popularity of a Medium article. Kaggle Inclass
  3. DotA 2 winner prediction. Kaggle Inclass

Citing mlcourse.ai

If you happen to cite mlcourse.ai in your work, you can use this BibTeX record:

@misc{mlcourse_ai,
    author = {Kashnitsky, Yury},
    title = {mlcourse.ai – Open Machine Learning Course},
    year = {2020},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/Yorko/mlcourse.ai}},
}

Community

You can join the Singularis.ai Slack community to ask questions on the course materials. The community is mostly Russian-speaking but questions in English are still welcome.

mlcourse.ai's People

Contributors

chrisroj avatar ikolzin avatar sevaseva2001 avatar shiuandinq avatar yorko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlcourse.ai's Issues

Variance formula in Article 3 and Assignment 3

The formulas for variance D in Article 3 and Assignment 3 are somewhat different, and both seem not ideal. In the article index i is used for both inner and external sums of y, that may be not quite right from a formal point of view. Possibly, in the assigment it was corrected by using yj and xj, however, j and xj was used in the same paragraph to mark the split feature. That can mislead someone.

Короткая инструкция по работе с материалами курса средствами GItHub

Добрый день!

Во-первых, я хотел бы сказать, что вы большие молодцы, что делаете этот курс! Основательность с который вы его запустили впечатляет!

Во-вторых, я хотел бы предложить небольшое дополнение, а именно добавить короткую инструкцию о том, как работать с материалами курса средствами Git/GitHub. Я думаю не все ваши слушатели умеют работать c Git/GitHub. По опыту прошлогоднего курса Юрия в ВШЭ, где он так же выкладывал материалы на GitHub, было совсем не очевидно как корректно обновлять их, не теряя собственных заметок в лекциях.

Я догадываюсь, что скорее всего нужно сделать Fork / Clone, наверное, сделать свой бранч(?) а потом как-то так, но хотелось бы знать наверняка. Возможно, это пригодиться и другим.

С наилучшими,
Андрей

Topic 2: typos

In
mlcourse.ai-master/jupyter_english/topic02_visual_data_analysis/topic2_visual_data_analysis.ipynb
"ellpise" instead of "ellipse"

Topic 2 Par 2, "median ($50\%)"

Something went wrong with the formula "median ($50%)" in the boxplot() explanation in the "3. Seaborn" part. Considering previous arcticle, the "median ($50\%$)" should give correct display, not the "($50\%)".

Minor typo in 4.1

"X – is a matrix of obesrvations and their features".
There's a typo in the word "observations".

Assignment 9

https://www.kaggle.com/kashnitsky/assignment-9-time-series-analysis

  1. web form is not corresponding to the questions in the task. At least 1st question is missing

  2. I doubt that in the 1st question there is a correct answer. Simple operations lead to result 3426.195682 which is not listed in the possible answers. Kernel related. In q2-4 the situation is the same. Practically no coding, but copypasting from the lecture. Still not the results listed in the possible answers. Although may be I didn't understand the tasks correctly
    kernel (2).zip

  3. For some reason numpy is not imported when it is definetly needed

Add demo gif to README

Disclaimer: This is a bot

It looks like your repo is trending. The github_trending_videos Instgram account automatically shows the demo gifs of trending repos in Github.

Your README doesn't seem to have any demo gifs. Add one and the next time the parser runs it will pick it up and post it on its Instagram feed. If you don't want to just close this issue we won't bother you again.

Topic 2 Workbook issue

In >topic2_part2_telecom_churn_tsne.ipynb

fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(16, 10))

for idx, feat in  enumerate(features):
    sns.boxplot(x='Churn', y=feat, data=df, ax=axes[idx / 4, idx % 4])
    axes[idx / 4, idx % 4].legend()
    axes[idx / 4, idx % 4].set_xlabel('Churn')
    axes[idx / 4, idx % 4].set_ylabel(feat);

would yield following error

IndexError                                Traceback (most recent call last)
<ipython-input-15-b3733e9b0263> in <module>()
      2 
      3 for idx, feat in  enumerate(features):
----> 4     sns.boxplot(x='Churn', y=feat, data=df, ax=axes[idx / 4, idx % 4])
      5     axes[idx / 4, idx % 4].legend()
      6     axes[idx / 4, idx % 4].set_xlabel('Churn')

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Can be fixed by either replacing idx / 4 -> int(idx / 4) or idx / 4 -> idx // 4

Topic 2 Part 1, Box plot explanation

It looks like the word "horizontal", not "vertical", should be used in the phrase "The vertical line inside the box marks the median (50%) of the distribution" in the description of sns.boxplot picture.

Possibly, in the sentence "its length is determined by the 25th(Q1) and 75th(Q3) percentiles" the word "height" would be more suitable as well.

Confidence intervals interpretation

A 95% confidence interval does not mean that for a given realized interval there is a 95% probability that the population parameter lies within the interval (i.e., a 95% probability that the interval covers the population parameter).

But on medium,
In the end, we see that, with 95% probability, the average number of customer service calls from loyal customers lies between 1.4 and 1.49 while the churned clients called 2.06 through 2.40 times on average.

Яндекс&МФТИ, Coursera, Final project - Идентификация пользователей

Здравствуйте! Уточните пожалуйста, какова форма ответа в задании 2 недели, вопрос 2:
"Распределено ли нормально число уникальных сайтов в сессии?". В форме нет четких указаний на формулировку ответа, варианты "Нет", "No", значение статистики и p-value критерия Шапиро-Вилка не подходят...
Может, я неверно посчитал, но ведь так и не понять :)

A2 demo: incomplete information in answers form

The answers form asks: "What's the rounded difference between median values of age for smokers and non-smokers? You'll need to figure out the units of feature age in this dataset."
This formulation should be answered with years, logically taken. But turns out that it wants months instead, so that should be mentioned.

Решение вопроса 5.11 не стабильно

Даже при выставленных random_state параметрах, best_score лучшей модели отличается от вариантов в ответах.

Подтверждено запуском несколькими участниками.

Возможно влияют конкретные версии пакетов на расчеты.

Могу приложить ipynb, на котором воспроизводится.

Topic 3, Some graphviz images missing

It looks like graphviz images after code cell 9 and 13 were not rendered. Considering presence of the graph after the 6th cell, it isn't a browser error and there should not be many difficulties restoring them.

Topic 5 Part 1: KDE Plot used for 'Customer service calls'

The KDE plot is used for 'Customer service calls' in Topic 5 Part 1. Due to the nature of KDE the plot is smoother than it needs to be for this type of discrete data e.g. the pikes for the loyal customers are lowered and there is the tail of the negative numbers of phone calls for the churned customers.

I suggest correcting to probably distplot which can estimate the distribution with both the histogram and the kernel. An example of code is shown below.

import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = 10, 6

telecom_data = pd.read_csv('../../data/telecom_churn.csv')

ax = sns.distplot(telecom_data[telecom_data['Churn'] == False]['Customer service calls'], label = 'Loyal')
ax = sns.distplot(telecom_data[telecom_data['Churn'] == True]['Customer service calls'],  label = 'Churn')        
ax.set(xlabel='Number of calls', ylabel='Density')
ax.legend()
plt.show()

image

Typo in 4.1

After defining linearity of weights there is a mathematical expression which can not be displayed clearly due to a typo I believe. Presumably ∀k is meant.
Best

where $\forall\ k\ $,

Topic 2: wrong direction on chart in section 4.3 t-SNE

1_1

Wrong direction on chart for the article on Github (https://mlcourse.ai/notebooks/blob/master/jupyter_english/topic02_visual_data_analysis/topic2_visual_data_analysis.ipynb): "south-west".
But it's true for the article on Medium (https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-2-visual-data-analysis-in-python-846b989675cd), as there is another version of the chart:

2_1

It's not a critical inaccuracy, but a little embarrassing.

locally built docker image doesn't work

I've created docker image locally, using docker image build and then tried to run it like this:

python run_docker_jupyter.py -t mlc_local

got this:

Running command
docker run -it  --rm -p 5022:22 -p 4545:4545 -v "/home/egor/private/mlcourse.ai":/notebooks -w /notebooks mlc_local jupyter
Command: jupyter
[I 12:44:17.454 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/traitlets/traitlets.py", line 528, in get
    value = obj._trait_values[self.name]
KeyError: 'allow_remote_access'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/notebook/notebookapp.py", line 869, in _default_allow_remote
    addr = ipaddress.ip_address(self.ip)
  File "/usr/lib/python3.5/ipaddress.py", line 54, in ip_address
    address)
ValueError: '' does not appear to be an IPv4 or IPv6 address

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/jupyter-notebook", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.5/dist-packages/jupyter_core/application.py", line 266, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-7>", line 2, in initialize
  File "/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/notebook/notebookapp.py", line 1629, in initialize
    self.init_webapp()
  File "/usr/local/lib/python3.5/dist-packages/notebook/notebookapp.py", line 1379, in init_webapp
    self.jinja_environment_options,
  File "/usr/local/lib/python3.5/dist-packages/notebook/notebookapp.py", line 158, in __init__
    default_url, settings_overrides, jinja_env_options)
  File "/usr/local/lib/python3.5/dist-packages/notebook/notebookapp.py", line 251, in init_settings
    allow_remote_access=jupyter_app.allow_remote_access,
  File "/usr/local/lib/python3.5/dist-packages/traitlets/traitlets.py", line 556, in __get__
    return self.get(obj, cls)
  File "/usr/local/lib/python3.5/dist-packages/traitlets/traitlets.py", line 535, in get
    value = self._validate(obj, dynamic_default())
  File "/usr/local/lib/python3.5/dist-packages/notebook/notebookapp.py", line 872, in _default_allow_remote
    for info in socket.getaddrinfo(self.ip, self.port, 0, socket.SOCK_STREAM):
  File "/usr/lib/python3.5/socket.py", line 732, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -5] No address associated with hostname

Undefined names 'dprev_h' and 'dprev_c'

flake8 testing of https://github.com/Yorko/mlcourse_open

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./class_cs231n/assignment3/cs231n/rnn_layers.py:264:16: F821 undefined name 'dprev_h'
    return dx, dprev_h, dprev_c, dWx, dWh, db
               ^
./class_cs231n/assignment3/cs231n/rnn_layers.py:264:25: F821 undefined name 'dprev_c'
    return dx, dprev_h, dprev_c, dWx, dWh, db
                        ^

Docker Image

The Docker image needs to be updated to the latest packages versioning, maybe...

Topic 3. Decision tree regressor, MSE

В примере по DecisionTreeRegressor неправильный расчет MSE в названии графика:
plt.title("Decision tree regressor, MSE = %.2f" % np.sum((y_test - reg_tree_pred) ** 2))
Нужно ещё поделить на количество наблюдений, предлагаю поправить так:
plt.title("Decision tree regressor, MSE = %.4f" % (np.sum((y_test - reg_tree_pred) ** 2) / n_test))

Файл:
https://github.com/Yorko/mlcourse.ai/blob/master/jupyter_english/topic03_decision_trees_kNN/topic3_decision_trees_kNN.ipynb
И в русской версии аналогично:
https://github.com/Yorko/mlcourse.ai/blob/master/jupyter_russian/topic03_decision_trees_knn/topic3_trees_knn.ipynb

Topic 7 typo

Agglomerative clustering
# linkage — is an implementation if agglomerative algorithm
should be of instead of if

Assignment 7
For classification, use the support vector machine – class sklearn.svm.LinearSVC. In this course, we did study this algorithm separately, but it is well-known and you can read about it, for example here.
it seems that it shoud be didn't instead of did.

Topic 6

In kaggle kernel In [38] it is said that the performance got worse when it actually got better.
https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection
In habr article however the numbers have really gotten worse.

Also in assignment 6 "Train a LASSO model " it is not said whether I should train on scaled X or original X. As I understand it should be scaled X to perform next task on finding least important features

A1 (demo). Pandas and UCI adult dataset - small error

Small error in
https://github.com/Yorko/mlcourse.ai/blob/master/jupyter_english/assignments_demo/assignment01_pandas_uci_adult_solution.ipynb

. Among whom the proportion of those who earn a lot(>50K) is more: among married or single men (marital-status feature)? Consider married those who have a marital-status starting with Married (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.

One of the steps of the proposed solution is
data[(data['sex'] == 'Male') & (data['marital-status'].isin(['Never-married', 'Separated', 'Divorced']))]['salary'].value_counts()

Why don't you consider adults with marital-status == 'Widowed'? So, the correct step is the following:
data[(data['sex'] == 'Male') & (data['marital-status'].isin(['Never-married', 'Separated', 'Divorced', 'Widowed']))]['salary'].value_counts()

Docker image - seaborn

Looks like the docker image is not connected to seaborn 0.9.0 , your exercise 2 requires catplot .

assignment1_pandas_olympic Q10

https://github.com/Yorko/mlcourse.ai/blob/master/jupyter_english/assignments_fall2018/assignment1_pandas_olympic.ipynb
There was a question

  1. What is the absolute difference between the number of unique sports at the 1995 Olympics and 2016 Olympics?

But in 1995 there was no official Olympic game. Nearest games was at 1996 https://en.wikipedia.org/wiki/1996_Summer_Olympics

If you analysis results of that games at 1996 then the possible option is not presented (43 new Event in 33 Sports in 2006)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.