knowledge memos/citations of machine-learning based on a coursera class: Machine Learning by Andrew Ng.
-
Arthur Samuel's older and informal definition:
"the field of study that gives computers the ability to learn without being explicitly programmed." -
Tom Mitchell's more modern definition:
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."- Example: playing checkers.
- E = the experience of playing many games of checkers
- T = the task of playing checkers.
- P = the probability that the program will win the next game.
In general, any machine learning problem can be assigned to one of two broad classifications:
Supervised learning and Unsupervised learning.
***
- "right answer" is given.
- In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.
- Regression (回帰)
- we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function.
- ex. Given a picture of Male/Female, we have to predict his/her age on the basis of the given picture.
- Classification (分類)
- we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories
- ex. Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.
- Approaching problems with little or no idea what our results should look like.
- We can derive structure from data where we don't necessarily know the effect of the variables.
- We can derive this structure by clustering the data based on relationships among the variables in the data.
- With unsupervised learning there is no feedback based on the prediction results.
- Clustering
- ex. Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.
- Non-clustering
- ex. The "Cocktail Party Algorithm", allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail party).
***
- summation of the difference between the predicted value and the actual value.
- goal of machine learning to solove problem is, in other words, to minimize a cost function.
- it is also called "Squared error function" or "Mean squared error".
- 1/m with Summation: averaging it
- 1/2 : we rather play with smaller numbers than big numbers
- If all data(x) are plotted on the hypothesis, cost function = 0.
- ex. h(X) = θ0 + θ1 * X
3D plot | Contour plot/figure |
---|---|
***
- Start with some θ(parameter)
- Keep changing θ to reduce J(θ) until we end up at a minimum
- "Batch" Gradient Descent: Each step of gradient descent uses all the training examples.
- As we approach a local minimum, gradient descent will automatically take smaller steps.
So, no need to decrease α over time.
correct | incorrect |
---|---|
- Gradient descent could be stuck at local optima.
- make each of input values in roughly the same range to converge efficiently, speedy.
- ideal range: −1 ≤ x(i) ≤ 1 or −0.5 ≤ x(i) ≤ 0.5
- Feature Scailing + Mean Normalization:
- "S" in the formula written above.
- divide the input values by the range (i.e. max - min) of the input variable.
- "μ" in the formula written above.
- subtract the average value from the values for that input variable.
- the average of the processed values is 0.
- "α" in the gradient descent formula.
- If α is too small: slow convergence.
- If α is too large: may not decrease on every iteration and thus may not converge.
- minimize J by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero.
- This allows us to find the optimum theta without iteration.
***
name | logic |
---|---|
Model | |
Cost Function | |
Algorithm |
- Every formula is equal.
description | formula |
---|---|
Expanded | |
X:column(vector) | |
X:row |
***
- Logistic Function / Sigmoid Function
- In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
***
Label | Definition |
---|---|
x | input variable, feature |
y | output/target variable |
m | number of training examples |
h | logic(relation) between x and y |
θ | parameter in h |
J | cost function |
α | learning rate |
real number |