GithubHelp home page GithubHelp logo

ani's Introduction

ANI

Deep Learning coursera.org https://www.coursera.org/specializations/deep-learning

Course 1 Neural Networks and Deep Learning

Week 1

What is a Neural Network?

  • Given structured data set, a sequence of calculations produces a final output.

  • The # of training sets you give a neural network is like how many problems you give a kid preparing for a math test.

  • The # of training sets is denoted: m

  • m training examples: {(x1, y1), (x2, y2)... (xm, ym)} where x is a single instance of a problem (like one math problem) and y is the answer.

  • X = [x1 x2 xm] where X is an m by n_x matrix (the x in n_x refers to the descriptions in the problem Ex. math problem says: 5+5, n_x would be 3 because 5, +, 5 are used.)

Week 2

Logistic Regression (LR)

  • Given x (where x can be something like a picture), calculate yhat.
  • Yhat is the probability that something is true given x (like if it is true that there is a cat in a picture)
  • yhat = P(y = 1|x)
  • Parameters: w (an (n_x, 1) dimension vector like x) and b (a real number)
  • Outputs yhat = sigmoid(w_transpose * x + b) where z is w_transpose * x + b
  • sigmoid(z) = 1 / (1 + e^(-z))

alt text

LR Cost Function

  • Given m training samples, want yhat(i) to be equivalent to y(i) where yhat(i) is specific to that single training sample
  • Ex. Given 5 math problems, yhat(3) would be the probability that the answer to question 3 is right
  • Loss (error) function: L(yhat, y) = 1/2(yhat - y)^2
  • Given yhat and y, you can find the amount of error you have
  • Knowing error tells you how much you can trust your answer
Taken from Week 2 Optional Video but relavent to LR Cost function
  • If y = 1: p(y|x) = yhat
  • If y = 0: p(y|x) = 1-yhat
  • Generalizing these two equations: p(y|x) = yhat^y(1-yhat)^(1-y)
  • Intuitive reasoning behind equation: plug in (when y = 1, p(y|x) = yhat & when y = 0, p(y|x) = 1-yhat)

alt text

  • Taking the log of both sides:
- log(p(y|x)) = log(yhat^y(1-yhat)^(1-y))  
              = ylog(yhat) + (1-y)log(1-yhat)  
              = L(yhat, y)  
              = -L(yhat, y)
  • Note: log with base e, really just natural log
  • Negative because you want to minimize loss function

Total probability of the all the predictions made on a training set:

alt text

  • Note: log of the produts = sum of logs (Ex. log(5x10) = log(5) + log(10))

  • yhat is in range(0,1) because it tells probability

  • The Cost function is the average of all the calculated Loss values

Gradient Descent

  • Want to find w & b that minimize Cost function

alt text

alt text

  • Partial differentiation is used when differentiating more than one variable

Computational Graph

  • J(a, b, c) = 3(a + bc) where J is a function with three parameters aka variables and 3(a + bc) is an example function
    Substituting u = bc, v = a + u, and J = 3v

alt text

Where the derivative of the function J with respect to v (denoted as dJ/dv) = 3
Other derivatives include:
dJ/da = (dJ/dv)(dv/da) = 3 dv/da = 1 dJ/du = (dJ/dv)(dv/du) = 3 dJ/db = (dJ/du)(du/db) = 3 * 2 = 6 dJ/dc = (dJ/du)(du/dc) = 3 * 3 = 9

Forward and Backward Propogation

  • z = w.T * x + b where .T refers to the transpose of a matrix
  • yhat = a = sig(z)
  • L(a, y) = -(ylog(yhat) + (1 - y)log(1 - yhat))

alt text

  • da/dz is the derivative of the sigmoid function with respect to z
  • dz/dw is the corresponding x(i) while dz/db is always one (therefore dL/dz = dL/dz)

LR on m

import numpy as np

J =0
dL/dw1 = 0
dL/dw2 = 0
dL/db = 0

for i = 1 in m:
   z[i] = w.T * x[i] + b
   a[i] = 1/(1-np.exp(z[i))
   J += -(y[i] * np.log(a[i]) + (1-y[i]) * np.log(1 - a[i]))
   dL/dz[i] = a[i] - y[i]
   dL/dw1 += x1[i] * dL/dz[i]
   dL/dw2 += x2[i] * dL/dz[i]
   dL/db += dL/dz[i]
   
J = J/m
Side Note: Softmax function normalizes a matrix

Vectorization

  • Vectorization is the omission of loops to increase efficiency of algorithm
Vectorization to find z
  • z = w.T * x + b

Non-vectorized python code:

z = 0
for i in range (n_x):
   z += w[i] * x[i]

z += b

Vectorized python code:

z = np.dot(w.T, X) + b
Vectorization LR Gradient Regression
  • Note: a derivative with respect to a variable may commonly be seen as the derivative of the variable
    Ex: dL/dz is commonly written as dz
  • Due to a lack of clarity, this convention will not be followed
    dL/dz[1] = a[1] - y[1] dL/dZ = [dL/dz[1] dL/dz[2] ... dL/dz[m]] A = [a[1] ... a[m]] Y = [y[1] ... y[m]] dL/dZ = A - Y = [a[1] - y[1] ... a[m] - y[m]] Non-vectorized dL/db:
db = 0  

for i in m:
   db += dz[1] ...  
   db += dz[m]

db /= m

Completely Vectorized:

A = 1/(1-np.exp(Z))
dZ = A - Y
dw = 1/m * np.dot(X, dZ.T)
db = 1/m * np.sum(dZ)

Broadcasting

  • When using axis = 0, taking vertical columns of a matrix
  • When using axis = 1, taking horizontal rows of a matrix
np.dot() #matrix multiplication where you take row * column
matrix * matrix #element-wise multiplication
  • Broadcasting works on element-wise multiplication

Ex.

1. [1 2 3 4] + 100
   In the computer, it becomes:
   [1 2 3 4] + [100 100 100 100]
2. [[1 2 3] [4 5 6]] + [100 200 300]
   Becomes
   [[1 2 3] [4 5 6]] + [[100 200 300] [100 200 300]]
General Principle
m x n matrix +, -, *, or / by a:
   1 x n 
   m x 1
#becomes copied in a way so that the 1 x n & m x 1 matrix become m x n matrices
  • NB Do not use a rank 1 array
assert(matrix_name.shape == (m, n)
#produces an errror if the matrix is not m x n

Outline of Steps

  1. Find the number of training samples, test samples, and how many parameters there are Ex. for picture recognition, there is usually about 64x64x3 pixels which means that there are 64x64x3 pixels
  2. Combine all of the pixels to one column vector
  3. Take the transpose of the combined the pixel column vector with the negative of the amount of training samples
    Ex. pic_flatten = train_set_x_flatten = train_set_x_orig.reshape(train_set_x_orig.shape[0], - train_set_x_orig.shape[1]*train_set_x_orig.shape[2]*3).T test_set_x_flatten = test_set_x_orig.reshape(test_set_x_orig.shape[0], -test_set_x_orig.shape[1]*test_set_x_orig.shape[2]*3).T
    • This produces a single matrix with all the training samples and another one for all the test samples
  4. Normalize all of the pixels by dividing by 255 (255 different options in a single pixel)

alt text

Where W is a column vector (with dimensions n by 1 where n is the number of pixels) of numbers that minimizes the cost function value
& b is a scalar that also minimizes the cost function value by giving neural network an extra degree of freedom

  1. Build a function to take the sigmoid of z
  2. Find Yhat aka A by taking the sigmoid(dot product between w.T and X + b)
  3. Calculate the loss: ylog(a) + (1-y)log(1-a)
  4. Calculate the cost: 1/m * sum(loss)
  5. Find dJ/dw: 1/m * dot product between X and (A - Y).T
  6. Find dJ/db: 1/m * sum(A - Y)
  7. Keep on iterating until you find the value of w & b that produce the smallest cost function value
    Use w = w - np.dot(learning_rate, dJ/dw)
    & b = b - np.dot(learning_rate, dJ/db)
  8. Once you find the optimal w & b value, calculate Yhat aka A again using the optimal w & b
  9. If the value of A > .5, then there is a cat
    Else, there is not a cat

Week 3

alt text

alt text

alt text

alt text

alt text

  • Where ReLU(A) = max(0, Z) and leakyReLU(A) = max(Z * .01, Z)

  • NB. tanh(x) = (e^x - e^(-x))/(e^x + e^(-x))

  • Activiation functions are used because a linear function would make all of the hidden layers useless

  • ReLU function is more useful in comparison to the sigmoid function because its rate of change when z is positive is greater (especially on the extremes of the sigmoid function)

  • Sigmoid function is still useful in the calculations for the final A because the probability likely results in a number between one and zero

alt text

  • Note: derivative is written with respect to a variable rather than the normal notation
  • Proof for (dL/dZ)^[1]:
    dL[1]/dZ[1] = (dL/dA[1])(dA[1]/dZ[1]) = (dL/dZ[2])(dZ[2]/dA[1])(dA[1]/dZ[1]) = (dL/dZ[2])W[2](dA[1]/dZ[1]) = W[2]"dZ[2]" dot product g[1]'(Z[1])
  • There isn't a 1/m for the dL/dZ terms because 1/m is the averaging constant
  • 1/m could be included in the dL/dZ equation, but it would add a step for the computer to calculate
  • Instead, it is just added to dW and dB

alt text

alt text

alt text

alt text

alt text

Week 4

  • Continued analysis of week 3 and application into Deep Neural Networks

  • Parameters include: W (weight) & b (bias)

  • Hyperparameters are parameters that affect W & b:

  1. Learning Rate
  2. #of iterations (how many times W := W - learning_rate(dW))
  3. #of hidden layers
  4. #of hidden units
  5. activation function used

alt text

alt text

alt text

Building a Deep Neural Network

Steps to build neural network with multiple hidden layers

  1. Initialize parameters for an L-layer neural network
  2. Calculate Activation function relative to relu or sigmoid function using Z through forward propogation
  3. Compute loss and coss function
  4. Implement backward propagation
  5. Calculuate dW and dB and implement gradient descent
  6. Use optimized parameters to calculate output

Course 2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

Week 1

alt text

  • High bias means underfitting while high variance means overfitting
  • If you have high bias:
  1. Use a bigger network
  2. Train for a longer period of time
  • If you have high variance:
  1. Give more training set
  2. Regularization
  • Instead of going from a train set to a test set, there is now a development set (dev set) where programmer can compare two different algorithms
  • Modern big data training set means allocating 98% of examples to train set, 1% to dev set, and 1% to test set
  • Want dev and test set to come from the same distribution (cat photos online vs. cat videos taken on cell phone)

alt text

  • Where lambda is the regularization parameter
  • L2 regulation is used far more frequently
  • L1 regulation just makes w sparse which means that there will be more 0s in the W matrix, taking up less memory
  • b regulation can be omitted because w has such a bigger effect

alt text

  • Dropout regularization just randomly shuts off nodes in different layers based off of a given probabiltiy
  • Ex. give_prob = .5, then in layers 1 through L, half of the nodes are shut off

Implementation of Inverted Dropout:

alt text

  • Dropout is used because it makes nodes on a given layer not dependent on a specific feature, forcing the weights to spread out A3 is divided by keep_prob to keep the z value the same
  • Dropout does make it harder to determine J
  • Early stopping is another possible solution where the calculation and updating of W gets stopped at some given time which ensures that W is not too small or too large
  • Do not use drop out at test time because you do not want the output to be random, and don't keep the 1/keep_prob factor

Normalizing Inputs

alt text

  • Normalizing inputs creates a symmetric cost function, also allowing faster gradient descent

alt text

Vanishing/Exploding Gradients

alt text

  • Want W to be small (less than 1) or else yhat becomes very large

alt text

  • Where Xavier initialization just helps with vanishing/exploding gradients

alt text

  • Where W[1], b[1]... W[L], b[L] is reshaped to a big vector, theta and dW[1], db[1]... dW[L], db[L] is reshaped into a big vector, dtheta

ani's People

Contributors

hunterzhaoliu avatar

ani's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.