Shallow Neural Networks
Build a neural network with one hidden layer, using forward propagation and backpropagation.
Learning Objectives
- Describe hidden units and hidden layers
- Use units with a non-linear activation function, such as tanh
- Implement forward and backward propagation
- Apply random initialization to your neural network
- Increase fluency in Deep Learning notations and Neural Network Representations
- Implement a 2-class classification neural network with a single hidden layer
- Compute the cross entropy loss
Table of contents
- Shallow Neural Networks
- Neural Networks Overview
- Neural Network Representation
- Computing a Neural Network’s Output
- Vectorizing Across Multiple Examples
- Explanation for Vectorized Implementation
- Activation Functions
- Why do you need Non-Linear Activation Functions?
- Derivatives of Activation Functions
- Gradient Descent for Neural Networks
- Backpropagation Intuition (Optional)
- Random Initialization
- Heroes of Deep Learning (Optional)
Shallow Neural Networks
Neural Networks Overview
Neural Network Representation
- Input layer / Hidden Layer / Output layer
- The term hidden layer refers to the fact that in the training set, the true values for these nodes in the middle are not observed
- Input is vector $x = a^{[0]}$, a is stands for activation
Computing a Neural Network’s Output
Each unit or node of our neural network compute a linear regression :
- $z=w^Tx+b$
- $a=\sigma(z)$
By convention, we denote $a_i^{[l]}$ and $z_i^{[l]}$ where :
- $i$ is the unit number
- and $l$ the layer number
Vectorization using vectr and matrix for hidden layer
Vectorization using vectr and matrix for output layer with $x = a^{[0]}$
Vectorizing Across Multiple Examples
We define :
- $x^{(i)}$ the example # i
- and $a^{[2] (i)}$ the prediction # i
Instead of implementing a loop on differents training example, we build a matrix with all traing vectors. Each column is one example
Explanation for Vectorized Implementation
Activation Functions
Also see : https://github.com/mauvaisetroupe/machine-learning-specialization-coursera/blob/main/c2-advanced-learning-algorithms/week2/README.md#choosing-activation-functions
In the previous examples, we used sigmoid function. Sigmoid function is called an activation function.
The hyperbolic tangent function (tanh) works almost always better than the sigmoid function (because centering the data around zero is efficient when training algorithm). One exception is for output layer on binary classification (prediction 0 or 1 given by sigmoid is more adapted).
Rules of thumb for choosing activation functions :
- never use sigmoid activation function except for the output layer of binomial classification
- prefere hyperbolic tangent
- ReLU is the default choice (but )
- or try Leaky ReLu $max(0.01*z,z)$
Why do you need Non-Linear Activation Functions?
https://github.com/mauvaisetroupe/machine-learning-specialization-coursera/blob/main/c2-advanced-learning-algorithms/week2/README.md#why-do-we-need-activation-functions
s
Derivatives of Activation Functions
Technically, derivative not defined in zero, but for algorithm, could consider g’(0)=0
Gradient Descent for Neural Networks
Backpropagation Intuition (Optional)
This slide for logistic regression explained here : https://github.com/mauvaisetroupe/deep-learning-specialization-coursera/tree/main/c1-neural-networks-and-deep-learning/week2#logistic-regression-gradient-descent
We have exactly the same calculus for a neural network with two layers :
Explanation of matrix usage to compute over all training examples (stacking them into a matrix) is here : https://github.com/mauvaisetroupe/deep-learning-specialization-coursera/tree/main/c1-neural-networks-and-deep-learning/week3#explanation-for-vectorized-implementation
We’ve seen here the gradient decsent algorith here : https://github.com/mauvaisetroupe/deep-learning-specialization-coursera/tree/main/c1-neural-networks-and-deep-learning/week2#vectorizing-logistic-regressions-gradient-output
If we apply derivation over all training examples to run gradient descent algorith, we obtain:
Random Initialization
If we initialize the weights to zero, all hidden units are symmetric. And no matter how long we’re upgrading the center, all continue to compute exactly the same function. The solution to this is to initialize your parameters randomly.
We prefer to initialize the weights to very small random values. Because if you are using a tanh or sigmoid activation function, to avoid being in the flat parts of these functions