Shallow Neural Networks

Build a neural network with one hidden layer, using forward propagation and backpropagation.

Learning Objectives

Describe hidden units and hidden layers
Use units with a non-linear activation function, such as tanh
Implement forward and backward propagation
Apply random initialization to your neural network
Increase fluency in Deep Learning notations and Neural Network Representations
Implement a 2-class classification neural network with a single hidden layer
Compute the cross entropy loss

Table of contents

Shallow Neural Networks
Heroes of Deep Learning (Optional)
1. Ian Goodfellow Interview

Shallow Neural Networks

Neural Networks Overview

Neural Network Representation

Input layer / Hidden Layer / Output layer
The term hidden layer refers to the fact that in the training set, the true values for these nodes in the middle are not observed
Input is vector $x = a^{[0]}$, a is stands for activation

Computing a Neural Network’s Output

Each unit or node of our neural network compute a linear regression :

$z=w^Tx+b$
$a=\sigma(z)$

By convention, we denote $a_i^{[l]}$ and $z_i^{[l]}$ where :

$i$ is the unit number
and $l$ the layer number

Vectorization using vectr and matrix for hidden layer

Vectorization using vectr and matrix for output layer with $x = a^{[0]}$

Vectorizing Across Multiple Examples

We define :

$x^{(i)}$ the example # i
and $a^{[2] (i)}$ the prediction # i

Instead of implementing a loop on differents training example, we build a matrix with all traing vectors. Each column is one example

Explanation for Vectorized Implementation

Activation Functions

Also see : https://github.com/mauvaisetroupe/machine-learning-specialization-coursera/blob/main/c2-advanced-learning-algorithms/week2/README.md#choosing-activation-functions

In the previous examples, we used sigmoid function. Sigmoid function is called an activation function.

The hyperbolic tangent function (tanh) works almost always better than the sigmoid function (because centering the data around zero is efficient when training algorithm). One exception is for output layer on binary classification (prediction 0 or 1 given by sigmoid is more adapted).

Rules of thumb for choosing activation functions :

never use sigmoid activation function except for the output layer of binomial classification
prefere hyperbolic tangent
ReLU is the default choice (but )
or try Leaky ReLu $max(0.01*z,z)$

Why do you need Non-Linear Activation Functions?

https://github.com/mauvaisetroupe/machine-learning-specialization-coursera/blob/main/c2-advanced-learning-algorithms/week2/README.md#why-do-we-need-activation-functions

s

Derivatives of Activation Functions

Technically, derivative not defined in zero, but for algorithm, could consider g’(0)=0

Gradient Descent for Neural Networks

Backpropagation Intuition (Optional)

This slide for logistic regression explained here : https://github.com/mauvaisetroupe/deep-learning-specialization-coursera/tree/main/c1-neural-networks-and-deep-learning/week2#logistic-regression-gradient-descent

We have exactly the same calculus for a neural network with two layers :

Explanation of matrix usage to compute over all training examples (stacking them into a matrix) is here : https://github.com/mauvaisetroupe/deep-learning-specialization-coursera/tree/main/c1-neural-networks-and-deep-learning/week3#explanation-for-vectorized-implementation

We’ve seen here the gradient decsent algorith here : https://github.com/mauvaisetroupe/deep-learning-specialization-coursera/tree/main/c1-neural-networks-and-deep-learning/week2#vectorizing-logistic-regressions-gradient-output

If we apply derivation over all training examples to run gradient descent algorith, we obtain:

Random Initialization

If we initialize the weights to zero, all hidden units are symmetric. And no matter how long we’re upgrading the center, all continue to compute exactly the same function. The solution to this is to initialize your parameters randomly.

We prefer to initialize the weights to very small random values. Because if you are using a tanh or sigmoid activation function, to avoid being in the flat parts of these functions

Heroes of Deep Learning (Optional)

Shallow Neural Networks

Shallow Neural Networks

Neural Networks Overview

Neural Network Representation

Computing a Neural Network’s Output

Vectorizing Across Multiple Examples

Explanation for Vectorized Implementation

Activation Functions

Why do you need Non-Linear Activation Functions?

Derivatives of Activation Functions

Gradient Descent for Neural Networks

Backpropagation Intuition (Optional)

Random Initialization

Heroes of Deep Learning (Optional)

Ian Goodfellow Interview