Shallow Neural Networks

Build a neural network with one hidden layer, using forward propagation and backpropagation.

Learning Objectives

  • Describe hidden units and hidden layers
  • Use units with a non-linear activation function, such as tanh
  • Implement forward and backward propagation
  • Apply random initialization to your neural network
  • Increase fluency in Deep Learning notations and Neural Network Representations
  • Implement a 2-class classification neural network with a single hidden layer
  • Compute the cross entropy loss
Table of contents
  1. Shallow Neural Networks
    1. Neural Networks Overview
    2. Neural Network Representation
    3. Computing a Neural Network’s Output
    4. Vectorizing Across Multiple Examples
    5. Explanation for Vectorized Implementation
    6. Activation Functions
    7. Why do you need Non-Linear Activation Functions?
    8. Derivatives of Activation Functions
    9. Gradient Descent for Neural Networks
    10. Backpropagation Intuition (Optional)
    11. Random Initialization
  2. Heroes of Deep Learning (Optional)
    1. Ian Goodfellow Interview

Shallow Neural Networks

Neural Networks Overview

Neural Network Representation

  • Input layer / Hidden Layer / Output layer
  • The term hidden layer refers to the fact that in the training set, the true values for these nodes in the middle are not observed
  • Input is vector $x = a^{[0]}$, a is stands for activation

Computing a Neural Network’s Output

Each unit or node of our neural network compute a linear regression :

  • $z=w^Tx+b$
  • $a=\sigma(z)$

By convention, we denote $a_i^{[l]}$ and $z_i^{[l]}$ where :

  • $i$ is the unit number
  • and $l$ the layer number

Vectorization using vectr and matrix for hidden layer

Vectorization using vectr and matrix for output layer with $x = a^{[0]}$

Vectorizing Across Multiple Examples

We define :

  • $x^{(i)}$ the example # i
  • and $a^{[2] (i)}$ the prediction # i

Instead of implementing a loop on differents training example, we build a matrix with all traing vectors. Each column is one example

Explanation for Vectorized Implementation

Activation Functions

Also see : https://github.com/mauvaisetroupe/machine-learning-specialization-coursera/blob/main/c2-advanced-learning-algorithms/week2/README.md#choosing-activation-functions

In the previous examples, we used sigmoid function. Sigmoid function is called an activation function.

The hyperbolic tangent function (tanh) works almost always better than the sigmoid function (because centering the data around zero is efficient when training algorithm). One exception is for output layer on binary classification (prediction 0 or 1 given by sigmoid is more adapted).

Rules of thumb for choosing activation functions :

  • never use sigmoid activation function except for the output layer of binomial classification
  • prefere hyperbolic tangent
  • ReLU is the default choice (but )
  • or try Leaky ReLu $max(0.01*z,z)$

Why do you need Non-Linear Activation Functions?

https://github.com/mauvaisetroupe/machine-learning-specialization-coursera/blob/main/c2-advanced-learning-algorithms/week2/README.md#why-do-we-need-activation-functions

s

Derivatives of Activation Functions

Technically, derivative not defined in zero, but for algorithm, could consider g’(0)=0

Gradient Descent for Neural Networks

Backpropagation Intuition (Optional)

This slide for logistic regression explained here : https://github.com/mauvaisetroupe/deep-learning-specialization-coursera/tree/main/c1-neural-networks-and-deep-learning/week2#logistic-regression-gradient-descent

We have exactly the same calculus for a neural network with two layers :

Explanation of matrix usage to compute over all training examples (stacking them into a matrix) is here : https://github.com/mauvaisetroupe/deep-learning-specialization-coursera/tree/main/c1-neural-networks-and-deep-learning/week3#explanation-for-vectorized-implementation

We’ve seen here the gradient decsent algorith here : https://github.com/mauvaisetroupe/deep-learning-specialization-coursera/tree/main/c1-neural-networks-and-deep-learning/week2#vectorizing-logistic-regressions-gradient-output

If we apply derivation over all training examples to run gradient descent algorith, we obtain:

Random Initialization

If we initialize the weights to zero, all hidden units are symmetric. And no matter how long we’re upgrading the center, all continue to compute exactly the same function. The solution to this is to initialize your parameters randomly.

We prefer to initialize the weights to very small random values. Because if you are using a tanh or sigmoid activation function, to avoid being in the flat parts of these functions

Heroes of Deep Learning (Optional)

Ian Goodfellow Interview