Deep Learning Specialization

Neural Networks and Deep Learning

Supervised Learning with Neural Networks


Input Output Application Algorithm
Home features Price Real Estate Standard NN
Ad, user info Click on ad? (0/1) Online Advertising Standard NN
Image Object (1,…,1000) Photo tagging Convolutional NN
Audio Text transcript Speech recognition Recurrent NN
English Chinese Machine translation Recurrent NN
Image, Radar info Position of other cars Autonomous driving Hybrid

Logistic Regression as a Neural Network

\[w,b \quad\rightarrow\quad z=w^Tx+b \quad\rightarrow\quad a=\sigma(z) \quad\rightarrow\quad L(a,y)\]

Given $x$,


with $\sigma(z)=\frac{1}{1+e^{-z}}$ and $\sigma’(z)=\sigma(z)(1-\sigma(z))$

Loss (error) function is:

\[L(a,y)=-(y\log a + (1-y)\log(1-a))\]

Cost function is:



\[\frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial a}\frac{\partial a}{\partial z}\frac{\partial z}{\partial w_i}=(-\frac{y}{a}+\frac{1-y}{1-a})~\cdot~a(1-a)~\cdot~x_i\]

Neural Network

2 layer NeuralNetwork

The first layer $a^{[0] (i)}=x^{(i)}$ is the input layer, the last layer $a^{[L] (i)}=\hat{y}^{(i)}$ is the output layer and the layers between them are hidden layers.

For $a^{[l] (i)}$ and $a_k^{[l+1] (i)}$ we have parameters $w_k^{[l+1]}$ and $b_k^{[l+1]}$ such that

\[z_k^{[l+1] (i)} = (w_k^{[l+1]})^T a^{[l] (i)} + b_k^{[l+1]}\] \[a_k^{[l+1] (i)} = g^{[l+1]}(z_k^{[l+1] (i)})\]

where $g(z)$ is the activation function.

So by noting

\[W^{[l]}=\begin{bmatrix} (w_1^{[l]})^T\\ (w_2^{[l]})^T\\ \vdots\\ (w_{n^{[l]}}^{[l]})^T \end{bmatrix}, \quad b^{[l]}=\begin{bmatrix} b_1^{[l]}\\ b_2^{[l]}\\ \vdots\\ b_{n^{[l]}}^{[l]} \end{bmatrix}, \quad z^{[l] (i)}=\begin{bmatrix} z_1^{[l] (i)}\\ z_2^{[l] (i)}\\ \vdots\\ z_{n^{[l]}}^{[l] (i)} \end{bmatrix} \quad a^{[l] (i)}=\begin{bmatrix} a_1^{[l] (i)}\\ a_2^{[l] (i)}\\ \vdots\\ a_{n^{[l]}}^{[l] (i)} \end{bmatrix}\]

we have:

\[z^{[l+1] (i)} = W^{[l+1]} a^{[l] (i)} + b^{[l+1]}\] \[a^{[l+1] (i)} = g^{[l+1]}(z^{[l+1] (i)})\]

We can continue vectorizing by noting:

\[X=\begin{bmatrix} x^{(1)} & x^{(2)} & \cdots & x^{(m)} \end{bmatrix}\] \[Z^{[l]}=\begin{bmatrix} z^{[l] (1)} & z^{[l] (2)} & \cdots & z^{[l] (m)} \end{bmatrix}\] \[A^{[l]}=\begin{bmatrix} a^{[l] (1)} & a^{[l] (2)} & \cdots & a^{[l] (m)} \end{bmatrix}\]

then (in fact the vector $b^{[l+1]}$ is added to each column)

\[Z^{[l+1]} = W^{[l+1]} A^{[l]} + b^{[l+1]}\] \[A^{[l+1]} = g^{[l+1]}(Z^{[l+1]})\]


It’s a forward propagation.

Activation functions

Gradient descent

For the cost function $J(W^{[l]},b^{[l]})=\frac{1}{m}\sum\limits_{i=1}^m L(\hat{y}^{(i)},y^{(i)})$, we note

\[dV := \frac{\partial J}{\partial V}\]

and we can update variables like $V = V - \alpha dV$

To calculate the gradient, since

\[z_k^{[l+1] (i)} = (w_k^{[l+1]})^T a^{[l] (i)} + b_k^{[l+1]}\] \[a_k^{[l+1] (i)} = g^{[l+1]}(z_k^{[l+1] (i)})\]

We have

\[\begin{align*} dz_k^{[l+1] (i)}&=\frac{\partial J}{\partial a_k^{[l+1] (i)}} \frac{\partial A^{[l+1]}}{\partial z_k^{[l+1] (i)}} = da_k^{[l+1] (i)} {g^{[l+1]}}'(z_k^{[l+1] (i)}) \\ dw_k^{[l+1]}&=\frac{\partial J}{\partial z_k^{[l+1] (i)}} \frac{\partial z_k^{[l+1] (i)}}{\partial w_k^{[l+1]}} = dz_k^{[l+1] (i)} a^{[l] (i)} \\ db_k^{[l+1]}&=\frac{\partial J}{\partial z_k^{[l+1] (i)}} \frac{\partial z_k^{[l+1] (i)}}{\partial b_k^{[l+1]}} = dz_k^{[l+1] (i)} \\ da^{[l] (i)}&=\sum\limits_k \frac{\partial J}{\partial z_k^{[l+1] (i)}} \frac{\partial z_k^{[l+1] (i)}}{\partial a^{[l] (i)}} = \sum\limits_k dz_k^{[l+1] (i)} w_k^{[l+1]} \end{align*}\]


\[W^{[l]}=\begin{bmatrix} (w_1^{[l]})^T\\ (w_2^{[l]})^T\\ \vdots\\ (w_{n^{[l]}}^{[l]})^T \end{bmatrix}, \quad b^{[l]}=\begin{bmatrix} b_1^{[l]}\\ b_2^{[l]}\\ \vdots\\ b_{n^{[l]}}^{[l]} \end{bmatrix}, \quad z^{[l] (i)}=\begin{bmatrix} z_1^{[l] (i)}\\ z_2^{[l] (i)}\\ \vdots\\ z_{n^{[l]}}^{[l] (i)} \end{bmatrix} \quad a^{[l] (i)}=\begin{bmatrix} a_1^{[l] (i)}\\ a_2^{[l] (i)}\\ \vdots\\ a_{n^{[l]}}^{[l] (i)} \end{bmatrix}\]

we now have, for the i-th example,

\[\begin{align*} dz^{[l+1] (i)}&= da^{[l+1] (i)}~\circ~{g^{[l+1]}}'(z^{[l+1] (i)}) \\ dW^{[l+1]}&= dz^{[l+1] (i)} (a^{[l] (i)})^T \\ db^{[l+1]}&= dz^{[l+1] (i)} \\ da^{[l] (i)}&=(W^{[l+1]})^Tdz^{[l+1] (i)} \end{align*}\]

By continuing vectorizing with the help of:

\[X=\begin{bmatrix} x^{(1)} & x^{(2)} & \cdots & x^{(m)} \end{bmatrix}\] \[Z^{[l]}=\begin{bmatrix} z^{[l] (1)} & z^{[l] (2)} & \cdots & z^{[l] (m)} \end{bmatrix}\] \[A^{[l]}=\begin{bmatrix} a^{[l] (1)} & a^{[l] (2)} & \cdots & a^{[l] (m)} \end{bmatrix}\]

we have,

\[\begin{align*} dZ^{[l+1]}&= dA^{[l+1]}~\circ~{g^{[l+1]}}'(Z^{[l+1]}) \\ dW^{[l+1]}&= \frac{1}{m}dZ^{[l+1]}(A^{[l+1]})^T \\ db^{[l+1]}&= \frac{1}{m}dZ^{[l+1]} [1]_{(m,1)} \\ dA^{[l]}&= (W^{[l+1]})^TdZ^{[l+1]} \end{align*}\]

Here $\circ$ means element-wise multiplication, $[1]_{(m,1)}$ means ones((m,1)) and we calculate the average over m examples for $dW, db$.

It’s a backpropagation.


We need to initialize all the parameters randomly.

It’s recommanded to initialize $W$ with small value, for example, like [-0.01,0.01] and initialize $b$ with zero. We initialize $W$ with small value to make sure that $Z$ will not be too big.

If we initialize all of them with zero, then in each layer all the neural units will act in the same way.


The learning rate $\alpha$, the number of iterations, the number of hidden layers and the size of each hidden layer, the choice of activation function, momentum term, mini batch size, regularization parameters, etc. These are all hyperparameters.

Gradient checking

It’s for DEBUG not for training

It doesn’t work with dropout !

If we take $\varepsilon=10^{-7}$, the L2 error is about $10^{-7}$, it’s good.

If grad check fails, try to identify the location of error.

Hyperparameter tuning, Regularization and Optimization

Basic Recipe for Machine Learning:


L2 regularization

Logistic regression


The part of $b$ is omit and if we use $L^1$ norm then $w$ will be sparse.

Neural nerwork


where the Frobenius norm is defined as:

\[\|A\|_F^2 = \sum\limits_i \sum\limits_j A_{ij}^2\]

So now the gradient of $W^{[l]}$ is:

\[dW^{[l]} = \frac{1}{m}dZ^{[l]}(A^{[l]})^T + \frac{\lambda}{m}W^{[l]}\]

Inverted Dropout regularization

Dropout regularization

For example, for the l-th layer we keep each unit with a probability $p$, so we can change $A^{[l]}$ as:

\[A^{[l]} = A^{[l]}~\circ~(\text{rand}((n^{[l]}, m)) < p)\] \[A^{[l]} = \frac{1}{p}A^{[l]}\]

For example, there are 50 units and $p=0.8$. So in average there are 10 units shut off. So $A^{[l]}$ is reduced by 20% (20% element of A is zero).

But since $Z^{[l+1]} = W^{[l+1]} A^{[l]} + b^{[l+1]}$, in order to preserve the expected value of $A^{[l]}$, we need to divide it by $p$.

Attention, at test time we do not use dropout.

It’s recommanded for set different keep probability $p$ for each layer, if the size of $A^{[l]}$ is big, we can set $p$ small (i.e. 0.5).

Also, remember that dropout is to prevent overfitting, so if there is no overfitting, we do not have to apply this. For computer vision, we always don’t have enough data so overfitting is often an issue.

Other methods

Data augmentation

For example, we can rotate/flip/cut images to make new data.

Early stopping

Plot training error or $J$ and dev set error, if dev set error increase while training error decrease, we stop.


Normalizing training sets

Reduce mean to zero ($x=x-\mu$) and variance ($x=x/sigma$) to one. Attention ! Need to use the same parameters to normalize dev/test set.

Do this all the times since there is no harm and we are not sure if we do not need it.

Vanishing/Exploding gradients

When neural network is deep, sometimes the gradient is too small/big.

For example, if all hidden layers is of same size, if $g(z)=z$, $b^{[l]}=0$, $W^{[l]}=W$ then


It can be very large or small according to its eigenvalues.

Weight initialization for deep networks

A partial solution for gradient vanishing/exploding problems

We know that $W^{[L]}$ is of size $(n^{[l]},n^{[l-1]})$, to initialize it (for tanh activation) (Xavier initialization):

\[W^{[L]} = \sqrt{\frac{1}{n^{[l-1]}}} \text{rand}((n^{[l]},n^{[l-1]}))\]


\[W^{[L]} = \sqrt{\frac{2}{n^{[l-1]}+n^{[l]}}} \text{rand}((n^{[l]},n^{[l-1]}))\]

If we use ReLU activation functions,

\[W^{[L]} = \sqrt{\frac{2}{n^{[l-1]}}} \text{rand}((n^{[l]},n^{[l-1]}))\]

Mini-batch gradient descent

Split the training dataset $X$ into mini-batches $X^{{t}}$, and use only a mini-batch to calculate the gradient. By this way, the algorithme start to make progress even before finishing the entire giant training set.

Typically the mini-batch size is 64, 128, 256 or 512.

Gradient descent with momentum

Exponentially weighted averages

Expotentially Weighted Averages

Given data $\theta_1, \theta_2, \cdots$, we can estimate as

\[v_0=0,\quad v_t=\beta v_{t-1}+(1-\beta)\theta_t\]

$v_t$ is an approximate average over $\frac{1}{1-\beta}$ days.

In fact,

\[v_t=(1-\beta)\theta_t+\beta v_{t-1}=(1-\beta)\theta_t+(1-\beta)\beta\theta_{t-1}+\cdots+(1-\beta)\beta^{t-s}\theta_{s}+\cdots+(1-\beta)\beta^{t-1}\theta_1\]

and we have $\lim\limits_{\varepsilon\rightarrow 0}(1-\varepsilon)^{\frac{1}{\varepsilon}}=\frac{1}{e}$, so $\beta^{\frac{1}{1-\beta}}\approx\frac{1}{e}$

Bias correction

We can see that $v_t$ is a linear combination of $\theta_i$ and the sum of coefficients is $1-\beta^t$. When $t$ is small, $v_t$ is too small and it’s not accurate enough. So we can correct the bias as following:



Idea is to average the gradient to prevent oscillation. Then the algorithme may be accelerated.

$V_{dW} = 0, V_{db} = 0$

On iteration $t$:

Practically, we don’t use bias correction for exponentially weighted averages and $\beta = 0.9$.


$S_{dW} = 0, S_{db} = 0$

On iteration $t$:

Idea is that : when an element of gradient is too big/small, for example $db$ is too big, then by dividing by $\sqrt{S_{db}}$, $db$ will be smaller. The $\varepsilon$ is to prevent division by zero, normally $\varepsilon=10^{-8}$.

By using RMSprop, we can use larger $\alpha$.


Combine Momentum and RMSprop

Adam : Adaptive Moment Estimation

$V_{dW} = 0, V_{db} = 0, S_{dW} = 0, S_{db} = 0$

On iteration $t$:


Learning rate decay

We note one epoch is one pass through the data, then we can decrease learning rate $\alpha$ as following:

\[\alpha = \frac{1}{1+\text{decay rate}~*~\text{epoch num}}\alpha_0\]


Hyperparameter tuning


Try random values instead of using a grid to tune. When finding a good point, we can zoom in and sample more densily to tune.

Appropriate scale

Batch Normalization

In NN, can we normalize $z^{[l]}$ (or $a^{[l]}$) to train $w^{[l]},b^{[l]}$ faster.

\[z_{\text{norm}}^{[l] (i)} = \frac{z^{[l] (i)} - \mu}{\sqrt{\sigma^2+\varepsilon}}\] \[\tilde{z}^{[l] (i)} = \gamma z_{\text{norm}}^{[l] (i)} + \beta\]

Here we normalize $z^{[l] (i)}$ then we adjust it to have a specific variance $\gamma^2$ and bias $\beta$.

\[X \xrightarrow{W^{[1]},b^{[1]}} Z^{[1]} \xrightarrow{\gamma^{[1]},\beta^{[1]}} \tilde{Z}^{[1]} \rightarrow a^{[1]} \rightarrow \cdots\]

Now, the parameters are $W^{[l]},b^{[l]},\gamma^{[l]},\beta^{[l]}$. We need to learn $\gamma^{[l]},\beta^{[l]}$ and they are of size ($n^{[l]}$,$1$).

In fact, we know that $Z^{[l+1]} = W^{[l+1]} A^{[l]} + b^{[l+1]}$, but since we will normalize $Z^{[l+1]}$, $b^{[l+1]}$ is no longer useful. So we can get rid of it.

Moreover, when using mini-batch, we normalize $Z^{[l]}$ by only using the data on the mini-batch. When testing, we need to calculate the $\mu, \sigma$ differently.

It’s proposed to calculate $\mu, \sigma$ for each mini batch and then we use exponentially weighted averages to get $\mu, \sigma$ for the entire dataset.

Multi-class classification

Softmax layer

hard max : [1, 0, 0, 0]

soft max : [0.8, 0.1, 0.002, 0.098]

To do multi-class classification, we change the last activation function.

We know that $z^{[L]} = W^{[L]} a^{[L-1]} + b^{[L]}$, then we define $a^{[L]}$ as follows:

\[t = \exp(z^{[L]}), \quad \hat{y}=a^{[L]}=\frac{t}{\sum t_i}\]

then $a_i^{[L]}$ represents the possibility that the sample belongs to class i.

If we define $a^{[L]}=g^{[L]}(z^{[L]})$, the function $g^{[L]}$ takes a vector as input and the output is also a vector. It’s different from the other activation functions that we have seen.

Loss function

For example, we have

\[y=\begin{bmatrix}0\\1\\0\\0\end{bmatrix} ,\quad \hat{y}=\begin{bmatrix}0.3\\0.2\\0.1\\0.4\end{bmatrix}\]

then the loss function

\[L(\hat{y},y)=-\sum\limits_j y_j\log\hat{y}_j=-\log(0.2)\]

Deep learning frameworks

How to choose ?

Structuring Machine Learning Projects