Backpropagation - Welcome to my blog

The following are my notes I prepared for the students as part of tutoring the Neural Networks class at Saarland University in fall 2023. Originally typset in LaTeX I decided that they are also a nice addition to my website. Most of the content is a more detailed discussion of the backpropagation section in Christopher Bishops excellent PRML Bishop, 2006 book with the addition of a few examples.

Consider a fully connected feed-forward neural network with one hidden layer with $M$ neurons, $D$ inputs and $K$ output units. A network of this is from depicted in Figure 1. A forward pass involves the following calculations.

\begin{aligned} \mathbf{a}^{(1)} &= \boldsymbol{W}^{(1)}\mathbf{x} & a^{(1)}_m &= \sum_{d=1}^D \boldsymbol{W}^{(1)}_{md}x_d\\ \mathbf{z} &= h^{(1)} \left(\mathbf{a}^{(1)}\right) & z_m &= h^{(1)}\left(a^{(1)}_m\right)\\ \mathbf{a}^{(2)} &= \boldsymbol{W}^{(2)}\mathbf{z} & a^{(2)}_k &= \sum_{m=1}^M \boldsymbol{W}^{(2)}_{km}z_m\\ \mathbf{\hat{y}} &= h^{(2)} \left(\mathbf{a}^{(2)}\right) & \hat{y}_k &= h^{(2)}\left(a^{(2)}_k\right) \end{aligned}

(1)

Here $\mathbf{x} \in \mathbb{R}^{D}, \mathbf{a}^{(1)}, \mathbf{z} \in \mathbb{R}^{M}, \mathbf{a}^{(2)}, \mathbf{\hat{y}} \in \mathbb{R}^{K}$ and $\boldsymbol{W}^{(1)} \in \mathbb{R}^{M \times D}, \boldsymbol{W}^{(2)} \in \mathbb{R}^{K \times M}$ . And $h^{(1)}$ is the activation function after the first set of activations $\mathbf{a}^{(1)}$ and $h^{(2)}$ is the output activation function. Now we will consider a separable loss or error function $\mathrm{J}$ .

\mathrm{J}\left({\{(\mathbf{x}_i, \mathbf{y}_i)\}_{i=1}^N; \boldsymbol{W}^{(1)}, \boldsymbol{W}^{(2)}}\right) = \sum_{n=1}^N \mathrm{J}_n \left( \mathbf{x}_n, \mathbf{\hat{y}}_n ; \boldsymbol{W}^{(1)}, \boldsymbol{W}^{(2)} \right)

(2)

To minimize the loss we are interested in gradients of the weight matrices. From now on we will only consider the loss $\mathrm{J}_n$ of a single sample $(\mathbf{x}_n, \mathbf{\hat{y}}_n)$ . Let’s first consider the loss w.r.t. entry $\boldsymbol{W}^{(2)}_{ji}$ . Because $\boldsymbol{W}^{(2)}_{ji}$ influences the output $\mathbf{\hat{y}}$ only through activation $a_j^{(2)}$ , by applying the chain rule we may also write:

\begin{aligned} \frac{\partial \mathrm{J}_n}{\partial \boldsymbol{W}^{(2)}_{ji}} &= \frac{\partial \mathrm{J}_n}{\partial a_j^{(2)}} \frac{\partial a_j^{(2)}}{\partial \boldsymbol{W}^{(2)}_{ji}} \end{aligned}

(3)

We can further simplify by again applying the chain rule, this time to the first term:

\begin{aligned} &= \frac{\partial \mathrm{J}_n}{\partial \hat{y}_j} \frac{\partial \hat{y}_j}{\partial a_j^{(2)}} \frac{\partial a_j^{(2)}}{\partial \boldsymbol{W}^{(2)}_{ji}} \end{aligned}

(4)

Here $\frac{\partial \mathrm{J}_n}{\partial \hat{y}_j}$ depends on the loss function. The derivatives of e.g. the squared or the cross-entropy error function read:

\frac{\partial \mathrm{J}_n}{\partial \hat{y}_j} = \begin{cases} \hat{y}_j - y_{nj} & \text{when } \mathrm{J}_n = \frac{1}{2}\left( \hat{y}_j - y_{nj} \right)^2\\ -\frac{y_{nj}}{\hat{y}_j} & \text{when } \mathrm{J}_n = \sum_{k=1}^K y_{nk} \log (\hat{y}_k) \end{cases}

(5)

The second term $\frac{\partial \hat{y}_j}{\partial a_j^{(2)}}$ is determined solely by the output activation function $h^{(2)}$ and reduces to 1 if it is the identity function $h^{(2)}\left(\mathbf{a^{(2)}} \right) = \mathbf{a^{(2)}}$ (as is usually the case in regression).
The remaining term is trivial to calculate:

\begin{aligned} \frac{\partial a_j^{(2)}}{\partial \boldsymbol{W}^{(2)}_{ji}} = \frac{\partial }{\partial \boldsymbol{W}^{(2)}_{ji}} \left( \sum_{m=1}^M \boldsymbol{W}^{(2)}_{jm}z_m \right) = z_i \end{aligned}

(6)

Therefore the partial derivative for entry $ij$ of $\boldsymbol{W}^{(2)}$ reads:

\begin{align*} \frac{\partial \mathrm{J}_n}{\partial \boldsymbol{W}^{(2)}_{ji}} &= \frac{\partial \mathrm{J}_n}{\partial a_j^{(2)}} \frac{\partial a_j^{(2)}}{\partial \boldsymbol{W}^{(2)}_{ji}} \\ &=\left[ \frac{\partial \mathrm{J}_n}{\partial \hat{y}_j} \frac{\partial \hat{y}_j}{\partial a_j^{(2)}} \right] \cdot z_i \\ &= \underbrace{\left[ \frac{\partial \mathrm{J}_n}{\partial \hat{y}_j} \cdot h^{(2)\prime} \left(a_j^{(2)} \right) \right]}_{\delta^{(2)}_j} \cdot z_i = \delta^{(2)}_j \cdot z_i \end{align*}

(7)

We denote the term with $\delta^{(2)}_j := \frac{\partial \mathrm{J}_n}{\partial a_j^{(2)}}$ and call it the error term. The calculation of the partial derivatives of $\boldsymbol{W}^{(1)}_{ji}$ is a little more involved because it requires careful considerations, which terms are influenced by $\boldsymbol{W}^{(1)}_{ji}$ in later layers.
As before, $\boldsymbol{W}^{(1)}_{ji}$ influences $\mathbf{\hat{y}}$ only through $a^{(1)}_j$ , hence we may write:

\frac{\partial \mathrm{J}_n}{\partial \boldsymbol{W}^{(1)}_{ji}} = \frac{\partial \mathrm{J}_n}{\partial a^{(1)}_j} \frac{\partial a^{(1)}_j}{\partial \boldsymbol{W}^{(1)}_{ji}}

(8)

To begin with, we focus on the first term. Notice that $\mathrm{J}_n$ depends on $a^{(1)}_j$ only through $z_j$ :

\begin{aligned} \frac{\partial \mathrm{J}_n}{\partial a^{(1)}_j} &= \frac{\partial \mathrm{J}_n}{\partial z_j} \frac{\partial z_j}{\partial a^{(1)}_j} \end{aligned}

(9)

This time however, $z_j$ is involved in the calculation of all output units $y_k$ by $a^{(2)}_k = \sum_{m=1}^M \boldsymbol{W}^{(2)}_{km}z_m$ . Therefore, to express $\frac{\partial \mathrm{J}_n}{\partial z_j}$ in terms of $a^{(2)}_k$ , we need to sum over all $k=1,\dots, K$ :

\begin{aligned} &= \sum_{k=1}^{K} \frac{\partial \mathrm{J}_n}{\partial a^{(2)}_k} \frac{\partial a^{(2)}_k}{\partial z_j} \frac{\partial z_j}{\partial a^{(1)}_j} \end{aligned}

(10)

By definition, we know $\frac{\partial \mathrm{J}_n}{\partial a^{(2)}_k} = \delta^{(2)}_k$ . It’s easy to verify that $\frac{\partial z_j}{\partial a^{(1)}_j} = h^{(1)\prime}(a^{(1)}_j)$ and similarly by writing out the definition of $\frac{\partial a^{(2)}_k}{\partial z_j}$ it’s clear that the result of that term is $\boldsymbol{W}^{(2)}_{kj}$ . Therefore we obtain:

\begin{aligned} &= \sum_{k=1}^{K} \delta^{(2)}_k\boldsymbol{W}^{(2)}_{kj} h^{(1)\prime}\left(a^{(1)}_j\right)\\ &= h^{(1)\prime}\left(a^{(1)}_j\right) \cdot \sum_{k=1}^{K} \boldsymbol{W}^{(2)}_{kj} \delta^{(2)}_k \end{aligned}

(11)

Only the partial derivative $\frac{\partial a^{(1)}_j}{\partial \boldsymbol{W}^{(1)}_{ji}}$ is left, however the calculation is trivial:

\frac{\partial a^{(1)}_j}{\partial \boldsymbol{W}^{(1)}_{ji}} = \frac{\partial }{\partial \boldsymbol{W}^{(1)}_{ji}} \left( \sum_{d=1}^D \boldsymbol{W}^{(1)}_{jd}x_d \right) = x_d

(12)

Consequently, every entry $ij$ of $\boldsymbol{W}^{(1)}$ has partial derivative:

\begin{align*} \frac{\partial \mathrm{J}_n}{\partial \boldsymbol{W}^{(1)}_{ji}} &= \frac{\partial \mathrm{J}_n}{\partial a^{(1)}_j} \frac{\partial a^{(1)}_j}{\partial \boldsymbol{W}^{(1)}_{ji}} \\ &=\left[ \sum_{k=1}^{K} \frac{\partial \mathrm{J}_n}{\partial a^{(2)}_k} \frac{\partial a^{(2)}_k}{\partial z_j} \frac{\partial z_j}{\partial a^{(1)}_j} \right] \frac{\partial a^{(1)}_j}{\partial \boldsymbol{W}^{(1)}_{ji}}\\ &=\left[ \sum_{k=1}^{K} \frac{\partial \mathrm{J}_n}{\partial a^{(2)}_k} \frac{\partial a^{(2)}_k}{\partial z_j} \frac{\partial z_j}{\partial a^{(1)}_j} \right] \cdot x_d\\ &= \underbrace{\left[ h^{(1)\prime}\left(a^{(1)}_j\right) \cdot \sum_{k=1}^{K} \boldsymbol{W}^{(2)}_{kj} \delta^{(2)}_k \right]}_{\delta^{(1)}_j} \cdot x_d = \delta^{(1)}_j \cdot x_d \end{align*}

(13)

Note, although we assumed a single hidden layer neural network, the partial derivatives for deeper layers take the same form as for $\boldsymbol{W}^{(1)}$ , this can be seen by defining $\mathbf{x} = \mathbf{z}^{(0)} = h^{(0)}\left(\mathbf{a}^{(0)}\right)$ as the output of the previous layer, with $\mathbf{a}^{(0)} = \boldsymbol{W}^{0}\mathbf{z}^{(-1)}$ .

Example 1 (Linear Regression)

Consider a simple linear regression network with one hidden layer, ReLU activation functions and one output unit, i.e.

\begin{aligned} \mathbf{z} &= h^{(1)}\left(\mathbf{a}^{(1)}\right) = \max\left(0, \mathbf{a}^{(1)} \right)\\ \hat{y}&= h^{(2)}\left(a^{(2)}\right) = a^{(2)} \end{aligned}

(14)

The error function is given by the squared difference

\mathrm{J}_n(\mathbf{x}_n, y_n) = \frac{1}{2} (\hat{y}_n - y_n)^2

(15)

From (7), we knot that we need to calculate $\delta^{(2)}$ :

\delta^{(2)} = \frac{\partial \mathrm{J}_n}{\partial \hat{y}} \cdot h^{(2)\prime}\left(a^{(2)} \right) = \frac{\partial \mathrm{J}_n}{\partial \hat{y}} = (\hat{y}_n - y_n)

(16)

where we used the fact that $h^{(2)\prime}\left(a^{(2)} \right) = 1$ for the identity activation function at the last layer. Hence for the weights in the last layer we have:

\frac{\partial \mathrm{J}_n}{\partial \boldsymbol{W}^{(2)}_{1i}} = \delta^{(2)} \cdot z_i = (\hat{y}_n - y_n) \cdot z_i

(17)

We write $\boldsymbol{W}^{(2)}_{1i}$ instead of $\boldsymbol{W}^{(2)}_{i}$ , to make it clear that we have only one output.
The only unknown term in the partial derivatives of the weights $\boldsymbol{W}^{(1)}$ in the first layer is $\frac{\partial z_j}{\partial a^{(1)}_j}$ :

\begin{aligned} \frac{\partial z_j}{\partial a^{(1)}_j} &= h^{(1)\prime}\left(a^{(1)}_j\right) = \text{ReLU}'\left( a^{(1)}_j \right) = \begin{cases} 1 & \text{if } a^{(1)}_j > 0 \\ 0 & \text{otherwise} \end{cases} \end{aligned}

(18)

Making use of (13), it is clear that

\begin{aligned} \frac{\partial \mathrm{J}_n}{\partial \boldsymbol{W}^{(1)}_{ji}} &= \text{ReLU}'\left( a^{(1)}_j \right) \cdot \sum_{k=1}^K \boldsymbol{W}_{kj}^{(2)} \delta^{(2)}_k\\ \frac{\partial \mathrm{J}_n}{\partial \boldsymbol{W}^{(1)}_{ji}} &=\text{ReLU}'\left( a^{(1)}_j \right) \cdot \boldsymbol{W}_{1j}^{(2)} \delta^{(2)}\\ &= \text{ReLU}'\left( a^{(1)}_j \right) \cdot \boldsymbol{W}_{1j}^{(2)} (\hat{y}_n - y_n)\\ &= \mathbb{I}\left[ a^{(1)}_j > 0 \right] \cdot \boldsymbol{W}_{1j}^{(2)} (\hat{y}_n - y_n) \end{aligned}

(19)

Where the sum vanishes because we have only one output unit.

Example 2 (Binary classification)

This time we consider a binary classification task, i.e. we use the binary cross entropy loss and sigmoid activation function in the last layer. The network structure is the same as in Example 1.
The binary cross entropy is defined as:

J_n(\mathbf{x}_n, y_n) = -y_n\cdot\log{\hat{y}_n} - (1-y_n)\cdot\log(1-\hat{y}_n)

(20)

Therefore

\frac{\partial J_n}{\partial \hat{y}} = -\frac{y_n}{\hat{y}_n} + \frac{1-y_n}{1 - \hat{y}_n} = \frac{\hat{y}_n - y_n}{\hat{y}\cdot(1-\hat{y}_n)}

(21)

We also know that $h^{(2)} = \sigma$ and $\sigma'\left(a^{(2)} \right) = \sigma\left(a^{(2)} \right) \cdot \left[ 1 - \sigma\left(a^{(2)} \right) \right]$ .
Hence we may write:

\begin{aligned} \frac{\partial \mathrm{J}_n}{\partial \boldsymbol{W}^{(2)}_{1i}} &= \frac{\partial J_n}{\partial \hat{y}} \cdot \sigma'\left(a^{(2)} \right) \cdot z_i\\ &= \frac{\hat{y}_n - y_n}{\hat{y}_n\cdot(1-\hat{y}_n)} \cdot \sigma\left(a^{(2)} \right) \cdot \left[ 1 - \sigma\left(a^{(2)} \right) \right] \cdot z_i \end{aligned}

(22)

Using the fact that $\hat{y}_n = \sigma\left(a^{(2)} \right)$ , we are left with:

= (\hat{y}_n - y_n) \cdot z_i

(23)

As a consequence, the partial derivatives w.r.t. $\boldsymbol{W}^{(1)}$ take the same form as in Example 1.

Example 3. (k-Class classification)

This example is an extension of Example 2. Instead of a binary classification task, we consider a $k$ -label classification task, accordingly we use the Cross-Entropy loss together with softmax activation functions in the last layer.

\begin{aligned} \mathrm{J}_n &= -\sum_{k=1}^K y_k \log (\hat{y}_k)\\ \mathbf{\hat{y}}_j &= h^{(2)}\left(\mathbf{a}^{(2)}_j \right) = \mathcal{S}\left(\mathbf{a}^{(2)}_j \right) = \frac{\exp\left(\mathbf{a}^{(2)}_j\right)}{\sum_{k=1}^{K} \exp \left(\mathbf{a}^{(2)}_k\right) } \end{aligned}

(24)

We calculate $\delta^{(2)}_j$ first. Because with the softmax activation function every $\mathbf{\hat{y}}_j$ depends on all $\mathbf{a}^{(2)}_k$ , we don’t simply have $\frac{\partial \mathrm{J}_n}{\partial a^{(2)}_j} =\frac{\partial \mathrm{J}_n}{\partial \hat{y}_j}\frac{\partial \hat{y}_j}{\partial a^{(2)}_j}$ but rather:

$$ \begin{aligned} \delta^{(2)}_j &= \frac{\partial \mathrm{J}_n}{\partial a^{(2)}j} = \sum{k=1}^{K}\frac{\partial \mathrm{J}_n}{\partial \hat{y}_k}\frac{\partial \hat{y}_k}{\partial a^{(2)}_j}

\end{aligned} $$

Here we obtain: