Backpropagation of Gradients

Layer ( l) values in the neural network after applying activation function is stored in a column vecotr ( {a}^l). The subscript represent the layer number. The connections are stored in a weight matrix ( W^l ), and the bias column vector is assumed to be ( b^l). The forward propagation is then obtained as

</p>

al=σ(Wlal1+bl)

We introduce a new vector zl=Wlal1+bl, which is the layer l values before applying the transfer function σ(). In summary, the neural network procedure is summarized as al=σ(zl) zl=Wlal1+bl

To simplify the procedure, we assume a simple 3 layer network as

In this network, we have

  • Input: a0
  • Layer1: z1=W1a0+b1 and a1=σ(z1)
  • Layer2: z2=W2a1+b2 and a2=σ(z2)

where the dimensions are [W1]4×3, [W2]2×4, also [a0]3×1, [a1]4×1,[b1]4×1,[z1]4×1, and [z2]2×1,[a2]2×1. Our objective is minimize the distance between the network's output a2 and t. A simple cost function here would be C=12||a2t||2. The weights matrix and bias vectors are updated usinng gradient descent method, where we need to find the derivatives of the loss function with respect to the weights and biases as

Wl=WlαCWl,bl=blαCbl,l={1,2}

We start by finding the derivatives of the weights w.r.t W2, we have CW2=(a2t)a2W2=(a2t)σ(z2)z2W2=(a2t)σ(z2)W2a1+b2W2 where AB is the entry-wise product (i.e. (AB)i=AiBi). We further simplify the above relation as CW2=[(a2t)σ(z2)][a1]=δ2[a1],δ2(a2t)σ(z2)

Notice that the dimensions work out perfectly as

[CW2]2×4=[δ2]2×1[a1]4×1

Furthermore, we need to calculate the derivatives with respect to b2, where we similarly find that

[Cb2]2×1=(a2t)σ(z2)W2a1+b2b2=[δ3]2×1

Now, we take the derivatives with respect to W1, we find

CW1=(a2t)a2W1=[(a2t)σ(z2)]z2W1=δ2W2a1+b2W1=[W2]δ3σ(z1)z1W1=[W2]δ3σ(z1)W1a0+b1W1

To summarize we have

CW1=[(W2)δ3σ(z1)][a0]=δ1[a0],δ1[W2]δ2σ(z1)

In terms of dimensions we have

[CW1]4×3=[δ1]4×1[a0]3×1[δ1]4×1=[W2×42][δ3]2×1[σ(z1)]4×1

Similarly taking the derivatives w.r.t the bias vector b1, we find that

[Cb1]4×1=[δ1]4×1

To summarize, we obtained the following result

CW2=δ2[a1],Cb2=δ2,δ2(a2t)σ(z2)

CW1=δ1[a0],Cb1=δ1,δ1[W2]δ3σ(z1)

Note that if we increase the number of layers, a similar pattern shows up.