交叉熵损失的反向传播

对于一个单标签多分类问题,假设网络的输出层的输入为Zin=[z1,,zi,,zn]Z_{in}=[z_1,\cdots,z_i,\cdots,z_n], 输出为$\hat Y=[\hat y_1,\cdots,\hat y_i,\cdots,\hat y_n]$, 真实类标签为$Y = [y_1,\cdots,y_i,\cdots,y_n]$,$n$为类别数(输出层神经元数),通常有:

Y^=Softmax(Zin)(1)\hat Y = Softmax(Z_{in})\tag{1} y^i=ezij=0nezj(2)\hat y_i = \frac {e^{z_i}}{\sum_{j=0}^n e^{z_j}}\tag{2}

其中$Softmax$为:

Softmax(Zin)=[,ezij=1nezj,](3)Softmax(Z_{in}) = [\cdots,\frac {e^{z_i}}{\sum_{j=1}^n e^{z_j}},\cdots]\tag{3}

交叉熵损失

Loss(Y,Y^)=i=1nyiln(y^i)(4)Loss(Y,\hat Y) = -\sum_{i=1}^ny_i*\ln(\hat y_i)\tag{4}

损失对神经网络输出的偏导(标量对向量求偏导)为:

Loss(Y,Y^)Y^=[y1y^1,,yiy^i,,yny^n]\label5(5)\frac {\partial Loss(Y,\hat Y)}{\partial \hat Y} = [-\frac {y_1}{\hat y_1},\cdots,-\frac {y_i}{\hat y_i},\cdots,-\frac {y_n}{\hat y_n}]\label{5}\tag{5}

后向传播推导中遇到的所有量都是变量,最终的目的是找到损失关于某变量的偏导,程序中也只用这个公式求得对应输入点的梯度

Softmax的偏导

求$\hat y_i​$对$z_i​$的偏导,根据{2}可得:

求$\hat Y$对$Z$的偏导(向量对向量求导)

这里把$y$的坐标写作$k$

$k=i$时:

y^izi=ezi1j=1nezj+ezi(1(j=1nezj)2)ezi=ezij=1nezj(ezij=1nezj)2=y^iy^i2(6)\begin{split} \frac {\partial \hat y_i}{\partial z_i} &= e^{z_i} * \frac {1}{\sum_{j=1}^ne^{z_j}} + e^{z_i} * (-\frac {1}{(\sum_{j=1}^n e^{z_j})^2}) * e^{z_i} \cr &= \frac {e^{z_i}}{\sum_{j=1}^n e^{z_j}} - (\frac {e^{z_i}}{\sum_{j=1}^n e^{z_j}})^2 \cr &= \hat y_i - \hat y_i^2 \end{split} \tag{6}

$k \neq i$时:

y^kzi=ezk(1(j=1nezj)2)ezi=ezkezi(j=1nezj)2=y^ky^i(7)\begin{split} \frac {\partial \hat y_k}{\partial z_i} &= e^{z_k} * (-\frac {1}{(\sum_{j=1}^n e^{z_j})^2}) * e^{z_i} \cr &= -\frac {e^{z_k}*e^{z_i}}{(\sum_{j=1}^n e^{z_j})^2} \cr &= - \hat y_k\hat y_i \end{split} \tag{7}

写成矩阵:

Y^Zin=[y^1y^12y^2y^1y^ny^1y^1y^2y^2y^22y^ny^2y^1y^ny^2y^ny^ny^n2](8)\frac {\partial \hat Y}{\partial Z_{in}} = \begin{bmatrix} \hat y_1-\hat y_1^2 & -\hat y_2\hat y_1 & \cdots & -\hat y_n\hat y_1\cr -\hat y_1\hat y_2 & \hat y_2-\hat y_2^2 & \cdots & -\hat y_n\hat y_2\cr \vdots & \vdots & \ddots & \vdots\cr -\hat y_1\hat y_n & -\hat y_2\hat y_n & \cdots & \hat y_n-\hat y_n^2\cr \end{bmatrix}\tag{8}

这是一个对称矩阵,在链式求导时加不加转置都一样

根据(5)和(8),损失$L$对输入$Z$的偏导(标量对向量求偏导):

L(Y,Y^)Z=L(Y,Y^)Y^(Y^Z)T=[y1y^1,,yiy^i,,yny^n][y^1y^12y^2y^1y^ny^1y^1y^2y^2y^22y^ny^2y^1y^ny^2y^ny^ny^n2]T=[(y^11)y1+y^1y2++y^1yn,y^2y1+(y^21)y2++y^1yn,]=[y^1i=1nyiy1,,y^ji=1nyiyj,,y^ni=1nyiyn]=[y^1y1,,y^jyj,,y^nyn]    (i=1nyi=1)=Y^Y\begin{align} \frac {\partial L(Y,\hat Y)}{\partial Z} &= \frac {\partial L(Y,\hat Y)}{\partial \hat Y} (\frac {\partial \hat Y}{\partial Z})^T\cr &= [-\frac {y_1}{\hat y_1},\cdots,-\frac {y_i}{\hat y_i},\cdots,-\frac {y_n}{\hat y_n}] \begin{bmatrix} \hat y_1-\hat y_1^2 & -\hat y_2\hat y_1 & \cdots & -\hat y_n\hat y_1\cr -\hat y_1\hat y_2 & \hat y_2-\hat y_2^2 & \cdots & -\hat y_n\hat y_2\cr \vdots & \vdots & \ddots & \vdots\cr -\hat y_1\hat y_n & -\hat y_2\hat y_n & \cdots & \hat y_n-\hat y_n^2\cr \end{bmatrix}^T\cr &= [(\hat y_1-1)y_1+\hat y_1y_2+\cdots+\hat y_1y_n,\hat y_2y_1+(\hat y_2-1)y_2+\cdots+\hat y_1y_n,\cdots]\cr &= [\hat y_1\sum_{i=1}^ny_i-y_1,\cdots,\hat y_j\sum_{i=1}^ny_i-y_j,\cdots,\hat y_n\sum_{i=1}^ny_i-y_n]\cr &= [\hat y_1-y_1,\cdots,\hat y_j-y_j,\cdots,\hat y_n-y_n]\ \ \ \ (\sum_{i=1}^ny_i=1)\cr &= \hat Y-Y \end{align}