神经网络中的前向传播与后向传播

img

f(z)f(z)为激励函数,关于激励函数(又称激活函数)的总结
隐藏层1输入

z(1)=W(1)xT+b(1)(1)z^{(1)}=W^{(1)}x^T+b^{(1)}\tag{1}

隐藏层1输出

n(1)=f(1)(z(1))(2)n^{(1)}=f^{(1)}(z^{(1)})\tag{2}

隐藏层2输入

z(2)=W(2)n(1)+b(2)(3)z^{(2)}=W^{(2)}n^{(1)}+b^{(2)}\tag{3}

隐藏层2输出

n(2)=f(2)(z(2))(4)n^{(2)}=f^{(2)}(z^{(2)})\tag{4}

隐藏层3输入

z(3)=W(3)n(2)+b(3)(5)z^{(3)}=W^{(3)}n^{(2)}+b^{(3)}\tag{5}

隐藏层3输出即输出层

y^=n(3)=f(3)(z(3))(6)\widehat y = n^{(3)}= f^{(3)}(z{(3)})\tag{6}

损失函数

L(y,y^)(7)L(y,\widehat y)\tag{7}

即隐藏层k+1输入

z(k+1)=W(k+1)n(k)+b(k+1)(8)z^{(k+1)}=W^{(k+1)}n^{(k)}+b^{(k+1)}\tag{8}

隐藏层k+1输出

n(k+1)=f(k+1)(z(k+1))(9)n^{(k+1)}= f^{(k+1)}(z{(k+1)})\tag{9}

对损失函数进行总结https://blog.csdn.net/lien0906/article/details/78429768
计算偏导数

z(k)b(k)=diag(1,1,,1)(10)\frac {\partial z^{(k)}}{\partial b^{(k)}}= diag(1,1, \ldots ,1)\tag{10}

列向量对列向量求导参见矩阵中的求导

计算偏导数\frac {\partial L(y,\widehat y)}{\partial z^{(k)}}\

偏导数\frac {\partial L(y,\widehat y)}{\partial z^{(k)}}\ 又称误差项(error term,也称"灵敏度"),一般用δ\delta 表示,用δ(k)\delta^{(k)} 表示第k层神经元的误差项,其值的大小代表了第k层神经元对最终总误差的影响大小

δ(k)=L(y,y^)z(k)=n(k)z(k)z(k+1)n(k)L(y,y^)z(k+1)=f(k)(z(k))(W(k+1))Tδ(k+1)(11)\begin{align} \delta^{(k)} & = \frac {\partial L(y,\widehat y)}{\partial z^{(k)}}\cr & =\frac {\partial n^{(k)}}{\partial z^{(k)}}* \frac {\partial z^{(k+1)}}{\partial n^{(k)}}* \frac {\partial L(y,\widehat y)}{\partial z^{(k+1)}}\cr & = {f^{(k)}}^{'}(z^{(k)}) * (W^{(k+1)})^T * \delta^{(k+1)} \end{align}\tag{11}

最终需要用的两个导数

L(y,y^)W(k)=L(y,y^)z(k)z(k)W(k)=δ(k)(n(k1))T(12)\frac {\partial L(y,\widehat y)}{\partial W^{(k)}} =\frac {\partial L(y,\widehat y)}{\partial z^{(k)}}* \frac {\partial z^{(k)}}{\partial W^{(k)}} =\delta^{(k)}*(n^{(k-1)})^T\tag{12} L(y,y^)b(k)=L(y,y^)z(k)z(k)b(k)=δ(k)(13)\frac {\partial L(y,\widehat y)}{\partial b^{(k)}} =\frac {\partial L(y,\widehat y)}{\partial z^{(k)}}* \frac {\partial z^{(k)}}{\partial b^{(k)}} =\delta^{(k)}\tag{13}

后向传播参数更新

W(k)=W(k)α(δ(k)(n(k1))T+W(k))(14)W^{(k)} = W^{(k)} - \alpha(\delta^{(k)}(n^{(k-1)})^T + W^{(k)})\tag{14} b(k)=b(k)αδ(k)(15)b^{(k)} = b^{(k)}-\alpha\delta^{(k)}\tag{15}

其中α\alpha 是学习率

后向传播中的正则化,L1正则化,L2正则化