个人写作笔记,如有问题,请不吝赐教!
目录如下:

上一篇中矩阵迹、矩阵范数的导数部分还留白白斩斩的,这里补充一下。
推导花了我好多大好光阴,鉴定为寄

这里是上一篇的链接

矩阵求导

还差一个最普适的情况——“矩阵对矩阵求导”没有分析,这种情况在神经网络中貌似特别常见。这里对这种情况的公式进行证明,证明方法与证明列向量的方式是完全一样的,不同之处在于,列向量在证明第二项时可以提取公因式产生矩阵分块,而在矩阵对矩阵求导中,不会得到矩阵分块,只能是矩阵直积,具体过程如下。
设矩阵 ACm×l\mathbf{A}\in\mathbb{C}^{m\times l}BCl×n\mathbf{B}\in\mathbb{C}^{l\times n}WCp×q\mathbf{W}\in\mathbb{C}^{p\times q},证明:

dABdW=dAdW(BIq)+(AIp)dBdW\begin{align*} &\frac{\mathrm{d}\mathbf{AB}}{\mathrm{d}\mathbf{W}}=\frac{\mathrm{d}\mathbf{A}}{\mathrm{d}\mathbf{W}}\big(\mathbf{B}\otimes\mathbf{I}_q\big)+\big(\mathbf{A}\otimes\mathbf{I}_p\big)\frac{\mathrm{d}\mathbf{B}}{\mathrm{d}\mathbf{W}} \end{align*}

根据矩阵对矩阵求导的定义,有:

dABdW=[da1ibi1dXda1ibi2dXda1ibindXda2ibi1dXda2ibi2dXda2ibindXdamibi1dXdamibi2dXdamibindX]=[da1idXbi1da1idXbi2da1idXbinda2idXbi1da2idXbi2da2idXbindamidXbi1damidXbi2damidXbin]+[a1idbi1dXa1idbi2dXa1idbindXa2idbi1dXa2idbi2dXa2idbindXamidbi1dXamidbi2dXamidbindX]\begin{align*} \frac{\mathrm{d}\mathbf{AB}}{\mathrm{d}\mathbf{W}}&= \begin{bmatrix} \frac{\mathrm{d}\sum a_{1i}b_{i1}}{\mathrm{d}\mathbf{X}}&\frac{\mathrm{d}\sum a_{1i}b_{i2}}{\mathrm{d}\mathbf{X}}&\cdots&\frac{\mathrm{d}\sum a_{1i}b_{in}}{\mathrm{d}\mathbf{X}}\\\\ \frac{\mathrm{d}\sum a_{2i}b_{i1}}{\mathrm{d}\mathbf{X}}&\frac{\mathrm{d}\sum a_{2i}b_{i2}}{\mathrm{d}\mathbf{X}}&\cdots&\frac{\mathrm{d}\sum a_{2i}b_{in}}{\mathrm{d}\mathbf{X}}\\\\ \vdots&\vdots&&\vdots\\\\ \frac{\mathrm{d}\sum a_{mi}b_{i1}}{\mathrm{d}\mathbf{X}}&\frac{\mathrm{d}\sum a_{mi}b_{i2}}{\mathrm{d}\mathbf{X}}&\cdots&\frac{\mathrm{d}\sum a_{mi}b_{in}}{\mathrm{d}\mathbf{X}}\\ \end{bmatrix}\\\\ &= \begin{bmatrix} \sum\frac{\mathrm{d} a_{1i}}{\mathrm{d}\mathbf{X}}b_{i1}&\sum\frac{\mathrm{d} a_{1i}}{\mathrm{d}\mathbf{X}}b_{i2}&\cdots&\sum\frac{\mathrm{d} a_{1i}}{\mathrm{d}\mathbf{X}}b_{in}\\\\ \sum\frac{\mathrm{d} a_{2i}}{\mathrm{d}\mathbf{X}}b_{i1}&\sum\frac{\mathrm{d} a_{2i}}{\mathrm{d}\mathbf{X}}b_{i2}&\cdots&\sum\frac{\mathrm{d} a_{2i}}{\mathrm{d}\mathbf{X}}b_{in}\\\\ \vdots&\vdots&&\vdots\\\\ \sum\frac{\mathrm{d} a_{mi}}{\mathrm{d}\mathbf{X}}b_{i1}&\sum\frac{\mathrm{d} a_{mi}}{\mathrm{d}\mathbf{X}}b_{i2}&\cdots&\sum\frac{\mathrm{d} a_{mi}}{\mathrm{d}\mathbf{X}}b_{in}\\ \end{bmatrix} + \begin{bmatrix} \sum a_{1i}\frac{\mathrm{d}b_{i1}}{\mathrm{d}\mathbf{X}}&\sum a_{1i}\frac{\mathrm{d}b_{i2}}{\mathrm{d}\mathbf{X}}&\cdots&\sum a_{1i}\frac{\mathrm{d}b_{in}}{\mathrm{d}\mathbf{X}}\\\\ \sum a_{2i}\frac{\mathrm{d}b_{i1}}{\mathrm{d}\mathbf{X}}&\sum a_{2i}\frac{\mathrm{d}b_{i2}}{\mathrm{d}\mathbf{X}}&\cdots&\sum a_{2i}\frac{\mathrm{d}b_{in}}{\mathrm{d}\mathbf{X}}\\\\ \vdots&\vdots&&\vdots\\\\ \sum a_{mi}\frac{\mathrm{d}b_{i1}}{\mathrm{d}\mathbf{X}}&\sum a_{mi}\frac{\mathrm{d}b_{i2}}{\mathrm{d}\mathbf{X}}&\cdots&\sum a_{mi}\frac{\mathrm{d}b_{in}}{\mathrm{d}\mathbf{X}}\\ \end{bmatrix} \end{align*}

考虑前一项因子式,将其分解为以下形式:

[da1idXbi1da1idXbi2da1idXbinda2idXbi1da2idXbi2da2idXbindamidXbi1damidXbi2damidXbin]=[da1idXda1idXda1idXda2idXda2idXda2idXdamidXdamidXdamidX][b1100b1200b1n000b1100b1200b1n000b2100b2200b2nb2100b2200b2n000b2100b2200b2n000b2100b2200b2nbm100bm200bmn000bm100bm200bmn000bm100bm200bmn]=[da1idXda1idXda1idXda2idXda2idXda2idXdamidXdamidXdamidX](BIq)=dAdW(BIq)\begin{aligned} &\begin{bmatrix} \sum\frac{\mathrm{d} a_{1i}}{\mathrm{d}\mathbf{X}}b_{i1}&\sum\frac{\mathrm{d} a_{1i}}{\mathrm{d}\mathbf{X}}b_{i2}&\cdots&\sum\frac{\mathrm{d} a_{1i}}{\mathrm{d}\mathbf{X}}b_{in}\\\\ \sum\frac{\mathrm{d} a_{2i}}{\mathrm{d}\mathbf{X}}b_{i1}&\sum\frac{\mathrm{d} a_{2i}}{\mathrm{d}\mathbf{X}}b_{i2}&\cdots&\sum\frac{\mathrm{d} a_{2i}}{\mathrm{d}\mathbf{X}}b_{in}\\\\ \vdots&\vdots&&\vdots\\\\ \sum\frac{\mathrm{d} a_{mi}}{\mathrm{d}\mathbf{X}}b_{i1}&\sum\frac{\mathrm{d} a_{mi}}{\mathrm{d}\mathbf{X}}b_{i2}&\cdots&\sum\frac{\mathrm{d} a_{mi}}{\mathrm{d}\mathbf{X}}b_{in}\\ \end{bmatrix}\\\\ &=\begin{bmatrix} \sum\frac{\mathrm{d} a_{1i}}{\mathrm{d}\mathbf{X}}&\sum\frac{\mathrm{d} a_{1i}}{\mathrm{d}\mathbf{X}}&\cdots&\sum\frac{\mathrm{d} a_{1i}}{\mathrm{d}\mathbf{X}}\\\\ \sum\frac{\mathrm{d} a_{2i}}{\mathrm{d}\mathbf{X}}&\sum\frac{\mathrm{d} a_{2i}}{\mathrm{d}\mathbf{X}}&\cdots&\sum\frac{\mathrm{d} a_{2i}}{\mathrm{d}\mathbf{X}}\\\\ \vdots&\vdots&&\vdots\\\\ \sum\frac{\mathrm{d} a_{mi}}{\mathrm{d}\mathbf{X}}&\sum\frac{\mathrm{d} a_{mi}}{\mathrm{d}\mathbf{X}}&\cdots&\sum\frac{\mathrm{d} a_{mi}}{\mathrm{d}\mathbf{X}}\\ \end{bmatrix}\cdot\\\\ &\begin{bmatrix} \hline b_{11}&0&\cdots&0&|&b_{12}&0&\cdots&0&|&\cdots&\cdots&|&b_{1n}&0&\cdots&0\\ 0&b_{11}&\cdots&0&|&0&b_{12}&\cdots&0&|&\cdots&\cdots&|&0&b_{1n}&\cdots&0\\ \vdots&\vdots&&\vdots&|&\vdots&\vdots&&\vdots&|&&&|&\vdots&\vdots&&\vdots\\ 0&0&\cdots&b_{21}&|&0&0&\cdots&b_{22}&|&\cdots&\cdots&|&0&0&\cdots&b_{2n}\\\hline b_{21}&0&\cdots&0&|&b_{22}&0&\cdots&0&|&\cdots&\cdots&|&b_{2n}&0&\cdots&0\\ 0&b_{21}&\cdots&0&|&0&b_{22}&\cdots&0&|&\cdots&\cdots&|&0&b_{2n}&\cdots&0\\ \vdots&\vdots&&\vdots&|&\vdots&\vdots&&\vdots&|&&&|&\vdots&\vdots&&\vdots\\ 0&0&\cdots&b_{21}&|&0&0&\cdots&b_{22}&|&\cdots&\cdots&|&0&0&\cdots&b_{2n}\\\hline \vdots&\vdots&&\vdots&|&\vdots&\vdots&&\vdots&|&&&|&\vdots&\vdots&&\vdots\\ \vdots&\vdots&&\vdots&|&\vdots&\vdots&&\vdots&|&&&|&\vdots&\vdots&&\vdots\\\hline b_{m1}&0&\cdots&0&|&b_{m2}&0&\cdots&0&|&\cdots&\cdots&|&b_{mn}&0&\cdots&0\\ 0&b_{m1}&\cdots&0&|&0&b_{m2}&\cdots&0&|&\cdots&\cdots&|&0&b_{mn}&\cdots&0\\ \vdots&\vdots&&\vdots&|&\vdots&\vdots&&\vdots&|&&&|&\vdots&\vdots&&\vdots\\ 0&0&\cdots&b_{m1}&|&0&0&\cdots&b_{m2}&|&\cdots&\cdots&|&0&0&\cdots&b_{mn}\\\hline \end{bmatrix}\\\\ &=\begin{bmatrix} \sum\frac{\mathrm{d} a_{1i}}{\mathrm{d}\mathbf{X}}&\sum\frac{\mathrm{d} a_{1i}}{\mathrm{d}\mathbf{X}}&\cdots&\sum\frac{\mathrm{d} a_{1i}}{\mathrm{d}\mathbf{X}}\\\\ \sum\frac{\mathrm{d} a_{2i}}{\mathrm{d}\mathbf{X}}&\sum\frac{\mathrm{d} a_{2i}}{\mathrm{d}\mathbf{X}}&\cdots&\sum\frac{\mathrm{d} a_{2i}}{\mathrm{d}\mathbf{X}}\\\\ \vdots&\vdots&&\vdots\\\\ \sum\frac{\mathrm{d} a_{mi}}{\mathrm{d}\mathbf{X}}&\sum\frac{\mathrm{d} a_{mi}}{\mathrm{d}\mathbf{X}}&\cdots&\sum\frac{\mathrm{d} a_{mi}}{\mathrm{d}\mathbf{X}}\\ \end{bmatrix}\cdot(\mathbf{B\otimes I}_q)\\\\ &=\frac{\mathrm{d}\mathbf{A}}{\mathrm{d}\mathbf{W}}\big(\mathbf{B}\otimes\mathbf{I}_q\big) \end{aligned}

考虑后一项因子式,同样地有类似的分解方法:

[dbi1dXdbi2dXdbindXdbi1dXdbi2dXdbindXamidbi1dXamidbi2dXamidbindX]=[a1100a1200a1n000a1100a1200a1n000a2100a2200a2na2100a2200a2n000a2100a2200a2n000a2100a2200a2nam100am200amn000am100am200amn000am100am200amn][dbi1dXdbi2dXdbindXdbi1dXdbi2dXdbindXdbi1dXdbi2dXdbindX]=(AIp)dBdW\begin{aligned} &\begin{bmatrix} \sum \frac{\mathrm{d}b_{i1}}{\mathrm{d}\mathbf{X}}&\sum \frac{\mathrm{d}b_{i2}}{\mathrm{d}\mathbf{X}}&\cdots&\sum \frac{\mathrm{d}b_{in}}{\mathrm{d}\mathbf{X}}\\\\ \sum\frac{\mathrm{d}b_{i1}}{\mathrm{d}\mathbf{X}}&\sum\frac{\mathrm{d}b_{i2}}{\mathrm{d}\mathbf{X}}&\cdots&\sum\frac{\mathrm{d}b_{in}}{\mathrm{d}\mathbf{X}}\\\\ \vdots&\vdots&&\vdots\\\\ \sum a_{mi}\frac{\mathrm{d}b_{i1}}{\mathrm{d}\mathbf{X}}&\sum a_{mi}\frac{\mathrm{d}b_{i2}}{\mathrm{d}\mathbf{X}}&\cdots&\sum a_{mi}\frac{\mathrm{d}b_{in}}{\mathrm{d}\mathbf{X}}\\ \end{bmatrix}\\\\ &=\begin{bmatrix} \hline a_{11}&0&\cdots&0&|&a_{12}&0&\cdots&0&|&\cdots&\cdots&|&a_{1n}&0&\cdots&0\\ 0&a_{11}&\cdots&0&|&0&a_{12}&\cdots&0&|&\cdots&\cdots&|&0&a_{1n}&\cdots&0\\ \vdots&\vdots&&\vdots&|&\vdots&\vdots&&\vdots&|&&&|&\vdots&\vdots&&\vdots\\ 0&0&\cdots&a_{21}&|&0&0&\cdots&a_{22}&|&\cdots&\cdots&|&0&0&\cdots&a_{2n}\\\hline a_{21}&0&\cdots&0&|&a_{22}&0&\cdots&0&|&\cdots&\cdots&|&a_{2n}&0&\cdots&0\\ 0&a_{21}&\cdots&0&|&0&a_{22}&\cdots&0&|&\cdots&\cdots&|&0&a_{2n}&\cdots&0\\ \vdots&\vdots&&\vdots&|&\vdots&\vdots&&\vdots&|&&&|&\vdots&\vdots&&\vdots\\ 0&0&\cdots&a_{21}&|&0&0&\cdots&a_{22}&|&\cdots&\cdots&|&0&0&\cdots&a_{2n}\\\hline \vdots&\vdots&&\vdots&|&\vdots&\vdots&&\vdots&|&&&|&\vdots&\vdots&&\vdots\\ \vdots&\vdots&&\vdots&|&\vdots&\vdots&&\vdots&|&&&|&\vdots&\vdots&&\vdots\\\hline a_{m1}&0&\cdots&0&|&a_{m2}&0&\cdots&0&|&\cdots&\cdots&|&a_{mn}&0&\cdots&0\\ 0&a_{m1}&\cdots&0&|&0&a_{m2}&\cdots&0&|&\cdots&\cdots&|&0&a_{mn}&\cdots&0\\ \vdots&\vdots&&\vdots&|&\vdots&\vdots&&\vdots&|&&&|&\vdots&\vdots&&\vdots\\ 0&0&\cdots&a_{m1}&|&0&0&\cdots&a_{m2}&|&\cdots&\cdots&|&0&0&\cdots&a_{mn}\\\hline \end{bmatrix}\cdot\\\\ &\begin{bmatrix} \sum\frac{\mathrm{d}b_{i1}}{\mathrm{d}\mathbf{X}}&\sum\frac{\mathrm{d}b_{i2}}{\mathrm{d}\mathbf{X}}&\cdots&\sum\frac{\mathrm{d}b_{in}}{\mathrm{d}\mathbf{X}}\\\\ \sum\frac{\mathrm{d}b_{i1}}{\mathrm{d}\mathbf{X}}&\sum\frac{\mathrm{d}b_{i2}}{\mathrm{d}\mathbf{X}}&\cdots&\sum\frac{\mathrm{d}b_{in}}{\mathrm{d}\mathbf{X}}\\\\ \vdots&\vdots&&\vdots\\\\ \sum\frac{\mathrm{d}b_{i1}}{\mathrm{d}\mathbf{X}}&\sum\frac{\mathrm{d}b_{i2}}{\mathrm{d}\mathbf{X}}&\cdots&\sum\frac{\mathrm{d}b_{in}}{\mathrm{d}\mathbf{X}}\\ \end{bmatrix}\\\\ &=\big(\mathbf{A}\otimes\mathbf{I}_p\big)\frac{\mathrm{d}\mathbf{B}}{\mathrm{d}\mathbf{W}} \end{aligned}

综上,原公式得到了证明。

矩阵的链式求导法则

不同于纯量的导数,矩阵链式求导法则会涉及到矩阵对矩阵的导数,因此在介绍完上一部分后引入矩阵的链式求导法则。设有函数矩阵:

G(F)=[g11(F)g12(F)g1n(F)g21(F)g22(F)g2n(F)gm1(F)gm2(F)gmn(F)]\mathbf{G(F)}=\begin{bmatrix} g_{11}(\mathbf{F})&g_{12}(\mathbf{F})&\cdots&g_{1n}(\mathbf{F})\\\\ g_{21}(\mathbf{F})&g_{22}(\mathbf{F})&\cdots&g_{2n}(\mathbf{F})\\ \vdots&\vdots&&\vdots\\ g_{m1}(\mathbf{F})&g_{m2}(\mathbf{F})&\cdots&g_{mn}(\mathbf{F})\\ \end{bmatrix}

其中矩阵FCs×t\mathbf{F}\in\mathbb{C}^{s\times t}又是关于ACp×q\mathbf{A}\in\mathbb{C}^{p\times q}的函数矩阵,可以证明,在由G,F\mathbf{G,F}合成的增广线性空间中存在这样的一个算子φ:Csp×qtCm×n\varphi:\mathbb{C}^{sp\times qt}\mapsto\mathbb{C}^{m\times n},使之对应的矩阵求导的链式法则,但是这个形式过于复杂,且仅具有理论上的作用,因此这里就不推导了。下面推导部分经常被使用的链式法则。

点乘运算的函数

点乘运算是特殊的函数矩阵,其导数在神经网络的反向传播中有重要作用。点乘运算的导数是一种特殊的链式法则。若矩阵A,CCm×n\mathbf{A,C}\in\mathbb{C}^{m\times n},则定义点乘运算为:

AB=[a11b11a12b12a1nb1na21b21a22b22a2nb2nam1bm1am2bm2amnbmn]\mathbf{A\odot B}=\begin{bmatrix} a_{11}b_{11}&a_{12}b_{12}&\cdots&a_{1n}b_{1n}\\ a_{21}b_{21}&a_{22}b_{22}&\cdots&a_{2n}b_{2n}\\ \vdots&\vdots&&\vdots\\ a_{m1}b_{m1}&a_{m2}b_{m2}&\cdots&a_{mn}b_{mn}\\ \end{bmatrix}

若有求导变量WCm×n\mathbf{W}\in\mathbb{C}^{m\times n}点乘运算的求导结果为:

dABdW=dAdW[B1(W)]+[A1(W)]dBdW\frac{\mathrm{d}\mathbf{A\odot B}}{\mathrm{d}\mathbf{W}}=\frac{\mathrm{d}\mathbf{A}}{\mathrm{d}\mathbf{W}}\odot\big[\mathbf{B\otimes 1(W)}\big]+\big[\mathbf{A\otimes 1(W)}\big]\odot\frac{\mathrm{d}\mathbf{B}}{\mathrm{d}\mathbf{W}}

其中:1(W)\mathbf{1(W)} 表示与W\mathbf{W}形状相同的全1矩阵。

按元素的函数运算

按元素的函数运算是指对矩阵内每一个元素均使用同一函数映射,从而改变整个矩阵的值,该运算的导数是特殊的链式运算法则。对于矩阵ACm×n\mathbf{A}\in\mathbb{C}^{m\times n}和单射f:xCCf:x\in\mathbb{C}\mapsto\mathbb{C},定义按元素的函数运算为:

f(A)fA[f(a11)f(a12)f(a1n)f(a21)f(a22)f(a2n)f(am1)f(am2)f(amn)]f\odot\mathbf{(A)}\triangleq f\Big|_\mathbf{A}\triangleq\begin{bmatrix} f(a_{11})&f(a_{12})&\cdots&f(a_{1n})\\ f(a_{21})&f(a_{22})&\cdots&f(a_{2n})\\ \vdots&\vdots&&\vdots\\ f(a_{m1})&f(a_{m2})&\cdots&f(a_{mn})\\ \end{bmatrix}

若有求导变量WCm×n\mathbf{W}\in\mathbb{C}^{m\times n},则按元素函数运算的求导结果为:

df(A)dW=[dfdxA1(W)]dAdW\frac{\mathrm{d}f\odot (\mathbf{A})}{\mathrm{d}\mathbf{W}}=\Big[\frac{\mathrm{d}f}{\mathrm{d}x}\Big|_\mathbf{A}\otimes 1(\mathbf{W})\Big]\odot\frac{\mathrm{d}\mathbf{A}}{\mathrm{d}\mathbf{W}}

其中:

dfdxA[dfdxa11dfdxa12dfdxa1ndfdxa21dfdxa22dfdxa2ndfdxam1dfdxam2dfdxamn]\frac{\mathrm{d}f}{\mathrm{d}x}\Big|_\mathbf{A}\triangleq\begin{bmatrix} \displaystyle\frac{\mathrm{d}f}{\mathrm{d}x}\Big|_{a_{11}}&\displaystyle\frac{\mathrm{d}f}{\mathrm{d}x}\Big|_{a_{12}}&\cdots&\displaystyle\frac{\mathrm{d}f}{\mathrm{d}x}\Big|_{a_{1n}}\\\\ \displaystyle\frac{\mathrm{d}f}{\mathrm{d}x}\Big|_{a_{21}}&\displaystyle\frac{\mathrm{d}f}{\mathrm{d}x}\Big|_{a_{22}}&\cdots&\displaystyle\frac{\mathrm{d}f}{\mathrm{d}x}\Big|_{a_{2n}}\\\\ \vdots&\vdots&&\vdots\\\\ \displaystyle\frac{\mathrm{d}f}{\mathrm{d}x}\Big|_{a_{m1}}&\displaystyle\frac{\mathrm{d}f}{\mathrm{d}x}\Big|_{a_{m2}}&\cdots&\displaystyle\frac{\mathrm{d}f}{\mathrm{d}x}\Big|_{a_{mn}}\\ \end{bmatrix}

上式表明,如果元素内存在函数的嵌套关系,同样可以如纯量函数链式法则一样展开,即:

dgf(A)dW=[dgdxf(A)1(W)][dfdxA1(W)]dAdW\frac{\mathrm{d}g\circ f\odot (\mathbf{A})}{\mathrm{d}\mathbf{W}}=\Big[\frac{\mathrm{d}g}{\mathrm{d}x}\Big|_{f\odot (\mathbf{A})}\otimes1(\mathbf{W})\Big]\odot\Big[\frac{\mathrm{d}f}{\mathrm{d}x}\Big|_\mathbf{A}\otimes 1(\mathbf{W})\Big]\odot\frac{\mathrm{d}\mathbf{A}}{\mathrm{d}\mathbf{W}}

矩阵元素算子求导

将上式做适当推广即可得到矩阵元素算子的导数,定义:

F(A)FA[f11(a11)f12(a12)f1n(a1n)f21(a21)f22(a22)f2n(a2n)fm1(am1)fm2(am2)fmn(amn)]\mathbf{F\odot(A)}\triangleq\mathbf{F}\Big|_\mathbf{A}\triangleq\begin{bmatrix} f_{11}(a_{11})&f_{12}(a_{12})&\cdots&f_{1n}(a_{1n})\\ f_{21}(a_{21})&f_{22}(a_{22})&\cdots&f_{2n}(a_{2n})\\ \vdots&\vdots&&\vdots\\ f_{m1}(a_{m1})&f_{m2}(a_{m2})&\cdots&f_{mn}(a_{mn})\\ \end{bmatrix}

若有求导变量WCm×n\mathbf{W}\in\mathbb{C}^{m\times n},则按元素函数运算的求导结果为:

dF(A)dW=[dFdxA1(W)]dAdW\frac{\mathrm{d}\mathbf{F\odot (A)}}{\mathrm{d}\mathbf{W}}=\Big[\frac{\mathrm{d}\mathbf{F}}{\mathrm{d}x}\Big|_\mathbf{A}\otimes 1(\mathbf{W})\Big]\odot\frac{\mathrm{d}\mathbf{A}}{\mathrm{d}\mathbf{W}}

其中:

dFdxA[df11dxa11df12dxa12df1ndxa1ndf21dxa21df22dxa22df2ndxa2ndfm1dxam1dfm2dxam2dfmndxamn]\frac{\mathrm{d}\mathbf{F}}{\mathrm{d}x}\Big|_\mathbf{A}\triangleq\begin{bmatrix} \displaystyle\frac{\mathrm{d}f_{11}}{\mathrm{d}x}\Big|_{a_{11}}&\displaystyle\frac{\mathrm{d}f_{12}}{\mathrm{d}x}\Big|_{a_{12}}&\cdots&\displaystyle\frac{\mathrm{d}f_{1n}}{\mathrm{d}x}\Big|_{a_{1n}}\\\\ \displaystyle\frac{\mathrm{d}f_{21}}{\mathrm{d}x}\Big|_{a_{21}}&\displaystyle\frac{\mathrm{d}f_{22}}{\mathrm{d}x}\Big|_{a_{22}}&\cdots&\displaystyle\frac{\mathrm{d}f_{2n}}{\mathrm{d}x}\Big|_{a_{2n}}\\\\ \vdots&\vdots&&\vdots\\\\ \displaystyle\frac{\mathrm{d}f_{m1}}{\mathrm{d}x}\Big|_{a_{m1}}&\displaystyle\frac{\mathrm{d}f_{m2}}{\mathrm{d}x}\Big|_{a_{m2}}&\cdots&\displaystyle\frac{\mathrm{d}f_{mn}}{\mathrm{d}x}\Big|_{a_{mn}}\\ \end{bmatrix}

矩阵的迹

设矩阵 A,BCn×n\mathbf{A},\mathbf{B}\in\mathbb{C}^{n\times n}WCp×q\mathbf{W}\in\mathbb{C}^{p\times q}

定义

对于方阵而言,定义矩阵的迹为主对角线上全体元素之和,即:

tr(A)=i=1naii\mathrm{tr}(\mathbf{A})=\sum_{i=1}^na_{ii}

性质

讨论矩阵的迹时,只考虑方阵的情况,一般不是方阵不必讨论迹。矩阵的迹等于矩阵全体特征值之和,可由多项式韦达定理证得,即有:

tr(A)=i=1nλi\mathrm{tr}(\mathbf{A})=\sum_{i=1}^n\lambda_i

矩阵的迹满足乘积可换顺序,即:

tr(AB)=tr(BA)\mathrm{tr}(\mathbf{AB})=\mathrm{tr}(\mathbf{BA})

微分性质

矩阵的迹就是一个纯量,因此求解时按一般纯量的求法即可。对于含有矩阵运算的迹,追迹计算与导数运算不能交换顺序:

d[tr(A)]dW=i=1ndaiidW\begin{align*} &\frac{\mathrm{d}[\mathrm{tr}(\mathbf{A})]}{\mathrm{d}\mathbf{W}}=\sum\limits_{i=1}^n\frac{\mathrm{d}a_{ii}}{\mathrm{d}\mathbf{W}} \end{align*}

从上式也可看出,该式并不能得到反映矩阵A\mathbf{A}全体元素的表达式,因此不能交换导数和追迹的顺序。但是两矩阵乘积嵌套迹运算,可以简记为如下形式:

d[tr(AB)]dW=[tr(dAdwijB+AdBdwij)]n×n\begin{align*} &\frac{\mathrm{d}[\mathrm{tr}(\mathbf{AB})]}{\mathrm{d}\mathbf{W}}=\Big[\mathrm{tr}\Big(\frac{\mathrm{d}\mathbf{A}}{\mathrm{d}w_{ij}}\mathbf{B}+\mathbf{A}\frac{\mathrm{d}\mathbf{B}}{\mathrm{d}w_{ij}}\Big)\Big]_{n\times n} \end{align*}

矩阵的范数

ACm×n\displaystyle\forall\mathbf{A}\in\mathbb{C}^{m\times n},定义以下的范数:

矩阵原生范数

总和范数:AM=j=1ni=1maij\displaystyle||\mathbf{A}||_M=\sum^n_{j=1}\sum^m_{i=1} |a_{ij}|

F范数:AF=j=1ni=1maij2\displaystyle||\mathbf{A}||_F=\sum^n_{j=1}\sum^m_{i=1} |a_{ij}|^2

G范数:AG=nmaxi,ji=1naij\displaystyle||\mathbf{A}||_G=n\cdot \max_{i,j}\sum^n_{i=1} |a_{ij}|

向量范数导出的矩阵范数

矩阵的最大奇异值称为矩阵的谱半径,用 ρ(A)=max{si}i=1n\displaystyle\rho({\mathbf{A}})=\max\{s_i\}_{i=1}^{n} 表示。

行和范数:A=maxjj=1naij\displaystyle||\mathbf{A}||_{\infty}=\max_j\sum^n_{j=1} |a_{ij}|

列和范数:A1=maxji=1maij\displaystyle||\mathbf{A}||_1=\max_j\sum^m_{i=1} |a_{ij}|

谱范数:A2=max{si}i=1n\displaystyle||\mathbf{A}||_2=\max\{s_i\}_{i=1}^{n}

矩阵范数的导数

矩阵范数大多不能直接求导,常见的能求导的范数有F范数,F范数的求法可由下式快捷算出:

AF2=tr(ATA)||\mathbf{A}||_F^2=\mathrm{tr}(\mathbf{A^{\mathrm{T}}A})

简单起见,此处对范数进行平方,在常见的学习率正则化处理中,经常见到带有这种形式的误差项,对此平方求导,可得:

dAF2dW=d[tr(ATA)]dW=tr(dATAdW)=tr[dATdW(AIq)+(ATIp)dAdW]\begin{align*} \frac{\mathrm{d}||\mathbf{A}||_F^2}{\mathrm{d}\mathbf{W}}&=\frac{\mathrm{d}\big[\mathrm{tr}(\mathbf{A^{\mathrm{T}}A})\big]}{\mathrm{d}\mathbf{W}}=\mathrm{tr}\Big(\frac{\mathrm{d}\mathbf{A^{\mathrm{T}}A}}{\mathrm{d}\mathbf{W}}\Big)\\\\ &=\mathrm{tr}\Big[\frac{\mathrm{d}\mathbf{A}^\mathrm{T}}{\mathrm{d}\mathbf{W}}\big(\mathbf{A}\otimes\mathbf{I}_q\big)+\big(\mathbf{A}^\mathrm{T}\otimes\mathbf{I}_p\big)\frac{\mathrm{d}\mathbf{A}}{\mathrm{d}\mathbf{W}}\Big] \end{align*}

特别地,当 WCm×m\mathbf{W}\in\mathbb{C^{m\times m}} 即微分变量也是方阵时,根据追迹的乘积互换性,可以一步得到:

dAF2dW=tr[dATdW(AIq)+(ATIp)dAdW]=2tr[dATdW(AIq)]\begin{align*} \frac{\mathrm{d}||\mathbf{A}||_F^2}{\mathrm{d}\mathbf{W}}&=\mathrm{tr}\Big[\frac{\mathrm{d}\mathbf{A}^\mathrm{T}}{\mathrm{d}\mathbf{W}}\big(\mathbf{A}\otimes\mathbf{I}_q\big)+\big(\mathbf{A}^\mathrm{T}\otimes\mathbf{I}_p\big)\frac{\mathrm{d}\mathbf{A}}{\mathrm{d}\mathbf{W}}\Big]\\\\ &=2\mathrm{tr}\Big[\frac{\mathrm{d}\mathbf{A}^\mathrm{T}}{\mathrm{d}\mathbf{W}}\big(\mathbf{A}\otimes\mathbf{I}_q\big)\Big] \end{align*}

向量范数的导数

将矩阵退化为列向量,就可以得到向量范数导数,这种情况更为常见。向量的二范数定义为:

X22=XTX||\mathbf{X}||_2^2=\mathbf{X}^\mathrm{T}\mathbf{X}

因此对于列向量 YCm×1\mathbf{Y}\in\mathrm{C}^{m\times1}XCn×1\mathbf{X}\in\mathrm{C}^{n\times1} ,有:

dY22dX=dYTdX(YI1)+(YTIm)dYdX=dYTdXY+V(YTdYdXT)=2dYTdXY\begin{align*} \frac{\mathrm{d}||\mathbf{Y}||_2^2}{\mathrm{d}\mathbf{X}}&=\frac{\mathrm{d}\mathbf{Y}^\mathrm{T}}{\mathrm{d}\mathbf{X}}\big(\mathbf{Y}\otimes\mathbf{I}_1\big)+\big(\mathbf{Y}^\mathrm{T}\otimes\mathbf{I}_m\big)\frac{\mathrm{d}\mathbf{Y}}{\mathrm{d}\mathbf{X}}\\\\ &=\frac{\mathrm{d}\mathbf{Y}^\mathrm{T}}{\mathrm{d}\mathbf{X}}\mathbf{Y}+\mathcal{V}\big(\mathbf{Y}^\mathrm{T}\frac{\mathrm{d}\mathbf{Y}}{\mathrm{d}\mathbf{X}^\mathrm{T}}\big)=2\cdot\frac{\mathrm{d}\mathbf{Y}^\mathrm{T}}{\mathrm{d}\mathbf{X}}\mathbf{Y} \end{align*}

常见矩阵导数表

设矩阵A,B,D,C\mathbf{A,B,D,\cdots}\in\mathbb{C},列向量X,Y,ZCq×1\mathbf{X,Y,Z}\in\mathbb{C}^{q\times 1}

线性变换

dAXdX=V(A)dAXdXT=AdXdX=V(Iq)dAXdA=V(Im)VT(XT)dAm×qXdAT=XImdABdA=V(Im)VT(BT)\begin{aligned} \frac{\mathrm{d}\mathbf{AX}}{\mathrm{d}\mathbf{X}}&=\mathcal{V}(\mathbf{A})\\\\ \frac{\mathrm{d}\mathbf{AX}}{\mathrm{d}\mathbf{X}^\mathrm{T}}&=\mathbf{A}\\\\ \frac{\mathrm{d}\mathbf{X}}{\mathrm{d}\mathbf{X}}&=\mathcal{V}(\mathbf{I}_q)\\\\ \frac{\mathrm{d}\mathbf{AX}}{\mathrm{d}\mathbf{A}}&=\mathcal{V}(\mathbf{I}_m)\mathcal{V}^\mathrm{T}(\mathbf{X}^\mathrm{T})\\\\ \frac{\mathrm{d}\mathbf{A_{m\times q}X}}{\mathrm{d}\mathbf{A}^\mathrm{T}}&=\mathbf{X}\otimes\mathbf{I}_m\\\\ \frac{\mathrm{d}\mathbf{AB}}{\mathrm{d}\mathbf{A}}&=\mathcal{V}(\mathbf{I}_m)\mathcal{V}^\mathrm{T}(\mathbf{B}^\mathrm{T})\\\\ \end{aligned}

二次型

dXTAXdX=(A+AT)XdXTAXdXT=XT(A+AT)dXTAXdA=(XTIq)dAdA(XIq)=XXTdXTAYdA=XYT\begin{aligned} \frac{\mathrm{d}\mathbf{X^\mathrm{T}AX}}{\mathrm{d}\mathbf{X}}&=\mathbf{(A+A^\mathrm{T})X}\\\\ \frac{\mathrm{d}\mathbf{X^\mathrm{T}AX}}{\mathrm{d}\mathbf{X}^\mathrm{T}}&=\mathbf{X^\mathrm{T}(A+A^\mathrm{T})}\\\\ \frac{\mathrm{d}\mathbf{X^\mathrm{T}AX}}{\mathrm{d}\mathbf{A}}&=(\mathbf{X}^\mathrm{T}\otimes\mathbf{I}_q)\frac{\mathrm{d}\mathbf{A}}{\mathrm{d}\mathbf{A}}(\mathbf{X}\otimes\mathbf{I}_q)=\mathbf{XX^\mathrm{T}}\\\\ \frac{\mathrm{d}\mathbf{X^\mathrm{T}AY}}{\mathrm{d}\mathbf{A}}&=\mathbf{XY^\mathrm{T}}\\\\ \end{aligned}

迹运算型

利用迹运算的压缩特性和互换性,可以对某些范数对矩阵的导数进行化简,范数对向量的导数可以直接按求导乘积法则运算。如下所示:

dAF2dA=[tr(dATdaijA+ATdAdaij)]n×n=A+ATd[tr(AB)]dA=d[tr(BA)]dA=BTdAX22dA=[tr(dXTATdaijAX+XTATdAXdaij)]n×n=XT(A+AT)X\begin{aligned} \frac{\mathrm{d}||\mathbf{A}||_F^2}{\mathrm{d}\mathbf{A}}&=\Big[\mathrm{tr}\Big(\frac{\mathrm{d}\mathbf{A}^\mathrm{T}}{\mathrm{d}a_{ij}}\mathbf{A}+\mathbf{A}^\mathrm{T}\frac{\mathrm{d}\mathbf{A}}{\mathrm{d}a_{ij}}\Big)\Big]_{n\times n}=\mathbf{A+A^\mathrm{T}}\\\\ \frac{\mathrm{d}[\mathrm{tr}(\mathbf{AB})]}{\mathrm{d}\mathbf{A}}&=\frac{\mathrm{d}[\mathrm{tr}(\mathbf{BA})]}{\mathrm{d}\mathbf{A}}=\mathbf{B}^\mathrm{T}\\\\ \frac{\mathrm{d}||\mathbf{AX}||_2^2}{\mathrm{d}\mathbf{A}}&=\Big[\mathrm{tr}\Big(\frac{\mathrm{d}\mathbf{X}^\mathrm{T}\mathbf{A}^\mathrm{T}}{\mathrm{d}a_{ij}}\mathbf{AX}+\mathbf{X}^\mathrm{T}\mathbf{A}^\mathrm{T}\frac{\mathrm{d}\mathbf{AX}}{\mathrm{d}a_{ij}}\Big)\Big]_{n\times n}=\mathbf{X^\mathrm{T}(A+A^\mathrm{T})X}\\\\ \end{aligned}

备注:以下公式可为上述各式化简提供简化手段:

tr(dAdwijB)={daijdwijbjii=j0ij[tr(dAdwijB)]n×n=BT\begin{aligned} \mathrm{tr}\Big(\frac{\mathrm{d}\mathbf{A}}{\mathrm{d}w_{ij}}\mathbf{B}\Big)&=\begin{cases} \displaystyle\frac{\mathrm{d}a_{ij}}{\mathrm{d}w_{ij}}b_{ji}&i=j\\\\ 0&i\ne j \end{cases}\\ \Big[\mathrm{tr}\Big(\frac{\mathrm{d}\mathbf{A}}{\mathrm{d}w_{ij}}\mathbf{B}\Big)\Big]_{n\times n}&=\mathbf{B^\mathrm{T}} \end{aligned}