In some deep neural networks (DNN), we may need to implement matrix multiplication, i.e., $\mathbf{Z}=\mathbf{XY}$. In the popular Caffe library, the closest implementation of matrix multiplication is its InnerProduct layer, i.e., $\mathbf{z=Wx+b}$. However the difference is that the weight matrix $\mathbf{W} \in \mathbb{R}^{M \times N}$ has to be a parameter blob associated with the InnerProduct layer, instead of a bottum input blob of the layer. To emulate a matrix multiplication in standard Caffe, as of April, 2017, is still doable but not so straightforward nor efficient, as explained in several stackoverflow discussions12.
Since I’m not satisfied with such a solution, I’ll describe below how to directly implement a matrix multiplication layer (in C++ for both cpu/gpu).
Assuming some basic understandings of backward propagation for solving DNN using stochastic gradient descent (SGD), firstly we need to figure out the partial derivatives.
Without loss of generality3, we have a loss function $l: \mathbb{R}^{M \times N} \to \mathbb{R}$. It accepts an input matrix $\mathbf{Z} \in \mathbb{R}^{M \times N}$, which itself is a matrix function performing matrix multiplication of two input matrices $\mathbf{X} \in \mathbb{R}^{M \times K}, \mathbf{Y} \in \mathbb{R}^{K \times N}$, i.e., $\mathbf{Z=XY}$.
Now we want to find out the partial derivatives of the loss function w.r.t. both $\mathbf{X}$ and $\mathbf{Y}$, i.e., $\underset{K \times M}{\frac{\partial l(\mathbf{Z})}{\partial \mathbf{X}}}$, and $\underset{N \times K}{\frac{\partial l(\mathbf{Z})}{\partial \mathbf{Y}}}$, as expressions of the loss function’s partial derivatives w.r.t. its direct input $\mathbf{Z}$, i.e., $\underset{N \times M}{\frac{\partial l(\mathbf{Z})}{\partial \mathbf{Z}}}$ is known4.
Since $\frac{\partial l(\mathbf{Z})}{\partial \mathbf{X}}$ is just a set of scalar-by-scalar partial derivatives arranged in a matrix layout, we can consider one entry and expand it using the chain rule and a scalar-by-matrix derivatives identity5:
Note equation $\eqref{eq:partial_Z_partial_Xij}$ comes from expanding $\frac{\partial \mathbf{Z}}{\partial X_{ij}}$ using the chain rule6, where the single-entry matrix $\underset{M \times K}{\mathbf{J}^{ij}}$ is $1$ only for the $(i,j)$-th entry and $0$ anywhere else. And equation $\eqref{eq:partial_l_partial_Xij}$ comes from a property of the matrix trace of multiple matrices’ product7.
From equation $\eqref{eq:A}$ and $\eqref{eq:A_ji}$, one can easily conclude that:
And similarly:
In Caffe, each variable (e.g., $\mathbf{X}$) is stored as a blob. And each blob contains two parts of values, data, and diff, where data stores the variable itself (i.e., $\mathbf{X}$), and diff stores the partial derivatives of the loss function $l$ of the DNN w.r.t. the variable (i.e., $\left(\frac{\partial l}{\partial \mathbf{X}}\right)^T$). Note that we are transposing the partial derivative matrix here. Because in Caffe, diff and data have the same shape, while in numerator-layout $\frac{\partial l}{\partial \mathbf{X}}$ is a $N \times M$ matrix with its entries corresponding to that of the transposed matrix $\mathbf{X}$ whose original shape is $M \times N$.
With these in mind, we can re-write the equation $\eqref{eq:partial_l_partial_X}$ and $\eqref{eq:partial_l_partial_Y}$ as the following equations:
Using equation $\eqref{eq:X_diff}$ and $\eqref{eq:Y_diff}$, we can write the backward propagation function (forward propagation is trivial) as the following pseudocode:
backward(bottom, top) {
// dl/dX' = dl/d(XY)' * Y', i.e., bottom[0].diff = top[0].diff * bottom[1].data'
caffe_cpu_gemm(CblasNoTrans, CblasTrans, M, K, N, 1, top[0].diff, bottom[1].data, 0, bottom[0].diff);
// dl/dY' = X' * dl/d(XY)', i.e., bottom[1].diff = bottom[0].data' * top[0].diff
caffe_cpu_gemm(CblasTrans, CblasNoTrans, K, N, M, 1, bottom[0].data, top[0].diff, 0, bottom[1].diff);
}
Of course, we need to take care of batched operations, and many other implementation details, following these wonderful tutorials89. A full implementation can be found in my own version of Caffe.
$l$ does NOT need to be just a final loss function layer of DNN, it can be a composition of the final loss function with all layers after the matrix multiplication layer, but still be a scalar-valued function. Thus this is a general formulation. ↩
We are sticking to the Numerator-layout notation for matrix calculus. ↩
https://en.wikipedia.org/wiki/Matrix_calculus#Scalar-by-matrix_identities ↩
See equation 73-74 in section 2.4.1 of the matrix cookbook10. ↩
https://github.com/BVLC/caffe/wiki/Simple-Example:-Sin-Layer ↩
Petersen, Kaare Brandt and Pedersen, Michael Syskind. The Matrix Cookbook. ↩ ↩2