How to Write a Matrix Multiplication Layer in Caffe?

Published: 17 Apr 2017 | Updated: 18 Apr 2017 | Tags: math code note caffe

In some deep neural networks (DNN), we may need to implement matrix multiplication, i.e., $\mathbf{Z}=\mathbf{XY}$. In the popular Caffe library, the closest implementation of matrix multiplication is its InnerProduct layer, i.e., $\mathbf{z=Wx+b}$. However the difference is that the weight matrix $\mathbf{W} \in \mathbb{R}^{M \times N}$ has to be a parameter blob associated with the InnerProduct layer, instead of a bottum input blob of the layer. To emulate a matrix multiplication in standard Caffe, as of April, 2017, is still doable but not so straightforward nor efficient, as explained in several stackoverflow discussions12.

Since I’m not satisfied with such a solution, I’ll describe below how to directly implement a matrix multiplication layer (in C++ for both cpu/gpu).

Math

Assuming some basic understandings of backward propagation for solving DNN using stochastic gradient descent (SGD), firstly we need to figure out the partial derivatives.

Definitions

Without loss of generality3, we have a loss function $l: \mathbb{R}^{M \times N} \to \mathbb{R}$. It accepts an input matrix $\mathbf{Z} \in \mathbb{R}^{M \times N}$, which itself is a matrix function performing matrix multiplication of two input matrices $\mathbf{X} \in \mathbb{R}^{M \times K}, \mathbf{Y} \in \mathbb{R}^{K \times N}$, i.e., $\mathbf{Z=XY}$.

Now we want to find out the partial derivatives of the loss function w.r.t. both $\mathbf{X}$ and $\mathbf{Y}$, i.e., $\underset{K \times M}{\frac{\partial l(\mathbf{Z})}{\partial \mathbf{X}}}$, and $\underset{N \times K}{\frac{\partial l(\mathbf{Z})}{\partial \mathbf{Y}}}$, as expressions of the loss function’s partial derivatives w.r.t. its direct input $\mathbf{Z}$, i.e., $\underset{N \times M}{\frac{\partial l(\mathbf{Z})}{\partial \mathbf{Z}}}$ is known4.

Derivations

Since $\frac{\partial l(\mathbf{Z})}{\partial \mathbf{X}}$ is just a set of scalar-by-scalar partial derivatives arranged in a matrix layout, we can consider one entry and expand it using the chain rule and a scalar-by-matrix derivatives identity5:

Note equation $\eqref{eq:partial_Z_partial_Xij}$ comes from expanding $\frac{\partial \mathbf{Z}}{\partial X_{ij}}$ using the chain rule6, where the single-entry matrix $\underset{M \times K}{\mathbf{J}^{ij}}$ is $1$ only for the $(i,j)$-th entry and $0$ anywhere else. And equation $\eqref{eq:partial_l_partial_Xij}$ comes from a property of the matrix trace of multiple matrices’ product7.

Conclusions

From equation $\eqref{eq:A}$ and $\eqref{eq:A_ji}$, one can easily conclude that:

And similarly:

Code

In Caffe, each variable (e.g., $\mathbf{X}$) is stored as a blob. And each blob contains two parts of values, data, and diff, where data stores the variable itself (i.e., $\mathbf{X}$), and diff stores the partial derivatives of the loss function $l$ of the DNN w.r.t. the variable (i.e., $\left(\frac{\partial l}{\partial \mathbf{X}}\right)^T$). Note that we are transposing the partial derivative matrix here. Because in Caffe, diff and data have the same shape, while in numerator-layout $\frac{\partial l}{\partial \mathbf{X}}$ is a $N \times M$ matrix with its entries corresponding to that of the transposed matrix $\mathbf{X}$ whose original shape is $M \times N$.

With these in mind, we can re-write the equation $\eqref{eq:partial_l_partial_X}$ and $\eqref{eq:partial_l_partial_Y}$ as the following equations:

Using equation $\eqref{eq:X_diff}$ and $\eqref{eq:Y_diff}$, we can write the backward propagation function (forward propagation is trivial) as the following pseudocode:

   backward(bottom, top) {
     // dl/dX' = dl/d(XY)' * Y', i.e., bottom[0].diff = top[0].diff * bottom[1].data'
     caffe_cpu_gemm(CblasNoTrans, CblasTrans, M, K, N, 1, top[0].diff, bottom[1].data, 0, bottom[0].diff);
     
     // dl/dY' = X' * dl/d(XY)', i.e., bottom[1].diff = bottom[0].data' * top[0].diff
     caffe_cpu_gemm(CblasTrans, CblasNoTrans, K, N, M, 1, bottom[0].data, top[0].diff, 0, bottom[1].diff);
   }

Of course, we need to take care of batched operations, and many other implementation details, following these wonderful tutorials89. A full implementation can be found in my own version of Caffe.

  1. http://stackoverflow.com/a/41658699/2303236 

  2. http://stackoverflow.com/a/41580924/2303236 

  3. $l$ does NOT need to be just a final loss function layer of DNN, it can be a composition of the final loss function with all layers after the matrix multiplication layer, but still be a scalar-valued function. Thus this is a general formulation. 

  4. We are sticking to the Numerator-layout notation for matrix calculus. 

  5. https://en.wikipedia.org/wiki/Matrix_calculus#Scalar-by-matrix_identities 

  6. See equation 73-74 in section 2.4.1 of the matrix cookbook10

  7. See equation 16 in section 1.1 of the matrix cookbook10

  8. https://github.com/BVLC/caffe/wiki/Simple-Example:-Sin-Layer 

  9. https://chrischoy.github.io/research/making-caffe-layer/ 

  10. Petersen, Kaare Brandt and Pedersen, Michael Syskind. The Matrix Cookbook 2