This appendix collects the 20 most-used matrix calculus identities in a single reference. Each identity includes a short derivation sketch so you can see why it holds, not just what it says. Section 21.7 in Chapter 21 has the same identities as a quick-lookup table; come here when you need the reasoning.
29.1.1 Layout Convention
All derivatives use the denominator layout (also called Hessian layout or numerator-layout transposed):
Under this convention the chain rule for a scalar \(L\) via \(f:\mathbb{R}^n\to\mathbb{R}^m\) is \(\nabla_x L = J_f^T \nabla_f L\), matching backpropagation.
29.1.2 Two Derivation Techniques
Technique 1 – Differential form for vector gradients. Write \(df\) as a linear function of \(dx\) and read off the gradient:
\[df = g(x)^T\,dx \implies \nabla_x f = g(x).\]
Technique 2 – Trace inner product for matrix gradients. Write \(df\) as a trace inner product with \(dW\) and read off the gradient:
Both techniques exploit the fact that the differential is unique: matching coefficients gives the derivative directly.
29.1.3 Finite-Difference Verification
Every identity in this appendix is verified numerically with central differences. The three helpers used throughout are:
import numpy as npdef fd_grad(f, x, h=1e-5):"""Central-difference gradient of scalar f at vector x.""" g = np.zeros_like(x, dtype=float)for i inrange(len(x)): xp, xm = x.copy(), x.copy() xp[i] += h; xm[i] -= h g[i] = (f(xp) - f(xm)) / (2* h)return g # shape (n,)def fd_matrix_grad(f, W, h=1e-5):"""Central-difference gradient of scalar f w.r.t. matrix W.""" G = np.zeros_like(W, dtype=float)for i inrange(W.shape[0]):for j inrange(W.shape[1]): Wp, Wm = W.copy(), W.copy() Wp[i, j] += h; Wm[i, j] -= h G[i, j] = (f(Wp) - f(Wm)) / (2* h)return G # shape same as Wdef fd_jac(f, x, h=1e-5):"""Central-difference Jacobian of vector f at vector x.""" fx = f(x) J = np.zeros((len(fx), len(x)), dtype=float)for j inrange(len(x)): xp, xm = x.copy(), x.copy() xp[j] += h; xm[j] -= h J[:, j] = (f(xp) - f(xm)) / (2* h)return J # shape (m, n)print("FD helpers defined.")
FD helpers defined.
29.1.4 Identity Index
ID
Function
Gradient / Jacobian / Hessian
D1
\(a^T x\)
\(a\)
D2
\(x^T A x\)
\((A + A^T)x\)
D3
\(\|Ax - b\|^2\)
\(2A^T(Ax - b)\)
D4
\(\|x\|^2\)
\(2x\)
D5
\((x-\mu)^T\Sigma^{-1}(x-\mu)\)
\(2\Sigma^{-1}(x-\mu)\)
D6
\(\operatorname{tr}(WA)\)
\(A^T\)
D7
\(\operatorname{tr}(W^T A)\)
\(A\)
D8
\(\operatorname{tr}(AWB W^T)\)
\(A^T W B^T + AWB\)
D9
\(a^T W b\)
\(ab^T\)
D10
\(x^T W x\)
\(xx^T\)
D11
\(\|W\|_F^2\)
\(2W\)
D12
\(L(Wx + b)\) (layer loss)
\((\nabla_y L)\,x^T\)
D13
\(Ax + b\)
\(A\) (Jacobian)
D14
\(\varphi(x)\) element-wise
\(\operatorname{diag}(\varphi'(x))\) (Jacobian)
D15
\(x^{\odot 2}\)
\(2\operatorname{diag}(x)\) (Jacobian)
D16
\(x / \|x\|\)
\((I - \hat x\hat x^T)/\|x\|\) (Jacobian)
D17
\(Af(x)\)
\(AJ_f\) (Jacobian)
D18
\(x^T Ax\)
\(A + A^T\) (Hessian)
D19
\(\|Ax - b\|^2\)
\(2A^T A\) (Hessian)
D20
\(\|r(x)\|^2\) nonlinear
\(\approx 2J_r^T J_r\) (Gauss-Newton)
29.2 Scalar-by-Vector Gradients
Identities D1-D5 cover the most common scalar functions of a vector \(x \in \mathbb{R}^n\). All derivations use Technique 1: write \(df = g^T dx\) and read off \(\nabla f = g\).
29.2.1 D1 – Linear form
\[\nabla_x (a^T x) = a\]
Derivation.\(d(a^T x) = a^T\,dx\), so \(g = a\).
Corollary:\(\nabla_x (x^T a) = a\) (same function, transposition is scalar).
import numpy as npdef fd_grad(f, x, h=1e-5): g = np.zeros_like(x, dtype=float)for i inrange(len(x)): xp, xm = x.copy(), x.copy() xp[i] += h; xm[i] -= h g[i] = (f(xp) - f(xm)) / (2* h)return grng = np.random.default_rng(0)n =5a = rng.standard_normal(n) # shape (5,)x = rng.standard_normal(n) # shape (5,)analytical = a # shape (5,)numerical = fd_grad(lambda v: a @ v, x) # shape (5,)print("D1 max error:", np.max(np.abs(analytical - numerical)))assert np.allclose(analytical, numerical, atol=1e-7)print("D1 passed.")
D1 max error: 7.679162861151667e-12
D1 passed.
29.2.2 D2 – Quadratic form
\[\nabla_x (x^T A x) = (A + A^T)x\]
For symmetric \(A\): \(\nabla_x (x^T A x) = 2Ax\).
Identities D6-D12 cover scalar functions of a matrix \(W\). All derivations use Technique 2: write \(df = \operatorname{tr}(G^T dW)\) and read off \(\partial f/\partial W = G\).
Key tool: the cyclic property\(\operatorname{tr}(ABC) = \operatorname{tr}(CAB) = \operatorname{tr}(BCA)\).
\[\frac{\partial}{\partial W}\operatorname{tr}(W^T A) = A\]
Derivation.\[d\,\operatorname{tr}(W^T A) = \operatorname{tr}((dW)^T A) = \operatorname{tr}(A^T\,dW) \cdot \ldots\]
More carefully: \(\operatorname{tr}((dW)^T A) = \operatorname{tr}(A^T dW)\) because \(\operatorname{tr}(B^T C) = \operatorname{tr}(C^T B)\) for any matrices. So \(\operatorname{tr}(G^T dW) = \operatorname{tr}(A^T dW)\) gives \(G = A\).
Derivation.\([\varphi(x)]_i = \varphi(x_i)\), so \(\partial [\varphi(x)]_i/\partial x_j = \varphi'(x_i)\,\delta_{ij}\). The result is diagonal with \(\varphi'(x_i)\) on the diagonal.
Derivation. Let \(r = \|x\|\) and \(u = x/r\). \[du = \frac{dx}{r} - \frac{x}{r^2}\,dr.\] Since \(r = (x^T x)^{1/2}\), we have \(dr = x^T dx / r = \hat{x}^T dx\). \[du = \frac{dx}{r} - \frac{x\,\hat{x}^T}{r^2}\,dx = \frac{1}{r}\bigl(I - \hat{x}\hat{x}^T\bigr)dx.\] The Jacobian is \(J = (I - \hat{x}\hat{x}^T)/r\).
Geometrically, \(I - \hat{x}\hat{x}^T\) is the projection onto the hyperplane orthogonal to \(\hat{x}\): motion along \(x\) does not change \(u\), while transverse motion does.
Identities D18-D20 cover the Hessian \(H_f \in \mathbb{R}^{n\times n}\), where \([H_f]_{ij} = \partial^2 f/\partial x_i\,\partial x_j\). Numerical verification uses second-order central differences: \[[H]_{ij} \approx \frac{f(x + h e_i + h e_j) - f(x + h e_i) - f(x + h e_j) + f(x)}{h^2}.\]
29.5.1 D18 – Hessian of the quadratic form
\[H(x^T Ax) = A + A^T\]
For symmetric \(A\): \(H = 2A\).
Derivation. From D2, \(\nabla(x^T Ax) = (A + A^T)x\). Differentiating again: \(H = J_{\nabla} = A + A^T\) (constant matrix, no \(x\) dependence).
import numpy as npdef fd_hess(f, x, h=1e-4): n =len(x) H = np.zeros((n, n), dtype=float)for i inrange(n):for j inrange(n): xpp = x.copy(); xpp[i] += h; xpp[j] += h xpn = x.copy(); xpn[i] += h xnp = x.copy(); xnp[j] += h H[i, j] = (f(xpp) - f(xpn) - f(xnp) + f(x)) / h**2return Hrng = np.random.default_rng(0)n =4A = rng.standard_normal((n, n)) # shape (4, 4)x = rng.standard_normal(n) # shape (4,)analytical = A + A.T # shape (4, 4)numerical = fd_hess(lambda v: v @ A @ v, x) # shape (4, 4)print("D18 max error:", np.max(np.abs(analytical - numerical)))assert np.allclose(analytical, numerical, atol=1e-5)print("D18 passed.")
D18 max error: 5.865266727855811e-08
D18 passed.
29.5.2 D19 – Hessian of least-squares
\[H(\|Ax - b\|^2) = 2A^T A\]
Derivation. From D3, \(\nabla \|Ax-b\|^2 = 2A^T(Ax-b) = 2A^T A x - 2A^T b\). The Hessian is the Jacobian of this gradient: \[H = J_{2A^T Ax - 2A^Tb} = 2A^T A.\] Note \(A^T A\) is always symmetric positive semidefinite (\(x^T A^T A x = \|Ax\|^2 \geq 0\)).
import numpy as npdef fd_hess(f, x, h=1e-4): n =len(x) H = np.zeros((n, n), dtype=float)for i inrange(n):for j inrange(n): xpp = x.copy(); xpp[i] += h; xpp[j] += h xpn = x.copy(); xpn[i] += h xnp = x.copy(); xnp[j] += h H[i, j] = (f(xpp) - f(xpn) - f(xnp) + f(x)) / h**2return Hrng = np.random.default_rng(1)m, n =6, 4A = rng.standard_normal((m, n)) # shape (6, 4)b = rng.standard_normal(m) # shape (6,)x = rng.standard_normal(n) # shape (4,)analytical =2* A.T @ A # shape (4, 4)numerical = fd_hess(lambda v: np.sum((A @ v - b)**2), x) # shape (4, 4)print("D19 max error:", np.max(np.abs(analytical - numerical)))assert np.allclose(analytical, numerical, atol=1e-5)print("D19 passed.")
D19 max error: 1.6648774173941283e-06
D19 passed.
29.5.3 D20 – Gauss-Newton Hessian approximation
\[H(\|r(x)\|^2) \approx 2J_r^T J_r\]
where \(r:\mathbb{R}^n\to\mathbb{R}^m\) is a nonlinear residual and \(J_r\) is its Jacobian.
Derivation.\(\|r(x)\|^2 = r(x)^T r(x)\). By D2 applied to the gradient of a composition: \[\nabla_x \|r\|^2 = 2J_r^T r(x).\] Taking the Jacobian of this gradient: \[H = 2J_r^T J_r + 2\sum_k r_k(x)\,H_{r_k}.\] The Gauss-Newton approximation drops the second-order term\(\sum_k r_k H_{r_k}\). This is accurate when the residuals \(r_k\) are small (near-optimal) or the curvature \(H_{r_k}\) is small.
The approximation \(H \approx 2J_r^T J_r\) is always positive semidefinite, which guarantees descent directions in nonlinear least-squares solvers.
import numpy as npdef fd_jac(f, x, h=1e-5): fx = f(x) J = np.zeros((len(fx), len(x)), dtype=float)for j inrange(len(x)): xp, xm = x.copy(), x.copy() xp[j] += h; xm[j] -= h J[:, j] = (f(xp) - f(xm)) / (2* h)return Jdef fd_hess(f, x, h=1e-4): n =len(x) H = np.zeros((n, n), dtype=float)for i inrange(n):for j inrange(n): xpp = x.copy(); xpp[i] += h; xpp[j] += h xpn = x.copy(); xpn[i] += h xnp = x.copy(); xnp[j] += h H[i, j] = (f(xpp) - f(xpn) - f(xnp) + f(x)) / h**2return Hrng = np.random.default_rng(2)n =3# Nonlinear residual: r(x) = [sin(x0), x1^2 - 1, exp(x2) - 2]def r(x): return np.array([np.sin(x[0]), x[1]**2-1, np.exp(x[2]) -2])def obj(x): return np.dot(r(x), r(x))x0 = rng.standard_normal(n) # shape (3,)# Gauss-Newton approximationJr = fd_jac(r, x0) # shape (3, 3)H_gn =2* Jr.T @ Jr # shape (3, 3)# True HessianH_true = fd_hess(obj, x0) # shape (3, 3)# Show both -- they differ by the second-order termprint("H_true (exact):\n", H_true.round(4))print("H_GN (approx):\n", H_gn.round(4))print("Difference:\n", (H_true - H_gn).round(4))# Near the minimum the residuals are small; demonstrate with x near zerox_near = np.array([0.0, 1.0, np.log(2.0)]) # r(x_near) ~ 0Jr_near = fd_jac(r, x_near)H_gn_near =2* Jr_near.T @ Jr_nearH_true_near = fd_hess(obj, x_near)print("\nAt near-minimum x:")print("||H_true - H_GN|| =", np.linalg.norm(H_true_near - H_gn_near).round(6))
This is the Jacobian chain rule – the analog of \((g\circ f)' = g' \circ f'\) for scalar functions. For a chain of \(k\) layers, \(J_{h_k \circ \cdots \circ h_1} = J_{h_k}\cdots J_{h_1}\).
29.6.3 Dense Layer: Three Backward Gradients
For a dense layer \(y = Wx + b\) with upstream gradient \(\delta = \nabla_y L\):
Gradient
Formula
Identity used
\(\partial L/\partial W\)
\(\delta\, x^T\)
D12
\(\partial L/\partial x\)
\(W^T \delta\)
Chain rule on D13
\(\partial L/\partial b\)
\(\delta\)
D13 (\(J_b = I\))
Derivation for \(\partial L/\partial x\).\(y = Wx + b\), so \(J_x = W\) (identity D13). By the chain rule: \(\nabla_x L = J_x^T \nabla_y L = W^T \delta\).
dL/dW max error: 4.0547509794208736e-11
dL/dx max error: 6.670264340868926e-11
dL/db max error: 2.6076474313185827e-11
All dense-layer backward gradients passed.
29.6.4 Two-Layer Chain Rule Example
For a two-layer network \(L = \ell(W_2\,\sigma(W_1 x))\) where \(\ell\) is a scalar loss and \(\sigma\) is element-wise sigmoid: