29  Appendix D: Matrix Calculus Identity Reference

29.1 Overview

This appendix collects the 20 most-used matrix calculus identities in a single reference. Each identity includes a short derivation sketch so you can see why it holds, not just what it says. Section 21.7 in Chapter 21 has the same identities as a quick-lookup table; come here when you need the reasoning.


29.1.1 Layout Convention

All derivatives use the denominator layout (also called Hessian layout or numerator-layout transposed):

Expression Shape Rule
\(\nabla_x f\) \((n,1)\) column \([\nabla_x f]_i = \partial f / \partial x_i\)
\(\partial f/\partial W\) same shape as \(W\) \([\partial f/\partial W]_{ij} = \partial f/\partial W_{ij}\)
\(J_f = \partial f/\partial x^T\) \((m,n)\) \([J_f]_{ij} = \partial f_i/\partial x_j\)
\(H_f\) \((n,n)\) \([H_f]_{ij} = \partial^2 f/\partial x_i\,\partial x_j\)

Under this convention the chain rule for a scalar \(L\) via \(f:\mathbb{R}^n\to\mathbb{R}^m\) is \(\nabla_x L = J_f^T \nabla_f L\), matching backpropagation.


29.1.2 Two Derivation Techniques

Technique 1 – Differential form for vector gradients. Write \(df\) as a linear function of \(dx\) and read off the gradient:

\[df = g(x)^T\,dx \implies \nabla_x f = g(x).\]

Technique 2 – Trace inner product for matrix gradients. Write \(df\) as a trace inner product with \(dW\) and read off the gradient:

\[df = \operatorname{tr}(G(W)^T\,dW) \implies \frac{\partial f}{\partial W} = G(W).\]

Both techniques exploit the fact that the differential is unique: matching coefficients gives the derivative directly.


29.1.3 Finite-Difference Verification

Every identity in this appendix is verified numerically with central differences. The three helpers used throughout are:

import numpy as np

def fd_grad(f, x, h=1e-5):
    """Central-difference gradient of scalar f at vector x."""
    g = np.zeros_like(x, dtype=float)
    for i in range(len(x)):
        xp, xm = x.copy(), x.copy()
        xp[i] += h; xm[i] -= h
        g[i] = (f(xp) - f(xm)) / (2 * h)
    return g    # shape (n,)

def fd_matrix_grad(f, W, h=1e-5):
    """Central-difference gradient of scalar f w.r.t. matrix W."""
    G = np.zeros_like(W, dtype=float)
    for i in range(W.shape[0]):
        for j in range(W.shape[1]):
            Wp, Wm = W.copy(), W.copy()
            Wp[i, j] += h; Wm[i, j] -= h
            G[i, j] = (f(Wp) - f(Wm)) / (2 * h)
    return G    # shape same as W

def fd_jac(f, x, h=1e-5):
    """Central-difference Jacobian of vector f at vector x."""
    fx = f(x)
    J = np.zeros((len(fx), len(x)), dtype=float)
    for j in range(len(x)):
        xp, xm = x.copy(), x.copy()
        xp[j] += h; xm[j] -= h
        J[:, j] = (f(xp) - f(xm)) / (2 * h)
    return J    # shape (m, n)

print("FD helpers defined.")
FD helpers defined.

29.1.4 Identity Index

ID Function Gradient / Jacobian / Hessian
D1 \(a^T x\) \(a\)
D2 \(x^T A x\) \((A + A^T)x\)
D3 \(\|Ax - b\|^2\) \(2A^T(Ax - b)\)
D4 \(\|x\|^2\) \(2x\)
D5 \((x-\mu)^T\Sigma^{-1}(x-\mu)\) \(2\Sigma^{-1}(x-\mu)\)
D6 \(\operatorname{tr}(WA)\) \(A^T\)
D7 \(\operatorname{tr}(W^T A)\) \(A\)
D8 \(\operatorname{tr}(AWB W^T)\) \(A^T W B^T + AWB\)
D9 \(a^T W b\) \(ab^T\)
D10 \(x^T W x\) \(xx^T\)
D11 \(\|W\|_F^2\) \(2W\)
D12 \(L(Wx + b)\) (layer loss) \((\nabla_y L)\,x^T\)
D13 \(Ax + b\) \(A\) (Jacobian)
D14 \(\varphi(x)\) element-wise \(\operatorname{diag}(\varphi'(x))\) (Jacobian)
D15 \(x^{\odot 2}\) \(2\operatorname{diag}(x)\) (Jacobian)
D16 \(x / \|x\|\) \((I - \hat x\hat x^T)/\|x\|\) (Jacobian)
D17 \(Af(x)\) \(AJ_f\) (Jacobian)
D18 \(x^T Ax\) \(A + A^T\) (Hessian)
D19 \(\|Ax - b\|^2\) \(2A^T A\) (Hessian)
D20 \(\|r(x)\|^2\) nonlinear \(\approx 2J_r^T J_r\) (Gauss-Newton)

29.2 Scalar-by-Vector Gradients

Identities D1-D5 cover the most common scalar functions of a vector \(x \in \mathbb{R}^n\). All derivations use Technique 1: write \(df = g^T dx\) and read off \(\nabla f = g\).


29.2.1 D1 – Linear form

\[\nabla_x (a^T x) = a\]

Derivation. \(d(a^T x) = a^T\,dx\), so \(g = a\).

Corollary: \(\nabla_x (x^T a) = a\) (same function, transposition is scalar).

import numpy as np

def fd_grad(f, x, h=1e-5):
    g = np.zeros_like(x, dtype=float)
    for i in range(len(x)):
        xp, xm = x.copy(), x.copy()
        xp[i] += h; xm[i] -= h
        g[i] = (f(xp) - f(xm)) / (2 * h)
    return g

rng = np.random.default_rng(0)
n = 5
a = rng.standard_normal(n)    # shape (5,)
x = rng.standard_normal(n)    # shape (5,)

analytical = a                           # shape (5,)
numerical  = fd_grad(lambda v: a @ v, x) # shape (5,)
print("D1 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-7)
print("D1 passed.")
D1 max error: 7.679162861151667e-12
D1 passed.

29.2.2 D2 – Quadratic form

\[\nabla_x (x^T A x) = (A + A^T)x\]

For symmetric \(A\): \(\nabla_x (x^T A x) = 2Ax\).

Derivation. \[d(x^T Ax) = (dx)^T Ax + x^T A\,dx = (A^T x)^T dx + (Ax)^T dx \cdot 1\]

Wait – more carefully: \((dx)^T Ax = x^T A^T\,dx\) after transposing the scalar, so \[d(x^T Ax) = x^T A^T\,dx + x^T A\,dx = x^T(A^T + A)\,dx.\] Thus \(g = (A + A^T)x\).

import numpy as np

def fd_grad(f, x, h=1e-5):
    g = np.zeros_like(x, dtype=float)
    for i in range(len(x)):
        xp, xm = x.copy(), x.copy()
        xp[i] += h; xm[i] -= h
        g[i] = (f(xp) - f(xm)) / (2 * h)
    return g

rng = np.random.default_rng(1)
n = 4
A = rng.standard_normal((n, n))    # shape (4, 4)  general (not symmetric)
x = rng.standard_normal(n)          # shape (4,)

analytical = (A + A.T) @ x                       # shape (4,)
numerical  = fd_grad(lambda v: v @ A @ v, x)     # shape (4,)
print("D2 (general A) max error:", np.max(np.abs(analytical - numerical)))

A_sym = (A + A.T) / 2                            # shape (4, 4)  symmetric
analytical_sym = 2 * A_sym @ x                   # shape (4,)
numerical_sym  = fd_grad(lambda v: v @ A_sym @ v, x)
print("D2 (symmetric A) max error:", np.max(np.abs(analytical_sym - numerical_sym)))
assert np.allclose(analytical, numerical, atol=1e-7)
assert np.allclose(analytical_sym, numerical_sym, atol=1e-7)
print("D2 passed.")
D2 (general A) max error: 3.646971613591177e-12
D2 (symmetric A) max error: 8.434919429589627e-13
D2 passed.

29.2.3 D3 – Least-squares gradient

\[\nabla_x \|Ax - b\|^2 = 2A^T(Ax - b)\]

Derivation. Let \(r = Ax - b\), so \(\|r\|^2 = r^T r\). \[d\|r\|^2 = 2r^T\,dr = 2(Ax-b)^T A\,dx = [2A^T(Ax-b)]^T dx.\] Thus \(\nabla = 2A^T(Ax-b)\).

This is the normal-equation gradient; setting it to zero gives \(A^T Ax = A^T b\).

import numpy as np

def fd_grad(f, x, h=1e-5):
    g = np.zeros_like(x, dtype=float)
    for i in range(len(x)):
        xp, xm = x.copy(), x.copy()
        xp[i] += h; xm[i] -= h
        g[i] = (f(xp) - f(xm)) / (2 * h)
    return g

rng = np.random.default_rng(2)
m, n = 6, 4
A = rng.standard_normal((m, n))    # shape (6, 4)
b = rng.standard_normal(m)          # shape (6,)
x = rng.standard_normal(n)          # shape (4,)

analytical = 2 * A.T @ (A @ x - b)                        # shape (4,)
numerical  = fd_grad(lambda v: np.sum((A @ v - b)**2), x)  # shape (4,)
print("D3 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-7)
print("D3 passed.")
D3 max error: 1.4594792041577875e-10
D3 passed.

29.2.4 D4 – Squared norm

\[\nabla_x \|x\|^2 = 2x\]

Derivation. Special case of D3 with \(A = I\), \(b = 0\): \(\nabla \|x\|^2 = 2I(Ix - 0) = 2x\).

Alternatively from D2: \(\|x\|^2 = x^T I x\) with \(A = I\) (symmetric), so \(\nabla = 2Ix = 2x\).

import numpy as np

def fd_grad(f, x, h=1e-5):
    g = np.zeros_like(x, dtype=float)
    for i in range(len(x)):
        xp, xm = x.copy(), x.copy()
        xp[i] += h; xm[i] -= h
        g[i] = (f(xp) - f(xm)) / (2 * h)
    return g

rng = np.random.default_rng(3)
n = 5
x = rng.standard_normal(n)    # shape (5,)

analytical = 2 * x                                  # shape (5,)
numerical  = fd_grad(lambda v: np.dot(v, v), x)     # shape (5,)
print("D4 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-7)
print("D4 passed.")
D4 max error: 3.980993312779901e-11
D4 passed.

29.2.5 D5 – Mahalanobis distance

\[\nabla_x\,(x - \mu)^T \Sigma^{-1}(x - \mu) = 2\Sigma^{-1}(x - \mu)\]

Derivation. Let \(z = x - \mu\), so \(dz = dx\). From D2 with \(A = \Sigma^{-1}\) (symmetric positive definite): \[\nabla_z (z^T \Sigma^{-1} z) = 2\Sigma^{-1} z = 2\Sigma^{-1}(x - \mu).\]

This gradient appears in the score function of a Gaussian log-likelihood: \(\nabla_x \log \mathcal{N}(x;\mu,\Sigma) = -\Sigma^{-1}(x-\mu)\).

import numpy as np

def fd_grad(f, x, h=1e-5):
    g = np.zeros_like(x, dtype=float)
    for i in range(len(x)):
        xp, xm = x.copy(), x.copy()
        xp[i] += h; xm[i] -= h
        g[i] = (f(xp) - f(xm)) / (2 * h)
    return g

rng = np.random.default_rng(4)
n = 4
S = rng.standard_normal((n, n))     # shape (4, 4)
Sigma = S @ S.T + np.eye(n)         # shape (4, 4)  symmetric positive definite
Sinv  = np.linalg.inv(Sigma)        # shape (4, 4)
mu    = rng.standard_normal(n)      # shape (4,)
x     = rng.standard_normal(n)      # shape (4,)

def mahal(v):
    z = v - mu
    return float(z @ Sinv @ z)

analytical = 2 * Sinv @ (x - mu)       # shape (4,)
numerical  = fd_grad(mahal, x)          # shape (4,)
print("D5 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-7)
print("D5 passed.")
D5 max error: 1.5869972003201838e-11
D5 passed.

29.3 Scalar-by-Matrix Gradients

Identities D6-D12 cover scalar functions of a matrix \(W\). All derivations use Technique 2: write \(df = \operatorname{tr}(G^T dW)\) and read off \(\partial f/\partial W = G\).

Key tool: the cyclic property \(\operatorname{tr}(ABC) = \operatorname{tr}(CAB) = \operatorname{tr}(BCA)\).


29.3.1 D6 – Trace of product

\[\frac{\partial}{\partial W}\operatorname{tr}(WA) = A^T\]

Derivation. \[d\,\operatorname{tr}(WA) = \operatorname{tr}((dW)A) = \operatorname{tr}(A\,dW) = \operatorname{tr}((A^T)^T dW).\] So \(G = A^T\).

Corollary D6b: \(\partial\,\operatorname{tr}(AW)/\partial W = A^T\) (cyclic: \(\operatorname{tr}(AW) = \operatorname{tr}(WA)\)).

import numpy as np

def fd_matrix_grad(f, W, h=1e-5):
    G = np.zeros_like(W, dtype=float)
    for i in range(W.shape[0]):
        for j in range(W.shape[1]):
            Wp, Wm = W.copy(), W.copy()
            Wp[i, j] += h; Wm[i, j] -= h
            G[i, j] = (f(Wp) - f(Wm)) / (2 * h)
    return G

rng = np.random.default_rng(0)
m, n = 3, 4
W = rng.standard_normal((m, n))    # shape (3, 4)
A = rng.standard_normal((n, m))    # shape (4, 3)

analytical = A.T                                       # shape (3, 4)
numerical  = fd_matrix_grad(lambda M: np.trace(M @ A), W)  # shape (3, 4)
print("D6 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-7)
print("D6 passed.")
D6 max error: 1.9034995801803234e-11
D6 passed.

29.3.2 D7 – Trace with transpose

\[\frac{\partial}{\partial W}\operatorname{tr}(W^T A) = A\]

Derivation. \[d\,\operatorname{tr}(W^T A) = \operatorname{tr}((dW)^T A) = \operatorname{tr}(A^T\,dW) \cdot \ldots\]

More carefully: \(\operatorname{tr}((dW)^T A) = \operatorname{tr}(A^T dW)\) because \(\operatorname{tr}(B^T C) = \operatorname{tr}(C^T B)\) for any matrices. So \(\operatorname{tr}(G^T dW) = \operatorname{tr}(A^T dW)\) gives \(G = A\).

import numpy as np

def fd_matrix_grad(f, W, h=1e-5):
    G = np.zeros_like(W, dtype=float)
    for i in range(W.shape[0]):
        for j in range(W.shape[1]):
            Wp, Wm = W.copy(), W.copy()
            Wp[i, j] += h; Wm[i, j] -= h
            G[i, j] = (f(Wp) - f(Wm)) / (2 * h)
    return G

rng = np.random.default_rng(1)
m, n = 3, 4
W = rng.standard_normal((m, n))    # shape (3, 4)
A = rng.standard_normal((m, n))    # shape (3, 4)

analytical = A                                          # shape (3, 4)
numerical  = fd_matrix_grad(lambda M: np.trace(M.T @ A), W)  # shape (3, 4)
print("D7 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-7)
print("D7 passed.")
D7 max error: 7.352340958277637e-12
D7 passed.

29.3.3 D8 – Trace of quadratic form in W

\[\frac{\partial}{\partial W}\operatorname{tr}(AWB W^T) = A^T W B^T + AWB\]

Derivation. Split the differential: \[d\,\operatorname{tr}(AWB W^T) = \operatorname{tr}(A(dW)BW^T) + \operatorname{tr}(AWB(dW)^T).\]

Term 1: \(\operatorname{tr}(A\,dW\,BW^T) = \operatorname{tr}(W^T A\,dW\,B)\) (cyclic) \(= \operatorname{tr}(B W^T A\,dW)\) (cyclic again) \(= \operatorname{tr}((A^T W B^T)^T dW)\), so \(G_1 = A^T W B^T\).

Term 2: \(\operatorname{tr}(AWB(dW)^T) = \operatorname{tr}((AWB)^T \cdot (dW)^T \cdot I)\) – actually, use \(\operatorname{tr}(C^T dW) = \operatorname{tr}(C\,dW^T)^T \ldots\) Simpler: transpose the scalar trace: \(\operatorname{tr}(AWB\,(dW)^T) = \operatorname{tr}(dW\,AWB)\) \(= \operatorname{tr}((AWB)^T \cdot \ldots)\) – use \(\operatorname{tr}(X^T Y) = \operatorname{tr}(Y X^T)\): \(\operatorname{tr}(AWB(dW)^T) = \operatorname{tr}((dW)(AWB)) = \operatorname{tr}((AWB)^T dW)^T = \operatorname{tr}((AWB)^T dW)\). Wait, scalars equal their transpose, so \(\operatorname{tr}(AWB(dW)^T) = \operatorname{tr}((AWB(dW)^T)^T) = \operatorname{tr}(dW\,(AWB)^T)\) \(= \operatorname{tr}((AWB) dW^T)\)… Let me be direct:

\(\operatorname{tr}(C D^T) = \operatorname{tr}(D C^T)\) (cyclic on the transposed version). Set \(C = AWB\) and \(D = dW\): \(\operatorname{tr}(AWB\,(dW)^T) = \operatorname{tr}(dW\,(AWB)^T) = \operatorname{tr}((AWB)^T\,dW)\). So \(G_2 = AWB\).

Total: \(G = G_1 + G_2 = A^T W B^T + AWB\).

import numpy as np

def fd_matrix_grad(f, W, h=1e-5):
    G = np.zeros_like(W, dtype=float)
    for i in range(W.shape[0]):
        for j in range(W.shape[1]):
            Wp, Wm = W.copy(), W.copy()
            Wp[i, j] += h; Wm[i, j] -= h
            G[i, j] = (f(Wp) - f(Wm)) / (2 * h)
    return G

rng = np.random.default_rng(2)
m, n = 3, 4
W = rng.standard_normal((m, n))    # shape (3, 4)
A = rng.standard_normal((m, m))    # shape (3, 3)
B = rng.standard_normal((n, n))    # shape (4, 4)

analytical = A.T @ W @ B.T + A @ W @ B       # shape (3, 4)
numerical  = fd_matrix_grad(
    lambda M: np.trace(A @ M @ B @ M.T), W)  # shape (3, 4)
print("D8 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-7)
print("D8 passed.")
D8 max error: 1.9040746757070792e-10
D8 passed.

29.3.4 D9 – Bilinear form in W

\[\frac{\partial}{\partial W}(a^T W b) = ab^T\]

Derivation. \(d(a^T W b) = a^T(dW)b = \operatorname{tr}(a^T(dW)b) = \operatorname{tr}(ba^T\,dW)\) \(= \operatorname{tr}((ab^T)^T dW)\), so \(G = ab^T\).

This is the rank-1 outer product rule: the gradient of a bilinear form \(a^T W b\) is the outer product \(ab^T\).

import numpy as np

def fd_matrix_grad(f, W, h=1e-5):
    G = np.zeros_like(W, dtype=float)
    for i in range(W.shape[0]):
        for j in range(W.shape[1]):
            Wp, Wm = W.copy(), W.copy()
            Wp[i, j] += h; Wm[i, j] -= h
            G[i, j] = (f(Wp) - f(Wm)) / (2 * h)
    return G

rng = np.random.default_rng(3)
m, n = 3, 4
W = rng.standard_normal((m, n))    # shape (3, 4)
a = rng.standard_normal(m)          # shape (3,)
b = rng.standard_normal(n)          # shape (4,)

analytical = np.outer(a, b)                          # shape (3, 4)
numerical  = fd_matrix_grad(lambda M: a @ M @ b, W)  # shape (3, 4)
print("D9 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-7)
print("D9 passed.")
D9 max error: 1.514932623791765e-11
D9 passed.

29.3.5 D10 – Quadratic form with shared vector

\[\frac{\partial}{\partial W}(x^T W x) = xx^T\]

Derivation. Special case of D9 with \(a = b = x\): \(d(x^T W x) = \operatorname{tr}((xx^T)^T dW)\), so \(G = xx^T\).

import numpy as np

def fd_matrix_grad(f, W, h=1e-5):
    G = np.zeros_like(W, dtype=float)
    for i in range(W.shape[0]):
        for j in range(W.shape[1]):
            Wp, Wm = W.copy(), W.copy()
            Wp[i, j] += h; Wm[i, j] -= h
            G[i, j] = (f(Wp) - f(Wm)) / (2 * h)
    return G

rng = np.random.default_rng(4)
n = 4
W = rng.standard_normal((n, n))    # shape (4, 4)
x = rng.standard_normal(n)          # shape (4,)

analytical = np.outer(x, x)                          # shape (4, 4)
numerical  = fd_matrix_grad(lambda M: x @ M @ x, W)  # shape (4, 4)
print("D10 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-7)
print("D10 passed.")
D10 max error: 4.426981004002073e-11
D10 passed.

29.3.6 D11 – Frobenius norm squared

\[\frac{\partial}{\partial W}\|W\|_F^2 = 2W\]

Derivation. \(\|W\|_F^2 = \operatorname{tr}(W^T W)\). \[d\,\operatorname{tr}(W^T W) = \operatorname{tr}((dW)^T W + W^T dW) = \operatorname{tr}(W^T dW) + \operatorname{tr}(W^T dW) = 2\operatorname{tr}(W^T dW).\] So \(G = 2W\).

import numpy as np

def fd_matrix_grad(f, W, h=1e-5):
    G = np.zeros_like(W, dtype=float)
    for i in range(W.shape[0]):
        for j in range(W.shape[1]):
            Wp, Wm = W.copy(), W.copy()
            Wp[i, j] += h; Wm[i, j] -= h
            G[i, j] = (f(Wp) - f(Wm)) / (2 * h)
    return G

rng = np.random.default_rng(5)
m, n = 3, 4
W = rng.standard_normal((m, n))    # shape (3, 4)

analytical = 2 * W                                      # shape (3, 4)
numerical  = fd_matrix_grad(lambda M: np.sum(M**2), W)  # shape (3, 4)
print("D11 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-7)
print("D11 passed.")
D11 max error: 5.258804502972225e-11
D11 passed.

29.3.7 D12 – Layer loss gradient (backprop)

\[\frac{\partial L}{\partial W} = (\nabla_y L)\,x^T\]

where \(y = Wx + b\), \(L\) is a scalar loss, and \(\nabla_y L \in \mathbb{R}^m\) is the upstream gradient.

Derivation. \(dy = (dW)x\), so \(dL = (\nabla_y L)^T dy = (\nabla_y L)^T (dW)x = \operatorname{tr}((\nabla_y L)^T (dW) x) = \operatorname{tr}(x (\nabla_y L)^T dW) = \operatorname{tr}((\nabla_y L\, x^T)^T dW)\).

So \(\partial L/\partial W = (\nabla_y L)\,x^T\), a rank-1 outer product.

This is the weight-gradient formula in every dense-layer backward pass.

import numpy as np

def fd_matrix_grad(f, W, h=1e-5):
    G = np.zeros_like(W, dtype=float)
    for i in range(W.shape[0]):
        for j in range(W.shape[1]):
            Wp, Wm = W.copy(), W.copy()
            Wp[i, j] += h; Wm[i, j] -= h
            G[i, j] = (f(Wp) - f(Wm)) / (2 * h)
    return G

rng = np.random.default_rng(6)
m, n = 3, 4
W = rng.standard_normal((m, n))    # shape (3, 4)
b = rng.standard_normal(m)          # shape (3,)
x = rng.standard_normal(n)          # shape (4,)

# Use L = sum(y^2)/2 so grad_y L = y
def loss(M):
    y = M @ x + b      # shape (3,)
    return 0.5 * np.dot(y, y)

y     = W @ x + b              # shape (3,)
delta = y                      # grad_y L = y for L = ||y||^2/2
analytical = np.outer(delta, x)    # shape (3, 4)
numerical  = fd_matrix_grad(loss, W)
print("D12 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-7)
print("D12 passed.")
D12 max error: 1.6325785168191942e-10
D12 passed.

29.4 Vector-by-Vector Jacobians

Identities D13-D17 cover functions \(f:\mathbb{R}^n\to\mathbb{R}^m\). The Jacobian \(J_f \in \mathbb{R}^{m\times n}\) has \([J_f]_{ij} = \partial f_i/\partial x_j\).

All derivations use the indicial approach: write \(f_i\) as a scalar and differentiate with respect to \(x_j\).


29.4.1 D13 – Affine map

\[J_{Ax+b} = A \in \mathbb{R}^{m\times n}\]

Derivation. \([Ax + b]_i = \sum_k A_{ik} x_k + b_i\). \(\partial/\partial x_j = A_{ij}\). So \([J]_{ij} = A_{ij}\), i.e., \(J = A\).

import numpy as np

def fd_jac(f, x, h=1e-5):
    fx = f(x)
    J = np.zeros((len(fx), len(x)), dtype=float)
    for j in range(len(x)):
        xp, xm = x.copy(), x.copy()
        xp[j] += h; xm[j] -= h
        J[:, j] = (f(xp) - f(xm)) / (2 * h)
    return J

rng = np.random.default_rng(0)
m, n = 4, 5
A = rng.standard_normal((m, n))    # shape (4, 5)
b = rng.standard_normal(m)          # shape (4,)
x = rng.standard_normal(n)          # shape (5,)

analytical = A                                    # shape (4, 5)
numerical  = fd_jac(lambda v: A @ v + b, x)      # shape (4, 5)
print("D13 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-7)
print("D13 passed.")
D13 max error: 1.9626522629323517e-11
D13 passed.

29.4.2 D14 – Element-wise nonlinearity

\[J_{\varphi(x)} = \operatorname{diag}(\varphi'(x)) \in \mathbb{R}^{n\times n}\]

Derivation. \([\varphi(x)]_i = \varphi(x_i)\), so \(\partial [\varphi(x)]_i/\partial x_j = \varphi'(x_i)\,\delta_{ij}\). The result is diagonal with \(\varphi'(x_i)\) on the diagonal.

Common special cases: - Sigmoid \(\sigma(x)\): \(\sigma'(x_i) = \sigma(x_i)(1 - \sigma(x_i))\). - ReLU: \(\varphi'(x_i) = \mathbf{1}[x_i > 0]\). - \(\tanh\): \(\varphi'(x_i) = 1 - \tanh^2(x_i)\).

import numpy as np

def fd_jac(f, x, h=1e-5):
    fx = f(x)
    J = np.zeros((len(fx), len(x)), dtype=float)
    for j in range(len(x)):
        xp, xm = x.copy(), x.copy()
        xp[j] += h; xm[j] -= h
        J[:, j] = (f(xp) - f(xm)) / (2 * h)
    return J

rng = np.random.default_rng(1)
n = 5
x = rng.standard_normal(n)    # shape (5,)

# Sigmoid
def sigmoid(v): return 1.0 / (1.0 + np.exp(-v))

s = sigmoid(x)                                           # shape (5,)
analytical = np.diag(s * (1 - s))                        # shape (5, 5)
numerical  = fd_jac(sigmoid, x)                          # shape (5, 5)
print("D14 (sigmoid) max error:", np.max(np.abs(analytical - numerical)))

# tanh
analytical_tanh = np.diag(1.0 - np.tanh(x)**2)          # shape (5, 5)
numerical_tanh  = fd_jac(np.tanh, x)                     # shape (5, 5)
print("D14 (tanh)    max error:", np.max(np.abs(analytical_tanh - numerical_tanh)))

assert np.allclose(analytical, numerical, atol=1e-7)
assert np.allclose(analytical_tanh, numerical_tanh, atol=1e-7)
print("D14 passed.")
D14 (sigmoid) max error: 2.310290847518104e-12
D14 (tanh)    max error: 2.129085796553909e-11
D14 passed.

29.4.3 D15 – Element-wise square

\[J_{x^{\odot 2}} = 2\operatorname{diag}(x) \in \mathbb{R}^{n\times n}\]

Derivation. Special case of D14 with \(\varphi(t) = t^2\), so \(\varphi'(t) = 2t\). The diagonal entries are \(2x_i\).

import numpy as np

def fd_jac(f, x, h=1e-5):
    fx = f(x)
    J = np.zeros((len(fx), len(x)), dtype=float)
    for j in range(len(x)):
        xp, xm = x.copy(), x.copy()
        xp[j] += h; xm[j] -= h
        J[:, j] = (f(xp) - f(xm)) / (2 * h)
    return J

rng = np.random.default_rng(2)
n = 5
x = rng.standard_normal(n)    # shape (5,)

analytical = 2 * np.diag(x)               # shape (5, 5)
numerical  = fd_jac(lambda v: v**2, x)    # shape (5, 5)
print("D15 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-7)
print("D15 passed.")
D15 max error: 4.704059364257773e-11
D15 passed.

29.4.4 D16 – Unit-vector normalization

\[J_{x/\|x\|} = \frac{I - \hat{x}\hat{x}^T}{\|x\|} \in \mathbb{R}^{n\times n}\]

where \(\hat{x} = x/\|x\|\).

Derivation. Let \(r = \|x\|\) and \(u = x/r\). \[du = \frac{dx}{r} - \frac{x}{r^2}\,dr.\] Since \(r = (x^T x)^{1/2}\), we have \(dr = x^T dx / r = \hat{x}^T dx\). \[du = \frac{dx}{r} - \frac{x\,\hat{x}^T}{r^2}\,dx = \frac{1}{r}\bigl(I - \hat{x}\hat{x}^T\bigr)dx.\] The Jacobian is \(J = (I - \hat{x}\hat{x}^T)/r\).

Geometrically, \(I - \hat{x}\hat{x}^T\) is the projection onto the hyperplane orthogonal to \(\hat{x}\): motion along \(x\) does not change \(u\), while transverse motion does.

import numpy as np

def fd_jac(f, x, h=1e-5):
    fx = f(x)
    J = np.zeros((len(fx), len(x)), dtype=float)
    for j in range(len(x)):
        xp, xm = x.copy(), x.copy()
        xp[j] += h; xm[j] -= h
        J[:, j] = (f(xp) - f(xm)) / (2 * h)
    return J

rng = np.random.default_rng(3)
n = 4
x = rng.standard_normal(n)    # shape (4,)

r    = np.linalg.norm(x)
xhat = x / r                              # shape (4,)

analytical = (np.eye(n) - np.outer(xhat, xhat)) / r   # shape (4, 4)
numerical  = fd_jac(lambda v: v / np.linalg.norm(v), x)  # shape (4, 4)
print("D16 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-7)
print("D16 passed.")
D16 max error: 4.571870659830779e-12
D16 passed.

29.4.5 D17 – Left-multiplication

\[J_{Af(x)} = A\,J_f \in \mathbb{R}^{k\times n}\]

where \(A \in \mathbb{R}^{k\times m}\) is constant and \(f:\mathbb{R}^n\to\mathbb{R}^m\).

Derivation. \([Af(x)]_i = \sum_l A_{il} f_l(x)\). \(\partial [Af]_i/\partial x_j = \sum_l A_{il} [J_f]_{lj}\). In matrix form: \(J_{Af} = A J_f\).

This is the matrix form of the scalar rule \(\partial(a^T f)/\partial x = a^T J_f\).

import numpy as np

def fd_jac(f, x, h=1e-5):
    fx = f(x)
    J = np.zeros((len(fx), len(x)), dtype=float)
    for j in range(len(x)):
        xp, xm = x.copy(), x.copy()
        xp[j] += h; xm[j] -= h
        J[:, j] = (f(xp) - f(xm)) / (2 * h)
    return J

rng = np.random.default_rng(4)
n, m, k = 4, 5, 3
A  = rng.standard_normal((k, m))    # shape (3, 5)
B  = rng.standard_normal((m, n))    # shape (5, 4)
x  = rng.standard_normal(n)          # shape (4,)

# f(x) = sigmoid(Bx)
def sigmoid(v): return 1.0 / (1.0 + np.exp(-v))
def f(v): return sigmoid(B @ v)

# J_f = diag(sigma') * B  (chain rule: D14 o D13)
s   = sigmoid(B @ x)                       # shape (5,)
Jf  = np.diag(s * (1 - s)) @ B             # shape (5, 4)

analytical = A @ Jf                              # shape (3, 4)
numerical  = fd_jac(lambda v: A @ f(v), x)      # shape (3, 4)
print("D17 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-7)
print("D17 passed.")
D17 max error: 2.929265163764683e-11
D17 passed.

29.5 Hessians

Identities D18-D20 cover the Hessian \(H_f \in \mathbb{R}^{n\times n}\), where \([H_f]_{ij} = \partial^2 f/\partial x_i\,\partial x_j\). Numerical verification uses second-order central differences: \[[H]_{ij} \approx \frac{f(x + h e_i + h e_j) - f(x + h e_i) - f(x + h e_j) + f(x)}{h^2}.\]


29.5.1 D18 – Hessian of the quadratic form

\[H(x^T Ax) = A + A^T\]

For symmetric \(A\): \(H = 2A\).

Derivation. From D2, \(\nabla(x^T Ax) = (A + A^T)x\). Differentiating again: \(H = J_{\nabla} = A + A^T\) (constant matrix, no \(x\) dependence).

import numpy as np

def fd_hess(f, x, h=1e-4):
    n = len(x)
    H = np.zeros((n, n), dtype=float)
    for i in range(n):
        for j in range(n):
            xpp = x.copy(); xpp[i] += h; xpp[j] += h
            xpn = x.copy(); xpn[i] += h
            xnp = x.copy(); xnp[j] += h
            H[i, j] = (f(xpp) - f(xpn) - f(xnp) + f(x)) / h**2
    return H

rng = np.random.default_rng(0)
n = 4
A = rng.standard_normal((n, n))    # shape (4, 4)
x = rng.standard_normal(n)          # shape (4,)

analytical = A + A.T                                # shape (4, 4)
numerical  = fd_hess(lambda v: v @ A @ v, x)        # shape (4, 4)
print("D18 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-5)
print("D18 passed.")
D18 max error: 5.865266727855811e-08
D18 passed.

29.5.2 D19 – Hessian of least-squares

\[H(\|Ax - b\|^2) = 2A^T A\]

Derivation. From D3, \(\nabla \|Ax-b\|^2 = 2A^T(Ax-b) = 2A^T A x - 2A^T b\). The Hessian is the Jacobian of this gradient: \[H = J_{2A^T Ax - 2A^Tb} = 2A^T A.\] Note \(A^T A\) is always symmetric positive semidefinite (\(x^T A^T A x = \|Ax\|^2 \geq 0\)).

import numpy as np

def fd_hess(f, x, h=1e-4):
    n = len(x)
    H = np.zeros((n, n), dtype=float)
    for i in range(n):
        for j in range(n):
            xpp = x.copy(); xpp[i] += h; xpp[j] += h
            xpn = x.copy(); xpn[i] += h
            xnp = x.copy(); xnp[j] += h
            H[i, j] = (f(xpp) - f(xpn) - f(xnp) + f(x)) / h**2
    return H

rng = np.random.default_rng(1)
m, n = 6, 4
A = rng.standard_normal((m, n))    # shape (6, 4)
b = rng.standard_normal(m)          # shape (6,)
x = rng.standard_normal(n)          # shape (4,)

analytical = 2 * A.T @ A                                       # shape (4, 4)
numerical  = fd_hess(lambda v: np.sum((A @ v - b)**2), x)      # shape (4, 4)
print("D19 max error:", np.max(np.abs(analytical - numerical)))
assert np.allclose(analytical, numerical, atol=1e-5)
print("D19 passed.")
D19 max error: 1.6648774173941283e-06
D19 passed.

29.5.3 D20 – Gauss-Newton Hessian approximation

\[H(\|r(x)\|^2) \approx 2J_r^T J_r\]

where \(r:\mathbb{R}^n\to\mathbb{R}^m\) is a nonlinear residual and \(J_r\) is its Jacobian.

Derivation. \(\|r(x)\|^2 = r(x)^T r(x)\). By D2 applied to the gradient of a composition: \[\nabla_x \|r\|^2 = 2J_r^T r(x).\] Taking the Jacobian of this gradient: \[H = 2J_r^T J_r + 2\sum_k r_k(x)\,H_{r_k}.\] The Gauss-Newton approximation drops the second-order term \(\sum_k r_k H_{r_k}\). This is accurate when the residuals \(r_k\) are small (near-optimal) or the curvature \(H_{r_k}\) is small.

The approximation \(H \approx 2J_r^T J_r\) is always positive semidefinite, which guarantees descent directions in nonlinear least-squares solvers.

import numpy as np

def fd_jac(f, x, h=1e-5):
    fx = f(x)
    J = np.zeros((len(fx), len(x)), dtype=float)
    for j in range(len(x)):
        xp, xm = x.copy(), x.copy()
        xp[j] += h; xm[j] -= h
        J[:, j] = (f(xp) - f(xm)) / (2 * h)
    return J

def fd_hess(f, x, h=1e-4):
    n = len(x)
    H = np.zeros((n, n), dtype=float)
    for i in range(n):
        for j in range(n):
            xpp = x.copy(); xpp[i] += h; xpp[j] += h
            xpn = x.copy(); xpn[i] += h
            xnp = x.copy(); xnp[j] += h
            H[i, j] = (f(xpp) - f(xpn) - f(xnp) + f(x)) / h**2
    return H

rng = np.random.default_rng(2)
n = 3

# Nonlinear residual: r(x) = [sin(x0), x1^2 - 1, exp(x2) - 2]
def r(x): return np.array([np.sin(x[0]), x[1]**2 - 1, np.exp(x[2]) - 2])
def obj(x): return np.dot(r(x), r(x))

x0 = rng.standard_normal(n)    # shape (3,)

# Gauss-Newton approximation
Jr  = fd_jac(r, x0)                   # shape (3, 3)
H_gn = 2 * Jr.T @ Jr                  # shape (3, 3)

# True Hessian
H_true = fd_hess(obj, x0)             # shape (3, 3)

# Show both -- they differ by the second-order term
print("H_true (exact):\n", H_true.round(4))
print("H_GN  (approx):\n", H_gn.round(4))
print("Difference:\n", (H_true - H_gn).round(4))

# Near the minimum the residuals are small; demonstrate with x near zero
x_near = np.array([0.0, 1.0, np.log(2.0)])   # r(x_near) ~ 0
Jr_near  = fd_jac(r, x_near)
H_gn_near   = 2 * Jr_near.T @ Jr_near
H_true_near = fd_hess(obj, x_near)
print("\nAt near-minimum x:")
print("||H_true - H_GN|| =", np.linalg.norm(H_true_near - H_gn_near).round(6))
H_true (exact):
 [[ 1.8586 -0.     -0.    ]
 [-0.     -0.7221 -0.    ]
 [-0.     -0.     -0.8954]]
H_GN  (approx):
 [[1.9294 0.     0.    ]
 [0.     2.1861 0.    ]
 [0.     0.     0.8755]]
Difference:
 [[-0.0708 -0.     -0.    ]
 [-0.     -2.9082 -0.    ]
 [-0.     -0.     -1.7709]]

At near-minimum x:
||H_true - H_GN|| = 0.003394

29.6 Chain Rule and Backpropagation

The chain rule ties the identities in this appendix into a complete system for computing gradients in composite functions.


29.6.1 Scalar Loss via a Vector Map

Let \(f:\mathbb{R}^n\to\mathbb{R}^m\) and \(L:\mathbb{R}^m\to\mathbb{R}\). The composite gradient is:

\[\nabla_x L = J_f^T\,\nabla_f L\]

Derivation. \(dL = (\nabla_f L)^T df = (\nabla_f L)^T J_f\,dx = (J_f^T \nabla_f L)^T dx\). So \(\nabla_x L = J_f^T \nabla_f L\).

The transpose \(J_f^T\) “pulls back” the upstream gradient – this is the vector-Jacobian product (VJP) computed in every reverse-mode autodiff system.


29.6.2 Composed Vector Maps

Let \(h = g \circ f\) where \(f:\mathbb{R}^n\to\mathbb{R}^m\) and \(g:\mathbb{R}^m\to\mathbb{R}^k\).

\[J_h = J_g\,J_f\]

Derivation. \(h(x) = g(f(x))\). Linearizing: \(dh = J_g\,df = J_g\,J_f\,dx\), so \(J_h = J_g J_f\).

This is the Jacobian chain rule – the analog of \((g\circ f)' = g' \circ f'\) for scalar functions. For a chain of \(k\) layers, \(J_{h_k \circ \cdots \circ h_1} = J_{h_k}\cdots J_{h_1}\).


29.6.3 Dense Layer: Three Backward Gradients

For a dense layer \(y = Wx + b\) with upstream gradient \(\delta = \nabla_y L\):

Gradient Formula Identity used
\(\partial L/\partial W\) \(\delta\, x^T\) D12
\(\partial L/\partial x\) \(W^T \delta\) Chain rule on D13
\(\partial L/\partial b\) \(\delta\) D13 (\(J_b = I\))

Derivation for \(\partial L/\partial x\). \(y = Wx + b\), so \(J_x = W\) (identity D13). By the chain rule: \(\nabla_x L = J_x^T \nabla_y L = W^T \delta\).

import numpy as np

def fd_grad(f, x, h=1e-5):
    g = np.zeros_like(x, dtype=float)
    for i in range(len(x)):
        xp, xm = x.copy(), x.copy()
        xp[i] += h; xm[i] -= h
        g[i] = (f(xp) - f(xm)) / (2 * h)
    return g

def fd_matrix_grad(f, W, h=1e-5):
    G = np.zeros_like(W, dtype=float)
    for i in range(W.shape[0]):
        for j in range(W.shape[1]):
            Wp, Wm = W.copy(), W.copy()
            Wp[i, j] += h; Wm[i, j] -= h
            G[i, j] = (f(Wp) - f(Wm)) / (2 * h)
    return G

rng = np.random.default_rng(0)
m, n = 4, 5
W = rng.standard_normal((m, n))    # shape (4, 5)
b = rng.standard_normal(m)          # shape (4,)
x = rng.standard_normal(n)          # shape (5,)

# Scalar loss: L = 0.5 * ||Wx + b||^2
def loss(W_, x_, b_): return 0.5 * np.dot(W_ @ x_ + b_, W_ @ x_ + b_)

y     = W @ x + b              # shape (4,)
delta = y                      # grad_y L = y for L = 0.5 ||y||^2

# Analytical
dL_dW = np.outer(delta, x)    # shape (4, 5)
dL_dx = W.T @ delta            # shape (5,)
dL_db = delta                  # shape (4,)

# Numerical
dL_dW_num = fd_matrix_grad(lambda M: loss(M, x, b), W)
dL_dx_num = fd_grad(lambda v: loss(W, v, b), x)
dL_db_num = fd_grad(lambda c: loss(W, x, c), b)

print("dL/dW max error:", np.max(np.abs(dL_dW - dL_dW_num)))
print("dL/dx max error:", np.max(np.abs(dL_dx - dL_dx_num)))
print("dL/db max error:", np.max(np.abs(dL_db - dL_db_num)))
assert np.allclose(dL_dW, dL_dW_num, atol=1e-7)
assert np.allclose(dL_dx, dL_dx_num, atol=1e-7)
assert np.allclose(dL_db, dL_db_num, atol=1e-7)
print("All dense-layer backward gradients passed.")
dL/dW max error: 4.0547509794208736e-11
dL/dx max error: 6.670264340868926e-11
dL/db max error: 2.6076474313185827e-11
All dense-layer backward gradients passed.

29.6.4 Two-Layer Chain Rule Example

For a two-layer network \(L = \ell(W_2\,\sigma(W_1 x))\) where \(\ell\) is a scalar loss and \(\sigma\) is element-wise sigmoid:

import numpy as np

def fd_matrix_grad(f, W, h=1e-5):
    G = np.zeros_like(W, dtype=float)
    for i in range(W.shape[0]):
        for j in range(W.shape[1]):
            Wp, Wm = W.copy(), W.copy()
            Wp[i, j] += h; Wm[i, j] -= h
            G[i, j] = (f(Wp) - f(Wm)) / (2 * h)
    return G

def sigmoid(v): return 1.0 / (1.0 + np.exp(-v))

rng = np.random.default_rng(1)
d0, d1, d2 = 4, 5, 3
W1 = rng.standard_normal((d1, d0))    # shape (5, 4)
W2 = rng.standard_normal((d2, d1))    # shape (3, 5)
x  = rng.standard_normal(d0)           # shape (4,)

def forward(W1_, W2_, x_):
    h = sigmoid(W1_ @ x_)    # shape (5,)
    y = W2_ @ h               # shape (3,)
    return 0.5 * np.dot(y, y)

# --- Analytical backward pass ---
# Forward
z1 = W1 @ x                   # shape (5,)  pre-activation
h  = sigmoid(z1)               # shape (5,)  hidden layer
y  = W2 @ h                    # shape (3,)  output

# Backward
delta2 = y                                  # dL/dy = y, shape (3,)
dL_dW2 = np.outer(delta2, h)               # shape (3, 5)  D12
delta1 = W2.T @ delta2                     # shape (5,)    chain rule
delta1 = delta1 * (h * (1.0 - h))         # shape (5,)    D14 (sigmoid)
dL_dW1 = np.outer(delta1, x)              # shape (5, 4)  D12

# --- Numerical verification ---
dL_dW1_num = fd_matrix_grad(lambda M: forward(M, W2, x), W1)
dL_dW2_num = fd_matrix_grad(lambda M: forward(W1, M, x), W2)

print("dL/dW1 max error:", np.max(np.abs(dL_dW1 - dL_dW1_num)))
print("dL/dW2 max error:", np.max(np.abs(dL_dW2 - dL_dW2_num)))
assert np.allclose(dL_dW1, dL_dW1_num, atol=1e-7)
assert np.allclose(dL_dW2, dL_dW2_num, atol=1e-7)
print("Two-layer backprop passed.")
dL/dW1 max error: 3.160205430674523e-11
dL/dW2 max error: 2.111300023699414e-11
Two-layer backprop passed.

29.6.5 Summary: Chain Rule Formulas

Setting Formula
Scalar \(L\) via \(f:\mathbb{R}^n\to\mathbb{R}^m\) \(\nabla_x L = J_f^T \nabla_f L\)
Composed maps \(h = g\circ f\) \(J_h = J_g J_f\)
Dense layer \(\partial L/\partial W\) \(\delta\, x^T\) where \(\delta = \nabla_y L\)
Dense layer \(\partial L/\partial x\) \(W^T \delta\)
Dense layer \(\partial L/\partial b\) \(\delta\)
Element-wise act. \(\partial L/\partial z\) \(\varphi'(z)\odot\delta\)