31  Appendix F: Tensor Notation for Deep Learning

31.1 Overview

This appendix is a bridge between the linear algebra in this book and the array-manipulation idioms you will encounter in deep learning frameworks. The core math does not change – matrix multiply is still matrix multiply – but the notation and the API do.

Three things are new relative to the rest of this book:

  1. Tensors – arrays with more than two axes – appear everywhere in deep learning: a minibatch of images is a rank-4 tensor (batch, height, width, channel); a sequence of token embeddings is rank-3 (batch, time, feature).

  2. Einstein summation – writing contractions by repeating index letters and summing over them – is the compact notation that replaces explicit for loops and np.dot / np.matmul calls when axes multiply beyond two.

  3. Framework APIs (numpy.einsum, torch.einsum) implement Einstein summation directly, giving you a single notation that covers dot products, matrix multiplies, batched multiplies, outer products, traces, and more.


31.1.1 What Is a Tensor?

In the physics and differential geometry sense, a tensor is a coordinate-independent multilinear map. In deep learning the word means something simpler: an \(N\)-dimensional array of numbers, where \(N\) is called the rank (or order, to avoid confusion with matrix rank).

Rank Name Example shape
0 Scalar ()
1 Vector (n,)
2 Matrix (m, n)
3 3-tensor (b, m, n) – a batch of \(b\) matrices
4 4-tensor (b, h, w, c) – a batch of images
import numpy as np

rng = np.random.default_rng(0)

scalar  = rng.standard_normal(())            # shape ()
vector  = rng.standard_normal((5,))          # shape (5,)
matrix  = rng.standard_normal((3, 4))        # shape (3, 4)
batch   = rng.standard_normal((8, 3, 4))     # shape (8, 3, 4)  -- 8 matrices
images  = rng.standard_normal((16, 28, 28, 1))  # shape (16, 28, 28, 1)

for name, arr in [("scalar", scalar), ("vector", vector),
                  ("matrix", matrix), ("batch", batch), ("images", images)]:
    print(f"{name:8s}  ndim={arr.ndim}  shape={arr.shape}")
scalar    ndim=0  shape=()
vector    ndim=1  shape=(5,)
matrix    ndim=2  shape=(3, 4)
batch     ndim=3  shape=(8, 3, 4)
images    ndim=4  shape=(16, 28, 28, 1)

31.1.2 Axis Labeling Convention

In deep learning it is standard to give axes names rather than just indices. This appendix uses these labels consistently:

Letter Axis meaning
b batch size
s, t sequence (time) length
d, e embedding / feature dimension
h number of attention heads
m, n general matrix dimensions
i, j, k summation (contracted) indices
H, W image height, width
C channel count
r filter radius (kernel)

These are the same letters used in einsum strings throughout this appendix.


31.2 Einstein Summation

Einstein’s rule: any index letter that appears exactly twice is summed over. In standard matrix notation \(C_{ij} = \sum_k A_{ik} B_{kj}\) the summation sign and the \(\sum_k\) are written out explicitly. The Einstein version drops the \(\sum\):

\[C_{ij} = A_{ik} B_{kj}\]

because \(k\) appears twice (once in \(A\), once in \(B\)), so the sum is implied.

numpy.einsum and torch.einsum accept a string that encodes the index pattern, then perform the implied contraction:

np.einsum("ik,kj->ij", A, B)   # matrix multiply

The string has the form <input1>,<input2>,...-><output>. Output indices are kept; input-only indices are summed over.


31.2.1 Einsum Catalog

The table below lists the 12 most common operations. Every one is verified numerically.

import numpy as np

rng = np.random.default_rng(42)

m, n, k, b = 4, 5, 3, 8

A  = rng.standard_normal((m, k))   # shape (m, k)
B  = rng.standard_normal((k, n))   # shape (k, n)
x  = rng.standard_normal((k,))     # shape (k,)
y  = rng.standard_normal((m,))     # shape (m,)
u  = rng.standard_normal((n,))     # shape (n,)
S  = rng.standard_normal((m, m))   # shape (m, m)  -- square
BA = rng.standard_normal((b, m, k))  # shape (b, m, k)  -- batch of matrices
BB = rng.standard_normal((b, k, n))  # shape (b, k, n)

# --- 1. Dot product: scalar = u . v ---
dot_es  = np.einsum("i,i->", x, x)
dot_np  = np.dot(x, x)
assert np.isclose(dot_es, dot_np)

# --- 2. Outer product: M[i,j] = u[i]*v[j] ---
outer_es = np.einsum("i,j->ij", y, u)   # shape (m, n)
outer_np = np.outer(y, u)
assert np.allclose(outer_es, outer_np)

# --- 3. Matrix-vector: y[i] = A[i,k] * x[k] ---
mv_es = np.einsum("ik,k->i", A, x)      # shape (m,)
mv_np = A @ x
assert np.allclose(mv_es, mv_np)

# --- 4. Matrix-matrix: C[i,j] = A[i,k] * B[k,j] ---
mm_es = np.einsum("ik,kj->ij", A, B)    # shape (m, n)
mm_np = A @ B
assert np.allclose(mm_es, mm_np)

# --- 5. Transposed multiply: C[i,j] = A[k,i] * B[k,j] (A^T @ B) ---
atb_es = np.einsum("ki,kj->ij", A, A)   # shape (k, k)... wait shape (m,m)
# A is (m,k): A^T is (k,m), A^T @ A is (k,k)
AtA_es = np.einsum("ik,il->kl", A, A)   # shape (k, k)
AtA_np = A.T @ A
assert np.allclose(AtA_es, AtA_np)

# --- 6. Trace: scalar = A[i,i] ---
trace_es = np.einsum("ii->", S)
trace_np = np.trace(S)
assert np.isclose(trace_es, trace_np)

# --- 7. Element-wise (Hadamard): C[i,j] = A[i,j] * A[i,j] ---
S2 = rng.standard_normal((m, k))
had_es = np.einsum("ij,ij->ij", A, S2)
had_np = A * S2
assert np.allclose(had_es, had_np)

# --- 8. Row sum: v[i] = sum_j A[i,j] ---
rowsum_es = np.einsum("ij->i", A)        # shape (m,)
rowsum_np = A.sum(axis=1)
assert np.allclose(rowsum_es, rowsum_np)

# --- 9. Column sum: v[j] = sum_i A[i,j] ---
colsum_es = np.einsum("ij->j", A)        # shape (k,)
colsum_np = A.sum(axis=0)
assert np.allclose(colsum_es, colsum_np)

# --- 10. Transpose: B[j,i] = A[i,j] ---
T_es = np.einsum("ij->ji", A)            # shape (k, m)
T_np = A.T
assert np.allclose(T_es, T_np)

# --- 11. Batched matrix multiply: C[b,i,j] = sum_k A[b,i,k] * B[b,k,j] ---
bmm_es = np.einsum("bik,bkj->bij", BA, BB)   # shape (b, m, n)
bmm_np = BA @ BB
assert np.allclose(bmm_es, bmm_np)

# --- 12. Quadratic form: scalar = x^T S x ---
S_sq = S @ S.T                                # make PSD, shape (m,m)
xS   = rng.standard_normal((m,))
qf_es = np.einsum("i,ij,j->", xS, S_sq, xS)
qf_np = xS @ S_sq @ xS
assert np.isclose(qf_es, qf_np)

print("All 12 einsum patterns verified.")
All 12 einsum patterns verified.

31.2.2 Einsum Cheat Sheet

Operation Einsum string NumPy equivalent
Dot product "i,i->" np.dot(x, y)
Outer product "i,j->ij" np.outer(x, y)
Matrix-vector "ij,j->i" A @ x
Matrix-matrix "ij,jk->ik" A @ B
\(A^T A\) "ji,jk->ik" A.T @ A
Trace "ii->" np.trace(A)
Hadamard "ij,ij->ij" A * B
Row sum "ij->i" A.sum(1)
Transpose "ij->ji" A.T
Batched matmul "bij,bjk->bik" A @ B (broadcast)
Quadratic form "i,ij,j->" x @ A @ x

31.2.3 Performance Note

numpy.einsum accepts an optimize argument that chooses the lowest-cost contraction order when chaining more than two arrays. For deep learning workloads, torch.einsum dispatches to optimized BLAS/cuBLAS kernels and supports automatic differentiation.

import numpy as np

rng = np.random.default_rng(0)
A = rng.standard_normal((50, 60))   # shape (50, 60)
B = rng.standard_normal((60, 70))   # shape (60, 70)
C = rng.standard_normal((70, 80))   # shape (70, 80)

# Two equivalent contractions; optimize picks the cheaper order
result = np.einsum("ij,jk,kl->il", A, B, C, optimize=True)  # shape (50, 80)
check  = (A @ B) @ C
print("shape:", result.shape)
print("max error:", np.max(np.abs(result - check)))
assert np.allclose(result, check, atol=1e-10)
print("Optimized three-way einsum verified.")
shape: (50, 80)
max error: 1.7053025658242404e-13
Optimized three-way einsum verified.

31.3 Transformer Attention as Einsum

Scaled dot-product attention is the single operation that powers every Transformer model. Its formula looks opaque in conventional notation but becomes transparent when written with explicit indices.


31.3.1 Scaled Dot-Product Attention

Given queries \(Q \in \mathbb{R}^{s \times d}\), keys \(K \in \mathbb{R}^{t \times d}\), and values \(V \in \mathbb{R}^{t \times d_v}\), the attention output is:

\[\operatorname{Attn}(Q, K, V) = \operatorname{softmax}\!\left(\frac{QK^T}{\sqrt{d}}\right)V\]

In shape terms: \((s,d) \times (d,t) \to (s,t)\), then \((s,t) \times (t,d_v) \to (s,d_v)\).

Using einsum notation, the score matrix is:

scores[s, t] = Q[s, d] * K[t, d]  (sum over d)
np.einsum("sd,td->st", Q, K)

The full single-head attention in NumPy:

import numpy as np

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

rng = np.random.default_rng(0)

S  = 6    # query sequence length
T  = 8    # key/value sequence length
d  = 16   # key/query dimension
dv = 16   # value dimension

Q = rng.standard_normal((S, d))    # shape (S, d)
K = rng.standard_normal((T, d))    # shape (T, d)
V = rng.standard_normal((T, dv))   # shape (T, dv)

# --- Compute attention using einsum ---
# Step 1: scores[s, t] = Q[s, d] * K[t, d]
scores = np.einsum("sd,td->st", Q, K) / np.sqrt(d)   # shape (S, T)

# Step 2: attention weights (row-wise softmax)
weights = softmax(scores, axis=1)                     # shape (S, T)

# Step 3: weighted sum of values
output = np.einsum("st,td->sd", weights, V)           # shape (S, dv)

# Verify against explicit matrix multiply
scores_mm = (Q @ K.T) / np.sqrt(d)
output_mm = softmax(scores_mm, axis=1) @ V

print("output shape:", output.shape)
print("max error vs matmul:", np.max(np.abs(output - output_mm)))
assert np.allclose(output, output_mm)
print("Single-head attention verified.")
output shape: (6, 16)
max error vs matmul: 4.3021142204224816e-16
Single-head attention verified.

31.3.2 Multi-Head Attention

In multi-head attention, \(H\) parallel heads each project Q, K, V into a lower-dimensional subspace, run attention independently, then concatenate:

\[\text{MultiHead}(Q,K,V) = \operatorname{Concat}(\text{head}_1,\ldots,\text{head}_H)W^O\]

where \(\text{head}_h = \operatorname{Attn}(QW_h^Q, KW_h^K, VW_h^V)\).

With einsum, all \(H\) heads process the full batch simultaneously. Typical shapes for a batch of \(b\) sequences:

Tensor Shape Meaning
\(Q\) \((b, s, d_{\text{model}})\) Queries (batch, seq, embed)
\(W^Q_h\) \((H, d_{\text{model}}, d_h)\) Per-head query projection
\(Q_h\) \((b, H, s, d_h)\) Projected queries
Scores \((b, H, s, t)\) Attention logits
Output \((b, H, s, d_h)\) Per-head output
import numpy as np

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

rng = np.random.default_rng(1)

B   = 2     # batch size
S   = 6     # sequence length (query)
T   = 8     # sequence length (key/value)
D   = 32    # model dimension
H   = 4     # number of heads
dh  = D // H  # head dimension = 8

# Input sequences
Q_in = rng.standard_normal((B, S, D))    # shape (B, S, D)
K_in = rng.standard_normal((B, T, D))    # shape (B, T, D)
V_in = rng.standard_normal((B, T, D))    # shape (B, T, D)

# Learnable projection weights
WQ = rng.standard_normal((H, D, dh))    # shape (H, D, dh)
WK = rng.standard_normal((H, D, dh))    # shape (H, D, dh)
WV = rng.standard_normal((H, D, dh))    # shape (H, D, dh)
WO = rng.standard_normal((H * dh, D))   # shape (H*dh, D)

# Project: Q_h[b,h,s,d] = Q_in[b,s,D] * WQ[h,D,d]
Q_h = np.einsum("bsd,hde->bhse", Q_in, WQ)  # shape (B, H, S, dh)
K_h = np.einsum("btd,hde->bhte", K_in, WK)  # shape (B, H, T, dh)
V_h = np.einsum("btd,hde->bhte", V_in, WV)  # shape (B, H, T, dh)

# Attention scores: scores[b,h,s,t] = Q_h[b,h,s,d] * K_h[b,h,t,d]
scores = np.einsum("bhsd,bhtd->bhst", Q_h, K_h) / np.sqrt(dh)  # (B,H,S,T)
weights = softmax(scores, axis=-1)                               # (B,H,S,T)

# Weighted values: out[b,h,s,d] = weights[b,h,s,t] * V_h[b,h,t,d]
out_h = np.einsum("bhst,bhtd->bhsd", weights, V_h)  # shape (B, H, S, dh)

# Concatenate heads and project out
# Reshape (B, H, S, dh) -> (B, S, H*dh)
out_cat = out_h.transpose(0, 2, 1, 3).reshape(B, S, H * dh)   # (B, S, H*dh)
out_final = np.einsum("bsd,de->bse", out_cat, WO)              # (B, S, D)

print("Multi-head attention output shape:", out_final.shape)
assert out_final.shape == (B, S, D)
print("Multi-head attention verified.")
Multi-head attention output shape: (2, 6, 32)
Multi-head attention verified.

31.3.3 Complexity: Why Attention Is Quadratic

The score matrix has shape \((b, H, s, t)\). For self-attention \(s = t = n\). The cost of computing all scores is \(O(n^2 d)\) per head – quadratic in sequence length. This is the bottleneck that Sparse Attention, Linformer, Performer, and FlashAttention all target.

Stage Dominant cost
Score computation \(O(b H n^2 d_h)\)
Softmax \(O(b H n^2)\)
Weighted sum \(O(b H n^2 d_h)\)
Total \(O(b H n^2 d_h)\) = \(O(b n^2 d)\)

31.4 Convolutions as Tensor Contractions

A 2-D convolution layer applies a bank of learned filters to an input image. Written out, it looks like a quadruple loop – but it is a tensor contraction and can be expressed with a single einsum once you understand the index structure.


31.4.1 2-D Convolution: Index Form

Let the input be \(X \in \mathbb{R}^{H \times W \times C_\text{in}}\) (one image, no batch dimension for clarity). A filter bank \(F \in \mathbb{R}^{r \times r \times C_\text{in} \times C_\text{out}}\) produces output \(Y \in \mathbb{R}^{H' \times W' \times C_\text{out}}\) by:

\[Y[i, j, c_\text{out}] = \sum_{p=0}^{r-1} \sum_{q=0}^{r-1} \sum_{c=0}^{C_\text{in}-1} X[i+p,\, j+q,\, c]\; F[p,\, q,\, c,\, c_\text{out}]\]

This is a tensor contraction over three axes: \(p, q, c\). The output spatial coordinates \(i, j\) remain free; they are never contracted.


31.4.2 Einsum Convolution (Single Image)

A direct einsum over a sliding window is clearest to read. The code below extracts patches first (the im2col trick), then contracts.

import numpy as np

rng = np.random.default_rng(0)

H, W, Cin  = 8, 8, 3     # input height, width, channels
r          = 3            # filter (kernel) size
Cout       = 4            # output channels
stride     = 1

# Input image and filter bank
X = rng.standard_normal((H, W, Cin))             # shape (H, W, Cin)
F = rng.standard_normal((r, r, Cin, Cout))        # shape (r, r, Cin, Cout)

# Output spatial dimensions
Ho = (H - r) // stride + 1   # = 6
Wo = (W - r) // stride + 1   # = 6

# --- Method 1: explicit loop (ground truth) ---
Y_loop = np.zeros((Ho, Wo, Cout))
for i in range(Ho):
    for j in range(Wo):
        patch = X[i:i+r, j:j+r, :]          # shape (r, r, Cin)
        Y_loop[i, j, :] = np.einsum("pqc,pqco->o", patch, F)

# --- Method 2: im2col + einsum ---
# Extract all (Ho*Wo) patches into a single array
patches = np.lib.stride_tricks.sliding_window_view(
    X, window_shape=(r, r, Cin)
)[::stride, ::stride, 0]               # shape (Ho, Wo, r, r, Cin)

Y_einsum = np.einsum("ijpqc,pqco->ijo", patches, F)  # shape (Ho, Wo, Cout)

print("Output shape:", Y_einsum.shape)
print("Max error vs loop:", np.max(np.abs(Y_einsum - Y_loop)))
assert np.allclose(Y_einsum, Y_loop, atol=1e-10)
print("im2col einsum convolution verified.")
Output shape: (6, 6, 4)
Max error vs loop: 0.0
im2col einsum convolution verified.

31.4.3 Batched Convolution

In practice a minibatch of \(b\) images is processed together. The batch axis \(b\) is free (not contracted), so it simply prepends to both the patch and output tensors.

import numpy as np

rng = np.random.default_rng(1)

B, H, W, Cin = 4, 8, 8, 3    # batch, height, width, in-channels
r    = 3                       # kernel size
Cout = 4                       # out-channels
stride = 1

X = rng.standard_normal((B, H, W, Cin))          # shape (B, H, W, Cin)
F = rng.standard_normal((r, r, Cin, Cout))        # shape (r, r, Cin, Cout)

Ho = (H - r) // stride + 1   # = 6
Wo = (W - r) // stride + 1   # = 6

# Extract patches for the whole batch
# sliding_window_view on (B, H, W, Cin) with window (1, r, r, Cin)
patches = np.lib.stride_tricks.sliding_window_view(
    X, window_shape=(1, r, r, Cin)
)[:, ::stride, ::stride, 0, 0]    # shape (B, Ho, Wo, r, r, Cin)

Y_batch = np.einsum("bijpqc,pqco->bijo", patches, F)  # shape (B, Ho, Wo, Cout)

print("Batched output shape:", Y_batch.shape)
assert Y_batch.shape == (B, Ho, Wo, Cout)
print("Batched convolution verified.")
Batched output shape: (4, 6, 6, 4)
Batched convolution verified.

31.4.4 Convolution vs Matrix Multiply: The Unified View

Both operations are tensor contractions. The difference is which indices are free and which are contracted:

Operation Free indices Contracted indices
Matrix multiply \(C=AB\) \(i\) (row), \(j\) (col) \(k\) (inner)
Conv output \(Y[i,j,c_\text{out}]\) \(i,j,c_\text{out}\) \(p,q,c_\text{in}\)
Attention output \(O[s,d]\) \(s,d\) \(t\) (weighted sum)

The im2col transformation turns a convolution into a matrix multiply by explicitly stacking the contracted patch axes into a single dimension. This is how cuDNN and PyTorch implement convolutions on GPU: reshape the input patches into a large matrix, then call a single batched GEMM.


31.4.5 im2col as a Matrix Multiply

After im2col the patch tensor \(P \in \mathbb{R}^{(H'W') \times (r^2 C_\text{in})}\) and the reshaped filter \(\hat{F} \in \mathbb{R}^{(r^2 C_\text{in}) \times C_\text{out}}\) give the output in one matrix multiply:

\[Y_{\text{flat}} = P\,\hat{F} \in \mathbb{R}^{(H'W') \times C_\text{out}}\]

import numpy as np

rng = np.random.default_rng(2)

H, W, Cin = 8, 8, 3
r    = 3
Cout = 4
stride = 1
Ho   = (H - r) // stride + 1   # = 6
Wo   = (W - r) // stride + 1   # = 6

X = rng.standard_normal((H, W, Cin))          # shape (H, W, Cin)
F = rng.standard_normal((r, r, Cin, Cout))    # shape (r, r, Cin, Cout)

# im2col: extract and flatten patches
patches = np.lib.stride_tricks.sliding_window_view(
    X, window_shape=(r, r, Cin)
)[::stride, ::stride, 0]              # shape (Ho, Wo, r, r, Cin)

P = patches.reshape(Ho * Wo, r * r * Cin)    # shape (Ho*Wo, r*r*Cin)
F_flat = F.reshape(r * r * Cin, Cout)        # shape (r*r*Cin, Cout)

Y_gemm = (P @ F_flat).reshape(Ho, Wo, Cout)  # shape (Ho, Wo, Cout)

# Ground truth (loop)
Y_ref = np.zeros((Ho, Wo, Cout))
for i in range(Ho):
    for j in range(Wo):
        patch = X[i:i+r, j:j+r, :]
        Y_ref[i, j, :] = np.einsum("pqc,pqco->o", patch, F)

print("im2col GEMM output shape:", Y_gemm.shape)
print("Max error vs loop:", np.max(np.abs(Y_gemm - Y_ref)))
assert np.allclose(Y_gemm, Y_ref, atol=1e-10)
print("im2col GEMM convolution verified.")
im2col GEMM output shape: (6, 6, 4)
Max error vs loop: 2.6645352591003757e-15
im2col GEMM convolution verified.

31.4.6 Summary

Concept Key einsum string Section
Dot product "i,i->" Section 31.2
Batched matmul "bij,bjk->bik" Section 31.2
Single-head attention "sd,td->st", "st,td->sd" Section 31.3
Multi-head projection "bsd,hde->bhse" Section 31.3
Multi-head scores "bhsd,bhtd->bhst" Section 31.3
Single-image conv "ijpqc,pqco->ijo" Section 31.4
Batched conv "bijpqc,pqco->bijo" Section 31.4