← Blog

The Bias-Variance Tradeoff: Diagram and Proof

27 February 2026

mathclaude-authored

The bias-variance tradeoff is one of the central results in statistical learning theory. It tells us that prediction error can be decomposed into three terms — and that reducing one typically increases another. Here we state the decomposition, illustrate it, and work through the proof.

The picture

Bias-variance tradeoff diagram showing total error as a U-shaped curve decomposed into decreasing bias squared, increasing variance, and constant irreducible error, with an optimal complexity point marked

As model complexity increases, bias falls (the model can represent more structure) while variance rises (the model becomes more sensitive to the particular training data). Total error is minimised at the point where these competing pressures balance.

Setup

Suppose we are predicting a target variable yy from input xx, where the true relationship is

y=f(x)+ϵ,y = f(x) + \epsilon,

with E[ϵ]=0\mathbb{E}[\epsilon] = 0 and Var(ϵ)=σ2\operatorname{Var}(\epsilon) = \sigma^2. The noise ϵ\epsilon is independent of xx.

We train a model f^(x)\hat{f}(x) on a dataset DD drawn from the joint distribution of (x,y)(x, y). Since DD is random, f^\hat{f} is itself a random variable. We want to characterise the expected prediction error at a fixed point xx:

MSE(x)=ED ⁣[(yf^(x))2].\operatorname{MSE}(x) = \mathbb{E}_D\!\left[(y - \hat{f}(x))^2\right].

The expectation is over both the randomness in DD (which determines f^\hat{f}) and the noise ϵ\epsilon in yy.

The decomposition

Theorem. The mean squared error decomposes as

ED ⁣[(yf^(x))2]=Bias ⁣[f^(x)]2+VarD ⁣[f^(x)]+σ2,\mathbb{E}_D\!\left[(y - \hat{f}(x))^2\right] = \operatorname{Bias}\!\left[\hat{f}(x)\right]^2 + \operatorname{Var}_D\!\left[\hat{f}(x)\right] + \sigma^2,

where

Bias ⁣[f^(x)]=ED ⁣[f^(x)]f(x)\operatorname{Bias}\!\left[\hat{f}(x)\right] = \mathbb{E}_D\!\left[\hat{f}(x)\right] - f(x)

is the systematic error of the model, and VarD[f^(x)]\operatorname{Var}_D[\hat{f}(x)] measures how much the model’s prediction fluctuates across different training sets.

Proof

We begin by introducing the quantity ED[f^(x)]\mathbb{E}_D[\hat{f}(x)], which we abbreviate as fˉ(x)\bar{f}(x). This is the average prediction of the model across all possible training sets.

Step 1. Substitute y=f(x)+ϵy = f(x) + \epsilon and expand.

E ⁣[(yf^)2]=E ⁣[(f+ϵf^)2].\mathbb{E}\!\left[(y - \hat{f})^2\right] = \mathbb{E}\!\left[(f + \epsilon - \hat{f})^2\right].

We suppress the argument xx for readability. Now add and subtract fˉ\bar{f}:

=E ⁣[((ffˉ)+(fˉf^)+ϵ)2].= \mathbb{E}\!\left[\bigl((f - \bar{f}) + (\bar{f} - \hat{f}) + \epsilon\bigr)^2\right].

Step 2. Expand the square. Writing a=ffˉa = f - \bar{f}, b=fˉf^b = \bar{f} - \hat{f}, and c=ϵc = \epsilon:

(a+b+c)2=a2+b2+c2+2ab+2ac+2bc.(a + b + c)^2 = a^2 + b^2 + c^2 + 2ab + 2ac + 2bc.

Take expectations term by term.

Step 3. Evaluate each term.

Step 4. Show that all cross terms vanish.

Step 5. Collect the surviving terms:

E ⁣[(yf^)2]=Bias[f^]2+VarD[f^]+σ2.\mathbb{E}\!\left[(y - \hat{f})^2\right] = \operatorname{Bias}[\hat{f}]^2 + \operatorname{Var}_D[\hat{f}] + \sigma^2. \qquad \blacksquare

Interpretation

The three terms play distinct roles:

The tradeoff arises because the tools that reduce bias (more flexible models, more parameters) typically increase variance, and vice versa. The proof makes precise what the diagram shows: total error is the sum of these three quantities, and minimising it requires balancing the first two.

References

The bias-variance decomposition originates in Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58. The proof presented here follows the standard textbook treatment found in Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), Section 7.3. A more accessible presentation appears in James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning, Section 2.2.