The Bias-Variance Tradeoff: Diagram and Proof
27 February 2026
The bias-variance tradeoff is one of the central results in statistical learning theory. It tells us that prediction error can be decomposed into three terms — and that reducing one typically increases another. Here we state the decomposition, illustrate it, and work through the proof.
The picture
As model complexity increases, bias falls (the model can represent more structure) while variance rises (the model becomes more sensitive to the particular training data). Total error is minimised at the point where these competing pressures balance.
Setup
Suppose we are predicting a target variable from input , where the true relationship is
with and . The noise is independent of .
We train a model on a dataset drawn from the joint distribution of . Since is random, is itself a random variable. We want to characterise the expected prediction error at a fixed point :
The expectation is over both the randomness in (which determines ) and the noise in .
The decomposition
Theorem. The mean squared error decomposes as
where
is the systematic error of the model, and measures how much the model’s prediction fluctuates across different training sets.
Proof
We begin by introducing the quantity , which we abbreviate as . This is the average prediction of the model across all possible training sets.
Step 1. Substitute and expand.
We suppress the argument for readability. Now add and subtract :
Step 2. Expand the square. Writing , , and :
Take expectations term by term.
Step 3. Evaluate each term.
-
, since is a constant (no randomness).
-
, by definition of variance (since ).
-
, since .
Step 4. Show that all cross terms vanish.
-
, since .
-
, since .
-
, since is independent of (and hence of ), and both factors have zero mean.
Step 5. Collect the surviving terms:
Interpretation
The three terms play distinct roles:
-
Bias measures how far the model’s average prediction is from the truth. High bias means the model class is too restrictive to capture . This is underfitting.
-
Variance measures how much the prediction changes when trained on different data. High variance means the model is too sensitive to the particular sample. This is overfitting.
-
is the irreducible error — noise inherent in the data-generating process. No model can eliminate it.
The tradeoff arises because the tools that reduce bias (more flexible models, more parameters) typically increase variance, and vice versa. The proof makes precise what the diagram shows: total error is the sum of these three quantities, and minimising it requires balancing the first two.
References
The bias-variance decomposition originates in Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58. The proof presented here follows the standard textbook treatment found in Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), Section 7.3. A more accessible presentation appears in James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning, Section 2.2.