Foundations

This chapter introduces the key concepts that underpin the toolkits estimator definitions and comparisons. Unlike the main estimator definitions, the foundations require knowledge of classic statistical methods. Well-known facts and commonly accepted notation are used without special introduction.

From Statistical Efficiency to Drift

Statistical efficiency measures estimator precision (Serfling 2009). When multiple estimators target the same quantity, efficiency determines which provides more reliable results.

Efficiency measures how tightly estimates cluster around the true value across repeated samples. For an estimator TT applied to samples from distribution XX, absolute efficiency is defined relative to the optimal estimator TT^*:

Efficiency(T,X)=Var[T(X1,,Xn)]Var[T(X1,,Xn)]\text{Efficiency}(T, X) = \frac{\text{Var}[T^*(X_1, \ldots, X_n)]}{\text{Var}[T(X_1, \ldots, X_n)]}

Relative efficiency compares two estimators by taking the ratio of their variances:

RelativeEfficiency(T1,T2,X)=Var[T2(X1,,Xn)]Var[T1(X1,,Xn)]\text{RelativeEfficiency}(T_1, T_2, X) = \frac{\text{Var}[T_2(X_1, \ldots, X_n)]}{\text{Var}[T_1(X_1, \ldots, X_n)]}

Under Additive\underline{\operatorname{Additive}} (Normal) distributions, this approach works well. The sample mean achieves optimal efficiency, while the median operates at roughly 64% efficiency.

However, this variance-based definition creates four critical limitations:

  • Absolute efficiency requires knowing the optimal estimator, which is difficult to determine. For many distributions, deriving the minimum-variance unbiased estimator requires complex mathematical analysis. Without this reference point, absolute efficiency cannot be computed.
  • Relative efficiency only compares estimator pairs, preventing systematic evaluation. This limits understanding of how multiple estimators perform relative to each other. Practitioners cannot rank estimators comprehensively or evaluate individual performance in isolation.
  • The approach depends on variance calculations that break down when variance becomes infinite or when distributions have heavy tails. Many real-world distributions, such as those with power-law tails, exhibit infinite variance. When the variance is undefined, efficiency comparisons become impossible.
  • Variance is not robust to outliers, which can corrupt efficiency calculations. A single extreme observation can greatly inflate variance estimates. This sensitivity can make efficient estimators look inefficient and vice versa.

The Drift\operatorname{Drift} concept provides a robust alternative. Drift measures estimator precision using Spread\operatorname{Spread} instead of variance, providing reliable comparisons across a wide range of distributions.

For an average estimator TT, random variable XX, and sample size nn:

AvgDrift(T,X,n)=nSpread[T(X1,,Xn)]Spread[X]\operatorname{AvgDrift}(T, X, n) = \frac{\sqrt{n} \cdot \operatorname{Spread}[T(X_1, \ldots, X_n)]}{\operatorname{Spread}[X]}

This formula measures estimator variability compared to data variability. Spread[T(X1,,Xn)]\operatorname{Spread}[T(X_1, \ldots, X_n)] captures the median absolute difference between estimates across repeated samples. Multiplying by n\sqrt{n} removes sample size dependency, making drift values comparable across different sample sizes. Dividing by Spread[X]\operatorname{Spread}[X] creates a scale-free measure that provides consistent drift values across different distribution parameters and measurement units.

Dispersion estimators use a parallel formulation:

DispDrift(T,X,n)=nRelSpread[T(X1,,Xn)]\operatorname{DispDrift}(T, X, n) = \sqrt{n} \cdot \operatorname{RelSpread}[T(X_1, \ldots, X_n)]

Here RelSpread\operatorname{RelSpread} (where RelSpread[Y]=Spread[Y]Center[Y]\operatorname{RelSpread}[Y] = \frac{\operatorname{Spread}[Y]}{\lvert \operatorname{Center}[Y] \rvert}) normalizes by the estimators typical value for fair comparison.

Drift offers four key advantages:

  • For estimators with n\sqrt{n} convergence rates, drift remains finite and comparable across distributions; for heavier tails, drift may diverge, flagging estimator instability.
  • It provides absolute precision measures rather than only pairwise comparisons.
  • The robust Spread\operatorname{Spread} foundation resists outlier distortion that corrupts variance-based calculations.
  • The n\sqrt{n} normalization makes drift values comparable across different sample sizes, enabling direct comparison of estimator performance regardless of sample size.

Under Additive\underline{\operatorname{Additive}} (Normal) conditions, drift matches traditional efficiency. The sample mean achieves drift near 1.0; the median achieves drift around 1.25. This consistency validates drift as a proper generalization of efficiency that extends to realistic data conditions where traditional efficiency fails.

When switching from one estimator to another while maintaining the same precision, the required sample size adjustment follows:

nnew=noriginalDrift2(T2,X)Drift2(T1,X)n_{\text{new}} = n_{\text{original}} \cdot \frac{\operatorname{Drift}^2(T_2, X)}{\operatorname{Drift}^2(T_1, X)}

This applies when estimator T1T_1 has lower drift than T2T_2.

The ratio of squared drifts determines the data requirement change. If T2T_2 has drift 1.5 times higher than T1T_1, then T2T_2 requires (1.5)2=2.25(1.5)^2 = 2.25 times more data to match T1T_1s precision. Conversely, switching to a more precise estimator allows smaller sample sizes.

For asymptotic analysis, Drift(T,X)\operatorname{Drift}(T, X) denotes the limiting value as nn \to \infty. With a baseline estimator, rescaled drift values enable direct comparisons:

Driftbaseline(T,X)=Drift(T,X)/Drift(Tbaseline,X)\operatorname{Drift}_{\text{baseline}}(T, X) = \operatorname{Drift}(T, X) / \operatorname{Drift}(T_{\text{baseline}}, X)

The standard drift definition assumes n\sqrt{n} convergence rates typical under Additive\underline{\operatorname{Additive}} (Normal) conditions. For broader applicability, drift generalizes to:

AvgDrift(T,X,n)=ninstabilitySpread[T(X1,,Xn)]Spread[X]\operatorname{AvgDrift}(T, X, n) = \frac{n^{\text{instability}} \cdot \operatorname{Spread}[T(X_1, \ldots, X_n)]}{\operatorname{Spread}[X]} DispDrift(T,X,n)=ninstabilityRelSpread[T(X1,,Xn)]\operatorname{DispDrift}(T, X, n) = n^{\text{instability}} \cdot \operatorname{RelSpread}[T(X_1, \ldots, X_n)]

The instability parameter adapts to estimator convergence rates. The toolkit uses instability=12\text{instability} = \frac{1}{2} throughout because this choice provides natural intuition and mental representation for the Additive\underline{\operatorname{Additive}} (Normal) distribution. Rather than introduce additional complexity through variable instability parameters, the fixed n\sqrt{n} scaling offers practical convenience while maintaining theoretical rigor for the distribution classes most common in applications.

From Confidence Level to Misrate

Traditional statistics expresses uncertainty through confidence levels: 95% confidence interval, 99% confidence, 99.9% confidence. This convention emerged from early statistical practice when tables printed confidence intervals for common levels like 90%, 95%, and 99%.

The confidence level approach creates practical problems:

  • Cognitive difficulty with high confidence. Distinguishing between 99.999% and 99.9999% confidence requires mental effort. The difference matters — one represents a 1-in-100,000 error rate, the other 1-in-1,000,000 — but the representation obscures this distinction.
  • Asymmetric scale. The confidence level scale compresses near 100%, where most practical values cluster. Moving from 90% to 95% represents a 2× change in error rate, while moving from 99% to 99.9% represents a 10× change, despite similar visual spacing.
  • Indirect interpretation. Practitioners care about error rates, not success rates. Whats the chance Im wrong? matters more than Whats the chance Im right? Confidence level forces mental subtraction to answer the natural question.
  • Unclear defaults. Traditional practice offers no clear default confidence level. Different fields use different conventions (95%, 99%, 99.9%), creating inconsistency and requiring arbitrary choices.

The misrate\mathrm{misrate} parameter provides a more natural representation. Misrate expresses the probability that computed bounds fail to contain the true value:

misrate=1confidence level\mathrm{misrate} = 1 - \text{confidence level}

This simple inversion provides several advantages:

  • Direct interpretation. misrate=0.01\mathrm{misrate} = 0.01 means 1% chance of error or wrong 1 time in 100. misrate=106\mathrm{misrate} = 10^{-6} means wrong 1 time in a million. No mental arithmetic required.
  • Linear scale for practical values. misrate=0.1\mathrm{misrate} = 0.1 (10%), misrate=0.01\mathrm{misrate} = 0.01 (1%), misrate=0.001\mathrm{misrate} = 0.001 (0.1%) form a natural sequence. Scientific notation handles extreme values cleanly: 10310^{-3}, 10610^{-6}, 10910^{-9}.
  • Clear comparisons. 10510^{-5} versus 10610^{-6} immediately shows a 10× difference in error tolerance. 99.999% versus 99.9999% confidence obscures this same relationship.
  • Pragmatic default. The toolkit recommends misrate=103\mathrm{misrate} = 10^{-3} (one-in-a-thousand error rate) as a reasonable default for everyday analysis. For critical decisions where errors are costly, use misrate=106\mathrm{misrate} = 10^{-6} (one-in-a-million).

The terminology shift from confidence level to misrate parallels other clarifying renames in this toolkit. Just as Additive\underline{\operatorname{Additive}} better describes the distributions formation than Normal, and Center\operatorname{Center} better describes the estimators purpose than Hodges-Lehmann, misrate\mathrm{misrate} better describes the quantity practitioners actually reason about: the probability of error.

Traditional confidence intervals become bounds in this framework, eliminating statistical jargon in favor of descriptive terminology. ShiftBounds(x,y,misrate)\operatorname{ShiftBounds}(\mathbf{x}, \mathbf{y}, \mathrm{misrate}) clearly indicates: it provides bounds on the shift, with a specified error rate. No background in classical statistics required to understand the concept.

Invariance

Invariance properties determine how estimators respond to data transformations. These properties are crucial for analysis design and interpretation:

  • Location-invariant estimators are invariant to additive shifts: T(x+k)=T(x)T(\mathbf{x}+k)=T(\mathbf{x})
  • Scale-invariant estimators are invariant to positive rescaling: T(kx)=T(x)T(k \cdot \mathbf{x})=T(\mathbf{x}) for k>0k>0
  • Equivariant estimators change predictably with transformations, maintaining relative relationships

Choosing estimators with appropriate invariance properties ensures that results remain meaningful across different measurement scales, units, and data transformations. For example, when comparing datasets collected with different instruments or protocols, location-invariant estimators eliminate the need for data centering, while scale-invariant estimators eliminate the need for normalization.

Location-invariance: An estimator TT is location-invariant if adding a constant to the measurements leaves the result unchanged:

T(x+k)=T(x)T(\mathbf{x} + k) = T(\mathbf{x}) T(x+k,y+k)=T(x,y)T(\mathbf{x} + k, \mathbf{y} + k) = T(\mathbf{x}, \mathbf{y})

Location-equivariance: An estimator TT is location-equivariant if it shifts with the data:

T(x+k)=T(x)+kT(\mathbf{x} + k) = T(\mathbf{x}) + k T(x+k1,y+k2)=T(x,y)+f(k1,k2)T(\mathbf{x} + k_1, \mathbf{y} + k_2) = T(\mathbf{x}, \mathbf{y}) + f(k_1, k_2)

Scale-invariance: An estimator TT is scale-invariant if multiplying by a positive constant leaves the result unchanged:

T(kx)=T(x)fork>0T(k \cdot \mathbf{x}) = T(\mathbf{x}) \quad \text{for} k > 0 T(kx,ky)=T(x,y)fork>0T(k \cdot \mathbf{x}, k \cdot \mathbf{y}) = T(\mathbf{x}, \mathbf{y}) \quad \text{for} k > 0

Scale-equivariance: An estimator TT is scale-equivariant if it scales proportionally with the data:

T(kx)=kT(x)orkT(x)fork0T(k \cdot \mathbf{x}) = k \cdot T(\mathbf{x}) \text{or} \lvert k \rvert \cdot T(\mathbf{x}) \quad \text{for} k \neq 0 T(kx,ky)=kT(x,y)orkT(x,y)fork0T(k \cdot \mathbf{x}, k \cdot \mathbf{y}) = k \cdot T(\mathbf{x}, \mathbf{y}) \text{or} \lvert k \rvert \cdot T(\mathbf{x}, \mathbf{y}) \quad \text{for} k \neq 0
LocationScale
CenterEquivariantEquivariant
SpreadInvariantEquivariant
ShiftInvariantEquivariant
RatioInvariant
DisparityInvariantInvariant