Methodology

This chapter examines the methodological principles that guide Pragmastats design and application.

Pragmatic Philosophy

The toolkits foundations rest on pragmatist epistemology: truth is determined by practical consequences, not abstract correspondence with reality.

  • Truth is what works — An estimator is correct if it produces useful results across realistic conditions
  • Meaning from consequences — The value of a statistical method lies in what it enables, not its theoretical elegance
  • Theory serves practice — Mathematical analysis provides insight, but empirical validation determines adoption
  • Utility as criterion — When methods conflict, prefer the one that solves more real problems

This stance inverts the traditional relationship between theory and practice. Rather than deriving methods from first principles and hoping they apply, we evaluate methods by their performance and seek theoretical understanding afterward.

Procedure-First Empiricism

Traditional statistical practice follows an assumptions-first methodology:

  • Assume a data-generating model (e.g., observations are normally distributed)
  • Derive the optimal procedure under those assumptions
  • Apply the procedure to data, hoping assumptions approximately hold

This toolkit inverts the process:

  • Select procedures based on desired properties (robustness, equivariance, interpretability)
  • Empirically measure performance across a wide range of conditions
  • Use theory to explain and predict observed behavior

Monte Carlo simulation serves as the primary instrument of knowledge. Rather than deriving asymptotic formulas for estimator variance, we measure actual variance across thousands of simulated samples. Drift tables in this manual are empirically measured, not analytically derived.

This approach has practical advantages: simulations can explore conditions that resist closed-form analysis, and empirical results are self-validating — they show what actually happens, not what theory predicts should happen.

For the formal treatment of domain assumptions that govern valid inputs, see the Assumptions chapter.

Epistemic Humility

No perfectly Gaussian, log-normal, or Pareto distributions exist in real data. Every distribution we name is a useful fiction — a model we employ because it approximates reality well enough for our purposes, while knowing it cannot be exactly correct.

  • Models are approximations — They capture essential structure while ignoring irrelevant details
  • Approximations fail at boundaries — Edge cases, extreme values, and distribution tails often violate assumptions
  • Graceful degradation — Methods should produce sensible (if less precise) results when assumptions weaken

The toolkit embodies this humility by choosing estimators that remain interpretable and bounded even when distributional assumptions break down. A robust estimator may sacrifice some efficiency under ideal conditions in exchange for reliable behavior when conditions degrade.

The Pairwise Principle

A structural insight unifies all primary robust estimators in this toolkit: they are medians of pairwise operations.

EstimatorPairwise OperationResult
Center\operatorname{Center}xi+xj2\frac{x_i + x_j}{2}Median of pairwise averages
Spread\operatorname{Spread}xixj\lvert x_i - x_j \rvertMedian of pairwise differences
Shift\operatorname{Shift}xiyjx_i - y_jMedian of cross-sample differences
Ratio\operatorname{Ratio}log(xi)log(yj)\log(x_i) - \log(y_j)exp(median of log-differences)
Dominance\operatorname{Dominance}1(xi>yj)\mathbf{1}(x_i > y_j)Proportion of pairwise comparisons

For multiplicative quantities like Ratio\operatorname{Ratio}, the pairwise operation is defined in log-space, aggregated with the median, then mapped back with exp. This canonical-scale approach preserves the median of pairwise operations principle while ensuring exact multiplicative antisymmetry: Ratio(x,y)×Ratio(y,x)=1\operatorname{Ratio}(\mathbf{x}, \mathbf{y}) \times \operatorname{Ratio}(\mathbf{y}, \mathbf{x}) = 1.

This pairwise structure provides three benefits:

  • Natural robustness — Comparing measurements to each other, not to external references, limits outlier influence
  • Self-calibration — The sample serves as its own reference distribution, requiring no external assumptions
  • Algebraic closure — Pairwise operations preserve symmetry and equivariance properties

The pairwise principle also enables efficient computation. Matrices of pairwise operations have structural properties (sorted rows and columns) that fast algorithms exploit to achieve O(nlogn)O(n \log n) complexity.

Median as Universal Aggregator

The median is the final step in each pairwise estimator. Why median specifically?

The median achieves the maximum possible breakdown point (50%) among all translation-equivariant location estimators. Up to half the data can be arbitrarily corrupted before the median becomes unbounded.

However, Center\operatorname{Center} and Spread\operatorname{Spread} achieve only 29% breakdown — not 50%. This is deliberate: a tradeoff between robustness and precision.

BreakdownRobustnessPrecisionEstimators
0%NoneOptimal under assumptionsMean\operatorname{Mean}, StdDev\operatorname{StdDev}
29%SubstantialNear-optimalCenter\operatorname{Center}, Spread\operatorname{Spread}
50%MaximumReducedMedian\operatorname{Median}, MAD\operatorname{MAD}

The 29% breakdown point survives approximately one corrupted measurement in four while maintaining roughly 95% asymptotic efficiency under ideal Gaussian conditions. This represents the practical optimum: enough robustness for realistic contamination levels, enough efficiency to compete with traditional methods when data is clean.

Convergence Conventions

Drift normalizes estimator variability by n\sqrt{n}, making precision comparable across sample sizes:

Drift=Spread(estimates)×n\operatorname{Drift} = \operatorname{Spread}(\text{estimates}) \times \sqrt{n}

This normalization embeds a deliberate assumption: most useful estimators converge at the n\sqrt{n} rate. The Central Limit Theorem guarantees this rate for means under mild conditions, and median-based estimators inherit similar convergence behavior.

  • Common case defaultn\sqrt{n} convergence covers the vast majority of practical estimators
  • Intuitive interpretation — Drift represents effective standard deviation at n=1n = 1
  • Mental calculation — Expected precision at any nn is simply Driftn\frac{\operatorname{Drift}}{\sqrt{n}}

For estimators with non-standard convergence (e.g., extreme value statistics), drift generalizes to ninstabilityn^{\text{instability}} where instability differs from 0.50.5. But the toolkit deliberately uses n\sqrt{n} throughout because it matches the common case and provides intuitive interpretation without complicating the universal mechanism.

This is pragmatic universalism: adopt the common case as default, acknowledge exceptions exist, and handle them explicitly rather than burdening the common case with unnecessary generality.

Structural Unity

All robust estimators in this toolkit share a common mathematical structure:

Estimator=InvTransform(Median(PairwiseOperation(Transform(x),Transform(y))))\text{Estimator} = \text{InvTransform}(\operatorname{Median}(\text{PairwiseOperation}(\text{Transform}(x), \text{Transform}(y))))

For additive estimators (Center\operatorname{Center}, Spread\operatorname{Spread}, Shift\operatorname{Shift}), Transform\text{Transform} is identity. For multiplicative estimators (Ratio\operatorname{Ratio}), Transform=log\text{Transform} = \log and InvTransform=exp\text{InvTransform} = \exp.

This structural unity is not merely aesthetic — it enables unified algorithmic optimization.

  • Sorted structure — Matrices of pairwise operations have sorted rows and columns
  • Monahans algorithm — Exploits sorted structure for O(nlogn)O(n \log n) Center\operatorname{Center}/Spread\operatorname{Spread}
  • Fast shift — Exploits cross-sample matrix structure for efficient two-sample comparison

Because all estimators share the same median of pairwise form, insights that accelerate one can often be adapted to accelerate others. A single theoretical framework covers all primary estimators.

Generative Naming

Names in this toolkit encode operational knowledge rather than historical provenance.

TraditionalPragmastatWhats Encoded
Gaussian / NormalAdditive\underline{\operatorname{Additive}}Formation: sum of independent factors (CLT)
Log-normal / GaltonMultiplic\underline{\operatorname{Multiplic}}Formation: product of independent factors
ParetoPower\underline{\operatorname{Power}}Behavior: power-law relationship
Hodges-LehmannCenter\operatorname{Center}Function: measures central tendency
ShamosSpread\operatorname{Spread}Function: measures variability
(none)sparityAssumption: property of having positive spread

Reading Additive\underline{\operatorname{Additive}} activates a generative model: this distribution arises when many independent factors add together. Reading Gaussian requires recalling an association with Carl Friedrich Gauss, then remembering what properties that name implies.

Generative names create immediate intuition about when a model applies. Additive\underline{\operatorname{Additive}} distributions arise from additive processes. Multiplic\underline{\operatorname{Multiplic}} distributions arise from multiplicative processes. The name itself encodes the formation mechanism.

The Inversion Principle

Traditional statistical outputs often require mental transformation before use. This toolkit inverts such framings to present information in directly actionable form, following principles of user-centered design (Norman 2013).

TraditionalPragmastatReason for Inversion
Confidence level (95%)misrate\mathrm{misrate} (0.05)Direct error interpretation
Confidence intervalBoundsPlain language, no jargon
Hypothesis test (p-value)Bounds estimationWhats plausible? not Is zero plausible?
Efficiency (variance ratio)Drift (spread-based)Works with heavy tails

Consider the confidence level vs. misrate inversion. A 95% confidence interval requires understanding: If I repeated this procedure infinitely, 95% of intervals would contain the true value. A 5% misrate states directly: This procedure errs about 5% of the time.

The shift from confidence intervals to bounds, and from hypothesis testing to interval estimation, moves from frequentist theology toward decision-relevant inference. The practitioner asks What values are plausible for this parameter? rather than Can I reject the hypothesis that this parameter equals zero?

Multi-Audience Design

This manual serves readers with diverse backgrounds and conflicting preferences:

AudiencePrioritiesChallenges
Experienced academicsRigor, derivation, formalism, citationsMay find practical focus too shallow
Professional developersExamples, APIs, searchability, minimalismMay find theory intimidating
Students and beginnersClarity, intuition, progressive disclosureNeed both theory and practice
Large language modelsStructure, consistency, unambiguous definitionsNeed form-independent content

These audiences have conflicting needs. Academics want complete derivations; developers want quick answers. Beginners need gentle introductions; experts need dense references. LLMs need predictable structure; humans appreciate variety.

The manual targets a neutral zone where all audiences find acceptable content:

  • Signature first — Mathematical definition immediately visible
  • Example second — Concrete computation before abstract explanation
  • Detail optional — Properties, corner cases, and theory follow for those who need them
  • Every sentence earns its place — No filler prose, no redundant explanation

Structural Principles

  • Concrete over abstract — Numbers and examples before symbols and theory
  • Precision without verbosity — Mathematical rigor in minimal words
  • Consistent layout — Same structure across all toolkit items enables scanning
  • Self-contained sections — Each section readable independently

LLM-Friendliness

The manuals structure also serves machine readers:

  • Predictable patterns — Consistent section ordering aids extraction
  • Explicit definitions — No implicit knowledge assumed
  • Tabular data — Structured information in tables, not prose
  • Short paragraphs — Content chunks cleanly for context windows

This multi-audience optimization forces elimination of audience-specific conventions, revealing form-independent essential content that serves everyone adequately rather than serving one group perfectly and others poorly.

Reference Tests as Specification

The toolkit maintains seven implementations across different programming languages: Python, TypeScript, R, C#, Kotlin, Rust, and Go. Each implementation must produce identical numerical results for all estimators.

This cross-language consistency is achieved through executable specifications:

Manual (definitions) ↔ C# (reference) → JSON (tests) → All languages (validation)

The specification IS the test suite. Reference tests serve three critical purposes:

  • Cross-language validation — All implementations pass identical test cases
  • Regression prevention — Changes validated against known outputs
  • Implementation guidance — Concrete examples for porting to new languages

Test Design Principles

  • Minimal sufficiency — Smallest test set providing high confidence in correctness
  • Comprehensive coverage — Both typical cases and edge cases that expose errors
  • Deterministic reproducibility — Fixed seeds for all random tests

Test Categories

  • Canonical cases — Deterministic inputs like natural number sequences where outputs are easily verified
  • Edge cases — Boundary conditions: single element, zeros, minimum viable sample sizes
  • Fuzzy tests — Controlled random exploration beyond hand-crafted examples

The C# implementation serves as the reference generator. All test cases are defined programmatically, executed to produce expected outputs, and serialized to JSON. Other implementations load these JSON files and verify their outputs match within numerical tolerance.

Cross-Language Determinism

Reproducibility requires determinism at every layer. When a simulation in Python produces a result, the same simulation in Rust, Go, or any other supported language must produce the identical result.

  • Portable RNGRng(experiment-1)\operatorname{Rng}(\text{experiment-1}) produces identical sequences in all languages
  • Specified algorithms — xoshiro256++ for generation, SplitMix64 for seeding, FNV-1a for string hashing
  • No implementation-dependent behavior — Floating-point operations follow IEEE 754

Unified API

Beyond numerical determinism, the toolkit maintains a consistent API across all implementations. Function names, parameter orders, and return types follow the same conventions in every language.

  • Same vocabularyCenter\operatorname{Center}, Spread\operatorname{Spread}, Shift\operatorname{Shift} mean the same thing everywhere
  • Same signaturesCenter(x) in Python, Center(x) in Rust, Center(x) in Go
  • Same behavior — Edge cases, error conditions, and defaults are identical

This unified API enables frictionless language switching. A practitioner prototyping in Python can port to Rust for production without learning new abstractions or revalidating statistical assumptions. The mental model transfers directly; only syntax changes.

Benefits of Unification

  • Debugging across languages — A failing test in TypeScript can be debugged in C#
  • Verified ports — New implementations can be validated against existing ones
  • Reproducible research — Results can be reproduced in any supported language
  • Team flexibility — Different team members can use preferred languages on the same analysis
  • Migration paths — Move from prototype to production without statistical revalidation

Summary Principles

The methodology of this toolkit can be distilled into twelve guiding principles:

  • Name things by what they do, not who discovered them — Generative names encode operational knowledge
  • All models are wrong; design for graceful degradation — Robust methods fail gently
  • Evaluate empirically, organize theoretically — Simulation before derivation
  • Self-reference provides robustness — Pairwise operations compare data to itself
  • 29% breakdown is the practical optimum — Balance robustness and precision
  • Invert framings that require mental transformation — Present directly actionable information
  • Default to the common case — Use n\sqrt{n} convergence; handle exceptions explicitly
  • Multi-audience optimization reveals essential content — Serve everyone adequately, not one group perfectly
  • Executable specifications are reliable specifications — Tests define correctness
  • Reproducibility requires portable determinism — Same seeds, same results, any language
  • Structural unity enables unified optimization — Median of pairwise admits fast algorithms
  • Utility is the ultimate criterion — Methods that solve real problems are correct methods

Strict Domains Principle

For each function parameter, Pragmastat enforces the strictest domain that:

  • Supports virtually all legitimate real-world use cases
  • Rejects pathological cases that would produce misleading results
  • Fails immediately with actionable guidance rather than silently degrading

Rationale: Learning from NHST Problems

Traditional tools accept arbitrary confidence levels without warning when the requested precision exceeds data resolution. This leads to misleading results: a practitioner requests 99.99% confidence with n=5n=5 and receives bounds that look like valid statistical inference but actually have much lower coverage.

Strict validation approach

  • Making impossible requests impossible — If n=5n=5 cannot achieve 99% confidence, the function rejects misrate=0.01\mathrm{misrate}=0.01 rather than returning meaningless bounds.
  • Actionable errors — Messages explain WHY the request failed and HOW to fix it.
  • Explicit tradeoffs — Practitioners learn their datas actual resolution limits.

Minimum achievable misrate

For one-sample bounds, minimum achievable misrate =21n= 2^{1-n}:

nnmisratemin\mathrm{misrate}_{min}max confidencenotes
20.550%only trivial bounds possible
50.062593.75%cannot achieve 95%
70.015698.4%cannot achieve 99%
100.0019599.8%most practical misrates achievable
201.9×1061.9 \times 10^{-6}99.9998%misrate=106\mathrm{misrate} = 10^{-6} is achievable

Practical implications

  • n<5n < 5: Cannot achieve 95% confidence (misrate=0.05\mathrm{misrate} = 0.05)
  • n<7n < 7: Cannot achieve 99% confidence (misrate=0.01\mathrm{misrate} = 0.01)
  • n20n \geq 20: misrate=106\mathrm{misrate} = 10^{-6} is achievable

This principle ensures that Pragmastat functions never silently produce misleading results when the requested precision exceeds what the data can support.

Test Framework

The reference test framework consists of three components:

Test generation — The C# implementation defines test inputs programmatically using builder patterns. For deterministic cases, inputs are explicitly specified. For random cases, the framework uses controlled seeds with System.Random to ensure reproducibility across all platforms.

The random generation mechanism works as follows:

  • Each test suite builder maintains a seed counter initialized to zero.
  • For one-sample estimators, each distribution type receives the next available seed. The same random generator produces all samples for all sizes within that distribution.
  • For two-sample estimators, each pair of distributions receives two consecutive seeds: one for the x\mathbf{x} sample generator and one for the y\mathbf{y} sample generator.
  • The seed counter increments with each random generator creation, ensuring deterministic test data generation.

For Additive\underline{\operatorname{Additive}} distributions, random values are generated using the Box-Müller transform, which converts pairs of uniform random values into normally distributed values. The transform applies the formula:

X=μ+σ2ln(U1)sin(2πU2)X = \mu + \sigma \sqrt{-2 \ln(U_1)} \sin(2 \pi U_2)

where U1,U2U_1, U_2 are uniform random values from Uniform(0,1)\underline{\operatorname{Uniform}}(0, 1), μ\mu is the mean, and σ\sigma is the standard deviation.

For Uniform\underline{\operatorname{Uniform}} distributions, random values are generated directly using the quantile function:

X=min+U(maxmin)X = \min + U \cdot (\max - \min)

where UU is a uniform random value from Uniform(0,1)\underline{\operatorname{Uniform}}(0, 1).

The framework executes the reference implementation on all generated inputs and serializes input-output pairs to JSON format.

Test validation — Each language implementation loads the JSON test cases and executes them against its local estimator implementation. Assertions verify that outputs match expected values within a given numerical tolerance (typically 101010^{-10} for relative error).

Test data format — Each test case is a JSON file containing input and output fields. For one-sample estimators, the input contains array x and optional parameters. For two-sample estimators, input contains arrays x and y. For bounds estimators (ShiftBounds\operatorname{ShiftBounds}, RatioBounds\operatorname{RatioBounds}), input additionally contains misrate. Output is a single numeric value for point estimators, or an object with lower and upper fields for bounds estimators.

Performance testing — The toolkit provides O(nlogn)O(n \log n) fast algorithms for Center\operatorname{Center}, Spread\operatorname{Spread}, and Shift\operatorname{Shift} estimators, dramatically more efficient than naive implementations that materialize all pairwise combinations. Performance tests use sample size n=100,000n = 100,000 (for one-sample) or n=m=100,000n = m = 100,000 (for two-sample). This specific size creates a clear performance distinction: fast implementations (O(nlogn)O(n \log n) or O((m+n)logL)O((m+n) \log L)) complete in under 5 seconds on modern hardware across all supported languages, while naive implementations (O(n2logn)O(n^2 \log n) or O(mnlog(mn))O(m n \log(m n))) would be prohibitively slow (taking hours or failing due to memory exhaustion). With n=100,000n = 100,000, naive approaches would need to materialize approximately 5 billion pairwise values for Center\operatorname{Center}/Spread\operatorname{Spread} or 10 billion for Shift\operatorname{Shift}, whereas fast algorithms require only O(n)O(n) additional memory. Performance tests serve dual purposes: correctness validation at scale and performance regression detection, ensuring implementations use the efficient algorithms and remain practical for real-world datasets with hundreds of thousands of observations. Performance test specifications are provided in the respective estimator sections above.

This framework ensures that all seven language implementations maintain strict numerical agreement across the full test suite.