Methodology

This chapter examines the methodological principles that guide Pragmastats design and application.

Pragmatic Philosophy

The toolkits foundations rest on pragmatist epistemology: truth is determined by practical consequences, not abstract correspondence with reality.

Truth is what works — An estimator is correct if it produces useful results across realistic conditions
Meaning from consequences — The value of a statistical method lies in what it enables, not its theoretical elegance
Theory serves practice — Mathematical analysis provides insight, but empirical validation determines adoption
Utility as criterion — When methods conflict, prefer the one that solves more real problems

This stance inverts the traditional relationship between theory and practice. Rather than deriving methods from first principles and hoping they apply, we evaluate methods by their performance and seek theoretical understanding afterward.

Procedure-First Empiricism

Traditional statistical practice follows an assumptions-first methodology:

Assume a data-generating model (e.g., observations are normally distributed)
Derive the optimal procedure under those assumptions
Apply the procedure to data, hoping assumptions approximately hold

This toolkit inverts the process:

Select procedures based on desired properties (robustness, equivariance, interpretability)
Empirically measure performance across a wide range of conditions
Use theory to explain and predict observed behavior

Monte Carlo simulation serves as the primary instrument of knowledge. Rather than deriving asymptotic formulas for estimator variance, we measure actual variance across thousands of simulated samples. Drift tables in this manual are empirically measured, not analytically derived.

This approach has practical advantages: simulations can explore conditions that resist closed-form analysis, and empirical results are self-validating — they show what actually happens, not what theory predicts should happen.

For the formal treatment of domain assumptions that govern valid inputs, see the Assumptions chapter.

Epistemic Humility

No perfectly Gaussian, log-normal, or Pareto distributions exist in real data. Every distribution we name is a useful fiction — a model we employ because it approximates reality well enough for our purposes, while knowing it cannot be exactly correct.

Models are approximations — They capture essential structure while ignoring irrelevant details
Approximations fail at boundaries — Edge cases, extreme values, and distribution tails often violate assumptions
Graceful degradation — Methods should produce sensible (if less precise) results when assumptions weaken

The toolkit embodies this humility by choosing estimators that remain interpretable and bounded even when distributional assumptions break down. A robust estimator may sacrifice some efficiency under ideal conditions in exchange for reliable behavior when conditions degrade.

The Pairwise Principle

A structural insight unifies all primary robust estimators in this toolkit: they are medians of pairwise operations.

Estimator	Pairwise Operation	Result
$\operatorname{Center}$	$\frac{x_i + x_j}{2}$	Median of pairwise averages
$\operatorname{Spread}$	$\lvert x_i - x_j \rvert$	Median of pairwise differences
$\operatorname{Shift}$	$x_i - y_j$	Median of cross-sample differences
$\operatorname{Ratio}$	$\log(x_i) - \log(y_j)$	exp(median of log-differences)
$\operatorname{Dominance}$	$\mathbf{1}(x_i > y_j)$	Proportion of pairwise comparisons

For multiplicative quantities like $\operatorname{Ratio}$ , the pairwise operation is defined in log-space, aggregated with the median, then mapped back with exp. This canonical-scale approach preserves the median of pairwise operations principle while ensuring exact multiplicative antisymmetry: $\operatorname{Ratio}(\mathbf{x}, \mathbf{y}) \times \operatorname{Ratio}(\mathbf{y}, \mathbf{x}) = 1$ .

This pairwise structure provides three benefits:

Natural robustness — Comparing measurements to each other, not to external references, limits outlier influence
Self-calibration — The sample serves as its own reference distribution, requiring no external assumptions
Algebraic closure — Pairwise operations preserve symmetry and equivariance properties

The pairwise principle also enables efficient computation. Matrices of pairwise operations have structural properties (sorted rows and columns) that fast algorithms exploit to achieve $O(n \log n)$ complexity.

Median as Universal Aggregator

The median is the final step in each pairwise estimator. Why median specifically?

The median achieves the maximum possible breakdown point (50%) among all translation-equivariant location estimators. Up to half the data can be arbitrarily corrupted before the median becomes unbounded.

However, $\operatorname{Center}$ and $\operatorname{Spread}$ achieve only 29% breakdown — not 50%. This is deliberate: a tradeoff between robustness and precision.

Breakdown	Robustness	Precision	Estimators
0%	None	Optimal under assumptions	$\operatorname{Mean}$ , $\operatorname{StdDev}$
29%	Substantial	Near-optimal	$\operatorname{Center}$ , $\operatorname{Spread}$
50%	Maximum	Reduced	$\operatorname{Median}$ , $\operatorname{MAD}$

The 29% breakdown point survives approximately one corrupted measurement in four while maintaining roughly 95% asymptotic efficiency under ideal Gaussian conditions. This represents the practical optimum: enough robustness for realistic contamination levels, enough efficiency to compete with traditional methods when data is clean.

Convergence Conventions

Drift normalizes estimator variability by $\sqrt{n}$ , making precision comparable across sample sizes:

\operatorname{Drift} = \operatorname{Spread}(\text{estimates}) \times \sqrt{n}

This normalization embeds a deliberate assumption: most useful estimators converge at the $\sqrt{n}$ rate. The Central Limit Theorem guarantees this rate for means under mild conditions, and median-based estimators inherit similar convergence behavior.

Common case default — $\sqrt{n}$ convergence covers the vast majority of practical estimators
Intuitive interpretation — Drift represents effective standard deviation at $n = 1$
Mental calculation — Expected precision at any $n$ is simply $\frac{\operatorname{Drift}}{\sqrt{n}}$

For estimators with non-standard convergence (e.g., extreme value statistics), drift generalizes to $n^{\text{instability}}$ where instability differs from $0.5$ . But the toolkit deliberately uses $\sqrt{n}$ throughout because it matches the common case and provides intuitive interpretation without complicating the universal mechanism.

This is pragmatic universalism: adopt the common case as default, acknowledge exceptions exist, and handle them explicitly rather than burdening the common case with unnecessary generality.

Structural Unity

All robust estimators in this toolkit share a common mathematical structure:

\text{Estimator} = \text{InvTransform}(\operatorname{Median}(\text{PairwiseOperation}(\text{Transform}(x), \text{Transform}(y))))

For additive estimators ( $\operatorname{Center}$ , $\operatorname{Spread}$ , $\operatorname{Shift}$ ), $\text{Transform}$ is identity. For multiplicative estimators ( $\operatorname{Ratio}$ ), $\text{Transform} = \log$ and $\text{InvTransform} = \exp$ .

This structural unity is not merely aesthetic — it enables unified algorithmic optimization.

Sorted structure — Matrices of pairwise operations have sorted rows and columns
Monahans algorithm — Exploits sorted structure for $O(n \log n)$ $\operatorname{Center}$ / $\operatorname{Spread}$
Fast shift — Exploits cross-sample matrix structure for efficient two-sample comparison

Because all estimators share the same median of pairwise form, insights that accelerate one can often be adapted to accelerate others. A single theoretical framework covers all primary estimators.

Generative Naming

Names in this toolkit encode operational knowledge rather than historical provenance.

Traditional	Pragmastat	Whats Encoded
Gaussian / Normal	$\underline{\operatorname{Additive}}$	Formation: sum of independent factors (CLT)
Log-normal / Galton	$\underline{\operatorname{Multiplic}}$	Formation: product of independent factors
Pareto	$\underline{\operatorname{Power}}$	Behavior: power-law relationship
Hodges-Lehmann	$\operatorname{Center}$	Function: measures central tendency
Shamos	$\operatorname{Spread}$	Function: measures variability
(none)	sparity	Assumption: property of having positive spread

Reading $\underline{\operatorname{Additive}}$ activates a generative model: this distribution arises when many independent factors add together. Reading Gaussian requires recalling an association with Carl Friedrich Gauss, then remembering what properties that name implies.

Generative names create immediate intuition about when a model applies. $\underline{\operatorname{Additive}}$ distributions arise from additive processes. $\underline{\operatorname{Multiplic}}$ distributions arise from multiplicative processes. The name itself encodes the formation mechanism.

The Inversion Principle

Traditional statistical outputs often require mental transformation before use. This toolkit inverts such framings to present information in directly actionable form, following principles of user-centered design (Norman 2013).

Traditional	Pragmastat	Reason for Inversion
Confidence level (95%)	$\mathrm{misrate}$ (0.05)	Direct error interpretation
Confidence interval	Bounds	Plain language, no jargon
Hypothesis test (p-value)	Bounds estimation	Whats plausible? not Is zero plausible?
Efficiency (variance ratio)	Drift (spread-based)	Works with heavy tails

Consider the confidence level vs. misrate inversion. A 95% confidence interval requires understanding: If I repeated this procedure infinitely, 95% of intervals would contain the true value. A 5% misrate states directly: This procedure errs about 5% of the time.

The shift from confidence intervals to bounds, and from hypothesis testing to interval estimation, moves from frequentist theology toward decision-relevant inference. The practitioner asks What values are plausible for this parameter? rather than Can I reject the hypothesis that this parameter equals zero?

Multi-Audience Design

This manual serves readers with diverse backgrounds and conflicting preferences:

Audience	Priorities	Challenges
Experienced academics	Rigor, derivation, formalism, citations	May find practical focus too shallow
Professional developers	Examples, APIs, searchability, minimalism	May find theory intimidating
Students and beginners	Clarity, intuition, progressive disclosure	Need both theory and practice
Large language models	Structure, consistency, unambiguous definitions	Need form-independent content

These audiences have conflicting needs. Academics want complete derivations; developers want quick answers. Beginners need gentle introductions; experts need dense references. LLMs need predictable structure; humans appreciate variety.

The manual targets a neutral zone where all audiences find acceptable content:

Signature first — Mathematical definition immediately visible
Example second — Concrete computation before abstract explanation
Detail optional — Properties, corner cases, and theory follow for those who need them
Every sentence earns its place — No filler prose, no redundant explanation

Structural Principles

Concrete over abstract — Numbers and examples before symbols and theory
Precision without verbosity — Mathematical rigor in minimal words
Consistent layout — Same structure across all toolkit items enables scanning
Self-contained sections — Each section readable independently

LLM-Friendliness

The manuals structure also serves machine readers:

Predictable patterns — Consistent section ordering aids extraction
Explicit definitions — No implicit knowledge assumed
Tabular data — Structured information in tables, not prose
Short paragraphs — Content chunks cleanly for context windows

This multi-audience optimization forces elimination of audience-specific conventions, revealing form-independent essential content that serves everyone adequately rather than serving one group perfectly and others poorly.

Reference Tests as Specification

The toolkit maintains seven implementations across different programming languages: Python, TypeScript, R, C#, Kotlin, Rust, and Go. Each implementation must produce identical numerical results for all estimators.

This cross-language consistency is achieved through executable specifications:

Manual (definitions) ↔ C# (reference) → JSON (tests) → All languages (validation)

The specification IS the test suite. Reference tests serve three critical purposes:

Cross-language validation — All implementations pass identical test cases
Regression prevention — Changes validated against known outputs
Implementation guidance — Concrete examples for porting to new languages

Test Design Principles

Minimal sufficiency — Smallest test set providing high confidence in correctness
Comprehensive coverage — Both typical cases and edge cases that expose errors
Deterministic reproducibility — Fixed seeds for all random tests

Test Categories

Canonical cases — Deterministic inputs like natural number sequences where outputs are easily verified
Edge cases — Boundary conditions: single element, zeros, minimum viable sample sizes
Fuzzy tests — Controlled random exploration beyond hand-crafted examples

The C# implementation serves as the reference generator. All test cases are defined programmatically, executed to produce expected outputs, and serialized to JSON. Other implementations load these JSON files and verify their outputs match within numerical tolerance.

Cross-Language Determinism

Reproducibility requires determinism at every layer. When a simulation in Python produces a result, the same simulation in Rust, Go, or any other supported language must produce the identical result.

Portable RNG — $\operatorname{Rng}(\text{experiment-1})$ produces identical sequences in all languages
Specified algorithms — xoshiro256++ for generation, SplitMix64 for seeding, FNV-1a for string hashing
No implementation-dependent behavior — Floating-point operations follow IEEE 754

Unified API

Beyond numerical determinism, the toolkit maintains a consistent API across all implementations. Function names, parameter orders, and return types follow the same conventions in every language.

Same vocabulary — $\operatorname{Center}$ , $\operatorname{Spread}$ , $\operatorname{Shift}$ mean the same thing everywhere
Same signatures — Center(x) in Python, Center(x) in Rust, Center(x) in Go
Same behavior — Edge cases, error conditions, and defaults are identical

This unified API enables frictionless language switching. A practitioner prototyping in Python can port to Rust for production without learning new abstractions or revalidating statistical assumptions. The mental model transfers directly; only syntax changes.

Benefits of Unification

Debugging across languages — A failing test in TypeScript can be debugged in C#
Verified ports — New implementations can be validated against existing ones
Reproducible research — Results can be reproduced in any supported language
Team flexibility — Different team members can use preferred languages on the same analysis
Migration paths — Move from prototype to production without statistical revalidation

Summary Principles

The methodology of this toolkit can be distilled into twelve guiding principles:

Name things by what they do, not who discovered them — Generative names encode operational knowledge
All models are wrong; design for graceful degradation — Robust methods fail gently
Evaluate empirically, organize theoretically — Simulation before derivation
Self-reference provides robustness — Pairwise operations compare data to itself
29% breakdown is the practical optimum — Balance robustness and precision
Invert framings that require mental transformation — Present directly actionable information
Default to the common case — Use $\sqrt{n}$ convergence; handle exceptions explicitly
Multi-audience optimization reveals essential content — Serve everyone adequately, not one group perfectly
Executable specifications are reliable specifications — Tests define correctness
Reproducibility requires portable determinism — Same seeds, same results, any language
Structural unity enables unified optimization — Median of pairwise admits fast algorithms
Utility is the ultimate criterion — Methods that solve real problems are correct methods

Strict Domains Principle

For each function parameter, Pragmastat enforces the strictest domain that:

Supports virtually all legitimate real-world use cases
Rejects pathological cases that would produce misleading results
Fails immediately with actionable guidance rather than silently degrading

Rationale: Learning from NHST Problems

Traditional tools accept arbitrary confidence levels without warning when the requested precision exceeds data resolution. This leads to misleading results: a practitioner requests 99.99% confidence with $n=5$ and receives bounds that look like valid statistical inference but actually have much lower coverage.

Strict validation approach

Making impossible requests impossible — If $n=5$ cannot achieve 99% confidence, the function rejects $\mathrm{misrate}=0.01$ rather than returning meaningless bounds.
Actionable errors — Messages explain WHY the request failed and HOW to fix it.
Explicit tradeoffs — Practitioners learn their datas actual resolution limits.

Minimum achievable misrate

For one-sample bounds, minimum achievable misrate $= 2^{1-n}$ :

$n$	$\mathrm{misrate}_{min}$	max confidence	notes
2	0.5	50%	only trivial bounds possible
5	0.0625	93.75%	cannot achieve 95%
7	0.0156	98.4%	cannot achieve 99%
10	0.00195	99.8%	most practical misrates achievable
20	$1.9 \times 10^{-6}$	99.9998%	$\mathrm{misrate} = 10^{-6}$ is achievable

Practical implications

$n < 5$ : Cannot achieve 95% confidence ( $\mathrm{misrate} = 0.05$ )
$n < 7$ : Cannot achieve 99% confidence ( $\mathrm{misrate} = 0.01$ )
$n \geq 20$ : $\mathrm{misrate} = 10^{-6}$ is achievable

This principle ensures that Pragmastat functions never silently produce misleading results when the requested precision exceeds what the data can support.

Test Framework

The reference test framework consists of three components:

Test generation — The C# implementation defines test inputs programmatically using builder patterns. For deterministic cases, inputs are explicitly specified. For random cases, the framework uses controlled seeds with System.Random to ensure reproducibility across all platforms.

The random generation mechanism works as follows:

Each test suite builder maintains a seed counter initialized to zero.
For one-sample estimators, each distribution type receives the next available seed. The same random generator produces all samples for all sizes within that distribution.
For two-sample estimators, each pair of distributions receives two consecutive seeds: one for the $\mathbf{x}$ sample generator and one for the $\mathbf{y}$ sample generator.
The seed counter increments with each random generator creation, ensuring deterministic test data generation.

For $\underline{\operatorname{Additive}}$ distributions, random values are generated using the Box-Müller transform, which converts pairs of uniform random values into normally distributed values. The transform applies the formula:

X = \mu + \sigma \sqrt{-2 \ln(U_1)} \sin(2 \pi U_2)

where $U_1, U_2$ are uniform random values from $\underline{\operatorname{Uniform}}(0, 1)$ , $\mu$ is the mean, and $\sigma$ is the standard deviation.

For $\underline{\operatorname{Uniform}}$ distributions, random values are generated directly using the quantile function:

X = \min + U \cdot (\max - \min)

where $U$ is a uniform random value from $\underline{\operatorname{Uniform}}(0, 1)$ .

The framework executes the reference implementation on all generated inputs and serializes input-output pairs to JSON format.

Test validation — Each language implementation loads the JSON test cases and executes them against its local estimator implementation. Assertions verify that outputs match expected values within a given numerical tolerance (typically $10^{-10}$ for relative error).

Test data format — Each test case is a JSON file containing input and output fields. For one-sample estimators, the input contains array x and optional parameters. For two-sample estimators, input contains arrays x and y. For bounds estimators ( $\operatorname{ShiftBounds}$ , $\operatorname{RatioBounds}$ ), input additionally contains misrate. Output is a single numeric value for point estimators, or an object with lower and upper fields for bounds estimators.

Performance testing — The toolkit provides $O(n \log n)$ fast algorithms for $\operatorname{Center}$ , $\operatorname{Spread}$ , and $\operatorname{Shift}$ estimators, dramatically more efficient than naive implementations that materialize all pairwise combinations. Performance tests use sample size $n = 100,000$ (for one-sample) or $n = m = 100,000$ (for two-sample). This specific size creates a clear performance distinction: fast implementations ( $O(n \log n)$ or $O((m+n) \log L)$ ) complete in under 5 seconds on modern hardware across all supported languages, while naive implementations ( $O(n^2 \log n)$ or $O(m n \log(m n))$ ) would be prohibitively slow (taking hours or failing due to memory exhaustion). With $n = 100,000$ , naive approaches would need to materialize approximately 5 billion pairwise values for $\operatorname{Center}$ / $\operatorname{Spread}$ or 10 billion for $\operatorname{Shift}$ , whereas fast algorithms require only $O(n)$ additional memory. Performance tests serve dual purposes: correctness validation at scale and performance regression detection, ensuring implementations use the efficient algorithms and remain practical for real-world datasets with hundreds of thousands of observations. Performance test specifications are provided in the respective estimator sections above.

This framework ensures that all seven language implementations maintain strict numerical agreement across the full test suite.