Methodology
This chapter examines the methodological principles that guide Pragmastats design and application.
Pragmatic Philosophy
The toolkits foundations rest on pragmatist epistemology: truth is determined by practical consequences, not abstract correspondence with reality.
- Truth is what works — An estimator is correct if it produces useful results across realistic conditions
- Meaning from consequences — The value of a statistical method lies in what it enables, not its theoretical elegance
- Theory serves practice — Mathematical analysis provides insight, but empirical validation determines adoption
- Utility as criterion — When methods conflict, prefer the one that solves more real problems
This stance inverts the traditional relationship between theory and practice. Rather than deriving methods from first principles and hoping they apply, we evaluate methods by their performance and seek theoretical understanding afterward.
Procedure-First Empiricism
Traditional statistical practice follows an assumptions-first methodology:
- Assume a data-generating model (e.g., observations are normally distributed)
- Derive the optimal procedure under those assumptions
- Apply the procedure to data, hoping assumptions approximately hold
This toolkit inverts the process:
- Select procedures based on desired properties (robustness, equivariance, interpretability)
- Empirically measure performance across a wide range of conditions
- Use theory to explain and predict observed behavior
Monte Carlo simulation serves as the primary instrument of knowledge. Rather than deriving asymptotic formulas for estimator variance, we measure actual variance across thousands of simulated samples. Drift tables in this manual are empirically measured, not analytically derived.
This approach has practical advantages: simulations can explore conditions that resist closed-form analysis, and empirical results are self-validating — they show what actually happens, not what theory predicts should happen.
For the formal treatment of domain assumptions that govern valid inputs, see the Assumptions chapter.
Epistemic Humility
No perfectly Gaussian, log-normal, or Pareto distributions exist in real data. Every distribution we name is a useful fiction — a model we employ because it approximates reality well enough for our purposes, while knowing it cannot be exactly correct.
- Models are approximations — They capture essential structure while ignoring irrelevant details
- Approximations fail at boundaries — Edge cases, extreme values, and distribution tails often violate assumptions
- Graceful degradation — Methods should produce sensible (if less precise) results when assumptions weaken
The toolkit embodies this humility by choosing estimators that remain interpretable and bounded even when distributional assumptions break down. A robust estimator may sacrifice some efficiency under ideal conditions in exchange for reliable behavior when conditions degrade.
The Pairwise Principle
A structural insight unifies all primary robust estimators in this toolkit: they are medians of pairwise operations.
| Estimator | Pairwise Operation | Result |
|---|---|---|
| Median of pairwise averages | ||
| Median of pairwise differences | ||
| Median of cross-sample differences | ||
| exp(median of log-differences) | ||
| Proportion of pairwise comparisons |
For multiplicative quantities like , the pairwise operation is defined in log-space, aggregated with the median, then mapped back with exp. This canonical-scale approach preserves the median of pairwise operations principle while ensuring exact multiplicative antisymmetry: .
This pairwise structure provides three benefits:
- Natural robustness — Comparing measurements to each other, not to external references, limits outlier influence
- Self-calibration — The sample serves as its own reference distribution, requiring no external assumptions
- Algebraic closure — Pairwise operations preserve symmetry and equivariance properties
The pairwise principle also enables efficient computation. Matrices of pairwise operations have structural properties (sorted rows and columns) that fast algorithms exploit to achieve complexity.
Median as Universal Aggregator
The median is the final step in each pairwise estimator. Why median specifically?
The median achieves the maximum possible breakdown point (50%) among all translation-equivariant location estimators. Up to half the data can be arbitrarily corrupted before the median becomes unbounded.
However, and achieve only 29% breakdown — not 50%. This is deliberate: a tradeoff between robustness and precision.
| Breakdown | Robustness | Precision | Estimators |
|---|---|---|---|
| 0% | None | Optimal under assumptions | , |
| 29% | Substantial | Near-optimal | , |
| 50% | Maximum | Reduced | , |
The 29% breakdown point survives approximately one corrupted measurement in four while maintaining roughly 95% asymptotic efficiency under ideal Gaussian conditions. This represents the practical optimum: enough robustness for realistic contamination levels, enough efficiency to compete with traditional methods when data is clean.
Convergence Conventions
Drift normalizes estimator variability by , making precision comparable across sample sizes:
This normalization embeds a deliberate assumption: most useful estimators converge at the rate. The Central Limit Theorem guarantees this rate for means under mild conditions, and median-based estimators inherit similar convergence behavior.
- Common case default — convergence covers the vast majority of practical estimators
- Intuitive interpretation — Drift represents effective standard deviation at
- Mental calculation — Expected precision at any is simply
For estimators with non-standard convergence (e.g., extreme value statistics), drift generalizes to where instability differs from . But the toolkit deliberately uses throughout because it matches the common case and provides intuitive interpretation without complicating the universal mechanism.
This is pragmatic universalism: adopt the common case as default, acknowledge exceptions exist, and handle them explicitly rather than burdening the common case with unnecessary generality.
Structural Unity
All robust estimators in this toolkit share a common mathematical structure:
For additive estimators (, , ), is identity. For multiplicative estimators (), and .
This structural unity is not merely aesthetic — it enables unified algorithmic optimization.
- Sorted structure — Matrices of pairwise operations have sorted rows and columns
- Monahans algorithm — Exploits sorted structure for /
- Fast shift — Exploits cross-sample matrix structure for efficient two-sample comparison
Because all estimators share the same median of pairwise form, insights that accelerate one can often be adapted to accelerate others. A single theoretical framework covers all primary estimators.
Generative Naming
Names in this toolkit encode operational knowledge rather than historical provenance.
| Traditional | Pragmastat | Whats Encoded |
|---|---|---|
| Gaussian / Normal | Formation: sum of independent factors (CLT) | |
| Log-normal / Galton | Formation: product of independent factors | |
| Pareto | Behavior: power-law relationship | |
| Hodges-Lehmann | Function: measures central tendency | |
| Shamos | Function: measures variability | |
| (none) | sparity | Assumption: property of having positive spread |
Reading activates a generative model: this distribution arises when many independent factors add together. Reading Gaussian requires recalling an association with Carl Friedrich Gauss, then remembering what properties that name implies.
Generative names create immediate intuition about when a model applies. distributions arise from additive processes. distributions arise from multiplicative processes. The name itself encodes the formation mechanism.
The Inversion Principle
Traditional statistical outputs often require mental transformation before use. This toolkit inverts such framings to present information in directly actionable form, following principles of user-centered design (Norman 2013).
| Traditional | Pragmastat | Reason for Inversion |
|---|---|---|
| Confidence level (95%) | (0.05) | Direct error interpretation |
| Confidence interval | Bounds | Plain language, no jargon |
| Hypothesis test (p-value) | Bounds estimation | Whats plausible? not Is zero plausible? |
| Efficiency (variance ratio) | Drift (spread-based) | Works with heavy tails |
Consider the confidence level vs. misrate inversion. A 95% confidence interval requires understanding: If I repeated this procedure infinitely, 95% of intervals would contain the true value. A 5% misrate states directly: This procedure errs about 5% of the time.
The shift from confidence intervals to bounds, and from hypothesis testing to interval estimation, moves from frequentist theology toward decision-relevant inference. The practitioner asks What values are plausible for this parameter? rather than Can I reject the hypothesis that this parameter equals zero?
Multi-Audience Design
This manual serves readers with diverse backgrounds and conflicting preferences:
| Audience | Priorities | Challenges |
|---|---|---|
| Experienced academics | Rigor, derivation, formalism, citations | May find practical focus too shallow |
| Professional developers | Examples, APIs, searchability, minimalism | May find theory intimidating |
| Students and beginners | Clarity, intuition, progressive disclosure | Need both theory and practice |
| Large language models | Structure, consistency, unambiguous definitions | Need form-independent content |
These audiences have conflicting needs. Academics want complete derivations; developers want quick answers. Beginners need gentle introductions; experts need dense references. LLMs need predictable structure; humans appreciate variety.
The manual targets a neutral zone where all audiences find acceptable content:
- Signature first — Mathematical definition immediately visible
- Example second — Concrete computation before abstract explanation
- Detail optional — Properties, corner cases, and theory follow for those who need them
- Every sentence earns its place — No filler prose, no redundant explanation
Structural Principles
- Concrete over abstract — Numbers and examples before symbols and theory
- Precision without verbosity — Mathematical rigor in minimal words
- Consistent layout — Same structure across all toolkit items enables scanning
- Self-contained sections — Each section readable independently
LLM-Friendliness
The manuals structure also serves machine readers:
- Predictable patterns — Consistent section ordering aids extraction
- Explicit definitions — No implicit knowledge assumed
- Tabular data — Structured information in tables, not prose
- Short paragraphs — Content chunks cleanly for context windows
This multi-audience optimization forces elimination of audience-specific conventions, revealing form-independent essential content that serves everyone adequately rather than serving one group perfectly and others poorly.
Reference Tests as Specification
The toolkit maintains seven implementations across different programming languages: Python, TypeScript, R, C#, Kotlin, Rust, and Go. Each implementation must produce identical numerical results for all estimators.
This cross-language consistency is achieved through executable specifications:
Manual (definitions) ↔ C# (reference) → JSON (tests) → All languages (validation)
The specification IS the test suite. Reference tests serve three critical purposes:
- Cross-language validation — All implementations pass identical test cases
- Regression prevention — Changes validated against known outputs
- Implementation guidance — Concrete examples for porting to new languages
Test Design Principles
- Minimal sufficiency — Smallest test set providing high confidence in correctness
- Comprehensive coverage — Both typical cases and edge cases that expose errors
- Deterministic reproducibility — Fixed seeds for all random tests
Test Categories
- Canonical cases — Deterministic inputs like natural number sequences where outputs are easily verified
- Edge cases — Boundary conditions: single element, zeros, minimum viable sample sizes
- Fuzzy tests — Controlled random exploration beyond hand-crafted examples
The C# implementation serves as the reference generator. All test cases are defined programmatically, executed to produce expected outputs, and serialized to JSON. Other implementations load these JSON files and verify their outputs match within numerical tolerance.
Cross-Language Determinism
Reproducibility requires determinism at every layer. When a simulation in Python produces a result, the same simulation in Rust, Go, or any other supported language must produce the identical result.
- Portable RNG — produces identical sequences in all languages
- Specified algorithms — xoshiro256++ for generation, SplitMix64 for seeding, FNV-1a for string hashing
- No implementation-dependent behavior — Floating-point operations follow IEEE 754
Unified API
Beyond numerical determinism, the toolkit maintains a consistent API across all implementations. Function names, parameter orders, and return types follow the same conventions in every language.
- Same vocabulary — , , mean the same thing everywhere
- Same signatures —
Center(x)in Python,Center(x)in Rust,Center(x)in Go - Same behavior — Edge cases, error conditions, and defaults are identical
This unified API enables frictionless language switching. A practitioner prototyping in Python can port to Rust for production without learning new abstractions or revalidating statistical assumptions. The mental model transfers directly; only syntax changes.
Benefits of Unification
- Debugging across languages — A failing test in TypeScript can be debugged in C#
- Verified ports — New implementations can be validated against existing ones
- Reproducible research — Results can be reproduced in any supported language
- Team flexibility — Different team members can use preferred languages on the same analysis
- Migration paths — Move from prototype to production without statistical revalidation
Summary Principles
The methodology of this toolkit can be distilled into twelve guiding principles:
- Name things by what they do, not who discovered them — Generative names encode operational knowledge
- All models are wrong; design for graceful degradation — Robust methods fail gently
- Evaluate empirically, organize theoretically — Simulation before derivation
- Self-reference provides robustness — Pairwise operations compare data to itself
- 29% breakdown is the practical optimum — Balance robustness and precision
- Invert framings that require mental transformation — Present directly actionable information
- Default to the common case — Use convergence; handle exceptions explicitly
- Multi-audience optimization reveals essential content — Serve everyone adequately, not one group perfectly
- Executable specifications are reliable specifications — Tests define correctness
- Reproducibility requires portable determinism — Same seeds, same results, any language
- Structural unity enables unified optimization — Median of pairwise admits fast algorithms
- Utility is the ultimate criterion — Methods that solve real problems are correct methods
Strict Domains Principle
For each function parameter, Pragmastat enforces the strictest domain that:
- Supports virtually all legitimate real-world use cases
- Rejects pathological cases that would produce misleading results
- Fails immediately with actionable guidance rather than silently degrading
Rationale: Learning from NHST Problems
Traditional tools accept arbitrary confidence levels without warning when the requested precision exceeds data resolution. This leads to misleading results: a practitioner requests 99.99% confidence with and receives bounds that look like valid statistical inference but actually have much lower coverage.
Strict validation approach
- Making impossible requests impossible — If cannot achieve 99% confidence, the function rejects rather than returning meaningless bounds.
- Actionable errors — Messages explain WHY the request failed and HOW to fix it.
- Explicit tradeoffs — Practitioners learn their datas actual resolution limits.
Minimum achievable misrate
For one-sample bounds, minimum achievable misrate :
| max confidence | notes | ||
|---|---|---|---|
| 2 | 0.5 | 50% | only trivial bounds possible |
| 5 | 0.0625 | 93.75% | cannot achieve 95% |
| 7 | 0.0156 | 98.4% | cannot achieve 99% |
| 10 | 0.00195 | 99.8% | most practical misrates achievable |
| 20 | 99.9998% | is achievable |
Practical implications
- : Cannot achieve 95% confidence ()
- : Cannot achieve 99% confidence ()
- : is achievable
This principle ensures that Pragmastat functions never silently produce misleading results when the requested precision exceeds what the data can support.
Test Framework
The reference test framework consists of three components:
Test generation — The C# implementation defines test inputs programmatically using builder patterns. For deterministic cases, inputs are explicitly specified. For random cases, the framework uses controlled seeds with System.Random to ensure reproducibility across all platforms.
The random generation mechanism works as follows:
- Each test suite builder maintains a seed counter initialized to zero.
- For one-sample estimators, each distribution type receives the next available seed. The same random generator produces all samples for all sizes within that distribution.
- For two-sample estimators, each pair of distributions receives two consecutive seeds: one for the sample generator and one for the sample generator.
- The seed counter increments with each random generator creation, ensuring deterministic test data generation.
For distributions, random values are generated using the Box-Müller transform, which converts pairs of uniform random values into normally distributed values. The transform applies the formula:
where are uniform random values from , is the mean, and is the standard deviation.
For distributions, random values are generated directly using the quantile function:
where is a uniform random value from .
The framework executes the reference implementation on all generated inputs and serializes input-output pairs to JSON format.
Test validation — Each language implementation loads the JSON test cases and executes them against its local estimator implementation. Assertions verify that outputs match expected values within a given numerical tolerance (typically for relative error).
Test data format — Each test case is a JSON file containing input and output fields. For one-sample estimators, the input contains array x and optional parameters. For two-sample estimators, input contains arrays x and y. For bounds estimators (, ), input additionally contains misrate. Output is a single numeric value for point estimators, or an object with lower and upper fields for bounds estimators.
Performance testing — The toolkit provides fast algorithms for , , and estimators, dramatically more efficient than naive implementations that materialize all pairwise combinations. Performance tests use sample size (for one-sample) or (for two-sample). This specific size creates a clear performance distinction: fast implementations ( or ) complete in under 5 seconds on modern hardware across all supported languages, while naive implementations ( or ) would be prohibitively slow (taking hours or failing due to memory exhaustion). With , naive approaches would need to materialize approximately 5 billion pairwise values for / or 10 billion for , whereas fast algorithms require only additional memory. Performance tests serve dual purposes: correctness validation at scale and performance regression detection, ensuring implementations use the efficient algorithms and remain practical for real-world datasets with hundreds of thousands of observations. Performance test specifications are provided in the respective estimator sections above.
This framework ensures that all seven language implementations maintain strict numerical agreement across the full test suite.