The Center test suite contains 38 correctness test cases stored in the repository (24 original + 14 unsorted), plus 1 performance test that should be implemented manually (see Test Framework).
Demo examples (n=5) — from manual introduction, validating properties:
natural-4: x=(1,2,3,4), expected output: 2.5 (smallest even size with rich structure)
Negative values (n=3) — sign handling validation:
negative-3: x=(−3,−2,−1), expected output: −2
Zero values (n=1,2) — edge case testing with zeros:
zeros-1: x=(0), expected output: 0
zeros-2: x=(0,0), expected output: 0
Additive distribution (n=5,10,30) — fuzzy testing with Additive(10,1):
additive-5, additive-10, additive-30: random samples generated with seed 0
Uniform distribution (n=5,100) — fuzzy testing with Uniform(0,1):
uniform-5, uniform-100: random samples generated with seed 1
The random samples validate that Center performs correctly on realistic distributions at various sample sizes. The progression from small (n=5) to large (n=100) samples helps identify issues that only manifest at specific scales.
Algorithm stress tests — edge cases for fast algorithm implementation:
These tests ensure implementations correctly sort input data before computing pairwise averages. The variety of shuffle patterns (reverse, rotation, interleaving, single element displacement) catches common sorting bugs.
Performance test — validates the fast O(nlogn) algorithm:
Input: x=(1,2,3,…,100000)
Expected output: 50000.5
Time constraint: Must complete in under 5 seconds
Purpose: Ensures that the implementation uses the efficient algorithm rather than materializing all (2n+1)≈5 billion pairwise averages
This test case is not stored in the repository because it generates a large JSON file (approximately 1.5 MB). Each language implementation should manually implement this test with the hardcoded expected result.
Spread Tests
Spread(x)=median1≤i<j≤n∣xi−xj∣
The Spread test suite contains 30 correctness test cases stored in the repository (20 original + 10 unsorted), plus 1 performance test that should be implemented manually (see Test Framework).
Demo examples (n=5) — from manual introduction, validating properties:
natural-4: x=(1,2,3,4), expected output: 1.5 (smallest even size with rich structure)
Negative values (n=3) — sign handling validation:
negative-3: x=(−3,−2,−1), expected output: 1
Additive distribution (n=5,10,30) — Additive(10,1):
additive-5, additive-10, additive-30: random samples generated with seed 0
Uniform distribution (n=5,100) — Uniform(0,1):
uniform-5, uniform-100: random samples generated with seed 1
The natural sequence cases validate the basic pairwise difference calculation. Constant samples and n=1 are excluded because Spread requires Spread(x)>0.
Algorithm stress tests — edge cases for fast algorithm implementation:
unsorted-extreme-wide-unsorted-5: x=(1000,0.001,1000000,100,1) (wide range unsorted)
These tests verify that implementations correctly sort input before computing pairwise differences. Since Spread uses absolute differences, order-dependent bugs would manifest differently than in Center.
Performance test — validates the fast O(nlogn) algorithm:
Input: x=(1,2,3,…,100000)
Expected output: 29290
Time constraint: Must complete in under 5 seconds
Purpose: Ensures that the implementation uses the efficient algorithm rather than materializing all (2n)≈5 billion pairwise differences
This test case is not stored in the repository because it generates a large JSON file (approximately 1.5 MB). Each language implementation should manually implement this test with the hardcoded expected result.
RelSpread Tests
RelSpread(x)=∣Center(x)∣Spread(x)
The RelSpread test suite contains 18 test cases (13 original + 5 unsorted) focusing on relative dispersion.
Demo examples (n=5) — from manual introduction, validating properties:
natural-4: x=(1,2,3,4), expected output: 0.6 (validates composite with even size)
Uniform distribution (n=5,10,20,30,100) — Uniform(0,1):
uniform-5, uniform-10, uniform-20, uniform-30, uniform-100: random samples generated with seed 0
The uniform distribution tests span multiple sample sizes to verify that RelSpread correctly normalizes dispersion. Zero and negative values are excluded because RelSpread requires strictly positive samples.
Composite estimator stress tests — edge cases specific to division operation:
composite-small-center: x=(0.001,0.002,0.003,0.004,0.005) (small center, tests division stability)
composite-large-spread: x=(1,100,200,300,1000) (large spread relative to center)
Since RelSpread combines both Center and Spread, these tests verify that sorting works correctly for composite estimators.
Shift Tests
Shift(x,y)=median1≤i≤n,1≤j≤m(xi−yj)
The Shift test suite contains 60 correctness test cases stored in the repository (42 original + 18 unsorted), plus 1 performance test that should be implemented manually (see Test Framework).
Demo examples (n=m=5) — from manual introduction, validating properties:
The natural sequences validate anti-symmetry (Shift(x,y)=−Shift(y,x)) and the identity property (Shift(x,x)=0). The asymmetric size combinations test the two-sample algorithm with unbalanced inputs.
Algorithm stress tests — edge cases for fast binary search algorithm:
These tests are critical for two-sample estimators because they verify that x and y are sorted independently. The variety includes cases where only one sample is unsorted, ensuring implementations dont incorrectly assume pre-sorted input or sort samples together.
Performance test — validates the fast O((m+n)logL) binary search algorithm:
Input: x=(1,2,3,…,100000), y=(1,2,3,…,100000)
Expected output: 0
Time constraint: Must complete in under 5 seconds
Purpose: Ensures that the implementation uses the efficient algorithm rather than materializing all mn=10 billion pairwise differences
This test case is not stored in the repository because it generates a large JSON file (approximately 1.5 MB). Each language implementation should manually implement this test with the hardcoded expected result.
Ratio Tests
Ratio(x,y)=exp(Shift(logx,logy))
The Ratio test suite contains 37 test cases (25 original + 12 unsorted), excluding zero values due to division constraints. The new definition uses geometric interpolation (via log-space), which affects expected values for even m×n cases.
Demo examples (n=m=5) — from manual introduction, validating properties:
Note: all generated values are strictly positive (no zeros); values near zero test numerical stability of log-transformation
The natural sequences verify the identity property (Ratio(x,x)=1) and validate ratio calculations with simple integer inputs. Note that implementations should handle the practical constraint of avoiding division by values near zero.
Unsorted tests — verify independent sorting for ratio calculation (12 tests):
unsorted-x-natural-{n}-{m} for (n,m)in(3,3),(4,4): X unsorted (reversed), Y sorted (2 tests)
unsorted-y-natural-{n}-{m} for (n,m)in(3,3),(4,4): X sorted, Y unsorted (reversed) (2 tests)
unsorted-both-natural-{n}-{m} for (n,m)in(3,3),(4,4): both unsorted (reversed) (2 tests)
unsorted-demo-unsorted-x: x=(16,1,8,2,4), y=(2,4,8,16,32) (demo-1 with X unsorted)
unsorted-demo-unsorted-y: x=(1,2,4,8,16), y=(32,2,16,4,8) (demo-1 with Y unsorted)
unsorted-demo-both-unsorted: x=(8,1,16,4,2), y=(16,32,2,8,4) (demo-1 both unsorted)
unsorted-identity-unsorted: x=(4,1,8,2,16), y=(16,1,8,4,2) (identity property, both unsorted)
unsorted-asymmetric-unsorted-2-3: x=(2,1), y=(3,1,2) (asymmetric, both unsorted)
unsorted-power-unsorted-5: x=(16,2,8,1,4), y=(32,4,16,2,8) (powers of 2 unsorted)
AvgSpread Tests
AvgSpread(x,y)=n+mn⋅Spread(x)+m⋅Spread(y)
The AvgSpread test suite contains 36 test cases (24 original + 12 unsorted). Since AvgSpread computes Spread(x) and Spread(y) independently, unsorted tests are critical to verify that both samples are sorted independently before computing their spreads.
Demo examples (n=m=5) — from manual introduction, validating properties:
Additive distribution ([n,m]in5,10,30×5,10,30) — 9 combinations with Additive(10,1):
Tests pooled dispersion across different sample size combinations
Random generation: x uses seed 0, y uses seed 1
Uniform distribution ([n,m]in5,100×5,100) — 4 combinations with Uniform(0,1):
Validates correct weighting when sample sizes differ substantially
Random generation: x uses seed 2, y uses seed 3
The asymmetric size combinations are particularly important for AvgSpread because the estimator must correctly weight each samples contribution by its size.
Composite estimator stress tests — edge cases for weighted averaging:
composite-asymmetric-weights: x=(1,2), y=(3,4,5,6,7,8,9,10) (2 vs 8, tests weighting formula)
Unsorted tests — critical for verifying independent sorting (12 tests):
unsorted-x-natural-{n}-{m} for (n,m)in(3,3),(4,4): X unsorted (reversed), Y sorted (2 tests)
unsorted-y-natural-{n}-{m} for (n,m)in(3,3),(4,4): X sorted, Y unsorted (reversed) (2 tests)
unsorted-both-natural-{n}-{m} for (n,m)in(3,3),(4,4): both unsorted (reversed) (2 tests)
unsorted-demo-unsorted-x: x=(12,0,6,3,9), y=(0,2,4,6,8) (demo-1 with X unsorted)
unsorted-demo-unsorted-y: x=(0,3,6,9,12), y=(8,0,4,2,6) (demo-1 with Y unsorted)
unsorted-demo-both-unsorted: x=(9,0,12,3,6), y=(6,0,8,2,4) (demo-1 both unsorted)
These tests verify that implementations compute Spread(x) and Spread(y) with properly sorted samples.
Disparity Tests
Disparity(x,y)=Shift(x,y)/AvgSpread(x,y)
The Disparity test suite contains 28 test cases (16 original + 12 unsorted). Since Disparity combines Shift and AvgSpread, unsorted tests verify both components handle sorting correctly.
Demo examples (n=m=5) — from manual introduction, validating properties:
The smaller test set for Disparity reflects implementation confidence. Since Disparity combines Shift and AvgSpread, correct implementation of those components ensures Disparity correctness. The test cases validate the division operation and confirm scale-free properties.
unsorted-anti-symmetry-unsorted: x=(8,0,4,2,6), y=(12,0,6,3,9) (demo-4 reversed and unsorted)
As a composite estimator, Disparity tests both the numerator (Shift) and denominator (AvgSpread). Unsorted variants verify end-to-end correctness including invariance properties.
PairwiseMargin Tests
PairwiseMargin(n,m,misrate)
The PairwiseMargin test suite contains 178 test cases (4 demo + 4 natural + 10 edge + 12 small grid + 148 large grid). The domain constraint misrate≥(nn+m)2 is enforced; inputs violating this return a domain error. Combinations where the requested misrate falls below the minimum achievable misrate are excluded from the grid.
Demo examples (n=m=30) — from manual introduction:
These edge cases validate correct handling of boundary conditions, the symmetry property PairwiseMargin(n,m,misrate)=PairwiseMargin(m,n,misrate), and extreme asymmetry in sample sizes.
Comprehensive grid — systematic coverage for thorough validation:
The comprehensive grid validates both symmetric (n=m) and asymmetric sample size combinations across six orders of magnitude in misrate, ensuring robust coverage of the parameter space.
SignedRankMargin Tests
SignedRankMargin(n,misrate)
The SignedRankMargin test suite contains 39 correctness test cases (4 demo + 6 boundary + 7 exact + 20 medium + 2 error).
Demo examples (n=30) — from manual introduction:
demo-1: n=30, misrate=10−6, expected output: 46
demo-2: n=30, misrate=10−5, expected output: 74
demo-3: n=30, misrate=10−4, expected output: 112
demo-4: n=30, misrate=10−3, expected output: 158
These demo cases match the reference values used throughout the manual to illustrate CenterBounds construction.
The medium sample tests validate the transition region between exact computation (n≤63) and approximate computation, ensuring consistent results across sample sizes and misrate values.
error-n0: n=0, misrate=0.05 (invalid: n must be positive)
This error case verifies that implementations correctly reject n=1 with misrate=0.5 as invalid input, since the minimum achievable misrate for n=1 is 20=1.0.
The ShiftBounds test suite contains 61 correctness test cases (3 demo + 9 natural + 6 property + 10 edge + 9 additive + 4 uniform + 5 misrate + 15 unsorted). Since ShiftBounds returns bounds rather than a point estimate, tests validate that the bounds contain Shift(x,y) and satisfy equivariance properties. Each test case output is a JSON object with lower and upper fields representing the interval bounds. The domain constraint misrate≥(nn+m)2 is enforced; inputs violating this return a domain error.
Demo examples (n=m=5) — from manual introduction, validating basic bounds:
These cases illustrate how tighter misrates produce wider bounds and validate the identity property where identical samples yield bounds containing zero.
These tests use identical samples with varying misrates to validate the monotonicity property: smaller misrates (higher confidence) produce wider bounds. The sequence demonstrates how bound width increases as misrate decreases, helping implementations verify correct margin calculation.
Unsorted tests — verify independent sorting of x and y (15 tests):
unsorted-x-natural-5-5: x=(5,3,1,4,2), y=(1,2,3,4,5), misrate=10−2 (X reversed, Y sorted)
unsorted-y-natural-5-5: x=(1,2,3,4,5), y=(5,3,1,4,2), misrate=10−2 (X sorted, Y reversed)
unsorted-asymmetric-5-10: x=(2,5,1,3,4), y=(10,5,2,8,4,1,9,3,7,6), misrate=10−2 (asymmetric sizes, both unsorted)
unsorted-duplicates: x=(3,3,3,3,3), y=(5,5,5,5,5), misrate=10−2 (all duplicates, any order)
unsorted-mixed-duplicates-x: x=(2,1,3,2,1), y=(1,1,2,2,3), misrate=10−2 (X has unsorted duplicates)
unsorted-mixed-duplicates-y: x=(1,1,2,2,3), y=(3,2,1,3,2), misrate=10−2 (Y has unsorted duplicates)
These unsorted tests are critical because ShiftBounds computes bounds from pairwise differences, requiring both samples to be sorted independently. The variety ensures implementations dont incorrectly assume pre-sorted input or sort samples together. Each test must produce identical output to its sorted counterpart, validating that the implementation correctly handles the sorting step.
No performance test — ShiftBounds uses the FastShift algorithm internally, which is already validated by the Shift performance test. Since bounds computation involves only two quantile calculations from the pairwise differences (at positions determined by PairwiseMargin), the performance characteristics are equivalent to computing two Shift estimates, which completes efficiently for large samples.
The RatioBounds test suite contains 61 correctness test cases (3 demo + 9 natural + 6 property + 10 edge + 9 multiplic + 4 uniform + 5 misrate + 15 unsorted). Since RatioBounds returns bounds rather than a point estimate, tests validate that the bounds contain Ratio(x,y) and satisfy equivariance properties. Each test case output is a JSON object with lower and upper fields representing the interval bounds. All samples must contain strictly positive values. The domain constraint misrate≥(nn+m)2 is enforced; inputs violating this return a domain error.
These cases illustrate how tighter misrates produce wider bounds and validate the identity property where identical samples yield bounds containing one.
These tests use identical samples with varying misrates to validate the monotonicity property: smaller misrates (higher confidence) produce wider bounds. The sequence demonstrates how bound width increases as misrate decreases, helping implementations verify correct margin calculation.
Unsorted tests — verify independent sorting of x and y (15 tests):
unsorted-x-natural-5-5: x=(5,3,1,4,2), y=(1,2,3,4,5), misrate=10−2 (X reversed, Y sorted)
unsorted-y-natural-5-5: x=(1,2,3,4,5), y=(5,3,1,4,2), misrate=10−2 (X sorted, Y reversed)
unsorted-demo-unsorted-x: x=(5,1,4,2,3), y=(2,3,4,5,6), misrate=0.05 (demo-1 X unsorted)
unsorted-demo-unsorted-y: x=(1,2,3,4,5), y=(6,2,5,3,4), misrate=0.05 (demo-1 Y unsorted)
unsorted-demo-both-unsorted: x=(4,1,5,2,3), y=(5,2,6,3,4), misrate=0.05 (demo-1 both unsorted)
unsorted-identity-unsorted: x=(4,1,5,2,3), y=(5,1,4,3,2), misrate=10−2 (identity property, both unsorted)
unsorted-scale-unsorted: x=(10,30,20), y=(15,5,10), misrate=0.5 (scale relationship, both unsorted)
unsorted-asymmetric-5-10: x=(2,5,1,3,4), y=(10,5,2,8,4,1,9,3,7,6), misrate=10−2 (asymmetric sizes, both unsorted)
unsorted-duplicates: x=(3,3,3,3,3), y=(5,5,5,5,5), misrate=10−2 (all duplicates, any order)
unsorted-mixed-duplicates-x: x=(2,1,3,2,1), y=(1,1,2,2,3), misrate=10−2 (X has unsorted duplicates)
unsorted-mixed-duplicates-y: x=(1,1,2,2,3), y=(3,2,1,3,2), misrate=10−2 (Y has unsorted duplicates)
These unsorted tests are critical because RatioBounds computes bounds from pairwise ratios, requiring both samples to be sorted independently. The variety ensures implementations dont incorrectly assume pre-sorted input or sort samples together. Each test must produce identical output to its sorted counterpart, validating that the implementation correctly handles the sorting step.
No performance test — RatioBounds uses the FastRatio algorithm internally, which delegates to FastShift in log-space. Since bounds computation involves only two quantile calculations from the pairwise differences (at positions determined by PairwiseMargin), the performance characteristics are equivalent to computing two Ratio estimates, which completes efficiently for large samples.
CenterBounds Tests
CenterBounds(x,misrate)=[w(kleft),w(kright)]
where w=2xi+xj (pairwise averages, sorted) for i≤j, kleft=⌊SignedRankMargin/⌋2)+1, kright=N−⌊SignedRankMargin/⌋2), and N=2n(n+1).
The CenterBounds test suite contains 43 test cases (3 demo + 4 natural + 5 property + 7 edge + 4 symmetric + 4 asymmetric + 2 additive + 2 uniform + 4 misrate + 6 unsorted + 2 error cases). Since CenterBounds returns bounds rather than a point estimate, tests validate that bounds contain Center(x) and satisfy equivariance properties. Each test case output is a JSON object with lower and upper fields representing the interval bounds.
Demo examples — from manual introduction, validating basic bounds:
The reference test framework consists of three components:
Test generation — The C# implementation defines test inputs programmatically using builder patterns. For deterministic cases, inputs are explicitly specified. For random cases, the framework uses controlled seeds with System.Random to ensure reproducibility across all platforms.
The random generation mechanism works as follows:
Each test suite builder maintains a seed counter initialized to zero.
For one-sample estimators, each distribution type receives the next available seed. The same random generator produces all samples for all sizes within that distribution.
For two-sample estimators, each pair of distributions receives two consecutive seeds: one for the x sample generator and one for the y sample generator.
The seed counter increments with each random generator creation, ensuring deterministic test data generation.
For Additive distributions, random values are generated using the Box-Müller transform, which converts pairs of uniform random values into normally distributed values. The transform applies the formula:
X=μ+σ−2ln(U1)sin(2πU2)
where U1,U2 are uniform random values from Uniform(0,1), μ is the mean, and σ is the standard deviation.
For Uniform distributions, random values are generated directly using the quantile function:
X=min+U⋅(max−min)
where U is a uniform random value from Uniform(0,1).
The framework executes the reference implementation on all generated inputs and serializes input-output pairs to JSON format.
Test validation — Each language implementation loads the JSON test cases and executes them against its local estimator implementation. Assertions verify that outputs match expected values within a given numerical tolerance (typically 10−10 for relative error).
Test data format — Each test case is a JSON file containing input and output fields. For one-sample estimators, the input contains array x and optional parameters. For two-sample estimators, input contains arrays x and y. For bounds estimators (ShiftBounds, RatioBounds), input additionally contains misrate. Output is a single numeric value for point estimators, or an object with lower and upper fields for bounds estimators.
Performance testing — The toolkit provides O(nlogn) fast algorithms for Center, Spread, and Shift estimators, dramatically more efficient than naive implementations that materialize all pairwise combinations. Performance tests use sample size n=100,000 (for one-sample) or n=m=100,000 (for two-sample). This specific size creates a clear performance distinction: fast implementations (O(nlogn) or O((m+n)logL)) complete in under 5 seconds on modern hardware across all supported languages, while naive implementations (O(n2logn) or O(mnlog(mn))) would be prohibitively slow (taking hours or failing due to memory exhaustion). With n=100,000, naive approaches would need to materialize approximately 5 billion pairwise values for Center/Spread or 10 billion for Shift, whereas fast algorithms require only O(n) additional memory. Performance tests serve dual purposes: correctness validation at scale and performance regression detection, ensuring implementations use the efficient algorithms and remain practical for real-world datasets with hundreds of thousands of observations. Performance test specifications are provided in the respective estimator sections above.
This framework ensures that all seven language implementations maintain strict numerical agreement across the full test suite.