PairwiseMargin
Exclusion count for dominance-based bounds.
- Purpose — determines extreme pairwise differences to exclude when constructing bounds
- Based on — distribution of under random sampling
- Returns — total margin split evenly between lower and upper tails
- Used by — to select appropriate order statistics
- Complexity — exact for small samples, approximated for large
- Domain — , (minimum achievable)
- Unit — count
- Note — assumes weak continuity (ties from measurement resolution are tolerated)
Properties
- Symmetry
- Bounds
Example
PairwiseMargin(30, 30, 1e-6) = 276PairwiseMargin(30, 30, 1e-4) = 390PairwiseMargin(30, 30, 1e-3) = 464
This is a supporting function that uses internally, so most users do not need to call it directly. It calculates how many extreme pairwise differences should be excluded when constructing bounds, based on sample sizes and the desired error rate. A lower misrate (higher confidence) results in a smaller margin, which produces wider bounds. The function automatically chooses between exact computation for small samples and a fast approximation for large samples.
Algorithm
The function determines how many extreme pairwise differences to exclude when constructing bounds around . Given samples and , the estimator computes all pairwise differences and sorts them. The bounds select specific order statistics from this sorted sequence: . The challenge lies in determining which order statistics produce bounds that contain the true shift with probability .
Random sampling creates natural variation in pairwise differences. Even when populations have identical distributions, sampling variation produces both positive and negative differences. The margin function quantifies this sampling variability: it specifies how many extreme pairwise differences could occur by chance with probability . For symmetric bounds, this margin splits evenly between the tails, giving and .
Computing the margin requires understanding the distribution of pairwise comparisons. Each pairwise difference corresponds to a comparison: exactly when . This connection motivates the dominance function:
The dominance function counts how many pairwise comparisons favor over . Both and operate on the same collection of pairwise differences. The estimator examines difference values, returning the median as a location estimate. The function examines difference signs, counting how many comparisons produce positive differences. While provides the estimate itself, determines which order statistics form reliable bounds around that estimate.
When populations have equivalent distributions, concentrates near by symmetry. The distribution of across all possible sample orderings determines reliable bounds. If deviates from by at least with probability , then the interval excluding the most extreme pairwise differences contains zero with probability . Translation invariance extends this relationship to arbitrary shifts: the margin computed from the comparison distribution applies regardless of the true shift value.
Two computational approaches provide the distribution of : exact calculation for small samples and approximation for large samples.
Exact method
Small sample sizes allow exact computation without approximation. The exact approach exploits a fundamental symmetry: under equivalent populations, all orderings of the combined measurements occur with equal probability. This symmetry enables direct calculation of how many orderings produce each comparison count.
Direct computation faces a combinatorial challenge. Enumerating all orderings to count comparison outcomes requires substantial resources. For samples beyond a few dozen measurements, naive implementation becomes impractical.
Löfflers recurrence relation (Löffler 1982) resolves this through algebraic structure. The recurrence exploits cycle properties in the comparison distribution, reducing memory requirements while maintaining exact calculation. The algorithm builds cumulative probabilities sequentially until reaching the threshold corresponding to the desired error rate. This approach extends practical exact computation to combined sample sizes of several hundred.
Define as the number of orderings producing exactly comparisons favoring . The probability mass function becomes:
A direct recurrence follows from considering the largest measurement. The rightmost element comes from either (contributing comparisons) or (contributing zero):
with base cases and .
Direct implementation requires time and memory. An alternative recurrence (Löffler 1982) exploits cycle structure:
where captures structural properties through divisors:
This reduces memory to and enables efficient computation through .
The algorithm computes cumulative probabilities sequentially until the threshold is exceeded. By symmetry, the lower and upper thresholds determine the total margin .
The sequential computation proceeds incrementally. Starting from with base probability , the algorithm computes , then , and so on, accumulating the cumulative distribution function with each step. The loop terminates as soon as reaches , returning the threshold value without computing further probabilities.
This sequential approach performs particularly well for small misrates. For , the threshold typically remains small even with large sample sizes, requiring only a few iterations regardless of whether and equal 50 or 200. The algorithm computes only the extreme tail probabilities needed to reach the threshold, never touching the vast majority of probability mass concentrated near . This efficiency advantage grows as misrates decrease: stricter bounds require fewer computed values, making exact calculation particularly attractive for high-confidence applications.
Approximate method
Large samples make exact computation impractical. The dominance count concentrates near with variance . A basic (Normal) approximation suffices asymptotically:
This approximation underestimates tail probabilities for moderate sample sizes. The (Normal) approximation provides a baseline but fails to capture the true distribution shape in the tails, producing mis-calibrated probabilities that become problematic for small error rates.
The Edgeworth expansion refines this approximation through moment-based corrections (Fix & Hodges 1955). The expansion starts with the (Normal) cumulative distribution as a baseline, then adds correction terms that account for the distributions asymmetry (skewness) and tail weight (kurtosis). These corrections use Hermite polynomials to adjust the baseline curve where the (Normal) approximation deviates most from the true distribution. The first few correction terms typically achieve the practical balance between accuracy and computational cost, substantially improving tail probability estimates compared to the basic approximation.
The standardized comparison count (with continuity correction):
produces the approximated cumulative distribution:
where denotes the standard (Normal) CDF.
The correction coefficients depend on standardized moments:
The moments , , are computed from sample sizes:
The correction terms use Hermite polynomials:
Binary search locates the threshold value efficiently. The algorithm maintains a search interval initialized to . Each iteration computes the midpoint and evaluates the Edgeworth CDF at . If , the threshold lies above and the search continues in . If , the threshold lies below and the search continues in . The loop terminates when and become adjacent, requiring CDF evaluations.
This binary search exhibits uniform performance across misrate values. Whether computing bounds for or , the algorithm performs the same number of iterations determined solely by the sample sizes. Each CDF evaluation costs constant time regardless of the threshold location, making the approximate method particularly efficient for large samples where exact computation becomes impractical. The logarithmic scaling ensures that doubling the sample size adds only one additional iteration, enabling practical computation for samples in the thousands or tens of thousands.
The toolkit selects between exact and approximate computation based on combined sample size: exact method for , approximate method for . The exact method guarantees correctness but scales as memory and time. For , this requires 40,000 memory locations. The approximate method achieves 1% accuracy with constant-time evaluations. For , the approximate method completes in milliseconds versus minutes for exact computation.
Both methods handle discrete data. Repeated measurements produce tied pairwise differences, creating plateaus in the sorted sequence. The exact method counts orderings without assuming continuity. The approximate methods moment-based corrections capture the actual distribution shape regardless of discreteness.
Minimum achievable misrate
The parameter controls how many extreme pairwise differences the bounds exclude. Lower misrate produces narrower bounds with higher confidence but requires excluding fewer extreme values. However, sample size limits how small misrate can meaningfully become.
Consider the most extreme configuration: all measurements from exceed all measurements from , giving . Under equivalent populations, this arrangement occurs purely by chance. The probability equals the chance of having all elements from occupy the top positions among total measurements:
This represents the minimum probability of the most extreme ordering under random sampling. Setting makes bounds construction problematic. The exact distribution of cannot support misrates smaller than the probability of its most extreme realization. Attempting to construct bounds with forces the algorithm to exclude zero pairwise differences from the tails, making . The resulting bounds span all pairwise differences, returning regardless of the desired confidence level. These bounds convey no useful information beyond the range of observed pairwise differences.
For small samples, can exceed commonly used values. With , the minimum misrate equals , making the typical choice of impossible. With , the minimum becomes , exceeding even .
The table below shows for small sample sizes:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1.000000 | 0.666667 | 0.500000 | 0.400000 | 0.333333 | 0.285714 | 0.250000 | 0.222222 | 0.200000 | 0.181818 |
| 2 | 0.666667 | 0.333333 | 0.200000 | 0.133333 | 0.095238 | 0.071429 | 0.055556 | 0.044444 | 0.036364 | 0.030303 |
| 3 | 0.500000 | 0.200000 | 0.100000 | 0.057143 | 0.035714 | 0.023810 | 0.016667 | 0.012121 | 0.009091 | 0.006993 |
| 4 | 0.400000 | 0.133333 | 0.057143 | 0.028571 | 0.015873 | 0.009524 | 0.006061 | 0.004040 | 0.002797 | 0.001998 |
| 5 | 0.333333 | 0.095238 | 0.035714 | 0.015873 | 0.007937 | 0.004329 | 0.002525 | 0.001554 | 0.000999 | 0.000666 |
| 6 | 0.285714 | 0.071429 | 0.023810 | 0.009524 | 0.004329 | 0.002165 | 0.001166 | 0.000666 | 0.000400 | 0.000250 |
| 7 | 0.250000 | 0.055556 | 0.016667 | 0.006061 | 0.002525 | 0.001166 | 0.000583 | 0.000311 | 0.000175 | 0.000103 |
| 8 | 0.222222 | 0.044444 | 0.012121 | 0.004040 | 0.001554 | 0.000666 | 0.000311 | 0.000155 | 0.000082 | 0.000046 |
| 9 | 0.200000 | 0.036364 | 0.009091 | 0.002797 | 0.000999 | 0.000400 | 0.000175 | 0.000082 | 0.000041 | 0.000022 |
| 10 | 0.181818 | 0.030303 | 0.006993 | 0.001998 | 0.000666 | 0.000250 | 0.000103 | 0.000046 | 0.000022 | 0.000011 |
For meaningful bounds construction, choose . This ensures the margin function excludes at least some extreme pairwise differences, producing bounds narrower than the full range. When working with small samples, verify that the desired misrate exceeds for the given sample sizes. With moderate sample sizes (), drops below , making standard choices like feasible.
using System;
using JetBrains.Annotations;
using Pragmastat.Exceptions;
using Pragmastat.Internal;
namespace Pragmastat.Functions;
/// <summary>
/// PairwiseMargin function
/// </summary>
/// <param name="threshold">The maximum value for n+m, after which implementation switches from exact to approx</param>
internal class PairwiseMargin(int threshold = PairwiseMargin.MaxExactSize)
{
public static readonly PairwiseMargin Instance = new();
private const int MaxExactSize = 400;
[PublicAPI]
public int Calc(int n, int m, double misrate)
{
if (n <= 0)
throw AssumptionException.Domain(Subject.X);
if (m <= 0)
throw AssumptionException.Domain(Subject.Y);
if (double.IsNaN(misrate) || misrate < 0 || misrate > 1)
throw AssumptionException.Domain(Subject.Misrate);
double minMisrate = MinAchievableMisrate.TwoSample(n, m);
if (misrate < minMisrate)
throw AssumptionException.Domain(Subject.Misrate);
return n + m <= threshold
? CalcExact(n, m, misrate)
: CalcApprox(n, m, misrate);
}
internal int CalcExact(int n, int m, double misrate)
{
int raw = CalcExactRaw(n, m, misrate / 2);
return checked(raw * 2);
}
internal int CalcApprox(int n, int m, double misrate)
{
long raw = CalcApproxRaw(n, m, misrate / 2);
long margin = raw * 2;
if (margin > int.MaxValue)
throw new OverflowException($"Pairwise margin exceeds supported range for n={n}, m={m}");
return (int)margin;
}
// Inversed implementation of Andreas Löffler's (1982) "Über eine Partition der nat. Zahlen und ihre Anwendung beim U-Test"
private static int CalcExactRaw(int n, int m, double p)
{
double total = n + m < BinomialCoefficientFunction.MaxAcceptableN
? BinomialCoefficientFunction.BinomialCoefficient(n + m, m)
: BinomialCoefficientFunction.BinomialCoefficient(n + m, m * 1.0);
var pmf = new List<double> { 1 }; // pmf[0] = 1
var sigma = new List<double> { 0 }; // sigma[0] is unused
int u = 0;
double cdf = 1.0 / total;
if (cdf >= p)
return 0;
while (true)
{
u++;
// Ensure sigma has entry for u
if (sigma.Count <= u)
{
int value = 0;
for (int d = 1; d <= n; d++)
if (u % d == 0 && u >= d)
value += d;
for (int d = m + 1; d <= m + n; d++)
if (u % d == 0 && u >= d)
value -= d;
sigma.Add(value);
}
// Compute pmf[u] using Loeffler recurrence
double sum = 0;
for (int i = 0; i < u; i++)
sum += pmf[i] * sigma[u - i];
sum /= u;
pmf.Add(sum);
cdf += sum / total;
if (cdf >= p)
return u;
if (sum == 0)
break;
}
return pmf.Count - 1;
}
// Inverse Edgeworth Approximation
private static long CalcApproxRaw(int n, int m, double misrate)
{
long a = 0;
long b = (long)n * m;
while (a < b - 1)
{
long c = (a + b) / 2;
double p = EdgeworthCdf(n, m, c);
if (p < misrate)
a = c;
else
b = c;
}
return EdgeworthCdf(n, m, b) < misrate ? b : a;
}
private static double EdgeworthCdf(int n, int m, long u)
{
double nm = (double)n * m;
double mu = nm / 2.0;
double su = Sqrt(nm * (n + m + 1) / 12.0);
// -0.5 continuity correction: computing P(U ≥ u) for a right-tail discrete CDF
double z = (u - mu - 0.5) / su;
double phi = Exp(-z.Sqr() / 2) / Sqrt(2 * PI);
double Phi = AcmAlgorithm209.Gauss(z);
// Pre-compute powers of n and m for efficiency
double n2 = (double)n * n;
double n3 = n2 * n;
double n4 = n2 * n2;
double m2 = (double)m * m;
double m3 = m2 * m;
double m4 = m2 * m2;
double mu2 = (double)n * m * (n + m + 1) / 12.0;
double mu4 =
(double)n * m * (n + m + 1) *
(0
+ 5 * m * n * (m + n)
- 2 * (m2 + n2)
+ 3 * m * n
- 2 * (n + m)
) / 240.0;
double mu6 =
(double)n * m * (n + m + 1) *
(0
+ 35 * m2 * n2 * (m2 + n2)
+ 70 * m3 * n3
- 42 * m * n * (m3 + n3)
- 14 * m2 * n2 * (n + m)
+ 16 * (n4 + m4)
- 52 * n * m * (n2 + m2)
- 43 * n2 * m2
+ 32 * (m3 + n3)
+ 14 * m * n * (n + m)
+ 8 * (n2 + m2)
+ 16 * n * m
- 8 * (n + m)
) / 4032.0;
// Pre-compute powers of mu2 and related terms
double mu2_2 = mu2 * mu2;
double mu2_3 = mu2_2 * mu2;
double mu4_mu2_2 = mu4 / mu2_2;
// Factorial constants: 4! = 24, 6! = 720, 8! = 40320
double e3 = (mu4_mu2_2 - 3) / 24.0;
double e5 = (mu6 / mu2_3 - 15 * mu4_mu2_2 + 30) / 720.0;
double e7 = 35 * (mu4_mu2_2 - 3) * (mu4_mu2_2 - 3) / 40320.0;
// Pre-compute powers of z for Hermite polynomials
double z2 = z * z;
double z3 = z2 * z;
double z5 = z3 * z2;
double z7 = z5 * z2;
double f3 = -phi * (z3 - 3 * z);
double f5 = -phi * (z5 - 10 * z3 + 15 * z);
double f7 = -phi * (z7 - 21 * z5 + 105 * z3 - 105 * z);
double edgeworth = Phi + e3 * f3 + e5 * f5 + e7 * f7;
return Min(Max(edgeworth, 0), 1);
}
}
Notes
The Mann-Whitney test (also known as the Wilcoxon rank-sum test) ranks among the most widely used non-parametric statistical tests, testing whether two independent samples come from the same distribution. Under (Normal) conditions, it achieves nearly the same precision as the Students -test, while maintaining reliability under diverse distributional conditions where the -test fails.
The test operates by comparing all pairs of measurements between the two samples. Given samples and , the Mann-Whitney statistic counts how many pairs satisfy :
If the samples come from the same distribution, should be near (roughly half the pairs favor , half favor ). Large deviations from suggest the distributions differ.
The test answers: Could this value arise by chance if the samples were truly equivalent? The -value quantifies this probability. If , traditional practice declares the difference statistically significant.
This approach creates several problems for practitioners:
- Binary thinking. The test produces a yes/no answer: reject or fail to reject the null hypothesis. Practitioners typically want to know the magnitude of difference, not just whether one exists.
- Arbitrary thresholds. The 0.05 threshold has no universal justification, yet it dominates practice and creates a false dichotomy between and .
- Hypothesis-centric framework. The test assumes a null hypothesis of no difference and evaluates evidence against it. Real questions rarely concern exact equality; practitioners want to know how different? rather than different or not?
- Inverted logic. The natural question is what shifts are consistent with my data? The test answers is this specific shift (zero) consistent with my data?
The toolkit inverts this framework. Instead of testing whether a hypothesized shift is plausible, we compute which shifts are plausible given the data. This inversion transforms hypothesis testing into bounds estimation.
The mathematical foundation remains the same. The distribution of pairwise comparisons under random sampling determines which order statistics of pairwise differences form reliable bounds. The Mann-Whitney statistic measures pairwise comparisons (). The estimator uses pairwise differences (). These quantities are mathematically related: a pairwise difference is positive exactly when . The toolkit renames this comparison count from to , clarifying its purpose: measuring how often one sample dominates the other in pairwise comparisons.
The distribution of determines which order statistics form reliable bounds. Define the margin function:
This function computes how many extreme pairwise differences could occur by chance with probability , based on the distribution of pairwise comparisons.
The function requires knowing the distribution of pairwise comparisons under sampling. Two computational approaches exist:
- Exact computation (Löfflers algorithm, 1982). Uses a recurrence relation to compute the exact distribution of pairwise comparisons for small samples. Practical for combined sample sizes up to several hundred.
- Approximation (Edgeworth expansion, 1955). Refines the normal approximation with correction terms based on higher moments of the distribution. Provides accurate results for large samples where exact computation becomes impractical.
The toolkit automatically selects the appropriate method based on sample sizes, ensuring both accuracy and computational efficiency.
This approach naturally complements and :
- uses the median of pairwise averages
- uses the median of pairwise differences
- uses the median of pairwise differences
- uses order statistics of the same pairwise differences
All procedures build on pairwise operations. This structural consistency reflects the mathematical unity underlying robust statistics: pairwise operations provide natural robustness while maintaining computational feasibility and statistical efficiency.
The inversion from hypothesis testing to bounds estimation represents a philosophical shift in statistical practice. Traditional methods ask should I believe this specific hypothesis? Pragmatic methods ask what should I believe, given this data? Bounds provide actionable answers: they tell practitioners which values are plausible, enabling informed decisions without arbitrary significance thresholds.
Traditional Mann-Whitney implementations apply tie correction when samples contain repeated values. This correction modifies variance calculations to account for tied observations, changing -values and confidence intervals in ways that depend on measurement precision. The toolkit deliberately omits tie correction. Continuous distributions produce theoretically distinct values; observed ties result from finite measurement precision and digital representation. When measurements appear identical, this reflects rounding of underlying continuous variation, not true equality in the measured quantity. Treating ties as artifacts of discretization rather than distributional features simplifies computation while maintaining accuracy. The exact and approximate methods compute comparison distributions without requiring adjustments for tied values, eliminating a source of complexity and potential inconsistency in statistical practice.
Historical Development
The mathematical foundations emerged through decades of refinement. Mann and Whitney (1947) established the distribution of pairwise comparisons under random sampling, creating the theoretical basis for comparing samples through rank-based methods. Their work demonstrated that comparison counts follow predictable patterns regardless of the underlying population distributions.
The original computational approaches suffered from severe limitations. Mann and Whitney proposed a slow exact method requiring exponential resources and a normal approximation that proved grossly inaccurate for practical use. The approximation works reasonably in distribution centers but fails catastrophically in the tails where practitioners most need accuracy. For moderate sample sizes, approximation errors can exceed factors of .
Fix and Hodges (1955) addressed the approximation problem through higher-order corrections. Their expansion adds terms based on the distributions actual moments rather than assuming perfect normality. This refinement reduces tail probability errors from orders of magnitude to roughly 1%, making approximation practical for large samples where exact computation becomes infeasible.
Löffler (1982) solved the exact computation problem through algorithmic innovation. The naive recurrence requires quadratic memory— infeasible for samples beyond a few dozen measurements. Löffler discovered a reformulation that reduces memory to linear scale, making exact computation practical for combined sample sizes up to several hundred.
Despite these advances, most statistical software continues using the 1947 approximation. The computational literature contains the solutions, but software implementations lag decades behind theoretical developments. This toolkit implements both the exact method for small samples and the refined approximation for large samples, automatically selecting the appropriate approach based on sample sizes.
The shift from hypothesis testing to bounds estimation requires no new mathematics. The same comparison distributions that enable hypothesis tests also determine which order statistics form reliable bounds. Traditional applications ask is zero plausible? and answer yes or no. This toolkit asks which values are plausible? and answers with an interval. The perspective inverts while the mathematical foundation remains identical.
Tests
The test suite contains 178 test cases (4 demo + 4 natural + 10 edge + 12 small grid + 148 large grid). The domain constraint is enforced; inputs violating this return a domain error. Combinations where the requested misrate falls below the minimum achievable misrate are excluded from the grid.
Demo examples () — from manual introduction:
demo-1: , , , expected output:demo-2: , , , expected output:demo-3: , , , expected output:demo-4: , , , expected output:
These demo cases match the reference values used throughout the manual to illustrate construction.
Natural sequences ( × 2 misrates, filtered by min misrate) — 4 tests:
- Misrate values:
- After filtering by , only 4 combinations survive:
natural-3-3-mr1: , , , expected output:natural-3-4-mr1: , ,natural-4-3-mr1: , ,natural-4-4-mr1: , ,
Edge cases — boundary condition validation (10 tests):
boundary-min: , , (minimum samples with maximum misrate, expected output: )boundary-zero-margin-small: , , (strict misrate with sufficient samples)boundary-loose: , , (very permissive misrate)symmetry-2-5: , , (tests symmetry property)symmetry-5-2: , , (symmetric counterpart, same output as above)symmetry-3-7: , , (asymmetric sizes)symmetry-7-3: , , (symmetric counterpart)asymmetry-extreme-1-100: , , (extreme size difference)asymmetry-extreme-100-1: , , (reversed extreme)asymmetry-extreme-2-50: , , (highly unbalanced)
These edge cases validate correct handling of boundary conditions, the symmetry property , and extreme asymmetry in sample sizes.
Comprehensive grid — systematic coverage for thorough validation:
Small sample combinations ( × 6 misrates, filtered) — 12 tests:
- Misrate values:
- Combinations where are excluded
- Test naming:
n{n}_m{m}_mr{k}where is the negative log10 of misrate - Examples:
n5_m5_mr1: , , , expected output:n5_m5_mr2: , ,
Large sample combinations ( × 6 misrates, filtered) — 148 tests:
- Misrate values: same as small samples
- Combinations where are excluded (affects at misrates and )
- Test naming:
n{n}_m{m}_r{k}where is the negative log10 of misrate - Examples:
n10_m10_r1: , , , expected output:n50_m50_r3: , , , expected output:n100_m100_r6: , , , expected output:
The comprehensive grid validates both symmetric () and asymmetric sample size combinations across six orders of magnitude in misrate, ensuring robust coverage of the parameter space.