PairwiseMargin

PairwiseMargin(n,m,misrate)\operatorname{PairwiseMargin}(n, m, \mathrm{misrate})

Exclusion count for dominance-based bounds.

  • Purpose — determines extreme pairwise differences to exclude when constructing bounds
  • Based on — distribution of Dominance(x,y)=i=1nj=1m1(xi>yj)\operatorname{Dominance}(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n \sum_{j=1}^m \mathbb{1}(x_i > y_j) under random sampling
  • Returns — total margin split evenly between lower and upper tails
  • Used byShiftBounds\operatorname{ShiftBounds} to select appropriate order statistics
  • Complexity — exact for small samples, approximated for large
  • Domainn,m1n, m \geq 1, misrate>2(n+mn)\mathrm{misrate} > \frac{2}{\binom{n+m}{n}} (minimum achievable)
  • Unit — count
  • Note — assumes weak continuity (ties from measurement resolution are tolerated)

Properties

  • Symmetry PairwiseMargin(n,m,misrate)=PairwiseMargin(m,n,misrate)\operatorname{PairwiseMargin}(n, m, \mathrm{misrate}) = \operatorname{PairwiseMargin}(m, n, \mathrm{misrate})
  • Bounds 0PairwiseMargin(n,m,misrate)nm0 \leq \operatorname{PairwiseMargin}(n, m, \mathrm{misrate}) \leq n m

Example

  • PairwiseMargin(30, 30, 1e-6) = 276
  • PairwiseMargin(30, 30, 1e-4) = 390
  • PairwiseMargin(30, 30, 1e-3) = 464

This is a supporting function that ShiftBounds\operatorname{ShiftBounds} uses internally, so most users do not need to call it directly. It calculates how many extreme pairwise differences should be excluded when constructing bounds, based on sample sizes and the desired error rate. A lower misrate (higher confidence) results in a smaller margin, which produces wider bounds. The function automatically chooses between exact computation for small samples and a fast approximation for large samples.

Algorithm

The PairwiseMargin\operatorname{PairwiseMargin} function determines how many extreme pairwise differences to exclude when constructing bounds around Shift(x,y)\operatorname{Shift}(\mathbf{x}, \mathbf{y}). Given samples x=(x1,,xn)\mathbf{x} = (x_1, \ldots, x_n) and y=(y1,,ym)\mathbf{y} = (y_1, \ldots, y_m), the ShiftBounds\operatorname{ShiftBounds} estimator computes all nmn m pairwise differences zij=xiyjz_{i j} = x_i - y_j and sorts them. The bounds select specific order statistics from this sorted sequence: [z(kleft),z(kright)][z_{(k_{\text{left}})}, z_{(k_{\text{right}})}]. The challenge lies in determining which order statistics produce bounds that contain the true shift Shift[X,Y]\operatorname{Shift}[X, Y] with probability 1misrate1 - \mathrm{misrate}.

Random sampling creates natural variation in pairwise differences. Even when populations have identical distributions, sampling variation produces both positive and negative differences. The margin function quantifies this sampling variability: it specifies how many extreme pairwise differences could occur by chance with probability misrate\mathrm{misrate}. For symmetric bounds, this margin splits evenly between the tails, giving kleft=PairwiseMargin(n,m,misrate)2+1k_{\text{left}} = \lfloor \frac{\operatorname{PairwiseMargin}(n, m, \mathrm{misrate})}{2} \rfloor + 1 and kright=nmPairwiseMargin(n,m,misrate)2k_{\text{right}} = n m - \lfloor \frac{\operatorname{PairwiseMargin}(n, m, \mathrm{misrate})}{2} \rfloor.

Computing the margin requires understanding the distribution of pairwise comparisons. Each pairwise difference corresponds to a comparison: xiyj>0x_i - y_j > 0 exactly when xi>yjx_i > y_j. This connection motivates the dominance function:

Dominance(x,y)=i=1nj=1m1(xi>yj)\text{Dominance}(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n \sum_{j=1}^m \mathbb{1}(x_i > y_j)

The dominance function counts how many pairwise comparisons favor x\mathbf{x} over y\mathbf{y}. Both Shift\operatorname{Shift} and Dominance\text{Dominance} operate on the same collection of nmn m pairwise differences. The Shift\operatorname{Shift} estimator examines difference values, returning the median as a location estimate. The Dominance\text{Dominance} function examines difference signs, counting how many comparisons produce positive differences. While Shift\operatorname{Shift} provides the estimate itself, Dominance\text{Dominance} determines which order statistics form reliable bounds around that estimate.

When populations have equivalent distributions, Dominance\text{Dominance} concentrates near nm2n \frac{m}{2} by symmetry. The distribution of Dominance\text{Dominance} across all possible sample orderings determines reliable bounds. If Dominance\text{Dominance} deviates from nm2n \frac{m}{2} by at least k2\frac{k}{2} with probability misrate\mathrm{misrate}, then the interval excluding the kk most extreme pairwise differences contains zero with probability 1misrate1 - \mathrm{misrate}. Translation invariance extends this relationship to arbitrary shifts: the margin computed from the comparison distribution applies regardless of the true shift value.

Two computational approaches provide the distribution of Dominance\text{Dominance}: exact calculation for small samples and approximation for large samples.

Exact method

Small sample sizes allow exact computation without approximation. The exact approach exploits a fundamental symmetry: under equivalent populations, all (n+mn)\binom{n+m}{n} orderings of the combined measurements occur with equal probability. This symmetry enables direct calculation of how many orderings produce each comparison count.

Direct computation faces a combinatorial challenge. Enumerating all orderings to count comparison outcomes requires substantial resources. For samples beyond a few dozen measurements, naive implementation becomes impractical.

Löfflers recurrence relation (Löffler 1982) resolves this through algebraic structure. The recurrence exploits cycle properties in the comparison distribution, reducing memory requirements while maintaining exact calculation. The algorithm builds cumulative probabilities sequentially until reaching the threshold corresponding to the desired error rate. This approach extends practical exact computation to combined sample sizes of several hundred.

Define pn,m(c)p_{n,m}(c) as the number of orderings producing exactly cc comparisons favoring x\mathbf{x}. The probability mass function becomes:

Pr(Dominance=c)=pn,m(c)(n+mn)\Pr(\text{Dominance} = c) = \frac{p_{n,m}(c)}{\binom{n+m}{n}}

A direct recurrence follows from considering the largest measurement. The rightmost element comes from either x\mathbf{x} (contributing mm comparisons) or y\mathbf{y} (contributing zero):

pn,m(c)=pn1,m(cm)+pn,m1(c)p_{n,m}(c) = p_{n-1,m}(c - m) + p_{n,m-1}(c)

with base cases pn,0(0)=1p_{n,0}(0) = 1 and p0,m(0)=1p_{0,m}(0) = 1.

Direct implementation requires O(nmnm)O(n \cdot m \cdot n m) time and O(nm)O(n m) memory. An alternative recurrence (Löffler 1982) exploits cycle structure:

pn,m(c)=1ci=0c1pn,m(i)σn,m(ci)p_{n,m}(c) = \frac{1}{c} \sum_{i=0}^{c-1} p_{n,m}(i) \cdot \sigma_{n,m}(c - i)

where σn,m(d)\sigma_{n,m}(d) captures structural properties through divisors:

σn,m(d)=kdϵkk,ϵk={1if1kn1ifm+1km+n0otherwise\begin{aligned} \sigma_{n,m}(d) = \sum_{k|d} \epsilon_k \cdot k, \quad \epsilon_k = \begin{cases} 1 & \text{if} 1 \leq k \leq n \\ -1 & \text{if} m+1 \leq k \leq m+n \\ 0 & \text{otherwise} \end{cases} \end{aligned}

This reduces memory to O(nm)O(n m) and enables efficient computation through c=nmc = n m.

The algorithm computes cumulative probabilities Pr(Dominancec)\Pr(\text{Dominance} \leq c) sequentially until the threshold misrate2\frac{\mathrm{misrate}}{2} is exceeded. By symmetry, the lower and upper thresholds determine the total margin PairwiseMargin=2c\operatorname{PairwiseMargin} = 2c.

The sequential computation proceeds incrementally. Starting from u=0u = 0 with base probability pn,m(0)=1p_{n,m}(0) = 1, the algorithm computes pn,m(1)p_{n,m}(1), then pn,m(2)p_{n,m}(2), and so on, accumulating the cumulative distribution function with each step. The loop terminates as soon as Pr(Dominanceu)\Pr(\text{Dominance} \leq u) reaches misrate2\frac{\mathrm{misrate}}{2}, returning the threshold value uu without computing further probabilities.

This sequential approach performs particularly well for small misrates. For misrate=106\mathrm{misrate} = 10^{-6}, the threshold uu typically remains small even with large sample sizes, requiring only a few iterations regardless of whether nn and mm equal 50 or 200. The algorithm computes only the extreme tail probabilities needed to reach the threshold, never touching the vast majority of probability mass concentrated near nm2n \frac{m}{2}. This efficiency advantage grows as misrates decrease: stricter bounds require fewer computed values, making exact calculation particularly attractive for high-confidence applications.

Approximate method

Large samples make exact computation impractical. The dominance count Dominance\text{Dominance} concentrates near nm2n \frac{m}{2} with variance nm(n+m+1)12n \frac{m(n+m+1)}{12}. A basic Additive\underline{\operatorname{Additive}} (Normal) approximation suffices asymptotically:

DominanceAdditive(nm2,nm(n+m+1)12)\text{Dominance} \approx \underline{\operatorname{Additive}}(n \frac{m}{2}, \sqrt{n \frac{m(n+m+1)}{12}})

This approximation underestimates tail probabilities for moderate sample sizes. The Additive\underline{\operatorname{Additive}} (Normal) approximation provides a baseline but fails to capture the true distribution shape in the tails, producing mis-calibrated probabilities that become problematic for small error rates.

The Edgeworth expansion refines this approximation through moment-based corrections (Fix & Hodges 1955). The expansion starts with the Additive\underline{\operatorname{Additive}} (Normal) cumulative distribution as a baseline, then adds correction terms that account for the distributions asymmetry (skewness) and tail weight (kurtosis). These corrections use Hermite polynomials to adjust the baseline curve where the Additive\underline{\operatorname{Additive}} (Normal) approximation deviates most from the true distribution. The first few correction terms typically achieve the practical balance between accuracy and computational cost, substantially improving tail probability estimates compared to the basic approximation.

The standardized comparison count (with continuity correction):

z=cnm20.5nm(n+m+1)12z = \frac{c - n \frac{m}{2} - 0.5}{\sqrt{n \frac{m(n+m+1)}{12}}}

produces the approximated cumulative distribution:

Pr(Dominancec)Φ(z)+e3ϕ(3)(z)+e5ϕ(5)(z)+e7ϕ(7)(z)\Pr(\text{Dominance} \leq c) \approx \Phi(z) + e_3 \phi^{(3)}(z) + e_5 \phi^{(5)}(z) + e_7 \phi^{(7)}(z)

where Φ\Phi denotes the standard Additive\underline{\operatorname{Additive}} (Normal) CDF.

The correction coefficients depend on standardized moments:

e3=124(μ4μ223),e5=1720(μ6μ2315μ4μ22+30),e7=3540320(μ4μ223)2e_3 = \frac{1}{24} (\frac{\mu_4}{\mu_2^2} - 3), \quad e_5 = \frac{1}{720} (\frac{\mu_6}{\mu_2^3} - 15 \frac{\mu_4}{\mu_2^2} + 30), \quad e_7 = \frac{35}{40320} (\frac{\mu_4}{\mu_2^2} - 3)^2

The moments μ2\mu_2, μ4\mu_4, μ6\mu_6 are computed from sample sizes:

μ2=nm(n+m+1)12\mu_2 = \frac{n m(n+m+1)}{12} μ4=nm(n+m+1)240(5nm(n+m)2(n2+m2)+3nm2(n+m))\mu_4 = \frac{n m(n+m+1)}{240} (5 n m(n+m) - 2(n^2 + m^2) + 3 n m - 2(n+m)) μ6=nm(n+m+1)4032(35n2m2(n2+m2)+70n3m342nm(n3+m3)14n2m2(n+m)+16(n4+m4)52nm(n2+m2)43n2m2+32(n3+m3)+14nm(n+m)+8(n2+m2)+16nm8(n+m))\mu_6 = \frac{n m(n+m+1)}{4032} (35 n^2 m^2(n^2 + m^2) + 70 n^3 m^3 - 42 n m(n^3 + m^3) \\ - 14 n^2 m^2(n + m) + 16(n^4 + m^4) - 52 n m(n^2 + m^2) \\ - 43 n^2 m^2 + 32(n^3 + m^3) + 14 n m(n + m) \\ + 8(n^2 + m^2) + 16 n m - 8(n + m))

The correction terms use Hermite polynomials:

ϕ(k)(z)=ϕ(z)Hk(z)\phi^{(k)}(z) = -\phi(z) H_k(z) H3(z)=z33z,H5(z)=z510z3+15z,H7(z)=z721z5+105z3105zH_3(z) = z^3 - 3z, \quad H_5(z) = z^5 - 10z^3 + 15z, \quad H_7(z) = z^7 - 21z^5 + 105z^3 - 105z

Binary search locates the threshold value efficiently. The algorithm maintains a search interval [a,b][a, b] initialized to [0,nm][0, n m]. Each iteration computes the midpoint c=a+b2c = \frac{a + b}{2} and evaluates the Edgeworth CDF at cc. If Pr(Dominancec)<misrate2\Pr(\text{Dominance} \leq c) < \frac{\mathrm{misrate}}{2}, the threshold lies above cc and the search continues in [c,b][c, b]. If Pr(Dominancec)misrate2\Pr(\text{Dominance} \leq c) \geq \frac{\mathrm{misrate}}{2}, the threshold lies below cc and the search continues in [a,c][a, c]. The loop terminates when aa and bb become adjacent, requiring O(log(nm))O(\log(n m)) CDF evaluations.

This binary search exhibits uniform performance across misrate values. Whether computing bounds for misrate=106\mathrm{misrate} = 10^{-6} or misrate=0.05\mathrm{misrate} = 0.05, the algorithm performs the same number of iterations determined solely by the sample sizes. Each CDF evaluation costs constant time regardless of the threshold location, making the approximate method particularly efficient for large samples where exact computation becomes impractical. The logarithmic scaling ensures that doubling the sample size adds only one additional iteration, enabling practical computation for samples in the thousands or tens of thousands.

The toolkit selects between exact and approximate computation based on combined sample size: exact method for n+m400n + m \leq 400, approximate method for n+m>400n + m > 400. The exact method guarantees correctness but scales as O(nm)O(n m) memory and O((nm)2)O((n m)^2) time. For n=m=200n = m = 200, this requires 40,000 memory locations. The approximate method achieves 1% accuracy with O(log(nm))O(\log(n m)) constant-time evaluations. For n=m=10000n = m = 10000, the approximate method completes in milliseconds versus minutes for exact computation.

Both methods handle discrete data. Repeated measurements produce tied pairwise differences, creating plateaus in the sorted sequence. The exact method counts orderings without assuming continuity. The approximate methods moment-based corrections capture the actual distribution shape regardless of discreteness.

Minimum achievable misrate

The misrate\mathrm{misrate} parameter controls how many extreme pairwise differences the bounds exclude. Lower misrate produces narrower bounds with higher confidence but requires excluding fewer extreme values. However, sample size limits how small misrate can meaningfully become.

Consider the most extreme configuration: all measurements from x\mathbf{x} exceed all measurements from y\mathbf{y}, giving x1,,xn>y1,,ymx_1, \ldots, x_n > y_1, \ldots, y_m. Under equivalent populations, this arrangement occurs purely by chance. The probability equals the chance of having all nn elements from x\mathbf{x} occupy the top nn positions among n+mn+m total measurements:

misratemin=2(n+mn)=2n!m!(n+m)!\mathrm{misrate}_{min} = \frac{2}{\binom{n+m}{n}} = \frac{2 \cdot n! \cdot m!}{(n+m)!}

This represents the minimum probability of the most extreme ordering under random sampling. Setting misrate<misratemin\mathrm{misrate} < \mathrm{misrate}_{min} makes bounds construction problematic. The exact distribution of Dominance\text{Dominance} cannot support misrates smaller than the probability of its most extreme realization. Attempting to construct bounds with misrate<misratemin\mathrm{misrate} < \mathrm{misrate}_{min} forces the algorithm to exclude zero pairwise differences from the tails, making PairwiseMargin=0\operatorname{PairwiseMargin} = 0. The resulting bounds span all nmn m pairwise differences, returning [z(1),z(nm)][z_{(1)}, z_{(n m)}] regardless of the desired confidence level. These bounds convey no useful information beyond the range of observed pairwise differences.

For small samples, misratemin\mathrm{misrate}_{min} can exceed commonly used values. With n=m=6n = m = 6, the minimum misrate equals 2(126)0.00217\frac{2}{\binom{12}{6}} \approx 0.00217, making the typical choice of misrate=103\mathrm{misrate} = 10^{-3} impossible. With n=m=4n = m = 4, the minimum becomes 2(84)0.0286\frac{2}{\binom{8}{4}} \approx 0.0286, exceeding even misrate=0.01\mathrm{misrate} = 0.01.

The table below shows misratemin\mathrm{misrate}_{min} for small sample sizes:

12345678910
11.0000000.6666670.5000000.4000000.3333330.2857140.2500000.2222220.2000000.181818
20.6666670.3333330.2000000.1333330.0952380.0714290.0555560.0444440.0363640.030303
30.5000000.2000000.1000000.0571430.0357140.0238100.0166670.0121210.0090910.006993
40.4000000.1333330.0571430.0285710.0158730.0095240.0060610.0040400.0027970.001998
50.3333330.0952380.0357140.0158730.0079370.0043290.0025250.0015540.0009990.000666
60.2857140.0714290.0238100.0095240.0043290.0021650.0011660.0006660.0004000.000250
70.2500000.0555560.0166670.0060610.0025250.0011660.0005830.0003110.0001750.000103
80.2222220.0444440.0121210.0040400.0015540.0006660.0003110.0001550.0000820.000046
90.2000000.0363640.0090910.0027970.0009990.0004000.0001750.0000820.0000410.000022
100.1818180.0303030.0069930.0019980.0006660.0002500.0001030.0000460.0000220.000011

For meaningful bounds construction, choose misrate>misratemin\mathrm{misrate} > \mathrm{misrate}_{min}. This ensures the margin function excludes at least some extreme pairwise differences, producing bounds narrower than the full range. When working with small samples, verify that the desired misrate exceeds misratemin\mathrm{misrate}_{min} for the given sample sizes. With moderate sample sizes (n,m15n, m \geq 15), misratemin\mathrm{misrate}_{min} drops below 10810^{-8}, making standard choices like misrate=106\mathrm{misrate} = 10^{-6} feasible.

using System;
using JetBrains.Annotations;
using Pragmastat.Exceptions;
using Pragmastat.Internal;

namespace Pragmastat.Functions;

/// <summary>
/// PairwiseMargin function
/// </summary>
/// <param name="threshold">The maximum value for n+m, after which implementation switches from exact to approx</param>
internal class PairwiseMargin(int threshold = PairwiseMargin.MaxExactSize)
{
  public static readonly PairwiseMargin Instance = new();

  private const int MaxExactSize = 400;

  [PublicAPI]
  public int Calc(int n, int m, double misrate)
  {
    if (n <= 0)
      throw AssumptionException.Domain(Subject.X);
    if (m <= 0)
      throw AssumptionException.Domain(Subject.Y);
    if (double.IsNaN(misrate) || misrate < 0 || misrate > 1)
      throw AssumptionException.Domain(Subject.Misrate);

    double minMisrate = MinAchievableMisrate.TwoSample(n, m);
    if (misrate < minMisrate)
      throw AssumptionException.Domain(Subject.Misrate);

    return n + m <= threshold
      ? CalcExact(n, m, misrate)
      : CalcApprox(n, m, misrate);
  }

  internal int CalcExact(int n, int m, double misrate)
  {
    int raw = CalcExactRaw(n, m, misrate / 2);
    return checked(raw * 2);
  }

  internal int CalcApprox(int n, int m, double misrate)
  {
    long raw = CalcApproxRaw(n, m, misrate / 2);
    long margin = raw * 2;
    if (margin > int.MaxValue)
      throw new OverflowException($"Pairwise margin exceeds supported range for n={n}, m={m}");
    return (int)margin;
  }

  // Inversed implementation of Andreas Löffler's (1982) "Über eine Partition der nat. Zahlen und ihre Anwendung beim U-Test"
  private static int CalcExactRaw(int n, int m, double p)
  {
    double total = n + m < BinomialCoefficientFunction.MaxAcceptableN
      ? BinomialCoefficientFunction.BinomialCoefficient(n + m, m)
      : BinomialCoefficientFunction.BinomialCoefficient(n + m, m * 1.0);

    var pmf = new List<double> { 1 }; // pmf[0] = 1
    var sigma = new List<double> { 0 }; // sigma[0] is unused

    int u = 0;
    double cdf = 1.0 / total;

    if (cdf >= p)
      return 0;

    while (true)
    {
      u++;
      // Ensure sigma has entry for u
      if (sigma.Count <= u)
      {
        int value = 0;
        for (int d = 1; d <= n; d++)
          if (u % d == 0 && u >= d)
            value += d;
        for (int d = m + 1; d <= m + n; d++)
          if (u % d == 0 && u >= d)
            value -= d;
        sigma.Add(value);
      }

      // Compute pmf[u] using Loeffler recurrence
      double sum = 0;
      for (int i = 0; i < u; i++)
        sum += pmf[i] * sigma[u - i];
      sum /= u;
      pmf.Add(sum);

      cdf += sum / total;
      if (cdf >= p)
        return u;
      if (sum == 0)
        break;
    }

    return pmf.Count - 1;
  }

  // Inverse Edgeworth Approximation
  private static long CalcApproxRaw(int n, int m, double misrate)
  {
    long a = 0;
    long b = (long)n * m;
    while (a < b - 1)
    {
      long c = (a + b) / 2;
      double p = EdgeworthCdf(n, m, c);
      if (p < misrate)
        a = c;
      else
        b = c;
    }

    return EdgeworthCdf(n, m, b) < misrate ? b : a;
  }

  private static double EdgeworthCdf(int n, int m, long u)
  {
    double nm = (double)n * m;
    double mu = nm / 2.0;
    double su = Sqrt(nm * (n + m + 1) / 12.0);
    // -0.5 continuity correction: computing P(U ≥ u) for a right-tail discrete CDF
    double z = (u - mu - 0.5) / su;
    double phi = Exp(-z.Sqr() / 2) / Sqrt(2 * PI);
    double Phi = AcmAlgorithm209.Gauss(z);

    // Pre-compute powers of n and m for efficiency
    double n2 = (double)n * n;
    double n3 = n2 * n;
    double n4 = n2 * n2;
    double m2 = (double)m * m;
    double m3 = m2 * m;
    double m4 = m2 * m2;

    double mu2 = (double)n * m * (n + m + 1) / 12.0;
    double mu4 =
      (double)n * m * (n + m + 1) *
      (0
       + 5 * m * n * (m + n)
       - 2 * (m2 + n2)
       + 3 * m * n
       - 2 * (n + m)
      ) / 240.0;

    double mu6 =
      (double)n * m * (n + m + 1) *
      (0
       + 35 * m2 * n2 * (m2 + n2)
       + 70 * m3 * n3
       - 42 * m * n * (m3 + n3)
       - 14 * m2 * n2 * (n + m)
       + 16 * (n4 + m4)
       - 52 * n * m * (n2 + m2)
       - 43 * n2 * m2
       + 32 * (m3 + n3)
       + 14 * m * n * (n + m)
       + 8 * (n2 + m2)
       + 16 * n * m
       - 8 * (n + m)
      ) / 4032.0;

    // Pre-compute powers of mu2 and related terms
    double mu2_2 = mu2 * mu2;
    double mu2_3 = mu2_2 * mu2;
    double mu4_mu2_2 = mu4 / mu2_2;

    // Factorial constants: 4! = 24, 6! = 720, 8! = 40320
    double e3 = (mu4_mu2_2 - 3) / 24.0;
    double e5 = (mu6 / mu2_3 - 15 * mu4_mu2_2 + 30) / 720.0;
    double e7 = 35 * (mu4_mu2_2 - 3) * (mu4_mu2_2 - 3) / 40320.0;

    // Pre-compute powers of z for Hermite polynomials
    double z2 = z * z;
    double z3 = z2 * z;
    double z5 = z3 * z2;
    double z7 = z5 * z2;

    double f3 = -phi * (z3 - 3 * z);
    double f5 = -phi * (z5 - 10 * z3 + 15 * z);
    double f7 = -phi * (z7 - 21 * z5 + 105 * z3 - 105 * z);

    double edgeworth = Phi + e3 * f3 + e5 * f5 + e7 * f7;
    return Min(Max(edgeworth, 0), 1);
  }
}

Notes

The Mann-Whitney UU test (also known as the Wilcoxon rank-sum test) ranks among the most widely used non-parametric statistical tests, testing whether two independent samples come from the same distribution. Under Additive\underline{\operatorname{Additive}} (Normal) conditions, it achieves nearly the same precision as the Students tt-test, while maintaining reliability under diverse distributional conditions where the tt-test fails.

The test operates by comparing all pairs of measurements between the two samples. Given samples x=(x1,,xn)\mathbf{x} = (x_1, \ldots, x_n) and y=(y1,,ym)\mathbf{y} = (y_1, \ldots, y_m), the Mann-Whitney UU statistic counts how many pairs satisfy xi>yjx_i > y_j:

U=i=1nj=1m1(xi>yj)U = \sum_{i=1}^n \sum_{j=1}^m \mathbb{1}(x_i > y_j)

If the samples come from the same distribution, UU should be near nm2n \frac{m}{2} (roughly half the pairs favor x\mathbf{x}, half favor y\mathbf{y}). Large deviations from nm2n \frac{m}{2} suggest the distributions differ.

The test answers: Could this UU value arise by chance if the samples were truly equivalent? The pp-value quantifies this probability. If p<0.05p < 0.05, traditional practice declares the difference statistically significant.

This approach creates several problems for practitioners:

  • Binary thinking. The test produces a yes/no answer: reject or fail to reject the null hypothesis. Practitioners typically want to know the magnitude of difference, not just whether one exists.
  • Arbitrary thresholds. The 0.05 threshold has no universal justification, yet it dominates practice and creates a false dichotomy between p=0.049p = 0.049 and p=0.051p = 0.051.
  • Hypothesis-centric framework. The test assumes a null hypothesis of no difference and evaluates evidence against it. Real questions rarely concern exact equality; practitioners want to know how different? rather than different or not?
  • Inverted logic. The natural question is what shifts are consistent with my data? The test answers is this specific shift (zero) consistent with my data?

The toolkit inverts this framework. Instead of testing whether a hypothesized shift is plausible, we compute which shifts are plausible given the data. This inversion transforms hypothesis testing into bounds estimation.

The mathematical foundation remains the same. The distribution of pairwise comparisons under random sampling determines which order statistics of pairwise differences form reliable bounds. The Mann-Whitney UU statistic measures pairwise comparisons (xi>yjx_i > y_j). The Shift\operatorname{Shift} estimator uses pairwise differences (xiyjx_i - y_j). These quantities are mathematically related: a pairwise difference xiyjx_i - y_j is positive exactly when xi>yjx_i > y_j. The toolkit renames this comparison count from UU to Dominance(x,y)\text{Dominance}(\mathbf{x}, \mathbf{y}), clarifying its purpose: measuring how often one sample dominates the other in pairwise comparisons.

The distribution of Dominance\text{Dominance} determines which order statistics form reliable bounds. Define the margin function:

PairwiseMargin(n,m,misrate)=number of pairwise differences to exclude from bounds\operatorname{PairwiseMargin}(n, m, \mathrm{misrate}) = \text{number of pairwise differences to exclude from bounds}

This function computes how many extreme pairwise differences could occur by chance with probability misrate\mathrm{misrate}, based on the distribution of pairwise comparisons.

The PairwiseMargin\operatorname{PairwiseMargin} function requires knowing the distribution of pairwise comparisons under sampling. Two computational approaches exist:

  • Exact computation (Löfflers algorithm, 1982). Uses a recurrence relation to compute the exact distribution of pairwise comparisons for small samples. Practical for combined sample sizes up to several hundred.
  • Approximation (Edgeworth expansion, 1955). Refines the normal approximation with correction terms based on higher moments of the distribution. Provides accurate results for large samples where exact computation becomes impractical.

The toolkit automatically selects the appropriate method based on sample sizes, ensuring both accuracy and computational efficiency.

This approach naturally complements Center\operatorname{Center} and Spread\operatorname{Spread}:

  • Center(x)\operatorname{Center}(\mathbf{x}) uses the median of pairwise averages xi+xj2\frac{x_i + x_j}{2}
  • Spread(x)\operatorname{Spread}(\mathbf{x}) uses the median of pairwise differences xixj\lvert x_i - x_j \rvert
  • Shift(x,y)\operatorname{Shift}(\mathbf{x}, \mathbf{y}) uses the median of pairwise differences xiyjx_i - y_j
  • ShiftBounds(x,y,misrate)\operatorname{ShiftBounds}(\mathbf{x}, \mathbf{y}, \mathrm{misrate}) uses order statistics of the same pairwise differences

All procedures build on pairwise operations. This structural consistency reflects the mathematical unity underlying robust statistics: pairwise operations provide natural robustness while maintaining computational feasibility and statistical efficiency.

The inversion from hypothesis testing to bounds estimation represents a philosophical shift in statistical practice. Traditional methods ask should I believe this specific hypothesis? Pragmatic methods ask what should I believe, given this data? Bounds provide actionable answers: they tell practitioners which values are plausible, enabling informed decisions without arbitrary significance thresholds.

Traditional Mann-Whitney implementations apply tie correction when samples contain repeated values. This correction modifies variance calculations to account for tied observations, changing pp-values and confidence intervals in ways that depend on measurement precision. The toolkit deliberately omits tie correction. Continuous distributions produce theoretically distinct values; observed ties result from finite measurement precision and digital representation. When measurements appear identical, this reflects rounding of underlying continuous variation, not true equality in the measured quantity. Treating ties as artifacts of discretization rather than distributional features simplifies computation while maintaining accuracy. The exact and approximate methods compute comparison distributions without requiring adjustments for tied values, eliminating a source of complexity and potential inconsistency in statistical practice.

Historical Development

The mathematical foundations emerged through decades of refinement. Mann and Whitney (1947) established the distribution of pairwise comparisons under random sampling, creating the theoretical basis for comparing samples through rank-based methods. Their work demonstrated that comparison counts follow predictable patterns regardless of the underlying population distributions.

The original computational approaches suffered from severe limitations. Mann and Whitney proposed a slow exact method requiring exponential resources and a normal approximation that proved grossly inaccurate for practical use. The approximation works reasonably in distribution centers but fails catastrophically in the tails where practitioners most need accuracy. For moderate sample sizes, approximation errors can exceed factors of 101110^11.

Fix and Hodges (1955) addressed the approximation problem through higher-order corrections. Their expansion adds terms based on the distributions actual moments rather than assuming perfect normality. This refinement reduces tail probability errors from orders of magnitude to roughly 1%, making approximation practical for large samples where exact computation becomes infeasible.

Löffler (1982) solved the exact computation problem through algorithmic innovation. The naive recurrence requires quadratic memory— infeasible for samples beyond a few dozen measurements. Löffler discovered a reformulation that reduces memory to linear scale, making exact computation practical for combined sample sizes up to several hundred.

Despite these advances, most statistical software continues using the 1947 approximation. The computational literature contains the solutions, but software implementations lag decades behind theoretical developments. This toolkit implements both the exact method for small samples and the refined approximation for large samples, automatically selecting the appropriate approach based on sample sizes.

The shift from hypothesis testing to bounds estimation requires no new mathematics. The same comparison distributions that enable hypothesis tests also determine which order statistics form reliable bounds. Traditional applications ask is zero plausible? and answer yes or no. This toolkit asks which values are plausible? and answers with an interval. The perspective inverts while the mathematical foundation remains identical.

Tests

PairwiseMargin(n,m,misrate)\operatorname{PairwiseMargin}(n, m, \mathrm{misrate})

The PairwiseMargin\operatorname{PairwiseMargin} test suite contains 178 test cases (4 demo + 4 natural + 10 edge + 12 small grid + 148 large grid). The domain constraint misrate2(n+mn)\mathrm{misrate} \geq \frac{2}{\binom{n+m}{n}} is enforced; inputs violating this return a domain error. Combinations where the requested misrate falls below the minimum achievable misrate are excluded from the grid.

Demo examples (n=m=30n = m = 30) — from manual introduction:

  • demo-1: n=30n=30, m=30m=30, misrate=106\mathrm{misrate}=10^{-6}, expected output: 276276
  • demo-2: n=30n=30, m=30m=30, misrate=105\mathrm{misrate}=10^{-5}, expected output: 328328
  • demo-3: n=30n=30, m=30m=30, misrate=104\mathrm{misrate}=10^{-4}, expected output: 390390
  • demo-4: n=30n=30, m=30m=30, misrate=103\mathrm{misrate}=10^{-3}, expected output: 464464

These demo cases match the reference values used throughout the manual to illustrate ShiftBounds\operatorname{ShiftBounds} construction.

Natural sequences ([n,m]in1,2,3,4×1,2,3,4[n, m] in {1, 2, 3, 4} \times {1, 2, 3, 4} × 2 misrates, filtered by min misrate) — 4 tests:

  • Misrate values: misratein101,102\mathrm{misrate} in {10^{-1}, 10^{-2}}
  • After filtering by misrate2(n+mn)\mathrm{misrate} \geq \frac{2}{\binom{n+m}{n}}, only 4 combinations survive:
  • natural-3-3-mr1: n=3n=3, m=3m=3, misrate=0.1\mathrm{misrate}=0.1, expected output: 00
  • natural-3-4-mr1: n=3n=3, m=4m=4, misrate=0.1\mathrm{misrate}=0.1
  • natural-4-3-mr1: n=4n=4, m=3m=3, misrate=0.1\mathrm{misrate}=0.1
  • natural-4-4-mr1: n=4n=4, m=4m=4, misrate=0.1\mathrm{misrate}=0.1

Edge cases — boundary condition validation (10 tests):

  • boundary-min: n=1n=1, m=1m=1, misrate=1.0\mathrm{misrate}=1.0 (minimum samples with maximum misrate, expected output: 00)
  • boundary-zero-margin-small: n=20n=20, m=20m=20, misrate=106\mathrm{misrate}=10^{-6} (strict misrate with sufficient samples)
  • boundary-loose: n=5n=5, m=5m=5, misrate=0.9\mathrm{misrate}=0.9 (very permissive misrate)
  • symmetry-2-5: n=2n=2, m=5m=5, misrate=0.1\mathrm{misrate}=0.1 (tests symmetry property)
  • symmetry-5-2: n=5n=5, m=2m=2, misrate=0.1\mathrm{misrate}=0.1 (symmetric counterpart, same output as above)
  • symmetry-3-7: n=3n=3, m=7m=7, misrate=0.05\mathrm{misrate}=0.05 (asymmetric sizes)
  • symmetry-7-3: n=7n=7, m=3m=3, misrate=0.05\mathrm{misrate}=0.05 (symmetric counterpart)
  • asymmetry-extreme-1-100: n=1n=1, m=100m=100, misrate=0.1\mathrm{misrate}=0.1 (extreme size difference)
  • asymmetry-extreme-100-1: n=100n=100, m=1m=1, misrate=0.1\mathrm{misrate}=0.1 (reversed extreme)
  • asymmetry-extreme-2-50: n=2n=2, m=50m=50, misrate=0.05\mathrm{misrate}=0.05 (highly unbalanced)

These edge cases validate correct handling of boundary conditions, the symmetry property PairwiseMargin(n,m,misrate)=PairwiseMargin(m,n,misrate)\operatorname{PairwiseMargin}(n, m, \mathrm{misrate}) = \operatorname{PairwiseMargin}(m, n, \mathrm{misrate}), and extreme asymmetry in sample sizes.

Comprehensive grid — systematic coverage for thorough validation:

Small sample combinations ([n,m]in1,2,3,4,5×1,2,3,4,5[n, m] in {1, 2, 3, 4, 5} \times {1, 2, 3, 4, 5} × 6 misrates, filtered) — 12 tests:

  • Misrate values: misratein101,102,103,104,105,106\mathrm{misrate} in {10^{-1}, 10^{-2}, 10^{-3}, 10^{-4}, 10^{-5}, 10^{-6}}
  • Combinations where misrate<2(n+mn)\mathrm{misrate} < \frac{2}{\binom{n+m}{n}} are excluded
  • Test naming: n{n}_m{m}_mr{k} where kk is the negative log10 of misrate
  • Examples:
  • n5_m5_mr1: n=5n=5, m=5m=5, misrate=0.1\mathrm{misrate}=0.1, expected output: 1010
  • n5_m5_mr2: n=5n=5, m=5m=5, misrate=0.01\mathrm{misrate}=0.01

Large sample combinations ([n,m]in10,20,30,50,100×10,20,30,50,100[n, m] in {10, 20, 30, 50, 100} \times {10, 20, 30, 50, 100} × 6 misrates, filtered) — 148 tests:

  • Misrate values: same as small samples
  • Combinations where misrate<2(n+mn)\mathrm{misrate} < \frac{2}{\binom{n+m}{n}} are excluded (affects n=m=10n = m = 10 at misrates 10510^{-5} and 10610^{-6})
  • Test naming: n{n}_m{m}_r{k} where kk is the negative log10 of misrate
  • Examples:
  • n10_m10_r1: n=10n=10, m=10m=10, misrate=0.1\mathrm{misrate}=0.1, expected output: 5656
  • n50_m50_r3: n=50n=50, m=50m=50, misrate=0.001\mathrm{misrate}=0.001, expected output: 15561556
  • n100_m100_r6: n=100n=100, m=100m=100, misrate=106\mathrm{misrate}=10^{-6}, expected output: 60606060

The comprehensive grid validates both symmetric (n=mn = m) and asymmetric sample size combinations across six orders of magnitude in misrate, ensuring robust coverage of the parameter space.

References

On the calculation of the exact null-distribution of the Wilcoxon-Mann-Whitney U-statistic
Löffler, Andreas (1982)
Significance probabilities of the Wilcoxon test
Fix, Evelyn, Hodges, J. L. (1955)
The Annals of Mathematical Statistics