PairwiseMargin

\operatorname{PairwiseMargin}(n, m, \mathrm{misrate})

Exclusion count for dominance-based bounds.

Purpose — determines extreme pairwise differences to exclude when constructing bounds
Based on — distribution of $\operatorname{Dominance}(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n \sum_{j=1}^m \mathbb{1}(x_i > y_j)$ under random sampling
Returns — total margin split evenly between lower and upper tails
Used by — $\operatorname{ShiftBounds}$ to select appropriate order statistics
Complexity — exact for small samples, approximated for large
Domain — $n, m \geq 1$ , $\mathrm{misrate} > \frac{2}{\binom{n+m}{n}}$ (minimum achievable)
Unit — count
Note — assumes weak continuity (ties from measurement resolution are tolerated)

Properties

Symmetry $\operatorname{PairwiseMargin}(n, m, \mathrm{misrate}) = \operatorname{PairwiseMargin}(m, n, \mathrm{misrate})$
Bounds $0 \leq \operatorname{PairwiseMargin}(n, m, \mathrm{misrate}) \leq n m$

Example

PairwiseMargin(30, 30, 1e-6) = 276
PairwiseMargin(30, 30, 1e-4) = 390
PairwiseMargin(30, 30, 1e-3) = 464

This is a supporting function that $\operatorname{ShiftBounds}$ uses internally, so most users do not need to call it directly. It calculates how many extreme pairwise differences should be excluded when constructing bounds, based on sample sizes and the desired error rate. A lower misrate (higher confidence) results in a smaller margin, which produces wider bounds. The function automatically chooses between exact computation for small samples and a fast approximation for large samples.

Algorithm

The $\operatorname{PairwiseMargin}$ function determines how many extreme pairwise differences to exclude when constructing bounds around $\operatorname{Shift}(\mathbf{x}, \mathbf{y})$ . Given samples $\mathbf{x} = (x_1, \ldots, x_n)$ and $\mathbf{y} = (y_1, \ldots, y_m)$ , the $\operatorname{ShiftBounds}$ estimator computes all $n m$ pairwise differences $z_{i j} = x_i - y_j$ and sorts them. The bounds select specific order statistics from this sorted sequence: $[z_{(k_{\text{left}})}, z_{(k_{\text{right}})}]$ . The challenge lies in determining which order statistics produce bounds that contain the true shift $\operatorname{Shift}[X, Y]$ with probability $1 - \mathrm{misrate}$ .

Random sampling creates natural variation in pairwise differences. Even when populations have identical distributions, sampling variation produces both positive and negative differences. The margin function quantifies this sampling variability: it specifies how many extreme pairwise differences could occur by chance with probability $\mathrm{misrate}$ . For symmetric bounds, this margin splits evenly between the tails, giving $k_{\text{left}} = \lfloor \frac{\operatorname{PairwiseMargin}(n, m, \mathrm{misrate})}{2} \rfloor + 1$ and $k_{\text{right}} = n m - \lfloor \frac{\operatorname{PairwiseMargin}(n, m, \mathrm{misrate})}{2} \rfloor$ .

Computing the margin requires understanding the distribution of pairwise comparisons. Each pairwise difference corresponds to a comparison: $x_i - y_j > 0$ exactly when $x_i > y_j$ . This connection motivates the dominance function:

\text{Dominance}(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n \sum_{j=1}^m \mathbb{1}(x_i > y_j)

The dominance function counts how many pairwise comparisons favor $\mathbf{x}$ over $\mathbf{y}$ . Both $\operatorname{Shift}$ and $\text{Dominance}$ operate on the same collection of $n m$ pairwise differences. The $\operatorname{Shift}$ estimator examines difference values, returning the median as a location estimate. The $\text{Dominance}$ function examines difference signs, counting how many comparisons produce positive differences. While $\operatorname{Shift}$ provides the estimate itself, $\text{Dominance}$ determines which order statistics form reliable bounds around that estimate.

When populations have equivalent distributions, $\text{Dominance}$ concentrates near $n \frac{m}{2}$ by symmetry. The distribution of $\text{Dominance}$ across all possible sample orderings determines reliable bounds. If $\text{Dominance}$ deviates from $n \frac{m}{2}$ by at least $\frac{k}{2}$ with probability $\mathrm{misrate}$ , then the interval excluding the $k$ most extreme pairwise differences contains zero with probability $1 - \mathrm{misrate}$ . Translation invariance extends this relationship to arbitrary shifts: the margin computed from the comparison distribution applies regardless of the true shift value.

Two computational approaches provide the distribution of $\text{Dominance}$ : exact calculation for small samples and approximation for large samples.

Exact method

Small sample sizes allow exact computation without approximation. The exact approach exploits a fundamental symmetry: under equivalent populations, all $\binom{n+m}{n}$ orderings of the combined measurements occur with equal probability. This symmetry enables direct calculation of how many orderings produce each comparison count.

Direct computation faces a combinatorial challenge. Enumerating all orderings to count comparison outcomes requires substantial resources. For samples beyond a few dozen measurements, naive implementation becomes impractical.

Löfflers recurrence relation (Löffler 1982) resolves this through algebraic structure. The recurrence exploits cycle properties in the comparison distribution, reducing memory requirements while maintaining exact calculation. The algorithm builds cumulative probabilities sequentially until reaching the threshold corresponding to the desired error rate. This approach extends practical exact computation to combined sample sizes of several hundred.

Define $p_{n,m}(c)$ as the number of orderings producing exactly $c$ comparisons favoring $\mathbf{x}$ . The probability mass function becomes:

\Pr(\text{Dominance} = c) = \frac{p_{n,m}(c)}{\binom{n+m}{n}}

A direct recurrence follows from considering the largest measurement. The rightmost element comes from either $\mathbf{x}$ (contributing $m$ comparisons) or $\mathbf{y}$ (contributing zero):

p_{n,m}(c) = p_{n-1,m}(c - m) + p_{n,m-1}(c)

with base cases $p_{n,0}(0) = 1$ and $p_{0,m}(0) = 1$ .

Direct implementation requires $O(n \cdot m \cdot n m)$ time and $O(n m)$ memory. An alternative recurrence (Löffler 1982) exploits cycle structure:

p_{n,m}(c) = \frac{1}{c} \sum_{i=0}^{c-1} p_{n,m}(i) \cdot \sigma_{n,m}(c - i)

where $\sigma_{n,m}(d)$ captures structural properties through divisors:

\begin{aligned} \sigma_{n,m}(d) = \sum_{k|d} \epsilon_k \cdot k, \quad \epsilon_k = \begin{cases} 1 & \text{if} 1 \leq k \leq n \\ -1 & \text{if} m+1 \leq k \leq m+n \\ 0 & \text{otherwise} \end{cases} \end{aligned}

This reduces memory to $O(n m)$ and enables efficient computation through $c = n m$ .

The algorithm computes cumulative probabilities $\Pr(\text{Dominance} \leq c)$ sequentially until the threshold $\frac{\mathrm{misrate}}{2}$ is exceeded. By symmetry, the lower and upper thresholds determine the total margin $\operatorname{PairwiseMargin} = 2c$ .

The sequential computation proceeds incrementally. Starting from $u = 0$ with base probability $p_{n,m}(0) = 1$ , the algorithm computes $p_{n,m}(1)$ , then $p_{n,m}(2)$ , and so on, accumulating the cumulative distribution function with each step. The loop terminates as soon as $\Pr(\text{Dominance} \leq u)$ reaches $\frac{\mathrm{misrate}}{2}$ , returning the threshold value $u$ without computing further probabilities.

This sequential approach performs particularly well for small misrates. For $\mathrm{misrate} = 10^{-6}$ , the threshold $u$ typically remains small even with large sample sizes, requiring only a few iterations regardless of whether $n$ and $m$ equal 50 or 200. The algorithm computes only the extreme tail probabilities needed to reach the threshold, never touching the vast majority of probability mass concentrated near $n \frac{m}{2}$ . This efficiency advantage grows as misrates decrease: stricter bounds require fewer computed values, making exact calculation particularly attractive for high-confidence applications.

Approximate method

Large samples make exact computation impractical. The dominance count $\text{Dominance}$ concentrates near $n \frac{m}{2}$ with variance $n \frac{m(n+m+1)}{12}$ . A basic $\underline{\operatorname{Additive}}$ (Normal) approximation suffices asymptotically:

\text{Dominance} \approx \underline{\operatorname{Additive}}(n \frac{m}{2}, \sqrt{n \frac{m(n+m+1)}{12}})

This approximation underestimates tail probabilities for moderate sample sizes. The $\underline{\operatorname{Additive}}$ (Normal) approximation provides a baseline but fails to capture the true distribution shape in the tails, producing mis-calibrated probabilities that become problematic for small error rates.

The Edgeworth expansion refines this approximation through moment-based corrections (Fix & Hodges 1955). The expansion starts with the $\underline{\operatorname{Additive}}$ (Normal) cumulative distribution as a baseline, then adds correction terms that account for the distributions asymmetry (skewness) and tail weight (kurtosis). These corrections use Hermite polynomials to adjust the baseline curve where the $\underline{\operatorname{Additive}}$ (Normal) approximation deviates most from the true distribution. The first few correction terms typically achieve the practical balance between accuracy and computational cost, substantially improving tail probability estimates compared to the basic approximation.

The standardized comparison count (with continuity correction):

z = \frac{c - n \frac{m}{2} - 0.5}{\sqrt{n \frac{m(n+m+1)}{12}}}

produces the approximated cumulative distribution:

\Pr(\text{Dominance} \leq c) \approx \Phi(z) + e_3 \phi^{(3)}(z) + e_5 \phi^{(5)}(z) + e_7 \phi^{(7)}(z)

where $\Phi$ denotes the standard $\underline{\operatorname{Additive}}$ (Normal) CDF.

The correction coefficients depend on standardized moments:

e_3 = \frac{1}{24} (\frac{\mu_4}{\mu_2^2} - 3), \quad e_5 = \frac{1}{720} (\frac{\mu_6}{\mu_2^3} - 15 \frac{\mu_4}{\mu_2^2} + 30), \quad e_7 = \frac{35}{40320} (\frac{\mu_4}{\mu_2^2} - 3)^2

The moments $\mu_2$ , $\mu_4$ , $\mu_6$ are computed from sample sizes:

\mu_2 = \frac{n m(n+m+1)}{12}

\mu_4 = \frac{n m(n+m+1)}{240} (5 n m(n+m) - 2(n^2 + m^2) + 3 n m - 2(n+m))

\mu_6 = \frac{n m(n+m+1)}{4032} (35 n^2 m^2(n^2 + m^2) + 70 n^3 m^3 - 42 n m(n^3 + m^3) \\ - 14 n^2 m^2(n + m) + 16(n^4 + m^4) - 52 n m(n^2 + m^2) \\ - 43 n^2 m^2 + 32(n^3 + m^3) + 14 n m(n + m) \\ + 8(n^2 + m^2) + 16 n m - 8(n + m))

The correction terms use Hermite polynomials:

\phi^{(k)}(z) = -\phi(z) H_k(z)

H_3(z) = z^3 - 3z, \quad H_5(z) = z^5 - 10z^3 + 15z, \quad H_7(z) = z^7 - 21z^5 + 105z^3 - 105z

Binary search locates the threshold value efficiently. The algorithm maintains a search interval $[a, b]$ initialized to $[0, n m]$ . Each iteration computes the midpoint $c = \frac{a + b}{2}$ and evaluates the Edgeworth CDF at $c$ . If $\Pr(\text{Dominance} \leq c) < \frac{\mathrm{misrate}}{2}$ , the threshold lies above $c$ and the search continues in $[c, b]$ . If $\Pr(\text{Dominance} \leq c) \geq \frac{\mathrm{misrate}}{2}$ , the threshold lies below $c$ and the search continues in $[a, c]$ . The loop terminates when $a$ and $b$ become adjacent, requiring $O(\log(n m))$ CDF evaluations.

This binary search exhibits uniform performance across misrate values. Whether computing bounds for $\mathrm{misrate} = 10^{-6}$ or $\mathrm{misrate} = 0.05$ , the algorithm performs the same number of iterations determined solely by the sample sizes. Each CDF evaluation costs constant time regardless of the threshold location, making the approximate method particularly efficient for large samples where exact computation becomes impractical. The logarithmic scaling ensures that doubling the sample size adds only one additional iteration, enabling practical computation for samples in the thousands or tens of thousands.

The toolkit selects between exact and approximate computation based on combined sample size: exact method for $n + m \leq 400$ , approximate method for $n + m > 400$ . The exact method guarantees correctness but scales as $O(n m)$ memory and $O((n m)^2)$ time. For $n = m = 200$ , this requires 40,000 memory locations. The approximate method achieves 1% accuracy with $O(\log(n m))$ constant-time evaluations. For $n = m = 10000$ , the approximate method completes in milliseconds versus minutes for exact computation.

Both methods handle discrete data. Repeated measurements produce tied pairwise differences, creating plateaus in the sorted sequence. The exact method counts orderings without assuming continuity. The approximate methods moment-based corrections capture the actual distribution shape regardless of discreteness.

Minimum achievable misrate

The $\mathrm{misrate}$ parameter controls how many extreme pairwise differences the bounds exclude. Lower misrate produces narrower bounds with higher confidence but requires excluding fewer extreme values. However, sample size limits how small misrate can meaningfully become.

Consider the most extreme configuration: all measurements from $\mathbf{x}$ exceed all measurements from $\mathbf{y}$ , giving $x_1, \ldots, x_n > y_1, \ldots, y_m$ . Under equivalent populations, this arrangement occurs purely by chance. The probability equals the chance of having all $n$ elements from $\mathbf{x}$ occupy the top $n$ positions among $n+m$ total measurements:

\mathrm{misrate}_{min} = \frac{2}{\binom{n+m}{n}} = \frac{2 \cdot n! \cdot m!}{(n+m)!}

This represents the minimum probability of the most extreme ordering under random sampling. Setting $\mathrm{misrate} < \mathrm{misrate}_{min}$ makes bounds construction problematic. The exact distribution of $\text{Dominance}$ cannot support misrates smaller than the probability of its most extreme realization. Attempting to construct bounds with $\mathrm{misrate} < \mathrm{misrate}_{min}$ forces the algorithm to exclude zero pairwise differences from the tails, making $\operatorname{PairwiseMargin} = 0$ . The resulting bounds span all $n m$ pairwise differences, returning $[z_{(1)}, z_{(n m)}]$ regardless of the desired confidence level. These bounds convey no useful information beyond the range of observed pairwise differences.

For small samples, $\mathrm{misrate}_{min}$ can exceed commonly used values. With $n = m = 6$ , the minimum misrate equals $\frac{2}{\binom{12}{6}} \approx 0.00217$ , making the typical choice of $\mathrm{misrate} = 10^{-3}$ impossible. With $n = m = 4$ , the minimum becomes $\frac{2}{\binom{8}{4}} \approx 0.0286$ , exceeding even $\mathrm{misrate} = 0.01$ .

The table below shows $\mathrm{misrate}_{min}$ for small sample sizes:

	1	2	3	4	5	6	7	8	9	10
1	1.000000	0.666667	0.500000	0.400000	0.333333	0.285714	0.250000	0.222222	0.200000	0.181818
2	0.666667	0.333333	0.200000	0.133333	0.095238	0.071429	0.055556	0.044444	0.036364	0.030303
3	0.500000	0.200000	0.100000	0.057143	0.035714	0.023810	0.016667	0.012121	0.009091	0.006993
4	0.400000	0.133333	0.057143	0.028571	0.015873	0.009524	0.006061	0.004040	0.002797	0.001998
5	0.333333	0.095238	0.035714	0.015873	0.007937	0.004329	0.002525	0.001554	0.000999	0.000666
6	0.285714	0.071429	0.023810	0.009524	0.004329	0.002165	0.001166	0.000666	0.000400	0.000250
7	0.250000	0.055556	0.016667	0.006061	0.002525	0.001166	0.000583	0.000311	0.000175	0.000103
8	0.222222	0.044444	0.012121	0.004040	0.001554	0.000666	0.000311	0.000155	0.000082	0.000046
9	0.200000	0.036364	0.009091	0.002797	0.000999	0.000400	0.000175	0.000082	0.000041	0.000022
10	0.181818	0.030303	0.006993	0.001998	0.000666	0.000250	0.000103	0.000046	0.000022	0.000011

For meaningful bounds construction, choose $\mathrm{misrate} > \mathrm{misrate}_{min}$ . This ensures the margin function excludes at least some extreme pairwise differences, producing bounds narrower than the full range. When working with small samples, verify that the desired misrate exceeds $\mathrm{misrate}_{min}$ for the given sample sizes. With moderate sample sizes ( $n, m \geq 15$ ), $\mathrm{misrate}_{min}$ drops below $10^{-8}$ , making standard choices like $\mathrm{misrate} = 10^{-6}$ feasible.

using System;
using JetBrains.Annotations;
using Pragmastat.Exceptions;
using Pragmastat.Internal;

namespace Pragmastat.Functions;

/// <summary>
/// PairwiseMargin function
/// </summary>
/// <param name="threshold">The maximum value for n+m, after which implementation switches from exact to approx</param>
internal class PairwiseMargin(int threshold = PairwiseMargin.MaxExactSize)
{
  public static readonly PairwiseMargin Instance = new();

  private const int MaxExactSize = 400;

  [PublicAPI]
  public int Calc(int n, int m, double misrate)
  {
    if (n <= 0)
      throw AssumptionException.Domain(Subject.X);
    if (m <= 0)
      throw AssumptionException.Domain(Subject.Y);
    if (double.IsNaN(misrate) || misrate < 0 || misrate > 1)
      throw AssumptionException.Domain(Subject.Misrate);

    double minMisrate = MinAchievableMisrate.TwoSample(n, m);
    if (misrate < minMisrate)
      throw AssumptionException.Domain(Subject.Misrate);

    return n + m <= threshold
      ? CalcExact(n, m, misrate)
      : CalcApprox(n, m, misrate);
  }

  internal int CalcExact(int n, int m, double misrate)
  {
    int raw = CalcExactRaw(n, m, misrate / 2);
    return checked(raw * 2);
  }

  internal int CalcApprox(int n, int m, double misrate)
  {
    long raw = CalcApproxRaw(n, m, misrate / 2);
    long margin = raw * 2;
    if (margin > int.MaxValue)
      throw new OverflowException($"Pairwise margin exceeds supported range for n={n}, m={m}");
    return (int)margin;
  }

  // Inversed implementation of Andreas Löffler's (1982) "Über eine Partition der nat. Zahlen und ihre Anwendung beim U-Test"
  private static int CalcExactRaw(int n, int m, double p)
  {
    double total = n + m < BinomialCoefficientFunction.MaxAcceptableN
      ? BinomialCoefficientFunction.BinomialCoefficient(n + m, m)
      : BinomialCoefficientFunction.BinomialCoefficient(n + m, m * 1.0);

    var pmf = new List<double> { 1 }; // pmf[0] = 1
    var sigma = new List<double> { 0 }; // sigma[0] is unused

    int u = 0;
    double cdf = 1.0 / total;

    if (cdf >= p)
      return 0;

    while (true)
    {
      u++;
      // Ensure sigma has entry for u
      if (sigma.Count <= u)
      {
        int value = 0;
        for (int d = 1; d <= n; d++)
          if (u % d == 0 && u >= d)
            value += d;
        for (int d = m + 1; d <= m + n; d++)
          if (u % d == 0 && u >= d)
            value -= d;
        sigma.Add(value);
      }

      // Compute pmf[u] using Loeffler recurrence
      double sum = 0;
      for (int i = 0; i < u; i++)
        sum += pmf[i] * sigma[u - i];
      sum /= u;
      pmf.Add(sum);

      cdf += sum / total;
      if (cdf >= p)
        return u;
      if (sum == 0)
        break;
    }

    return pmf.Count - 1;
  }

  // Inverse Edgeworth Approximation
  private static long CalcApproxRaw(int n, int m, double misrate)
  {
    long a = 0;
    long b = (long)n * m;
    while (a < b - 1)
    {
      long c = (a + b) / 2;
      double p = EdgeworthCdf(n, m, c);
      if (p < misrate)
        a = c;
      else
        b = c;
    }

    return EdgeworthCdf(n, m, b) < misrate ? b : a;
  }

  private static double EdgeworthCdf(int n, int m, long u)
  {
    double nm = (double)n * m;
    double mu = nm / 2.0;
    double su = Sqrt(nm * (n + m + 1) / 12.0);
    // -0.5 continuity correction: computing P(U ≥ u) for a right-tail discrete CDF
    double z = (u - mu - 0.5) / su;
    double phi = Exp(-z.Sqr() / 2) / Sqrt(2 * PI);
    double Phi = AcmAlgorithm209.Gauss(z);

    // Pre-compute powers of n and m for efficiency
    double n2 = (double)n * n;
    double n3 = n2 * n;
    double n4 = n2 * n2;
    double m2 = (double)m * m;
    double m3 = m2 * m;
    double m4 = m2 * m2;

    double mu2 = (double)n * m * (n + m + 1) / 12.0;
    double mu4 =
      (double)n * m * (n + m + 1) *
      (0
       + 5 * m * n * (m + n)
       - 2 * (m2 + n2)
       + 3 * m * n
       - 2 * (n + m)
      ) / 240.0;

    double mu6 =
      (double)n * m * (n + m + 1) *
      (0
       + 35 * m2 * n2 * (m2 + n2)
       + 70 * m3 * n3
       - 42 * m * n * (m3 + n3)
       - 14 * m2 * n2 * (n + m)
       + 16 * (n4 + m4)
       - 52 * n * m * (n2 + m2)
       - 43 * n2 * m2
       + 32 * (m3 + n3)
       + 14 * m * n * (n + m)
       + 8 * (n2 + m2)
       + 16 * n * m
       - 8 * (n + m)
      ) / 4032.0;

    // Pre-compute powers of mu2 and related terms
    double mu2_2 = mu2 * mu2;
    double mu2_3 = mu2_2 * mu2;
    double mu4_mu2_2 = mu4 / mu2_2;

    // Factorial constants: 4! = 24, 6! = 720, 8! = 40320
    double e3 = (mu4_mu2_2 - 3) / 24.0;
    double e5 = (mu6 / mu2_3 - 15 * mu4_mu2_2 + 30) / 720.0;
    double e7 = 35 * (mu4_mu2_2 - 3) * (mu4_mu2_2 - 3) / 40320.0;

    // Pre-compute powers of z for Hermite polynomials
    double z2 = z * z;
    double z3 = z2 * z;
    double z5 = z3 * z2;
    double z7 = z5 * z2;

    double f3 = -phi * (z3 - 3 * z);
    double f5 = -phi * (z5 - 10 * z3 + 15 * z);
    double f7 = -phi * (z7 - 21 * z5 + 105 * z3 - 105 * z);

    double edgeworth = Phi + e3 * f3 + e5 * f5 + e7 * f7;
    return Min(Max(edgeworth, 0), 1);
  }
}

Notes

The Mann-Whitney $U$ test (also known as the Wilcoxon rank-sum test) ranks among the most widely used non-parametric statistical tests, testing whether two independent samples come from the same distribution. Under $\underline{\operatorname{Additive}}$ (Normal) conditions, it achieves nearly the same precision as the Students $t$ -test, while maintaining reliability under diverse distributional conditions where the $t$ -test fails.

The test operates by comparing all pairs of measurements between the two samples. Given samples $\mathbf{x} = (x_1, \ldots, x_n)$ and $\mathbf{y} = (y_1, \ldots, y_m)$ , the Mann-Whitney $U$ statistic counts how many pairs satisfy $x_i > y_j$ :

U = \sum_{i=1}^n \sum_{j=1}^m \mathbb{1}(x_i > y_j)

If the samples come from the same distribution, $U$ should be near $n \frac{m}{2}$ (roughly half the pairs favor $\mathbf{x}$ , half favor $\mathbf{y}$ ). Large deviations from $n \frac{m}{2}$ suggest the distributions differ.

The test answers: Could this $U$ value arise by chance if the samples were truly equivalent? The $p$ -value quantifies this probability. If $p < 0.05$ , traditional practice declares the difference statistically significant.

This approach creates several problems for practitioners:

Binary thinking. The test produces a yes/no answer: reject or fail to reject the null hypothesis. Practitioners typically want to know the magnitude of difference, not just whether one exists.
Arbitrary thresholds. The 0.05 threshold has no universal justification, yet it dominates practice and creates a false dichotomy between $p = 0.049$ and $p = 0.051$ .
Hypothesis-centric framework. The test assumes a null hypothesis of no difference and evaluates evidence against it. Real questions rarely concern exact equality; practitioners want to know how different? rather than different or not?
Inverted logic. The natural question is what shifts are consistent with my data? The test answers is this specific shift (zero) consistent with my data?

The toolkit inverts this framework. Instead of testing whether a hypothesized shift is plausible, we compute which shifts are plausible given the data. This inversion transforms hypothesis testing into bounds estimation.

The mathematical foundation remains the same. The distribution of pairwise comparisons under random sampling determines which order statistics of pairwise differences form reliable bounds. The Mann-Whitney $U$ statistic measures pairwise comparisons ( $x_i > y_j$ ). The $\operatorname{Shift}$ estimator uses pairwise differences ( $x_i - y_j$ ). These quantities are mathematically related: a pairwise difference $x_i - y_j$ is positive exactly when $x_i > y_j$ . The toolkit renames this comparison count from $U$ to $\text{Dominance}(\mathbf{x}, \mathbf{y})$ , clarifying its purpose: measuring how often one sample dominates the other in pairwise comparisons.

The distribution of $\text{Dominance}$ determines which order statistics form reliable bounds. Define the margin function:

\operatorname{PairwiseMargin}(n, m, \mathrm{misrate}) = \text{number of pairwise differences to exclude from bounds}

This function computes how many extreme pairwise differences could occur by chance with probability $\mathrm{misrate}$ , based on the distribution of pairwise comparisons.

The $\operatorname{PairwiseMargin}$ function requires knowing the distribution of pairwise comparisons under sampling. Two computational approaches exist:

Exact computation (Löfflers algorithm, 1982). Uses a recurrence relation to compute the exact distribution of pairwise comparisons for small samples. Practical for combined sample sizes up to several hundred.
Approximation (Edgeworth expansion, 1955). Refines the normal approximation with correction terms based on higher moments of the distribution. Provides accurate results for large samples where exact computation becomes impractical.

The toolkit automatically selects the appropriate method based on sample sizes, ensuring both accuracy and computational efficiency.

This approach naturally complements $\operatorname{Center}$ and $\operatorname{Spread}$ :

$\operatorname{Center}(\mathbf{x})$ uses the median of pairwise averages $\frac{x_i + x_j}{2}$
$\operatorname{Spread}(\mathbf{x})$ uses the median of pairwise differences $\lvert x_i - x_j \rvert$
$\operatorname{Shift}(\mathbf{x}, \mathbf{y})$ uses the median of pairwise differences $x_i - y_j$
$\operatorname{ShiftBounds}(\mathbf{x}, \mathbf{y}, \mathrm{misrate})$ uses order statistics of the same pairwise differences

All procedures build on pairwise operations. This structural consistency reflects the mathematical unity underlying robust statistics: pairwise operations provide natural robustness while maintaining computational feasibility and statistical efficiency.

The inversion from hypothesis testing to bounds estimation represents a philosophical shift in statistical practice. Traditional methods ask should I believe this specific hypothesis? Pragmatic methods ask what should I believe, given this data? Bounds provide actionable answers: they tell practitioners which values are plausible, enabling informed decisions without arbitrary significance thresholds.

Traditional Mann-Whitney implementations apply tie correction when samples contain repeated values. This correction modifies variance calculations to account for tied observations, changing $p$ -values and confidence intervals in ways that depend on measurement precision. The toolkit deliberately omits tie correction. Continuous distributions produce theoretically distinct values; observed ties result from finite measurement precision and digital representation. When measurements appear identical, this reflects rounding of underlying continuous variation, not true equality in the measured quantity. Treating ties as artifacts of discretization rather than distributional features simplifies computation while maintaining accuracy. The exact and approximate methods compute comparison distributions without requiring adjustments for tied values, eliminating a source of complexity and potential inconsistency in statistical practice.

Historical Development

The mathematical foundations emerged through decades of refinement. Mann and Whitney (1947) established the distribution of pairwise comparisons under random sampling, creating the theoretical basis for comparing samples through rank-based methods. Their work demonstrated that comparison counts follow predictable patterns regardless of the underlying population distributions.

The original computational approaches suffered from severe limitations. Mann and Whitney proposed a slow exact method requiring exponential resources and a normal approximation that proved grossly inaccurate for practical use. The approximation works reasonably in distribution centers but fails catastrophically in the tails where practitioners most need accuracy. For moderate sample sizes, approximation errors can exceed factors of $10^11$ .

Fix and Hodges (1955) addressed the approximation problem through higher-order corrections. Their expansion adds terms based on the distributions actual moments rather than assuming perfect normality. This refinement reduces tail probability errors from orders of magnitude to roughly 1%, making approximation practical for large samples where exact computation becomes infeasible.

Löffler (1982) solved the exact computation problem through algorithmic innovation. The naive recurrence requires quadratic memory— infeasible for samples beyond a few dozen measurements. Löffler discovered a reformulation that reduces memory to linear scale, making exact computation practical for combined sample sizes up to several hundred.

Despite these advances, most statistical software continues using the 1947 approximation. The computational literature contains the solutions, but software implementations lag decades behind theoretical developments. This toolkit implements both the exact method for small samples and the refined approximation for large samples, automatically selecting the appropriate approach based on sample sizes.

The shift from hypothesis testing to bounds estimation requires no new mathematics. The same comparison distributions that enable hypothesis tests also determine which order statistics form reliable bounds. Traditional applications ask is zero plausible? and answer yes or no. This toolkit asks which values are plausible? and answers with an interval. The perspective inverts while the mathematical foundation remains identical.

Tests

\operatorname{PairwiseMargin}(n, m, \mathrm{misrate})

The $\operatorname{PairwiseMargin}$ test suite contains 178 test cases (4 demo + 4 natural + 10 edge + 12 small grid + 148 large grid). The domain constraint $\mathrm{misrate} \geq \frac{2}{\binom{n+m}{n}}$ is enforced; inputs violating this return a domain error. Combinations where the requested misrate falls below the minimum achievable misrate are excluded from the grid.

Demo examples ( $n = m = 30$ ) — from manual introduction:

demo-1: $n=30$ , $m=30$ , $\mathrm{misrate}=10^{-6}$ , expected output: $276$
demo-2: $n=30$ , $m=30$ , $\mathrm{misrate}=10^{-5}$ , expected output: $328$
demo-3: $n=30$ , $m=30$ , $\mathrm{misrate}=10^{-4}$ , expected output: $390$
demo-4: $n=30$ , $m=30$ , $\mathrm{misrate}=10^{-3}$ , expected output: $464$

These demo cases match the reference values used throughout the manual to illustrate $\operatorname{ShiftBounds}$ construction.

Natural sequences ( $[n, m] in {1, 2, 3, 4} \times {1, 2, 3, 4}$ × 2 misrates, filtered by min misrate) — 4 tests:

Misrate values: $\mathrm{misrate} in {10^{-1}, 10^{-2}}$
After filtering by $\mathrm{misrate} \geq \frac{2}{\binom{n+m}{n}}$ , only 4 combinations survive:
natural-3-3-mr1: $n=3$ , $m=3$ , $\mathrm{misrate}=0.1$ , expected output: $0$
natural-3-4-mr1: $n=3$ , $m=4$ , $\mathrm{misrate}=0.1$
natural-4-3-mr1: $n=4$ , $m=3$ , $\mathrm{misrate}=0.1$
natural-4-4-mr1: $n=4$ , $m=4$ , $\mathrm{misrate}=0.1$

Edge cases — boundary condition validation (10 tests):

boundary-min: $n=1$ , $m=1$ , $\mathrm{misrate}=1.0$ (minimum samples with maximum misrate, expected output: $0$ )
boundary-zero-margin-small: $n=20$ , $m=20$ , $\mathrm{misrate}=10^{-6}$ (strict misrate with sufficient samples)
boundary-loose: $n=5$ , $m=5$ , $\mathrm{misrate}=0.9$ (very permissive misrate)
symmetry-2-5: $n=2$ , $m=5$ , $\mathrm{misrate}=0.1$ (tests symmetry property)
symmetry-5-2: $n=5$ , $m=2$ , $\mathrm{misrate}=0.1$ (symmetric counterpart, same output as above)
symmetry-3-7: $n=3$ , $m=7$ , $\mathrm{misrate}=0.05$ (asymmetric sizes)
symmetry-7-3: $n=7$ , $m=3$ , $\mathrm{misrate}=0.05$ (symmetric counterpart)
asymmetry-extreme-1-100: $n=1$ , $m=100$ , $\mathrm{misrate}=0.1$ (extreme size difference)
asymmetry-extreme-100-1: $n=100$ , $m=1$ , $\mathrm{misrate}=0.1$ (reversed extreme)
asymmetry-extreme-2-50: $n=2$ , $m=50$ , $\mathrm{misrate}=0.05$ (highly unbalanced)

These edge cases validate correct handling of boundary conditions, the symmetry property $\operatorname{PairwiseMargin}(n, m, \mathrm{misrate}) = \operatorname{PairwiseMargin}(m, n, \mathrm{misrate})$ , and extreme asymmetry in sample sizes.

Comprehensive grid — systematic coverage for thorough validation:

Small sample combinations ( $[n, m] in {1, 2, 3, 4, 5} \times {1, 2, 3, 4, 5}$ × 6 misrates, filtered) — 12 tests:

Misrate values: $\mathrm{misrate} in {10^{-1}, 10^{-2}, 10^{-3}, 10^{-4}, 10^{-5}, 10^{-6}}$
Combinations where $\mathrm{misrate} < \frac{2}{\binom{n+m}{n}}$ are excluded
Test naming: n{n}_m{m}_mr{k} where $k$ is the negative log10 of misrate
Examples:
n5_m5_mr1: $n=5$ , $m=5$ , $\mathrm{misrate}=0.1$ , expected output: $10$
n5_m5_mr2: $n=5$ , $m=5$ , $\mathrm{misrate}=0.01$

Large sample combinations ( $[n, m] in {10, 20, 30, 50, 100} \times {10, 20, 30, 50, 100}$ × 6 misrates, filtered) — 148 tests:

Misrate values: same as small samples
Combinations where $\mathrm{misrate} < \frac{2}{\binom{n+m}{n}}$ are excluded (affects $n = m = 10$ at misrates $10^{-5}$ and $10^{-6}$ )
Test naming: n{n}_m{m}_r{k} where $k$ is the negative log10 of misrate
Examples:
n10_m10_r1: $n=10$ , $m=10$ , $\mathrm{misrate}=0.1$ , expected output: $56$
n50_m50_r3: $n=50$ , $m=50$ , $\mathrm{misrate}=0.001$ , expected output: $1556$
n100_m100_r6: $n=100$ , $m=100$ , $\mathrm{misrate}=10^{-6}$ , expected output: $6060$

The comprehensive grid validates both symmetric ( $n = m$ ) and asymmetric sample size combinations across six orders of magnitude in misrate, ensuring robust coverage of the parameter space.

References

On the calculation of the exact null-distribution of the Wilcoxon-Mann-Whitney U-statistic

Löffler, Andreas (1982)

Significance probabilities of the Wilcoxon test

Fix, Evelyn, Hodges, J. L. (1955)

The Annals of Mathematical Statistics

DOI: 10.1214/aoms/1177728547