Algorithms
This chapter describes the core algorithms that power the toolkit.
Pseudorandom Number Generation
The toolkit provides deterministic randomization utilities that produce identical sequences across all supported programming languages. This cross-language reproducibility enables deterministic experiments and cross-validation of statistical analyses.
The core random number generator uses the xoshiro256++ algorithm (see Blackman & Vigna 2021), a member of the xoshiro/xoroshiro family developed by David Blackman and Sebastiano Vigna. This algorithm was selected for several reasons:
- Quality: passes all tests in the BigCrush test suite from TestU01
- Speed: extremely fast due to simple bitwise operations (shifts, rotations, XORs)
- Period: period of , sufficient for parallel simulations
- Adoption: used by .NET 6+, Julia, and Rusts rand crate
The algorithm maintains a 256-bit state () and produces 64-bit outputs. Each step updates the state through a combination of XOR, shift, and rotate operations.
Seed Initialization
Converting a single seed value into the full 256-bit state requires a seeding algorithm. The toolkit uses SplitMix64 (see Steele et al. 2014) for this purpose:
Four consecutive outputs from SplitMix64 initialize the xoshiro256++ state (). This approach provides high-quality initial states from simple integer seeds.
String Seeds
For named experiments (e.g., ), string seeds are converted to integers using FNV-1a hash (see Fowler et al. 1991):
This enables meaningful experiment identifiers while maintaining determinism.
Uniform Floating-Point Generation
To generate uniform values in , the upper 53 bits of a 64-bit output are used:
The 53-bit mantissa of IEEE 754 double precision ensures all representable values in are reachable.
Shuffle Algorithm
The function uses the Fisher-Yates algorithm (see Fisher & Yates 1938, Knuth 1997), also known as the Knuth shuffle:
for i from n-1 down to 1:
j = uniform_int(0, i+1)
swap(array[i], array[j])
This produces a uniformly random permutation in time with additional space. The algorithm is unbiased: each of the permutations has equal probability.
Selection Sampling
The function uses selection sampling (see Fan et al. 1962) to select elements from without replacement:
seen = 0, selected = 0
for each element x at position i:
if uniform() < (k - selected) / (n - seen):
output x
selected += 1
seen += 1
This algorithm preserves the original order of elements (order of first appearance) and requires only a single pass through the data. Each element is selected independently with the correct marginal probability, producing a simple random sample.
Distribution Sampling
Box-Muller Transform
The (normal) distribution uses the Box-Muller transform (see Box & Muller 1958) to convert uniform random values to normally distributed values. Given two independent uniform values :
Both and are independent standard normal values. The implementation uses only to maintain cross-language determinism.
Numerical Constants
Distribution sampling requires handling edge cases where floating-point operations would be undefined. Two constants are used across all language implementations:
Machine Epsilon (): The smallest such that in float64 arithmetic.
Used when to avoid or division by zero in inverse transform sampling.
Smallest Positive Subnormal (): The smallest positive value representable in IEEE 754 binary64.
Used when to avoid in the Box-Muller transform.
All language implementations use the same literal values for these constants (not language-specific builtins like Number.EPSILON or f64::EPSILON) to ensure bit-identical outputs across languages.
Inverse Transform Sampling
Other distributions use inverse transform sampling. Given a uniform value and the inverse CDF :
- Exponential:
- Pareto (Power):
- Uniform:
The log-normal () distribution samples from and applies the exponential transform: .
Fast Center
The estimator computes the median of all pairwise averages from a sample. Given a dataset , this estimator is defined as:
A direct implementation would generate all pairwise averages and sort them. With , this approach creates approximately 50 million values, requiring quadratic memory and time.
The breakthrough came in 1984 when John Monahan developed an algorithm that reduces expected complexity to while using only linear memory (see Monahan 1984). The algorithm exploits the inherent structure in pairwise sums rather than computing them explicitly. After sorting the values , the algorithm considers the implicit upper triangular matrix where for . This matrix has a crucial property: each row and column is sorted in non-decreasing order, enabling efficient median selection without storing the matrix.
Rather than sorting all pairwise sums, the algorithm uses a selection approach similar to quickselect. It maintains search bounds for each matrix row and iteratively narrows the search space. For each row , the algorithm tracks active column indices from to , defining which pairwise sums remain candidates for the median. It selects a candidate sum as a pivot using randomized selection from active matrix elements, then counts how many pairwise sums fall below the pivot. Because both rows and columns are sorted, this counting takes only time using a two-pointer sweep from the matrixs upper-right corner.
The median corresponds to rank where . If fewer than sums lie below the pivot, the median must be larger; if more than sums lie below the pivot, the median must be smaller. Based on this comparison, the algorithm eliminates portions of each row that cannot contain the median, shrinking the active search space while preserving the true median.
Real data often contain repeated values, which can cause the selection process to stall. When the algorithm detects no progress between iterations, it switches to a midrange strategy: find the smallest and largest pairwise sums still in the search space, then use their average as the next pivot. If the minimum equals the maximum, all remaining candidates are identical and the algorithm terminates. This tie-breaking mechanism ensures reliable convergence with discrete or duplicated data.
The algorithm achieves time complexity through linear partitioning (each pivot evaluation requires only operations) and logarithmic iterations (randomized pivot selection leads to expected iterations, similar to quickselect). The algorithm maintains only row bounds and counters, using additional space. This matches the complexity of sorting a single array while avoiding the quadratic memory and time explosion of computing all pairwise combinations.
namespace Pragmastat.Algorithms;
internal static class FastCenter
{
/// <summary>
/// ACM Algorithm 616: fast computation of the Hodges-Lehmann location estimator
/// </summary>
/// <remarks>
/// Computes the median of all pairwise averages (xi + xj)/2 efficiently.
/// See: John F Monahan, "Algorithm 616: fast computation of the Hodges-Lehmann location estimator"
/// (1984) DOI: 10.1145/1271.319414
/// </remarks>
/// <param name="values">A sorted sample of values</param>
/// <param name="random">Random number generator</param>
/// <param name="isSorted">If values are sorted</param>
/// <returns>Exact center value (Hodges-Lehmann estimator)</returns>
public static double Estimate(IReadOnlyList<double> values, Random? random = null, bool isSorted = false)
{
int n = values.Count;
if (n == 1) return values[0];
if (n == 2) return (values[0] + values[1]) / 2;
random ??= new Random();
if (!isSorted)
values = values.OrderBy(x => x).ToList();
// Calculate target median rank(s) among all pairwise sums
long totalPairs = (long)n * (n + 1) / 2;
long medianRankLow = (totalPairs + 1) / 2; // For odd totalPairs, this is the median
long medianRankHigh =
(totalPairs + 2) / 2; // For even totalPairs, average of ranks medianRankLow and medianRankHigh
// Initialize search bounds for each row in the implicit matrix
long[] leftBounds = new long[n];
long[] rightBounds = new long[n];
long[] partitionCounts = new long[n];
for (int i = 0; i < n; i++)
{
leftBounds[i] = i + 1; // Row i can pair with columns [i+1..n] (1-based indexing)
rightBounds[i] = n; // Initially, all columns are available
}
// Start with a good pivot: sum of middle elements (handles both odd and even n)
double pivot = values[(n - 1) / 2] + values[n / 2];
long activeSetSize = totalPairs;
long previousCount = 0;
while (true)
{
// === PARTITION STEP ===
// Count pairwise sums less than current pivot
long countBelowPivot = 0;
long currentColumn = n;
for (int row = 1; row <= n; row++)
{
partitionCounts[row - 1] = 0;
// Move left from current column until we find sums < pivot
// This exploits the sorted nature of the matrix
while (currentColumn >= row && values[row - 1] + values[(int)currentColumn - 1] >= pivot)
currentColumn--;
// Count elements in this row that are < pivot
if (currentColumn >= row)
{
long elementsBelow = currentColumn - row + 1;
partitionCounts[row - 1] = elementsBelow;
countBelowPivot += elementsBelow;
}
}
// === CONVERGENCE CHECK ===
// If no progress, we have ties - break them using midrange strategy
if (countBelowPivot == previousCount)
{
double minActiveSum = double.MaxValue;
double maxActiveSum = double.MinValue;
// Find the range of sums still in the active search space
for (int i = 0; i < n; i++)
{
if (leftBounds[i] > rightBounds[i]) continue; // Skip empty rows
double rowValue = values[i];
double smallestInRow = values[(int)leftBounds[i] - 1] + rowValue;
double largestInRow = values[(int)rightBounds[i] - 1] + rowValue;
minActiveSum = Min(minActiveSum, smallestInRow);
maxActiveSum = Max(maxActiveSum, largestInRow);
}
pivot = (minActiveSum + maxActiveSum) / 2;
if (pivot <= minActiveSum || pivot > maxActiveSum) pivot = maxActiveSum;
// If all remaining values are identical, we're done
if (minActiveSum == maxActiveSum || activeSetSize <= 2)
return pivot / 2;
continue;
}
// === TARGET CHECK ===
// Check if we've found the median rank(s)
bool atTargetRank = countBelowPivot == medianRankLow || countBelowPivot == medianRankHigh - 1;
if (atTargetRank)
{
// Find the boundary values: largest < pivot and smallest >= pivot
double largestBelowPivot = double.MinValue;
double smallestAtOrAbovePivot = double.MaxValue;
for (int i = 1; i <= n; i++)
{
long countInRow = partitionCounts[i - 1];
double rowValue = values[i - 1];
long totalInRow = n - i + 1;
// Find largest sum in this row that's < pivot
if (countInRow > 0)
{
long lastBelowIndex = i + countInRow - 1;
double lastBelowValue = rowValue + values[(int)lastBelowIndex - 1];
largestBelowPivot = Max(largestBelowPivot, lastBelowValue);
}
// Find smallest sum in this row that's >= pivot
if (countInRow < totalInRow)
{
long firstAtOrAboveIndex = i + countInRow;
double firstAtOrAboveValue = rowValue + values[(int)firstAtOrAboveIndex - 1];
smallestAtOrAbovePivot = Min(smallestAtOrAbovePivot, firstAtOrAboveValue);
}
}
// Calculate final result based on whether we have odd or even number of pairs
if (medianRankLow < medianRankHigh)
{
// Even total: average the two middle values
return (smallestAtOrAbovePivot + largestBelowPivot) / 4;
}
else
{
// Odd total: return the single middle value
bool needLargest = countBelowPivot == medianRankLow;
return (needLargest ? largestBelowPivot : smallestAtOrAbovePivot) / 2;
}
}
// === UPDATE BOUNDS ===
// Narrow the search space based on partition result
if (countBelowPivot < medianRankLow)
{
// Too few values below pivot - eliminate smaller values, search higher
for (int i = 0; i < n; i++)
leftBounds[i] = i + partitionCounts[i] + 1;
}
else
{
// Too many values below pivot - eliminate larger values, search lower
for (int i = 0; i < n; i++)
rightBounds[i] = i + partitionCounts[i];
}
// === PREPARE NEXT ITERATION ===
previousCount = countBelowPivot;
// Recalculate how many elements remain in the active search space
activeSetSize = 0;
for (int i = 0; i < n; i++)
{
long rowSize = rightBounds[i] - leftBounds[i] + 1;
activeSetSize += Max(0, rowSize);
}
// Choose next pivot based on remaining active set size
if (activeSetSize > 2)
{
// Use randomized row median strategy for efficiency
// Handle large activeSetSize by using double precision for random selection
double randomFraction = random.NextDouble();
long targetIndex = (long)(randomFraction * activeSetSize);
int selectedRow = 0;
// Find which row contains the target index
long cumulativeSize = 0;
for (int i = 0; i < n; i++)
{
long rowSize = Max(0, rightBounds[i] - leftBounds[i] + 1);
if (targetIndex < cumulativeSize + rowSize)
{
selectedRow = i;
break;
}
cumulativeSize += rowSize;
}
// Use median element of the selected row as pivot
long medianColumnInRow = (leftBounds[selectedRow] + rightBounds[selectedRow]) / 2;
pivot = values[selectedRow] + values[(int)medianColumnInRow - 1];
}
else
{
// Few elements remain - use midrange strategy
double minRemainingSum = double.MaxValue;
double maxRemainingSum = double.MinValue;
for (int i = 0; i < n; i++)
{
if (leftBounds[i] > rightBounds[i]) continue; // Skip empty rows
double rowValue = values[i];
double minInRow = values[(int)leftBounds[i] - 1] + rowValue;
double maxInRow = values[(int)rightBounds[i] - 1] + rowValue;
minRemainingSum = Min(minRemainingSum, minInRow);
maxRemainingSum = Max(maxRemainingSum, maxInRow);
}
pivot = (minRemainingSum + maxRemainingSum) / 2;
if (pivot <= minRemainingSum || pivot > maxRemainingSum)
pivot = maxRemainingSum;
if (minRemainingSum == maxRemainingSum)
return pivot / 2;
}
}
}
}
Fast Spread
The estimator computes the median of all pairwise absolute differences. Given a sample , this estimator is defined as:
Like , computing naively requires generating all pairwise differences, sorting them, and finding the median — a quadratic approach that becomes computationally prohibitive for large datasets.
The same structural principles that accelerate computation also apply to pairwise differences, yielding an exact algorithm. After sorting the input to obtain , all pairwise absolute differences with become positive differences . This allows considering the implicit upper triangular matrix where for . This matrix has a crucial structural property: for a fixed row , differences increase monotonically, while for a fixed column , differences decrease as increases. This sorted structure enables linear-time counting of elements below any threshold.
The algorithm applies Monahans selection strategy (Monahan 1984), adapted for differences rather than sums. For each row , it tracks active column indices representing differences still under consideration, initially spanning columns through . It chooses candidate differences from the active set using weighted random row selection, maintaining expected logarithmic convergence while avoiding expensive pivot computations. For any pivot value , the number of differences falling below is counted using a single sweep; the monotonic structure ensures this counting requires only operations. While counting, the largest difference below and the smallest difference at or above are maintained — these boundary values become the exact answer when the target rank is reached.
The algorithm naturally handles both odd and even cases. For an odd number of differences, it returns the single middle element when the count exactly hits the median rank. For an even number of differences, it returns the average of the two middle elements; boundary tracking during counting provides both values simultaneously. Unlike approximation methods, this algorithm returns the precise median of all pairwise differences; randomness affects only performance, not correctness.
The algorithm includes the same stall-handling mechanisms as the center algorithm. It tracks whether the count below the pivot changes between iterations; when progress stalls due to tied values, it computes the range of remaining active differences and pivots to their midrange. This midrange strategy ensures convergence even with highly discrete data or datasets with many identical values.
Several optimizations make the implementation practical for production use. A global column pointer that never moves backward during counting exploits the matrix structure to avoid redundant comparisons. Exact boundary values are captured during each counting pass, eliminating the need for additional searches when the target rank is reached. Using only additional space for row bounds and counters, the algorithm achieves time complexity with minimal memory overhead, making robust scale estimation practical for large datasets.
namespace Pragmastat.Algorithms;
internal static class FastSpread
{
/// <summary>
/// Shamos "Spread". Expected O(n log n) time, O(n) extra space. Exact.
/// </summary>
public static double Estimate(IReadOnlyList<double> values, Random? random = null, bool isSorted = false)
{
int n = values.Count;
if (n <= 1) return 0;
if (n == 2) return Abs(values[1] - values[0]);
random ??= new Random();
// Prepare a sorted working copy.
double[] a = isSorted ? CopySorted(values) : EnsureSorted(values);
// Total number of pairwise differences with i < j
long N = (long)n * (n - 1) / 2;
long kLow = (N + 1) / 2; // 1-based rank of lower middle
long kHigh = (N + 2) / 2; // 1-based rank of upper middle
// Per-row active bounds over columns j (0-based indices).
// Row i allows j in [i+1, n-1] initially.
int[] L = new int[n];
int[] R = new int[n];
long[] rowCounts = new long[n]; // # of elements in row i that are < pivot (for current partition)
for (int i = 0; i < n; i++)
{
L[i] = Min(i + 1, n); // n means empty
R[i] = n - 1; // inclusive
if (L[i] > R[i])
{
L[i] = 1;
R[i] = 0;
} // mark empty
}
// A reasonable initial pivot: a central gap
double pivot = a[n / 2] - a[(n - 1) / 2];
long prevCountBelow = -1;
while (true)
{
// === PARTITION: count how many differences are < pivot; also track boundary neighbors ===
long countBelow = 0;
double largestBelow = double.NegativeInfinity; // max difference < pivot
double smallestAtOrAbove = double.PositiveInfinity; // min difference >= pivot
int j = 1; // global two-pointer (non-decreasing across rows)
for (int i = 0; i < n - 1; i++)
{
if (j < i + 1) j = i + 1;
while (j < n && a[j] - a[i] < pivot) j++;
long cntRow = j - (i + 1);
if (cntRow < 0) cntRow = 0;
rowCounts[i] = cntRow;
countBelow += cntRow;
// boundary elements for this row
if (cntRow > 0)
{
// last < pivot in this row is (j-1)
double candBelow = a[j - 1] - a[i];
if (candBelow > largestBelow) largestBelow = candBelow;
}
if (j < n)
{
double candAtOrAbove = a[j] - a[i];
if (candAtOrAbove < smallestAtOrAbove) smallestAtOrAbove = candAtOrAbove;
}
}
// === TARGET CHECK ===
// If we've split exactly at the middle, we can return using the boundaries we just found.
bool atTarget =
(countBelow == kLow) || // lower middle is the largest < pivot
(countBelow == (kHigh - 1)); // upper middle is the smallest >= pivot
if (atTarget)
{
if (kLow < kHigh)
{
// Even N: average the two central order stats.
return 0.5 * (largestBelow + smallestAtOrAbove);
}
else
{
// Odd N: pick the single middle depending on which side we hit.
bool needLargest = (countBelow == kLow);
return needLargest ? largestBelow : smallestAtOrAbove;
}
}
// === STALL HANDLING (ties / no progress) ===
if (countBelow == prevCountBelow)
{
// Compute min/max remaining difference in the ACTIVE set and pivot to their midrange.
double minActive = double.PositiveInfinity;
double maxActive = double.NegativeInfinity;
long active = 0;
for (int i = 0; i < n - 1; i++)
{
int Li = L[i], Ri = R[i];
if (Li > Ri) continue;
double rowMin = a[Li] - a[i];
double rowMax = a[Ri] - a[i];
if (rowMin < minActive) minActive = rowMin;
if (rowMax > maxActive) maxActive = rowMax;
active += (Ri - Li + 1);
}
if (active <= 0)
{
// No active candidates left: the only consistent answer is the boundary implied by counts.
// Fall back to neighbors from this partition.
if (kLow < kHigh) return 0.5 * (largestBelow + smallestAtOrAbove);
return (countBelow >= kLow) ? largestBelow : smallestAtOrAbove;
}
if (maxActive <= minActive) return minActive; // all remaining equal
double mid = 0.5 * (minActive + maxActive);
pivot = (mid > minActive && mid <= maxActive) ? mid : maxActive;
prevCountBelow = countBelow;
continue;
}
// === SHRINK ACTIVE WINDOW ===
// --- SHRINK ACTIVE WINDOW (fixed) ---
if (countBelow < kLow)
{
// Need larger differences: discard all strictly below pivot.
for (int i = 0; i < n - 1; i++)
{
// First j with a[j] - a[i] >= pivot is j = i + 1 + cntRow (may be n => empty row)
int newL = i + 1 + (int)rowCounts[i];
if (newL > L[i]) L[i] = newL; // do NOT clamp; allow L[i] == n to mean empty
if (L[i] > R[i])
{
L[i] = 1;
R[i] = 0;
} // mark empty
}
}
else
{
// Too many below: keep only those strictly below pivot.
for (int i = 0; i < n - 1; i++)
{
// Last j with a[j] - a[i] < pivot is j = i + cntRow (not cntRow-1!)
int newR = i + (int)rowCounts[i];
if (newR < R[i]) R[i] = newR; // shrink downward to the true last-below
if (R[i] < i + 1)
{
L[i] = 1;
R[i] = 0;
} // empty row if none remain
}
}
prevCountBelow = countBelow;
// === CHOOSE NEXT PIVOT FROM ACTIVE SET (weighted random row, then row median) ===
long activeSize = 0;
for (int i = 0; i < n - 1; i++)
{
if (L[i] <= R[i]) activeSize += (R[i] - L[i] + 1);
}
if (activeSize <= 2)
{
// Few candidates left: return midrange of remaining exactly.
double minRem = double.PositiveInfinity, maxRem = double.NegativeInfinity;
for (int i = 0; i < n - 1; i++)
{
if (L[i] > R[i]) continue;
double lo = a[L[i]] - a[i];
double hi = a[R[i]] - a[i];
if (lo < minRem) minRem = lo;
if (hi > maxRem) maxRem = hi;
}
if (activeSize <= 0) // safety net; fall back to boundary from last partition
{
if (kLow < kHigh) return 0.5 * (largestBelow + smallestAtOrAbove);
return (countBelow >= kLow) ? largestBelow : smallestAtOrAbove;
}
if (kLow < kHigh) return 0.5 * (minRem + maxRem);
return (Abs((kLow - 1) - countBelow) <= Abs(countBelow - kLow)) ? minRem : maxRem;
}
else
{
long t = NextIndex(random, activeSize); // 0..activeSize-1
long acc = 0;
int row = 0;
for (; row < n - 1; row++)
{
if (L[row] > R[row]) continue;
long size = R[row] - L[row] + 1;
if (t < acc + size) break;
acc += size;
}
// Median column of the selected row
int col = (L[row] + R[row]) >> 1;
pivot = a[col] - a[row];
}
}
}
// --- Helpers ---
private static double[] CopySorted(IReadOnlyList<double> values)
{
var a = new double[values.Count];
for (int i = 0; i < a.Length; i++)
{
double v = values[i];
if (double.IsNaN(v)) throw new ArgumentException("NaN not allowed.", nameof(values));
a[i] = v;
}
Array.Sort(a);
return a;
}
private static double[] EnsureSorted(IReadOnlyList<double> values)
{
// Trust caller; still copy to array for fast indexed access.
var a = new double[values.Count];
for (int i = 0; i < a.Length; i++)
{
double v = values[i];
if (double.IsNaN(v)) throw new ArgumentException("NaN not allowed.", nameof(values));
a[i] = v;
}
return a;
}
private static long NextIndex(Random rng, long limitExclusive)
{
// Uniform 0..limitExclusive-1 even for large ranges.
// Use rejection sampling for correctness.
ulong uLimit = (ulong)limitExclusive;
if (uLimit <= int.MaxValue)
{
return rng.Next((int)uLimit);
}
while (true)
{
ulong u = ((ulong)(uint)rng.Next() << 32) | (uint)rng.Next();
ulong r = u % uLimit;
if (u - r <= ulong.MaxValue - (ulong.MaxValue % uLimit)) return (long)r;
}
}
}
Fast Shift
The estimator measures the median of all pairwise differences between elements of two samples. Given samples and , this estimator is defined as:
This definition represents a special case of a more general problem: computing arbitrary quantiles of all pairwise differences. For samples of size and , the total number of pairwise differences is . A naive approach would materialize all differences, sort them, and extract the desired quantile. With , this approach creates 100 million values, requiring quadratic memory and time.
The presented algorithm avoids materializing pairwise differences by exploiting the sorted structure of the samples. After sorting both samples to obtain and , the key insight is that its possible to count how many pairwise differences fall below any threshold without computing them explicitly. This counting operation enables a binary search over the continuous space of possible difference values, iteratively narrowing the search range until it converges to the exact quantile.
The algorithm operates through a value-space search rather than index-space selection. It maintains a search interval , initialized to the range of all possible differences: . At each iteration, it selects a candidate value within this interval and counts how many pairwise differences are less than or equal to this threshold. For the median (quantile ), if fewer than half the differences lie below the threshold, the median must be larger; if more than half lie below, the median must be smaller. Based on this comparison, the search space is reduced by eliminating portions that cannot contain the target quantile.
The counting operation achieves linear complexity via a two-pointer sweep. For a given threshold , the number of pairs satisfying is counted. This is equivalent to counting pairs where . For each row in the implicit matrix of differences, a column pointer advances through the sorted array while , stopping at the first position where . All subsequent positions in that row satisfy the condition, contributing pairs to the count for row . Because both samples are sorted, the column pointer advances monotonically and never backtracks across rows, making each counting pass regardless of the total number of differences.
During each counting pass, the algorithm tracks boundary values: the largest difference at or below the threshold and the smallest difference above it. When the count exactly matches the target rank (or the two middle ranks for even-length samples), these boundary values provide the exact answer without additional searches. For Type-7 quantile computation (Hyndman & Fan 1996), which interpolates between order statistics, the algorithm collects the necessary boundary values in a single pass and performs linear interpolation: .
Real datasets often contain discrete or repeated values that can cause search stagnation. The algorithm detects when the search interval stops shrinking between iterations, indicating that multiple pairwise differences share the same value. When the closest difference below the threshold equals the closest above, all remaining candidates are identical and the algorithm terminates immediately. Otherwise, it uses the boundary values from the counting pass to snap the search interval to actual difference values, ensuring reliable convergence even with highly discrete data.
The binary search employs numerically stable midpoint calculations and terminates when the search interval collapses to a single value or when boundary tracking confirms convergence. Iteration limits are included as a safety mechanism, though convergence typically occurs much earlier due to the exponential narrowing of the search space.
The algorithm naturally generalizes to multiple quantiles by computing each one independently. For quantiles with samples of size and , the total complexity is , where represents the convergence precision. This is dramatically more efficient than the naive approach, especially for large and with small . The algorithm requires only additional space beyond the input arrays, making it practical for large-scale statistical analysis where memory constraints prohibit materializing quadratic data structures.
namespace Pragmastat.Algorithms;
using System;
using System.Collections.Generic;
using System.Linq;
public static class FastShift
{
/// <summary>
/// Computes quantiles of all pairwise differences { x_i - y_j }.
/// Time: O((m + n) * log(precision)) per quantile. Space: O(1).
/// </summary>
/// <param name="p">Probabilities in [0, 1].</param>
/// <param name="assumeSorted">If false, collections will be sorted.</param>
public static double[] Estimate(IReadOnlyList<double> x, IReadOnlyList<double> y, double[] p, bool assumeSorted = false)
{
if (x == null || y == null || p == null)
throw new ArgumentNullException();
if (x.Count == 0 || y.Count == 0)
throw new ArgumentException("x and y must be non-empty.");
foreach (double pk in p)
if (double.IsNaN(pk) || pk < 0.0 || pk > 1.0)
throw new ArgumentOutOfRangeException(nameof(p), "Probabilities must be within [0, 1].");
double[] xs, ys;
if (assumeSorted)
{
xs = x as double[] ?? x.ToArray();
ys = y as double[] ?? y.ToArray();
}
else
{
xs = x.OrderBy(v => v).ToArray();
ys = y.OrderBy(v => v).ToArray();
}
int m = xs.Length;
int n = ys.Length;
long total = (long)m * n;
// Type-7 quantile: h = 1 + (n-1)*p, then interpolate between floor(h) and ceil(h)
var requiredRanks = new SortedSet<long>();
var interpolationParams = new (long lowerRank, long upperRank, double weight)[p.Length];
for (int i = 0; i < p.Length; i++)
{
double h = 1.0 + (total - 1) * p[i];
long lowerRank = (long)Math.Floor(h);
long upperRank = (long)Math.Ceiling(h);
double weight = h - lowerRank;
if (lowerRank < 1) lowerRank = 1;
if (upperRank > total) upperRank = total;
interpolationParams[i] = (lowerRank, upperRank, weight);
requiredRanks.Add(lowerRank);
requiredRanks.Add(upperRank);
}
var rankValues = new Dictionary<long, double>();
foreach (long rank in requiredRanks)
rankValues[rank] = SelectKthPairwiseDiff(xs, ys, rank);
var result = new double[p.Length];
for (int i = 0; i < p.Length; i++)
{
var (lowerRank, upperRank, weight) = interpolationParams[i];
double lower = rankValues[lowerRank];
double upper = rankValues[upperRank];
result[i] = weight == 0.0 ? lower : (1.0 - weight) * lower + weight * upper;
}
return result;
}
// Binary search in [min_diff, max_diff] that snaps to actual discrete values.
// Avoids materializing all m*n differences.
private static double SelectKthPairwiseDiff(double[] x, double[] y, long k)
{
int m = x.Length;
int n = y.Length;
long total = (long)m * n;
if (k < 1 || k > total)
throw new ArgumentOutOfRangeException(nameof(k));
double searchMin = x[0] - y[n - 1];
double searchMax = x[m - 1] - y[0];
if (double.IsNaN(searchMin) || double.IsNaN(searchMax))
throw new InvalidOperationException("NaN in input values.");
const int maxIterations = 128; // Sufficient for double precision convergence
double prevMin = double.NegativeInfinity;
double prevMax = double.PositiveInfinity;
for (int iter = 0; iter < maxIterations && searchMin != searchMax; iter++)
{
double mid = Midpoint(searchMin, searchMax);
CountAndNeighbors(x, y, mid, out long countLessOrEqual, out double closestBelow, out double closestAbove);
if (closestBelow == closestAbove)
return closestBelow;
// No progress means we're stuck between two discrete values
if (searchMin == prevMin && searchMax == prevMax)
return countLessOrEqual >= k ? closestBelow : closestAbove;
prevMin = searchMin;
prevMax = searchMax;
if (countLessOrEqual >= k)
searchMax = closestBelow;
else
searchMin = closestAbove;
}
if (searchMin != searchMax)
throw new InvalidOperationException("Convergence failure (pathological input).");
return searchMin;
}
// Two-pointer algorithm: counts pairs where x[i] - y[j] <= threshold, and tracks
// the closest actual differences on either side of threshold.
private static void CountAndNeighbors(
double[] x, double[] y, double threshold,
out long countLessOrEqual, out double closestBelow, out double closestAbove)
{
int m = x.Length, n = y.Length;
long count = 0;
double maxBelow = double.NegativeInfinity;
double minAbove = double.PositiveInfinity;
int j = 0;
for (int i = 0; i < m; i++)
{
while (j < n && x[i] - y[j] > threshold)
j++;
count += (n - j);
if (j < n)
{
double diff = x[i] - y[j];
if (diff > maxBelow) maxBelow = diff;
}
if (j > 0)
{
double diff = x[i] - y[j - 1];
if (diff < minAbove) minAbove = diff;
}
}
// Fallback to actual min/max if no boundaries found (shouldn't happen in normal operation)
if (double.IsNegativeInfinity(maxBelow))
maxBelow = x[0] - y[n - 1];
if (double.IsPositiveInfinity(minAbove))
minAbove = x[m - 1] - y[0];
countLessOrEqual = count;
closestBelow = maxBelow;
closestAbove = minAbove;
}
private static double Midpoint(double a, double b) => a + (b - a) * 0.5;
}
Fast Ratio
The estimator measures the typical multiplicative relationship between elements of two samples. Given samples and with all positive values, this estimator is defined as:
A naive approach would compute all ratios, sort them, and extract the median. With , this creates 100 million values, requiring quadratic memory and time.
The presented algorithm avoids this cost by exploiting the multiplicative-additive duality. Taking the logarithm of a ratio yields a difference:
This transforms the problem of finding the median of pairwise ratios into finding the median of pairwise differences in log-space.
The algorithm operates in three steps:
-
Log-transform — Apply to each element of both samples. If was sorted, remains sorted (log is monotonically increasing for positive values).
-
Delegate to FastShift — Use the Fast Shift algorithm to compute the desired quantile of pairwise differences in log-space. This leverages the complexity of FastShift.
-
Exp-transform — Apply to convert the result back to ratio-space. If the log-space median is , then is the original ratio.
The total complexity is per quantile, where represents the convergence precision in the log-space binary search. This is dramatically more efficient than the naive approach.
Memory usage is for storing the log-transformed samples, compared to for materializing all pairwise ratios.
This algorithm naturally extends to computing bounds on ratios: uses the same log-exp transformation, delegating to in log-space.
namespace Pragmastat.Algorithms;
using System;
using System.Collections.Generic;
using System.Linq;
using Pragmastat.Exceptions;
using Pragmastat.Internal;
/// <summary>
/// Computes quantiles of pairwise ratios via log-transformation and FastShift delegation.
/// Ratio(x, y) = exp(Shift(log(x), log(y)))
/// </summary>
public static class FastRatio
{
/// <summary>
/// Computes quantiles of all pairwise ratios { x_i / y_j }.
/// Time: O((m + n) * log(precision)) per quantile. Space: O(m + n).
/// </summary>
/// <remarks>
/// Log-transformation preserves sort order for positive values.
/// </remarks>
/// <param name="x">First sample (must be strictly positive).</param>
/// <param name="y">Second sample (must be strictly positive).</param>
/// <param name="p">Probabilities in [0, 1].</param>
/// <param name="assumeSorted">If true, assumes x and y are already sorted in ascending order.</param>
/// <returns>Quantile values for each probability in p.</returns>
public static double[] Estimate(IReadOnlyList<double> x, IReadOnlyList<double> y, double[] p, bool assumeSorted = false)
{
// Log-transform both samples (includes positivity check)
var logX = MathExtensions.Log(x, Subject.X);
var logY = MathExtensions.Log(y, Subject.Y);
// Delegate to FastShift in log-space
var logResult = FastShift.Estimate(logX, logY, p, assumeSorted);
// Exp-transform back to ratio-space
return logResult.Select(v => Math.Exp(v)).ToArray();
}
}
Fast PairwiseMargin
The function determines how many extreme pairwise differences to exclude when constructing bounds around . Given samples and , the estimator computes all pairwise differences and sorts them. The bounds select specific order statistics from this sorted sequence: . The challenge lies in determining which order statistics produce bounds that contain the true shift with probability .
Random sampling creates natural variation in pairwise differences. Even when populations have identical distributions, sampling variation produces both positive and negative differences. The margin function quantifies this sampling variability: it specifies how many extreme pairwise differences could occur by chance with probability . For symmetric bounds, this margin splits evenly between the tails, giving and .
Computing the margin requires understanding the distribution of pairwise comparisons. Each pairwise difference corresponds to a comparison: exactly when . This connection motivates the dominance function:
The dominance function counts how many pairwise comparisons favor over . Both and operate on the same collection of pairwise differences. The estimator examines difference values, returning the median as a location estimate. The function examines difference signs, counting how many comparisons produce positive differences. While provides the estimate itself, determines which order statistics form reliable bounds around that estimate.
When populations have equivalent distributions, concentrates near by symmetry. The distribution of across all possible sample orderings determines reliable bounds. If deviates from by at least with probability , then the interval excluding the most extreme pairwise differences contains zero with probability . Translation invariance extends this relationship to arbitrary shifts: the margin computed from the comparison distribution applies regardless of the true shift value.
Two computational approaches provide the distribution of : exact calculation for small samples and approximation for large samples.
Exact method
Small sample sizes allow exact computation without approximation. The exact approach exploits a fundamental symmetry: under equivalent populations, all orderings of the combined measurements occur with equal probability. This symmetry enables direct calculation of how many orderings produce each comparison count.
Direct computation faces a combinatorial challenge. Enumerating all orderings to count comparison outcomes requires substantial resources. For samples beyond a few dozen measurements, naive implementation becomes impractical.
Löfflers recurrence relation (Löffler 1982) resolves this through algebraic structure. The recurrence exploits cycle properties in the comparison distribution, reducing memory requirements while maintaining exact calculation. The algorithm builds cumulative probabilities sequentially until reaching the threshold corresponding to the desired error rate. This approach extends practical exact computation to combined sample sizes of several hundred.
Define as the number of orderings producing exactly comparisons favoring . The probability mass function becomes:
A direct recurrence follows from considering the largest measurement. The rightmost element comes from either (contributing comparisons) or (contributing zero):
with base cases and .
Direct implementation requires time and memory. An alternative recurrence (Löffler 1982) exploits cycle structure:
where captures structural properties through divisors:
This reduces memory to and enables efficient computation through .
The algorithm computes cumulative probabilities sequentially until the threshold is exceeded. By symmetry, the lower and upper thresholds determine the total margin .
The sequential computation proceeds incrementally. Starting from with base probability , the algorithm computes , then , and so on, accumulating the cumulative distribution function with each step. The loop terminates as soon as reaches , returning the threshold value without computing further probabilities.
This sequential approach performs particularly well for small misrates. For , the threshold typically remains small even with large sample sizes, requiring only a few iterations regardless of whether and equal 50 or 200. The algorithm computes only the extreme tail probabilities needed to reach the threshold, never touching the vast majority of probability mass concentrated near . This efficiency advantage grows as misrates decrease: stricter bounds require fewer computed values, making exact calculation particularly attractive for high-confidence applications.
Approximate method
Large samples make exact computation impractical. The dominance count concentrates near with variance . A basic (Normal) approximation suffices asymptotically:
This approximation underestimates tail probabilities for moderate sample sizes. The (Normal) approximation provides a baseline but fails to capture the true distribution shape in the tails, producing mis-calibrated probabilities that become problematic for small error rates.
The Edgeworth expansion refines this approximation through moment-based corrections (Fix & Hodges 1955). The expansion starts with the (Normal) cumulative distribution as a baseline, then adds correction terms that account for the distributions asymmetry (skewness) and tail weight (kurtosis). These corrections use Hermite polynomials to adjust the baseline curve where the (Normal) approximation deviates most from the true distribution. The first few correction terms typically achieve the practical balance between accuracy and computational cost, substantially improving tail probability estimates compared to the basic approximation.
The standardized comparison count (with continuity correction):
produces the approximated cumulative distribution:
where denotes the standard (Normal) CDF.
The correction coefficients depend on standardized moments:
The moments , , are computed from sample sizes:
The correction terms use Hermite polynomials:
Binary search locates the threshold value efficiently. The algorithm maintains a search interval initialized to . Each iteration computes the midpoint and evaluates the Edgeworth CDF at . If , the threshold lies above and the search continues in . If , the threshold lies below and the search continues in . The loop terminates when and become adjacent, requiring CDF evaluations.
This binary search exhibits uniform performance across misrate values. Whether computing bounds for or , the algorithm performs the same number of iterations determined solely by the sample sizes. Each CDF evaluation costs constant time regardless of the threshold location, making the approximate method particularly efficient for large samples where exact computation becomes impractical. The logarithmic scaling ensures that doubling the sample size adds only one additional iteration, enabling practical computation for samples in the thousands or tens of thousands.
The toolkit selects between exact and approximate computation based on combined sample size: exact method for , approximate method for . The exact method guarantees correctness but scales as memory and time. For , this requires 40,000 memory locations. The approximate method achieves 1% accuracy with constant-time evaluations. For , the approximate method completes in milliseconds versus minutes for exact computation.
Both methods handle discrete data. Repeated measurements produce tied pairwise differences, creating plateaus in the sorted sequence. The exact method counts orderings without assuming continuity. The approximate methods moment-based corrections capture the actual distribution shape regardless of discreteness.
Minimum achievable misrate
The parameter controls how many extreme pairwise differences the bounds exclude. Lower misrate produces narrower bounds with higher confidence but requires excluding fewer extreme values. However, sample size limits how small misrate can meaningfully become.
Consider the most extreme configuration: all measurements from exceed all measurements from , giving . Under equivalent populations, this arrangement occurs purely by chance. The probability equals the chance of having all elements from occupy the top positions among total measurements:
This represents the minimum probability of the most extreme ordering under random sampling. Setting makes bounds construction problematic. The exact distribution of cannot support misrates smaller than the probability of its most extreme realization. Attempting to construct bounds with forces the algorithm to exclude zero pairwise differences from the tails, making . The resulting bounds span all pairwise differences, returning regardless of the desired confidence level. These bounds convey no useful information beyond the range of observed pairwise differences.
For small samples, can exceed commonly used values. With , the minimum misrate equals , making the typical choice of impossible. With , the minimum becomes , exceeding even .
The table below shows for small sample sizes:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1.000000 | 0.666667 | 0.500000 | 0.400000 | 0.333333 | 0.285714 | 0.250000 | 0.222222 | 0.200000 | 0.181818 |
| 2 | 0.666667 | 0.333333 | 0.200000 | 0.133333 | 0.095238 | 0.071429 | 0.055556 | 0.044444 | 0.036364 | 0.030303 |
| 3 | 0.500000 | 0.200000 | 0.100000 | 0.057143 | 0.035714 | 0.023810 | 0.016667 | 0.012121 | 0.009091 | 0.006993 |
| 4 | 0.400000 | 0.133333 | 0.057143 | 0.028571 | 0.015873 | 0.009524 | 0.006061 | 0.004040 | 0.002797 | 0.001998 |
| 5 | 0.333333 | 0.095238 | 0.035714 | 0.015873 | 0.007937 | 0.004329 | 0.002525 | 0.001554 | 0.000999 | 0.000666 |
| 6 | 0.285714 | 0.071429 | 0.023810 | 0.009524 | 0.004329 | 0.002165 | 0.001166 | 0.000666 | 0.000400 | 0.000250 |
| 7 | 0.250000 | 0.055556 | 0.016667 | 0.006061 | 0.002525 | 0.001166 | 0.000583 | 0.000311 | 0.000175 | 0.000103 |
| 8 | 0.222222 | 0.044444 | 0.012121 | 0.004040 | 0.001554 | 0.000666 | 0.000311 | 0.000155 | 0.000082 | 0.000046 |
| 9 | 0.200000 | 0.036364 | 0.009091 | 0.002797 | 0.000999 | 0.000400 | 0.000175 | 0.000082 | 0.000041 | 0.000022 |
| 10 | 0.181818 | 0.030303 | 0.006993 | 0.001998 | 0.000666 | 0.000250 | 0.000103 | 0.000046 | 0.000022 | 0.000011 |
For meaningful bounds construction, choose . This ensures the margin function excludes at least some extreme pairwise differences, producing bounds narrower than the full range. When working with small samples, verify that the desired misrate exceeds for the given sample sizes. With moderate sample sizes (), drops below , making standard choices like feasible.
using System;
using JetBrains.Annotations;
using Pragmastat.Exceptions;
using Pragmastat.Internal;
namespace Pragmastat.Functions;
/// <summary>
/// PairwiseMargin function
/// </summary>
/// <param name="threshold">The maximum value for n+m, after which implementation switches from exact to approx</param>
internal class PairwiseMargin(int threshold = PairwiseMargin.MaxExactSize)
{
public static readonly PairwiseMargin Instance = new();
private const int MaxExactSize = 400;
[PublicAPI]
public int Calc(int n, int m, double misrate)
{
if (n <= 0)
throw AssumptionException.Domain(Subject.X);
if (m <= 0)
throw AssumptionException.Domain(Subject.Y);
if (double.IsNaN(misrate) || misrate < 0 || misrate > 1)
throw AssumptionException.Domain(Subject.Misrate);
double minMisrate = MinAchievableMisrate.TwoSample(n, m);
if (misrate < minMisrate)
throw AssumptionException.Domain(Subject.Misrate);
return n + m <= threshold
? CalcExact(n, m, misrate)
: CalcApprox(n, m, misrate);
}
internal int CalcExact(int n, int m, double misrate)
{
int raw = CalcExactRaw(n, m, misrate / 2);
return checked(raw * 2);
}
internal int CalcApprox(int n, int m, double misrate)
{
long raw = CalcApproxRaw(n, m, misrate / 2);
long margin = raw * 2;
if (margin > int.MaxValue)
throw new OverflowException($"Pairwise margin exceeds supported range for n={n}, m={m}");
return (int)margin;
}
// Inversed implementation of Andreas Löffler's (1982) "Über eine Partition der nat. Zahlen und ihre Anwendung beim U-Test"
private static int CalcExactRaw(int n, int m, double p)
{
double total = n + m < BinomialCoefficientFunction.MaxAcceptableN
? BinomialCoefficientFunction.BinomialCoefficient(n + m, m)
: BinomialCoefficientFunction.BinomialCoefficient(n + m, m * 1.0);
var pmf = new List<double> { 1 }; // pmf[0] = 1
var sigma = new List<double> { 0 }; // sigma[0] is unused
int u = 0;
double cdf = 1.0 / total;
if (cdf >= p)
return 0;
while (true)
{
u++;
// Ensure sigma has entry for u
if (sigma.Count <= u)
{
int value = 0;
for (int d = 1; d <= n; d++)
if (u % d == 0 && u >= d)
value += d;
for (int d = m + 1; d <= m + n; d++)
if (u % d == 0 && u >= d)
value -= d;
sigma.Add(value);
}
// Compute pmf[u] using Loeffler recurrence
double sum = 0;
for (int i = 0; i < u; i++)
sum += pmf[i] * sigma[u - i];
sum /= u;
pmf.Add(sum);
cdf += sum / total;
if (cdf >= p)
return u;
if (sum == 0)
break;
}
return pmf.Count - 1;
}
// Inverse Edgeworth Approximation
private static long CalcApproxRaw(int n, int m, double misrate)
{
long a = 0;
long b = (long)n * m;
while (a < b - 1)
{
long c = (a + b) / 2;
double p = EdgeworthCdf(n, m, c);
if (p < misrate)
a = c;
else
b = c;
}
return EdgeworthCdf(n, m, b) < misrate ? b : a;
}
private static double EdgeworthCdf(int n, int m, long u)
{
double nm = (double)n * m;
double mu = nm / 2.0;
double su = Sqrt(nm * (n + m + 1) / 12.0);
// -0.5 continuity correction: computing P(U ≥ u) for a right-tail discrete CDF
double z = (u - mu - 0.5) / su;
double phi = Exp(-z.Sqr() / 2) / Sqrt(2 * PI);
double Phi = AcmAlgorithm209.Gauss(z);
// Pre-compute powers of n and m for efficiency
double n2 = (double)n * n;
double n3 = n2 * n;
double n4 = n2 * n2;
double m2 = (double)m * m;
double m3 = m2 * m;
double m4 = m2 * m2;
double mu2 = (double)n * m * (n + m + 1) / 12.0;
double mu4 =
(double)n * m * (n + m + 1) *
(0
+ 5 * m * n * (m + n)
- 2 * (m2 + n2)
+ 3 * m * n
- 2 * (n + m)
) / 240.0;
double mu6 =
(double)n * m * (n + m + 1) *
(0
+ 35 * m2 * n2 * (m2 + n2)
+ 70 * m3 * n3
- 42 * m * n * (m3 + n3)
- 14 * m2 * n2 * (n + m)
+ 16 * (n4 + m4)
- 52 * n * m * (n2 + m2)
- 43 * n2 * m2
+ 32 * (m3 + n3)
+ 14 * m * n * (n + m)
+ 8 * (n2 + m2)
+ 16 * n * m
- 8 * (n + m)
) / 4032.0;
// Pre-compute powers of mu2 and related terms
double mu2_2 = mu2 * mu2;
double mu2_3 = mu2_2 * mu2;
double mu4_mu2_2 = mu4 / mu2_2;
// Factorial constants: 4! = 24, 6! = 720, 8! = 40320
double e3 = (mu4_mu2_2 - 3) / 24.0;
double e5 = (mu6 / mu2_3 - 15 * mu4_mu2_2 + 30) / 720.0;
double e7 = 35 * (mu4_mu2_2 - 3) * (mu4_mu2_2 - 3) / 40320.0;
// Pre-compute powers of z for Hermite polynomials
double z2 = z * z;
double z3 = z2 * z;
double z5 = z3 * z2;
double z7 = z5 * z2;
double f3 = -phi * (z3 - 3 * z);
double f5 = -phi * (z5 - 10 * z3 + 15 * z);
double f7 = -phi * (z7 - 21 * z5 + 105 * z3 - 105 * z);
double edgeworth = Phi + e3 * f3 + e5 * f5 + e7 * f7;
return Min(Max(edgeworth, 0), 1);
}
}
Fast SignedRankMargin
The function determines how many extreme pairwise averages to exclude when constructing bounds around . Given a sample , the estimator computes all pairwise averages for and sorts them. The bounds select specific order statistics from this sorted sequence: . The challenge lies in determining which order statistics produce bounds that contain the true center with probability .
The margin function is the one-sample analog of . While uses the Mann-Whitney distribution for two-sample comparisons, uses the Wilcoxon signed-rank distribution for one-sample inference. Under the weak symmetry assumption, the signed-rank statistic has a known distribution that enables exact computation of bounds coverage.
For symmetric distributions, consider the signs of deviations from the center. The Wilcoxon signed-rank statistic sums the ranks of positive deviations:
where is the rank of among all , and is the true center. Under symmetry, each deviation is equally likely to be positive or negative, giving a discrete distribution over .
The connection to pairwise averages is fundamental: the -th order statistic of sorted pairwise averages corresponds to a specific threshold of the signed-rank statistic. By computing the distribution of , we determine which order statistics provide bounds with the desired coverage.
Two computational approaches provide the distribution of : exact calculation for small samples and approximation for large samples.
Exact method
Small sample sizes allow exact computation without approximation. The Wilcoxon signed-rank distribution has equally likely outcomes under symmetry, corresponding to all possible sign patterns for deviations from the center.
Dynamic programming builds the probability mass function efficiently. Define as the number of sign patterns producing signed-rank statistic equal to . The recurrence considers whether to include rank in the positive sum:
with base case . This builds the distribution incrementally, rank by rank.
The algorithm computes cumulative probabilities sequentially until the threshold is exceeded. For symmetric two-tailed bounds, the margin becomes . Memory is for storing the probability array, and time is for the full computation.
The sequential computation performs well for small misrates. For , the threshold typically remains small, requiring only iterations through the lower tail regardless of sample size.
Approximate method
Large samples make exact computation impractical. For , the Wilcoxon distribution is approximated using an Edgeworth expansion.
Under symmetry, the signed-rank statistic has:
The basic normal approximation uses these moments directly, but underestimates tail probabilities for moderate sample sizes.
The Edgeworth expansion refines this through moment-based corrections. The fourth central moment of is:
This enables kurtosis correction:
The approximated CDF becomes:
where includes a continuity correction.
Binary search locates the threshold efficiently. Each CDF evaluation costs , and evaluations suffice. The approximate method completes in constant time regardless of sample size.
The toolkit uses exact computation for and approximation for . At , the exact method requires arrays of size (), which remains practical on modern hardware. The transition at is determined by the requirement that fits in a 64-bit integer. The approximation achieves sub-1% accuracy for , making the transition smooth.
Minimum achievable misrate
The parameter controls how many extreme pairwise averages the bounds exclude. However, sample size limits how small misrate can meaningfully become.
The most extreme configuration occurs when all signs are positive (or all negative): or . Under symmetry, this extreme occurs with probability:
Setting makes bounds construction problematic. Pragmastat rejects such requests with a domain error.
The table below shows for small sample sizes:
| 2 | 3 | 5 | 7 | 10 | |
|---|---|---|---|---|---|
| 0.5 | 0.25 | 0.0625 | 0.0156 | 0.00195 | |
| max conf | 50% | 75% | 93.75% | 98.4% | 99.8% |
For meaningful bounds construction, choose . With , standard choices like become feasible. With , even is achievable.
using JetBrains.Annotations;
using Pragmastat.Exceptions;
namespace Pragmastat.Functions;
/// <summary>
/// SignedRankMargin function for one-sample bounds.
/// One-sample analog of PairwiseMargin using Wilcoxon signed-rank distribution.
/// </summary>
/// <param name="threshold">Maximum n for exact computation; larger n uses approximation</param>
internal class SignedRankMargin(int threshold = SignedRankMargin.MaxExactSize)
{
public static readonly SignedRankMargin Instance = new();
private const int MaxExactSize = 63;
[PublicAPI]
public int Calc(int n, double misrate)
{
if (n <= 0)
throw AssumptionException.Domain(Subject.X);
if (double.IsNaN(misrate) || misrate < 0 || misrate > 1)
throw AssumptionException.Domain(Subject.Misrate);
double minMisrate = MinAchievableMisrate.OneSample(n);
if (misrate < minMisrate)
throw AssumptionException.Domain(Subject.Misrate);
return n <= threshold
? CalcExact(n, misrate)
: CalcApprox(n, misrate);
}
internal int CalcExact(int n, double misrate)
{
int raw = CalcExactRaw(n, misrate / 2);
return checked(raw * 2);
}
internal int CalcApprox(int n, double misrate)
{
long raw = CalcApproxRaw(n, misrate / 2);
long margin = raw * 2;
if (margin > int.MaxValue)
throw new OverflowException($"Signed-rank margin exceeds supported range for n={n}");
return (int)margin;
}
/// <summary>
/// Compute one-sided margin using exact Wilcoxon signed-rank distribution.
/// Uses dynamic programming to compute the CDF.
/// </summary>
private static int CalcExactRaw(int n, double p)
{
ulong total = 1UL << n;
long maxW = (long)n * (n + 1) / 2;
var count = new ulong[maxW + 1];
count[0] = 1;
for (int i = 1; i <= n; i++)
{
for (long w = Min(maxW, (long)i * (i + 1) / 2); w >= i; w--)
count[w] += count[w - i];
}
ulong cumulative = 0;
for (int w = 0; w <= maxW; w++)
{
cumulative += count[w];
double cdf = (double)cumulative / total;
if (cdf >= p)
return w;
}
return (int)maxW;
}
/// <summary>
/// Compute one-sided margin using Edgeworth approximation for large n.
/// </summary>
private static long CalcApproxRaw(int n, double misrate)
{
long maxW = (long)n * (n + 1) / 2;
long a = 0;
long b = maxW;
while (a < b - 1)
{
long c = (a + b) / 2;
double cdf = EdgeworthCdf(n, c);
if (cdf < misrate)
a = c;
else
b = c;
}
return EdgeworthCdf(n, b) < misrate ? b : a;
}
/// <summary>
/// Edgeworth expansion for Wilcoxon signed-rank distribution CDF.
/// </summary>
private static double EdgeworthCdf(int n, long w)
{
double mu = (double)n * (n + 1) / 4.0;
double sigma2 = n * (n + 1.0) * (2 * n + 1) / 24.0;
double sigma = Sqrt(sigma2);
// +0.5 continuity correction: computing P(W ≤ w) for a left-tail discrete CDF
double z = (w - mu + 0.5) / sigma;
double phi = Exp(-z * z / 2) / Sqrt(2 * PI);
double Phi = AcmAlgorithm209.Gauss(z);
double nd = n;
double kappa4 = -nd * (nd + 1) * (2 * nd + 1) * (3 * nd * nd + 3 * nd - 1) / 240.0;
double e3 = kappa4 / (24 * sigma2 * sigma2);
double z2 = z * z;
double z3 = z2 * z;
double f3 = -phi * (z3 - 3 * z);
double edgeworth = Phi + e3 * f3;
return Min(Max(edgeworth, 0), 1);
}
}
Fast CenterQuantiles
The estimator computes the median of all pairwise averages for . For , we need not just the median but specific order statistics: the -th smallest pairwise average for bounds computation. Given a sample of size , there are such pairwise averages.
A naive approach would materialize all pairwise averages, sort them, and extract the desired quantile. With , this creates approximately 50 million values, requiring quadratic memory and time. The fast algorithm avoids materializing the pairs entirely.
The algorithm exploits the sorted structure of the implicit pairwise average matrix. After sorting the input to obtain , the pairwise averages form a symmetric matrix where both rows and columns are sorted. Instead of searching through indices, the algorithm searches through values. It maintains a search interval , initialized to (the range of all possible pairwise averages). At each iteration, it asks: How many pairwise averages are at most this threshold? Based on the count, it narrows the search interval. For finding the -th smallest pairwise average: if , the target is at or below the threshold and the search continues in the lower half; if , the target is above the threshold and the search continues in the upper half.
The key operation is counting pairwise averages at or below a threshold . For each , we need to count how many satisfy , equivalently . Because the array is sorted, a single pass through the array suffices. For each , a pointer tracks the largest index where . As increases, increases, so the threshold decreases, meaning can only decrease (or stay the same). This two-pointer technique ensures each element is visited at most twice, making the counting operation regardless of the threshold value.
Binary search converges to an approximate value, but bounds require exact pairwise averages. After the search converges, the algorithm identifies candidate exact values near the approximate threshold, then selects the one at the correct rank. The candidate generation examines positions near the boundary: for each row , it finds the values where the pairwise average crosses the threshold, collecting the actual pairwise average values. Sorting these candidates and verifying ranks ensures the exact quantile is returned.
needs two quantiles: and . The algorithm computes each independently using the same technique. For symmetric bounds around the center, the lower and upper ranks are and .
The algorithm achieves time complexity for sorting, plus for binary search where is the value range precision. Memory usage is for the sorted array plus for candidate generation. This is dramatically more efficient than the naive approach. For , the fast algorithm completes in milliseconds versus minutes for the naive approach.
The algorithm uses relative tolerance for convergence: . The floor of 1 prevents degenerate tolerance when bounds are near zero. This ensures stable behavior across different scales of input data. For candidate generation near the threshold, small tolerances prevent missing exact values due to floating-point imprecision.
using System.Diagnostics;
namespace Pragmastat.Algorithms;
/// <summary>
/// Efficiently computes quantiles from all pairwise averages (x[i] + x[j]) / 2 for i ≤ j.
/// Uses binary search with counting function to avoid materializing all N(N+1)/2 pairs.
/// </summary>
internal static class FastCenterQuantiles
{
private static bool IsSorted(IReadOnlyList<double> list)
{
for (int i = 1; i < list.Count; i++)
if (list[i] < list[i - 1])
return false;
return true;
}
/// <summary>
/// Relative epsilon for floating-point comparisons in binary search convergence.
/// </summary>
private const double RelativeEpsilon = 1e-14;
/// <summary>
/// Compute specified quantile from pairwise averages.
/// </summary>
/// <param name="sorted">Sorted input array.</param>
/// <param name="k">1-based rank of the desired quantile.</param>
/// <returns>The k-th smallest pairwise average.</returns>
public static double Quantile(IReadOnlyList<double> sorted, long k)
{
Debug.Assert(
IsSorted(sorted),
"FastCenterQuantiles.Quantile: input must be sorted");
int n = sorted.Count;
if (n == 0)
throw new ArgumentException("Input cannot be empty", nameof(sorted));
long totalPairs = (long)n * (n + 1) / 2;
if (k < 1 || k > totalPairs)
throw new ArgumentOutOfRangeException(nameof(k), $"k must be in range [1, {totalPairs}]");
if (n == 1)
return sorted[0];
return FindExactQuantile(sorted, k);
}
/// <summary>
/// Compute both lower and upper bounds from pairwise averages.
/// </summary>
/// <param name="sorted">Sorted input array.</param>
/// <param name="marginLo">Rank of lower bound (1-based).</param>
/// <param name="marginHi">Rank of upper bound (1-based).</param>
/// <returns>Lower and upper quantiles.</returns>
public static (double Lo, double Hi) Bounds(IReadOnlyList<double> sorted, long marginLo, long marginHi)
{
Debug.Assert(
IsSorted(sorted),
"FastCenterQuantiles.Bounds: input must be sorted");
int n = sorted.Count;
if (n == 0)
throw new ArgumentException("Input cannot be empty", nameof(sorted));
long totalPairs = (long)n * (n + 1) / 2;
marginLo = Max(1, Min(marginLo, totalPairs));
marginHi = Max(1, Min(marginHi, totalPairs));
double lo = FindExactQuantile(sorted, marginLo);
double hi = FindExactQuantile(sorted, marginHi);
return (Min(lo, hi), Max(lo, hi));
}
/// <summary>
/// Count pairwise averages ≤ target value.
/// </summary>
private static long CountPairsLessOrEqual(IReadOnlyList<double> sorted, double target)
{
int n = sorted.Count;
long count = 0;
// j is not reset: as i increases, threshold decreases monotonically
int j = n - 1;
for (int i = 0; i < n; i++)
{
double threshold = 2 * target - sorted[i];
while (j >= 0 && sorted[j] > threshold)
j--;
if (j >= i)
count += j - i + 1;
}
return count;
}
/// <summary>
/// Find exact k-th pairwise average using selection algorithm.
/// </summary>
private static double FindExactQuantile(IReadOnlyList<double> sorted, long k)
{
int n = sorted.Count;
long totalPairs = (long)n * (n + 1) / 2;
if (n == 1)
return sorted[0];
if (k == 1)
return sorted[0];
if (k == totalPairs)
return sorted[n - 1];
double lo = sorted[0];
double hi = sorted[n - 1];
const double eps = RelativeEpsilon;
var candidates = new List<double>(n);
while (hi - lo > eps * Max(1.0, Max(Abs(lo), Abs(hi))))
{
double mid = (lo + hi) / 2;
long countLessOrEqual = CountPairsLessOrEqual(sorted, mid);
if (countLessOrEqual >= k)
hi = mid;
else
lo = mid;
}
double target = (lo + hi) / 2;
for (int i = 0; i < n; i++)
{
double threshold = 2 * target - sorted[i];
int left = i;
int right = n;
while (left < right)
{
int m = (left + right) / 2;
if (sorted[m] < threshold - eps)
left = m + 1;
else
right = m;
}
if (left < n && left >= i && Abs(sorted[left] - threshold) < eps * Max(1.0, Abs(threshold)))
candidates.Add((sorted[i] + sorted[left]) / 2);
if (left > i)
{
double avgBefore = (sorted[i] + sorted[left - 1]) / 2;
if (avgBefore <= target + eps)
candidates.Add(avgBefore);
}
}
if (candidates.Count == 0)
return target;
candidates.Sort();
foreach (double candidate in candidates)
{
long countAtCandidate = CountPairsLessOrEqual(sorted, candidate);
if (countAtCandidate >= k)
return candidate;
}
return target;
}
}