SIMD median selection networks (AVX2): Hybrid SIMD+scalar selection
networks for n = 8, 16, 32 are 23–58% faster than scalar median_net.
Accelerates robScale, robLoc, MAD, and ADM median dispatch.
Raised sort network threshold (16 → 56): Branchless sorting networks
(sort_net) beat std::sort by 1.2–3× for n ≤ 60. Benefits Qn exact
path (n ≤ 40), Sn micro-path, GMD, and IQR sort-then-index.
Ensemble uses a separate lower threshold (20) to avoid L1 I-cache pressure
in the bootstrap hot loop.
Raised median network threshold (16 → 36): Scalar median_net
(selection networks) beats Floyd-Rivest through n ≈ 36 for both odd and
even sample sizes.
Earlier ensemble parallelism: TBB parallel_for now fires for all
ensemble bootstrap calls with ≥ 2 cores and ≥ 16 replicates (was gated
on n × n_boot ≥ 10000). Ensemble is 4–8× faster for n = 3–49 on
multi-core machines.
ADM AVX2 auto-vectorization restored: adm_core_avx2 now contains
the vectorizable loop body directly instead of delegating to
adm_core_scalar via plain inline. Eliminates a 6–9% regression at
medium n (1024–4096) caused by lost auto-vectorization across the
target-attribute boundary.
Dropped AVX-512 dispatch: Removed ROBSCALE_TARGET_AVX512F,
ROBSCALE_HAS_AVX512_DISPATCH, and the 8-wide AVX-512 tanh path.
The tanh backend hierarchy is now: Apple Accelerate → glibc libmvec
(AVX2 4-wide) → SLEEF (AVX2 4-wide) → OpenMP SIMD → scalar.
Runtime AVX2 dispatch for ADM: adm_core now takes an explicit
use_avx2 flag, computed once at each entry point via
RuntimeConfig::get(). Non-AVX2 machines take the scalar path safely.
L2 cache plausibility guard: Hardware detection now rejects
l2_per_core values below 64 KB as implausible and falls back to
256 KB. Applied across Linux, macOS, and Windows detection paths.
configure sed delimiter: Replaced | delimiter with ! in the
11-expression sed block to survive TBB rpath values containing |.
RcppParallel.h include ordering: Added #include <RcppParallel.h>
before system TBB headers in rob_scale.cpp and rob_loc.cpp so that
RcppParallel's bundled TBB 2019 include guards fire first.
Explicit <cmath> include: Added direct #include <cmath> to
rob_scale.cpp and rob_loc.cpp for portability.
macOS TBB detection: The configure script now checks for
libtbb.dylib in addition to libtbb.so when detecting RcppParallel's
bundled TBB.
SystemRequirements: Removed TBB from SystemRequirements in
DESCRIPTION. TBB is provided by RcppParallel or detected at
configure time; listing it as a system requirement was incorrect.
<immintrin.h> guard: Narrowed the <immintrin.h> include in
qnsn_kernels.h from an architecture check to ROBSCALE_HAS_AVX2_DISPATCH.
Clang 21 ABI compatibility: Added target("avx2,fma") and
target("avx512f") attributes to extern "C" declarations of
libmvec/SLEEF vectorized tanh functions. Clang 21 (CRAN Debian)
requires target attributes on __m256d/__m512d function signatures
even at declaration scope.
TBB guard alignment: Unified all TBB preprocessor guards across
qn_estimator.cpp, sn_estimator.cpp, ensemble.cpp, and
worker_compat.h to use the combined
(ROBSCALE_HAS_SYSTEM_TBB || USE_DIRECT_TBB) pattern. Previously
these files only checked USE_DIRECT_TBB, causing system-oneTBB
builds (CRAN Linux) to fall back to RcppParallel instead of using
direct TBB parallelism for Qn, Sn, and ensemble bootstrap.
Unused variable warning: Eliminated use_avx2 warning on
system-oneTBB builds by expanding the guard that consumes it.
scale_robust(x, method = "qn") with n >= threshold now correctly returns
the Qn estimator instead of silently returning gmd(x). The auto_switch
mechanism now only applies to method = "ensemble"; named methods are always
dispatched as requested.scale_robust(..., ci = TRUE) now supports bootstrap confidence intervals for
individual named methods via boot_method = "bca", "percentile", or
"parametric". The new C++ function cpp_single_estimator_ci_bounds runs
the bootstrap on the single requested estimator.
boot_method = "analytical" is now accepted explicitly by scale_robust().
It selects the chi-squared interval for sd and the ARE-based normal
approximation for all other named methods.
print.robscale_ci now shows the CI method in the output header
(e.g., 95% CI (analytical): [0.6123, 1.1456]).
AVX2 detection guarded for non-GCC/Clang compilers (WU-PORT-1):
__builtin_cpu_supports("avx2") is now wrapped in
#if defined(__GNUC__) || defined(__clang__). MSVC and ICC builds
previously hit an undefined-identifier error here; on unsupported
compilers, SIMD capability defaults conservatively to None.
hardware_concurrency() zero return clamped (WU-PORT-1):
std::thread::hardware_concurrency() returns 0 on platforms that
cannot determine the thread count. The result is clamped to
std::max(..., 1u), preventing a zero-divisor in the TBB
grain-size calculation on such platforms.
sysfs L2-cache parse hardened (WU-PORT-1): The Linux sysfs
integer read now wraps std::stoul in try { } catch (...).
Malformed or out-of-range values fall back to
L2_FALLBACK_KB * 1024 and num_logical_cores = 1 instead of
propagating an uncaught exception to the R session.
mad.cpp micro-buffer tied to ROBSCALE_MICRO_BUFFER_SIZE
(WU-PORT-2): Both mad_impl_auto and mad_impl_center previously
hard-coded buf_micro[64] and n <= 64. Both now reference
buf_micro[ROBSCALE_MICRO_BUFFER_SIZE] and
n <= static_cast<int>(ROBSCALE_MICRO_BUFFER_SIZE). Changing the
package-wide constant in robscale_config.h now propagates to both
MAD paths without manual adjustment.
Ensemble bootstrap singleton overhead eliminated (WU-PERF-1):
ensemble_one_replicate called
robscale::qnsn::RuntimeConfig::get().sort_boost_threshold on
every bootstrap replicate. The call is replaced with the
compile-time constant ROBSCALE_SORT_BOOST_THRESHOLD, removing the
static-local singleton access from the hot path. Gate results:
~59% speedup at n = 1,000--5,000 (stable-zone ratio ≈ 0.41).
AVX2 diff-fill enabled in TBB Qn refinement path (WU-PERF-2):
The qn_fill_diffs_avx2 kernel previously ran only in the
brute-force path. It now dispatches inside the TBB parallel_for
lambda when use_avx2 && len >= 4; the use_avx2 flag is hoisted
before the parallel region to avoid repeated CPUID lookups inside
the lambda. Gate results: ~11% speedup at n = 1,000--10,000
(stable-zone ratio ≈ 0.89).
AVX-512 tanh dispatch: On glibc ≥ 2.35 systems where libmvec provides
_ZGVeN8v_tanh, the bulk tanh kernel processes 8 doubles per iteration
instead of 4. CPUID dispatch ensures the 8-wide path runs only on AVX-512F
hardware; other systems fall through to AVX2 with zero overhead.
GMD AVX2 FMA kernel: The Gini mean difference weighted sum uses a shared
gmd_weighted_sum kernel with _mm256_fmadd_pd for n ≥ 8 on AVX2
hardware, replacing three duplicate scalar loops.
Estimator-major bootstrap layout: boot_results is stored as
7 × n_boot (estimator-major) so the mean/variance reduction pass reads each
estimator's replicates contiguously (stride-1 instead of stride-7).
Dead tmp parameter removed from rob_scale_compute
(WU-CLEAN-1): The function accepted an unused
double* ROBSCALE_RESTRICT tmp pointer left over from a planned
optimization that was never implemented. Removed from the
definition, its forward declaration, and all three call sites.
STACK_SIZE unified to ROBSCALE_SN_STACK_THRESHOLD
(WU-CLEAN-1): Four independent constexpr int STACK_SIZE = 2048
definitions in gmd.cpp, iqr.cpp, mad.cpp, and
sn_estimator.cpp are replaced with the package-wide constant in
robscale_config.h.
iqr_select_and_interp helper extracted; dead buf2 removed
(WU-CLEAN-2): The pdqselect + max-scan + Type-7 interpolation block
was duplicated inline across the Q1 and Q3 selection paths in
iqr(). Extracting it as static inline iqr_select_and_interp in
estimators_internal.h eliminated the duplication and enabled
compiler CSE across both call sites. Side effect: ~75% speedup at
n = 1,000 (ratio ≈ 0.25). The now-unused buf2 parameter is
removed from iqr() and its call site in ensemble.cpp.
sn_inner_serial template extracted; Sn inner-loop triplicated no
more (WU-CLEAN-3): The Sn inner loop appeared identically in
SnWorker::operator(), the Tier-1 serial fallback, and the Tier-2
large-$n$ branch. All three now delegate to
sn_inner_serial<T>, tagged __attribute__((always_inline)). The
attribute is load-bearing: without it, GCC 15 declines to inline
across all three call sites at -O2, producing a ~5× regression at
n = 1,000.
ROBSCALE_SORT_MEDIAN_THRESHOLD renamed
ROBSCALE_SORT_NETWORK_THRESHOLD (WU-CLEAN-4): The macro guards
the sort-network dispatch path, not median calculation. The old name
was misleading. Renamed across all six files; no value change.
Documentation additions (WU-CLEAN-4): ensemble_one_replicate
receives a full doc comment covering its parameters, the XorShift32
PRNG bit-shifts, and the sorted-once design pattern.
discover_linux, discover_macos, and discover_windows in
qnsn_hardware_info.h receive brief function-level comments. The
diag.cpp section labels are replaced with semantic names
("M-scale iteration diagnostics", "Median crossover benchmark
helpers").
iqr_scaled() dead stack eliminated (OPT-I1+I2): The entry frame previously
allocated both a 32 KB buf_stack[4096] and a 2 KB buf_micro[256] unconditionally,
even for n=2. buf_stack is now declared only inside a ROBSCALE_NOINLINE large-n
helper (n > 256); buf_stack size reduced 4096 → 2048 (IQR needs one copy).
Stack at small n reduced from ~34 KB to ~2 KB.
ROBSCALE_RESTRICT on iqr_impl_large and interp_q7 (OPT-I5): Pointer
parameters carry ROBSCALE_RESTRICT, allowing the compiler to generate stronger
SIMD code in the selection and scan loops.
iqr_scaled() n≤16 sort fast path (OPT-I6): For n≤16, two pdqselect calls
plus two min-scan passes are replaced by a single small_sort (branch-free sorting
network) followed by direct index reads. Q1/Q3 are O(1) after the sort.
Symmetric Q1 selection for n>16 with frac1>0 (OPT-I3): When the Q1 quantile
requires interpolation ((n-1) % 4 ≠ 0, ~75% of n values), the lower-partition
scan is reduced from O(0.75n) to O(0.25n). Instead of finding the minimum of
buf[lo1+1..n-1], pdqselect places lo1+1 exactly and a max-scan over
buf[0..lo1] recovers Q1. Applies to both the micro path (17≤n≤256) and the
large-n NOINLINE helper (n>257). Gate results: n=100 −4.3%, n=1000 −3.3%.
ROBSCALE_RESTRICT on estimators_internal::iqr() (OPT-I8): Pointer
parameters of the inline iqr() function (used in the ensemble compute path) carry
ROBSCALE_RESTRICT, allowing alias-free code generation in the pdqselect path.
iqr_sorted() added (OPT-I7): New robscale::internal::iqr_sorted(sorted_x, n)
for pre-sorted data — O(1) Type 7 quantile reads, no copy, no selection. Completes
the sorted-variant family alongside sn_sorted() and qn_sorted().
mad_scaled() small-n stack frame reduced 16 KB → 512 B (OPT-M1): mad_impl_auto
and mad_impl_center now route n > 64 through ROBSCALE_NOINLINE helpers.
The 16 KB buf_stack[2048] is declared only in those helpers, never in the entry frame.
#pragma omp simd on all deviation loops (OPT-M2): All absolute-deviation loops
in mad_impl_auto, mad_impl_center, and their large-n helpers are SIMD-annotated
via the established ROBSCALE_HAS_OMP_SIMD guard pattern.
ROBSCALE_RESTRICT on NOINLINE helper parameters (OPT-M3): mad_impl_auto_large
and mad_impl_center_large carry ROBSCALE_RESTRICT on their input pointers,
allowing the compiler to generate stronger SIMD code.
bulk_abs_diff / bulk_abs_diff_inplace shared SIMD kernels (OPT-M4): Two
inline SIMD kernels added to src/robust_core.h. All six deviation loops in
mad.cpp are replaced with kernel calls. n=1000 shows ~2–3.5% improvement.
mad_from_data uses bulk_abs_diff_inplace (OPT-M5): The deviation loop in
estimators_internal.h::mad_from_data (used by the ensemble compute path) is
replaced with the shared kernel.
Ensemble MAD block uses bulk_abs_diff (OPT-M6): The per-replicate deviation
loop in ensemble_one_replicate is replaced with bulk_abs_diff.
V-shaped deviation median in ensemble MAD block (OPT-M7): Since resample is
sorted, the deviation array is V-shaped. vshaped_mad finds the median of the two
sorted halves via O(log n) binary search (kth_of_two_sorted), replacing the O(n)
median_select call and eliminating the work2 write-read cycle.
gmd() small-n stack frame reduced 17 KB → 1 KB (OPT-G1): gmd_impl now
routes n ≤ 128 through an ROBSCALE_NOINLINE helper that allocates only a
128-double arena. The large buf_stack[2048] is declared only in the n > 128
path, eliminating 16 KB of unnecessary stack allocation on every small-sample
call.
Ensemble bootstrap sort upgraded to boost::float_sort (OPT-G4): The
per-replicate sort in ensemble_one_replicate now dispatches to
boost::sort::spreadsort::float_sort (serial radix sort) for n above the
runtime-configured sort_boost_threshold. Expected speedup ~1.5–2× for
bootstrap samples of size ≥ 512. tbb::parallel_sort is not used here to
avoid nested-TBB oversubscription inside the outer tbb::parallel_for.
gmd() internal sort path uses optimized_sort (OPT-G3): internal::gmd()
in estimators_internal.h now uses robscale::qnsn::optimized_sort (tiered
std::sort → boost::float_sort → tbb::parallel_sort) instead of plain
std::sort. Call sites (compute_all_estimators at lines 216 and 330 of
ensemble.cpp) are serial, so the optional TBB dispatch is safe.
GMD kernel scale factor precomputed (OPT-G5): constant * 2 / (n*(n-1))
is now computed once before the accumulation loop across all three call sites
(gmd_sorted, internal::gmd(), ensemble inline block), reducing per-element
work to a single multiply.
ROBSCALE_RESTRICT annotations on gmd_sorted and gmd_core (OPT-G6):
No-aliasing hints allow the compiler to generate tighter SIMD code for the
accumulation loop.
robScale()): The previous
guard rejected a valid extrapolation whenever the Aitken jump exceeded the
two-step pair displacement (|cand - s2| < |s2 - s0|). For convergence
rates $r \gtrsim 0.73$ this condition is never satisfied, so the
acceleration never fired and the M-scale iteration required up to 80
rho_sum evaluations per call at small $n$. Removing the guard — keeping
only the minimal safety candidate > 0 — cuts typical rho_evals from
50–80 down to 7–12 for the previously-slow cases (n=5 seed=47/247,
n=7 seed=49/149). The case n=7 seed=49, which previously hit maxit without
converging, now converges in fewer than 40 evaluations.Fused AVX2 kernel threshold lowered from n≥8 to n≥4 (robScale()):
The use_fused flag in rob_scale_compute now activates at $n \geq 4$
instead of $n \geq 8$. Inputs of length 4, 5, 6, and 7 now use the
single-pass fused AVX2 kernel rather than the three-pass scalar fallback,
eliminating the remaining per-call overhead for very small samples.
Fused single-pass NR kernel for robLoc() (OPT-L1): rob_loc_nr_step_avx2
collapses the three-pass Newton--Raphson loop (scale residuals, bulk_tanh,
accumulate) into a single AVX2 pass. Two accumulators advance in lockstep:
acc_psi via addpd and acc_p2 via fmadd ($p_i^2 = \tanh^2(u_i)$).
The derivative sum follows from $\sum \text{sech}^2(u_i) = n - \sum \tanh^2(u_i)$,
avoiding a second transcendental evaluation. A degenerate-scale guard
(sum_dpsi < DBL_MIN) prevents NaN on near-constant inputs.
RuntimeConfig hoist for robLoc() (OPT-L2): The SIMD dispatch flag
was re-read from thread-local storage on every Newton--Raphson iteration.
CPUID features are invariant over process lifetime; the flag is now hoisted
once before the loop. Added bulk_tanh_dispatched() to robust_core.h to
accept the pre-hoisted flag.
Warm-cache MAD for robLoc() (OPT-L4): rob_loc_core copies input
into buf[] and selects the median in place. MAD is permutation-invariant,
so mad_select now reads from buf[]---warm in L1/L2 cache---rather than
the cold input array.
TBB parallel NR for robLoc() (OPT-L3): For $n \geq \max(4096, L_2/32)$,
rob_loc_parallel_compute partitions the array across TBB threads via
parallel_reduce. Each chunk calls rob_loc_nr_step_avx2; an NRAccum
struct accumulates partial sums with operator+=.
Combined speedup: These four optimizations yield 2.6--4.1x speedup
over revss for robLoc() at $n = 100$ to $10{,}000$ on x86_64 with AVX2
(benchmark ratios: 0.347 at $n = 100$, 0.243 at $n = 10{,}000$).
Sorting-network threshold corrected for robScale() (OPT-A): Lowered
ROBSCALE_SORT_MEDIAN_THRESHOLD from 64 to 16. The median-net comparator
count grows as $O(n^{1.5})$ ($n = 64$: 337 swaps; $n = 16$: 46 swaps);
Floyd--Rivest is $O(n)$, crossing over at $n \approx 16$. Using sorting
networks up to $n = 64$ incurred ~3.5× unnecessary overhead per median call.
This change eliminates the $n = 500$--$1{,}000$ regression in robScale():
benchmarks show ratios 0.201 ($n = 500$) and 0.174 ($n = 1{,}000$), versus
1.126 and 1.122 before the fix (a 5--6× reversal).
PLT elimination for median_net<T> (OPT-B): Added
__attribute__((visibility("hidden"))) to the median_net<T> template
definition in sort_net.h. On Linux -fPIC, this eliminates PLT
indirection for all intra-DSO calls, replacing the W (weak) symbol with a
direct $t$ (local) binding.
adm() hot-buffer pass (OPT-A): adm_impl_auto() now passes the
already-copied working buffer to adm_core() rather than the cold SEXP
pointer. The $n$ doubles remain in L1/L2 cache from the initial copy
through the deviation sum, avoiding an additional cold read.
adm() stack-frame split (OPT-B): Extracted adm_large_n() as a
ROBSCALE_NOINLINE helper. This reduces the adm_impl_auto() stack frame
from 8,256 to 624 bytes (objdump verified), preventing the 32 KB large-$n$
buffer from penalizing small calls.
Benchmark seed pool widened to 7 seeds (benchmarks/run_benchmarks.R):
BENCH_SEEDS increased from 3 to 7 seeds (spaced 50 apart: 42, 92, 142,
192, 242, 292, 342). At small $n$ ($n \leq 16$) the previous 3-seed pool
had high variance because 2 of 3 seeds could land on the previously-slow
Aitken path, biasing the pooled-median timing toward the slow cluster.
Added rob_loc_scalar_impl, rob_loc_has_parallel, and rob_loc_serial_impl
diagnostic exports for TDD correctness gates.
Added compile-time diagnostic helpers (src/diag.cpp) for sorting-network
threshold validation; not user-facing.
Fused median-then-MAD: mad_scaled() now computes absolute deviations
in-place on the median selection buffer, eliminating the second scratch array.
Memory drops from $2n$ to $n$ doubles. The same fusion applies to robScale()
and the ensemble's internal MAD and M-scale paths, reducing total ensemble
workspace from $3n$ to $2n$ doubles. Benchmarks show 7--15% speedup at medium
$n$ on x86_64, with no regressions on ARM64 or at small $n$.
Noinline small-$n$ dispatch for robScale() and robLoc(): extracted
shared core logic (rob_scale_core, rob_loc_core) and separated the
small-$n$ path ($n \leq 64$) into ROBSCALE_NOINLINE helpers with a 1 KB
stack frame. The large-$n$ path retains its 32 KB buffer independently.
Benchmarks showed no measurable timing improvement (the R-to-C++ .Call()
boundary dominates at small $n$), but the refactoring clarifies the code
structure and prevents the compiler from penalizing small calls with the
large-$n$ stack reservation.
ROBSCALE_NOINLINE portability macro to robscale_config.h
(__attribute__((noinline)) on GCC/Clang, __declspec(noinline) on MSVC).estimators_internal.h: mad_from_data() and rob_scale() signatures
reduced from three to two buffer arguments.ensemble.cpp: workspace allocation reduced from $3n$ to $2n$ doubles.revss v3.0.0,
which changed robScale() and robLoc() defaults to use bias-corrected "AA"
constants. The adm() cross-check remains active for all revss versions.robScale() and robLoc(),
decoupled from upstream revss. Catches regressions in our own algorithm
independent of revss version changes.qn() and sn() kernels,
achieving optimal vectorization across x86_64 (AVX2/FMA) and ARM64 (NEON).scale_robust() provides a unified dispatcher that automatically
selects between a variance-weighted ensemble of 7 estimators (for small
samples) and the Gini Mean Difference (GMD) for larger samples (n >= 20).gmd(), iqr_scaled(), mad_scaled(), and sd_c4()
to the public API.ci = TRUE to all scale estimators, providing
analytical intervals (GMD, MAD, Sn, Qn, bias-corrected SD) or bootstrap-based
intervals (ensemble).stats::mad() and stats::IQR() with optimized C++
implementations (mad_scaled(), iqr_scaled()) using O(n) selection via the
pdqselect algorithm.robustbase,
Hmisc, GiniDistance, and collapse.inst/CITATION
and documentation.std::sort for n = 9-16 on both ARM64 (Apple Silicon)
and x86_64 (AMD Zen 3).robScale() that was
inadvertently removed, causing incorrect results when MAD collapses for
n < 4.#pragma GCC diagnostic ignored "-Wdeprecated-volatile" from
src/qn_estimator.cpp, src/sn_estimator.cpp, and src/qnsn_sort_utils.h
to resolve the "pragmas suppressing diagnostics" NOTE requested by CRAN
maintainers.configure.win and cleanup.win to suppress spurious
win-builder warnings about missing Windows configuration.qnsn_kernels.h that
Clang 17 on macOS promoted to a compilation WARNING.RcppParallel::LdFlags() to the canonical
RcppParallel::RcppParallelLibs() in Makevars.win and Makevars.in.<Accelerate/Accelerate.h> include and resolving a
symbol conflict with R's COMPLEX type.CXX_STD = CXX17 in Makevars (required by
CRAN policy for packages using C++17 features).GNU make in SystemRequirements ($(shell) is a GNU
extension used for RcppParallel linking).inst/WORDLIST to whitelist technical terms from aspell.DESCRIPTION to satisfy metadata checks.-ltbb flag from Makevars which caused warnings
on some macOS configurations.qn() and sn() (Rousseeuw &
Croux, 1993) with specialized sorting network kernels and cache-aware
parallelization.robustbase to Suggests for reference and benchmarking.configure script auto-detects SIMD
capabilities (AVX2/FMA on x86_64, NEON on ARM64) without requiring
any environment variables.adm() now raises an error when na.rm = FALSE and
NAs are present, matching robLoc() and robScale() behavior.new[]/delete[] with std::unique_ptr for RAII memory
safety in all C++ estimator functions.PSI_NORM_CONST,
INV_PSI_NORM_CONST) from robust_core.h.half_inv_s above the Newton--Raphson loop in
rob_loc.cpp.robLoc and robScale to README.adm(), maxit edge cases
for robLoc() and robScale(), sorting network verification for n = 2--8.cph role to Authors@R in DESCRIPTION.Initial release.
adm(), robLoc(), and robScale() implementing
the robust location and scale M-estimators of Rousseeuw & Verboven (2002)
for very small samples.revss package.tanh(x/2) identity for the logistic psi function with
platform-vectorized bulk evaluation (Apple Accelerate on macOS,
OpenMP SIMD hints on Linux).revss verified across 5,400 systematic
comparisons (tolerance: sqrt(.Machine$double.eps)).