Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SIMD Dispatch Matrix

bwa-mem3 ships one binary per platform. The x86 binary contains compiled kernels for every supported SIMD tier (sse41 / sse42 / avx / avx2 / avx512bw) and dispatches in process at startup. The arm64 binary contains a single NEON kernel path. There are no bwa-mem3.<tier> companion files on disk and no launcher binary.

Dispatch flowchart

flowchart TD
    A[bwa-mem3 mem starts] --> B{Platform?}
    B -- ARM / aarch64 --> C[NEON kernel TU, no dispatch]
    B -- x86 --> D[bwamem3_simd_init in src/simd_dispatch.cpp]
    D --> E[__builtin_cpu_supports]
    E --> F{Host capability?}
    F -- AVX-512BW --> G1[g_tier = avx512bw]
    F -- AVX2 --> G2[g_tier = avx2]
    F -- AVX --> G3[g_tier = avx]
    F -- SSE4.2 --> G4[g_tier = sse42]
    F -- SSE4.1 --> G5[g_tier = sse41]
    F -- below build floor --> H[exit(2): host below SIMD floor]
    G1 & G2 & G3 & G4 & G5 --> I[Per-kernel factory selects matching tier]

Tier detection runs once during main(). Subsequent kernel calls pay a single indirect-call hop through a factory vtable (or an extern "C" wrapper for free-function ksw_* kernels) — about 0.3 ns per call after BTB warm-up, well below run-to-run noise on the bwa-mem3-bench corpus.

If the host CPU does not meet the build’s compile-time SIMD floor (BASELINE_ARCH, default avx2 since PR #84), the binary exits with code 2 and an [E::bwamem3] message naming the gap before any alignment work runs. bwa-mem3 version, --help, and -h are exempt and always succeed so operators can introspect a binary on a host that cannot run alignment. See Host requirements.

Building

make                              # single multi-tier x86 binary, BASELINE_ARCH=avx2
make BASELINE_ARCH=sse41          # lower host SIMD floor / maximize portability (~10–15% slower on AVX2 hosts)
make BASELINE_ARCH=avx512bw       # AVX-512BW-only fleet (locks the host floor)
make arm64                        # single NEON binary, no dispatch table

BASELINE_ARCH controls the tier at which non-kernel translation units compile. The hand-tuned kernel TUs in KERNEL_SRCS (bandedSWA, kswv, ksw, sam_encode) are always compiled at every supported tier and dispatched at runtime, so a build at BASELINE_ARCH=avx2 still uses the AVX-512BW kernels on AVX-512BW hosts. The non-kernel TUs are not auto-vectorized above BASELINE_ARCH, which is the trade-off — see BASELINE_ARCH=avx512bw build flag for the empirical perf characterization.

Supported x86 tiers (minimum CPU for each tier’s kernel path):

TierArch flagsMinimum CPU
sse41-msse4.1Penryn (2007) / K10 (2011)
sse42-msse4.2Nehalem (2008) / Bulldozer (2011)
avx-mavxSandy Bridge (2011) / Bulldozer (2011)
avx2-mavx2Haswell (2013) / Excavator (2015)
avx512bw-mavx512f -mavx512bw -mprefer-vector-width=256Skylake-X (2017) / Zen 4 (2022)

For arm64 builds:

BinaryArch flagsPlatform
bwa-mem3 (arm64)-DAPPLE_SILICON=1 + native NEON / sse2neon shimAny aarch64 / Apple Silicon

Kernel vectorization coverage

KernelSSE4.1SSE4.2AVXAVX2AVX-512BWNEON (arm64)
kswv (vectorized Smith-Waterman)8-wide int168-wide int168-wide int1616-wide int1632-wide int168-wide int16 (native)
bandedSWA (banded alignment / mate-rescue)vectorizedvectorizedvectorizedvectorizedvectorizednative NEON blendv
ksw_* (SW extension free functions)per-tierper-tierper-tierper-tierper-tierper-tier (NEON)
sam_encode (SAM seq/qual encoder)per-tierper-tierper-tierper-tierper-tierper-tier (NEON)
FM-index lookup (FMI_search)scalar popcountscalar popcountscalar popcountscalar popcountscalar popcount__builtin_popcountl
libsais BWT constructionscalarscalarscalarOpenMP parallelOpenMP parallelOpenMP parallel

Note — FM-index is memory-bound

The FM-index backward-extension loop is limited by pointer-chasing through the cp_occ arrays, not by computation. Additional SIMD width does not increase throughput here. See Developer Guide — Apple Silicon / NEON port for the profiling evidence.

Runtime overrides

Two environment variables tune dispatch:

VariableEffect
BWAMEM3_FORCE_TIER=<tier>Forces a specific tier (sse41 / sse42 / avx / avx2 / avx512bw). Downgrade-only: requests above the host’s detected tier (which would SIGILL) and unknown names are rejected with a stderr warning. Used by test/regression/all_tiers_parity.sh to confirm byte-identical SAM across all tiers on AVX-512 hosts.
BWAMEM3_DEBUG_SIMD=1Prints a one-line [I::bwamem3_simd_init_body] startup banner with the build baseline, the detected host capability, and the resolved tier. Also enables the build-baseline-vs-host gap warning.

Use bwa-mem3 version to read the resolved tier without alignment:

v0.2.0
SIMD floor:   avx2 (x86-64-v3, Haswell 2013+); kernels: sse41 sse42 avx avx2 avx512bw
SIMD runtime: avx512bw (BWAMEM3_FORCE_TIER unset)

Why in-process dispatch, not separate binaries

The pre-PR-#83 design shipped six binaries (one launcher plus one per ISA tier) and execvd the matching tier at startup. That worked but cost ~120 MB on disk, required all six binaries to be present in the same directory, and made BWAMEM3_FORCE_TIER impossible without re-exec’ing a different file. The current single-binary design keeps the per-tier compile granularity for the hand-tuned kernel TUs while collapsing distribution to one file (~25 MB), and adds runtime tier override and a clean host-floor precheck. Indirect-call overhead is the only trade-off, and it is below the measurement noise floor on every architecture in the bench matrix.


See also: Performance overview · PGO build · Host requirements · Developer Guide — SIMD dispatch architecture · Developer Guide — Single-binary SIMD dispatch (x86) · Developer Guide — Apple Silicon / NEON port · BASELINE_ARCH=avx512bw build flag