Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SIMD Dispatch Matrix

bwa-mem3 uses a multi-binary dispatch strategy on x86: the bwa-mem3 launcher reads the CPU’s CPUID bits at startup, then execvs the highest-capability variant binary found on disk. On ARM there is only one NEON level, so the launcher execs bwa-mem3.arm64 directly without any cpuid check.

Dispatch flowchart

flowchart TD
    A[bwa-mem3 launcher starts] --> B{Platform?}
    B -- ARM / aarch64 --> C[exec bwa-mem3.arm64]
    B -- x86 --> D[read CPUID via __cpuidex]
    D --> E{AVX-512BW supported?}
    E -- yes --> F[exec bwa-mem3.avx512bw]
    E -- no --> G{AVX2 supported?}
    G -- yes --> H[exec bwa-mem3.avx2]
    G -- no --> I{AVX supported?}
    I -- yes --> J[exec bwa-mem3.avx]
    I -- no --> K{SSE4.2 supported?}
    K -- yes --> L[exec bwa-mem3.sse42]
    K -- no --> M{SSE4.1 supported?}
    M -- yes --> N[exec bwa-mem3.sse41]
    M -- no --> O[error: no supported variant found]

The launcher reads CPUID leaf 1 (for SSE flags) and leaf 7 (for AVX2 and AVX-512 flags). It checks in descending capability order and stops at the first variant binary it finds on disk. If no variant binary is executable, it exits with an error.

The ARM path always tries bwa-mem3.arm64 first, then falls back to the bare bwa-mem3 name (which on ARM is a symlink to bwa-mem3.arm64 created by make arm64).

Building the variant binaries

make multi builds all five x86 variants and the launcher in sequence:

make multi

This produces:

FilenameArch flagsMinimum CPU
bwa-mem3.sse41-msse4.1Penryn (2007) / K10 (2011)
bwa-mem3.sse42-msse4.2Nehalem (2008) / Bulldozer (2011)
bwa-mem3.avx-mavxSandy Bridge (2011) / Bulldozer (2011)
bwa-mem3.avx2-mavx2Haswell (2013) / Excavator (2015)
bwa-mem3.avx512bw-mavx512f -mavx512bwSkylake-X (2017) / Zen 4 (2022)

For ARM builds, make arm64 produces a single binary and creates the symlink:

FilenameArch flagsPlatform
bwa-mem3.arm64-DAPPLE_SILICON=1 + sse2neon shimAny aarch64 / Apple Silicon

To build a single-arch binary for a known target (e.g. for a cluster with uniform hardware):

make arch=avx2

The resulting binary is named bwa-mem3 and contains only AVX2 code. The launcher is not built; it is not needed when the target ISA is fixed.

Kernel vectorization coverage

The table below lists the kernels that have SIMD implementations and which ISA levels they cover.

KernelSSE4.1SSE4.2AVXAVX2AVX-512BWNEON (arm64)
kswv (vectorized Smith-Waterman)8-wide int168-wide int168-wide int1616-wide int1632-wide int168-wide int16 (native)
bandedSWA (banded alignment)SSE2 baselineSSE2 baselineSSE2 baselineSSE2 baselineSSE2 baselinenative NEON blendv
FM-index lookup (FMI_search)SSE popcountSSE popcountSSE popcountSSE popcountSSE popcount__builtin_popcountl
libsais BWT constructionscalarscalarscalarOpenMP parallelOpenMP parallelOpenMP parallel

Note — FM-index is memory-bound

The FM-index backward-extension loop is limited by pointer-chasing through the cp_occ arrays, not by computation. Additional SIMD width does not increase throughput here. See the Apple Silicon optimization log in Developer Guide — Apple Silicon / NEON port for the profiling evidence.

Why the launcher uses execv, not a function pointer

The multi-binary design was inherited from bwa-mem2. Separate compilation units mean the compiler can use the target ISA’s full instruction set throughout — not just in hand-vectorized loops but also in auto-vectorized loops, register allocation, and branch heuristics. A single-binary dispatcher that calls ISA-specific function pointers achieves the same for hand-written kernels but leaves the compiler’s auto-vectorization gated at the baseline ISA. For a workload with this many scalar loops, the execv approach yields a measurable difference. For the ARM path, all CPUs have the same NEON level so the single-binary approach is fine.


See also: Performance overview · PGO build · Developer Guide — SIMD dispatch architecture · Developer Guide — Multi-binary launcher (x86) · Developer Guide — Apple Silicon / NEON port