SIMD dispatch architecture
bwa-mem3 uses two complementary mechanisms to run the best available
SIMD code path at run time: in-process tier dispatch on x86
(handled separately in
Single-binary SIMD dispatch (x86)) and compile-time
conditional compilation inside each kernel translation unit,
mediated by src/simd_compat.h and src/kernel_dispatch.h.
This page covers the compile-time layer: what the macros do, which
kernels are vectorised at each ISA level, and how the dispatch
decision flows from main() to a tier-specific kernel instruction.
The simd_compat.h abstraction layer
src/simd_compat.h is the single point where platform detection and
intrinsic selection occur. It is included by every file that touches
SIMD code. The header resolves to one of four paths:
| Platform | Branch condition | Intrinsic headers |
|---|---|---|
| ARM / Apple Silicon | __ARM_NEON or __aarch64__ | sse2neon.h (translation) + <arm_neon.h> (native) |
| x86 AVX-512BW | __AVX512BW__ | <immintrin.h> |
| x86 AVX2 | __AVX2__ | <immintrin.h> |
| x86 SSE4.1 / SSE2 | __SSE4_1__ or __SSE2__ | <smmintrin.h> + <emmintrin.h> |
The ARM path defines APPLE_SILICON 1, sets SIMD_WIDTH8 = 16 and
SIMD_WIDTH16 = 8 (128-bit NEON lanes), defines a
posix_memalign-backed _mm_malloc replacement that enforces the
128-byte Apple Silicon cache-line alignment, and provides two
optimised NEON helpers that sse2neon does not generate efficiently:
_mm_movemask_epi16— extracts the MSB of each 16-bit element usingvshrq_n_u16+vmovn_u16+ position-weightedvaddv_u8, replacing the_mm_movemask_epi8(v) & 0xAAAApattern used inbandedSWA.cpp._mm_blendv_epi16_fast— a bitwise select on 16-bit elements via NEONvbslq_s16, replacing the OR/AND/ANDNOT sequence sse2neon emits for_mm_blendv_epi8.
SIMD_WIDTH8 and SIMD_WIDTH16 control the lane counts in kswv.cpp
and bandedSWA.cpp. The macros differ per ISA level:
| ISA | SIMD_WIDTH8 | SIMD_WIDTH16 |
|---|---|---|
| SSE4.1 | 16 | 8 |
| AVX2 | 32 | 16 |
| AVX-512BW | 64 | 32 |
| ARM NEON | 16 | 8 |
Per-tier compilation and symbol mangling
On x86 the four kernel translation units listed in KERNEL_SRCS
(bandedSWA.cpp, kswv.cpp, ksw.cpp, sam_encode.cpp) are
compiled five times each — once per supported tier
(sse41 / sse42 / avx / avx2 / avx512bw) — with tier-specific
-m... flags. src/kernel_dispatch.h is a preprocessor-only header
that renames each exported kernel symbol per a
KERNEL_VARIANT=_<tier> macro, so the five tier compiles produce
non-colliding symbols that all link into one binary.
bandedSWA.h adds an abstract IBandedPairWiseSW interface;
BandedPairWiseSW is final and inherits from it. kswv.h mirrors
this with Ikswv. Each per-tier kernel TU exports a C-linkage factory
function (make_bsw_kernel_<tier>, make_kswv_kernel_<tier>) that
returns a std::unique_ptr<I*> to the tier-specific concrete class.
The dispatcher in src/simd_dispatch.cpp switches on g_tier and
calls the matching factory; the call sites in bwamem.cpp and
bwamem_pair.cpp see only the interface. This separation keeps the
dispatcher TU free of class-layout knowledge and sidesteps the ODR
risk that would arise from each tier’s compile pulling in a
differently-laid-out concrete class definition.
The free-function ksw_* family (ksw_extend2, ksw_global2,
ksw_extend, ksw_global, ksw_align2, ksw_align) is dispatched
through thin extern "C" wrappers in simd_dispatch.cpp that switch
on g_tier and tail-call the matching mangled per-tier symbol.
Internal aux helpers in ksw.cpp (ksw_qinit, ksw_u8, ksw_i16)
are forced static so the five tier compiles do not multi-define
them. The SAM seq/qual encoder previously inlined in bwamem.cpp was
lifted into src/sam_encode.{h,cpp} so it also participates in
per-tier compilation.
All non-kernel TUs (bwamem.cpp, bwamem_pair.cpp, fastmap.cpp,
FMI_search.cpp, bntseq.cpp, …) compile once at the
BASELINE_ARCH tier (default avx2, set by the make line). They
call into the dispatcher’s tier-agnostic entry points, which fan out
to the per-tier kernels at run time. See
Single-binary SIMD dispatch (x86) for the runtime
selection and override semantics, and
BASELINE_ARCH=avx512bw build flag
for why non-kernel TUs do not auto-vectorize at 512-bit by default.
On arm64 there is one NEON tier and one kernel compile per TU; the dispatch tables collapse to single-entry switches and the per-tier mangling layer is a no-op.
Dispatch diagram
The full dispatch decision, from the shell to a kernel instruction, follows this flow:
flowchart TD
A[User runs: bwa-mem3 mem ...] --> B{Platform}
B -- ARM / Apple Silicon --> C[bwa-mem3 main, single NEON kernel TU]
B -- x86 --> D[bwa-mem3 main, calls bwamem3_simd_init in src/simd_dispatch.cpp]
D --> E{__builtin_cpu_supports + BWAMEM3_FORCE_TIER}
E -- AVX-512BW --> F1[g_tier = avx512bw]
E -- AVX2 --> F2[g_tier = avx2]
E -- AVX --> F3[g_tier = avx]
E -- SSE4.2 --> F4[g_tier = sse42]
E -- SSE4.1 --> F5[g_tier = sse41]
F1 & F2 & F3 & F4 & F5 --> G[Non-kernel TUs run\nat BASELINE_ARCH tier]
C --> G
G --> H{Kernel call}
H -- kswv\nbatched SW --> I[per-tier kswv.<tier>.o\nvia make_kswv_kernel_<tier>]
H -- bandedSWA\nmate-rescue --> J[per-tier bandedSWA.<tier>.o\nvia make_bsw_kernel_<tier>]
H -- ksw_align2 etc.\nfree functions --> K[per-tier ksw.<tier>.o\nvia extern-C wrapper in simd_dispatch.cpp]
H -- sam_encode --> L[per-tier sam_encode.<tier>.o]
H -- FMI_search\nbackward extension --> M[FMI_search.cpp\n__builtin_popcountl — not SIMD]
H -- libsais\nBWT construction --> N[libsais.c\nOpenMP parallel SA-IS]
I --> O[SIMD instructions\nat the dispatched tier]
J --> O
K --> O
L --> O
Per-kernel vectorisation status
| Kernel | SSE4.1 | SSE4.2 | AVX | AVX2 | AVX-512BW | ARM NEON |
|---|---|---|---|---|---|---|
kswv (batched Smith-Waterman) | 8-wide int16 | 8-wide int16 | 8-wide int16 | 16-wide int16 | 32-wide int16 | 8-wide int16 (native) |
bandedSWA (banded SW / mate-rescue) | vectorised | vectorised | vectorised | vectorised | vectorised | native NEON blendv |
ksw_* free functions (SW extension) | per-tier | per-tier | per-tier | per-tier | per-tier | per-tier (NEON) |
sam_encode (SAM seq/qual encoder) | per-tier | per-tier | per-tier | per-tier | per-tier | per-tier (NEON) |
FMI_search (FM-index backward ext.) | scalar | scalar | scalar | scalar | scalar | scalar |
libsais (BWT / SA construction) | OpenMP only | OpenMP only | OpenMP only | OpenMP only | OpenMP only | OpenMP only |
FMI_search is memory-bound with sequential pointer-chasing
dependencies; adding SIMD to it produces no measurable speedup.
libsais benefits from OpenMP-parallel induced sorting but not from
SIMD widening within a single thread.
Adding a new SIMD kernel
- Include
simd_compat.hrather than any platform intrinsic header directly. - Use
SIMD_WIDTH8/SIMD_WIDTH16for lane-count arithmetic so the code compiles correctly across all ISA levels. - If the kernel needs per-tier compilation:
- Add the source to
KERNEL_SRCSin the Makefile so the per-tier pattern rules (src/%.<tier>.o) pick it up. - Use the
KERNEL_VARIANTrename macros fromsrc/kernel_dispatch.hto expose mangled symbols. - Export a C-linkage factory or dispatcher entry point from the per-tier TU and add a switch on
g_tierinsrc/simd_dispatch.cpp.
- Add the source to
- For ARM-specific optimisations, gate them with
#ifdef APPLE_SILICON(or#if defined(__ARM_NEON)) and provide asimd_compat.h-routed fallback for x86. - Verify correctness on at least SSE4.1 (lowest supported x86 tier) and ARM64 using
make test, then runtest/regression/all_tiers_parity.shto confirm byte-identical SAM across every x86 tier underBWAMEM3_FORCE_TIER.
Tip — Testing SIMD correctness
The kswv unit tests in
test/unit/test_kswv*.cppuse synthetic sequence-pair generators that drive edge cases (empty batches, nrow==0, homopolymers) across every SIMD width. Run them with./test/bwa_mem3_tests_unit --test-suite="unit/kswv"after modifying any vectorised kernel, then loopBWAMEM3_FORCE_TIERover all five tiers in an end-to-end smoke run to catch dispatcher-wiring regressions that the unit tests miss.
See also:
Single-binary SIMD dispatch (x86) ·
Apple Silicon / NEON port ·
Building from source ·
Performance → SIMD dispatch matrix ·
BASELINE_ARCH=avx512bw build flag ·
Regression test framework