Architecture Support
This page covers the architecture-specific build and runtime work carried in bwa-mem3. The goal is a single codebase that builds cleanly on all supported targets and runs the best available SIMD kernels on each.
For the full dispatch matrix and runtime selection logic, see Performance → SIMD dispatch matrix and Developer Guide → SIMD dispatch architecture.
Linux ARM64 / aarch64 build (PR #1)
The Apple Silicon work that reached the fork in commit ae73227 gated ARM
behavior on $(UNAME_M) == arm64. On macOS, uname -m returns arm64. On
Linux ARM64, it returns aarch64. The Makefile’s ifeq check therefore fell
through to the x86 multi target on every Linux aarch64 host, failing with:
g++: error: unrecognized command-line option '-msse'
PR #1 introduces an IS_ARM variable ($(filter $(UNAME_M),arm64 aarch64))
that matches both names. All four architecture-conditional blocks in the
Makefile are rewritten to use IS_ARM: the NEON/sse2neon flag block, the x86
arch-specific block, the ARM64 single-binary build block, and the multi
target ARM64 short-circuit. The CI workflow is extended to trigger on pushes
to fg-main (the integration branch at the time of PR #1, renamed to main
in the 0.1.0-pre release) and adds an ubuntu-24.04-arm matrix row so the
aarch64 path is exercised on every PR.
arch=avx512bw explicit build target (PR #16)
The AVX-512 Smith-Waterman kernels in bwa-mem2 are guarded by the
__AVX512BW__ preprocessor macro — not __AVX512F__. The only way to
build them before this PR was arch=avx512, but the (then) make multi
rule emitted the dispatch binary as bwa-mem2.avx512bw. The build
selector (avx512), the preprocessor guard (__AVX512BW__), and the
dispatcher suffix (.avx512bw) disagreed.
PR #16 added arch=avx512bw as an explicit Makefile target with flags
-mavx512f -mavx512bw and switched the multi-binary make path to use
it. The legacy arch=avx512 was preserved as an alias with identical
flags. No C++ was changed; the fix was 11 insertions and 2 deletions in
the Makefile.
PR #83 has since replaced the multi-binary scheme with a single binary
that compiles each kernel TU at every supported tier and dispatches
in process; the avx512bw tier name and flag set survived the
transition unchanged, and the arch=avx512bw build target remains the
single-arch fallback for clusters with uniform AVX-512BW hardware. The
pre-#16 mismatch between selector, guard, and suffix is therefore
resolved in both the historical multi-binary layout and the current
single-binary layout.
This is a pure build-correctness fix: before PR #16, arch=avx512bw
and the legacy multi-binary build on AVX-512BW hardware silently
compiled the wrong kernel (see
Correctness → AVX-512BW dispatch guard for the
downstream effect).
NEON kswv mate-rescue (PR #18)
bwa-mem2 has a batched mate-rescue Smith-Waterman path (BWAMEM_BATCHED_MATESW)
that uses SIMD kswv kernels to score rescue candidates in parallel. On ARM64
the gate was __AVX512BW__, which is never true on NEON hardware. The NEON
kswv::getScores8 kernel existed in the source but was unreachable in
production.
PR #18 enables this path on ARM64 by replacing the __AVX512BW__ gate with a
new BWAMEM_BATCHED_MATESW macro that fires on NEON/Apple Silicon as well.
Along the way, four kernel bugs were found and fixed:
- te split — the
te(traceback end) value needed separate hi/lo tracking for 16-lane u8 batches. - Freeze mask — a
frozen_vecmask now gatesgmax/te/qeupdates afterKSW_XSTOPfires, preventing stale values from escaping to the score2 scan. - Per-lane score2 exclusion —
len1,low/high, andqemasks were not applied per-lane in Loop 1, allowing lanes without a valid primary to contribute spurious suboptimal scores. - minsc filter on rowMax — sub-
minscplateau scores were leaking intoscore2because the scalarksw_u8gating condition (imax >= minsc) was not replicated.
Measured on an M-series Mac (8 threads, 500k PE 100 bp reads on chr17): 1.42× speedup (−29.4% wall time) with byte-identical sorted SAM output.
AVX2 kswv mate-rescue (PR #20)
PR #18 enabled batched mate-rescue on ARM64. Most x86 production deployments
(AWS c6a, c6i, older Xeons) use AVX2 without AVX-512BW and were excluded from
the same gate. PR #20 extends the batched path to AVX2 by adding a 256-bit
kswv256_u8 kernel and widening BWAMEM_BATCHED_MATESW to fire on __AVX2__.
The AVX2 kernel is a direct port of the corrected NEON kernel from PR #18,
with an additional fix for per-lane te2 tracking (_mm256_blendv_epi8 on a
sign-extended 8→16 bit mask). Verified byte-identical sorted SAM vs the
pre-BWAMEM_BATCHED_MATESW scalar control on EC2 m5.xlarge (Skylake-SP, 4
threads, 500k chr17 PE pairs).
Note: PR #20 introduced a score2 plateau regression in the AVX2 kernel that was identified and fixed in the correctness series (PRs #27, #28, #29).
Changes catalog
| Item | bwa-mem3 PR | Upstream PR/issue | Status |
|---|---|---|---|
| Linux ARM64 / aarch64 build + CI | #1 | bwa-mem2#288 | fork-only (upstream PR open) |
arch=avx512bw explicit target | #16 | — | fork-only |
| NEON kswv mate-rescue kernel | #18 | — | fork-only |
| AVX2 kswv mate-rescue kernel | #20 | — | fork-only |
See also: Performance → SIMD dispatch matrix · Developer Guide → SIMD dispatch architecture · Developer Guide → Apple Silicon / NEON port · Correctness fixes · Performance → PGO build