Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Apple Silicon / NEON port

bwa-mem3 supports ARM64 (Apple Silicon and Linux aarch64) as a first-class build target. The port uses the sse2neon translation shim as a baseline and replaces the two most performance-critical SSE paths with native NEON intrinsics.

Architecture overview

The ARM build compiles a single binary with a single NEON kernel TU. There is only one NEON instruction-set level on all current ARM64 CPUs, so the per-tier dispatch table used by the x86 single-binary build (see Single-binary SIMD dispatch (x86)) collapses to a one-entry switch on aarch64 — there is effectively no dispatch overhead. make arm64 builds and installs the binary at the bare bwa-mem3 name.

sse2neon shim

ext/sse2neon/sse2neon.h is a header-only library that maps Intel SSE intrinsics to their NEON equivalents. When APPLE_SILICON=1 is defined (set automatically when uname -m is arm64 or aarch64), src/simd_compat.h includes sse2neon and defines the SSE feature test macros (__SSE__ through __SSE4_2__) so that code guarded by those macros compiles without changes.

The translation is not zero-cost for all operations. Two patterns that sse2neon handles poorly are replaced with native NEON in src/simd_compat.h:

  • _mm_movemask_epi16 — used heavily in bandedSWA.cpp to extract the sign bit of each 16-bit lane. The native implementation shifts right by 15, narrows to 8-bit with vmovn_u16, and reduces with position-weighted vaddv_u8.
  • _mm_blendv_epi16_fast — a bitwise select on 16-bit lanes using vbslq_s16. Replaces the three-operation OR/AND/ANDNOT sequence sse2neon emits for _mm_blendv_epi8.

Memory alignment

Apple Silicon uses 128-byte cache lines (versus 64 bytes on x86). simd_compat.h overrides _mm_malloc on ARM to call posix_memalign with a minimum alignment of 128 bytes for all SIMD allocations. CACHE_LINE_BYTES is set to 128 in macro.h when APPLE_SILICON=1.

Accelerate.framework

The Makefile links -framework Accelerate on macOS ARM builds. The framework is linked but not used for computation: bwa-mem3’s hot paths (Smith-Waterman, FM-index) do not match the large-matrix / large-vector patterns that BLAS and vDSP target. The link is retained to keep the option open and adds no overhead at runtime.

P-core / E-core detection

src/fastmap.cpp calls HTStatus() on macOS to detect the Apple Silicon microarchitecture. HTStatus() reads the hw.perflevel0.physicalcpu and hw.perflevel1.physicalcpu sysctl keys to report P-core and E-core counts and the L2 cache size (typically 4 MB on M-series chips). This information is printed at startup for diagnostic purposes. The L2 cache size is used to validate the compile-time BATCH_SIZE setting (currently 1024, which was already optimal for a 4 MB L2 cache).

Benchmark results

All measurements use 100K paired-end reads, 5% error rate, 30% indels, chr17 reference, 8 threads, on an M-series Apple Silicon machine.

BuildWall-clock (avg, s)vs. baseline
sse2neon baseline (no native NEON)15.4
+ native NEON kswv.cpp14.4~7% faster
+ native NEON bandedSWA.cpp blendv13.8~4% faster
PGO on top of native NEON~13.4~3% further

The FM-index (FMI_search.cpp) is memory-bound with sequential pointer-chasing dependencies and does not benefit from SIMD. libsais benefits from OpenMP-parallel suffix-array construction but not from SIMD widening within a single thread.

Optimization task summary

TaskStatusImpactNotes
Correctness verificationdone200,006 alignments, 0 differences vs. reference
Dynamic L2 cache detectiondone~0%4 MB detected; compile-time BATCH_SIZE=1024 already optimal
Native NEON bandedSWA.cppdone~4%vbsl-based blendv in simd_compat.h
Per-tier dispatch tableN/A0%Collapses to one entry on ARM (single NEON level)
Accelerate.frameworkdone~0%Linked; no suitable compute patterns
M1/M2/M3/M4 detectiondone~0%P/E-core counts and L2 cache via sysctl
Native NEON FMI_search.cppN/A0%Memory-bound; SIMD cannot help
Profile-Guided Optimizationdone~3%make pgo-generate / make pgo-use

Building for Apple Silicon

# Standard arm64 build
make arch=arm64

# PGO build (recommended for production on Apple Silicon)
make pgo-generate PGO_ARCH=arm64
./bwa-mem3.pgo-instr mem -t 8 ref.fa r1.fq.gz r2.fq.gz > /dev/null
make pgo-use PGO_ARCH=arm64

The resulting bwa-mem3.pgo binary delivers the full ~10% improvement over the pure sse2neon baseline.

Tip — Recommended production build on Apple Silicon

Use PGO for production deployments. The combined ~10% improvement from native NEON kernels plus PGO is consistent and verified on M-series hardware.

Files modified in the NEON port

  • src/kswv.cpp, src/kswv.h — native NEON batched Smith-Waterman
  • src/bandedSWA.h — SIMD width definitions for ARM
  • src/simd_compat.h — sse2neon integration, aligned allocation, _mm_blendv_epi16_fast, _mm_movemask_epi16
  • src/fastmap.cpp — L2 cache detection, HTStatus() for non-NUMA (macOS)
  • src/macro.hBATCH_SIZE and CACHE_LINE_BYTES tuning for Apple Silicon
  • Makefilearm64 target, sse2neon flags, Accelerate linkage, PGO targets

See also: SIMD dispatch architecture · Building from source · Performance → PGO build · Performance → SIMD dispatch matrix · What’s Different → Architecture support