Build

This page describes the recommended build configuration for production use of bwa-mem3.

Choose the right arch target

The default make invocation builds a single multi-tier binary on x86 (or a single NEON binary on arm64). For production clusters where the CPU family is uniform, you can trim further by building one tier only — the binary drops the per-tier dispatch table and ships a single kernel path:

# Most modern x86-64 servers (Haswell or later):
make arch=avx2

# Intel Cascade Lake / Sapphire Rapids, AWS c7i/m7i:
make arch=avx512bw

# Apple Silicon / AWS Graviton:
make arch=arm64

Omit arch= if the deployment target is heterogeneous or unknown; the default make produces a single binary that includes every supported x86 tier and dispatches at runtime via __builtin_cpu_supports. Tune the non-kernel TU compile baseline with BASELINE_ARCH= (default avx2) — see Single-binary SIMD dispatch (x86).

See SIMD dispatch matrix for the full list of targets and which kernels each vectorizes.

Use a recent compiler (especially on ARM)

Use the newest C++ compiler available, and on ARM/aarch64 prefer a recent clang. The compiler matters more on ARM than on x86: the aarch64 build runs its SIMD through the sse2neon translation layer rather than hand-written intrinsics, so codegen quality — and therefore throughput — depends heavily on the compiler and its version.

Measured on AWS Graviton4 (c8g.4xlarge, 16 cores), hg38, 5M read pairs, make arm64, best-of-3 CPU-seconds:

Compiler	CPU-seconds	vs gcc 15.2
gcc 15.2	1779	—
clang 22.1	1679	~6% faster

Two takeaways:

clang generally emits better NEON than gcc for sse2neon-translated code — about 6% fewer CPU-seconds here.
Compiler version matters as much as the vendor. A larger ~18% clang-over-gcc gap has been reported against an older gcc (~13); against a modern gcc (15.2) it narrows to ~6%, because recent gcc closed most of the NEON-codegen gap. Bumping the gcc version is often most of the win even without switching to clang.

If you build the arm64 binary with clang, note the OpenMP runtime changes from libgomp to libomp (llvm-openmp) — see Multi-architecture deployment.

Profile-Guided Optimization (PGO)

PGO adds 3–5% throughput on real workloads and is recommended for any installation that runs many alignment jobs against the same reference. It is opt-in — the default make does not use it. The generate → train → use workflow, the PGO_ARCH= selector, PGO_PROFILE_DIR=, and the training-data caveats are all in Performance → PGO build; the Summary below shows the production recipe.

mimalloc

mimalloc is compiled in by default (USE_MIMALLOC=1). The allocator improves multi-threaded throughput by reducing lock contention on malloc and free hot paths. Run bwa-mem3 version to confirm it is active:

bwa-mem3 version
# Expected output includes a line like:
#   mimalloc 3.x.x

To build without mimalloc (for example, when using AddressSanitizer or on a system with a known-incompatible allocator):

make USE_MIMALLOC=0

Summary

For a production installation on a known x86 server with AVX2:

make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Deploy: bwa-mem3.pgo.avx2

On ARM/aarch64 (Apple Silicon, AWS Graviton), build with a recent clang and apply PGO on top:

make pgo-generate PGO_ARCH=arm64 CXX=clang++
./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=arm64 CXX=clang++
# Deploy: bwa-mem3.pgo