Build
This page describes the recommended build configuration for production use of bwa-mem3.
Choose the right arch target
The default make invocation builds the multi-binary launcher on x86 (or a
single ARM64 binary on Apple Silicon). For production servers where the CPU
family is known, specify the target explicitly so the compiler can generate
tighter code and the binary does not need the launcher overhead:
# Most modern x86-64 servers (Skylake or later):
make arch=avx2
# Intel Cascade Lake / Sapphire Rapids, AWS c7i/m7i:
make arch=avx512bw
# Apple Silicon / AWS Graviton:
make arch=arm64
Omit arch= if the deployment target is heterogeneous or unknown; make
(with no arguments) builds the full multi-binary suite on x86 and selects the
fastest variant at runtime via cpuid.
See SIMD dispatch matrix for the full list of targets and which kernels each vectorizes.
Profile-Guided Optimization (PGO)
PGO typically yields 3–5% throughput improvement on real workloads. It is opt-in — the standard
make target does not use it — but is recommended for any installation that
will run many alignment jobs against the same reference.
The workflow is three steps:
# Step 1: Build an instrumented binary (produces bwa-mem3.pgo-instr).
make pgo-generate
# Step 2: Run a representative training workload.
# Use reads and a reference that reflect actual production input.
# About 10–30 million read pairs is sufficient.
./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
# Step 3: Build the PGO-optimized binary (produces bwa-mem3.pgo).
make pgo-use
To target a specific SIMD level, pass PGO_ARCH=:
make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Produces: bwa-mem3.pgo.avx2
Profile data is written to pgo_profiles/ by default. Pass
PGO_PROFILE_DIR=<path> to change the location.
Tip — Training data matters
The training workload should resemble production input in read length, base quality distribution, and reference composition. A read set that is too short, too long, or too easy (low mismatch rate) will bias the branch predictions and may produce a build that is slower than the non-PGO baseline on real data.
mimalloc
mimalloc is compiled in by default (USE_MIMALLOC=1). The allocator
improves multi-threaded throughput by reducing lock contention on malloc
and free hot paths. Run bwa-mem3 version to confirm it is active:
bwa-mem3 version
# Expected output includes a line like:
# mimalloc 3.x.x
To build without mimalloc (for example, when using AddressSanitizer or on a system with a known-incompatible allocator):
make USE_MIMALLOC=0
Summary
For a production installation on a known x86 server with AVX2:
make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Deploy: bwa-mem3.pgo.avx2
See also: SIMD dispatch matrix · PGO build · Memory allocator (mimalloc) · Building from source · Anti-patterns