Build

This page describes the recommended build configuration for production use of bwa-mem3.

Choose the right arch target

The default make invocation builds a single multi-tier binary on x86 (or a single NEON binary on arm64). For production clusters where the CPU family is uniform, you can trim further by building one tier only — the binary drops the per-tier dispatch table and ships a single kernel path:

# Most modern x86-64 servers (Haswell or later):
make arch=avx2

# Intel Cascade Lake / Sapphire Rapids, AWS c7i/m7i:
make arch=avx512bw

# Apple Silicon / AWS Graviton:
make arch=arm64

Omit arch= if the deployment target is heterogeneous or unknown; the default make produces a single binary that includes every supported x86 tier and dispatches at runtime via __builtin_cpu_supports. Tune the non-kernel TU compile baseline with BASELINE_ARCH= (default avx2) — see Single-binary SIMD dispatch (x86).

See SIMD dispatch matrix for the full list of targets and which kernels each vectorizes.

Profile-Guided Optimization (PGO)

PGO typically yields 3–5% throughput improvement on real workloads. It is opt-in — the standard make target does not use it — but is recommended for any installation that will run many alignment jobs against the same reference.

The workflow is three steps:

# Step 1: Build an instrumented binary (produces bwa-mem3.pgo-instr).
make pgo-generate

# Step 2: Run a representative training workload.
#   Use reads and a reference that reflect actual production input.
#   About 10–30 million read pairs is sufficient.
./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null

# Step 3: Build the PGO-optimized binary (produces bwa-mem3.pgo).
make pgo-use

To target a specific SIMD level, pass PGO_ARCH=:

make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Produces: bwa-mem3.pgo.avx2

Profile data is written to pgo_profiles/ by default. Pass PGO_PROFILE_DIR=<path> to change the location.

Tip — Training data matters

The training workload should resemble production input in read length, base quality distribution, and reference composition. A read set that is too short, too long, or too easy (low mismatch rate) will bias the branch predictions and may produce a build that is slower than the non-PGO baseline on real data.

mimalloc

mimalloc is compiled in by default (USE_MIMALLOC=1). The allocator improves multi-threaded throughput by reducing lock contention on malloc and free hot paths. Run bwa-mem3 version to confirm it is active:

bwa-mem3 version
# Expected output includes a line like:
#   mimalloc 3.x.x

To build without mimalloc (for example, when using AddressSanitizer or on a system with a known-incompatible allocator):

make USE_MIMALLOC=0

Summary

For a production installation on a known x86 server with AVX2:

make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Deploy: bwa-mem3.pgo.avx2

Keyboard shortcuts

bwa-mem3

Build

Choose the right arch target

Profile-Guided Optimization (PGO)

mimalloc

Summary