Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Build

This page describes the recommended build configuration for production use of bwa-mem3.

Choose the right arch target

The default make invocation builds the multi-binary launcher on x86 (or a single ARM64 binary on Apple Silicon). For production servers where the CPU family is known, specify the target explicitly so the compiler can generate tighter code and the binary does not need the launcher overhead:

# Most modern x86-64 servers (Skylake or later):
make arch=avx2

# Intel Cascade Lake / Sapphire Rapids, AWS c7i/m7i:
make arch=avx512bw

# Apple Silicon / AWS Graviton:
make arch=arm64

Omit arch= if the deployment target is heterogeneous or unknown; make (with no arguments) builds the full multi-binary suite on x86 and selects the fastest variant at runtime via cpuid.

See SIMD dispatch matrix for the full list of targets and which kernels each vectorizes.

Profile-Guided Optimization (PGO)

PGO typically yields 3–5% throughput improvement on real workloads. It is opt-in — the standard make target does not use it — but is recommended for any installation that will run many alignment jobs against the same reference.

The workflow is three steps:

# Step 1: Build an instrumented binary (produces bwa-mem3.pgo-instr).
make pgo-generate

# Step 2: Run a representative training workload.
#   Use reads and a reference that reflect actual production input.
#   About 10–30 million read pairs is sufficient.
./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null

# Step 3: Build the PGO-optimized binary (produces bwa-mem3.pgo).
make pgo-use

To target a specific SIMD level, pass PGO_ARCH=:

make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Produces: bwa-mem3.pgo.avx2

Profile data is written to pgo_profiles/ by default. Pass PGO_PROFILE_DIR=<path> to change the location.

Tip — Training data matters

The training workload should resemble production input in read length, base quality distribution, and reference composition. A read set that is too short, too long, or too easy (low mismatch rate) will bias the branch predictions and may produce a build that is slower than the non-PGO baseline on real data.

mimalloc

mimalloc is compiled in by default (USE_MIMALLOC=1). The allocator improves multi-threaded throughput by reducing lock contention on malloc and free hot paths. Run bwa-mem3 version to confirm it is active:

bwa-mem3 version
# Expected output includes a line like:
#   mimalloc 3.x.x

To build without mimalloc (for example, when using AddressSanitizer or on a system with a known-incompatible allocator):

make USE_MIMALLOC=0

Summary

For a production installation on a known x86 server with AVX2:

make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Deploy: bwa-mem3.pgo.avx2

See also: SIMD dispatch matrix · PGO build · Memory allocator (mimalloc) · Building from source · Anti-patterns