PGO Build
Profile-Guided Optimization (PGO) is a two-pass compiler technique. In the first pass (pgo-generate) the compiler inserts counters into every branch, call site, and loop back-edge. You run a representative training workload against the instrumented binary so those counters accumulate real branch-probability data. In the second pass (pgo-use) the compiler recompiles every translation unit using the collected profiles to make better inlining, branch-prediction, and code-layout decisions.
bwa-mem3’s Makefile provides three targets that implement this workflow.
Observed gains
On Apple Silicon (M-series), PGO delivered approximately 3% throughput improvement over the native NEON build. The gain on x86 depends on the workload — short-read paired-end alignment on avx2 or avx512bw hardware typically sees 2–5%. PGO is most useful when you will run the same binary on the same hardware against the same workload repeatedly (e.g. a production pipeline node). It is not worth the extra build time for one-off or exploratory runs.
Workflow
Step 1: Build the instrumented binary
make pgo-generate
By default PGO_ARCH is set to arm64 on Apple Silicon / aarch64 hosts and native on x86 hosts. To target a specific ISA, pass PGO_ARCH explicitly:
make pgo-generate PGO_ARCH=avx2
This produces a binary named bwa-mem3.pgo-instr (or bwa-mem3.pgo-instr.avx2 for non-default arch). Profiles are written to the directory pgo_profiles/ by default. Override with PGO_PROFILE_DIR:
make pgo-generate PGO_ARCH=avx2 PGO_PROFILE_DIR=/scratch/pgo-profiles-avx2
Step 2: Run the training workload
Run a workload that is representative of your production use. A single-end or paired-end alignment run against the same reference and similar read length is sufficient. A larger training run produces more stable profiles but 5–10 million read pairs is generally enough.
./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
The run discards output so you are measuring the alignment work alone.
Tip — Training workload size
Aim for a training run that exercises the same code paths as your production workload. If you align 150 bp paired-end reads in production, train on 150 bp reads. If you use
--meth, include a methylation alignment run in training. A few million read pairs is sufficient; a full WGS run provides diminishing returns.
Step 3: Build the optimized binary
make pgo-use
Or with matching arch and profile dir:
make pgo-use PGO_ARCH=avx2 PGO_PROFILE_DIR=/scratch/pgo-profiles-avx2
This produces bwa-mem3.pgo (or bwa-mem3.pgo.avx2). The binary is ready to use in production.
Step 4: Clean up instrumentation artifacts
make pgo-clean
This removes the profile directory and all bwa-mem3.pgo-instr* and bwa-mem3.pgo* files.
Multi-arch builds with PGO
Each architecture requires its own profile because the instrumentation counters are embedded in arch-specific code. Run the full three-step workflow once per arch and keep the profiles in separate directories:
# AVX2 profile
make pgo-generate PGO_ARCH=avx2 PGO_PROFILE_DIR=pgo_profiles_avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2 PGO_PROFILE_DIR=pgo_profiles_avx2
# AVX-512BW profile (separate host or same host with matching CPU)
make pgo-generate PGO_ARCH=avx512bw PGO_PROFILE_DIR=pgo_profiles_avx512bw
./bwa-mem3.pgo-instr.avx512bw mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx512bw PGO_PROFILE_DIR=pgo_profiles_avx512bw
Warning — Profile portability
Profile data collected on one microarchitecture is not portable to a different one. An AVX2 profile collected on a Haswell CPU will not improve — and may pessimize — an AVX-512BW build run on a Sapphire Rapids CPU. Always collect profiles on the same hardware class where the optimized binary will run.
PGO and the multi-binary layout
The PGO targets produce a single optimized binary for a single arch target. They do not rebuild the full make multi set. If you want PGO-optimized multi-binary dispatch, build and profile each arch variant separately, place them alongside the launcher, and verify with ./bwa-mem3 version.
Relationship to LTO
make lto-build produces a Link-Time Optimization binary; make pgo-use produces a PGO-optimized binary. Both are independent opt-in targets. You can combine them by passing -flto (or -flto=thin for clang) as part of EXTRA_CXXFLAGS during the pgo-use step, but the combination has not been systematically benchmarked. In practice, LTO and PGO each provide modest single-digit gains; their interaction is compiler-specific.
See also: Performance overview · SIMD dispatch matrix · Tuning checklist · Best Practices — Build · Developer Guide — Building from source