Tuning Checklist
The items below are ordered by expected impact for most workloads. Work through them in sequence; there is little point optimizing output format before confirming you are running the right binary for your CPU.
1. Run the right binary for your CPU
If you built with make multi (recommended for production x86 deployments), the bwa-mem3 launcher reads CPUID at startup and execs the highest-capability variant automatically. Verify which variant is running by checking the banner printed to stderr at the start of a mem run:
-----------------------------
Executing in AVX512 mode!!
-----------------------------
If the banner says SSE4.1 on a machine you believe supports AVX2, the variant binary may be missing from the directory. Confirm with:
ls -1 bwa-mem3.sse41 bwa-mem3.sse42 bwa-mem3.avx bwa-mem3.avx2 bwa-mem3.avx512bw 2>&1
If files are missing, rebuild with make multi.
For ARM / Apple Silicon, there is only one binary level. Confirm it is in use:
ls -la bwa-mem3
# expect: bwa-mem3 -> bwa-mem3.arm64
See SIMD dispatch matrix for the full dispatch logic and the minimum CPU requirements for each variant.
Tip — Single-arch deployments
On a cluster where all nodes have the same CPU, build with
make arch=avx2(or the appropriate ISA). The launcher overhead is negligible but removing it simplifies the deployment: only one binary to distribute and no variant-lookup failures.
2. Build with PGO if you will run repeatedly
For production pipeline nodes that will process many samples against the same reference, a PGO build provides an additional 2–5% throughput at the cost of one extra build pass and a training run:
make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
See PGO build for the full workflow, including multi-arch and profile portability notes.
3. Use shared memory for many small samples
When aligning many samples on one machine against the same reference, loading the index into POSIX shared memory once and reusing it across all mem invocations eliminates redundant I/O and reduces per-sample startup time significantly. The benefit grows with the number of samples and the size of the reference.
# Load the index into shared memory once
bwa-mem3 shm ref.fa
# Align each sample against the in-memory index
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz | samtools sort -@ 4 -o sample.bam -
# When finished with all samples, drop the shared segment
bwa-mem3 shm -d
Warning — No staleness check
bwa-mem3 shmdoes not detect whether the on-disk index has changed after the segment was loaded. Always runbwa-mem3 shm -dbefore re-indexing a reference and re-loading withbwa-mem3 shm. Failing to do so results in alignments against a stale index.
See Getting Started — Shared-memory index and Best Practices — Multi-sample workflows for complete workflows.
4. Emit BAM directly
Use --bam (or --bam=0 for uncompressed BAM) to emit BAM instead of SAM. Uncompressed BAM avoids the text-formatting cost on the aligner side and the text-parsing cost on the downstream side. samtools sort reads BAM natively and is fastest when the input is uncompressed:
bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -o out.bam -
samtools index out.bam
The --bam flag (without =0) produces BGZF-compressed BAM. This is useful when writing directly to disk without a downstream piped tool.
See Best Practices — Output format for guidance on when SAM is still appropriate.
5. Pipe to a multi-threaded sorter
Sorting is typically the bottleneck after alignment. Keep a separate thread budget for samtools sort:
bwa-mem3 mem --bam=0 -t 12 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -m 2G -o out.bam -
On a 16-core machine, allocating 12 threads to mem and 8 to samtools sort (with overlap via the pipe) is a common starting point. The aligner is generally CPU-bound; the sorter is I/O-bound during merge. Profile both stages to find the right split for your hardware.
Tip — Thread count tuning
bwa-mem3 memscales well to 16–32 threads on most workloads. Beyond 32 threads the per-thread work unit becomes small enough that synchronization overhead starts to erode gains. See User Guide — Threading and resource use for thread-scaling data.
Summary table
| Item | Action | Reference |
|---|---|---|
| Right binary for CPU | make multi; verify banner | SIMD dispatch matrix |
| PGO for production | pgo-generate → train → pgo-use | PGO build |
| Shared-memory index | bwa-mem3 shm ref.fa before batch runs | Quick start: shm |
| Emit uncompressed BAM | --bam=0 | Best Practices — Output format |
| Multi-threaded sort | samtools sort -@ with appropriate thread split | User Guide — Threading |
See also: Performance overview · SIMD dispatch matrix · PGO build · Best Practices — Build · User Guide — Threading and resource use