Performance Improvements

This page covers the performance work carried in bwa-mem3 on top of upstream bwa-mem2. Almost every change listed here is a throughput, memory, or supporting (test/hardening/cleanup) change that preserves the aligner’s output; the one exception is the deterministic tie-break ordering in #123, which can reorder equal-scoring alignments relative to upstream (see Equivalence with bwa-mem2 for the full, audited list of where bwa-mem3 output diverges).

For a reader-friendly grouping of what drives the speedup — by machine architecture, hot-path rewrite, indexing, allocation/I/O, and build-time — see Performance → Overview. For current benchmark numbers across architectures and workloads, see bwa-mem3-bench, the canonical source of truth for benchmark methodology and results.

Lockstep SMEM batching (PR #33)

Seeding in bwa-mem2 advances one read’s SMEM walk at a time. Because each forward/backward extension step issues a random access into the cp_occ checkpoint array (~4 GB for human genome), the CPU stalls on cache misses between steps. Lockstep batching advances SMEM_LOCKSTEP_N reads’ SMEM walks in slot-interleaved round-robin order so that the out-of-order engine can overlap the cp_occ cache-miss loads for read i+N with the compute-bound walk of read i.

Each read slot (BatchSlot) carries its own prev[] walk buffer and match_buf[] reorder buffer. A tight recycling loop assigns finished slots to the next unprocessed read immediately. The match-emit cursor enforces input-index order so output is byte-identical to scalar. SMEM_LOCKSTEP_N is compile-time tunable; N=1 dispatches to the unchanged scalar path for bisection.

Measured improvement on 150 bp NovaSeq WGS (1M pairs, hg38, Graviton3 r7g.4xlarge, 8 threads): −6.1% wall time (82 s → 77 s). The backwardExt hot cp_occ load share dropped from 65.5% to 53.3% of function time — direct evidence that the OoO engine is overlapping cross-slot loads. On 300 bp MiSeq reads the workload is SW-dominated (~85% of cycles in kswv kernels) and the SMEM improvement is within noise; parity holds.

Supersedes PR #15 (cross-read _mm_prefetch shape), which regressed on Graviton3.

Batched `-H` header ingestion (PR #49, closes issue #37)

Passing a large header file via -H <file> re-ran strlen on the growing header string and called realloc on every input line, making ingestion O(n²) in the number of header lines. For a ~70 MB / ~1.5 M-line header (reported in upstream bwa-mem2#204) this caused runtimes exceeding 10 minutes before alignment started.

The fix introduces bwa_insert_header_file, a batched helper that determines the file size with fseek/ftell, allocates a single buffer, copies all @-prefixed lines in one pass, and calls bwa_insert_header once. The fix also addresses four correctness gaps in the upstream PR #204: the return-value assignment was dropped (leaving hdr_line stale after realloc), const FILE* caused compiler warnings, empty files were not guarded, and each fgets was not bounded by remaining buffer. A regression test (test/header_insert_test.cpp) diffs the batched path against the pre-patch per-line baseline across eight edge cases.

libsais FM-index construction (PR #57)

bwa-mem3 index now builds the FM-index using libsais v2.9.1 (Ilya Grebnov) instead of the sais-lite (Yuta Mori saisxx) library that bwa-mem2 inherited. libsais is actively maintained, supports OpenMP-parallel induced sorting, and produces a byte-identical FM-index. An index built by bwa-mem2 index is read without re-indexing; a bwa (v1) index uses a different format and must be rebuilt with bwa-mem3 index (see Coming from bwa or bwa-mem2).

For a human reference (GRCh38 + decoys), libsais reduces indexing wall time and peak memory vs sais-lite. Exact numbers depend on thread count and available RAM; see the PR body for measurements on Graviton3.

Consolidated mapping speedups (PR #58)

PR #58 is a multi-phase performance audit of bwa-mem2’s hot path, squashed and rebased onto main. It incorporates improvements across five subsystems:

ksw2 banded SW — tuned the band extension loop to reduce redundant computation in the common case.
SMEM lockstep batching — additional refinements on top of PR #33.
SAL prefetch — prefetch hints for the suffix array lookup hot path.
SAM record building — reduced per-record allocation in the text formatting path.
PGO build — the opt-in profile-guided optimization target (see also Performance → PGO build) is included in this suite.

On the smoke-1M workload (1M PE 150 bp reads, hg38, Graviton3 r7g.4xlarge, 16 threads, warm page cache), this PR contributed the largest single-step wall time reduction in the main branch’s performance history. Benchmark details are maintained at bwa-mem3-bench.

Full change list

Every performance PR — with its upstream disposition — is in the PR catalog (filter on the Performance class). The sections above narrate the load-bearing ones.

Open / in progress

Not yet merged to main:

Item	Stage	Mechanism	bwa-mem3 PR
AVX2 8-bit wrapper prefetch	sw	Adds the missing next-batch ref/query software-prefetch to `smithWatermanBatchWrapper8` — the lone SW wrapper that lacked it (follow-up to #161)	#163

Several correctness and crash fixes underpin the long-read SW, indexing, and high-throughput work rather than adding speed themselves: SMEM read positions widened int16_t → int32_t to stop a long-read SIGSEGV (#142, merged); a persistent kt_for worker pool that fixes a multi-chunk SIGSEGV under mimalloc v3 (#154, merged); and mem_lim widened to int64 to stop an SA-staging buffer overflow on highly repetitive seeds (#156, merged).

Keyboard shortcuts

bwa-mem3