Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Performance Improvements

This page covers the performance work carried in bwa-mem3 on top of upstream bwa-mem2. Every change listed here preserves byte-identical SAM/BAM output vs the upstream baseline it was benchmarked against.

For current benchmark numbers across architectures and workloads, see bwa-mem3-bench, the canonical source of truth for benchmark methodology and results.

Lockstep SMEM batching (PR #33)

Seeding in bwa-mem2 advances one read’s SMEM walk at a time. Because each forward/backward extension step issues a random access into the cp_occ checkpoint array (~4 GB for human genome), the CPU stalls on cache misses between steps. Lockstep batching advances SMEM_LOCKSTEP_N reads’ SMEM walks in slot-interleaved round-robin order so that the out-of-order engine can overlap the cp_occ cache-miss loads for read i+N with the compute-bound walk of read i.

Each read slot (BatchSlot) carries its own prev[] walk buffer and match_buf[] reorder buffer. A tight recycling loop assigns finished slots to the next unprocessed read immediately. The match-emit cursor enforces input-index order so output is byte-identical to scalar. SMEM_LOCKSTEP_N is compile-time tunable; N=1 dispatches to the unchanged scalar path for bisection.

Measured improvement on 150 bp NovaSeq WGS (1M pairs, hg38, Graviton3 r7g.4xlarge, 8 threads): −6.1% wall time (82 s → 77 s). The backwardExt hot cp_occ load share dropped from 65.5% to 53.3% of function time — direct evidence that the OoO engine is overlapping cross-slot loads. On 300 bp MiSeq reads the workload is SW-dominated (~85% of cycles in kswv kernels) and the SMEM improvement is within noise; parity holds.

Supersedes PR #15 (cross-read _mm_prefetch shape), which regressed on Graviton3.

Batched -H header ingestion (PR #49, closes issue #37)

Passing a large header file via -H <file> re-ran strlen on the growing header string and called realloc on every input line, making ingestion O(n²) in the number of header lines. For a ~70 MB / ~1.5 M-line header (reported in upstream bwa-mem2#204) this caused runtimes exceeding 10 minutes before alignment started.

The fix introduces bwa_insert_header_file, a batched helper that determines the file size with fseek/ftell, allocates a single buffer, copies all @-prefixed lines in one pass, and calls bwa_insert_header once. The fix also addresses four correctness gaps in the upstream PR #204: the return-value assignment was dropped (leaving hdr_line stale after realloc), const FILE* caused compiler warnings, empty files were not guarded, and each fgets was not bounded by remaining buffer. A regression test (test/header_insert_test.cpp) diffs the batched path against the pre-patch per-line baseline across eight edge cases.

libsais FM-index construction (PR #57)

bwa-mem3 index now builds the FM-index using libsais v2.9.1 (Ilya Grebnov) instead of the sais-lite (Yuta Mori saisxx) library that bwa-mem2 inherited. libsais is actively maintained, supports OpenMP-parallel induced sorting, and produces a byte-identical FM-index. No changes are required to existing indexes — bwa-mem3 reads index files built by bwa-mem2 index without re-indexing.

For a human reference (GRCh38 + decoys), libsais reduces indexing wall time and peak memory vs sais-lite. Exact numbers depend on thread count and available RAM; see the PR body for measurements on Graviton3.

Consolidated mapping speedups (PR #58)

PR #58 is a multi-phase performance audit of bwa-mem2’s hot path, squashed and rebased onto main. It incorporates improvements across five subsystems:

  • ksw2 banded SW — tuned the band extension loop to reduce redundant computation in the common case.
  • SMEM lockstep batching — additional refinements on top of PR #33.
  • SAL prefetch — prefetch hints for the suffix array lookup hot path.
  • SAM record building — reduced per-record allocation in the text formatting path.
  • PGO build — the opt-in profile-guided optimization target (see also Performance → PGO build) is included in this suite.

On the smoke-1M workload (1M PE 150 bp reads, hg38, Graviton3 r7g.4xlarge, 16 threads, warm page cache), this PR contributed the largest single-step wall time reduction in the main branch’s performance history. Benchmark details are maintained at bwa-mem3-bench.


Changes catalog

Itembwa-mem3 PRUpstream PR/issueStatus
Lockstep SMEM batching#33fork-only
Batched -H header ingestion#49bwa-mem2#204fork-only (upstream PR open)
Large header performance (issue)issue #37closed by #49
libsais FM-index construction#57fork-only
Consolidated mapping speedups#58fork-only

See also: Performance → Overview · Performance → PGO build · Correctness fixes · Build & infrastructure · bwa-mem3-bench