Correctness Fixes
This page documents bugs present in upstream bwa-mem2 that bwa-mem3 fixes. Each
fix is isolated to a single PR so it can be reviewed independently and dropped
from main once upstream merges the equivalent patch.
@PG CL: tab escaping (PR #54)
When a read-group string is passed via -R '@RG\tID:x\tSM:y', the tab
characters in the argument were copied verbatim into the @PG CL: SAM header
field. The SAM specification uses tabs as field delimiters, so the resulting
header line appeared to have extra ID: and other tag fields embedded inside
CL:. Lenient parsers (samtools, htsjdk) tolerated the output; strict parsers
(noodles, some fgbio configurations) rejected the file as malformed.
The fix replaces each tab character with a space when building the @PG CL:
value in src/main.cpp. The @RG line itself is not modified, so the
read-group metadata is preserved correctly. A regression shell test
(test/pg_cl_escape_test.sh) asserts that the @PG line contains exactly
five tab-separated fields after the fix. Upstream issue reference:
bwa-mem2#293.
SMEM buffer overflow on reads longer than 151 bp (PR #55)
bwa-mem2 hardcoded READ_LEN 151 in src/macro.h to size the per-thread
matchArray SMEM buffer at compile time. The FMI walk wrote past this buffer
without bounds checking when reads exceeded 151 bp, causing memory corruption
that manifested as segfaults or silent wrong output on 300 bp MiSeq reads,
error-corrected long reads, and any run with a non-default -k that extended
seed length.
A second cap, MAX_READ_LEN_FOR_LOCKSTEP 512, guarded the lockstep driver’s
per-slot stack arrays with a hard assert that aborted on anything longer.
The fix eliminates both compile-time caps. Every per-thread SMEM buffer is now
heap-allocated on the memory management context (mmc) and grown on demand
from each batch’s observed max_readlength. The pre-walk grow in
mem_collect_smem sizes matchArray[tid] to BATCH_MUL * BATCH_SIZE * max_readlength, and all array writes are bounds-checked with a structured
smem_overflow_die on overflow. Regression tests cover 300 bp, 1 kbp, and
3 kbp phiX reads; all three segfaulted before the fix and produce correct
NM:i:0 alignments after. Upstream references:
bwa-mem2#210 (issue),
bwa-mem2#238
(closed unmerged upstream PR).
kswv nrow==0 guard (PR #51)
When a SIMD batch contained only padding pairs (all len1 == 0), the DP loop
never executed and nrow was zero. The post-loop rowMax + (i-1) * SIMD_WIDTH
store still executed, walking SIMD_WIDTH bytes before the beginning of the
rowMax allocation. On glibc this produced a free(): invalid pointer abort;
on macOS libc it silently corrupted the heap.
The fix wraps the post-loop store in an if (i > 0) guard on all five SIMD
kswv kernels: NEON u8, NEON 16, AVX2 u8, AVX-512BW u8, and AVX-512BW 16. The
upstream patch bwa-mem2#289
covered only the two AVX-512BW kernels; bwa-mem3 broadens it to the three
additional kernels carried in this fork. A dedicated regression test
(test/kswv_nrow_zero_test.cpp) builds all-padding batches and verifies each
kernel is clean under AddressSanitizer.
kswv score2 plateau series (PRs #26, #27, #28, #29, #30, #31)
The batched mate-rescue Smith-Waterman path (kswv) contains a family of
related bugs across its SIMD kernels that inflated the suboptimal score
(score2 / XS) and consequently deflated MAPQ relative to upstream
bwa-mem2.
AVX-512BW dispatch guard (PR #26). GCC with -mavx512bw automatically
defines __AVX2__, so the #elif __AVX2__ branch in src/kswv.h and
src/kswv.cpp matched first on every AVX-512BW build. The 256-bit AVX2 kernel
produced only 32-lane results into 64-lane score[]/te1[]/qe[] arrays
sized for AVX-512BW; the upper 32 lanes held uninitialized values.
mem_matesw_batch_post read those bogus te values, bwa_gen_cigar2
returned NULL, and mem_reg2aln triggered an a.cigar != NULL assertion on
every AVX-512BW dispatch host (AWS c7a, c7i). The fix qualifies the #elif __AVX2__ guard with !__AVX512BW__, matching the existing pattern in
bandedSWA.h. Closes issue #25.
AVX2 score2 plateau fix (PR #27 closed, PR #28 merged). The AVX2 256-bit
kswv kernel added in PR #20 used a dense SIMD max over every rowMax row to
compute the suboptimal score. Scalar ksw_u8 instead collapses consecutive
rows above minsc into a single b[] entry anchored at the max-score row,
then finds the best anchor outside the primary region. The dense max pulled in
tail rows from a plateau whose anchor sat inside the primary region, inflating
XS by 1–4 on a minority of reads and reducing MAPQ by 2–18 on those reads.
PR #27 (closed) temporarily disabled the AVX2 batched path. PR #28 fixes the
kernel itself by replacing the dense scan with a per-lane scalar emulation of
the b[] build-and-scan logic.
NEON and AVX-512BW 8-bit port (PR #29). The same dense-rowMax score2 scan
existed in kswv_neon_u8 and kswv_512_u8. Confirmed on ARM: rebuilding
smoke-1M on darwin/arm64 pre-fix produced the identical four MAPQ regressions
as the AVX2 case. PR #29 ports the per-lane scalar b[]-emulation fix to both
kernels.
AVX-512BW 16-bit port (PR #30). kswv_512_16 carried four bugs: the same
dense-rowMax plateau pattern, aggregate maxl/minh bounds instead of
per-lane bounds (a gap from PR #21), no minsc filter, and no qe mask. The
per-lane scalar emulation from PR #29 fixes all four naturally.
NEON 16-bit rewrite (PR #31). kswv_neon_16 was effectively dead code
before this PR. Five interacting bugs produced 20,435 BAM diffs vs scalar
reference on smoke-1M -A 2: the score table reinterpreted int16 xor
indices as int8 lookups (inflating match scores by ~256 per cell), the table
was too small for the 16-bit SoA encoding, rowMax was never written, the
early-exit fired on row 0 for all pairs without a KSW_XSTOP target, and all
the fix-3 class bugs from PRs #28–#30 were missing. The PR rewrites the kernel
from scratch against kswv_neon_u8’s structure using 32-byte int8 tables
indexed via vqtbl2_s8, per-lane freeze, exit0 bitmap, and per-lane scalar
score2.
kseq2bseq1 zero-initialization (PR #22)
bseq_read_orig grows its sequence buffer with realloc, leaving tail entries
uninitialized. kseq2bseq1 populated only name, comment, seq, qual,
and l_seq for each entry, leaving sam, bams, n_bams, and cap_bams at
whatever values realloc happened to return. PR #13 added an unconditional
free(ret->seqs[i].bams) in the output loop (fastmap.cpp:571), which turned
those garbage values into a crash — a pointer being freed was not allocated
abort under system malloc and a SIGSEGV under mimalloc — once input exceeded
the initial 256-sequence allocation. The crash was deterministic and
reproducible with -t1.
The fix is a single memset(s, 0, sizeof(*s)) at the top of kseq2bseq1.
Proper-pair flag from emitted alignment (PR #17)
In the no_pairing emission path of mem_sam_pe and mem_sam_pe_batch_post,
the proper-pair bit (0x2) was computed from a[i].a[0].rb regardless of
which alignment was actually emitted. When the primary’s alignment score fell
below the reporting threshold opt->T but a non-primary ALT hit cleared it,
mem_reg2aln emitted a[i].a[n_pri[i]] while mem_infer_dir still read the
below-threshold primary. In that case the SAM flag did not reflect the
coordinates in the record.
The fix stores the selected alignment index per mate in a which[2] array and
passes a[i].a[which[i]].rb to mem_infer_dir, ensuring the proper-pair
flag always matches the emitted record. The bug was present in the bwa-mem2
initial commit from 2019. Upstream reference: pre-existing bug, no open
upstream PR at time of merge.
Changes catalog
| Item | bwa-mem3 PR | Upstream PR/issue | Status |
|---|---|---|---|
@PG CL: tab escape | #54 | bwa-mem2#293 | fork-only (open upstream issue) |
| SMEM buffer overflow on >151 bp reads | #55 | bwa-mem2#238, bwa-mem2#210 | fork-only (upstream PR closed unmerged) |
| kswv nrow==0 guard | #51 | bwa-mem2#289 | fork-only (upstream PR open) |
| AVX-512BW dispatch guard | #26 | — | fork-only |
| AVX2 score2 plateau disable (superseded) | #27 | — | closed (superseded by #28) |
| AVX2 score2 plateau fix | #28 | — | fork-only |
| NEON + AVX-512BW 8-bit score2 fix | #29 | — | fork-only |
| AVX-512BW 16-bit score2 fix | #30 | — | fork-only |
| NEON 16-bit kernel rewrite | #31 | — | fork-only |
| kseq2bseq1 zero-initialization | #22 | — | fork-only |
| Proper-pair flag from emitted alignment | #17 | — | fork-only |
See also: Performance improvements · Architecture support · Upstream PR status · Developer Guide → Regression test framework · Performance → SIMD dispatch matrix