Correctness Fixes

This page documents bugs present in upstream bwa-mem2 that bwa-mem3 fixes. Each fix is isolated to a single PR so it can be reviewed independently and dropped from main once upstream merges the equivalent patch.

`@PG CL:` tab escaping (PR #54)

When a read-group string is passed via -R '@RG\tID:x\tSM:y', the tab characters in the argument were copied verbatim into the @PG CL: SAM header field. The SAM specification uses tabs as field delimiters, so the resulting header line appeared to have extra ID: and other tag fields embedded inside CL:. Lenient parsers (samtools, htsjdk) tolerated the output; strict parsers (noodles, some fgbio configurations) rejected the file as malformed.

The fix replaces each tab character with a space when building the @PG CL: value in src/main.cpp. The @RG line itself is not modified, so the read-group metadata is preserved correctly. A regression shell test (test/pg_cl_escape_test.sh) asserts that the @PG line contains exactly five tab-separated fields after the fix. Upstream issue reference: bwa-mem2#293.

SMEM buffer overflow on reads longer than 151 bp (PR #55)

bwa-mem2 hardcoded READ_LEN 151 in src/macro.h to size the per-thread matchArray SMEM buffer at compile time. The FMI walk wrote past this buffer without bounds checking when reads exceeded 151 bp, causing memory corruption that manifested as segfaults or silent wrong output on 300 bp MiSeq reads, error-corrected long reads, and any run with a non-default -k that extended seed length.

A second cap, MAX_READ_LEN_FOR_LOCKSTEP 512, guarded the lockstep driver’s per-slot stack arrays with a hard assert that aborted on anything longer.

The fix eliminates both compile-time caps. Every per-thread SMEM buffer is now heap-allocated on the memory management context (mmc) and grown on demand from each batch’s observed max_readlength. The pre-walk grow in mem_collect_smem sizes matchArray[tid] to BATCH_MUL * BATCH_SIZE * max_readlength, and all array writes are bounds-checked with a structured smem_overflow_die on overflow. Regression tests cover 300 bp, 1 kbp, and 3 kbp phiX reads; all three segfaulted before the fix and produce correct NM:i:0 alignments after. Upstream references: bwa-mem2#210 (issue), bwa-mem2#238 (closed unmerged upstream PR).

`kswv` nrow==0 guard (PR #51)

When a SIMD batch contained only padding pairs (all len1 == 0), the DP loop never executed and nrow was zero. The post-loop rowMax + (i-1) * SIMD_WIDTH store still executed, walking SIMD_WIDTH bytes before the beginning of the rowMax allocation. On glibc this produced a free(): invalid pointer abort; on macOS libc it silently corrupted the heap.

The fix wraps the post-loop store in an if (i > 0) guard on all five SIMD kswv kernels: NEON u8, NEON 16, AVX2 u8, AVX-512BW u8, and AVX-512BW 16. The upstream patch bwa-mem2#289 covered only the two AVX-512BW kernels; bwa-mem3 broadens it to the three additional kernels carried in this fork. A dedicated regression test (test/kswv_nrow_zero_test.cpp) builds all-padding batches and verifies each kernel is clean under AddressSanitizer.

kswv score2 plateau series (PRs #26, #27, #28, #29, #30, #31)

The batched mate-rescue Smith-Waterman path (kswv) contains a family of related bugs across its SIMD kernels that inflated the suboptimal score (score2 / XS) and consequently deflated MAPQ relative to upstream bwa-mem2.

AVX-512BW dispatch guard (PR #26). GCC with -mavx512bw automatically defines __AVX2__, so the #elif __AVX2__ branch in src/kswv.h and src/kswv.cpp matched first on every AVX-512BW build. The 256-bit AVX2 kernel produced only 32-lane results into 64-lane score[]/te1[]/qe[] arrays sized for AVX-512BW; the upper 32 lanes held uninitialized values. mem_matesw_batch_post read those bogus te values, bwa_gen_cigar2 returned NULL, and mem_reg2aln triggered an a.cigar != NULL assertion on every AVX-512BW dispatch host (AWS c7a, c7i). The fix qualifies the #elif __AVX2__ guard with !__AVX512BW__, matching the existing pattern in bandedSWA.h. Closes issue #25.

AVX2 score2 plateau fix (PR #27 closed, PR #28 merged). The AVX2 256-bit kswv kernel added in PR #20 used a dense SIMD max over every rowMax row to compute the suboptimal score. Scalar ksw_u8 instead collapses consecutive rows above minsc into a single b[] entry anchored at the max-score row, then finds the best anchor outside the primary region. The dense max pulled in tail rows from a plateau whose anchor sat inside the primary region, inflating XS by 1–4 on a minority of reads and reducing MAPQ by 2–18 on those reads. PR #27 (closed) temporarily disabled the AVX2 batched path. PR #28 fixes the kernel itself by replacing the dense scan with a per-lane scalar emulation of the b[] build-and-scan logic.

NEON and AVX-512BW 8-bit port (PR #29). The same dense-rowMax score2 scan existed in kswv_neon_u8 and kswv_512_u8. Confirmed on ARM: rebuilding smoke-1M on darwin/arm64 pre-fix produced the identical four MAPQ regressions as the AVX2 case. PR #29 ports the per-lane scalar b[]-emulation fix to both kernels.

AVX-512BW 16-bit port (PR #30). kswv_512_16 carried four bugs: the same dense-rowMax plateau pattern, aggregate maxl/minh bounds instead of per-lane bounds (a gap from PR #21), no minsc filter, and no qe mask. The per-lane scalar emulation from PR #29 fixes all four naturally.

NEON 16-bit rewrite (PR #31). kswv_neon_16 was effectively dead code before this PR. Five interacting bugs produced 20,435 BAM diffs vs scalar reference on smoke-1M -A 2: the score table reinterpreted int16 xor indices as int8 lookups (inflating match scores by ~256 per cell), the table was too small for the 16-bit SoA encoding, rowMax was never written, the early-exit fired on row 0 for all pairs without a KSW_XSTOP target, and all the fix-3 class bugs from PRs #28–#30 were missing. The PR rewrites the kernel from scratch against kswv_neon_u8’s structure using 32-byte int8 tables indexed via vqtbl2_s8, per-lane freeze, exit0 bitmap, and per-lane scalar score2.

`kseq2bseq1` zero-initialization (PR #22)

bseq_read_orig grows its sequence buffer with realloc, leaving tail entries uninitialized. kseq2bseq1 populated only name, comment, seq, qual, and l_seq for each entry, leaving sam, bams, n_bams, and cap_bams at whatever values realloc happened to return. PR #13 added an unconditional free(ret->seqs[i].bams) in the output loop (fastmap.cpp:571), which turned those garbage values into a crash — a pointer being freed was not allocated abort under system malloc and a SIGSEGV under mimalloc — once input exceeded the initial 256-sequence allocation. The crash was deterministic and reproducible with -t1.

The fix is a single memset(s, 0, sizeof(*s)) at the top of kseq2bseq1.

Proper-pair flag from emitted alignment (PR #17)

In the no_pairing emission path of mem_sam_pe and mem_sam_pe_batch_post, the proper-pair bit (0x2) was computed from a[i].a[0].rb regardless of which alignment was actually emitted. When the primary’s alignment score fell below the reporting threshold opt->T but a non-primary ALT hit cleared it, mem_reg2aln emitted a[i].a[n_pri[i]] while mem_infer_dir still read the below-threshold primary. In that case the SAM flag did not reflect the coordinates in the record.

The fix stores the selected alignment index per mate in a which[2] array and passes a[i].a[which[i]].rb to mem_infer_dir, ensuring the proper-pair flag always matches the emitted record. The bug was present in the bwa-mem2 initial commit from 2019. Upstream reference: pre-existing bug, no open upstream PR at time of merge.

Changes catalog

Item	bwa-mem3 PR	Upstream PR/issue	Status
`@PG CL:` tab escape	#54	bwa-mem2#293	fork-only (open upstream issue)
SMEM buffer overflow on >151 bp reads	#55	bwa-mem2#238, bwa-mem2#210	fork-only (upstream PR closed unmerged)
kswv nrow==0 guard	#51	bwa-mem2#289	fork-only (upstream PR open)
AVX-512BW dispatch guard	#26	—	fork-only
AVX2 score2 plateau disable (superseded)	#27	—	closed (superseded by #28)
AVX2 score2 plateau fix	#28	—	fork-only
NEON + AVX-512BW 8-bit score2 fix	#29	—	fork-only
AVX-512BW 16-bit score2 fix	#30	—	fork-only
NEON 16-bit kernel rewrite	#31	—	fork-only
kseq2bseq1 zero-initialization	#22	—	fork-only
Proper-pair flag from emitted alignment	#17	—	fork-only

Keyboard shortcuts

bwa-mem3