Changelog
Release 0.2.0 (2026-05-13)
Operational / packaging
- Single-binary SIMD dispatch on x86 (#83). The previous multi-binary
build (
make multiproducing fivebwa-mem3.<tier>ISA variants plus arunsimd.cpplauncher thatexecv’d the matching tier) is replaced by a single binary that contains compiled kernels for every supported tier (sse41/sse42/avx/avx2/avx512bw) and selects one in process at startup via__builtin_cpu_supports. Install size drops from ~120 MB to ~25 MB; per-call overhead is one indirect branch (~0.3 ns after BTB warm-up). No.<tier>companion files are produced or needed. Seedocs/src/developer-guide/launcher.md. BWAMEM3_FORCE_TIER=<tier>andBWAMEM3_DEBUG_SIMD=1env vars (#83).BWAMEM3_FORCE_TIERis downgrade-only and replaces the prior “exec thebwa-mem3.sse41binary” A/B-testing pattern; up-tier or unrecognized requests are rejected with a stderr warning.BASELINE_ARCH=avx2is the new default for non-kernel translation units on x86 (#84, supersedes the SSE4.1 floor that PR #83 originally shipped with). Override viamake BASELINE_ARCH=<tier>. AVX-512BW hosts usingBASELINE_ARCH=avx512bwsee a small additional speedup on Zen 4 with-mprefer-vector-width=256(#86) and roughly flat results on Sapphire Rapids — seedocs/src/whats-different/avx512-baseline.mdfor the characterization.- Host-floor precheck (#95).
bwa-mem3 mem,bwa-mem3 index, andbwa-mem3 shmrefuse to run with exit code 2 and an[E::bwamem3]stderr message when the host CPU does not meet the build’s compile-time SIMD floor, instead of SIGILL-ing deep in alignment.bwa-mem3 version,--help, and-hare exempt and always succeed. bwa-mem3 versionnow printsSIMD floor:(build’s required minimum) andSIMD runtime:(resolved tier) lines on stdout, plus a[W::bwa-mem3]warning on stderr (exit 0) if the host is below the floor. Seedocs/src/getting-started/host-requirements.md.bwa-mem3 shmperforms astatvfs("/dev/shm")capacity preflight (#86). When/dev/shmis too small for the index, the stage aborts with an[E::bwa_shm_stage]message naming/dev/shm, the required size, and amount -o remount,size=...hint — replacing the prior[fread] Bad addressfailure mode.statvfsfailures (no/dev/shm, restricted sandbox) are non-fatal and the stage proceeds.bwa-mem3 shm/bwactlregistry RMW is now serialized via a POSIX named semaphore (#82, closes #66). Concurrentshm stage/shm dropinvocations across processes no longer race when updating the registry; the prior best-effort flock was per-openand did not cover the read-modify-write window.
Methylation
mem --methemits Bismark-compatible auxiliary tagsXR:Z(read conversionCT/GA),XG:Z(genome strandCT/GA), andXM:Z(per-base methylation call string) (#90). These replace the prior bwameth-styleYS:Z/YC:Z/YD:Zon output (still used internally for SEQ restoration). The reference-annotationXR:Zfrom-Vis suppressed under--methto avoid colliding with the Bismark semantics. Downstream tools that previously readYS:Z/YC:Z/YD:Zmust be pointed at the correspondingXR:Z/XG:Zand the per-baseXM:Z. Seedocs/src/methylation/tags.md.
Correctness
- Fixed SIGSEGV in
mem_mateswon shm-backedref_string(#85).ksw_align2mutates its reference slice in place; when the slice pointed into a read-only shm segment, this faulted. Now copies the slice before passing it in. FMI_searchsampled-SA prefetch: parenthesizedSA_COMPX_MASKprecedence so the masked offset is computed against the correct operand (#73). The unparenthesized form was silently producing wrong-but-harmless prefetch addresses; no alignment output was affected.bntseq.altparser bounds the line buffer to prevent a stack-overflow on malicious or malformed.altfiles (#74).display_statsclamps the per-thread bucket count toLIM_Cso--profilewith-tgreater than the compiled-in limit no longer writes past the end of the stats array (#81).
Performance
- x86 wall-time improvements on the bench (vs the 0.1.0-pre baseline):
AVX2 (c6a) −17 to −22%, AVX-512 AMD Zen4 (c7a) −16 to −24%, AVX-512
Intel SPR (c7i) −28 to −30% across wgs / wes / panel-twist 5M-read
samples. Concordance vs upstream
bwa-mem2 v2.2.1remains 100.0000% on all non-methylation cells. arm64 (c7g / c8g) is flat (within ±2%). The wins are attributable primarily to (a) capping AVX-512BW auto-vectorization at 256-bit on theavx512bwtarget (#86) and (b) inliningFMI_search::backwardExtto recover a gcc 12+ wall-clock regression (#88). Seedocs/src/performance/overview.mdfor the reference numbers across architectures. - Smaller contributions in the release window: per-strip L1 prefetches
across all
kswvu8/u16 kernels (#70);SMEM_LOCKSTEP_Nbumped from 8 to 16 (#75); closed-form ungapped HIT path whentotal_mis == 0(#77);ksortswitched to an on-stack buffer for smallnto drop a per-callmalloc(#78);libsais_buildskips a wasted zero-init pass on its unpack and SA buffers, trimming index-build time (#80).
Release 0.1.0-pre (2026-04-28)
- Project renamed from
bwa-mem2tobwa-mem3. The new project tracks Fulcrum Genomics’ performance and feature work on top of the upstream bwa-mem2 codebase. - Default branch renamed from
fg-maintomain. - Binary renamed from
bwa-mem2tobwa-mem3. Arch-suffixed variants (bwa-mem3.sse41,.sse42,.avx,.avx2,.avx512bw,.arm64,.pgo,.profile,.lto) renamed to match. @PGSAM header tags now readID:bwa-mem3 PN:bwa-mem3(andbwa-mem3-methfor--methmode).- Test binaries renamed:
bwa_mem2_tests_unit→bwa_mem3_tests_unit,bwa_mem2_tests_integration→bwa_mem3_tests_integration. .bwt.2bit.64index file format unchanged — bwa-mem3 reads indexes built bybwa-mem2 indexwithout re-indexing.
Release 2.2.1 (17 March 2021)
Hotfix for v2.2: Fixed the bug mentioned in #135.
Release 2.2 (8 March 2021)
Changes since the last release (2.1):
- Passed the validation test on ~88 billions reads (Credits: Keiran Raine, CASM division, Sanger Institute)
- Fixed bugs reported in #109 causing mismatch between bwa-mem and bwa-mem2
- Fixed the issue (# 112) causing crash due to corrupted thread id
- Using all the SSE flags to create optimized SSE41 and SSE42 binaries
Release 2.1 (16 October 2020)
Release 2.1 of BWA-MEM2.
Changes since the last release (2.0):
-
Smaller index: the index size on disk is down by 8 times and in memory by 4 times due to moving to only one type of FM-index (2bit.64 instead of 2bit.64 and 8bit.32) and 8x compression of suffix array. For example, for human genome, index size on disk is down to ~10GB from ~80GB and memory footprint is down to ~10GB from ~40GB. There is a substantial decrease in index IO time due to the reduction and hardly any performance impact on read mapping.
-
Added support for 2 more execution modes: sse4.2 and avx.
-
Fixed multiple bugs including those reported in Issues #71, #80 and #85.
-
Merged multiple pull requests.
Release 2.0 (9 July 2020)
This is the first production release of BWA-MEM2.
Changes since the last release:
-
Made the source code more secure with more than 300 changes all across it.
-
Added support for memory re-allocations in case the pre-allocated fixed memory is insufficient.
-
Added support for MC flag in the sam file and support for -5, -q flags in the command line.
-
The output is now identical to the output of bwa-mem-0.7.17.
-
Merged index building code with FMI_Search class.
-
Added support for different ways to input read files, now, it is same as bwa-mem.
-
Fixed a bug in AVX512 sam processing part, which was leading to incorrect output.
Release 2.0pre2 (4 February 2020)
Miscellaneous changes:
-
Changed the license from GPL to MIT.
-
IMPORTANT: the index structure has changed since commit 6743183. Please rebuild the index if you are using a later commit or the new release.
-
Added charts in README.md comparing the performance of bwa-mem2 with bwa-mem.
Major code changes:
-
Fixed working for variable length reads.
-
Fixed a bug involving reads of length greater than 250bp.
-
Added support for allocation of more memory in small chunks if large pre-allocated fixed memory is insufficient. This is needed very rarely (thus, having no impact on performance) but prevents asserts from failing (code from crashing) in that scenario.
-
Fixed a memory leak due to not releasing the memory allocated for seeds after smem.
-
Fixed a segfault due to non-alignment of small allocated memory in the optimized banded Smith-Waterman.
-
Enabled working with genomes larger than 7-8 billion nucleotides (e.g. Wheat genome).
-
Fixed a segfault occuring (with gcc compiler) while reading the index.
See also: Citation · License · What’s Different — Overview · Developer Guide — Release process