Home

bwa-mem3

A faster, more correct, drop-in replacement for bwa mem and bwa-mem2.

If you align short reads with bwa or bwa-mem2 today, bwa-mem3 will give you the same answers — only quicker, with fewer rough edges, and with first-class support for things you used to need a wrapper script for.

Why bwa-mem3

Drop in, go faster. Same algorithm, same outputs, same flags as bwa-mem2 — but consolidated mapping speedups, a memory-bounded index builder, batched header ingestion, and a tuned allocator add up to measurable wall-clock wins on real workloads.
Methylation in one binary. A --meth flag adds native bisulfite/EM-seq alignment: bwameth-compatible read placement by default (--meth-scoring collapsed), or variant-aware scoring on request (--meth-scoring genomic). No Python, no inline conversion script, no separate post-processing step. One bwa-mem3 index --meth ref.fa, one bwa-mem3 mem --meth ref.fa R1.fq R2.fq, done — headers consolidated and Bismark tags emitted.
Stage the index once, align many. A bwa-mem3 shm subcommand pins the FM-index in shared memory so back-to-back runs on the same host skip the ~11 GB index read every time.
Correctness fixes upstream haven’t merged yet. Tabs in -R, 151+ bp reads, AVX-512 mate-rescue, kswv score2 plateau across NEON/AVX2/AVX-512BW, mem_sam_pe proper-pair flag — every fix tracked back to the upstream PR or issue that found it.
Architecture-aware out of the box. SSE4.1, SSE4.2, AVX, AVX2, AVX-512BW, and ARM64/NEON. One binary per platform; the dispatcher picks the right tier for your CPU in process at startup.

Get started in 30 seconds

git clone --recursive https://github.com/fg-labs/bwa-mem3
cd bwa-mem3 && make
./bwa-mem3 index ref.fa
./bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam

Tip — Emit BAM directly

For production pipelines, add --bam=0 to skip the SAM text round-trip entirely. See Best Practices: Output format.

Where to start

Installation — Build from source (Bioconda is on the way).
Quick start: align paired-end FASTQs — Two commands to your first alignment.
Quick start: methylation — The one-binary bwameth.py replacement, in two commands.
Best Practices — The five things that actually move the needle for production runs.
What’s different from bwa-mem2 — Every fix and feature, with upstream cross-references.

What’s in this book

Getting Started — Install and run your first alignment.
User Guide — Indexing, alignment, output, threading, allocator notes.
Performance — Where the speed comes from and how to get more.
Best Practices — Build, run, and deploy recommendations.
CLI Reference — Every flag, auto-captured from --help.
Methylation Reference — --meth mode in full.
What’s Different from bwa-mem2 — The full changelog, by category.
Developer Guide — Build matrix, SIMD dispatch, regression tests, contributing.
Related Projects — bwa-mem3-bench, bwa-mem3-rs, fgumi, bwa-mem2 upstream.
Reference — PR catalog, glossary, citation, license, changelog.

bwa-mem3 is a derivative of bwa-mem2 maintained by Fulcrum Genomics. MIT licensed. See License and Citation.

Installation

Already using bwa or bwa-mem2? After installing, see Coming from bwa or bwa-mem2 — the command line is unchanged, but the index and output have a couple of migration notes.

Bioconda (coming soon)

A Bioconda package for bwa-mem3 is in preparation. Once published, installation will be:

conda install -c bioconda bwa-mem3

This will be the recommended path for most users. Check back here or watch the fg-labs/bwa-mem3 repository for the announcement.

Build from source

Until the Bioconda package is available, build from source using the steps below.

Prerequisites

bwa-mem3 vendors several libraries as git submodules. Building from source requires the toolchain to compile bwa-mem3 itself plus the bootstrap tools each vendored library needs.

Tool	Why it’s needed	Minimum version
C++14 compiler (GCC or Clang)	bwa-mem3 itself	GCC 8+ / Clang 7+
GNU make	top-level build	3.81+
Git	submodule checkout (with `--recursive`)	any recent
autoconf, automake, autoconf-archive, libtool	`ext/htslib` runs `autoreconf -i && ./configure` during build	any recent
pkg-config	htslib’s `configure` uses it to locate zlib	any recent
zlib development headers	htslib links against zlib	any recent
libdeflate development headers	`src/fast_reader.c` uses libdeflate for BGZF block decode	any recent
OpenMP runtime	`ext/libsais` uses OpenMP for parallel suffix-array construction	see notes below
CMake 3.12+	building bundled mimalloc (default; skip if you pass `USE_MIMALLOC=0`)	3.12+

OpenMP notes.

On Linux with GCC, libgomp ships with the compiler — no extra package needed.

On Linux with Clang, install libomp-dev (Debian/Ubuntu) or libomp-devel (RHEL/Fedora).

On macOS, install Homebrew’s libomp (brew install libomp). The Makefile auto-detects the Homebrew prefix; set LIBOMP_PREFIX=/path/to/libomp if you installed it elsewhere.

Install prerequisites by platform

Debian / Ubuntu:

sudo apt-get install \
    build-essential git cmake pkg-config \
    autoconf automake autoconf-archive libtool \
    zlib1g-dev libdeflate-dev \
    libomp-dev          # only needed if building with Clang

RHEL / Fedora / Amazon Linux:

sudo dnf install \
    gcc gcc-c++ make git cmake pkgconf-pkg-config \
    autoconf automake autoconf-archive libtool \
    zlib-devel libdeflate-devel \
    libomp-devel        # only needed if building with Clang

Amazon Linux 2023 has no libdeflate-devel. The dnf install above will fail on that one package. Build and install libdeflate from source instead (e.g. v1.22): configure with cmake -B build -DCMAKE_INSTALL_PREFIX=/usr/local -DLIBDEFLATE_BUILD_SHARED_LIB=OFF, then cmake --build build && sudo cmake --install build to put the headers and static library under /usr/local. CMake installs the library to lib or lib64 depending on the distro, so set LIBRARY_PATH=/usr/local/lib64:/usr/local/lib (covering both) before running bwa-mem3’s make. Other RHEL/Fedora releases have the package.

macOS (Homebrew):

xcode-select --install   # Apple Clang + git + make
brew install \
    cmake pkg-config \
    autoconf automake autoconf-archive libtool \
    libdeflate libomp

What happens if a prereq is missing. The Makefile fails fast with an actionable error: a missing libomp on macOS, a missing autoreconf, or a missing cmake each produce a one-line hint pointing at the install command above. There is no need to install everything optimistically — install only what the error message asks for if you prefer.

Clone and build

git clone --recursive https://github.com/fg-labs/bwa-mem3
cd bwa-mem3
make

The --recursive flag is required. bwa-mem3 vendors several libraries (mimalloc, sse2neon, and others) as git submodules. A shallow or non-recursive clone will fail to compile.

Warning — Shallow clone submodule pitfall

If you cloned without --recursive, initialize the submodules before running make:
git submodule update --init --recursive
Forgetting this step is the most common source of build failures.

Target architecture

By default, make builds a general-purpose binary that runs on any supported CPU. For maximum performance, specify the architecture that matches your deployment target:

Flag	Requires	Notes
`make`	SSE4.1 or better (x86), any (ARM)	Default; selects best dispatch at runtime on x86
`make arch=avx2`	AVX2 (e.g. Haswell, Zen 2)	Recommended for modern x86 servers
`make arch=avx512bw`	AVX-512BW (e.g. Skylake-X, Ice Lake, Sapphire Rapids)	Maximum x86 performance
`make arch=arm64`	Apple Silicon / AWS Graviton	NEON-vectorized build

See Performance — SIMD dispatch matrix for the full matrix of which kernels are vectorized under each target.

Memory allocator (mimalloc)

bwa-mem3 bundles mimalloc and links it into every binary by default. mimalloc reduces allocator contention under high thread counts and lowers wall-clock time on multi-threaded alignment runs.

To build without mimalloc, pass USE_MIMALLOC=0:

make USE_MIMALLOC=0

See User Guide — Memory allocator for details on how mimalloc is linked on Linux versus macOS and when opting out is appropriate.

Smoke test

After building, run the smoke test to confirm the binary works and report which allocator is active:

./bwa-mem3 version

Expected output (with mimalloc):

v0.2.0-12-gabcdef1
mimalloc 3.x.x

If the mimalloc line is absent, the build linked the system allocator (expected when USE_MIMALLOC=0 was passed or when the vendor submodule was not initialized).

Next: host requirements

If you’re planning to deploy bwa-mem3 across a heterogeneous fleet (AWS Batch, mixed compute clusters), read Host requirements for the supported CPU floor and Best Practices → Multi-architecture deployment for the deployment recipe.

Coming from bwa or bwa-mem2

bwa-mem3’s command line is a drop-in for bwa mem (the original BWA) and bwa-mem2 — same subcommand, same flags, same FASTQ inputs. The one catch is the index: a bwa-mem2 index is reused as-is, but a bwa (v1) index must be rebuilt once (different format) before that unchanged command line applies. Three things are worth knowing before you switch.

1. Rebuild the index (bwa) — or reuse it (bwa-mem2)

bwa-mem3 uses the bwa-mem2 index format (ref.fa.bwt.2bit.64 + ref.fa.pac, read on demand via pac-fetch).

Coming from bwa (v1): your bwa index files use a different FM-index format (.bwt / .sa) and cannot be used. Rebuild once — it takes a few minutes and leaves your FASTA untouched:
```
bwa-mem3 index ref.fa
```
Coming from bwa-mem2: your existing index works as-is, no rebuild. bwa-mem3 reads the .bwt.2bit.64 and .pac your bwa-mem2 index already produced; it ignores the .0123 file (and no longer builds one — pass index --emit-unpacked-ref only if some other tool still needs it).

See Indexing the reference for details.

2. The command line is the same

Every bwa mem flag is accepted, so an existing invocation runs unchanged — swap the binary and go:

# was: bwa mem -t 16 -R '@RG\tID:s1\tSM:s1' ref.fa R1.fq.gz R2.fq.gz > out.sam
bwa-mem3 mem -t 16 -R '@RG\tID:s1\tSM:s1' ref.fa R1.fq.gz R2.fq.gz > out.sam

bwa-mem3 adds a few flags on top (none change default behavior): --bam[=N] (emit BAM directly), --meth (native bisulfite/EM-seq), --supp-rep-hard-cap and --min-ext-len (opt-in tuning). See the CLI reference.

3. Output is equivalent, not byte-identical

Where each read maps — position, CIGAR, MAPQ, FLAG — is preserved on the data we have tested, but the SAM byte stream is not identical to bwa/bwa-mem2: bwa-mem3 emits a few additional tags, converges per-architecture SIMD score2/MAPQ toward the scalar reference, and breaks ties deterministically. If you validate against a previous bwa/bwa-mem2 release, expect (and audit) these differences — see Equivalence with bwa-mem2 for the field-by-field comparison and a per-PR trail.

Recommended migration sequence

Install bwa-mem3 — see Installation.
Index (rebuild from bwa, or reuse a bwa-mem2 index — see above).
Run the drop-in profile first — no extra flags — and validate against your current pipeline so the only changed variable is the binary.
Then opt into the recommended profile (-m 10 -y 0) for bwa-mem3’s best speed/accuracy trade-off. See Settings profiles: drop-in vs recommended.

Host requirements

bwa-mem3 runs on the hosts in the table below. Verify your host with bwa-mem3 version — the SIMD floor and runtime lines tell you what the binary needs and what your host provides.

Platform	Default build floor	Earliest supported CPU	Notes
Linux x86_64	AVX2 (`BASELINE_ARCH=avx2`)	Intel Haswell (2013); AMD Zen / Naples (2017)	Auto-selects best of `sse41 / sse42 / avx / avx2 / avx512bw` at runtime
Linux x86_64 (legacy)	SSE4.1 (`BASELINE_ARCH=sse41`)	Intel Nehalem (2008); AMD Bulldozer (2011)	Opt-in rebuild; ~10-15% slower on AVX2 hosts
Linux arm64	NEON (aarch64 ABI baseline)	Any aarch64 host	Single tier; NEON is mandatory in the aarch64 ABI
macOS arm64	NEON	Apple M1 (2020)	Apple Silicon only; macOS x86_64 is unsupported

How to verify

$ bwa-mem3 version
v0.2.0-12-gabcdef1
SIMD floor: avx2 (x86-64-v3, Haswell 2013+); kernels: sse41 sse42 avx avx2 avx512bw
SIMD runtime: avx512bw (BWAMEM3_FORCE_TIER unset)
mimalloc 3.x.x

The SIMD floor: line tells you what host features the binary requires.
The SIMD runtime: line tells you what kernel tier was selected at startup.
On a host below the floor, bwa-mem3 version writes a [W::bwa-mem3] warning line to stderr (not stdout) and still exits 0, so the diagnostic command stays usable even on hosts that cannot run alignment. The floor + runtime lines remain on stdout, so bwa-mem3 version | grep '^SIMD' works in CI scripts even on too-old hosts.

Failure mode on too-old hosts

If you run bwa-mem3 mem (or another alignment subcommand) on a host below the floor, the binary refuses with exit code 2 and a stderr message identifying the gap:

[E::bwamem3] this binary was compiled for SIMD floor avx2 and emits avx2
instructions in non-kernel translation units. The host CPU does not support
avx2 (detected: sse42). Running would SIGILL on the first avx2 instruction.

To run on this host, rebuild bwa-mem3 with BASELINE_ARCH=sse42 (or lower),
or use a binary built for a lower SIMD floor.

The version subcommand stays exit-0 so introspection still works on the same host.

Mixed-architecture fleets

For AWS Batch and other heterogeneous compute environments where the same job may schedule onto x86_64 or arm64 hosts, see Best Practices → Multi-architecture deployment.

Quick start: align paired-end FASTQs

This page walks through the two-command workflow: index the reference once, then align reads.

Switching from bwa or bwa-mem2? The command line is unchanged, but rebuild the index if you’re coming from bwa v1 (a bwa-mem2 index is reused as-is), and expect output that is functionally equivalent but not byte-identical. See Coming from bwa or bwa-mem2.

Index the reference

bwa-mem3 index ref.fa

This produces four index files alongside ref.fa:

File	Description
`ref.fa.bwt.2bit.64`	FM-index in 2-bit packed format
`ref.fa.amb`	Ambiguous base positions
`ref.fa.ann`	Sequence name and length annotations
`ref.fa.pac`	2-bit packed reference sequence

(No ref.fa.0123 is written: mem reconstructs reference bases from .pac on demand. Pass index --emit-unpacked-ref if a tool such as bwa-mem2 needs it.)

Indexing hg38 takes a few minutes on a multi-core host (construction is multi-threaded and memory-bounded — see Indexing the reference); the final index is roughly 11 GB on disk. The index is read once per mem invocation; for workloads that align many samples, load it into shared memory first (see Quick start: shared-memory index).

Align paired-end reads

bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam

-t 16 sets the thread count to 16. bwa-mem3 scales well up to the number of physical CPU cores; hyperthreading provides diminishing returns above that point. See User Guide — Threading and resource use for recommendations at different core counts.

The default output is SAM on stdout. To write BAM directly, add --bam (uncompressed by default — best for piping to samtools sort; use --bam=6 for BGZF-compressed BAM):

bwa-mem3 mem --bam -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

Tip — Prefer BAM output in production

Piping BAM (--bam) to samtools sort avoids the text formatting and parsing overhead of SAM on both sides of the pipe. For large cohorts this yields a measurable wall-clock reduction. See Best Practices — Output format for the recommended pipeline and a discussion of when SAM is still useful.

Read group tagging

For downstream tools that require a @RG header (most variant callers), pass -R:

bwa-mem3 mem -t 16 \
  -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA\tLB:lib1' \
  ref.fa R1.fq.gz R2.fq.gz > out.sam

The value is a tab-delimited string following BWA conventions. Every aligned record receives an RG:Z: tag matching the ID field of the read-group header.

bwa-mem3 emits the standard SAM tags plus the fork’s HN:i tag; for the full list see Output: SAM/BAM, headers, tags (and Methylation SAM tags under --meth).

Quick start: methylation alignment

bwa-mem3 supports bisulfite-converted (WGBS/RRBS/EM-seq) read alignment through a single --meth flag on both index and mem. No Python interpreter, no piped preprocessor, and no separate postprocessing step are required.

Note — bwameth-compatible by default, variant-aware on request

By default (--meth-scoring collapsed) bwa-mem3 closely tracks bwameth.py’s read placement and emits the standard Bismark tags methylation callers expect (MethylDackel, Bismark, PileOMeth, etc.). It is a placement drop-in — not a byte-for-byte reproduction of bwameth: a small fraction of records (~1% on typical WGBS/EM-seq) still differ in POS/CIGAR/MAPQ, so re-validate if you are pinned to a specific bwameth release (see bwameth.py drop-in mapping). Add --meth-scoring genomic to opt into variant-aware scoring (truthful NM/MD; one BAM for both methylation and variant calling).

Index the reference for methylation

Build the methylation index once:

bwa-mem3 index --meth ref.fa

This builds the normal 4-letter index at the bare prefix plus a converted seed index:

File(s)	Description
`ref.fa.amb`, `ref.fa.ann`, `ref.fa.bwt.2bit.64`, `ref.fa.pac`	Normal index over the original reference (used for scoring/extension; bases are pac-fetched from `.pac`, so no `.0123` is built).
`ref.fa.meth.fa`	Per-strand C→T / G→A converted FASTA (`f`/`r` doubled contigs).
`ref.fa.meth.*`	FM-index over the converted FASTA (used only for seeding).

Both indexes live next to ref.fa; bwa-mem3 index ref.fa (no --meth) builds only the first set.

Align bisulfite-converted reads

# Default: collapsed (bwameth-compatible placement)
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam
samtools index out.bam

# Opt into variant-aware scoring
bwa-mem3 mem --meth --meth-scoring genomic -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam

# Single-end (RRBS etc.): pass one FASTQ. The C→T (R1) projection is used for all reads.
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz \
  | samtools sort -o out.bam

Pass the original (unconverted) reference path. bwa-mem3 finds the ref.fa.meth.* seed index automatically when --meth is active. The paired-end-only flags (-U 100, -p) do not apply to single-end input.

What `--meth` does

In one process, --meth replaces what bwameth.py does with external tools:

Seeds in 3-letter space, scores in 4-letter space — projects each read (R1 C→T, R2 G→A) to seed against ref.fa.meth.*, then extends and scores against the original reference, restoring the original bases in SEQ.
Applies bwameth-aligned defaults — -L 10 -U 100 -T 40 -M -C plus mode-dependent -B (2 for collapsed, 4 for genomic).
Post-processes the BAM inline — consolidates f/r @SQ headers back to real chromosomes, emits Bismark XR/XG/XM tags, runs optional --chimera-qc, and writes uncompressed BAM ready for samtools sort.

See the Methylation Reference — Overview for the mechanism in detail, Flags for the scoring modes and QC flags, and SAM tags for the tag definitions.

Quick start: shared-memory index

The bwa-mem3 hg38 index is roughly 11 GB on disk (~15 GB once loaded into RAM). By default, every bwa-mem3 mem invocation reads the index from disk, which can take 30–60 seconds on a spinning disk and several seconds even on fast NVMe storage. For workloads that align many small samples in sequence on the same machine, this per-invocation overhead accumulates.

bwa-mem3 shm stages the index once into POSIX shared memory. Subsequent mem invocations attach to the in-memory segment instead of reading from disk, reducing per-sample startup time to near zero.

Stage the index

bwa-mem3 shm ref.fa

This reads the index files from disk and copies them into a POSIX shared-memory segment. The command returns when staging is complete. The index stays in memory until it is explicitly dropped or the system is rebooted.

To stage a methylation (--meth) index:

bwa-mem3 shm --meth ref.fa

A standard and a methylation index for the same reference can be staged simultaneously; they occupy separate named segments.

Align using the staged index

No extra flag is needed. When bwa-mem3 mem starts, it checks whether a matching shared-memory segment exists. If one does, it attaches automatically:

bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam

Inspect and drop staged segments

List all currently staged indices:

bwa-mem3 shm -l

Drop all staged segments:

bwa-mem3 shm -d

When to use shared-memory indexing

Shared-memory indexing is most beneficial when:

Aligning tens to hundreds of small samples (e.g. amplicon panels, targeted sequencing) where per-sample read time dominates the per-sample alignment time.
Running a batch pipeline on a single large machine where the index fits comfortably in RAM (approximately 15 GB for hg38 with the standard index).
The same reference is used for all samples in the batch; a new shm invocation is required for each distinct reference.

It provides little benefit when:

Aligning a small number of large samples (WGS), where alignment time far exceeds index load time.
The available RAM is insufficient to hold the index alongside the operating system and alignment worker processes.

Warning — No staleness check — always drop before re-indexing

bwa-mem3 shm does not detect whether the on-disk index files have changed after staging. If you run bwa-mem3 index ref.fa again (e.g. to rebuild after a reference update), the shared-memory segment is not invalidated. Subsequent mem invocations will attach to the stale segment and produce silently incorrect alignments.

Always drop the segment before re-indexing:
bwa-mem3 shm -d
bwa-mem3 index ref.fa
bwa-mem3 shm ref.fa

Indexing the reference

Before aligning reads, bwa-mem3 builds an FM-index from the reference FASTA. The index is read back from disk at the start of every mem run, so it is built once and reused indefinitely.

Basic indexing

bwa-mem3 index ref.fa

The command writes four files alongside the input FASTA:

File	Contents
`ref.fa.bwt.2bit.64`	Burrows-Wheeler Transform, 2-bit packed, 64-bit offsets
`ref.fa.amb`	Coordinates and counts of ambiguous (N) bases
`ref.fa.ann`	Sequence names and lengths
`ref.fa.pac`	Forward sequence, 2-bit packed (one base per 2 bits)

The .bwt.2bit.64 file dominates disk usage. For the human reference (hg38), expect roughly 11 GB total across all four files.

Note — reusing an existing index

From bwa-mem2: an index built by bwa-mem2 index works as-is — no rebuild. bwa-mem3 reads its .bwt.2bit.64 and .pac and ignores the .0123.

From bwa (v1): a bwa index uses a different FM-index format (.bwt / .sa) and cannot be reused — run bwa-mem3 index ref.fa to rebuild (a few minutes; the FASTA is unchanged).

See Coming from bwa or bwa-mem2.

Note — no .0123 by default

Earlier releases (and bwa-mem2) also wrote ref.fa.0123, an unpacked forward+reverse reference (~8× the .pac; ~6.4 GB on hg38). bwa-mem3 no longer builds it: mem reconstructs the bases it needs directly from the packed .pac on demand (pac-fetch), so .0123 is never read. Output is byte-for-byte identical. Pass --emit-unpacked-ref to index if you need the file for an external tool that still requires it (e.g. bwa-mem2):
bwa-mem3 index --emit-unpacked-ref ref.fa   # also writes ref.fa.0123
mem ignores any .0123 present and always pac-fetches from .pac.

Methylation index (`--meth`)

bwa-mem3 index --meth ref.fa

Methylation mode builds a dual index: the normal FM-index over the original reference (at the bare prefix), plus a converted seed index under the .meth prefix. The seed index is built over a per-strand-converted FASTA (ref.fa.meth.fa) whose contigs are doubled (f-prefixed C→T, r-prefixed G→A):

# normal index over the original reference (used for scoring/extension)
ref.fa.amb
ref.fa.ann
ref.fa.bwt.2bit.64
ref.fa.pac
# converted seed index (used only for seeding)
ref.fa.meth.fa
ref.fa.meth.amb
ref.fa.meth.ann
ref.fa.meth.bwt.2bit.64
ref.fa.meth.pac

Note — neither index ships a .0123

By default, neither the original index nor the .meth seed index writes an unpacked .0123. mem --meth seeds against the seed FM-index but scores/extends against the original reference, whose bases it pac-fetches from ref.fa.pac — so the original .0123 (~6.4 GB) is unnecessary, and the seed’s unpacked bases are never read at all (~13 GB). Skipping both saves ~19 GB of disk on hg38, while the runtime RSS reduction comes from avoiding the original .0123 load. (--emit-unpacked-ref applies to the original index only, for bwa-mem2 compatibility; the seed never needs one.)

The .meth seed FM-index is roughly twice the size of the normal FM-index (its contigs are doubled), so a --meth build is larger than a plain build but well under 3× (by default no .0123 is written; --emit-unpacked-ref adds it for the original index only). For hg38, budget on the order of 35 GB of disk for the combined dual index plus the intermediate ref.fa.meth.fa.

Tip — Pass the original FASTA to mem, not the seed index

When running bwa-mem3 mem --meth, pass the original FASTA path (ref.fa); bwa-mem3 finds ref.fa.meth.* automatically. A legacy ref.fa.bwameth.c2t index from an older release is not usable — rebuild with bwa-mem3 index --meth (see Migrating from bwameth.py c2t).

Output file locations

Index files are written to the same directory as the input FASTA by default. The input path is taken verbatim as a prefix — you can pass an absolute path to write into a different directory:

bwa-mem3 index /data/indexes/hg38/hg38.fa
# writes hg38.fa.bwt.2bit.64, etc. into /data/indexes/hg38/

Time and memory

Index construction is multi-threaded and memory-bounded. index -t defaults to the detected core count (cgroup-aware), and --max-memory caps peak RAM at min(50% of RAM, 32 GB) by default — raise or lower it to trade memory against spill I/O. On a typical multi-core host, indexing hg38 takes a few minutes (longer if pinned to a single core).

bwa-mem3 builds the suffix array with libsais, whose OpenMP-parallel, memory-bounded construction is faster and leaner than the original bwa-mem2 approach. See Performance improvements for benchmark numbers.

Warning — Do not index over a live shared-memory segment

If you have previously staged the index into shared memory with bwa-mem3 shm, drop the segment first before re-indexing:
bwa-mem3 shm -d
bwa-mem3 index ref.fa
There is no staleness check. If bwa-mem3 mem finds a matching segment in shared memory it will attach to it even when the on-disk index has been updated. See Quick start: shared-memory index.

Arch flags and the index format

The FM-index format is architecture-independent. A single index works across every SIMD tier and every supported platform: the x86 binary’s AVX2 / AVX-512BW dispatch paths and the arm64 NEON binary all read the same on-disk layout.

Aligning short reads (mem)

bwa-mem3 mem aligns one or two FASTQ files against an indexed reference and writes SAM (default) or BAM (--bam) to stdout. It is a drop-in replacement for bwa-mem2 mem and supports all standard bwa-mem flags.

Basic usage

Paired-end:

bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam

Single-end:

bwa-mem3 mem -t 16 ref.fa reads.fq.gz > out.sam

Pipe directly to samtools:

bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

Using --bam=0 (uncompressed BAM) avoids SAM text formatting on the write side and SAM parsing on the samtools side, and skips the wasted compression that samtools sort would immediately decompress; the BAM bytes flow between processes in the pipe.

Key flags

Threading: `-t`

-t INT   number of threads [1]

Performance scales well through 8–16 threads on most machines. Beyond 32 threads, returns diminish on typical workloads because inter-thread locking and IO become the bottleneck. See Threading and resource use for detailed guidance.

Read-group header: `-R`

-R STR   read group header line, e.g. '@RG\tID:sample1\tSM:sample1\tLB:lib1\tPL:ILLUMINA'

Every production alignment should include a @RG header. The ID in the -R string is embedded as an RG:Z: tag on every output record.

Tip — Escape the tab correctly

Pass -R with a literal \t between fields. Most shells require single quotes or $'...' quoting to prevent interpretation of the backslash:
bwa-mem3 mem -R $'@RG\tID:s1\tSM:sample1' -t 16 ref.fa R1.fq.gz R2.fq.gz

Chunk size: `-K`

-K INT   process INT input bases in each batch [10000000]

Larger -K values increase memory use but can improve throughput on very deep or very wide batches. The default is appropriate for most workloads.

SAM output control: `-S`, `-P`

-S    skip mate rescue
-P    skip pairing; mate rescue performed unless -S also in use

These flags are primarily useful for debugging or non-standard workflows. Normal paired-end alignments should leave both at their defaults.

Output modes

SAM (default)

bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam

Plain-text SAM. Suitable for inspection, compatibility testing, and piping to tools that consume SAM.

BAM (`--bam=0`)

bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz > out.bam

Writes BAM directly. --bam=0 is uncompressed BAM, which avoids double-compression when piping into a downstream sorter and is roughly 10–15% faster end-to-end. Pass --bam=6 to write a fully compressed BAM if the output is the final product.

Note — –bam=0 is the recommended output mode

For production pipelines, always use --bam=0 and pipe to samtools sort. See Best Practices: output format for the canonical pipeline.

Methylation alignment (`--meth`)

Pass --meth for bisulfite/RRBS samples. This activates inline C-to-T read conversion, bwameth.py-compatible flag defaults, and inline BAM post-processing. See Quick start: methylation alignment for the two-command workflow and the Methylation Reference for full detail.

Shared-memory index auto-attach

When bwa-mem3 shm has staged the index into shared memory, bwa-mem3 mem attaches automatically — no extra flag is required. The shared-memory path is transparent to users.

Cross-references

The full flag list is in the CLI Reference: mem page.

Output: SAM/BAM, headers, tags

bwa-mem3 writes output in either SAM (default) or BAM (--bam) format. This page covers the header structure and every non-standard SAM tag emitted by bwa-mem3.

Output format

By default, bwa-mem3 mem writes SAM to stdout. Pass --bam (or --bam=N for a specific compression level) to write BAM. Level 0 (uncompressed) is the default when --bam is given without an argument, which is optimal when piping to a downstream samtools sort.

# SAM (default)
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam

# Uncompressed BAM — best for piping
bwa-mem3 mem --bam -t 16 ref.fa R1.fq.gz R2.fq.gz | samtools sort -@ 8 -o out.bam -

# Compressed BAM — useful when the output is the final file
bwa-mem3 mem --bam=6 -t 16 ref.fa R1.fq.gz R2.fq.gz > out.bam

SAM header

`@HD`

A default @HD VN:1.6 SO:unsorted line is emitted unless the user supplies one via -H. The sort order is unsorted because bwa-mem3 writes records in input read order; downstream sorting is always a separate step.

`@SQ`

One @SQ line is written per reference sequence, with the sequence name (SN:) and length (LN:) derived from the FM-index. If the index was built with a .dict or .hdr file that supplies @SQ records, those records are used instead of the auto-generated ones.

In methylation mode (--meth), the doubled reference contains sequences with an f or r prefix in their names. The inline BAM post-processor collapses these back to canonical chromosome names so that the output @SQ lines match a standard non-methylation alignment. See Chimera QC and header rewriting.

`@PG`

One @PG entry is written in standard mode:

ID	Description
`bwa-mem3`	The alignment step. `VN:` is the bwa-mem3 version string; `CL:` is the full command line.

In methylation mode (--meth), a second @PG entry is appended:

ID	Description
`bwa-mem3-meth`	The inline post-processor. `VN:` carries the version with `-meth` suffix; `CL:` is the full command line.

The bwa-mem3-meth entry follows immediately after the bwa-mem3 entry and records the post-processing step as a distinct pipeline node, matching the convention of separate-tool pipelines.

Tags emitted by bwa-mem3

Standard tags

bwa-mem3 emits the same standard tags as bwa-mem2 (NM:i, MD:Z, AS:i, XS:i, SA:Z, RG:Z, XA:Z, MC:Z, etc.). These are documented in the SAM specification and are not described further here.

bwa-mem3 additionally emits MQ:i on paired-end records — the mate’s mapping quality, set alongside MC:Z (the mate’s CIGAR) so callers that key off the mate’s MAPQ don’t need to look at the mate record. Both SAM and --bam output paths emit it. Backported from lh3/bwa PR #330 in fg-labs PR #35.

The XA:Z field set widens from chr,pos,CIGAR,NM to chr,pos,CIGAR,NM,score,mapq when -u (a.k.a. the upstream “XB” toggle) is passed; the tag name itself remains XA:Z for downstream compatibility. Tools that parse XA:Z need to be aware of the two possible field widths.

`HN:i` — total alignment hit count

HN:i:<count>

The total number of primary alignments (both reported and suppressed) that the aligner found for this read, before the -h supplementary cap is applied. Useful for distinguishing “uniquely mapped” from “multi-mapped” reads without relying solely on MAPQ.

HN:i is emitted on the primary alignment record only.

Methylation-only tags

The following Bismark-compatible tags are emitted only when --meth is active. See SAM tags: XR, XG, XM for the full per-tag reference, including the XM:Z character alphabet and the XG:Z strand-pick semantics.

Tag	Type	Description
`XR:Z`	string	Read conversion direction: `CT` (R1 / SE) or `GA` (R2)
`XG:Z`	string	Genome strand of the alignment: `CT` (OT) or `GA` (OB)
`XM:Z`	string	Per-base methylation call string (length = `SEQ`)

The bwameth-style YS:Z / YC:Z tags exist only as an internal carrier on bseq1_t.comment for SEQ restoration and XR:Z derivation; they are suppressed at BAM emit and never appear in output. The bwameth YD:Z strand tag has been replaced by Bismark XG:Z and is not emitted.

MAPQ semantics

MAPQ semantics are inherited from bwa-mem2 and follow the same scoring model. In methylation mode, alignments identified as chimeras (longest M/=/X run covering less than 44% of the read length) have their MAPQ capped at 1 and the 0x200 (QC fail) flag set. See Chimera QC and header rewriting.

Threading and resource use

The `-t` flag

-t INT   number of threads [1]

bwa-mem3 parallelizes alignment by dividing the input into fixed-size batches (controlled by -K) and processing batches concurrently. Threads share the in-memory FM-index; there is no per-thread copy.

How threads interact with performance

Where threads help

Seed finding (SMEM enumeration) is fully parallel across reads in a batch.
Extension (banded Smith-Waterman) is fully parallel.
Pair rescue is parallel.
BAM encoding (when --bam is active) is parallel.

Where threads stop helping

Thread count and wall-clock alignment time scale well to approximately 16–32 threads on a modern CPU. Beyond that, several effects conspire to flatten the curve:

FM-index bandwidth. The resident index for hg38 is ~15 GB and does not fit in the L3 cache of any current server. At high thread counts, threads contend for memory bandwidth accessing the BWT.
IO contention. On spinning disk or a shared network filesystem, concurrent reads of the same large index file saturate IO bandwidth before the CPU is saturated.
Output serialization. SAM output is serialized per-record to stdout. BAM output with --bam reduces this bottleneck but does not eliminate it entirely.

Recommended thread counts

Machine	Recommended `-t`	Notes
16-core workstation	12–14	Leave 2 cores for `samtools sort`
32-core server	24–28	Leave cores for downstream and OS overhead
64-core server	40–48	Marginal returns above 48; test with your workload
Multiple parallel runs	divide evenly	See below

These are starting points. Profile with your specific data and storage configuration to find the practical optimum.

Running multiple parallel alignments

When running multiple bwa-mem3 mem processes on the same machine, divide threads so that the total does not exceed the physical core count. For example, on a 32-core machine running four concurrent samples:

# Four parallel runs, 8 threads each
for sample in a b c d; do
  bwa-mem3 mem --bam -t 8 ref.fa ${sample}_R1.fq.gz ${sample}_R2.fq.gz \
    | samtools sort -@ 2 -o ${sample}.bam - &
done
wait

Using shared memory (bwa-mem3 shm) amortizes the index read-in cost across all four runs. See Quick start: shared-memory index and Best Practices: multi-sample workflows.

Memory use

Peak RAM is the resident index (~15 GB for hg38, ~22 GB under --meth) plus a per-batch working set that scales with the effective batch size (chunk_size × n_threads), and is fixed with respect to -t. The per-batch term is what tips memory-constrained or wide-window (e.g. Hi-C) runs into OOM, and bwa-mem3 shm lets concurrent processes share one physical copy of the index. For the full budgeting model and the -K/-t interaction, see Memory budgeting and data-type tuning.

IO recommendations

Use local NVMe storage for the index files when possible. The ~11 GB index read is the dominant IO event at the start of each mem run.
Write BAM (--bam) to a fast local disk or pipe directly to samtools sort. Avoid writing uncompressed SAM to a network filesystem.
Separate read and write paths if your storage topology allows it: read the index from one volume and write sorted BAM to another.

Memory allocator (mimalloc)

bwa-mem3 vendors and links mimalloc, Microsoft’s high-performance memory allocator, into every binary by default. On multi-threaded alignment workloads, mimalloc reduces wall-clock time by replacing the system allocator with one optimized for many small, short-lived allocations — exactly the access pattern produced by the inner alignment loops.

What mimalloc replaces

The system allocator (glibc malloc on Linux, libSystem malloc on macOS) is a general-purpose allocator with a global lock. Under heavy multi-threaded allocation pressure — 16+ threads each issuing thousands of short-lived allocations per batch — the lock becomes a measurable bottleneck. mimalloc uses per-thread free lists and a segment-based heap to eliminate most of this contention.

Platform-specific linkage

The linkage strategy differs by OS:

Platform	Mechanism
Linux	Static linkage with `--whole-archive`. The entire mimalloc static library is embedded into the `bwa-mem3` binary; its `malloc`/`free` symbols take precedence over `glibc`’s at link time.
macOS	Dynamic linkage via dyld interposing. `libmimalloc.dylib` is built alongside the binary; dyld’s `DYLD_INSERT_LIBRARIES` interposing mechanism replaces `malloc`/`free` at load time. The dylib ships next to the binary.

Warning — macOS: keep libmimalloc.dylib next to the binary

On macOS, libmimalloc.dylib must remain in the same directory as the bwa-mem3 binary (or be reachable via the embedded rpath). The binary carries a hard dynamic dependency on libmimalloc.dylib; if you move bwa-mem3 without it, dyld can no longer resolve the library and the binary fails to launch (dyld: Library not loaded) rather than silently running on the system allocator. Whenever mimalloc is linked and loaded, bwa-mem3 version reports its status explicitly — see Verifying that mimalloc is active below.

Verifying that mimalloc is active

Run:

./bwa-mem3 version

The version output always carries a mimalloc line whenever the library is linked, with a status suffix reporting whether it is actually intercepting the standard allocator:

mimalloc 3.x.x (active)

(active) means a standard malloc allocation was routed to mimalloc — the allocator is doing its job. If instead you see:

mimalloc 3.x.x (linked but NOT overriding malloc)

then a libmimalloc that exports only the mi_* API is linked — e.g. a distro/conda libmimalloc built without the malloc override — and every real malloc/free is still going to the system allocator. If the mimalloc line is absent entirely, the binary was built with USE_MIMALLOC=0.

The status is probed at runtime by allocating through the standard malloc and asking mimalloc whether the pointer lives in one of its heap regions, so it reflects the allocator that is genuinely in effect — not merely what was linked.

Opting out

Pass USE_MIMALLOC=0 at build time to produce a binary linked against the system allocator:

make USE_MIMALLOC=0

Reasons to opt out:

AddressSanitizer (ASAN) builds. The Makefile automatically sets USE_MIMALLOC=0 when ASAN_FLAGS is detected, because ASAN and mimalloc’s malloc interposing cannot coexist cleanly.
Container environments where distributing a dylib alongside the binary is inconvenient.
Reproducibility testing to isolate whether a behavioral difference is allocator-related.

Note — Default is on

USE_MIMALLOC=1 is the default. Opt-out is not recommended for production workloads — mimalloc measurably reduces wall time on multi-threaded runs.

Build internals

The mimalloc source lives in ext/mimalloc/ as a git submodule. The Makefile target builds it via CMake before linking bwa-mem3. The relevant Makefile variables are MIMALLOC_SRC, MIMALLOC_BUILD, and MIMALLOC_LIB.

The feature was introduced in bwa-mem3 as part of the performance improvement work. See Features and Build & infrastructure for the PR history.

Memory budgeting and data-type tuning

Most short-read alignments run comfortably with default settings. A handful of situations push memory hard enough to matter: tight memory caps (containers, cgroups, schedulers), very high thread counts, and data types with unusually wide insert-size distributions such as Hi-C. This page explains how peak memory is built up, how to fit a run inside a fixed cap, and how to align Hi-C and similar data without wasting memory.

How peak memory is built up

Peak resident memory during bwa-mem3 mem is the sum of two parts:

peak RSS  ≈  resident index  +  per-batch working set

Resident index. The FM-index, packed reference (.pac), and related structures are loaded once and shared across all threads (there is no per-thread copy). For hg38 the resident index baseline is roughly 15 GB. This is fixed for a given reference and does not change with -t or -K.

mem reconstructs the reference bases it needs for scoring and extension directly from the packed .pac on demand (pac-fetch), so the unpacked .0123 reference (~6.4 GB on hg38) is neither loaded nor required on disk — the .pac already holds the same bases at one-quarter the size. This is the only reference path and is byte-for-byte identical to the historical .0123 load. On hg38 (5M read pairs, -t 16) pac-fetch lowered peak RSS by ~6.2 GB for a plain alignment and ~6.3 GB for --meth, at neutral-to-slightly-faster wall time.

Per-batch working set. On top of the index, each in-flight batch holds the reads, their seeds, candidate alignment regions, and the reference windows used by mate rescue. This scales with the effective batch size (below) and with the data: longer reference windows — for example wide mate-rescue windows — make it larger.

Methylation (`--meth`) mode

--meth loads a dual index: the doubled seed FM-index for seeding plus the original reference’s packed .pac for scoring/extension. The seed FM-index is roughly twice the size of a plain FM-index (its contigs are doubled), so the resident index for hg38 is on the order of 22 GB (seed FM-index ~21 GB + original .pac ~1 GB), versus ~15 GB for a plain alignment.

As with plain alignment, the original reference’s bases are pac-fetched from its .pac — the original unpacked .0123 (~6.4 GB) is neither built nor loaded. The seed index’s own unpacked .0123 (~13 GB) and packed .pac (~1.6 GB) are likewise neither built nor loaded — mem --meth extends against the original reference, never the seed, so the seed’s bases are never read. (Earlier --meth builds loaded the original .0123 plus both seed files, costing ~20 GB of extra RSS in total.) Under bwa-mem3 shm --meth the staged seed segment is likewise seed-only, omitting the seed PAC and .0123. The per-batch levers below (-K, -t) apply unchanged.

Effective batch size: `-K` and the `-t` multiplier

The single most important knob for per-batch memory is the effective batch size, and it is easy to misjudge because of how it interacts with -t:

no -K     effective batch = chunk_size × n_threads   (chunk_size default 10,000,000)
-K INT    effective batch = INT                       (fixed, regardless of n_threads)

So with the default chunk_size and -t 16, each batch is 10,000,000 × 16 = 160,000,000 bases — not 10 M. Setting -K 1000000 makes the batch a fixed 1 M bases regardless of thread count: in that example a 160× reduction in the per-batch working set, not 10×.

Two consequences worth remembering:

Raising -t raises peak memory by default, because the batch grows with the thread count. If you scale threads up on a memory-constrained host, scale -K down to compensate.
-K also makes output deterministic across different -t values (see -K in the mem reference), which is why it is useful for reproducibility independent of memory.

Fitting a run inside a memory cap

Under a hard cap (a --memory cgroup limit, a Slurm --mem, a container limit), budget like this:

per-batch budget  ≈  memory cap  −  resident index  −  headroom

For hg38 under a 22 GB cap, that is roughly 22 − 15 − a couple GB ≈ a few GB for the per-batch working set — not much. The levers, in order of preference:

Lower -K. -K 1000000 keeps the per-batch working set well under 1 GB for typical short reads and costs little throughput. This is the first thing to try.
Lower -t. Fewer threads shrink the default batch (since it is chunk_size × n_threads) and reduce concurrency-related overhead. Prefer adjusting -K first so you keep your cores busy.
Use bwa-mem3 shm if you run several samples on one host: the index is mapped from a shared segment and its pages are shared across processes, so the ~15 GB index is paid once rather than per process. See Quick start: shared-memory index.

Tip — a silent OOM looks like a truncated BAM

When the kernel OOM-kills bwa-mem3 (for example, a process exceeding its cgroup limit), it receives SIGKILL with no error message. A downstream samtools in the same pipe may still exit 0, leaving a header-only or truncated BAM that masquerades as success. Always run alignment pipes under set -o pipefail and verify the output BAM is non-empty (e.g. a record count or samtools quickcheck) before treating a run as complete.

Hi-C and other wide-insert data

Mate rescue searches a reference window whose width is derived from the observed insert-size distribution (roughly the inter-quartile range of inferred insert sizes, scaled out by several multiples). For standard paired-end libraries this window is a few hundred bases. For Hi-C (and capture-C / 3C-style data) the “insert size” between mates is not a library property at all — mates are ligation contacts that can sit anywhere in the genome — so the inferred distribution is enormously wide and the per-attempt rescue window balloons to tens of kilobases. The result is both wasted compute and a large per-batch working set, which is a common cause of OOM on Hi-C.

The fix is to align Hi-C the way Hi-C pipelines expect, with -5SP:

bwa-mem3 mem -5SP -t 16 --bam ref.fa hic_R1.fq.gz hic_R2.fq.gz \
  | samtools sort -n -@ 2 -o hic.namesorted.bam -

-S skips mate rescue — this removes the wide rescue windows entirely and is the bulk of the memory and time saving.
-P skips pairing, which is meaningless for Hi-C contacts.
-5 marks the alignment with the smallest coordinate as primary for split reads, the convention expected by Hi-C tools (Juicer, Arima, pairtools, etc.).

-5SP is the standard Hi-C invocation for the bwa mem family; it is the correct mode for Hi-C, not merely a memory workaround. Because it changes which alignments are reported, apply it consistently across any runs you intend to compare. Hi-C output is typically name-sorted (samtools sort -n) for the downstream contact caller rather than coordinate-sorted.

For other data with genuinely large but real inserts (e.g. some long-fragment or mate-pair libraries), mate rescue is still meaningful — keep it, but lower -K to bound the per-batch working set rather than disabling rescue.

Tips and best practices

Quick operational pointers for running bwa-mem3. Each links to the page with the full rationale.

Index once, align many. The on-disk index is stable across bwa-mem3 releases and across every SIMD tier in the binary; you do not re-index when upgrading unless the release notes say so. See Indexing the reference.
Pipe --bam straight into samtools sort — never write an intermediate unsorted BAM. See Best Practices → Output format.
Stage the index in shared memory for batch workloads (bwa-mem3 shm), and always shm -d before re-indexing — there is no staleness check. See Quick start: shared-memory index and Anti-patterns.
Divide threads explicitly across concurrent jobs so the total stays at or below the physical core count. See Threading and resource use.
Confirm the resolved SIMD tier with bwa-mem3 version (or BWAMEM3_DEBUG_SIMD=1); one binary carries every tier and selects at startup. See Performance → SIMD dispatch matrix.
Always pass a read-group (-R with at least ID: and SM:) — GATK, fgbio, and Picard require an @RG header. See Aligning short reads.

For an impact-ordered tuning walkthrough, see Best Practices → Optimization checklist.

Performance Overview

Performance claims in this section are benchmarked, not asserted. The canonical source of truth for benchmark methodology, hardware configurations, and current numbers is bwa-mem3-bench, a reproducible benchmarking harness that runs across AWS Batch architectures (x86 AVX2, AVX-512, ARM Graviton). Consult that repository before drawing conclusions from isolated anecdotal timings.

What drives bwa-mem3’s performance

There is no single “the speedup.” bwa-mem3 inherits bwa-mem2’s SIMD-vectorized core and layers on a series of independent improvements, each targeting a different bottleneck in the align pipeline. How much any one of them helps — and therefore the total — depends heavily on the workload: read length, error rate, reference size, thread count, CPU architecture (AVX2 vs AVX-512 vs NEON), and whether the index is cold or warm in the page cache. A short-read whole-genome run is dominated by seeding and FM-index walks; a long-read or high-error run spends most of its cycles in the Smith–Waterman kernels; a many-sample run can be bottlenecked on header ingestion or decompression. The drivers below group by what part of the machine or algorithm they fix. For real, reproducible numbers on specific hardware, always defer to bwa-mem3-bench rather than any single anecdote here.

For the full per-change list with PR links and status, see What’s Different — Performance improvements.

1. Getting more out of the machine (SIMD / microarchitecture)

The alignment kernels (kswv mate-rescue, bandedSWA) are compiled for the widest SIMD ISA the host supports — SSE4.1 through AVX-512BW on x86, native NEON on ARM — and selected at runtime by an in-process dispatcher (a single binary that picks the right kernel per host, #83) rather than the older multi-binary execv launcher.

Native NEON kernels replaced the sse2neon translation shim in the two hottest kernels on ARM, worth roughly 10% additional throughput on Apple Silicon over the pure-translation baseline.
AVX2 as the x86 baseline (#84): restoring the non-kernel translation units to -mavx2 recovered ~+15% user time on wgs-5M / ~+11% on wes-5M that had been lost when the baseline briefly dropped to SSE4.1 — hot non-kernel paths (chain extension, FM-index BWT walks, mate scoring) auto-vectorize at 256-bit width again.
Capping AVX-512BW auto-vectorization at 256-bit (#86) avoids the frequency downclock that wide 512-bit auto-vec code triggers on some x86 parts, where the wider vectors cost more in clock than they return.
Per-strip L1 prefetches added to the 8-/16-bit kswv kernels that lacked them (#70, bringing them in line with kswv512_16) stop the inner SW loop from stalling on first-touch L1 misses.
A recovered 8-bit banded SW path for reads ≥128 bp (#140 and follow-ups) keeps long-read alignment in the cheaper 8-bit lane width where it is valid.
Per-ISA SW kernel hand-tuning. A NEON pass (#160) replaced the multi-instruction sse2neon “all lanes zero” tests in the hot band-narrowing scans with a single vmaxvq horizontal reduction; an AVX2 pass (#161) relieved the port-5 vpblendvb/vpshufb bottleneck that ceilings the inner loop on Zen3. Both are byte-identical to the prior output.
AVX2 16-bit mate rescue (#162): AVX2-only hosts (e.g. Zen3, which run the AVX2 default) previously fell back to scalar ksw_align2 for 16-bit mate rescue because only NEON and AVX-512 had a batched 16-bit kswv; kswv256_16 closes that gap.

See the SIMD dispatch matrix for the full ISA picture.

2. Fixing bad patterns in the hot path (algorithmic rewrites)

Several gains come simply from rewriting code that did more work than it needed to — the “rewrites wind up fixing bad patterns” effect — without changing where reads map.

Lockstep SMEM batching (#33, widened from 8 to 16 reads in #75): advances several reads’ seed walks in interleaved order so the out-of-order engine overlaps the random cp_occ checkpoint-array cache misses of one read with the compute of another. Measured ~−6% wall time on 150 bp WGS, with the hot cp_occ load share dropping from 65.5% to 53.3% of seeding time.
O(n²) → O(n) header ingestion (#49): the -H path re-ran strlen + realloc per line; batching the read took a ~70 MB / 1.5 M-line header from >10 minutes to under a second before alignment even starts.
Closed-form ungapped scoring when there are no mismatches (#77) replaces a per-base Kadane-style walk with a direct store.
On-stack sort buffers for small arrays (#78) removes a per-read malloc/free that was dominating the sort of the typical 5–30-element alignment-region arrays.
Inlining backwardExt (#88) eliminates a struct-by-value ABI pass that gcc 12+ could not optimize away, recovering (and beating) the older compiler’s baseline.
pdqsort with stable tie-breaks at the dedup-patch sort sites (#123).
The consolidated mapping-speedup audit (#58) bundles tuning across the ksw2 band loop, SMEM batching, suffix-array lookup prefetch, and SAM record building — historically the single largest wall-time step in main.

3. Indexing

bwa-mem3 index builds the FM-index with the linear-time libsais library (#57) instead of the older sais-lite, cutting both wall time and peak memory while producing a byte-identical index (existing indexes need no rebuild). Construction also skips the wasted zero-initialization of unpack and suffix-array buffers (#80), which on a doubled-human input avoided tens of GiB of write-then-overwrite zero-fill, and right-sizes the SA-entry staging buffers to the actual write count rather than the uncapped SA-interval sum (#157).

4. Memory allocation and I/O

mimalloc, vendored and statically linked by default (#19), replaces the system allocator and avoids glibc ptmalloc’s lock contention at high thread counts — a consistent multi-threaded throughput win. See User Guide — Memory allocator.
Faster read ingestion: a content-detecting FASTQ fast path over libdeflate BGZF (#128, merged) cuts the cost of decompressing and parsing input, which matters most when the aligner would otherwise be I/O- or parse-bound. A vendored zlib-ng inflate path with a third pipeline worker (#153, merged) extends the same idea: it cuts the read stage ~2.2× on gzip input and, because the serial read stage becomes a larger share of the wall as the compute stage parallelizes, improves end-to-end wall by up to −7.8% at 96 cores. The accompanying --profile stage-timing mode (#152) is what attributes wall time across the read‖proc‖write stages to surface exactly this effect.

5. Build-time optimization (PGO)

The build provides opt-in make pgo-generate / make pgo-use targets that recompile with branch-probability and call-frequency profiles gathered from a representative workload — ~3% on Apple Silicon, workload-dependent on x86. PGO is not applied to the default make output. See PGO build.

Reference numbers across architectures

Wall-time medians from bwa-mem3-bench at SHA a02fcb4 (2026-06-20), 5 reps per cell, t≈16, hg38, paired-end 150 bp:

sample	c6a (x86-64, AVX2, Zen3)	c7a (x86-64, AVX-512, Zen4)	c7i (x86-64, AVX-512, SPR)	c7g (arm64, NEON, Graviton3)	c8g (arm64, NEON, Graviton4)
wgs-5M	131.63 s	92.66 s	112.98 s	154.27 s	127.93 s
wes-5M	76.57 s	56.53 s	67.66 s	79.67 s	65.26 s
panel-twist-5M	150.84 s	102.72 s	156.42 s	181.37 s	148.50 s

Concordance vs upstream bwa-mem2 v2.2.1 on these cells, measured over primary-alignment records: wgs-5M 99.9893%, wes-5M 99.9996%, panel-twist-5M 99.9414%. bwa-mem3 is intentionally not byte-identical to bwa-mem2 — the residual differences are additive SAM tags, per-architecture SIMD score2/MAPQ convergence, deterministic tie-breaks, and a small number of additional supplementary alignments; see Equivalence with bwa-mem2 for the full audited breakdown. NEON-vs-x86 cross-architecture concordance on the same builds remains 100.0000% (the ARM and x86 fg-labs builds produce identical records). Spot-pool noise envelope (rep-to-rep CV) for this run: ~1–2% on c7a / c7g / c8g, ~1–5% on c6a, ~10–16% on c7i — the c7i medians in particular carry wide error bars and should be read as directional. See the bench repo for the methodology, the full per-rep table, and noisier instance classes (e.g. m7i) excluded from this summary.

Release-to-release speedups are deliberately uneven across this grid. A workload’s gain scales with the share of its wall time spent in the Smith–Waterman kernels — highest on wgs-5M (~85% of cycles), lowest on the seed/IO-bound wes-5M, and intermediate on panel-twist-5M, whose deep target coverage produces many split alignments that each re-enter SW. That per-workload factor is multiplied by how heavily a given release retuned the host’s per-ISA SW kernel: v0.3.0 concentrated on the NEON (#160, #166) and AVX2 (#161, #162) kernels, so Graviton and AVX2 hosts moved more than the already-tuned AVX-512BW path. See SIMD dispatch for how the per-host kernel is selected.

Benchmarking responsibly

Alignment throughput is sensitive to read length, error rate, reference size, thread count, CPU architecture, NUMA topology, and whether the index is cold (in-kernel page cache) or warm. The bwa-mem3-bench harness controls for these variables by running standardized workloads on defined instance types. If you need numbers for a procurement or publication decision, run the harness against your target hardware.

SIMD Dispatch Matrix

bwa-mem3 ships one binary per platform. The x86 binary contains compiled kernels for every supported SIMD tier (sse41 / sse42 / avx / avx2 / avx512bw) and dispatches in process at startup. The arm64 binary contains a single NEON kernel path. There are no bwa-mem3.<tier> companion files on disk and no launcher binary.

Dispatch flowchart

flowchart TD
    A[bwa-mem3 mem starts] --> B{Platform?}
    B -- ARM / aarch64 --> C[NEON kernel TU, no dispatch]
    B -- x86 --> D[bwamem3_simd_init in src/simd_dispatch.cpp]
    D --> E[__builtin_cpu_supports]
    E --> F{Host capability?}
    F -- AVX-512BW --> G1[g_tier = avx512bw]
    F -- AVX2 --> G2[g_tier = avx2]
    F -- AVX --> G3[g_tier = avx]
    F -- SSE4.2 --> G4[g_tier = sse42]
    F -- SSE4.1 --> G5[g_tier = sse41]
    F -- below build floor --> H[exit(2): host below SIMD floor]
    G1 & G2 & G3 & G4 & G5 --> I[Per-kernel factory selects matching tier]

Tier detection runs once during main(). Subsequent kernel calls pay a single indirect-call hop through a factory vtable (or an extern "C" wrapper for free-function ksw_* kernels) — about 0.3 ns per call after BTB warm-up, well below run-to-run noise on the bwa-mem3-bench corpus.

If the host CPU does not meet the build’s compile-time SIMD floor (BASELINE_ARCH, default avx2 since PR #84), the binary exits with code 2 and an [E::bwamem3] message naming the gap before any alignment work runs. bwa-mem3 version, --help, and -h are exempt and always succeed so operators can introspect a binary on a host that cannot run alignment. See Host requirements.

`BASELINE_ARCH` and the kernel tiers

The build targets (make, make arch=…, make BASELINE_ARCH=…, make arm64) are documented in Best Practices → Build. What matters for dispatch is BASELINE_ARCH:

BASELINE_ARCH controls the tier at which non-kernel translation units compile. The hand-tuned kernel TUs in KERNEL_SRCS (bandedSWA, kswv, ksw, sam_encode) are always compiled at every supported tier and dispatched at runtime, so a build at BASELINE_ARCH=avx2 still uses the AVX-512BW kernels on AVX-512BW hosts. The non-kernel TUs are not auto-vectorized above BASELINE_ARCH, which is the trade-off — see BASELINE_ARCH=avx512bw build flag for the empirical perf characterization.

Supported x86 tiers (minimum CPU for each tier’s kernel path):

Tier	Arch flags	Minimum CPU
`sse41`	`-msse4.1`	Penryn (2007) / K10 (2011)
`sse42`	`-msse4.2`	Nehalem (2008) / Bulldozer (2011)
`avx`	`-mavx`	Sandy Bridge (2011) / Bulldozer (2011)
`avx2`	`-mavx2`	Haswell (2013) / Excavator (2015)
`avx512bw`	`-mavx512f -mavx512bw -mprefer-vector-width=256`	Skylake-X (2017) / Zen 4 (2022)

For arm64 builds:

Binary	Arch flags	Platform
`bwa-mem3` (arm64)	`-DAPPLE_SILICON=1` + native NEON / sse2neon shim	Any aarch64 / Apple Silicon

Kernel vectorization coverage

Kernel	SSE4.1	SSE4.2	AVX	AVX2	AVX-512BW	NEON (arm64)
`kswv` (vectorized Smith-Waterman)	8-wide int16	8-wide int16	8-wide int16	16-wide int16	32-wide int16	8-wide int16 (native)
`bandedSWA` (banded alignment / mate-rescue)	vectorized	vectorized	vectorized	vectorized	vectorized	native NEON blendv
`ksw_*` (SW extension free functions)	per-tier	per-tier	per-tier	per-tier	per-tier	per-tier (NEON)
`sam_encode` (SAM seq/qual encoder)	per-tier	per-tier	per-tier	per-tier	per-tier	per-tier (NEON)
FM-index lookup (`FMI_search`)	scalar popcount	scalar popcount	scalar popcount	scalar popcount	scalar popcount	`__builtin_popcountl`
libsais BWT construction	scalar	scalar	scalar	OpenMP parallel	OpenMP parallel	OpenMP parallel

Note — FM-index is memory-bound

The FM-index backward-extension loop is limited by pointer-chasing through the cp_occ arrays, not by computation. Additional SIMD width does not increase throughput here. See Developer Guide — Apple Silicon / NEON port for the profiling evidence.

Runtime overrides

Two environment variables tune dispatch:

Variable	Effect
`BWAMEM3_FORCE_TIER=<tier>`	Forces a specific tier (`sse41` / `sse42` / `avx` / `avx2` / `avx512bw`). Downgrade-only: requests above the host’s detected tier (which would SIGILL) and unknown names are rejected with a stderr warning. Used by `test/regression/all_tiers_parity.sh` to confirm byte-identical SAM across all tiers on AVX-512 hosts.
`BWAMEM3_DEBUG_SIMD=1`	Prints a one-line `[I::bwamem3_simd_init_body]` startup banner with the build baseline, the detected host capability, and the resolved tier. Also enables the build-baseline-vs-host gap warning.

Use bwa-mem3 version to read the resolved tier without alignment:

v0.2.0
SIMD floor:   avx2 (x86-64-v3, Haswell 2013+); kernels: sse41 sse42 avx avx2 avx512bw
SIMD runtime: avx512bw (BWAMEM3_FORCE_TIER unset)

Why in-process dispatch, not separate binaries

The pre-PR-#83 design shipped six binaries (one launcher plus one per ISA tier) and execvd the matching tier at startup. That worked but cost ~120 MB on disk, required all six binaries to be present in the same directory, and made BWAMEM3_FORCE_TIER impossible without re-exec’ing a different file. The current single-binary design keeps the per-tier compile granularity for the hand-tuned kernel TUs while collapsing distribution to one file (~25 MB), and adds runtime tier override and a clean host-floor precheck. Indirect-call overhead is the only trade-off, and it is below the measurement noise floor on every architecture in the bench matrix.

PGO Build

Profile-Guided Optimization (PGO) is a two-pass compiler technique. In the first pass (pgo-generate) the compiler inserts counters into every branch, call site, and loop back-edge. You run a representative training workload against the instrumented binary so those counters accumulate real branch-probability data. In the second pass (pgo-use) the compiler recompiles every translation unit using the collected profiles to make better inlining, branch-prediction, and code-layout decisions.

bwa-mem3’s Makefile provides three targets that implement this workflow.

Observed gains

On Apple Silicon (M-series), PGO delivered approximately 3% throughput improvement over the native NEON build. The gain on x86 depends on the workload — short-read paired-end alignment on avx2 or avx512bw hardware typically sees 2–5%. PGO is most useful when you will run the same binary on the same hardware against the same workload repeatedly (e.g. a production pipeline node). It is not worth the extra build time for one-off or exploratory runs.

Workflow

Step 1: Build the instrumented binary

make pgo-generate

By default PGO_ARCH is set to arm64 on Apple Silicon / aarch64 hosts and native on x86 hosts. To target a specific ISA, pass PGO_ARCH explicitly:

make pgo-generate PGO_ARCH=avx2

This produces a binary named bwa-mem3.pgo-instr (or bwa-mem3.pgo-instr.avx2 for non-default arch). Profiles are written to the directory pgo_profiles/ by default. Override with PGO_PROFILE_DIR:

make pgo-generate PGO_ARCH=avx2 PGO_PROFILE_DIR=/scratch/pgo-profiles-avx2

Step 2: Run the training workload

Run a workload that is representative of your production use. A single-end or paired-end alignment run against the same reference and similar read length is sufficient. A larger training run produces more stable profiles but 5–10 million read pairs is generally enough.

./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null

The run discards output so you are measuring the alignment work alone.

Tip — Training workload size

Aim for a training run that exercises the same code paths as your production workload. If you align 150 bp paired-end reads in production, train on 150 bp reads. If you use --meth, include a methylation alignment run in training. A few million read pairs is sufficient; a full WGS run provides diminishing returns.

Step 3: Build the optimized binary

make pgo-use

Or with matching arch and profile dir:

make pgo-use PGO_ARCH=avx2 PGO_PROFILE_DIR=/scratch/pgo-profiles-avx2

This produces bwa-mem3.pgo (or bwa-mem3.pgo.avx2). The binary is ready to use in production.

Step 4: Clean up instrumentation artifacts

make pgo-clean

This removes the profile directory and all bwa-mem3.pgo-instr* and bwa-mem3.pgo* files.

Multi-arch builds with PGO

Each architecture requires its own profile because the instrumentation counters are embedded in arch-specific code. Run the full three-step workflow once per arch and keep the profiles in separate directories:

# AVX2 profile
make pgo-generate PGO_ARCH=avx2 PGO_PROFILE_DIR=pgo_profiles_avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2 PGO_PROFILE_DIR=pgo_profiles_avx2

# AVX-512BW profile (separate host or same host with matching CPU)
make pgo-generate PGO_ARCH=avx512bw PGO_PROFILE_DIR=pgo_profiles_avx512bw
./bwa-mem3.pgo-instr.avx512bw mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx512bw PGO_PROFILE_DIR=pgo_profiles_avx512bw

Warning — Profile portability

Profile data collected on one microarchitecture is not portable to a different one. An AVX2 profile collected on a Haswell CPU will not improve — and may pessimize — an AVX-512BW build run on a Sapphire Rapids CPU. Always collect profiles on the same hardware class where the optimized binary will run.

PGO and the single-binary multi-tier build

The PGO targets produce one optimized binary for a single arch= target. They do not yet rebuild the default make single multi-tier binary’s per-tier kernel TUs. If you need PGO across more than one host class, build and profile each arch= variant separately and deploy whichever matches the target fleet — bwa-mem3 version will report the resolved tier so you can confirm. PGO for the in-process multi-tier dispatch path is tracked as a future enhancement.

Relationship to LTO

make lto-build produces a Link-Time Optimization binary; make pgo-use produces a PGO-optimized binary. Both are independent opt-in targets. You can combine them by passing -flto (or -flto=thin for clang) as part of EXTRA_CXXFLAGS during the pgo-use step, but the combination has not been systematically benchmarked. In practice, LTO and PGO each provide modest single-digit gains; their interaction is compiler-specific.

Optimization checklist

The items below are ordered by expected impact for most workloads. Work through them in sequence; there is little point optimizing output format before confirming you are running the right binary for your CPU.

1. Confirm the resolved SIMD tier matches your CPU

The default make produces a single binary that contains every supported x86 SIMD tier and selects one in process at startup. Verify which tier is running:

bwa-mem3 version
# expect: SIMD floor: <build_floor>; SIMD runtime: <resolved_tier>

If the runtime tier is below what your CPU supports, double-check whether you accidentally built with a lower BASELINE_ARCH= or set BWAMEM3_FORCE_TIER in the environment. Set BWAMEM3_DEBUG_SIMD=1 to get a startup banner on stderr at the start of a mem run.

On ARM / Apple Silicon, the binary has one NEON tier; bwa-mem3 version reports SIMD runtime: neon.

See SIMD dispatch matrix for the full dispatch logic and the minimum CPU requirements for each tier.

Tip — Single-arch deployments

On a cluster where every node has the same CPU, build with make arch=avx2 (or the appropriate ISA). The runtime dispatch overhead is negligible, but a single-arch build trims the binary and removes any chance of BWAMEM3_FORCE_TIER accidentally downgrading throughput in production.

2. Build with PGO if you will run repeatedly

For production pipeline nodes that will process many samples against the same reference, a PGO build provides an additional 2–5% throughput at the cost of one extra build pass and a training run:

make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2

See PGO build for the full workflow, including multi-arch and profile portability notes.

3. Use shared memory for many small samples

When aligning many samples on one machine against the same reference, loading the index into POSIX shared memory once and reusing it across all mem invocations eliminates redundant I/O and reduces per-sample startup time significantly. The benefit grows with the number of samples and the size of the reference.

# Load the index into shared memory once
bwa-mem3 shm ref.fa

# Align each sample against the in-memory index
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz | samtools sort -@ 4 -o sample.bam -

# When finished with all samples, drop the shared segment
bwa-mem3 shm -d

Warning — No staleness check

bwa-mem3 shm does not detect whether the on-disk index has changed after the segment was loaded. Always run bwa-mem3 shm -d before re-indexing a reference and re-loading with bwa-mem3 shm. Failing to do so results in alignments against a stale index.

See Getting Started — Shared-memory index and Best Practices — Multi-sample workflows for complete workflows.

4. Emit BAM directly

Use --bam (or --bam=0 for uncompressed BAM) to emit BAM instead of SAM. Uncompressed BAM avoids the text-formatting cost on the aligner side and the text-parsing cost on the downstream side. samtools sort reads BAM natively and is fastest when the input is uncompressed:

bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

A compression level — --bam=6, say — produces BGZF-compressed BAM, useful when writing directly to disk without a downstream piped tool.

See Best Practices — Output format for guidance on when SAM is still appropriate.

5. Pipe to a multi-threaded sorter

Sorting is typically the bottleneck after alignment. Keep a separate thread budget for samtools sort:

bwa-mem3 mem --bam=0 -t 12 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -m 2G -o out.bam -

On a 16-core machine, allocating 12 threads to mem and 8 to samtools sort (with overlap via the pipe) is a common starting point. The aligner is generally CPU-bound; the sorter is I/O-bound during merge. Profile both stages to find the right split for your hardware.

Tip — Thread count tuning

bwa-mem3 mem scales well to 16–32 threads on most workloads. Beyond 32 threads the per-thread work unit becomes small enough that synchronization overhead starts to erode gains. See User Guide — Threading and resource use for thread-scaling data.

6. Reorder seeds longest-first (`--seed-order local-longest`)

After SA-interval resolution bwa-mem3 builds chains by scanning seeds in the order they were emitted. --seed-order local-longest re-sorts each read’s resolved seeds by decreasing length before chaining. The longest seed anchors its chain first and absorbs shorter seeds that are fully contained within it, so contained sub-seeds are never extended — they contribute nothing a long seed does not already cover.

bwa-mem3 mem --seed-order local-longest -t 16 ref.fa R1.fq.gz R2.fq.gz | samtools sort -@ 4 -o out.bam -

Measured on 50,000 real WGS reads (1000 Genomes HG00096, hg38), local-longest cuts the fraction of seeds that reach banded Smith-Waterman extension from 38.2 % to 43.7 % absorbed (an ~8.9 % reduction in extended seeds). Since Smith-Waterman extension is typically the dominant per-read cost in bwa-mem3 mem, the gain scales accordingly with the seed absorption rate for a given dataset.

Accuracy and byte-identity. F1 is flat on an easy simulated profile (holodeck, ~94.4 %; no regression vs --seed-order off). Hard-data F1 validation on divergent/indel-rich reads and GIAB benchmarks is not yet complete, so --seed-order local-longest is opt-in only and the default remains off. Non-off modes are not byte-identical to off: they can shift secondary alignments, XA:Z:, XS:i, and HN:i tags, and a small number of primaries. See Equivalence → Seed ordering for the full taxonomy.

Additional advanced modes are accepted by the option (global-longest, absorb-count, most-absorb) but are unadvertised pending further validation.

7. Enable SMEM deduplication (opt-in, not byte-identical)

--smem-dedup removes duplicate SMEM seeds before SA expansion, cutting SA lookups by roughly 10 % on typical short-read WGS data. Off by default to preserve byte-identical output. Enable only when output identity with upstream bwa-mem2 is not required:

bwa-mem3 mem --smem-dedup -t 16 ref.fa R1.fq.gz R2.fq.gz | ...

The accuracy impact is confined to a small fraction of reads: on 50 k WGS reads vs hg38 only 2 reads (0.004 %) changed — one XS tag update on a MAPQ-60 read (primary placement unchanged) and one tie-break shift on a MAPQ-0 equal-score locus. All uniquely-mapped reads are unaffected.

Not byte-identical

--smem-dedup changes the SMEM set seen by chaining for reads that had duplicate SMEMs (arising from B-tree duplicate keys in the SA). This can alter XS tags and equal-score placements on 0.004 % of reads (2 of 50 k on the WGS validation set). Do not enable in pipelines that compare output to a bwa-mem2 baseline.

Summary table

Item	Action	Reference
Right SIMD tier for CPU	`bwa-mem3 version`; verify `SIMD runtime:`	SIMD dispatch matrix
PGO for production	`pgo-generate` → train → `pgo-use`	PGO build
Shared-memory index	`bwa-mem3 shm ref.fa` before batch runs	Quick start: shm
Emit uncompressed BAM	`--bam=0`	Best Practices — Output format
Multi-threaded sort	`samtools sort -@` with appropriate thread split	User Guide — Threading
Reorder seeds longest-first	`--seed-order local-longest`	Equivalence
SMEM deduplication	`--smem-dedup`; ~10 % fewer SA lookups; opt-in, not byte-identical	Features → –smem-dedup

Settings profiles: bwa drop-in vs recommended

bwa-mem3’s defaults track bwa-mem / bwa-mem2 behavior — it is a faster drop-in for an existing bwa-mem2 pipeline. Some defaults the bwa lineage inherited are, however, more conservative than necessary. This page defines two profiles:

a drop-in profile that keeps the bwa-mem/bwa-mem2 defaults, for migrations and parity checks;
a recommended profile that deviates from those defaults where we have benchmarked the change to be near-neutral on accuracy and clearly faster.

The defaults ship as the drop-in profile. The recommended profile is opt-in — you turn it on with explicit flags — so upgrading bwa-mem3 never silently changes your alignments.

Which profile?

Your situation	Profile	Invocation
Migrating a bwa-mem2 pipeline, or validating against bwa/bwa-mem2	Drop-in	`bwa-mem3 mem` (no extra flags)
New pipeline, or migration already validated	Recommended	`bwa-mem3 mem -m 10 -y 0` (add `-s 2` under `--meth`)

bwa-mem3 is already, by design, not byte-identical to bwa-mem2 even at default settings (additive SAM tags, per-architecture SIMD score2/MAPQ convergence, deterministic tie-breaks, a few extra supplementary alignments) — it is 99.94%–99.9996% concordant on primary records (panel-twist to WES). See Equivalence with bwa-mem2. “Drop-in” therefore means as close to bwa-mem2 as bwa-mem3 gets; the recommended profile is a further, documented deviation.

Drop-in profile (default)

bwa-mem3 mem -t <N> ref.fa R1.fq R2.fq > out.sam

No extra flags. Use this when you are:

reproducing or validating against bwa-mem / bwa-mem2 output, or
migrating an existing bwa-mem2 pipeline and want to change one variable (the aligner binary) at a time before tuning anything else.

Recommended profile

bwa-mem3 mem -t <N> -m 10 -y 0 ref.fa R1.fq R2.fq > out.sam

Shorthand: bwa-mem3 mem --fast applies -m 10 -y 0 --min-ext-len 30 --smem-dedup --skip-contained-ext --max-extend-chains 20 --adaptive-band --extend-mate-concordant (plus -s 2 and a lower --max-extend-chains 10 under --meth) in one flag. Explicit flags still override individual levers where applicable; --smem-dedup, --skip-contained-ext and --adaptive-band are forced on with no opt-out (--adaptive-band is a no-op on short reads, a ~25% speedup on long-read runs). --skip-contained-ext no-ops under --meth (its own internal gate disables it there), so on a --meth run the effective levers are -m 10 -y 0 --min-ext-len 30 --smem-dedup --max-extend-chains 10 --adaptive-band -s 2 --extend-mate-concordant. See mem → --fast.

Use this for new pipelines, or once a drop-in migration is validated and you want bwa-mem3’s best speed/accuracy trade-off. Current recommended deviations:

flag	default (drop-in)	recommended	effect
`-m` (mate-rescue depth)	50	10	~11–22% less alignment CPU; near-neutral accuracy (see below)
`-y` (3rd-round seeding occurrence)	20	0	~11–30% less alignment CPU; F1 near-neutral across regimes (within ±0.02; better on divergent/repeat — see below)
`-s` (Pass-2 re-seed width), `--meth` only	10	2	light re-seed: ~same speed as `-s 0` but recovers the MAPQ/placement `-s 0` lost (see below)
`--min-ext-len` (skip short-seed extension), standard-error reads	0	30	~10–20% less alignment CPU; accuracy change confined to the already-low-confidence tail (see below)

This table will grow as we benchmark additional tunings; each entry is gated on the same “measurably faster, near-neutral accuracy” bar.

For a bisulfite (--meth) pipeline the recommended invocation is therefore:

bwa-mem3 mem -t <N> --meth -m 10 -s 2 -y 0 ref.fa R1.fq R2.fq > out.sam

Mate-rescue depth: `-m 10`

-m caps how many near-best candidate loci per read get a mate-rescue Smith–Waterman pass. bwa-mem and bwa-mem2 both default to 50; we measured the marginal value of that depth directly — end-to-end and at the read level against simulated golden truth.

The depth beyond ~10 is a measured wash. Mate rescue earns its keep on the first few candidates, but candidates 11–50 only ever engage on reads with many near-best loci — repetitive and cross-contig placements — and on truth data deep rescue places about as many of those reads correctly as it mis-places. Lowering -m 50 → 10 changes which low-confidence reads win, but the net true-recall cost is ≈ 0.003–0.004% (≈200 reads in 5.4 M), on both methylation and non-methylation data, and the accuracy-vs-depth curve is flat (no depth between 10 and 50 is meaningfully better). Disabling rescue entirely (-S, or -m 0) is a real regression — so rescue matters, just not 50-deep.

In exchange you get a real speed-up, largest exactly where deep rescue binds (high-depth amplicon/panel and repetitive regions):

dataset	alignment CPU	wall
panel-twist (high depth)	−22%	−21%
wgs	−19%	−19%
wes	−18%	−14%
meth (em-seq)	−11%	−9%

Aggregate samtools flagstat impact is a uniform, directional sub-0.2 pp reduction in mapped%/proper-pair% (≤0.02 pp on WGS/WES; ≤0.12–0.19 pp on amplicon/meth), entirely in the low-MAPQ deep-rescue tail: the mapped count only ever decreases (it never newly maps a previously-unmapped read), and the handful of reads whose placement does change net to ≈0 recall against golden truth.

One caveat: this golden-truth metric scores a single best placement per read, so it does not capture downstream tools that aggregate over repeats — depth-based CNV, SV/mobile-element calling in repetitive regions, or repeat-region methylation. If your pipeline relies on which repeat copy is reported (rather than just the confident, MAPQ-high tier), validate -m 10 against your own analysis rather than assuming neutrality.

Numbers are from bwa-mem3-bench on 5 M-read real datasets plus a multi-contig holodeck golden-truth ablation; consult the bench for methodology and current figures.

Third-round seeding: `-y 0`

bwa-mem seeds in up to three rounds: (1) SMEMs, (2) reseeding long SMEMs (-r), and (3) an occurrence-bounded “seed strategy” round (-y) that, for every read position, grows an exact match until it occurs fewer than -y (default 20) times in the genome and emits it. Round 3 is a repeat-region safety net. -y 0 disables it entirely.

It is net-useless-to-harmful, even in the repeats it was built for. We swept -y 0 (and the more-aggressive -y 0 -r 10, see below) against the stock default across four golden-truth regimes (holodeck, read-name truth) and on real WES + WGS:

regime	ΔF1 (`-y 0` − default)	ΔF1 (`-y 0 -r 10` − default)	ΔF1 in the MAPQ-0 (repeat) bin (`-y 0` / `-y 0 -r 10`)
easy (150 bp, default error)	−0.010	−0.001	−0.14 / −0.04
substitution-divergent (2–15%)	+0.008	−0.074	+0.13 / −0.07
indel-rich (37 % indels)	−0.016	−0.017	−0.02 / −0.03
repeat-enriched (reads simulated from RepeatMasker, 49 % of hg38)	+0.007	+0.004	+0.16 / −0.02

The tiny easy/indel deltas are ≤0.016 in absolute F1; on the divergent and repeat-enriched regimes — where round 3 is supposed to earn its keep — removing it makes F1 better. Mechanism: its low-occurrence (<20-hit) seeds add spurious near-tied repeat candidates that slightly degrade placement; dropping them lets the correct unique anchor win. (The regime sweep was on standard, non---meth reads; -y 0 is a generic seeding change and is recommended for --meth on the same basis, but was not separately F1-measured there.)

On real data the confident core is untouched. On 20 M real WES+WGS reads, zero confidently and uniquely mapped reads (MAPQ ≥ 60, NM ≤ 1, no soft-clip) became unmapped under -y 0, and 3–4 per 10 M lost any alignment score. Every regression lands on the already-low-confidence tail (base MAPQ ≈ 0, heavily soft-clipped or high-NM), and round 3 was so often producing worse primaries that removing it improves a large share of as many reads as it perturbs (on WES, 3,194 reads improve vs 3,784 that worsen).

In exchange, alignment CPU drops ~11–30 % single-thread, larger on data with more repeat content (WGS > WES). The win is confirmed cross-architecture (Graviton4/Linux: realistic +29.9 %, HG002 WGS +19.8 %, matching macOS) — it cuts cold-memory seeding work, not arithmetic, so it ports.

More aggressive (clean-data only): also drop round 2 with -y 0 -r 10. That roughly halves alignment CPU (~50–63 %), but round 2 is genuine split-read/divergence sensitivity — it costs −0.074 F1 on divergent data (the -y 0 -r 10 column above, concentrated in the repeat/multimapper bin). Use -y 0 -r 10 only on known-clean, low-divergence libraries; -y 0 alone is the broadly-safe recommendation.

Pass-2 re-seeding under `--meth`: `-s 2`

--fast --meth uses -s 2 (light Pass-2 re-seed), not -s 0. Earlier releases set -s 0 (no re-seed). Read-level analysis on holodeck sim reads showed -s 0 inflates MAPQ: the affected reads map on occurrence-1 SMEMs that hide an interior repeat only Pass-2 would surface, so without it they look uniquely placed and MAPQ is pushed toward 60 — the calibration caveat noted below, now quantified. -s 2 re-seeds exactly those occurrence-1 SMEMs (the cheap subset), recovering MAPQ and placement at ≈ the speed of -s 0 (3.6× vs 3.25× on twist-em-seq 5M; MAPQ matches the -s 10 default on 85/86 changed reads; placement 97.6% = default). This is the “cheap middle ground” the read-confidence fallback at the end of this section did not find (it re-seeds by SMEM occurrence, not by read confidence). Validation status: the -s 0 recall/speed figures below are the genome-wide em-seq bench; the -s 2 MAPQ/placement recovery is read-level + chr1 sim, with a genome-wide em-seq re-run pending.

-s controls bwa-mem’s Pass-2 “re-seeding” — after the first seeding pass, long super-maximal exact matches whose occurrence count is ≤ -s are re-seeded from their midpoint to recover shorter, more frequent matches that a long seed would otherwise swallow. Under --meth, reads are projected into a 3-letter alphabet (C→T and G→A), which collapses sequence complexity: in that space Pass-2 mostly mines low-complexity sub-seeds that inflate the candidate set without changing placement. Setting -s 0 disables Pass-2 entirely (an exact seed’s interval size is always ≥ 1, so the re-seed gate never fires). This is methylation-specific — in the full 4-letter alphabet Pass-2 still earns its keep, so the recommendation is scoped to --meth.

Placement is a measured wash, even where re-seeding should matter most. We aligned ~33 M holodeck em-seq golden-truth reads with -s 0 vs the -s 10 default across two references — single-contig chr22 and a deliberately hard multi-contig build (chr19–22 + their ALT scaffolds + decoys + HLA, 2,977 contigs) that stresses cross-contig multi-mapping. In both regimes the net true-recall difference is within noise:

reference (10 seeds)	records	net recall Δ (`-s 0` − `-s 10`)	significance
single-contig chr22	16.9 M	−22 (−0.0001%)	McNemar p = 0.87
multi-contig (ALT/decoy/HLA)	16.4 M	−453 (−0.0028%)	McNemar p = 0.23

The direction (a sub-0.003% loss concentrated in the low-MAPQ cross-contig tail) mirrors -m 10’s, and -s 0 is not the same as disabling all seeding — it only removes the redundant second pass.

The one caveat is mapping-quality calibration. Removing Pass-2 shrinks the candidate set, so MAPQ trends upward (≈95% of changed reads). On easy references this is benign re-distribution; in the hard multi-contig regime the dominant high-confidence bin (MAPQ 50–60, ~84% of reads) stays identically calibrated (0.020% mis-placement either way), but the small mid-MAPQ bins (~3% of reads) are modestly to ~2× more error-prone at a given reported MAPQ under -s 0. This is negligible for methylation quantification but worth knowing for callers that hard-filter on MAPQ in repetitive regions.

In exchange, seeding is ~20% cheaper. On chr22 (single thread, best-of-5): alignment user-CPU −19.7% (12.96 s vs 16.14 s) and peak RSS −6.5%, with the gain largest on high-depth and repetitive inputs where the redundant Pass-2 candidate set is largest.

A selective fallback — skip Pass-2 globally but re-seed only low-confidence reads — was evaluated and does not help: in the multi-contig regime it would re-seed ~13% of reads yet recovers about as many placements as it sacrifices (net within noise), so there is no cheap middle ground.

Short-seed extension: `--min-ext-len 30`

--min-ext-len INT skips banded Smith–Waterman extension of seeds shorter than INT bp in chains that still hold a longer anchor seed — those short seeds are collinear with the anchor, whose extension already covers them, so dropping them is near output-neutral and pure speed. A chain whose seeds are all short is left untouched, so the filter never empties a chain and never drops a read; such all-short chains (common on low-mappability / repetitive loci) extend exactly as the default does. Off by default (0 → byte-identical to baseline). Extension is ~60% of mem CPU and almost all of it is spent on short seeds — seeds ≤40 bp hold ~90% of all banded-SW cells yet are ~99% wasted, because long seeds already resolve via the ungapped fast-path. Skipping the redundant ones thins the extension stage without touching seeding or chaining.

Recall-safe (non-emptying filter). Earlier releases dropped every short seed unconditionally, which emptied all-short chains and silently unmapped reads whose only evidence was short — negligible on clean WGS but catastrophic on low-mappability short-read data (a 151 bp low-mappability sample lost 63% of its mappings). The current filter only drops a short seed when a longer anchor survives in the same chain, making it a strict recall improvement: it can reduce extension work but can never lose a read.

The accuracy change is small, confined to low-confidence reads, and costs no recall. On real HG002 1M PE WGS at --min-ext-len 30 (non-emptying filter), the mapped count is unchanged from default (99.75% both), and only ~0.10% of reads change locus — down from ~0.40% under the old emptying filter, because the reads that used to vanish now map identically to default. The locus changes concentrate in the low-confidence tail (repeat/paralog churn); ~0.005% of reads (≈100 of 2 M) change at MAPQ ≥ 60.

In exchange, alignment CPU drops ~10% single-thread at --min-ext-len 30 (measured −9.5% main_mem on HG002 WGS-1M), largest on data carrying many short seeds. Because the speedup thins extension rather than speeding seeding — and seeding dominates the wall — the wall-clock gain is smaller than the ~90% banded-SW cell reduction would suggest.

Higher thresholds no longer cliff. Because all-short chains are now protected, raising --min-ext-len to 40–50 no longer drops reads: on HG002 WGS-1M the mapped count holds at 99.75% and divergence stays ~0.10% across 30–50. The previously-documented high-error F1 cliff (F1 −1.4% at 40, a cliff at 50 on 2–15% simulated substitution error) was a property of the emptying behavior — its mechanism, correct alignments carried solely on short seeds being dropped, is exactly what the non-emptying filter now protects — so it is expected to be largely removed, pending re-validation of the simulated-error sweep under the new filter. Indels and structural variants were never a contraindication — an indel splits a read into two still-long exact segments the fast-path handles, so indel-rich data is free.

30 remains the recommended value. Validation status: the non-emptying filter changes the observable output of --min-ext-len (and therefore --fast), so the cross-architecture speed figures and the golden-truth F1 sweep warrant a fresh bwa-mem3-bench run (multi-thread, all regimes) to confirm these --fast accuracy figures genome-wide. --min-ext-len and --fast stay opt-in; the drop-in defaults are unchanged.

Chain extension cap: `--max-extend-chains`

--max-extend-chains INT caps the number of chains that reach banded Smith–Waterman extension to the top-INT by chain weight (applied after mem_chain_flt); the dropped chains are the lowest-weight secondaries. It is the only lever that reduces the number of chains extended per read — the other --fast levers (-y 0, --min-ext-len, --smem-dedup) cut seeds and SW-per-chain but leave chains extended nearly unchanged — so it is orthogonal and adds a real marginal speedup on top of them. Off by default (0 → byte-identical to baseline). As a safety fallback the cap is a no-op for pathological reads with more than 4096 chains (MAX_EXTEND_CHAINS_CAP): those reads extend all of their chains as usual, so the option has no effect on them.

Not byte-identical. Dropping candidate chains removes low-weight secondaries, so XS, secondary alignments, and MAPQ can move on multi-mapping reads. High-confidence (uniquely-placed) reads are unaffected; the default path (0) is verified byte-identical to the base branch.

The accuracy/speed tradeoff is a smooth monotonic curve with a knee at N = 4–5. On holodeck truth (sim-wgs-place, 10.7 M reads), standalone N = 5 is −23% total alignment CPU at +20 high-confidence (MAPQ ≥ 60) mismaps out of 9.5 M (1529 → 1549). Stacked on top of --fast it is −15% marginal CPU at +21 high-confidence mismaps (1559 → 1580, +0.0002% absolute; overall −0.045 pp). N = 2 is too aggressive (+13.5% high-confidence mismaps). --fast sets 20 and pairs it with --extend-mate-concordant (below): the standalone MAPQ ≥ 1 tail rises 3.8× at cap 5, but mate-concordant retention closes most of that at cap 20 for ~−20% aligner CPU (fg-labs/bwa-mem3#202). --max-extend-chains and --fast stay opt-in; the drop-in defaults are unchanged.

Mate-concordant chain retention: `--extend-mate-concordant`

The chain cap interacts badly with paired-end pairing, most acutely under --meth. Bisulfite reads are projected into a 3-letter alphabet (C→T, G→A), which collapses sequence complexity and flattens chain weights: a read carries many similarly-weighted chains instead of one clear winner. Capping to the top-5 by weight then frequently drops a read’s true low-weight chain — not because it was chain-filtered away (in 89% of the regressions instrumented on a 50k-pair slice the true chain is still a candidate at cap-5), but because capping starves PE pairing and mate rescue of the secondary anchors that let the true concordant pair win. Both mates then flip together to a wrong concordant locus (99% flagged proper-pair). On a 1M-pair sim-meth-place slice this dropped correct placement 98.10% → 97.63% and raised confident (MAPQ ≥ 30) mismaps 4.4× (0.036% → 0.160%).

--extend-mate-concordant fixes this at the chain-cap stage: when --max-extend-chains would cap a paired-end read, it additionally retains any chain concordant with one of the mate’s chains — same contig, FR (“innie”) orientation, within a window — even if it ranks below the cap. This keeps the true pair’s low-weight anchor while still dropping the far/redundant chains the cap targets. The mate scan is bounded (MATE_SCAN_MAX, 256) to keep it O(n) in practice.

The window is sized to the aligner’s own proper-pair insert bound. --extend-mate-concordant (bare) is auto: it uses the estimated pes[FR].high (inferred from the data each chunk and read by the next chunk’s cap, which runs before pairing), falling back to a built-in default until the insert size is known. --extend-mate-concordant=INT pins a fixed bp window; =0 disables. Matching the window to the insert bound matters because retained chains are then extended — a wide window admits far and spurious concordant chains, adding alignment CPU on chain-rich reads, so auto keeps only genuine pair anchors.

Validation. On the canonical bench (holodeck eval, 5 reps on m7i), the option recovers part of the --fast --meth regression at ~1% alignment CPU with the auto window:

dataset (`--fast --meth`)	placement correct%	MAPQ≥30 mismap%	align CPU vs 0.5.0
default `--meth` (reference)	sim-meth-place 94.27 / sim-meth-vars 95.11	5.73 / 4.89	—
0.5.0 cap-5 (regressed)	93.14 / 94.22	6.86 / 5.78	baseline
+ `--extend-mate-concordant` (auto)	93.71 / 94.46	6.29 / 5.54	+1%

So placement recovers roughly a quarter to a half of the gap (not to parity), confident mismaps recover similarly, RSS is unchanged, and the CPU cost is ~1%. The window sizing is what makes it cheap: a fixed 2000 bp window recovered the same accuracy but cost +16–21% CPU on chain-rich simulated reads, because it retained and extended far/spurious concordant chains; the auto pes[FR].high window admits only genuine pair anchors. (Turning the chain cap off entirely under --meth recovers similar accuracy but at a larger, uniform CPU cost.) Full per-dataset figures: fg-labs/bwa-mem3#195.

--fast enables it automatically for both non-meth and --meth runs. The non-meth case (fg-labs/bwa-mem3#202): the top-INT cap inflates the confident (MAPQ ≥ 1) mis-placement tail 3.8× on sim-wgs-place (3,626 → 13,921 vs uncapped) by the same mechanism — the true chain is low-weight but mate-concordant in ~98% of cap-dropped reads. Paired with --max-extend-chains 20, mate-concordant retention closes most of that tail (→ ~4,450, verified on a rebuilt --fast) at ~−20% aligner CPU. It is a no-op unless a chain cap is actually in effect, and like --max-extend-chains it is not byte-identical when it retains a chain. --extend-mate-concordant and --fast stay opt-in; the drop-in defaults are unchanged.

Speed: drop-in and recommended

When comparing bwa-mem3 to a stock bwa-mem2 baseline, report both layers:

Drop-in speed-up — bwa-mem3 vs bwa-mem2 at identical settings, from its vectorized kernels, lockstep SMEM batching, libsais indexing, and mimalloc. See Performance Overview and What’s Different — Performance.
Recommended-profile speed-up — the additional ~11–22% alignment-CPU reduction from -m 10, on top of the drop-in gain.

A benchmark that quotes only default-vs-default settings understates the throughput available to a caller who has adopted the recommended profile; quote both so readers can pick the comparison that matches their deployment.

Situational: `--supp-rep-hard-cap` for SV-aware pipelines

This is not part of the recommended profile — it is a situational knob, not a free speed-up, and the default (0, off) is correct for plain alignment.

If you run structural-variant or breakpoint detection downstream (e.g. fgsv SvPileup), --supp-rep-hard-cap 20 suppresses repeat-induced spurious supplementary breakpoints — split alignments anchored in repeats whose mapping quality is overestimated — at no measured cost to real SV breakpoints. On GIAB HG002 CMRG (GRCh38, 2×250) it stripped repeat artifacts while preserving every credible real-SV breakpoint, including ones in moderately-repetitive regions.

Do not use more aggressive values: --supp-rep-hard-cap ≤ 10 began suppressing real GIAB SV breakpoints (a verified insertion at chr3:45890256 and deletion at chr18:79739776) whose supporting split reads happen to land in repetitive regions.

Caveats: this was measured on a single repeat-enriched truth set at the breakpoint level (not end-to-end SV calls), so treat 20 as a sensible starting point for SV workflows rather than a universal recommendation. It has no effect on primary-alignment MAPQ or on non-SV pipelines.

Situational: `--adaptive-band` for long reads

This is not part of the recommended (short-read) profile — it is a long-read lever, and the default (off) is correct for standard Illumina WGS/WES.

--adaptive-band starts banded Smith-Waterman tight and expands each extension only to the band its chain’s seed geometry implies, instead of the fixed -w band (100) for every extension. The band only constrains the DP matrix when the extension’s reference window exceeds it — an intrinsically long-read condition — so:

Use it for long reads: SBX, PacBio HiFi, ONT, or any run whose reads are roughly ≥ 200 bp. On SBX (HG002, 240 bp+) it cut alignment CPU by ~25 %.
No-op on short reads: WGS (~150 bp) and WES (~76 bp) extensions are already smaller than the band, so there is nothing to trim; those reads run on the 8-bit kernel, which the option leaves untouched. Enabling it on a short-read run neither helps nor hurts.

Accuracy is preserved: placement is identical to default on holodeck sim-wgs-place (MAPQ-60+ mismaps unchanged), and indel representation matches the -w 100 default (indels up to the chaining limit still emit a single D/I CIGAR, so small/mid-size indel callers are unaffected). Like --fast, it is not byte-identical when enabled (it shifts a small number of borderline secondary alignments), which is why it is an opt-in flag rather than a default.

Build

This page describes the recommended build configuration for production use of bwa-mem3.

Choose the right arch target

The default make invocation builds a single multi-tier binary on x86 (or a single NEON binary on arm64). For production clusters where the CPU family is uniform, you can trim further by building one tier only — the binary drops the per-tier dispatch table and ships a single kernel path:

# Most modern x86-64 servers (Haswell or later):
make arch=avx2

# Intel Cascade Lake / Sapphire Rapids, AWS c7i/m7i:
make arch=avx512bw

# Apple Silicon / AWS Graviton:
make arch=arm64

Omit arch= if the deployment target is heterogeneous or unknown; the default make produces a single binary that includes every supported x86 tier and dispatches at runtime via __builtin_cpu_supports. Tune the non-kernel TU compile baseline with BASELINE_ARCH= (default avx2) — see Single-binary SIMD dispatch (x86).

See SIMD dispatch matrix for the full list of targets and which kernels each vectorizes.

Use a recent compiler (especially on ARM)

Use the newest C++ compiler available, and on ARM/aarch64 prefer a recent clang. The compiler matters more on ARM than on x86: the aarch64 build runs its SIMD through the sse2neon translation layer rather than hand-written intrinsics, so codegen quality — and therefore throughput — depends heavily on the compiler and its version.

Measured on AWS Graviton4 (c8g.4xlarge, 16 cores), hg38, 5M read pairs, make arm64, best-of-3 CPU-seconds:

Compiler	CPU-seconds	vs gcc 15.2
gcc 15.2	1779	—
clang 22.1	1679	~6% faster

Two takeaways:

clang generally emits better NEON than gcc for sse2neon-translated code — about 6% fewer CPU-seconds here.
Compiler version matters as much as the vendor. A larger ~18% clang-over-gcc gap has been reported against an older gcc (~13); against a modern gcc (15.2) it narrows to ~6%, because recent gcc closed most of the NEON-codegen gap. Bumping the gcc version is often most of the win even without switching to clang.

If you build the arm64 binary with clang, note the OpenMP runtime changes from libgomp to libomp (llvm-openmp) — see Multi-architecture deployment.

Profile-Guided Optimization (PGO)

PGO adds 3–5% throughput on real workloads and is recommended for any installation that runs many alignment jobs against the same reference. It is opt-in — the default make does not use it. The generate → train → use workflow, the PGO_ARCH= selector, PGO_PROFILE_DIR=, and the training-data caveats are all in Performance → PGO build; the Summary below shows the production recipe.

mimalloc

mimalloc is compiled in by default (USE_MIMALLOC=1). The allocator improves multi-threaded throughput by reducing lock contention on malloc and free hot paths. Run bwa-mem3 version to confirm it is active:

bwa-mem3 version
# Expected output includes a line like:
#   mimalloc 3.x.x

To build without mimalloc (for example, when using AddressSanitizer or on a system with a known-incompatible allocator):

make USE_MIMALLOC=0

Summary

For a production installation on a known x86 server with AVX2:

make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Deploy: bwa-mem3.pgo.avx2

On ARM/aarch64 (Apple Silicon, AWS Graviton), build with a recent clang and apply PGO on top:

make pgo-generate PGO_ARCH=arm64 CXX=clang++
./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=arm64 CXX=clang++
# Deploy: bwa-mem3.pgo

Output Format

The choice of output format — SAM, compressed BAM, or uncompressed BAM — has a measurable effect on end-to-end pipeline wall time. This page explains why uncompressed BAM is the right default and shows the recommended pipeline.

Why uncompressed BAM is faster than SAM

When bwa-mem3 writes SAM (the default when --bam is not set), every alignment record must be serialized into ASCII text: integers are formatted as decimal strings, bases are encoded as characters, and flags are written as decimal numbers. The receiving process — typically samtools sort — then parses each field back from text into binary integers. Both conversions are pure overhead: the data is binary inside bwa-mem3 and binary inside samtools; text is only an interchange format that is immediately discarded.

Uncompressed BAM (--bam=0) bypasses this round-trip. bwa-mem3 writes binary BAM records directly via htslib’s wb0 mode. The write path performs no text formatting; the read path in samtools sort performs no text parsing. The htslib overhead of the wb0 write is negligible — it is effectively a buffered write(2) call with a small BAM block header prepended.

Compressed BAM (--bam=1) adds BGZF compression on top, which costs CPU on the write side and gains nothing: the pipe is in-process memory or a kernel pipe buffer, and samtools sort will re-compress the output anyway. Compressed BAM on a pipe wastes CPU on both sides.

Recommended pipeline

bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

The -@ 8 flag gives samtools sort eight compression threads for writing the final sorted BAM. Tune this number based on available cores; the total core count should be split so that alignment threads and sort threads do not contend. A 16:8 split (bwa-mem3:samtools) works well on 24-core machines.

Tip — Thread allocation

Do not give all cores to bwa-mem3. Downstream samtools sort needs threads to compress and write the sorted BAM. Leaving 4–8 threads for samtools sort keeps the pipeline balanced and prevents a write bottleneck that would stall the aligner.

Methylation output

The --meth path always writes uncompressed BAM internally, regardless of the --bam flag. The post-processing step (header rewrite, Bismark XR:Z / XG:Z / XM:Z tag emission, opt-in chimera QC) is performed inline before the record is handed to htslib, so the same pipeline shape applies:

bwa-mem3 mem --meth --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

When SAM is appropriate

SAM (the default, equivalent to omitting --bam) remains the right choice for:

Debugging. Plain text is readable with less, grep, and any text editor, making it easy to inspect individual records without samtools view.
Ad-hoc inspection. When you need to scan a few thousand reads to diagnose a mapping problem, piping to SAM and reading the output directly is faster than writing a BAM file and then querying it.
Compatibility with tools that require SAM input. Some legacy tools do not accept BAM. If the downstream tool does not support BAM, use SAM.

For production alignment jobs that feed samtools sort, always use --bam=0.

Summary table

Format	`--bam` value	Pipe overhead	Recommended for
SAM	(default / omit)	High (text round-trip)	Debugging, ad-hoc inspection
Uncompressed BAM	`0`	Negligible	Production pipelines
Compressed BAM	`1`	High on write side	Writing directly to a file (no downstream sort)

Multi-Sample Workflows

When you need to align many samples back-to-back against the same reference on a single machine, loading the FM-index into shared memory once — and keeping it resident across all alignment jobs — eliminates the index I/O cost for every sample after the first.

The problem: repeated index loads

The bwa-mem3 index for hg38 is roughly 11 GB on disk (~15 GB resident once loaded). Without shared memory, bwa-mem3 mem reads the entire index from disk on every invocation. On a fast NVMe drive this takes 30–60 seconds; on a network-attached or spinning-disk filesystem it can take several minutes. For a batch of 100 samples, that adds hours of pure I/O overhead.

Staging the index once with `bwa-mem3 shm`

# Stage the index into shared memory (one-time cost, ~15 GB for hg38).
bwa-mem3 shm ref.fa

# Align each sample. bwa-mem3 mem attaches automatically — no extra flag.
bwa-mem3 mem --bam=0 -t 16 ref.fa sample1_R1.fq.gz sample1_R2.fq.gz \
  | samtools sort -@ 4 -o sample1.bam -
bwa-mem3 mem --bam=0 -t 16 ref.fa sample2_R1.fq.gz sample2_R2.fq.gz \
  | samtools sort -@ 4 -o sample2.bam -
# ...

# When done, release the segment.
bwa-mem3 shm -d

For methylation workflows, stage the c2t index instead:

bwa-mem3 shm --meth ref.fa
bwa-mem3 mem --meth --bam=0 -t 16 ref.fa sample1_R1.fq.gz sample1_R2.fq.gz \
  | samtools sort -@ 4 -o sample1.bam -
bwa-mem3 shm -d

Confirming the index is staged

bwa-mem3 shm -l
# Prints the basename and memory usage of each staged segment.

If the listing is empty, the index is not staged and bwa-mem3 mem will fall back to loading from disk.

Thread layout for parallel alignment

Running multiple bwa-mem3 mem instances in parallel is efficient when the samples are independent and the machine has enough cores. The shared-memory index eliminates disk contention, so the bottleneck becomes CPU and memory bandwidth.

Guidelines for N-core machines:

N = 32: Two instances at -t 14 each, with -@ 4 for samtools sort. Keeps 4 cores reserved for OS and I/O.
N = 64: Two to four instances at -t 14 to -t 16, each with -@ 4 for samtools sort.
N = 128: Four to eight instances; keep at least 8–16 cores free for samtools sort threads and OS scheduling.

Tip — Memory bandwidth limit

The FM-index lookup is memory-bandwidth bound. On machines with NUMA topology (multi-socket or multi-chiplet), binding each bwa-mem3 instance to a NUMA node with numactl --cpunodebind=N --membind=N can improve throughput by reducing cross-node memory traffic.

Scripting a batch with a loop

bwa-mem3 shm ref.fa

for sample in sample1 sample2 sample3; do
  bwa-mem3 mem --bam=0 -t 16 ref.fa "${sample}_R1.fq.gz" "${sample}_R2.fq.gz" \
    | samtools sort -@ 4 -o "${sample}.bam" -
  samtools index "${sample}.bam"
done

bwa-mem3 shm -d

For parallel execution, replace the for loop body with a background job (or use a workflow manager such as Snakemake or Nextflow) and limit the degree of parallelism to match available cores.

Warning — Stale segment footgun

If you need to re-index the reference (e.g. after updating it), always run bwa-mem3 shm -d before bwa-mem3 index. There is no automatic staleness check. See Anti-patterns for details.

Methylation Defaults

bwa-mem3 mem --meth ships with scoring and filtering defaults aligned with the bwameth.py reference implementation (in the default collapsed scoring mode). This page is prescriptive: when to keep those defaults and when to override them. For what the flags are and how the scoring modes work, see the reference:

Scoring modes (collapsed vs genomic) and the full --meth flag set — Methylation Reference → Flags.
Placement compatibility with bwameth.py (the default collapsed mode is a placement drop-in, not byte-identical — ~1% of records differ in POS/CIGAR/MAPQ, so re-validate if you are pinned to a specific bwameth release) — bwameth.py drop-in mapping.

When to keep the defaults

For standard whole-genome bisulfite sequencing (WGBS) workflows, the defaults (collapsed scoring plus -B 2 -L 10 -U 100 -T 40 -M -C) are appropriate as-is. They were derived from the bwameth.py codebase and are expected by most downstream methylation callers. Unless you have a specific reason to deviate, use:

bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 4 -o out.bam -
samtools index out.bam

When to override

Low-coverage or targeted bisulfite sequencing. If your library covers a small target region and insert sizes are more variable, consider lowering -T (e.g. -T 20) to recover short or soft-clipped alignments in the target.

Amplicon bisulfite sequencing. Amplicon reads have uniform insert sizes; the default -U 100 is appropriate. However, if your amplicons are short (< 100 bp), consider lowering -L further to reduce clipping at read ends.

Non-standard conversion chemistry. Some library preparations use only one strand conversion (C→T only, not G→A). In such cases, --set-as-failed r suppresses alignments to the reverse-complement strand, which reduces noise from strand-ambiguous alignments:

bwa-mem3 mem --meth --set-as-failed r -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 4 -o out.bam -

Chimera QC is opt-in (off by default, matching Bismark). Leave it off for standard directional EM-seq / WGBS. Turn it on for PBAT / scBS-Seq libraries (where intra-fragment chimerism is common) or when you want bwameth.py-equivalent flagging — pass --chimera-qc (see Flags for the exact heuristic):

bwa-mem3 mem --meth --chimera-qc -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 4 -o out.bam -

Note — Overriding scoring defaults

Scoring flags supplied on the command line override the defaults set by --meth. For example, bwa-mem3 mem --meth -B 4 ... uses -B 4 (not 2). Order does not matter: the defaults are applied after the whole command line is parsed, so -B 4 --meth and --meth -B 4 behave identically.

This applies to -B, -L, -U, and -T. It does not apply to -M and -C: bwa has no option that unsets either one, so --meth sets them unconditionally and they cannot be turned off from the command line. See Flags → -M and split alignments.

Downstream tool compatibility

The --meth output BAM carries the Bismark XR:Z / XG:Z / XM:Z tag set, so it feeds Bismark-aware callers directly — bismark_methylation_extractor, methylKit, methtuple, DMRfinder, epialleleR, MethylDackel, and biscuit per-read tools have all been used successfully. See SAM tags for the tag definitions and which tools read which tag.

Multi-architecture deployment

This page covers running bwa-mem3 in heterogeneous compute environments — AWS Batch with mixed instance families, GCP Batch with mixed CPU platforms, on-prem Slurm with mixed nodes, Kubernetes clusters with mixed node pools.

Within x86_64: one binary, dynamic dispatch

bwa-mem3 ships a single x86_64 binary that contains five SIMD kernel tiers (sse41, sse42, avx, avx2, avx512bw) and selects the best one at runtime via __builtin_cpu_supports. See src/simd_dispatch.cpp for the dispatcher and src/kernel_dispatch.h for the per-tier symbol mangling.

Build once at the BASELINE_ARCH floor that matches your fleet’s oldest x86 host. The default BASELINE_ARCH=avx2 covers Intel Haswell (2013) and AMD Zen (2017) onward. Within that floor, every host transparently uses its best available tier for the hot kernel paths.

Across x86_64 and arm64

A single ELF binary cannot span CPU families. You must build two binaries — one for x86_64, one for arm64 — and package them so the right one runs on each host.

The recommended approach is a Docker manifest-list container of your own making, with one layer per architecture under a single tag. Example:

FROM ubuntu:24.04 AS build
RUN apt-get update && apt-get install -y \
    build-essential git cmake pkg-config \
    autoconf automake autoconf-archive libtool \
    zlib1g-dev libdeflate-dev
WORKDIR /src
RUN git clone --recursive https://github.com/fg-labs/bwa-mem3 .
RUN make -j

FROM ubuntu:24.04
COPY --from=build /src/bwa-mem3 /usr/local/bin/bwa-mem3
RUN apt-get update && apt-get install -y libgomp1 zlib1g libdeflate0 \
 && rm -rf /var/lib/apt/lists/*
ENTRYPOINT ["bwa-mem3"]

Build for both architectures with one command:

docker buildx build --platform linux/amd64,linux/arm64 \
  -t <registry>/<image>:<tag> --push .

Building the arm64 layer with clang? A recent clang produces faster NEON code than gcc on aarch64 (see Best Practices → Build). If you switch the arm64 build to clang, swap the OpenMP runtime: clang links libomp (LLVM) rather than libgomp, so the runtime stage needs libomp5 (or llvm-openmp) in place of libgomp1.

AWS Batch, GCP Batch, Kubernetes, and containerd all read the manifest list and pull the correct layer based on the host’s architecture. The submitter references one tag; the runtime picks the right binary automatically.

Verifying at runtime

bwa-mem3 version reports the build’s floor, the kernels compiled in, and the resolved runtime tier. Use this in CI or in your Batch job’s startup script to confirm the right layer was pulled:

$ bwa-mem3 version
v0.2.0-12-gabcdef1
SIMD floor: avx2 (x86-64-v3, Haswell 2013+); kernels: sse41 sse42 avx avx2 avx512bw
SIMD runtime: avx512bw (BWAMEM3_FORCE_TIER unset)

Grep for SIMD runtime: to record the tier each job ran at — useful for post-mortem diagnosis of perf regressions.

Pre-Haswell hosts

If your fleet really must include pre-Haswell x86 (c4, m4, pre-Skylake Xeons), rebuild with a lower floor:

make BASELINE_ARCH=sse41

Expect roughly 10-15% slower wall time on AVX2 hosts in the same container compared to a default BASELINE_ARCH=avx2 build. This is the trade-off for broader host coverage; only do it if you actually need pre-Haswell support.

The default BASELINE_ARCH=avx2 covers virtually every modern compute environment. AWS, GCP, and Azure all default to Haswell-or-newer instance types in current-generation compute environments.

What the host-floor precheck does

If a job is scheduled onto a host that doesn’t meet the build’s floor (e.g. an avx2-baseline binary lands on a pre-Haswell host), bwa-mem3 mem refuses to run with exit code 2 and a clear stderr message:

[E::bwamem3] this binary was compiled for SIMD floor avx2 and emits avx2
instructions in non-kernel translation units. The host CPU does not support
avx2 (detected: sse42). Running would SIGILL on the first avx2 instruction.

To run on this host, rebuild bwa-mem3 with BASELINE_ARCH=sse42 (or lower),
or use a binary built for a lower SIMD floor.

This is a clean failure: the job exits before any billable alignment work starts. Compare to the alternative without the precheck (SIGILL deep inside an alignment job, opaque process death, wasted compute).

Defence-in-depth recommendation: configure your AWS Batch compute environment (or equivalent) to exclude instance families older than your binary’s floor. The precheck protects against accidental scheduling; an allowlist at the orchestrator level prevents the scheduling decision in the first place.

Anti-Patterns

This page documents common mistakes that produce incorrect results or unnecessary failures when using bwa-mem3.

Re-indexing without dropping the shared-memory segment

Warning — Footgun

bwa-mem3 shm does not detect stale segments. If you re-run bwa-mem3 index after a shared-memory segment is already staged, the on-disk index files will not match the in-memory segment. bwa-mem3 mem will attach to the stale segment and produce incorrect alignments without any warning.

Always run bwa-mem3 shm -d before re-indexing:
bwa-mem3 shm -d           # drop all staged segments
bwa-mem3 index ref.fa     # rebuild the on-disk index
bwa-mem3 shm ref.fa       # re-stage the new index
There is no automatic staleness check in the implementation. The segment name is derived from the reference basename only; no content hash or modification timestamp is stored.

To confirm that no stale segments are staged, use bwa-mem3 shm -l before running any indexing step.

Forgetting to initialize submodules

bwa-mem3 depends on several submodules (ext/htslib, ext/libsais, ext/mimalloc, ext/sse2neon). A shallow clone or a clone without --recursive will produce a build that fails at the linking step with missing symbols, or at runtime with missing index files.

Warning — Missing submodules

Always clone with --recursive, or initialize submodules after cloning:
git clone --recursive https://github.com/fg-labs/bwa-mem3
# or, after a bare clone:
git submodule update --init --recursive
If make reports missing headers (e.g. htslib/hts.h: No such file or directory), the submodules were not initialized.

Leaving `BASELINE_ARCH` at the default on a known higher-tier CPU

The default make (no arch=) builds the multi-tier single binary with non-kernel TUs compiled at BASELINE_ARCH=avx2. On a production server with a known higher-tier CPU family, this leaves auto-vectorized non-kernel hot paths at 256-bit width when the host could go wider, or keeps the host-floor precheck at avx2 when the deployment surface is strictly AVX-512. Pass BASELINE_ARCH= (or build a single-tier binary with arch=) to align the build with the deployment:

Pick a BASELINE_ARCH= (or a single-tier arch=) build target that matches the deployment — see Build → arch targets. The default is correct only when the binary is distributed across multiple CPU families or the target is genuinely unknown. Note that BASELINE_ARCH=avx512bw does not always beat avx2 even on AVX-512 hosts — see BASELINE_ARCH=avx512bw build flag for the empirical characterization.

Mixing bwa-mem3 and bwa-mem2 outputs in the same pipeline

bwa-mem3 adds several custom SAM tags that bwa-mem2 does not emit: HN:i (total number of primary alignments — both reported and suppressed — that the aligner found for this read, before the -h supplementary cap is applied), and — in --meth mode — the Bismark-compatible XR:Z (read conversion direction), XG:Z (genome strand), and XM:Z (per-base methylation call string) tags. It also rewrites @SQ header lines in --meth mode (collapsing f/r strand prefixes back to one entry per chromosome).

Warning — Header and tag mismatch

Do not merge BAM files produced by bwa-mem3 and bwa-mem2 without verifying that the @PG headers and custom tags are handled correctly by the downstream tool. In methylation workflows, a bwa-mem2 BAM mixed into a bwa-mem3 --meth pipeline will be missing the XR:Z / XG:Z / XM:Z Bismark annotations, which will cause methylation callers to silently drop or misclassify those records.

If you must merge outputs from both tools, run samtools view -H on both files and confirm that @SQ lines are consistent and that the downstream tool can tolerate the tag differences.

Writing compressed BAM to a pipe

Passing --bam=1 (compressed BAM) when piping to samtools sort compresses the stream on the bwa-mem3 side and then immediately decompresses it on the samtools side. This wastes CPU on both ends with no benefit.

Use --bam=0 (uncompressed BAM) for all pipe-to-sort workflows. See Output format for the full explanation and recommended pipeline.

CLI Reference Overview

bwa-mem3 exposes four subcommands: index, mem, shm, and version. Run bwa-mem3 <subcommand> --help to see the full option list for any command.

bwa-mem3 mem is command-line compatible with bwa mem and bwa-mem2 mem — every existing flag is accepted, so an existing invocation runs unchanged. The fork-added flags are --bam[=N], --meth (with --meth-scoring, --set-as-failed, --chimera-qc), --supp-rep-hard-cap, --min-ext-len, and --legacy-reader; all are off or output-neutral by default. See Coming from bwa or bwa-mem2.

How this section is structured

Each subcommand page follows the same layout:

Introduction — what the subcommand does and when to reach for it.
Synopsis — the verbatim --help output, auto-captured from the binary at build time and included here via mdbook’s {{#include}} directive. The snippet is regenerated by make docs-cli and CI fails if it drifts from the binary.
Common usage — two or three worked command-line examples.
Flag reference (for mem, grouped by topic) — per-flag prose covering semantics, defaults, and interaction with other flags that the --help text does not have room to explain.
Notes / Gotchas — operational warnings about non-obvious behavior.
See also — cross-links to related pages in this book.

Subcommands

index builds the FM-index from a reference FASTA. Pass --meth to build a dual index (the normal index plus a converted .meth seed index) for methylation alignment.

mem aligns short reads against an indexed reference, producing SAM or BAM output. It is the primary alignment subcommand. The flag surface is large; the mem reference page groups flags by purpose to make them easier to navigate.

shm stages an FM-index into POSIX shared memory so that repeated bwa-mem3 mem invocations on the same machine skip the per-run disk read. It also lists and destroys staged segments.

version prints the bwa-mem3 release version and, when mimalloc is compiled in, the mimalloc version.

index

bwa-mem3 index builds the FM-index (BWT + suffix array) that bwa-mem3 mem requires for alignment. Run it once per reference; the resulting files sit alongside the input FASTA and are reused for all subsequent alignment jobs. Pass --meth to build a dual index — the normal index plus a converted .meth seed index — for bisulfite-seq alignment.

Synopsis

Usage: bwa-mem3 index [-p prefix] [-t N] [--max-memory SIZE] [--tmp-dir PATH] [--meth] <in.fasta>

  -p STR             output prefix (default: <in.fasta>)
  -t INT             worker threads [auto: detected cores, cgroup-aware]
  --max-memory SIZE  peak memory budget; SIZE accepts a G/M/K suffix
                     (case-insensitive) or bare bytes
                     [auto: min(50% of RAM, 32G), cgroup-aware]
  --tmp-dir PATH     scratch directory [$TMPDIR]
  --meth             build a BS-aware dual index. Writes the original-alphabet
                     index at <in.fasta>.* plus a converted seed FM-index at
                     <in.fasta>.meth.* (used by `bwa-mem3 mem --meth`).
  --emit-unpacked-ref also write the unpacked `<prefix>.0123` reference. Off by
                     default: `mem` pac-fetches bases from `.pac`, so `.0123`
                     is never read. Enable only for an external consumer that
                     still requires it (e.g. bwa-mem2); ~8x the size of `.pac`.
  -h, --help         print this help message and exit

Common usage

Build a standard index using all available cores:

bwa-mem3 index ref.fa

Build a methylation-aware index (required before bwa-mem3 mem --meth):

bwa-mem3 index --meth ref.fa

Limit peak RAM to 16 GB and write scratch data to /scratch:

bwa-mem3 index --max-memory 16G --tmp-dir /scratch ref.fa

Flag reference

`-p STR` — output prefix

By default, index files are written alongside <in.fasta> using the FASTA path as a prefix (e.g. ref.fa.bwt.2bit.64, ref.fa.pac, etc.). Use -p to write them to a different base path, such as a dedicated index directory:

bwa-mem3 index -p /idx/hg38 ref.fa
# writes /idx/hg38.bwt.2bit.64, /idx/hg38.pac, …
# align with: bwa-mem3 mem /idx/hg38 R1.fq R2.fq

`-t INT` — worker threads

Controls the number of threads used during index construction. The default auto-detects available cores and is cgroup-aware, so it behaves correctly inside containers and on shared cluster nodes. Set explicitly when you want to cap CPU usage.

`--max-memory SIZE` — peak memory budget

Limits how much RAM the indexer may use at once. SIZE accepts a G, M, or K suffix (case-insensitive) or a bare byte count. The default is min(50% of RAM, 32 GB), computed in a cgroup-aware manner.

For large references (hg38 and above) on machines with limited RAM, setting this to a value lower than the reference size causes the indexer to partition work and use --tmp-dir for intermediate files, at the cost of extra I/O.

`--tmp-dir PATH` — scratch directory

Scratch directory for intermediate files when memory is partitioned. Defaults to $TMPDIR. Point this at a fast local disk (NVMe or ramdisk) to minimize wall-clock time when --max-memory forces partitioned construction.

`--emit-unpacked-ref` — also write `<prefix>.0123`

Off by default. bwa-mem3 mem reconstructs reference bases from the packed .pac on demand (pac-fetch), so the unpacked .0123 (~8× the .pac; ~6.4 GB on hg38) is never read and is not built. Enable this flag only when an external consumer still requires the file — for example, sharing an index with bwa-mem2, which loads .0123 directly:

bwa-mem3 index --emit-unpacked-ref ref.fa   # additionally writes ref.fa.0123

For --meth the flag applies to the original index only; the .meth seed index never needs an unpacked reference.

`--meth` — build a methylation (dual) index

Builds a dual index: the normal FM-index over the original FASTA (at the bare prefix), plus a converted seed FM-index under the .meth prefix, built over a per-strand-converted FASTA <in.fasta>.meth.fa (f-prefixed C→T and r-prefixed G→A doubled contigs). All files are placed alongside the original FASTA.

By default, neither index writes an unpacked .0123: mem --meth extends against the original reference (whose bases it pac-fetches from .pac), never the seed, so the original .0123 (~6.4 GB) is unnecessary and the seed’s unpacked bases are never read at all (~13 GB). Not building either saves ~19 GB of disk on hg38; the runtime RSS reduction comes from avoiding the original .0123 load (~6.4 GB). (--emit-unpacked-ref overrides this for the original index only; the .meth seed never needs one.)

Pass the original FASTA prefix to all three index, shm, and mem commands; the .meth seed index is located automatically when --meth is present.

Notes / Gotchas

Tip — Index once, align many times

A standard hg38 index is ~11 GB of index files on disk and takes several minutes to build. A --meth build adds the seed index on top — the doubled seed FM-index (~21 GB) plus its packed .pac (~1.6 GB) — for roughly 34 GB of index files (~37 GB including the converted .meth.fa). That is a little over double the plain footprint, not triple: by default no unpacked .0123 is built for either index (--emit-unpacked-ref adds it for the original only). Build once and store on shared storage; all alignment jobs on the same reference share the files.

Note — a --meth index is a superset, not a separate index

index --meth writes the normal index at the bare prefix plus the .meth seed index. The bare-prefix index is an ordinary index, so bwa-mem3 mem ref.fa (without --meth) works fine for standard alignment against the same files — no separate index directory is needed. Only --meth runs use the .meth seed index.

mem

bwa-mem3 mem aligns short DNA reads against an indexed reference genome using the BWA-MEM algorithm. It accepts one or two FASTQ files (single-end or paired-end) and writes alignments to stdout in SAM or BAM format. It is the primary alignment subcommand; nearly all bwa-mem3 usage flows through it.

Synopsis

Usage: bwa-mem3 mem [options] <idxbase> <in1.fq> [in2.fq]
Options:
  Algorithm options:
    -o STR        Output SAM file name
    --bam[=N]     Emit BAM instead of SAM text. N=0 (default) = uncompressed;
                  1..9 = BGZF deflate levels. Writes to stdout; redirect with `>`.
    -t INT        number of threads [1]
    -k INT        minimum seed length [19]
    -w INT        band width for banded alignment [100]
    -d INT        off-diagonal X-dropoff [100]
    -r FLOAT      look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
    -y INT        seed occurrence for the 3rd round seeding [20]
    -c INT        skip seeds with more than INT occurrences [500]
    --smem-dedup  dedup identical SMEMs before chaining: fewer SA lookups, ~10% fewer; opt-in, NOT byte-identical (changes XS/secondary on a small fraction of reads) [off]
    --skip-contained-ext  skip banded-SW extension of seeds contained (same diagonal) in a longer in-chain seed; byte-identical (non-meth), ~10% less alignment CPU; no effect under --meth [off]
    --max-extend-chains INT  cap chains extended per read to the top-INT by weight; ~23% less alignment CPU, high-confidence placement unaffected; ignored for reads with >4096 chains; opt-in, NOT byte-identical (0 = off) [0]
    --adaptive-band  adaptive banded-SW: start tight and expand each pair to its chain-geometry band on long-extension reads; long-read speedup (~1.3x on SBX), no-op on short reads; opt-in, NOT byte-identical [off]
    --extend-mate-concordant[=INT]  when --max-extend-chains caps a PE read, also keep any chain concordant (same contig, FR, within INT bp) with a mate chain; recovers the true pair's low-weight chain the cap would drop (mainly --meth). Bare = auto (window = estimated proper-pair insert high bound); =INT = fixed bp; =0 = off. Opt-in, NOT byte-identical [off]
    -D FLOAT      drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
    -W INT        discard a chain if seeded bases shorter than INT [0]
    -m INT        perform at most INT rounds of mate rescues for each read [50]
    -S            skip mate rescue
    -P            skip pairing; mate rescue performed unless -S also in use
    --fast        speed preset: -m 10 -y 0 --min-ext-len 30 --smem-dedup
                  --skip-contained-ext --max-extend-chains 20 --adaptive-band
                  --extend-mate-concordant (under --meth: --max-extend-chains 10,
                  -s 2). Opt-in; explicit
                  flags override where applicable; --smem-dedup,
                  --skip-contained-ext and --adaptive-band are always enabled.
                  NOT byte-identical to the default (divergence confined to the
                  low-confidence tail).
Scoring options:
   -A INT        score for a sequence match, which scales options -TdBOELU unless overridden [1]
   -B INT        penalty for a mismatch [4]
   -O INT[,INT]  gap open penalties for deletions and insertions [6,6]
   -E INT[,INT]  gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]
   -L INT[,INT]  penalty for 5'- and 3'-end clipping [5,5]
   -U INT        penalty for an unpaired read pair [17]
Input/output options:
   -p            smart pairing (ignoring in2.fq)
   -R STR        read group header line such as '@RG\tID:foo\tSM:bar' [null]
   -H STR/FILE   insert STR to header if it starts with @; or insert lines in FILE [null]
   -j            treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)
   -5            for split alignment, take the alignment with the smallest coordinate as primary
   -q            don't modify mapQ of supplementary alignments
   -K INT        process INT input bases in each batch regardless of nThreads (for reproducibility) []
   -v INT        verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]
   -T INT        minimum score to output [30]
   -h INT[,INT]  if there are <INT hits with score >80.00% of the max score, output all in XA [5,200]
   -z FLOAT      the fraction of the max score to use with -h [0.80]
   -u            output XB instead of XA; XB is XA with the alignment score and mapping quality added
   -a            output all alignments for SE or unpaired PE
   -C            append FASTA/FASTQ comment to SAM output
   -V            output the reference FASTA header in the XR tag
   -Y            use soft clipping for supplementary alignments
   -M            mark shorter split hits as secondary
   -I FLOAT[,FLOAT[,INT[,INT]]]
                 specify the mean, standard deviation (10% of the mean if absent), max
                 (4 sigma from the mean if absent) and min of the insert size distribution.
                 FR orientation only. [inferred]
Bisulfite (--meth) options:
   --meth        enable inline bwameth-style C→T/G→A read conversion + meth-aware BAM
                 emission. Implies --bam. Requires the reference to have been built
                 with `bwa-mem3 index --meth` (emits the original index plus a
                 ref.fa.meth.* converted seed index).
   --meth-scoring collapsed|genomic
                 bisulfite scoring mode [collapsed]. collapsed: C/T (and G/A)
                 interchangeable, bwameth-compatible placement (sets -B 2).
                 genomic: free only the conversion direction, keep variants as
                 mismatches (variant-aware, truthful NM/MD; -B 4).
   --set-as-failed f|r
                 flag alignments to the matching strand ('f' or 'r') as QC-fail (0x200)
   --chimera-qc
                 enable the bwameth.py-style longest-match <44% chimera heuristic
                 (sets 0x200, clears 0x2, caps MAPQ at 1). Off by default; not in Bismark.
Supplementary MAPQ rescoring (fg-labs extension):
   --supp-rep-hard-cap INT
                 force MAPQ=0 for supplementary alignments whose chain contains any seed
                 with >=INT genome occurrences (i.e. the supp region is repetitive on its
                 own). 0 disables (default). Typical values 5-20; lower = more aggressive.
                 Primary MAPQ is unaffected.
Seed ordering (fg-labs extension):
   --seed-order STR
                 seed emission order before chaining: off|local-longest [off]
                 (advanced modes: global-longest, absorb-count, most-absorb; see docs)
Input reader:
   --legacy-reader
                 use the legacy gzFile/kseq input reader instead of the default
                 content-detecting fast reader (escape hatch / A-B baseline).
Help:
   --help        print this help message and exit
Note: Please read the man page for detailed description of the command line and options.

Common usage

Paired-end alignment, 16 threads, SAM to stdout:

bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam

Paired-end alignment, emit uncompressed BAM, pipe directly to samtools sort:

bwa-mem3 mem --bam -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

Paired-end methylation alignment with a read group header:

bwa-mem3 mem --meth -t 16 \
  -R '@RG\tID:lib1\tSM:sample1\tPL:ILLUMINA' \
  ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam -

Flag reference

Input / output

`-o STR` — output file

Write output to STR instead of stdout. Honored for both SAM and --bam output; the path is opened lazily so BAM mode can hand it to htslib instead of truncating it as a SAM-text file. Stdout redirection (>) remains an alternative.

`--bam[=N]` — emit BAM

Emit BAM instead of SAM. N controls BGZF compression: 0 (default when --bam is used without =) writes uncompressed BAM, which costs almost no CPU and is the recommended mode for piping to samtools sort. Values 1–9 select increasing BGZF deflate levels; use --bam=6 or --bam=9 only when writing directly to final storage without a downstream sort step.

Tip — Prefer –bam for production pipelines

Uncompressed BAM (--bam or --bam=0) eliminates the text-formatting cost on the aligner side and the text-parse cost on the samtools sort side. For any pipeline that immediately sorts or processes the output, this is faster than SAM at no quality cost.

`-R STR` — read group header

Injects a @RG header line and tags every alignment with RG:Z:<ID>. The value is a tab-separated @RG line with literal \t escapes, for example:

-R '@RG\tID:run1\tSM:HG001\tPL:ILLUMINA\tLB:lib1'

bwa-mem3 escapes any literal tab characters inside -R values before writing them to the @PG CL: field, preventing header corruption (fix for issue #45).

`-H STR/FILE` — extra header lines

If STR begins with @, it is injected verbatim as a header line. Otherwise STR is treated as a path and every line in the file is injected. Useful for adding @CO comments or custom @RG / @PG entries.

`-p` — smart pairing

Reads interleaved paired-end data from a single FASTQ file (in1.fq) rather than two separate files. The second positional argument (in2.fq) is ignored.

`-5` — leftmost-coordinate primary

For split alignments, designates the alignment with the smallest genomic coordinate as primary, rather than the longest alignment. Useful for some downstream tools that expect the leftmost alignment to be primary.

`-q` — preserve supplementary MAPQ

By default, bwa-mem3 may downgrade the MAPQ of supplementary alignments. -q suppresses that adjustment.

`-K INT` — fixed batch size

Forces each thread batch to process exactly INT input bases regardless of the number of threads. Useful when you need bit-for-bit reproducible output across runs with different -t values: fix -K to the same value and the output is deterministic.

`-v INT` — verbosity

Controls stderr diagnostic output: 1 = errors only, 2 = warnings, 3 = informational messages (default), 4+ = debugging.

`-a` — all alignments

Output all alignments for single-end or unpaired paired-end reads, including secondary alignments. Equivalent to enabling secondary-alignment reporting.

`-C` — append FASTA/FASTQ comment

Appends the comment field from the FASTA/FASTQ header to the SAM output as an additional column. Useful when the comment carries barcodes or UMIs.

`-V` — reference header in XR tag

Emits the reference FASTA header line for each alignment position as an XR SAM tag.

Under --meth, XR:Z instead carries the Bismark read-conversion direction (CT/GA) and this reference-annotation use of XR is suppressed — see Methylation Reference → Flags.

`-Y` — soft-clip supplementary alignments

Uses soft clipping instead of hard clipping for supplementary alignments. Some downstream tools require this.

`-M` — mark shorter split hits as secondary

Marks the shorter alignment in a split read as secondary (sets 0x100 flag) rather than supplementary. Required for compatibility with tools that do not handle supplementary alignments (e.g. Picard’s duplicate-marking before certain versions).

`-j` — treat ALT contigs as primary

Treats ALT contigs as part of the primary assembly by ignoring the <idxbase>.alt file. Use when your workflow does not include ALT-aware postprocessing.

Scoring

All scoring flags accept integer values. Changing -A (match score) scales the penalty flags that default to multiples of -A; explicit overrides of individual flags are unaffected.

Flag	Default	Meaning
`-A INT`	1	Score for a sequence match. Scales `-T`, `-d`, `-B`, `-O`, `-E`, `-L`, `-U` unless overridden.
`-B INT`	4	Mismatch penalty.
`-O INT[,INT]`	6,6	Gap open penalty for deletions and insertions respectively.
`-E INT[,INT]`	1,1	Gap extension penalty per base. A gap of length k costs `-O + -E * k`.
`-L INT[,INT]`	5,5	Clipping penalty for 5’ and 3’ ends.
`-U INT`	17	Penalty for an unpaired read pair (affects mate-rescue scoring).
`-T INT`	30	Minimum alignment score to output. Alignments below this threshold are not reported.

Note — –meth overrides scoring defaults

When --meth is active, bwa-mem3 applies -L 10 -U 100 -T 40 -M -C plus a mode-dependent mismatch penalty: -B 2 for --meth-scoring collapsed (default, bwameth-compatible) and -B 4 for --meth-scoring genomic. This mirrors bwameth’s bwa mem -T 40 -B 2 -L 10 -CM (with -U 100 for paired-end). The scoring values (-B, -L, -U, -T) can still be overridden by passing the flag explicitly, in any position relative to --meth; -M and -C cannot, since bwa has no option that unsets them.

Those constants are quoted at bwameth’s match score (-A 1) and scale with -A like every other score-derived default above: under -A 2 the effective values are -L 20 -U 200 -T 80 and -B 4 (collapsed) / -B 8 (genomic).

Paired-end

`-I FLOAT[,FLOAT[,INT[,INT]]]` — insert size distribution

Specifies the mean, standard deviation (default: 10% of mean), maximum (default: 4 sigma above mean), and minimum of the insert size distribution for FR-orientation paired-end reads. By default bwa-mem3 infers these parameters from the first batch of reads. Provide them explicitly for speed or when the reference is short and inference may be inaccurate.

`-m INT` — mate rescue rounds

Maximum number of mate-rescue attempts per read. Reduce to speed up alignment on data where the default (50) wastes time on unrescuable pairs. See Settings profiles for the benchmarked -m 10 recommendation.

`-S` — skip mate rescue

Disables mate rescue entirely. Faster but may reduce sensitivity for discordant pairs.

`-P` — skip pairing

Skips the pairing step; mate rescue still runs unless -S is also given.

Filtering

`-c INT` — skip repetitive seeds

Seeds with more than INT occurrences in the reference are skipped. Lowering this (e.g. to 50) speeds up alignment of highly repetitive reads but may reduce sensitivity. Raising it increases sensitivity in repeat-heavy regions at a cost in runtime.

`-D FLOAT` — chain length fraction

Drops chains shorter than FLOAT times the longest overlapping chain. The default (0.50) discards chains that are less than half the length of the best chain.

`-W INT` — minimum seeded bases

Discards chains with fewer than INT seeded bases. Raising this filters out very short, low-confidence chains.

`--min-ext-len INT` — skip Smith-Waterman extension of short seeds

Off by default (0) → output byte-identical to baseline. When INT > 0, a short seed (< INT bp) is dropped before banded Smith-Waterman only if its chain still has a longer anchor seed — its extension is then redundant (the anchor already covers it), so skipping it is near output-neutral (~10 % less alignment CPU at 30). A chain whose seeds are all short is left untouched, so the filter never empties a chain or drops a read: it is recall-safe by construction. 30 is the recommended value. For the benchmarks, behavior details, and validation status, see Settings profiles → --min-ext-len 30.

`--max-extend-chains INT` — cap chains extended per read

Off by default (0) → output byte-identical to baseline. When INT > 0, only the top-INT chains by weight (after chain filtering) reach banded Smith-Waterman extension; the remaining lower-weight chains are dropped before extension. This is the only lever that reduces the number of chains extended per read, so it is orthogonal to the seed- and SW-per-chain levers and adds a real marginal speedup on top of them (~15 % marginal alignment CPU on top of --fast, ~23 % standalone, at 5). It is not byte-identical: dropping candidate chains removes low-weight secondaries, so XS, secondary alignments, and MAPQ can shift on multi-mapping reads. High-confidence (uniquely-placed) reads are unaffected. The cap is a safety no-op for pathological reads with more than 4096 chains (MAX_EXTEND_CHAINS_CAP): those reads extend all of their chains as usual, so --max-extend-chains has no effect on them. --fast sets 20. For the accuracy/speed curve and validation status, see Settings profiles → --max-extend-chains.

`--adaptive-band` — adaptive banded Smith-Waterman for long reads

Off by default → output byte-identical to baseline. When set, banded extension starts at a tight band and expands each pair only to the band its chain’s seed geometry actually needs (the inter-seed indel), rather than the fixed -w band (100) for every extension.

When to use it: long reads. The band only constrains the DP matrix when the extension’s reference window exceeds it (ref_window > 2·w+1), which happens for long reads. So this is a long-read lever — SBX, PacBio HiFi, ONT, or any run whose reads are roughly ≥ 200 bp. On SBX (HG002, 240 bp+) it cuts alignment CPU by ~25 %. On short-read data (WGS ~150 bp, WES ~76 bp) the extension matrix is already smaller than the band, so there is nothing to trim: those reads run on the 8-bit kernel, which this option deliberately leaves untouched, making it a no-op on short reads (enabling it on a WGS/WES run neither helps nor hurts).

Accuracy: placement is unchanged (holodeck sim-wgs-place: MAPQ-60+ mismaps identical to default) and indel representation is preserved — indels up to the chaining limit still emit a single D/I CIGAR, matching the -w 100 default, so small/mid-size indel callability is unaffected.

Not byte-identical when on. Like --fast, enabling it shifts a small number of borderline secondary alignments (starting tight and expanding can change which of several near-tied placements wins). It is therefore an opt-in flag, not a default.

`--extend-mate-concordant` — retain mate-concordant chains under a chain cap

Takes an optional window: --extend-mate-concordant (bare) = auto, sizing the window to the estimated proper-pair insert bound (pes[FR].high, inferred from the data during the run); --extend-mate-concordant=INT pins a fixed window in bp; --extend-mate-concordant=0 disables it. Off by default → no effect. When on (and --max-extend-chains is capping a paired-end read), a chain that would be dropped by the cap is instead retained if it is concordant with one of the mate’s chains — same contig, FR (“innie”) orientation, within the window. It only does anything when a chain cap is in effect, so it is a strict no-op without --max-extend-chains.

The window matters: too wide and it retains — and then extends — far/spurious concordant chains, adding alignment CPU on chain-rich reads; sizing it to the aligner’s own proper-pair insert bound (the auto default) admits only genuine pair anchors. Before the insert size is estimated (the first chunk), auto falls back to a built-in default.

When to use it: --meth. Bisulfite’s collapsed 3-letter alphabet flattens chain weights, so under --max-extend-chains the cap often drops a read’s true low-weight chain and starves PE pairing of the anchor that lets the true concordant pair win — flipping both mates to a wrong concordant locus. This option recovers that anchor. --fast enables it (auto) automatically under --meth only; on non-meth data the cap does not regress placement, so --fast leaves it off to preserve the speedup. The recovery is partial (it narrows, but does not fully close, the placement gap to default), and with the auto window the alignment-CPU cost is ~1% — sizing the window to the insert bound is what keeps it there (a wide fixed window instead retains and extends far/spurious concordant chains, costing 15–20% on chain-rich reads). See the benchmarked per-dataset figures in fg-labs/bwa-mem3#195.

Not byte-identical when it retains a chain. Like --max-extend-chains, keeping an extra candidate can move XS, secondaries, and MAPQ on multi-mapping reads; high-confidence placement is unaffected. For the placement/mismap validation, see Settings profiles → --extend-mate-concordant.

`-h INT[,INT]` — secondary alignment reporting

If there are fewer than INT hits with score exceeding FLOAT (see -z) times the maximum score, all of them are output in the XA auxiliary tag. The second integer is a hard cap on the number of XA entries. Defaults: 5, 200.

`-z FLOAT` — secondary score fraction

Fraction of the maximum alignment score used as the threshold for secondary hit reporting with -h. Default: 0.80.

`-u` — emit XB instead of XA

Outputs XB in place of XA. XB is an extension of XA that also carries the alignment score and mapping quality for each secondary hit.

Speed preset

`--fast` — speed preset (opt-in, not byte-identical)

--fast is a one-flag shorthand for the characterized speed levers:

bwa-mem3 mem --fast  ≡  -m 10 -y 0 --min-ext-len 30 --smem-dedup --skip-contained-ext --max-extend-chains 20 --adaptive-band --extend-mate-concordant

--skip-contained-ext is byte-identical to the default on non-meth single- and paired-end reads and no-ops under --meth (via its own internal gate), so it is pure upside where it applies (~10% lower alignment CPU on long-read inputs) and safe elsewhere.

--adaptive-band (see above) is included because it is a strict no-op on short reads (the reads --fast primarily targets) and a ~25% alignment-CPU speedup on long-read (SBX/HiFi/ONT) runs, so bundling it only helps.

--extend-mate-concordant repairs the chain-cap pairing regression — the true, low-weight but mate-concordant chain the cap would otherwise drop — and is included for both non-meth and --meth --fast (see Settings profiles → --extend-mate-concordant).

Under --meth it additionally sets -s 2 (light Pass-2 re-seeding) and lowers the chain cap to 10. Earlier releases used -s 0 (no re-seed), which inflated MAPQ on bisulfite reads; -s 2 recovers the MAPQ/placement at nearly the same speed (see Settings profiles → Pass-2 re-seeding).

Each lever is applied only if you did not set it explicitly, so explicit flags win where applicable (--fast -m 30 keeps -m 30; --fast --max-extend-chains 8 keeps 8); --smem-dedup and --skip-contained-ext are always enabled and cannot be opted back out of once --fast is set. Output is not byte-identical to the default; the accuracy cost of each lever is characterized in Settings profiles and is confined to the already-low-confidence tail. bwa-mem3 mem prints the resolved preset to stderr ([M::main_mem] --fast: ...) so runs are self-documenting.

Methylation (`--meth`)

`--meth` — enable bisulfite alignment mode

Activates bisulfite alignment: each read is projected (R1 C→T, R2 G→A) to find seeds in the converted .meth seed index, then extended and scored against the original 4-letter reference, with inline BAM post-processing and forced --bam output. The reference must have been indexed with bwa-mem3 index --meth.

Pass the original FASTA prefix as <idxbase> (e.g. ref.fa); the ref.fa.meth.* seed index alongside it is found automatically. A legacy bwameth .bwameth.c2t index is not used directly — rebuild with index --meth (see Migrating from bwameth.py c2t).

See Methylation Reference for the full treatment.

`--meth-scoring {collapsed|genomic}` — bisulfite scoring model

Selects how the 4-letter matrix treats converted bases. collapsed (default) frees C↔T and G↔A both ways (bwameth-compatible placement, sets -B 2); genomic frees only the conversion direction, keeping real variants as mismatches (variant-aware, truthful NM/MD, keeps -B 4). Only meaningful with --meth. See Flags → –meth-scoring.

`--set-as-failed {f|r}` — strand QC-fail flag

Forces the QC-fail bit (0x200) on all alignments to the forward (f) or reverse (r) bisulfite strand. Used when one strand is known to be unreliable for a given library preparation.

`--chimera-qc` — opt in to bwameth.py-style chimera heuristic

Off by default (matches Bismark, which has no equivalent heuristic). When set, mapped records whose longest M/=/X CIGAR run is less than 44 % of the read length get 0x200 set, 0x2 cleared, and MAPQ capped at 1. Useful for PBAT / scBS-Seq libraries where intra-fragment chimerism is common, or when reproducing bwameth.py output bit-for-bit.

Threading

`-t INT` — number of threads

Number of worker threads. Defaults to 1. Set to the number of physical cores available to this job. Scaling is workload- and hardware-dependent: on typical machines the curve flattens around 16–32 threads (FM-index bandwidth and I/O contention dominate); on high-memory / fast-I/O servers the aligner can keep scaling toward ~64 threads on hg38 before saturating. See the threading guide for measured guidance and per-machine recommendations.

See User Guide — Threading and resource use for guidance on thread counts at various machine sizes.

Supplementary MAPQ rescoring

`--supp-rep-hard-cap INT` — cap MAPQ for repetitive supplementary alignments

Forces MAPQ=0 for supplementary alignments whose chain contains any seed with at least INT occurrences in the genome. This targets supplementary alignments anchored in repetitive regions that upstream MAPQ scoring may overestimate. 0 disables the cap (default). Typical values are 5–20; lower values are more aggressive. Primary alignment MAPQ is unaffected.

Debug

`-k INT` — minimum seed length

Minimum exact-match seed length. Shorter seeds increase sensitivity but raise runtime. The default (19) is calibrated for 100–150 bp Illumina reads.

`-w INT` — band width

Band width for the banded Smith-Waterman extension. Wider bands can recover alignments with long indels at greater CPU cost.

`-d INT` — X-dropoff

Off-diagonal X-dropoff for the Z-drop heuristic. Controls how far an alignment extension continues after a score drop.

`-r FLOAT` — re-seeding factor

Seeds longer than -k * FLOAT are re-seeded internally to find sub-seeds (bwa-mem’s second seeding round). Lowering produces more seeds / higher sensitivity at greater cost; raising (e.g. -r 10) suppresses the round. Round 2 is genuine split-read/divergence sensitivity, so only suppress it on known-clean data — see Settings profiles → -y 0.

`-y INT` — third-round seed occurrence threshold

bwa-mem’s third seeding round: for each read position, grow an exact match until it occurs fewer than INT times in the genome (default 20), then emit it as a seed — a repeat-region safety net. -y 0 disables the round, cutting ~11–30 % of alignment CPU with F1-near-neutral accuracy; it is part of the recommended profile. For the regime sweep and rationale, see Settings profiles → -y 0.

`--legacy-reader` — use the legacy input reader

Read input with the legacy gzFile/kseq reader instead of the default content-detecting fast reader. An escape hatch for A/B baselining or working around an input the fast reader mishandles; not needed in normal use.

Notes / Gotchas

Warning — –meth requires a –meth index

Running bwa-mem3 mem --meth against a standard (non-c2t) index produces incorrect alignments without an error. Confirm that the index was built with bwa-mem3 index --meth before aligning bisulfite data.

Note — SIMD variant printed to stderr at startup

When mem starts it prints a banner (Executing in AVX512 mode!! etc.) to stderr. This is informational and does not affect stdout output.

shm

bwa-mem3 shm stages an FM-index into POSIX shared memory so that subsequent bwa-mem3 mem invocations on the same machine attach to the in-memory segment instead of re-reading the index files from disk. For workloads that align many small samples back-to-back against the same reference — such as clinical panels or amplicon sequencing — this removes the dominant I/O bottleneck. shm also lists and destroys staged segments.

Synopsis


Usage: bwa-mem3 shm [-d|-l|--help] [--meth] [idxbase]

Options:
  -d        destroy all indices in shared memory (matches bwa v1 behavior)
  -l        list names of indices in shared memory
  --meth    stage a `bwa-mem3 index --meth` index — auto-appends
            `.meth` to <idxbase>, mirroring `mem --meth`
  -h --help print this help and exit

Stage with no flags: `bwa-mem3 shm <idxbase>` loads the index into
POSIX shared memory; subsequent `bwa-mem3 mem <idxbase> ...` runs
auto-attach instead of re-reading from disk. For meth indices, pass
the same plain `<idxbase>` to all three commands plus `--meth` on
`index`, `shm`, and `mem` (the c2t suffix is auto-appended).

Footgun: if you re-build the index, run `bwa-mem3 shm -d` first.
There is no staleness check -- a stale segment will silently mis-align.

Stuck-lock recovery: concurrent stagers are serialized by a named
       POSIX semaphore. If a stager is kill -9'd mid-stage, the lock
       persists and subsequent stages block forever. `bwa-mem3 shm -d`
       unlinks the semaphore alongside the registry; rerun afterwards.

macOS: POSIX shm has implementation-defined per-segment caps; large
       indices may simply fail to stage. Prefer Linux for production.
Linux: /dev/shm defaults to ~50% of RAM on bare metal; in containers
       it is often much smaller and may need raising via --shm-size
       (Docker) or an emptyDir tmpfs (Kubernetes).

Common usage

Stage a standard index, align two samples, then release the segment:

bwa-mem3 shm ref.fa
bwa-mem3 mem -t 16 ref.fa sample1_R1.fq sample1_R2.fq > sample1.sam
bwa-mem3 mem -t 16 ref.fa sample2_R1.fq sample2_R2.fq > sample2.sam
bwa-mem3 shm -d

Stage a methylation index and align:

bwa-mem3 shm --meth ref.fa
bwa-mem3 mem --meth -t 16 ref.fa R1.fq R2.fq | samtools sort -o out.bam -
bwa-mem3 shm -d

List all currently staged segments:

bwa-mem3 shm -l

Flag reference

(no flags) `<idxbase>` — stage an index

Loads all index files for <idxbase> into a POSIX shared-memory segment. After staging, any bwa-mem3 mem <idxbase> ... on the same machine auto-attaches and reads from memory rather than disk.

`-d` — destroy all segments

Removes every bwa-mem3 shared-memory segment on the machine. This is the correct clean-up command after a batch job and the required step before re-building the index (see the footgun warning below).

`-l` — list staged indices

Prints the names of all currently staged segments. Useful to confirm that staging succeeded before launching alignment jobs.

`--meth` — stage a methylation index

Stages the .meth seed index (<idxbase>.meth.*) into shared memory, mirroring the behavior of bwa-mem3 index --meth and bwa-mem3 mem --meth. Pass the same plain <idxbase> to all three commands; the .meth suffix is handled transparently.

The staged seed segment is seed-only: it holds the seed FM-index and contig metadata (BNS) but omits the packed reference (PAC), because mem --meth extends against the original reference and never reads the seed’s bases. This trims ~1.6 GB from the staged segment on hg38. (The unpacked .0123 reference is never staged for any index — plain or seed — because mem pac-fetches reference bases from .pac on demand; that saves the seed’s ~13 GB and the original’s ~6.4 GB versus staging .0123.)

Notes / Gotchas

Warning — No staleness check — always destroy before re-indexing

There is no staleness check. If you re-run bwa-mem3 index ref.fa after staging, the on-disk index files will not match the in-memory segment, but bwa-mem3 mem will still attach to the stale segment and silently produce incorrect alignments. Always run bwa-mem3 shm -d before re-indexing.

Note — Platform limits

macOS: POSIX shared memory has implementation-defined per-segment size caps. Staging a full hg38 index (~18 GB; ~21 GB for a --meth seed segment) may fail silently or with a cryptic error. Prefer Linux for production use with large references.

Linux containers: /dev/shm typically defaults to ~50% of physical RAM on bare metal but is often much smaller inside Docker containers or Kubernetes pods. Raise the limit with --shm-size (Docker) or an emptyDir tmpfs volume with an explicit size (Kubernetes) before attempting to stage a large index.

Note — /dev/shm capacity preflight (PR #86)

Before opening the segment, bwa-mem3 shm calls statvfs("/dev/shm") and compares the available bytes against the index’s total_size. If /dev/shm is too small the stage aborts cleanly with an [E::bwa_shm_stage] message that names /dev/shm, the required size, and a mount -o remount,size=... hint. This replaces the previous failure mode where ftruncate succeeded lazily and pack_into later surfaced ENOSPC as [fread] Bad address with no indication that /dev/shm was the cause. The preflight is best-effort: a statvfs failure (no /dev/shm, restricted sandbox, ENOSYS) is non-fatal and the stage proceeds. As a rough sizing guide, hg38 stages ~17 GB; AWS instances default to RAM/2 (so c7a.4xlarge / c7i.4xlarge at 32 GB get ~16 GB of /dev/shm, which is just under the index size — a remount,size=28g is the documented fix).

Note — Stuck-lock recovery

Concurrent bwa-mem3 shm <prefix> invocations are serialized by a named POSIX semaphore (/bwactl_lock) so the registry stays consistent. POSIX semaphores have no SEM_UNDO equivalent: if a stager segfaults or is kill -9’d while holding the lock, every subsequent stage will block in sem_wait forever. Run bwa-mem3 shm -d to recover — it unlinks the semaphore alongside the registry, freeing the next stager.

version

bwa-mem3 version prints the release version, the build’s compiled-in SIMD floor, the SIMD tier resolved at runtime, and (when mimalloc is compiled in) the mimalloc version. It is the canonical way to confirm which build is on PATH, what host class it requires, and what kernel path it will dispatch to.

bwa-mem3 version always exits 0 — even on a host below the build’s SIMD floor — so operators can introspect a binary on a host that cannot actually run alignment. bwa-mem3 <subcommand> --help and -h share the same property.

Synopsis

v<MAJOR.MINOR>-<N>-g<COMMIT>
mimalloc 3.3.0 (active)

Common usage

Confirm the installed version, SIMD floor, and resolved tier:

./bwa-mem3 version

A typical run on an AVX-512BW host with the default BASELINE_ARCH=avx2 build prints (mimalloc line on stderr, the rest on stdout — order in a merged stream is not guaranteed):

v0.3.0-12-gabcdef1
SIMD floor: avx2 (x86-64-v3, Haswell 2013+); kernels: sse41 sse42 avx avx2 avx512bw
SIMD runtime: avx512bw (BWAMEM3_FORCE_TIER unset)
mimalloc 3.3.0

version line — bwa-mem3’s release string, derived from git describe at build time and stored as PACKAGE_VERSION in the binary. When building from a tarball without git history, the fallback value is set via FG_LABS_VERSION_FALLBACK at compile time.
SIMD floor: — the compile-time minimum the binary requires. Set by BASELINE_ARCH (default avx2) and listed alongside the per-tier kernel set the binary carries.
SIMD runtime: — the tier resolved at startup by __builtin_cpu_supports (or BWAMEM3_FORCE_TIER if set in the environment). On arm64 this is always neon.
mimalloc line — present only when USE_MIMALLOC=1 (the default).

On a host below the SIMD floor, version also writes a [W::bwa-mem3] warning on stderr identifying the gap (and the alignment subcommands will refuse to run with exit code 2 — see Host requirements for the exit-2 message format and rebuild instructions).

Notes / Gotchas

Tip — version | grep is safe in CI

The version, SIMD floor:, and SIMD runtime: lines all go to stdout; the mimalloc line and any host-below-floor warning go to stderr. So bwa-mem3 version | grep '^SIMD' works in CI scripts even on hosts that cannot run alignment. Use 2>/dev/null to suppress the mimalloc and warning lines if you want stdout only.

Tip — No mimalloc line means USE_MIMALLOC=0

If no mimalloc line appears, the binary was built without the bundled allocator (make USE_MIMALLOC=0). See User Guide — Memory allocator (mimalloc) for when this is appropriate.

Methylation Reference Overview

bwa-mem3 mem --meth is a single-binary, single-command bisulfite/EM-seq aligner. One bwa-mem3 index --meth builds the reference, and one bwa-mem3 mem --meth aligns raw FASTQ to a sorted-ready BAM — no Python, no piped read-conversion preprocessor, and no separate post-processing script.

What sets it apart from the classic bwameth.py approach is where the alignment is scored. bwameth.py converts the reads and the reference into 3-letter (C→T) space and aligns entirely in that collapsed space. bwa-mem3 --meth only uses 3-letter space to find seeds; it then extends, scores, and reports every alignment against the original 4-letter reference using a per-strand asymmetric substitution matrix. The output is in the original alphabet with Bismark-compatible XR/XG/XM tags, so one BAM serves both methylation calling and — in the variant-aware scoring mode — variant calling, because real C/T and G/A variants stay literal mismatches in NM/MD.

The non---meth code path is byte-for-byte unchanged.

Two scoring modes: `--meth-scoring`

Because scoring happens in 4-letter space, --meth can choose how lenient to be about bisulfite-converted bases. This is controlled by --meth-scoring:

Mode	Default?	Matrix	`-B`	Behavior
`collapsed`	yes	frees C↔T and G↔A both ways (two cells)	`2`	bwameth-compatible placement — C/T and G/A are interchangeable, so it closely tracks bwameth’s collapsed-space mapping. A close approximation, not exact: ~1% of records differ in `POS`/`CIGAR`/`MAPQ`, so re-validate if pinned to a bwameth release.
`genomic`	no (opt-in)	frees only the conversion direction (one cell)	`4`	variant-aware — a real C/T or G/A variant scores as a mismatch, so `NM`/`MD` are truthful and the BAM is usable for variant calling.

The default is collapsed, so existing methylation pipelines see bwameth-compatible read placement unless they explicitly opt into genomic. collapsed closely tracks bwameth’s placement and emits the same Bismark tags, but it is a placement drop-in — not byte-identical: ~1% of records differ in POS/CIGAR/MAPQ, so re-validate if you are pinned to a specific bwameth release. See bwameth.py drop-in mapping for the full placement-compatibility caveat.

Pipeline at a glance

The diagram below shows the internal flow when bwa-mem3 mem --meth runs. Every step executes inside the single process; no external programs or temporary files are required.

flowchart LR
    A[Raw FASTQ\nR1 / R2] -->|project R1 C→T,\nR2 G→A for SEEDING ONLY| B[seed in .meth\ndoubled seed index]
    B -->|remap each seed →\noriginal coords + OT/OB hypothesis| C[extend + SCORE\nORIGINAL read vs ORIGINAL ref\nper-strand asymmetric matrix]
    C -->|--meth-scoring\ncollapsed / genomic| D[original-alphabet\nalignment]
    D -->|XR/XG/XM Bismark tags\noptional --chimera-qc| E[BAM output]

Steps:

Seed projection. Each read is projected into 3-letter space for seeding only: R1 has every C replaced with T, R2 has every G replaced with A. The original bases are preserved on a first-class per-read field (bseq1_t.meth_orig_seq) and drive scoring and output later. The projection is in-memory; the FASTQ is never rewritten.
Seeding against the .meth doubled seed index. The projected read is seeded against the converted seed FM-index (<ref>.meth.*), which contains a forward C→T projection (f-prefixed contigs) and a reverse G→A projection (r-prefixed contigs) of each chromosome.
Seed remap to original coordinates. Every seed is mapped back to original genome coordinates, and the contig prefix it came from sets a strand hypothesis: f → OT (top strand), r → OB (bottom strand). This hypothesis selects the per-strand matrix and feeds the Bismark XG:Z tag.
4-letter extension and scoring. The original read is extended and scored against the original 4-letter reference window using the per-strand asymmetric matrix (see --meth-scoring). OT frees ref-C × read-T (the unmethylated C→T conversion); OB frees ref-G × read-A. The seed’s own true score is recomputed in this matrix too, so a seed-internal variant correctly lowers the alignment score rather than being assumed a perfect match.
Original-alphabet output. Records are written against the original chromosome names and coordinates, with the original read bases in SEQ, plus Bismark XR:Z (read conversion), XG:Z (genome strand), and XM:Z (per-base methylation call) tags. Optional --chimera-qc (off by default, matching Bismark) flags chimeric reads. The @PG ID:bwa-mem3-meth line records the command line. Output is uncompressed BAM (wb0); pipe directly to samtools sort.

Quick-start commands

# Index once: builds the normal index at the bare prefix PLUS a .meth seed index.
bwa-mem3 index --meth ref.fa

# Align paired-end FASTQs (collapsed = bwameth-compatible placement, the default).
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam
samtools index out.bam

# Opt into variant-aware scoring (truthful NM/MD; BAM usable for variant calling).
bwa-mem3 mem --meth --meth-scoring genomic -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam

Note — scoring defaults

--meth applies -L 10 -U 100 -T 40 -M -C in both modes, plus the mode-dependent mismatch penalty: -B 2 for collapsed, -B 4 for genomic. These mirror bwameth’s bwa mem -T 40 -B 2 -L 10 -CM (with -U 100 for paired-end). The scoring values (-B, -L, -U, -T) can be overridden on the command line, in any position relative to --meth. -M and -C cannot — bwa has no option that unsets them, so --meth applies them unconditionally.

These constants are quoted at bwa’s default match score (-A 1, what bwameth runs). Like every other score-derived default, they scale with -A: under -A 2 the effective values are -L 20 -U 200 -T 80 and -B 4/-B 8.

bwameth.py Drop-In Mapping

bwa-mem3 --meth is designed so that, in its default collapsed mode, read placement closely tracks the bwameth.py pipeline for the standard case (with a small, bounded divergence — see the callout below), while emitting the Bismark tag set methylation callers expect. This page explains what is the same, what differs, and where the two approaches diverge by design.

Important — placement drop-in, not byte-identical, and not drift-free

collapsed approximates bwameth’s placement (where reads map and their primary/MAPQ behavior) because both treat C/T and G/A as interchangeable — but it is neither a byte-for-byte reproduction nor drift-free. bwa-mem3 scores against the original 4-letter reference rather than in collapsed space, so a small but nonzero fraction of records differ from bwameth.py in POS, CIGAR, or MAPQ — on the order of ~1% of records on typical WGBS/EM-seq, with true mapped-position (POS) changes affecting a smaller subset (a few tenths of a percent). The 4-letter scoring path widens this versus a pure collapsed-space aligner — and versus the older 3-letter --meth releases — and the opt-in genomic mode diverges further on purpose (it penalizes real variants). The tag schema is also Bismark (XR/XG/XM), not bwameth (YS/YC/YD).

If you are pinned to a specific bwameth release (e.g. a clinical pipeline validated against a bwameth version), treat collapsed as a new aligner and re-validate against your own bwameth output — do not assume placement equivalence. The divergence is small and bounded, but it is real and it is larger than the older 3-letter --meth path.

Command comparison

bwameth.py pipeline (multi-step)

# Step 1: build a single doubled (c2t) reference
bwameth.py index ref.fa                # writes ref.fa.bwameth.c2t + FMI

# Step 2: align (bwameth.py converts reads, calls bwa/bwa-mem2, post-processes)
bwameth.py map --bwa-mem2 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam
samtools index out.bam

bwa-mem3 –meth (single binary)

# Step 1: build a dual index (original index + .meth seed index)
bwa-mem3 index --meth ref.fa           # writes ref.fa.* AND ref.fa.meth.*

# Step 2: align (inline seed projection + 4-letter scoring + post-processing)
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam
samtools index out.bam

The index layouts differ. bwameth.py builds one collapsed doubled reference (ref.fa.bwameth.c2t + FMI) and aligns entirely against it. bwa-mem3 builds two indexes: the normal 4-letter index at the bare prefix (for scoring/extension) and a converted seed index ref.fa.meth.* (for seeding only). A legacy bwameth .bwameth.c2t index is not used directly — rebuild with index --meth (see Migrating from bwameth.py c2t).

What is gained

No Python or bwameth.py dependency. Read seeding, 4-letter scoring, and BAM post-processing all run inside a single bwa-mem3 process. One binary, no virtual environment, no bwameth.py version pinning.

No intermediate files. No converted FASTQ is written; the C→T / G→A projection is applied in-memory to the seeding copy of each read.

Variant-aware option. --meth-scoring genomic scores real C/T and G/A variants as mismatches, so a single BAM supports both methylation calling and variant calling — something a collapsed-space aligner cannot produce.

Inline BAM post-processing. Header rewriting, Bismark XR/XG/XM tags, opt-in chimera QC (--chimera-qc), and QC-fail propagation happen in the same pass. Output is uncompressed BAM (wb0) that samtools sort reads natively.

bwameth-aligned defaults (collapsed). --meth-scoring collapsed applies -B 2 -L 10 -U 100 -T 40 -M -C, mirroring bwameth’s bwa mem -T 40 -B 2 -L 10 -CM (plus -U 100 for paired-end). genomic uses the same set but keeps -B 4. The scoring parameters (-B, -L, -U, -T) can be overridden on the command line, in any position relative to --meth. -M and -C cannot: bwa has no option that unsets them, so --meth applies them unconditionally.

These constants are quoted at bwa’s default match score (-A 1, what bwameth runs). Like every other score-derived default, they scale with -A: under -A 2 the effective values are -L 20 -U 200 -T 80 and -B 4 (collapsed) / -B 8 (genomic).

What stays the same (collapsed mode)

The output BAM carries the standard methylation tag set, flags, and SEQ representation, and read placement closely tracks bwameth at the standard case — but “stays the same” here means functionally equivalent, not identical: as the callout above notes, ~1% of records still differ in POS/CIGAR/MAPQ. The @PG provenance line and the tag schema intentionally differ:

Field	bwameth.py	bwa-mem3 –meth
`@SQ` headers	One per real chromosome	One per real chromosome
Read placement (collapsed)	reference	Closely tracks at the standard case; ~1% of records differ in `POS`/`CIGAR`/`MAPQ` (re-validate if pinned to a bwameth release)
Methylation aux tags	`YS:Z`, `YC:Z`, `YD:Z`	`XR:Z`, `XG:Z`, `XM:Z` (Bismark)
`@PG`	`ID:bwameth`	`ID:bwa-mem3-meth`
Chimera QC threshold	Longest M < 44% of read	Same (44%), opt-in via `--chimera-qc`
Chimera QC flags	`0x200`, clear `0x2`, MAPQ ≤ 1	Same
SEQ field	Pre-conversion bases (RC-flipped when `is_rev`)	Same
`NM`/`MD`	Collapsed (conversions and real variants both hidden)	Conversions hidden; real variants hidden in `collapsed`, shown in `genomic`

bwa-mem3 emits the Bismark-compatible XR:Z / XG:Z / XM:Z tag set rather than bwameth’s YS:Z / YC:Z / YD:Z, so output is directly consumable by bismark_methylation_extractor, methylKit, methtuple, DMRfinder, and epialleleR in addition to MethylDackel and biscuit. Tools that expect YS/YC/YD must be pointed at the corresponding XR/XG (and per-base XM) tags.

When to prefer bwameth.py

If your workflow requires bwameth.py-specific features (e.g. bwameth.py markduplicates or non-standard post-processors), or strict byte-for-byte reproduction of a bwameth release, continue using bwameth.py. bwa-mem3 --meth targets the indexing + alignment + standard post-processing path, with bwameth-compatible placement (collapsed) or variant-aware scoring (genomic).

Conversion Details (C→T, G→A)

Bisulfite sequencing converts unmethylated cytosines to uracil (read as thymine after PCR). bwa-mem3 --meth uses a C→T / G→A projection to find seeds, but — unlike a classic 3-letter aligner — it then scores and reports against the original 4-letter reference. This page describes what gets projected, where, and how the original bases come back for scoring and output.

What gets projected (for seeding only)

Paired-end bisulfite reads follow a strand convention:

R1 (read 1): every C is replaced with T (models the OT / CTOB strands).
R2 (read 2): every G is replaced with A (models the OB / CTOT strands).

Single-end mode uses the R1 (C→T) rule for all reads.

This projection is applied only to the copy of the read used for seeding. The .meth doubled seed index built by bwa-mem3 index --meth holds two projections of each chromosome:

f-prefixed contigs (e.g. fchr1): the chromosome with every C → T.
r-prefixed contigs (e.g. rchr1): the reverse-complement strand with every G → A.

Projected R1 reads seed against f-prefixed contigs and projected R2 reads against r-prefixed contigs. The contig prefix records the strand hypothesis (f → OT, r → OB), which both selects the per-strand scoring matrix and feeds the Bismark XG:Z tag (CT for OT, GA for OB).

Key difference from bwameth / 3-letter aligners

A classic 3-letter aligner also converts the reference and the read, then scores in collapsed space — so a real C/T or G/A variant is invisible. Here the projection is used only to locate seeds; extension and scoring run on the original bases against the original reference (next sections). That is what lets --meth-scoring genomic tell a real variant apart from a conversion.

Where projection happens

Seed projection runs inside src/fastmap.cpp in the meth_mode ingest block, right after sequence parsing. It writes the projected bases into the in-memory seeding buffer; the original FASTQ is never rewritten.

Before projecting, the original sequence is preserved on a first-class per-read field, bseq1_t.meth_orig_seq. (For interoperability with an external c2t converter that does not populate that field, the original bases can also be carried on the read comment as YS:Z:<l_seq bases>\tYC:Z:<direction>, where <direction> is CT for R1 and GA for R2.) These carriers are internal: they are not emitted to the output BAM.

Scoring against the original reference

After seeds are remapped to original coordinates with their OT/OB hypothesis, bwa-mem3 extends and scores the original read against the original 4-letter reference window using the per-strand asymmetric matrix:

OT frees ref-C × read-T (the expected unmethylated C→T conversion).
OB frees ref-G × read-A (the expected bottom-strand G→A conversion).

Under --meth-scoring collapsed (default) the mirror cell is freed too (ref-T × read-C, ref-A × read-G), so C/T and G/A are interchangeable and placement matches bwameth. Under --meth-scoring genomic only the conversion direction is freed, so a real variant stays a mismatch. See Overview → --meth-scoring.

The seed’s own ungapped score is recomputed in the same matrix (not assumed to be a perfect len × match), so under --meth-scoring genomic a seed-internal C/T or G/A variant correctly lowers the alignment score, AS, and MAPQ. Under --meth-scoring collapsed the mirror cell is freed, so such a variant is scored as a conversion and does not penalize placement.

Sequence restoration in the BAM SEQ field

Methylation callers (MethylDackel, Bismark tools) read the BAM SEQ field to see real C/T bases, not the projected T/A. meth_mem_aln_to_bam (in src/meth_bam.cpp) restores the original bases before writing each record:

The original bases come from bseq1_t.meth_orig_seq (the first-class field), falling back to the YS:Z comment carrier only when that field is absent.
For forward-aligned records (!p.is_rev), the original bases are copied directly into SEQ.
For reverse-aligned records (p.is_rev), they are reverse-complemented with the standard TGCAN table.
If neither carrier is available (e.g. an external c2t converter that emits neither), the code falls back to the seeding buffer in s->seq, with the same RC flip.

Warning — Soft-clip and supplementary trimming

When computing the SEQ range for supplementary alignments, the qb/qe boundaries account for soft-clip / hard-clip operations at the CIGAR ends. The restoration applies over the same trimmed range, so SEQ length always matches the emitted CIGAR.

QUAL field handling

The QUAL field is taken directly from the original FASTQ (bseq1_t.qual) over the same [qb, qe) range and is never modified. Quality scores correspond to the original base calls.

Relationship to the reference index

bwa-mem3 index --meth ref.fa writes two indexes:

the normal 4-letter index at the bare prefix (ref.fa.amb, .ann, .bwt.2bit.64, .pac) — used for scoring/extension against the original reference, and
the converted seed index ref.fa.meth.* (built over a per-strand-converted FASTA ref.fa.meth.fa with the f/r doubled contigs) — used only for seeding.

Neither index writes an unpacked .0123: seeding uses the seed FM-index, and scoring/extension pac-fetches the original reference’s bases from ref.fa.pac on demand. So the original .0123 (~6.4 GB) is unnecessary and the seed’s unpacked bases are never read (~13 GB) — saving ~19 GB of disk on hg38, while the runtime RSS reduction comes from avoiding the original .0123 load (~6.4 GB).

This dual-index layout differs from bwameth.py, which builds a single ref.fa.bwameth.c2t doubled reference and aligns entirely against it. A legacy bwameth .bwameth.c2t index cannot be reused directly — rebuild with index --meth (see Migrating from bwameth.py c2t).

SAM Tags: XR, XG, XM (Bismark-compatible)

bwa-mem3 mem --meth emits three Bismark-compatible auxiliary tags on each output record: XR:Z, XG:Z, and XM:Z. These tags are read by bismark_methylation_extractor, deduplicate_bismark, methylKit processBismarkAln, methtuple, DMRfinder, epialleleR, MethylDackel, and biscuit’s per-read methylation tools.

Tag reference

`XR:Z` — read conversion direction

Property	Value
Type	`Z` (NUL-terminated string)
Values	`CT` (R1 / SE) or `GA` (R2)
Set by	`meth_mem_aln_to_bam` from the read’s conversion direction (R1 → `CT`, R2 → `GA`)
Emitted on	All records (mapped and unmapped)

XR:Z records which conversion was applied to the read at FASTQ ingest:

CT — C→T conversion applied; this is an R1 read or single-end read.
GA — G→A conversion applied; this is an R2 read.

`XG:Z` — genome strand of the alignment

Property	Value
Type	`Z` (NUL-terminated string)
Values	`CT` (aligned to original top, `f-`-prefixed contig) or `GA` (aligned to original bottom, `r-`-prefixed contig)
Set by	`meth_mem_aln_to_bam` from `meth_chrom_map_t.direction`
Emitted on	Mapped records only

XG:Z indicates which doubled-reference strand the read aligned to:

CT — read aligned under the OT (top-strand) hypothesis (f-prefixed seed contig).
GA — read aligned under the OB (bottom-strand) hypothesis (r-prefixed seed contig).

For properly paired directional reads, R1 and R2 of a fragment naturally share XG:Z. Discordant pairs (flagged with 0x200 only when --chimera-qc is enabled) may see XG:Z diverge between mates.

`XM:Z` — methylation call string

Property	Value
Type	`Z` (NUL-terminated string)
Length	Equal to `SEQ` length
Set by	`meth_build_xm` (`src/meth_xm.cpp`) walking SEQ-orientation read against un-converted ref
Emitted on	Mapped records only

Per-base methylation call. Each character corresponds to one SEQ base:

char	meaning
`z` / `Z`	unmethylated / methylated C in CpG context
`x` / `X`	unmethylated / methylated C in CHG context
`h` / `H`	unmethylated / methylated C in CHH context
`u` / `U`	unmethylated / methylated C in unknown context (N within 1 or 2 bp downstream of the C, on the read’s source strand)
`.`	non-C reference at this position, sequencing mismatch (read base ≠ C/T at a ref C), insertion, soft clip, or N at the C position itself

The string is in SEQ orientation (matches the BAM SEQ field): for reads with the 0x10 flag set, both SEQ and XM:Z are reverse-complemented relative to FASTQ-original orientation.

Computation

Under --meth, the original 4-letter reference is loaded directly alongside the .meth seed index — its bns/pac handles, via meth_orig_ref_load_handles in src/fastmap.cpp. (No conversion or fold is needed: unlike the retired D1 design, the reference used for scoring and XM is the unmodified reference, not a recovered projection of the doubled c2t index.)

Per mapped record, meth_build_xm (src/meth_bam.cpp) slices the original forward-strand reference window at the read’s footprint plus 2 bp of context on either side, then walks the BAM CIGAR jointly over the restored SEQ and the ref window. The classifier matches Bismark’s methylation_call:

match position with ref[t] == 'C' (top strand) or 'G' (bottom strand):
    determine context from ref[t±1], ref[t±2]
        N in either context base    -> u/U (unknown context)
        ref[t±1] == G/C (per strand) -> z/Z (CpG)
        ref[t±2] == G/C (per strand) -> x/X (CHG)
        otherwise                    -> h/H (CHH)
    determine methylation:
        read base == C/G (per strand) -> uppercase (methylated)
        read base == T/A (per strand) -> lowercase (unmethylated)
        otherwise                      -> '.'
insertion / soft clip                  -> '.' per consumed read base
deletion / N op                         -> no XM emit
hard clip / pad                         -> no XM emit

The top vs bottom strand choice is driven by XG:Z (= the winning OT/OB hypothesis, p.meth_hypothesis), not by the SAM 0x10 (RC) flag. CTOT reads (R2 mapped forward to a top-strand contig with 0x10 set) and OB reads (R1 mapped RC to a bottom-strand contig) are both handled by reading the rule table from the strand encoded in XG. The walk runs in SEQ orientation throughout — no RC of the ref slice or the read.

For bottom-strand methylation, the C of interest at forward position P is encoded as a G on the forward strand (complement of bottom-strand C). The downstream context on the bottom strand corresponds to upstream positions on the forward strand; the classifier indexes ref[t-1] and ref[t-2] instead of ref[t+1] and ref[t+2], and looks for a C (forward) instead of a G to flag CpG.

Inspecting tags with samtools

samtools view out.bam | head -1 | tr '\t' '\n' | grep -E '^X[RGM]:'

Expected output looks like:

XR:Z:CT
XG:Z:CT
XM:Z:..z..h..Z..x..h.....Z..

Chimera QC and Header Rewriting

After the alignment kernel produces mem_aln_t records, bwa-mem3 --meth applies a set of post-processing steps before writing BAM output. These steps are implemented in src/meth_bam.cpp and run in the same process, in the same pass over the aligned records.

`@SQ` header consolidation

The .meth seed reference (ref.fa.meth.fa) contains two contigs for each chromosome:

fchr1, fchr2, … — C→T projections of each chromosome.
rchr1, rchr2, … — G→A projections of each chromosome.

If the raw alignment header were written directly, every downstream tool would see twice as many sequences as there are real chromosomes, with unfamiliar f/r-prefixed names. meth_bam_writer_open instead builds a consolidated header using the meth_chrom_map_t:

meth_chrom_map_build_from_bns iterates over bns->anns and strips the leading f/r from each contig name.
The first contig with a given stripped name registers that name in the output list; subsequent contigs with the same stripped name map to the same output index.
The BAM @SQ lines are written from the consolidated list — one SN: per real chromosome.

RNAME, RNEXT, and SA/XA tag contig references in every record are rewritten through cmap->out_tid and cmap->output_names so they reference the consolidated names. The mapping from internal (doubled-ref) contig index to output contig index is cmap->out_tid[p.rid].

Note — TLEN computation uses consolidated TIDs

Template length (TLEN) is computed using the consolidated output TIDs, not the internal p.rid values. Two mates that rescue onto fchr1 and rchr1 respectively both map to output chr1, so TLEN is reported as a non-zero distance rather than zero (which would happen if the mismatched internal TIDs were used).

Chimera QC heuristic (opt-in)

bwameth.py applies a heuristic to flag reads that look like chimeric fragments: if the longest contiguous alignment run (sum of M/=/X CIGAR operations) covers less than 44 % of the read length, the read is considered a potential chimera. Bismark does not apply this kind of heuristic.

bwa-mem3 --meth makes this opt-in via --chimera-qc (default off, so the runtime posture matches Bismark). When enabled, the check inside meth_mem_aln_to_bam does:

if (100 * longest_M_run < 44 * l_seq):
    flag  |=  0x200   # set QC fail
    flag  &= ~0x2     # clear proper pair
    mapq   =  min(mapq, 1)

The threshold constant is MIN_LONGEST_M_PCT = 44 (defined at the top of src/meth_bam.cpp). The longest run is computed by cigar_longest_m_mem from src/cigar_util.cpp, which counts M, =, and X operations.

The chimera heuristic is only applied to mapped records (!(flag & 0x4) && direction != 0). Unmapped records are not touched.

See Flags for when to use --chimera-qc (PBAT / scBS-Seq; bwameth.py-equivalence runs).

`--set-as-failed` strand filtering

Before the chimera check, meth_mem_aln_to_bam checks whether opt->meth_set_as_failed is set and matches the record’s strand direction:

if (meth_set_as_failed != 0 && meth_set_as_failed == direction):
    flag |= 0x200

This unconditionally marks all alignments to the specified strand (f or r) as QC-failed before chimera logic runs. The chimera check then applies on top of the already-set fail flag.

Pair-level QC-fail propagation

Once per read group (all records sharing the same query name), after individual records have been processed:

meth_bam_group_propagate_qcfail(group, n)

This function scans all records in the group. If any record has 0x200 set, it propagates that flag to every other record in the group and clears 0x2 (proper pair) on all of them. This ensures that a chimeric or strand-filtered primary alignment also marks its split hits and the mate as QC-failed, preventing inconsistent flag states in the output BAM. (Under --meth those split hits carry 0x100, not 0x800 — see -M and split alignments.)

`@PG ID:bwa-mem3-meth` insertion

meth_bam_writer_open appends a @PG line to the header after the original bwa-mem3 @PG entry:

@PG  ID:bwa-mem3-meth  PN:bwa-mem3-meth  VN:<version>-meth  CL:<command line>

The <command line> field is the full bwa-mem3 mem --meth ... invocation with embedded tab characters replaced by spaces (htslib does not permit literal tabs in @PG CL: fields). This records the exact parameters used for provenance and reproducibility.

Tip — Verifying the header

After alignment, confirm consolidation and provenance with:
samtools view -H out.bam | grep -E '^@SQ|^@PG'
You should see one @SQ line per chromosome (no f/r prefixes) and both @PG ID:bwa-mem3 and @PG ID:bwa-mem3-meth entries.

Flags: –meth-scoring, –set-as-failed, –chimera-qc

bwa-mem3 --meth adds three flags. --meth-scoring selects the bisulfite scoring model; --set-as-failed and --chimera-qc control QC behavior during BAM post-processing (both affect the chimera QC and strand-filtering logic inside meth_mem_aln_to_bam, src/meth_bam.cpp).

`--meth-scoring {collapsed|genomic}`

Selects how the 4-letter scoring matrix treats bisulfite-converted bases. bwa-mem3 --meth scores against the original reference, so it can either collapse the conversion (bwameth-style) or keep it variant-aware.

Accepted values:

collapsed (default) — free both conversion directions: C↔T and G↔A are interchangeable (a two-cell matrix). Reproduces bwameth’s collapsed-space placement and sets the mismatch penalty to -B 2. Use this when you need bwameth-compatible read placement (the drop-in default).
genomic — free only the conversion direction (a one-cell matrix), so the mirror cell stays a real mismatch. A genuine C/T or G/A variant is penalized, making NM/MD truthful and the BAM usable for variant calling. Keeps bwa’s default -B 4.

Important — collapsed is a placement drop-in, not byte-identical to bwameth

collapsed closely tracks bwameth’s placement but scores against the original 4-letter reference, so ~1% of records differ from bwameth.py in POS/CIGAR/MAPQ. If you are pinned to a specific bwameth release, re-validate against your own bwameth output — see bwameth.py drop-in mapping for the full caveat.

Effect on output:

The mode changes alignment score, MAPQ, NM, MD, and occasionally placement and CIGAR. On a real C/T (or G/A) variant under a seed, genomic lowers the score by -A + -B (the match score plus the mismatch penalty) relative to collapsed — the freed match becomes a mismatch — which can break paralog ties in genomic’s favor and avoid spurious indels. The Bismark XR/XG/XM tags and the SEQ field are identical in both modes.

When to use it:

Keep collapsed for methylation-only workflows that must match bwameth placement (e.g. clinical pipelines validated against a bwameth release). Choose genomic when you want one BAM that serves both methylation and variant calling, or want the aligner to distinguish real variants from conversions.

Note — -B follows the mode, but you can override it

collapsed sets -B 2 and genomic keeps -B 4 by default. An explicit -B overrides the mode default and still reaches the per-strand matrices, whether it appears before or after --meth. The other --meth defaults (-L 10 -U 100 -T 40 -M -C) are the same in both modes.

`--set-as-failed {f|r}`

Marks every alignment to the specified strand as QC-failed (0x200) regardless of alignment quality or CIGAR structure.

Accepted values:

f — flag all alignments to f-prefixed contigs (C→T top-strand projection).
r — flag all alignments to r-prefixed contigs (G→A bottom-strand projection).

Effect on records:

When --set-as-failed f (or r) is set and a mapped record’s strand matches the specified value, the record’s SAM flag has 0x200 set. If --chimera-qc is also active, the chimera heuristic runs on top, possibly clearing 0x2 and capping MAPQ. QC-fail propagation then spreads the flag to all records in the read group.

When to use it:

Some experimental designs produce reads that are expected to align exclusively to one strand. Flagging the other strand as QC-failed before downstream analysis prevents spurious methylation calls from mis-strand alignments. It is also useful for diagnosing library preparation issues: run once with --set-as-failed r and once without to compare yield on each strand.

Warning — All records on the strand are flagged

--set-as-failed is a blunt instrument. It marks every alignment to the chosen strand, including correctly aligned reads that simply happened to land on the complementary strand due to library structure. Use this flag only when your library is expected to be strand-specific.

`--chimera-qc`

Enables the bwameth.py-style longest-M chimera heuristic. Off by default; this is the Bismark-equivalent posture, since Bismark itself does not apply this kind of QC heuristic.

When --chimera-qc is set, any mapped record whose longest M/=/X CIGAR run covers less than 44 % of the read length receives:

0x200 (QC fail) set.
0x2 (proper pair) cleared.
MAPQ capped at 1.

QC-fail propagation across the read group also applies.

When to use it:

The 44 % threshold was calibrated by bwameth.py for standard mammalian whole- genome bisulfite-sequencing (WGBS) libraries with typical read lengths and is helpful on PBAT / scBS-Seq libraries where intra-fragment chimeras are common. For Bismark-equivalent output (and most directional EM-seq / WGBS workflows), leave it off.

It is also useful when benchmarking: comparing bwa-mem3 --meth output against bwameth.py output is cleaner with --chimera-qc enabled, since bwameth.py’s chimera logic always runs.

Note — Pair-level propagation still applies

--chimera-qc controls only whether the heuristic itself runs. --set-as-failed is independent: when active, those flags are still set, and meth_bam_group_propagate_qcfail propagates any 0x200 flags across the read group regardless of --chimera-qc.

Flag interaction summary

Condition	`0x200` set?	`0x2` cleared?	MAPQ capped?
Normal aligned record (default, no flags)	No	No	No
`--chimera-qc` triggers (longest M/=/X < 44%)	Yes	Yes	Yes (≤1)
`--set-as-failed` strand matches	Yes	No	No
Both `--chimera-qc` + `--set-as-failed` active	Yes	Yes	Yes (≤1)

`-M` and split alignments

--meth sets -M and -C unconditionally (bwa has no option that unsets either), so unlike the scoring defaults these two cannot be overridden.

-M changes how a split (chimeric) hit is flagged, not whether it is emitted: supplementary (0x800) becomes secondary (0x100). The record is otherwise identical — SA:Z, SEQ, QUAL, NM/MD and the methylation tags are all retained. So samtools flagstat on a --meth BAM always reports 0 supplementary, and anything selecting split-read evidence by the 0x800 bit will miss it. Methylation calling is unaffected.

Note — on short reads, -T 40 filters far more split hits than -M does

A split arm covers only part of the read, so its score scales with arm length; below T it is dropped outright, before -M is consulted. On 1M Twist EM-seq pairs (75 bp) vs hg38, --meth emitted 94 split records against 4039 at -T 30 — and of those 4039, 3945 scored AS < 40 versus 94 at AS >= 40, exactly the default-threshold survivors. (-L 10 accounts for ~18 more.)

The effect shrinks with read length — a 150 bp read splits into ~75 bp arms scoring well clear of 40 — though that has not been measured. Passing -T 30 (optionally -L 5) alongside --meth recovers those dropped split records, at the cost of divergence from bwameth placement.

This is not a substitute for supplementary alignments. -M is still unconditional, so the recovered records are flagged secondary (0x100), not supplementary (0x800) — an SV caller that selects split-read evidence by the 0x800 bit will ignore them no matter how low -T goes. Lowering -T is useful for manual inspection, or for callers that accept 0x100 records and read SA:Z directly.

`-V` reference annotation `XR:Z` is suppressed under `--meth`

bwa-mem3 mem -V normally emits the contig annotation as an XR:Z auxiliary field. Under --meth, XR:Z carries the Bismark read-conversion direction (CT/GA) instead. The reference-annotation XR:Z is silently suppressed when --meth is active so the two uses don’t collide. There is no flag to override this — -V is a no-op for XR:Z under --meth. See tags.md.

Migrating from external bwameth.py c2t

Earlier (D1) bwa-mem3 --meth releases mirrored bwameth.py: they aligned pre-converted reads against a single ref.fa.bwameth.c2t doubled reference, so a bwameth-style external c2t workflow could be wired up directly. The current (D3) design does not support that pattern, because it seeds in 3-letter space but scores against the original 4-letter reference. This page explains what changed and how to migrate.

Important — external c2t interop is no longer supported

bwa-mem3 mem --meth no longer aligns against a .bwameth.c2t reference, and it cannot consume pre-converted reads. It must do the C→T / G→A projection itself (for seeding) so it can keep the original bases for scoring against the original reference. Pass raw FASTQ and the original ref.fa prefix.

What changed

	D1 (old)	D3 (current)
Reference used for alignment	`ref.fa.bwameth.c2t` (converted, doubled)	`ref.fa` original 4-letter index + `ref.fa.meth.*` seed index
Reads	pre-converted, or inline-converted	raw FASTQ; projected internally for seeding only
External `bwameth.py c2t` reads piped in	supported	not supported
Passing a `.bwameth.c2t` reference path	used as-is	errors: “found a legacy ‘.bwameth.c2t’ index … Re-run: bwa-mem3 index –meth”

Because scoring now runs on the original bases, feeding pre-converted reads would make the converted bases look like the truth and corrupt scoring, NM/MD, and the XM methylation string. That is why the external-c2t path was removed rather than adapted.

How to migrate

Rebuild the index. A legacy ref.fa.bwameth.c2t index is not usable. Build the dual index from the original FASTA:
```
bwa-mem3 index --meth ref.fa     # writes ref.fa.* and ref.fa.meth.*
```
Pass raw FASTQ and the original prefix. Drop any bwameth.py c2t preprocessing step and the -p /dev/stdin plumbing:
```
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam
```
bwa-mem3 finds ref.fa.meth.* automatically and does the seed-time projection internally. (You can pass the ref.fa.meth seed-index path directly if you prefer, but the original-reference handles must sit alongside it.)

If you specifically need bwameth.py’s collapsed-space alignment or its own c2t tooling, continue to use bwameth.py itself — see bwameth.py drop-in mapping for how --meth-scoring collapsed reproduces its placement instead.

What’s Different from bwa-mem2

This section tracks every change that bwa-mem3 carries on top of upstream bwa-mem2/bwa-mem2’s master branch, explains why each change was made, and records its upstream disposition.

bwa-mem3 is not byte-identical to bwa-mem2. Upstream reproduces the original bwa exactly; bwa-mem3 does not — it emits extra SAM tags, fixes crashes and SIMD scoring bugs, and changes tie resolution. On the data tested, the core alignment (position, CIGAR, MAPQ, FLAG) is preserved, but the SAM byte stream is not. See Equivalence with bwa-mem2 for the field-by-field comparison.

How this section is organized

Each page covers one category of change:

Equivalence with bwa-mem2 — what is and isn’t preserved, with the verified concordance check and the declared-divergence catalog.
Correctness fixes — upstream bugs fixed in bwa-mem3 (the kswv score2 series, the proper-pair regression, the zero-init crash, the SMEM overflow, @PG tab-escaping).
Performance improvements — lockstep SMEM batching, batched -H ingestion, libsais FM-index construction, and the consolidated mapping speedups.
Features — --meth, mimalloc, --supp-rep-hard-cap, bwa-mem3 shm, the HN:i tag, and --bam=LEVEL.
Architecture support — Linux ARM64/aarch64, arch=avx512bw, and the NEON / AVX2 kswv mate-rescue kernels.
Build & infrastructure — the doctest framework, version stamping, PGO targets, flag forwarding, and the CI matrix.

The flat per-PR record — every fork-carried change with its bwa-mem3 PR, class, and upstream bwa-mem2 disposition — lives in one place: the PR catalog. The pages here explain the why behind each class.

Notable fork-level changes

Vendored mimalloc allocator: ext/mimalloc is pinned at v3.3.0 and linked into every binary by default (USE_MIMALLOC=1). Linux uses --whole-archive static linkage; macOS uses dyld-interposed shared linkage. USE_MIMALLOC=1 is the supported and recommended default on all platforms; USE_MIMALLOC=0 is provided as a best-effort opt-out and is CI-gated on Linux x86 only. See Features for details.
--supp-rep-hard-cap INT (opt-in, default disabled): forces MAPQ=0 on supplementary alignments whose chain contains a seed with >=INT genome occurrences. Addresses the long-standing bwa/bwa-mem2 issue where a supp fragment that maps to many places standalone (e.g. a short read in a CCATCC repeat) inherits a high MAPQ from its primary because the supp’s competing repetitive chains get filtered out during the full-read pipeline and therefore never contribute to its sub/sub_n. See upstream #260 for the reporter case. Primary MAPQ is unaffected; default output is byte-identical to stock bwa-mem2. Typical values are 5–20 (lower = more aggressive); the upstream #260 repro drops from MAPQ=60 to MAPQ=0 at --supp-rep-hard-cap 18.

Version stamping

PACKAGE_VERSION (the value reported by bwa-mem3 version and written to the @PG VN: SAM header field) is generated at build time by the Makefile from git describe --tags --dirty, e.g. v2.3-30-g61813ef for a tree 30 commits past upstream tag v2.3 at commit 61813ef.

No manual bumping required: cut a fresh release by tagging the commit (git tag -a vX.Y-fg-labs.N -m ...) and the next build picks it up.
Builds where git describe --tags fails (source-tarball extractions, or shallow clones / checkouts with no tag reachable from HEAD — including CI’s default actions/checkout fetch-depth of 1) fall back to the static FG_LABS_VERSION_FALLBACK in Makefile. Bump that when cutting a release that will be consumed as a tarball, or in CI artifacts.
src/version.h is generated and .gitignored; make clean removes it.

Branching and update policy

master tracks upstream unchanged.
main is upstream/master plus the commits above. Rebased onto upstream roughly quarterly, or sooner when an upstream release we care about lands.
Contributions go via PR targeting main. CI and CodeRabbit gate merges.
Any PR that adds or removes a fork-carried commit must add a row to the PR catalog in the same PR (the FG-MAIN-TABLE rule).

Consuming

Clone this repo and check out main:

git clone https://github.com/fg-labs/bwa-mem3.git
cd bwa-mem3
git checkout main

Or vendor the branch into a downstream repo by pinning to a specific commit (not the branch tip) so your build is reproducible.

Relationship to upstream

We submit the generally-useful fixes and features carried here as PRs against bwa-mem2/bwa-mem2 when the upstream maintainers are actively merging; while they are not, fixes land here first and we drop them from main once they appear upstream.

Equivalence with bwa-mem2 (bit-identity)

bwa-mem3 is not byte-identical to bwa-mem2, and that is intentional. Upstream bwa-mem2 advertises exact equivalence with the original bwa — “produces alignment identical to bwa” and “exact same output as bwa-mem(2)” — and that guarantee was the right bar for a project whose sole charter was to reproduce bwa mem faster. bwa-mem3 has a broader charter: it adds informative SAM tags, fixes crashes and undefined behavior, corrects SIMD scoring kernels, and makes tie resolution deterministic. Several of those changes necessarily move output away from a byte-for-byte match. We have consciously stepped below the bit-identity bar, and this page records exactly where, and gives an auditable trail back to every merged pull request so a reader can decide for themselves whether each divergence matters for their workflow.

The short version: on the data we have tested, the core alignment — where each read maps and how — is preserved on essentially every read. The SAM byte stream is not, primarily because bwa-mem3 emits additional auxiliary tags that upstream never wrote, and secondarily because it adds a handful of supplementary alignments and shifts MAPQ/CIGAR on a small per-architecture fraction of reads. Beyond the additive tags, the remaining divergences are latent, opt-in, or per-architecture; they are described under “What differs” below and in more depth on the Correctness fixes, Performance improvements, and Features pages.

What is preserved

We ran an empirical concordance check with bwa-mem3-bench at commit a02fcb4 (current main), comparing bwa-mem3 against upstream bwa-mem2 v2.2.1 on x86 hosts across whole-genome, whole-exome, and panel workloads. Primary-alignment concordance — reference name, position, CIGAR, MAPQ, and placement flags compared per read end — is:

sample	primary concordance	primary records
wes-5M	99.9996%	10,051,170
wgs-5M	99.9893%	9,980,872
panel-twist-5M	99.9414%	7,913,324

Where each read maps is preserved on essentially every read. The well-under-0.1% of primary records that differ do so in MAPQ, CIGAR, or position, and are accounted for by the per-architecture SIMD score2/MAPQ convergence and the deterministic tie-break change described under “What differs” below and on the Correctness fixes page. (On the 1M-read smoke-1M cell the figure is 99.946%; the larger exome and genome cells above are more representative.) Cross-architecture, the NEON (ARM) and x86 builds are byte-identical to each other — 100.0000% concordance over all records, supplementary alignments included.

What differs

Additive SAM tags

The most pervasive difference is two additive auxiliary tags that bwa-mem3 emits and upstream does not:

MQ:i — mate mapping quality, present on ~100% of bwa-mem3 records and absent from upstream output.
HN:i — total hit count per primary, present on 54,188 of the 64,763 bwa-mem3 records and on 0 upstream records.

A representative record (SRR34589119.1) makes the shape of the difference concrete. bwa-mem3 emits:

… MC:Z:53S12M1D6M2D22M58S  MQ:i:19  AS:i:25  XS:i:23  HN:i:9

while upstream emits:

… MC:Z:53S12M1D6M2D22M58S  AS:i:25  XS:i:23

Same alignment, same scores — two extra tags. Because these tags are inserted into the optional-field area of the record, the line is no longer byte-identical to upstream even though the alignment it describes is. MQ:i is one of the lh3/bwa tags ported in #35; HN:i is added in #42. See Features → HN:i hit count tag for the full semantics.

Separately, the @PG header line reports ID:bwa-mem3 / PN:bwa-mem3 rather than bwa-mem2, which is also a byte-level header difference by design.

Additional supplementary alignments

On the default build, with no special flags, bwa-mem3 emits a small number of additional supplementary (chimeric/split) alignments that upstream bwa-mem2 v2.2.1 does not. On wes-5M (5,025,585 read pairs) bwa-mem3 emits 5,123 supplementary records versus upstream’s 5,118 — five extra split alignments, on five templates (0.0001%). The primary alignment of every affected pair is unchanged; only an extra supplementary record is added. This is measured by bwa-mem3-bench’s compare-bams, which reports per-template supplementary count-mismatch and position-unmatched rates alongside primary concordance. These additions are default-on behavior — they occur with no special flags — and are not a product of the opt-in --supp-rep-hard-cap rescoring, which only lowers MAPQ and never adds records. Pinning them to a specific upstream-divergence PR is tracked as follow-up (#127).

Divergences that are latent, opt-in, or per-architecture

The following changes can move alignments, scores, or MAPQ relative to upstream, but did not surface as primary-alignment differences on the measured cells because they are gated, latent, or only active on other inputs or architectures:

Proper-pair FLAG recompute (#17, default-on). bwa-mem3 computes the 0x2 bit from the alignment actually emitted rather than from the below-threshold primary. This only changes the flag in the rare case where the primary’s score is under opt->T but an ALT hit clears it; on smoke-1M no record hit that path, so the full FLAG matched upstream exactly. See Correctness fixes → Proper-pair flag.
SIMD scoring-kernel fixes (#21, #26, #28, #29, #30, #31). These correct the batched mate-rescue kswv kernels so the suboptimal score (score2 → XS:i/MAPQ) converges toward the scalar ksw_align2 reference. They move XS/MAPQ on the minority of reads where the SIMD kernel previously diverged, and the affected reads differ by architecture (AVX2 vs NEON vs AVX-512BW). See Correctness fixes → kswv score2 plateau series.
Seeding correctness fixes (#55, #73, #100). These fix buffer sizing and a prefetch-mask precedence bug. They change alignments only where the old bug actually triggered (e.g. reads longer than 151 bp for #55; #73 is a prefetch hint with no semantic change).
Opt-in MAPQ rescoring (#56, #101, #118, default-off). --supp-rep-hard-cap INT forces MAPQ=0 on supplementary alignments anchored in repetitive seeds. With no flag the output is unchanged; #101 makes the flag actually take effect (it shipped as a silent no-op before), and #118 is its regression test. See Features → --supp-rep-hard-cap.
Tie-break determinism (#123). Makes secondary-alignment ordering deterministic across runs; can reorder equal-scoring ties relative to upstream’s order.

(The resolve→order→chain seeding refactor that backs --seed-order is byte-identical in its default off mode; it is described in the dedicated section below rather than listed here, since only its non-off modes are divergent.)

Seed ordering (`--seed-order`, opt-in)

--seed-order off (the default) preserves byte-identical output. The internal resolve→order→chain refactor that enables it was verified to be bit-for-bit identical to the previous code path across single-end, paired-end, and threaded runs.

--seed-order local-longest (and the unadvertised advanced modes global-longest, absorb-count, most-absorb) are opt-in and are not byte-identical. They reorder each read’s SA-resolved seeds before chaining so that the longest seeds anchor chains first; contained shorter seeds are then absorbed rather than extended. The output shift is non-trivial: secondary alignments, XA:Z:, XS:i, and HN:i can all change, and a small number of primary alignments may shift as well.

Accuracy on an easy simulated profile (holodeck, ~94.4 % F1) is flat relative to off. Hard-data F1 validation on divergent/indel-rich reads and GIAB benchmarks is not yet complete. Because the accuracy gate is not fully passed, all non-off modes remain opt-in only; the default stays off. See Optimization checklist → Reorder seeds longest-first for usage guidance.

Declared divergence catalog

The divergences described above are tracked as a structured registry in bwa-mem3-bench (docs/expected-divergences.yaml). Each carries a per-sample concordance-drift budget that the benchmark gates against on every run — a new bwa-mem3 build that drifts beyond its budget fails CI rather than silently shipping a regression. The table below is generated from that registry; do not edit it by hand (see bwa-mem3-bench → Per-release concordance history for how it is regenerated).

id	pr	affected	samples	budget_%	summary
FG-PRIMARY-DRIFT	fg-labs/bwa-mem3#123	primary_alignment	wgs-5M, wes-5M, panel-twist-5M, smoke-1M	0.1000	Per-architecture SIMD score2/MAPQ convergence (#21, #26, #28-#31) and deterministic tie-break ordering (#123) shift MAPQ, CIGAR, or position on a small fraction of primary alignments relative to bwa-mem2 v2.2.1. Where each read maps is preserved; the affected reads differ in placement detail, and the set varies by SIMD architecture.
FG-METH-DIVERGENCE	fg-labs/bwa-mem3#90	meth_alignment	meth-twist-emseq-5M, smoke-meth	1.5000	Bisulfite (–meth) mode against the bwameth.py baseline diverges beyond the ignored YD/XM/XG tag set (Bismark-compatible XR/XG/XM tags and C->T/G->A conversion handling), giving a larger but still-bounded concordance drift on methylation workloads.
FG-SUPP-ADDITIONS	TBD	supplementary_alignment	all	0.0000	bwa-mem3 emits a small number of additional supplementary (split/chimeric) alignments vs bwa-mem2 v2.2.1 on the default build (e.g. wes-5M: 5123 vs 5118). Primary alignments are unchanged. Tracked as a supplementary-count metric (compare-bams supp_count_mismatch / supp_unmatched); it does not affect the primary-concordance drift budget, hence 0.0 here.

Per-PR audit trail

Every fork-carried change is listed — with its class and upstream bwa-mem2 disposition — in the PR catalog. The declared divergence catalog above calls out the entries that actually affect output.

See also: Overview · Correctness fixes · Performance improvements · Features · PR catalog

Correctness Fixes

This page documents bugs present in upstream bwa-mem2 that bwa-mem3 fixes. Each fix is isolated to a single PR so it can be reviewed independently and dropped from main once upstream merges the equivalent patch.

`@PG CL:` tab escaping (PR #54)

When a read-group string is passed via -R '@RG\tID:x\tSM:y', the tab characters in the argument were copied verbatim into the @PG CL: SAM header field. The SAM specification uses tabs as field delimiters, so the resulting header line appeared to have extra ID: and other tag fields embedded inside CL:. Lenient parsers (samtools, htsjdk) tolerated the output; strict parsers (noodles, some fgbio configurations) rejected the file as malformed.

The fix replaces each tab character with a space when building the @PG CL: value in src/main.cpp. The @RG line itself is not modified, so the read-group metadata is preserved correctly. A regression shell test (test/pg_cl_escape_test.sh) asserts that the @PG line contains exactly five tab-separated fields after the fix. Upstream issue reference: bwa-mem2#293.

SMEM buffer overflow on reads longer than 151 bp (PR #55)

bwa-mem2 hardcoded READ_LEN 151 in src/macro.h to size the per-thread matchArray SMEM buffer at compile time. The FMI walk wrote past this buffer without bounds checking when reads exceeded 151 bp, causing memory corruption that manifested as segfaults or silent wrong output on 300 bp MiSeq reads, error-corrected long reads, and any run with a non-default -k that extended seed length.

A second cap, MAX_READ_LEN_FOR_LOCKSTEP 512, guarded the lockstep driver’s per-slot stack arrays with a hard assert that aborted on anything longer.

The fix eliminates both compile-time caps. Every per-thread SMEM buffer is now heap-allocated on the memory management context (mmc) and grown on demand from each batch’s observed max_readlength. The pre-walk grow in mem_collect_smem sizes matchArray[tid] to BATCH_MUL * BATCH_SIZE * max_readlength, and all array writes are bounds-checked with a structured smem_overflow_die on overflow. Regression tests cover 300 bp, 1 kbp, and 3 kbp phiX reads; all three segfaulted before the fix and produce correct NM:i:0 alignments after. Upstream references: bwa-mem2#210 (issue), bwa-mem2#238 (closed unmerged upstream PR).

`kswv` nrow==0 guard (PR #51)

When a SIMD batch contained only padding pairs (all len1 == 0), the DP loop never executed and nrow was zero. The post-loop rowMax + (i-1) * SIMD_WIDTH store still executed, walking SIMD_WIDTH bytes before the beginning of the rowMax allocation. On glibc this produced a free(): invalid pointer abort; on macOS libc it silently corrupted the heap.

The fix wraps the post-loop store in an if (i > 0) guard on all five SIMD kswv kernels: NEON u8, NEON 16, AVX2 u8, AVX-512BW u8, and AVX-512BW 16. The upstream patch bwa-mem2#289 covered only the two AVX-512BW kernels; bwa-mem3 broadens it to the three additional kernels carried in this fork. A dedicated regression test (test/kswv_nrow_zero_test.cpp) builds all-padding batches and verifies each kernel is clean under AddressSanitizer.

kswv score2 plateau series (PRs #26, #27, #28, #29, #30, #31)

The batched mate-rescue Smith-Waterman path (kswv) contains a family of related bugs across its SIMD kernels that inflated the suboptimal score (score2 / XS) and consequently deflated MAPQ relative to upstream bwa-mem2.

AVX-512BW dispatch guard (PR #26). GCC with -mavx512bw automatically defines __AVX2__, so the #elif __AVX2__ branch in src/kswv.h and src/kswv.cpp matched first on every AVX-512BW build. The 256-bit AVX2 kernel produced only 32-lane results into 64-lane score[]/te1[]/qe[] arrays sized for AVX-512BW; the upper 32 lanes held uninitialized values. mem_matesw_batch_post read those bogus te values, bwa_gen_cigar2 returned NULL, and mem_reg2aln triggered an a.cigar != NULL assertion on every AVX-512BW dispatch host (AWS c7a, c7i). The fix qualifies the #elif __AVX2__ guard with !__AVX512BW__, matching the existing pattern in bandedSWA.h. Closes issue #25.

AVX2 score2 plateau fix (PR #27 closed, PR #28 merged). The AVX2 256-bit kswv kernel added in PR #20 used a dense SIMD max over every rowMax row to compute the suboptimal score. Scalar ksw_u8 instead collapses consecutive rows above minsc into a single b[] entry anchored at the max-score row, then finds the best anchor outside the primary region. The dense max pulled in tail rows from a plateau whose anchor sat inside the primary region, inflating XS by 1–4 on a minority of reads and reducing MAPQ by 2–18 on those reads. PR #27 (closed) temporarily disabled the AVX2 batched path. PR #28 fixes the kernel itself by replacing the dense scan with a per-lane scalar emulation of the b[] build-and-scan logic.

NEON and AVX-512BW 8-bit port (PR #29). The same dense-rowMax score2 scan existed in kswv_neon_u8 and kswv_512_u8. Confirmed on ARM: rebuilding smoke-1M on darwin/arm64 pre-fix produced the identical four MAPQ regressions as the AVX2 case. PR #29 ports the per-lane scalar b[]-emulation fix to both kernels.

AVX-512BW 16-bit port (PR #30). kswv_512_16 carried four bugs: the same dense-rowMax plateau pattern, aggregate maxl/minh bounds instead of per-lane bounds (a gap from PR #21), no minsc filter, and no qe mask. The per-lane scalar emulation from PR #29 fixes all four naturally.

NEON 16-bit rewrite (PR #31). kswv_neon_16 was effectively dead code before this PR. Five interacting bugs produced 20,435 BAM diffs vs scalar reference on smoke-1M -A 2: the score table reinterpreted int16 xor indices as int8 lookups (inflating match scores by ~256 per cell), the table was too small for the 16-bit SoA encoding, rowMax was never written, the early-exit fired on row 0 for all pairs without a KSW_XSTOP target, and all the fix-3 class bugs from PRs #28–#30 were missing. The PR rewrites the kernel from scratch against kswv_neon_u8’s structure using 32-byte int8 tables indexed via vqtbl2_s8, per-lane freeze, exit0 bitmap, and per-lane scalar score2.

`kseq2bseq1` zero-initialization (PR #22)

bseq_read_orig grows its sequence buffer with realloc, leaving tail entries uninitialized. kseq2bseq1 populated only name, comment, seq, qual, and l_seq for each entry, leaving sam, bams, n_bams, and cap_bams at whatever values realloc happened to return. PR #13 added an unconditional free(ret->seqs[i].bams) in the output loop (fastmap.cpp:571), which turned those garbage values into a crash — a pointer being freed was not allocated abort under system malloc and a SIGSEGV under mimalloc — once input exceeded the initial 256-sequence allocation. The crash was deterministic and reproducible with -t1.

The fix is a single memset(s, 0, sizeof(*s)) at the top of kseq2bseq1.

Proper-pair flag from emitted alignment (PR #17)

In the no_pairing emission path of mem_sam_pe and mem_sam_pe_batch_post, the proper-pair bit (0x2) was computed from a[i].a[0].rb regardless of which alignment was actually emitted. When the primary’s alignment score fell below the reporting threshold opt->T but a non-primary ALT hit cleared it, mem_reg2aln emitted a[i].a[n_pri[i]] while mem_infer_dir still read the below-threshold primary. In that case the SAM flag did not reflect the coordinates in the record.

The fix stores the selected alignment index per mate in a which[2] array and passes a[i].a[which[i]].rb to mem_infer_dir, ensuring the proper-pair flag always matches the emitted record. The bug was present in the bwa-mem2 initial commit from 2019. Upstream reference: pre-existing bug, no open upstream PR at time of merge.

Changes catalog

Item	bwa-mem3 PR	Upstream PR/issue	Status
`@PG CL:` tab escape	#54	bwa-mem2#293	fork-only (open upstream issue)
SMEM buffer overflow on >151 bp reads	#55	bwa-mem2#238, bwa-mem2#210	fork-only (upstream PR closed unmerged)
kswv nrow==0 guard	#51	bwa-mem2#289	fork-only (upstream PR open)
AVX-512BW dispatch guard	#26	—	fork-only
AVX2 score2 plateau disable (superseded)	#27	—	closed (superseded by #28)
AVX2 score2 plateau fix	#28	—	fork-only
NEON + AVX-512BW 8-bit score2 fix	#29	—	fork-only
AVX-512BW 16-bit score2 fix	#30	—	fork-only
NEON 16-bit kernel rewrite	#31	—	fork-only
kseq2bseq1 zero-initialization	#22	—	fork-only
Proper-pair flag from emitted alignment	#17	—	fork-only

Performance Improvements

This page covers the performance work carried in bwa-mem3 on top of upstream bwa-mem2. Almost every change listed here is a throughput, memory, or supporting (test/hardening/cleanup) change that preserves the aligner’s output; the one exception is the deterministic tie-break ordering in #123, which can reorder equal-scoring alignments relative to upstream (see Equivalence with bwa-mem2 for the full, audited list of where bwa-mem3 output diverges).

For a reader-friendly grouping of what drives the speedup — by machine architecture, hot-path rewrite, indexing, allocation/I/O, and build-time — see Performance → Overview. For current benchmark numbers across architectures and workloads, see bwa-mem3-bench, the canonical source of truth for benchmark methodology and results.

Lockstep SMEM batching (PR #33)

Seeding in bwa-mem2 advances one read’s SMEM walk at a time. Because each forward/backward extension step issues a random access into the cp_occ checkpoint array (~4 GB for human genome), the CPU stalls on cache misses between steps. Lockstep batching advances SMEM_LOCKSTEP_N reads’ SMEM walks in slot-interleaved round-robin order so that the out-of-order engine can overlap the cp_occ cache-miss loads for read i+N with the compute-bound walk of read i.

Each read slot (BatchSlot) carries its own prev[] walk buffer and match_buf[] reorder buffer. A tight recycling loop assigns finished slots to the next unprocessed read immediately. The match-emit cursor enforces input-index order so output is byte-identical to scalar. SMEM_LOCKSTEP_N is compile-time tunable; N=1 dispatches to the unchanged scalar path for bisection.

Measured improvement on 150 bp NovaSeq WGS (1M pairs, hg38, Graviton3 r7g.4xlarge, 8 threads): −6.1% wall time (82 s → 77 s). The backwardExt hot cp_occ load share dropped from 65.5% to 53.3% of function time — direct evidence that the OoO engine is overlapping cross-slot loads. On 300 bp MiSeq reads the workload is SW-dominated (~85% of cycles in kswv kernels) and the SMEM improvement is within noise; parity holds.

Supersedes PR #15 (cross-read _mm_prefetch shape), which regressed on Graviton3.

Batched `-H` header ingestion (PR #49, closes issue #37)

Passing a large header file via -H <file> re-ran strlen on the growing header string and called realloc on every input line, making ingestion O(n²) in the number of header lines. For a ~70 MB / ~1.5 M-line header (reported in upstream bwa-mem2#204) this caused runtimes exceeding 10 minutes before alignment started.

The fix introduces bwa_insert_header_file, a batched helper that determines the file size with fseek/ftell, allocates a single buffer, copies all @-prefixed lines in one pass, and calls bwa_insert_header once. The fix also addresses four correctness gaps in the upstream PR #204: the return-value assignment was dropped (leaving hdr_line stale after realloc), const FILE* caused compiler warnings, empty files were not guarded, and each fgets was not bounded by remaining buffer. A regression test (test/header_insert_test.cpp) diffs the batched path against the pre-patch per-line baseline across eight edge cases.

libsais FM-index construction (PR #57)

bwa-mem3 index now builds the FM-index using libsais v2.9.1 (Ilya Grebnov) instead of the sais-lite (Yuta Mori saisxx) library that bwa-mem2 inherited. libsais is actively maintained, supports OpenMP-parallel induced sorting, and produces a byte-identical FM-index. An index built by bwa-mem2 index is read without re-indexing; a bwa (v1) index uses a different format and must be rebuilt with bwa-mem3 index (see Coming from bwa or bwa-mem2).

For a human reference (GRCh38 + decoys), libsais reduces indexing wall time and peak memory vs sais-lite. Exact numbers depend on thread count and available RAM; see the PR body for measurements on Graviton3.

Consolidated mapping speedups (PR #58)

PR #58 is a multi-phase performance audit of bwa-mem2’s hot path, squashed and rebased onto main. It incorporates improvements across five subsystems:

ksw2 banded SW — tuned the band extension loop to reduce redundant computation in the common case.
SMEM lockstep batching — additional refinements on top of PR #33.
SAL prefetch — prefetch hints for the suffix array lookup hot path.
SAM record building — reduced per-record allocation in the text formatting path.
PGO build — the opt-in profile-guided optimization target (see also Performance → PGO build) is included in this suite.

On the smoke-1M workload (1M PE 150 bp reads, hg38, Graviton3 r7g.4xlarge, 16 threads, warm page cache), this PR contributed the largest single-step wall time reduction in the main branch’s performance history. Benchmark details are maintained at bwa-mem3-bench.

Full change list

Every performance PR — with its upstream disposition — is in the PR catalog (filter on the Performance class). The sections above narrate the load-bearing ones.

Open / in progress

Not yet merged to main:

Item	Stage	Mechanism	bwa-mem3 PR
AVX2 8-bit wrapper prefetch	sw	Adds the missing next-batch ref/query software-prefetch to `smithWatermanBatchWrapper8` — the lone SW wrapper that lacked it (follow-up to #161)	#163

Several correctness and crash fixes underpin the long-read SW, indexing, and high-throughput work rather than adding speed themselves: SMEM read positions widened int16_t → int32_t to stop a long-read SIGSEGV (#142, merged); a persistent kt_for worker pool that fixes a multi-chunk SIGSEGV under mimalloc v3 (#154, merged); and mem_lim widened to int64 to stop an SA-staging buffer overflow on highly repetitive seeds (#156, merged).

Features

This page covers user-facing features added to bwa-mem3 on top of upstream bwa-mem2. None of these features change default behavior: output produced by bwa-mem3 mem without any of these flags is byte-identical to the corresponding bwa-mem2 output (except for the @PG ID: and PN: fields which now read bwa-mem3).

`--meth` bisulfite alignment mode (PR #13)

--meth adds native bisulfite/EM-seq alignment to bwa-mem3 index and bwa-mem3 mem in a single binary — no Python, no separate post-processing step, no bwameth.py dependency. In the default --meth-scoring collapsed mode it reproduces bwameth.py’s read placement (a placement drop-in, not a byte-for-byte clone); --meth-scoring genomic opts into variant-aware scoring bwameth cannot produce.

bwa-mem3 index --meth ref.fa          # once per reference
bwa-mem3 mem --meth ref.fa R1.fq R2.fq | samtools sort -o out.bam

index --meth builds a dual index: the normal index over the original reference plus a converted seed index <ref>.meth.* (over <ref>.meth.fa, a doubled reference with f-prefixed C→T and r-prefixed G→A contigs). Reads seed in that 3-letter space but are scored against the original 4-letter reference, so the index layout is not the same as bwameth.py’s single .bwameth.c2t reference.

mem --meth projects each read (R1 C→T, R2 G→A) to find seeds in the .meth index, preserving the original bases on the first-class bseq1_t.meth_orig_seq field (a YS:Z/YC:Z comment carrier is a fallback only; neither reaches the BAM). It then extends and scores against the original 4-letter reference with a per-strand asymmetric matrix, consolidates the f/r contig pairs back to one @SQ per real chromosome, emits Bismark-compatible XR:Z (read conversion direction), XG:Z (genome strand), and XM:Z (per-base methylation call string) auxiliary tags, restores the original bases into the BAM SEQ field for CpG-calling tools, optionally applies a chimera QC heuristic (longest M/=/X run < 44% of read length → set 0x200, clear proper-pair 0x2, cap MAPQ at 1) when --chimera-qc is passed (off by default), and writes a @PG ID:bwa-mem3-meth entry.

In the default collapsed mode this reproduces bwameth.py’s read placement (chrom, pos) for the standard case, while scoring against the original reference rather than in collapsed space — so it is not byte-for-byte identical to bwameth output. Stacks on PR #12 (--bam). See the Methylation Reference for full details.

Vendored mimalloc allocator (PR #19)

bwa-mem3 vendors mimalloc v3.3.0 as a pinned submodule at ext/mimalloc and links it into every binary by default (USE_MIMALLOC=1). On Linux, static linkage uses --whole-archive; on macOS, dyld-interposed shared linkage is used.

Measured on AWS c7g.4xlarge (Graviton3, 16 threads, 29M 150 bp paired-end exome-capture reads vs hg38, page cache dropped between iterations): −24.5% wall-clock time (528.6 s → 424.7 s) compared to the same build with USE_MIMALLOC=0. No user-visible interface change; no runtime configuration required.

USE_MIMALLOC=0 is a supported best-effort opt-out and is CI-gated on Linux x86. bwa-mem3 version prints the mimalloc version string when it is active.

`--supp-rep-hard-cap` supplementary MAPQ rescoring (PR #56)

Supplementary alignments for a split read inherit MAPQ from the full-read scoring pipeline. Competing repetitive chains for the supplementary fragment are filtered out during full-read chain scoring (mem_chain_flt) before Smith-Waterman, so they never contribute to sub/sub_n. A supp fragment landing in a CCATCC repeat that would map equally well to 50+ locations standalone can therefore carry MAPQ=60 from its primary.

--supp-rep-hard-cap INT opts into rescoring: if any seed in a supplementary alignment’s chain has >=INT genome occurrences (from the SMEM SA count), the supplementary MAPQ is forced to 0. Primary alignment MAPQ and coordinates are unaffected. Default output (no flag) is byte-identical to upstream bwa-mem2.

The SMEM SA-occurrence count is preserved on each seed as mem_seed_t.n_hits and propagated to mem_alnreg_t.chain_n_hits during chain-to-alignment conversion. Typical values for INT are 5–20; lower is more aggressive. The upstream bwa-mem2#260 reporter case drops from MAPQ=60 to MAPQ=0 at --supp-rep-hard-cap 18. Closes issue #46.

Shared-memory index: `bwa-mem3 shm` (PR #65)

bwa-mem3 mem reloads the FM-index from disk on every invocation. For hg38 the index is ~18 GB; for short alignment jobs (targeted panels, small sample batches) this load cost dominates runtime and makes per-invocation IOPS the bottleneck.

PR #65 ports the bwa shm command from bwa-mem v1 to bwa-mem3 with strict v1 CLI parity:

bwa-mem3 shm <index-prefix>    # load index into shared-memory segment once
bwa-mem3 mem <index-prefix> ...  # subsequent runs attach instead of re-reading
bwa-mem3 shm -d <index-prefix>  # detach and free the segment

The index lives in a POSIX shared-memory segment. Multiple bwa-mem3 mem processes on the same host share the same in-memory copy. Closes issue #64.

Warning — Stale index

bwa-mem3 shm does not detect when the on-disk index has been rebuilt. Always run bwa-mem3 shm -d <prefix> before running bwa-mem3 index and then re-stage with bwa-mem3 shm <prefix>. Using a stale shared-memory segment produces silently wrong alignments.

`bwa-mem3 shm --meth` (PR #67)

bwa-mem3 mem --meth <prefix> locates the .meth seed index built by bwa-mem3 index --meth <prefix> automatically. Before PR #67, staging a methylation index in shared memory required passing the full suffixed seed-index path to shm while continuing to pass the plain prefix to mem. The mismatch was easy to forget, and the failure mode — a run that silently attached the wrong segment — was difficult to diagnose.

PR #67 adds --meth support to bwa-mem3 shm so the same plain-prefix convention works end-to-end:

bwa-mem3 shm --meth ref.fa       # stages ref.fa.meth.*
bwa-mem3 mem --meth ref.fa ...   # attaches automatically
bwa-mem3 shm -d --meth ref.fa   # detaches

`HN:i` hit count tag (PR #42)

Every primary SAM/BAM record now carries an HN:i:<n> tag reporting the number of secondary alignment candidates clustered with this primary under XA_drop_ratio. This count is captured before the -h/max_XA_hits cap truncates the XA:Z: string, so HN reports the true number of alternate loci even when no XA:Z: field appears in the record.

This makes it possible to distinguish:

HN:i:0 + no XA:Z: — genuinely unique mapper.
HN:i:N + XA:Z:... (N ≤ -h) — multi-mapper with all alternates listed.
HN:i:N + no XA:Z: (N > -h) — multi-mapper whose alternates were suppressed by the cap.

Motivated by lh3/bwa#438, which adds HN to bwa aln. HN is emitted in both SAM (mem_aln2sam) and BAM (mem_aln_to_bam) paths and is absent when -a (MEM_F_ALL) is active.

`--bam=LEVEL` direct BAM output (PR #12)

bwa-mem3 mem --bam (or --bam=0 through --bam=9) emits BAM directly via htslib, bypassing the SAM-text-to-BAM conversion round trip that normally occurs when the output is piped to samtools view -bS.

--bam / --bam=0: uncompressed BAM (BGZF framing only) — near-zero CPU overhead, smaller than SAM text, fast downstream parsing.
--bam=1..9: BGZF deflate at the specified level.
No flag: SAM text on stdout (default, unchanged).

The implementation adds src/bam_writer.{h,cpp}, a new module that converts mem_aln_t to bam1_t via mem_aln_to_bam. htslib v1.21 is pulled in as a submodule at ext/htslib. On the bwameth.py example fixture (92,961 records), samtools view of --bam output vs SAM text produces a zero-line diff across all 11 SAM columns and all aux tags. See Best Practices → Output format for the recommended pipeline.

`--smem-dedup` SMEM deduplication

--smem-dedup opts into removing fully-identical duplicate SMEM seeds before SA expansion. Off by default → output byte-identical to baseline. When enabled, duplicate SMEMs (same rid, query span [m,n), SA interval [k,l) of size s) that appear adjacent in the sorted SMEM array are compacted in O(n) with no allocation.

Duplicate SMEMs arise because the FM-index is a B-tree and can contain duplicate keys, particularly in repeat-dense regions. On 50 k WGS reads vs hg38, roughly 10 % fewer SA lookups are performed with --smem-dedup active. The wall-clock impact on an arm64 host is ~1–2 % (seeding is a fraction of total runtime); on x86 workloads where SA expansion is a larger share the gain is correspondingly larger.

The accuracy impact is small and bounded: only reads that carried duplicate SMEMs can be affected, and only if the deduplication changes which chain wins. On the 50 k validation set, 2 reads (0.004 %) differed — one XS tag update on a MAPQ-60 read (primary coordinates unchanged) and one equal-score MAPQ-0 tie-break shift. Zero uniquely-mapped reads changed. See the root cause analysis for the full characterization.

Not byte-identical

Do not enable in pipelines that compare output to a bwa-mem2 or bwa-mem3 baseline. The changes are benign and bounded, but they are real SAM changes.

`--min-ext-len` short-seed extension filter

--min-ext-len INT opts into skipping banded Smith-Waterman extension of short seeds (< INT bp) that sit in a chain with a longer anchor seed — the anchor’s extension already covers them, so their own extension is redundant. Off by default (0) → output byte-identical to baseline.

Smith-Waterman extension is ~60 % of bwa-mem3 mem CPU, and almost all of it is spent on short seeds: seeds ≤40 bp hold roughly 90 % of all banded-SW cells yet are ~99 % wasted, because long seeds already resolve via the ungapped fast-path at near-zero cost. The filter drops those redundant short seeds before extension (mem_chain_drop_short_seeds, a stable in-place compaction of each chain’s seeds, called from mem_flt_chained_seeds in src/bwamem.cpp), so their extension never runs while seeding and chaining are untouched.

Recall-safe by construction. A chain whose seeds are all short is left intact — dropping its only evidence would unmap the read — so the filter never empties a chain. (An earlier version dropped every short seed unconditionally, which silently unmapped low-mappability reads: a 151 bp low-mappability sample lost 63 % of its mappings. The anchor guard fixes this; it is a strict recall improvement that can reduce work but never lose a read.)

Measured single-thread on hg38 (HG002 1M PE WGS, non-emptying filter):

~10 % lower main_mem CPU at --min-ext-len 30 (−9.5 % measured), with no recall loss — mapped count is identical to default (99.75 %). ~0.10 % of reads change locus (down from ~0.40 % under the old emptying filter), confined to the low-confidence tail; ~0.005 % of reads change at MAPQ ≥ 60.
Higher thresholds no longer cliff: mapped count and ~0.10 % divergence hold flat across 30–50, because all-short chains are now protected.
The previously-documented high-error F1 cliff and the cross-architecture speed figures were measured under the emptying behavior; both need a fresh bwa-mem3-bench run under the non-emptying filter. Indels and structural variants were never a contraindication (an indel leaves two still-long exact segments the fast-path handles).

30 is the recommended opt-in value. Because the speedup thins the extension stage — not seeding — the wall-clock gain is smaller than the cell-count reduction suggests (seeding, which the filter does not touch, dominates runtime). See CLI → mem --min-ext-len and Settings profiles for the recommended operating point.

`--seed-order` seed reordering before chaining

--seed-order <mode> reorders each read’s SA-resolved seeds before chaining. The default off preserves byte-identical output. The recommended opt-in mode is local-longest, which sorts seeds by decreasing length so the longest seed anchors its chain first and absorbs contained shorter seeds — those sub-seeds then never reach banded Smith-Waterman.

bwa-mem3 mem --seed-order local-longest -t 16 ref.fa R1.fq.gz R2.fq.gz | samtools sort -@ 4 -o out.bam -

Measured on 50,000 real WGS reads (1000 Genomes HG00096, hg38), local-longest reduces extended seeds by ~8.9 % (absorbed fraction increases from 38.2 % to 43.7 %). Since Smith-Waterman extension is typically the dominant per-read cost in bwa-mem3 mem, this translates to a meaningful throughput gain on extension-heavy workloads.

--seed-order local-longest is not byte-identical to the default — it can shift secondary alignments, XA:Z:, XS:i, HN:i, and a small number of primaries. Accuracy is flat on easy simulated data (holodeck, F1 ~94.4 %; no regression vs off), but hard-data F1 validation on divergent/indel-rich reads and GIAB benchmarks is not yet complete. For that reason, all non-off modes are opt-in only and the default stays off.

See Optimization checklist → Reorder seeds longest-first and Equivalence → Seed ordering for full details.

Changes catalog

Item	bwa-mem3 PR	Upstream PR/issue	Status
`--meth` bisulfite alignment mode	#13	—	fork-only
Vendored mimalloc allocator	#19	—	fork-only
`--supp-rep-hard-cap` MAPQ rescoring	#56	bwa-mem2#260	fork-only (upstream issue open)
`bwa-mem3 shm` shared-memory index	#65	—	fork-only
`shm --meth` symmetry	#67	—	fork-only
`HN:i` hit count tag	#42	lh3/bwa#438	fork-only (analogous to bwa aln)
`--bam=LEVEL` direct BAM output	#12	—	fork-only
`--smem-dedup` SMEM deduplication	#187	—	fork-only (opt-in, not byte-identical)
`--min-ext-len` short-seed extension filter	pending	—	fork-only (opt-in, off by default)
`--seed-order` seed reordering	#186	—	fork-only (opt-in, off by default)
`--skip-contained-ext` contained-seed extension skip	#192	—	fork-only (opt-in, byte-identical non-meth, no-op under –meth)
`--max-extend-chains` chain-extension cap	#193	—	fork-only (opt-in, not byte-identical)
`--extend-mate-concordant` mate-concordant chain retention	#195	—	fork-only (opt-in, not byte-identical)

Architecture Support

This page covers the architecture-specific build and runtime work carried in bwa-mem3. The goal is a single codebase that builds cleanly on all supported targets and runs the best available SIMD kernels on each.

For the full dispatch matrix and runtime selection logic, see Performance → SIMD dispatch matrix and Developer Guide → SIMD dispatch architecture.

Linux ARM64 / aarch64 build (PR #1)

The Apple Silicon work that reached the fork in commit ae73227 gated ARM behavior on $(UNAME_M) == arm64. On macOS, uname -m returns arm64. On Linux ARM64, it returns aarch64. The Makefile’s ifeq check therefore fell through to the x86 multi target on every Linux aarch64 host, failing with:

g++: error: unrecognized command-line option '-msse'

PR #1 introduces an IS_ARM variable ($(filter $(UNAME_M),arm64 aarch64)) that matches both names. All four architecture-conditional blocks in the Makefile are rewritten to use IS_ARM: the NEON/sse2neon flag block, the x86 arch-specific block, the ARM64 single-binary build block, and the multi target ARM64 short-circuit. The CI workflow is extended to trigger on pushes to fg-main (the integration branch at the time of PR #1, renamed to main in the 0.1.0-pre release) and adds an ubuntu-24.04-arm matrix row so the aarch64 path is exercised on every PR.

`arch=avx512bw` explicit build target (PR #16)

The AVX-512 Smith-Waterman kernels in bwa-mem2 are guarded by the __AVX512BW__ preprocessor macro — not __AVX512F__. The only way to build them before this PR was arch=avx512, but the (then) make multi rule emitted the dispatch binary as bwa-mem2.avx512bw. The build selector (avx512), the preprocessor guard (__AVX512BW__), and the dispatcher suffix (.avx512bw) disagreed.

PR #16 added arch=avx512bw as an explicit Makefile target with flags -mavx512f -mavx512bw and switched the multi-binary make path to use it. The legacy arch=avx512 was preserved as an alias with identical flags. No C++ was changed; the fix was 11 insertions and 2 deletions in the Makefile.

PR #83 has since replaced the multi-binary scheme with a single binary that compiles each kernel TU at every supported tier and dispatches in process; the avx512bw tier name and flag set survived the transition unchanged, and the arch=avx512bw build target remains the single-arch fallback for clusters with uniform AVX-512BW hardware. The pre-#16 mismatch between selector, guard, and suffix is therefore resolved in both the historical multi-binary layout and the current single-binary layout.

This is a pure build-correctness fix: before PR #16, arch=avx512bw and the legacy multi-binary build on AVX-512BW hardware silently compiled the wrong kernel (see Correctness → AVX-512BW dispatch guard for the downstream effect).

NEON kswv mate-rescue (PR #18)

bwa-mem2 has a batched mate-rescue Smith-Waterman path (BWAMEM_BATCHED_MATESW) that uses SIMD kswv kernels to score rescue candidates in parallel. On ARM64 the gate was __AVX512BW__, which is never true on NEON hardware. The NEON kswv::getScores8 kernel existed in the source but was unreachable in production.

PR #18 enables this path on ARM64 by replacing the __AVX512BW__ gate with a new BWAMEM_BATCHED_MATESW macro that fires on NEON/Apple Silicon as well. Along the way, four kernel bugs were found and fixed:

te split — the te (traceback end) value needed separate hi/lo tracking for 16-lane u8 batches.
Freeze mask — a frozen_vec mask now gates gmax/te/qe updates after KSW_XSTOP fires, preventing stale values from escaping to the score2 scan.
Per-lane score2 exclusion — len1, low/high, and qe masks were not applied per-lane in Loop 1, allowing lanes without a valid primary to contribute spurious suboptimal scores.
minsc filter on rowMax — sub-minsc plateau scores were leaking into score2 because the scalar ksw_u8 gating condition (imax >= minsc) was not replicated.

Measured on an M-series Mac (8 threads, 500k PE 100 bp reads on chr17): 1.42× speedup (−29.4% wall time) with byte-identical sorted SAM output.

AVX2 kswv mate-rescue (PR #20)

PR #18 enabled batched mate-rescue on ARM64. Most x86 production deployments (AWS c6a, c6i, older Xeons) use AVX2 without AVX-512BW and were excluded from the same gate. PR #20 extends the batched path to AVX2 by adding a 256-bit kswv256_u8 kernel and widening BWAMEM_BATCHED_MATESW to fire on __AVX2__.

The AVX2 kernel is a direct port of the corrected NEON kernel from PR #18, with an additional fix for per-lane te2 tracking (_mm256_blendv_epi8 on a sign-extended 8→16 bit mask). Verified byte-identical sorted SAM vs the pre-BWAMEM_BATCHED_MATESW scalar control on EC2 m5.xlarge (Skylake-SP, 4 threads, 500k chr17 PE pairs).

Note: PR #20 introduced a score2 plateau regression in the AVX2 kernel that was identified and fixed in the correctness series (PRs #27, #28, #29).

Changes catalog

Item	bwa-mem3 PR	Upstream PR/issue	Status
Linux ARM64 / aarch64 build + CI	#1	bwa-mem2#288	fork-only (upstream PR open)
`arch=avx512bw` explicit target	#16	—	fork-only
NEON kswv mate-rescue kernel	#18	—	fork-only
AVX2 kswv mate-rescue kernel	#20	—	fork-only

Build & Infrastructure

This page covers the build-system, testing, and CI infrastructure changes carried in bwa-mem3 on top of upstream bwa-mem2.

doctest framework and Codecov (PR #34)

PR #34 establishes the long-term test infrastructure for bwa-mem3:

doctest 2.4.11 is vendored as a single-header under ext/doctest/, with the SHA256 recorded in ext/doctest/VERSION.
A new test/framework/ static library provides shared helpers: scoring matrices, deterministic sequence-pair generators, kswv-style batch packers, scalar and SIMD runners, kswr comparators, a JUnit reporter hook, and a shared main.
Two test binaries are produced: bwa_mem3_tests_unit (runs on every CI matrix row) and bwa_mem3_tests_integration (runs on a subset of rows).
The existing kswv_selftest is ported to test/unit/test_kswv_correctness.cpp — 30,049 assertions against scalar ksw_align2 on 10k random plus curated edge pairs.
Five legacy integration sources are moved to test/integration/ via git mv; their binaries still emit at test/<name> so existing scripts keep working.
Five inline CI bash regression blocks are extracted to test/regression/*.sh (phix_parity, chr22_parity, thread_determinism, bam_roundtrip, meth_oracle).
A coverage CI job builds libbwa.a and both test binaries with COVERAGE=1 (-O0 --coverage), runs both test binaries, collects Cobertura XML via gcovr, and uploads to Codecov via codecov/codecov-action.

`PACKAGE_VERSION` from `git describe` (PR #52)

Before PR #52, src/main.cpp hardcoded PACKAGE_VERSION "2.2.1". This string appeared in bwa-mem3 version output and in the @PG VN: SAM header field but was never updated, causing every build to report an outdated version.

The Makefile now generates src/version.h from git describe --tags --dirty, falling back to a static FG_LABS_VERSION_FALLBACK when git describe cannot reach a tag (source-tarball extractions, shallow clones — e.g. CI with the default fetch-depth: 1). A write-if-changed mechanism (cmp -s + mv) regenerates the file on every invocation but only bumps its mtime when the stamped string changes, so only main.o is rebuilt when the version changes, not the entire tree. src/version.h is .gitignored and removed by make clean. Fixes issue #40. Related upstream: bwa-mem2#283, bwa-mem2#284.

PGO target parameterization (PR #59)

The original pgo-generate and pgo-use Makefile targets hardcoded arch=arm64 and a single shared pgo_profiles/ directory. PR #59 generalizes both:

PGO_ARCH (default: arm64 on ARM hosts, native otherwise) passes through to the recursive make invocation as arch=$(PGO_ARCH). Accepts the same values as the rest of the Makefile: arm64, sse41, avx2, avx512bw, native, etc.
PGO_PROFILE_DIR is now overridable (?= instead of =). Each (arch × training-regime) combination can capture into its own directory.
When PGO_ARCH != arm64, the output binaries are named bwa-mem3.pgo-instr.<arch> and bwa-mem3.pgo.<arch> so multiple per-arch PGO builds coexist. The default arm64 names are unchanged for backward compatibility.
pgo-clean now removes arch-suffixed PGO binaries in addition to bare names.

This enables the benchmarking workflow at bwa-mem3-bench, which requires per-arch × per-regime profile capture. See also Performance → PGO build.

`CXXFLAGS`/`CPPFLAGS`/`LDFLAGS` forwarding (PR #50)

At the time of PR #50, the Makefile’s multi: rule compiled runsimd.cpp (the x86 multi-binary launcher) without honoring CXXFLAGS, CPPFLAGS, or LDFLAGS. The $(EXE) link honored CXXFLAGS and LDFLAGS but not CPPFLAGS. PR #83 has since replaced the multi-binary scheme with a single binary that builds via the single: target (the default), and that target inherits the same flag-forwarding behavior.

PR #50 mirrored upstream bwa-mem2#290: the compile rules now honor all three variables, and $(EXE) link adds $(CPPFLAGS). This allows downstream packagers (Debian, Bioconda) and reproducible-build systems to inject hardening flags (-D_FORTIFY_SOURCE=2, -fstack-protector-strong, -Wl,-z,relro) through the environment without patching the Makefile. No functional change unless the env vars are set. Closes issue #39.

Testing and CI (PR #23, #24)

The test harness and the GitHub Actions CI matrix (multi-arch builds, the canonical deep-test row, --bam roundtrip and thread-determinism checks, chr22 parity vs bwa, and the --meth regressions) are contributor-facing. See the Developer Guide → Regression test framework for what runs and how to run it locally.

`BASELINE_ARCH=avx512bw` build flag

This page documents the empirical perf characterization of building bwa-mem3 with BASELINE_ARCH=avx512bw and the -mprefer-vector-width=256 mitigation that ships as part of that target.

Background: BASELINE_ARCH

bwa-mem3 ships a single x86 binary with all five SIMD tiers (sse41 / sse42 / avx / avx2 / avx512bw) compiled for the hand-tuned kernel TUs (KERNEL_SRCS in the Makefile: bandedSWA, kswv, ksw, sam_encode). The runtime dispatcher in src/simd_dispatch.cpp picks the right tier per kernel call based on __builtin_cpu_supports.

Everything outside KERNEL_SRCS — bwamem.cpp, bwamem_pair.cpp, FMI_search.cpp, fastmap.cpp, bntseq.cpp, etc. — is compiled once at the tier set by the BASELINE_ARCH Makefile variable (default: avx2). The compiler can auto-vectorize loops in those TUs at up to that tier’s width.

PR #84 raised the default from sse41 to avx2 after measuring ~10-15% wall-time gains on AVX2 hosts (c6a, etc.) when the auto-vectorizer could finally widen hot non-kernel loops to 256-bit.

The naive expectation: avx2 → avx512bw should give another tier

Following PR #84’s logic, you might expect BASELINE_ARCH=avx512bw to unlock another ~10-15% on AVX-512BW hosts (c7a, c7i, m7i) by widening auto-vectorization to 512-bit. It does not. The avx2 → avx512bw transition has fundamentally different hardware economics from the sse41 → avx2 transition.

The two AVX-512 perf hazards

1. AMD Zen 4 µop-split (c7a)

AMD’s Zen 4 cores (c7a / Genoa, c8a / Bergamo) implement 512-bit AVX-512 operations by issuing 2× 256-bit µops per 512-bit op. For auto-vectorized loops:

Iteration latency doubles.
Iteration count only halves if the trip count is large enough. Short-trip loops eat the 2× latency without amortizing.
512-bit instruction encodings are larger → more I-cache pressure.

Net: loops that auto-vectorized productively at 256-bit AVX2 lose performance when the compiler widens them to 512-bit.

2. Intel Sapphire Rapids transition + downclock (c7i / m7i)

Intel’s Sapphire Rapids has native 512-bit execution units, so the µop-split issue does not apply. But it pays:

~3-5% AVX-512 frequency downclock under sustained heavy 512-bit use.
AVX-512 ↔ AVX2 transition penalties when non-kernel TUs running 512-bit code call into the 256-bit hand-tuned kernel TUs (which always run at host tier via the dispatcher).

Net: small or zero gain from widening, often offset by the transition costs.

Mitigation: `-mprefer-vector-width=256`

The canonical mitigation (used by FFmpeg, libvpx, Intel ISPC) is to keep AVX-512BW capabilities available but cap auto-vectorization at 256-bit width. The flag -mprefer-vector-width=256 (gcc / clang) / -qopt-zmm-usage=low (icpc) does exactly that:

The compiler can still emit AVX-512BW instructions where it explicitly needs them (mask registers, byte/word lane permutes, gather/scatter, the 32-zmm register file).
The auto-vectorizer’s preferred SIMD width stays at 256-bit, dodging Zen 4’s µop-split and Intel’s downclock/transition costs.

The Makefile bakes this flag into arch=avx512bw directly. Hand-tuned 512-bit kernel intrinsics in KERNEL_SRCS are unaffected — the cap is about auto-vec, not intrinsics.

Empirical numbers

c7a.4xlarge (AMD EPYC 9R14, Zen 4) and c7i.4xlarge (Intel Xeon Platinum 8488C, Sapphire Rapids) running the bench’s wgs-5M sample (1kg HG00096, 5M PE reads on hg38), shm-warmed via bwa-mem3 shm, 3 reps median, timing via tricord (fg-labs/tricord):

host	avx2	avx512bw	avx512bw + pvw256 (default)
c7a (Zen 4)	105.70 s	103.40 s (−2.2%)	101.03 s (−4.4%)
c7i (Sapphire Rapids)	156.50 s	155.47 s (−0.7%)	155.41 s (−0.7%)

The gain is real but small. Defaulting BASELINE_ARCH=avx2 for x86 distribution is still correct: it’s portable across every x86 host and loses only ~2-4% to the host-locked avx512bw build on AVX-512 hosts.

When to use `BASELINE_ARCH=avx512bw`

Production fleets pinned to AVX-512BW hosts (c7a / c7i / m7i): ship a host-locked build for the small (~2-4%) extra gain. The Makefile’s arch=avx512bw includes the -mprefer-vector-width=256 cap by default, which is empirically the right choice for both Zen 4 and Sapphire Rapids. The binary will SIGILL on hosts below avx2; pair with explicit Batch queue / image plumbing.
Mixed fleets / generic x86 distribution: stay on the default BASELINE_ARCH=avx2. The 2-4% gap is small enough that portability is worth it.

Benchmarking it yourself

Build the avx2 and avx512bw variants (the latter already includes -mprefer-vector-width=256 by default) and A/B them on your target host with bwa-mem3-bench; GCUPS is within-host only, so compare on the hardware you will deploy on. (An earlier bench-infrastructure regression here was traced to a toolchain difference, not avx512bw itself, and is resolved.)

Building from source

This page documents every build target available in the Makefile and what each produces. For the recommended production build workflow see Best Practices → Build.

Prerequisites

A C++14-capable compiler: GCC 7+ or Clang 6+ on Linux; Clang 15+ (Xcode) on macOS.
GNU make 3.81+.
CMake 3.12+ (required only when USE_MIMALLOC=1, which is the default).
autoconf, automake, autoconf-archive, libtool, pkg-config — ext/htslib’s build runs autoreconf -i && ./configure and locates zlib via pkg-config.
zlib development headers — htslib links against zlib.
libdeflate development headers — src/fast_reader.c uses libdeflate for BGZF block decode (htslib also links it transitively). Debian/Ubuntu: libdeflate-dev; RHEL/Fedora: libdeflate-devel; macOS: brew install libdeflate (the Makefile auto-detects the Homebrew prefix or honours LIBDEFLATE_PREFIX). Amazon Linux 2023 ships no libdeflate-devel — build and install it from source (e.g. libdeflate v1.22): cmake -B build -DCMAKE_INSTALL_PREFIX=/usr/local -DLIBDEFLATE_BUILD_SHARED_LIB=OFF, then cmake --build build && sudo cmake --install build, and set LIBRARY_PATH=/usr/local/lib64:/usr/local/lib (CMake installs to lib or lib64 depending on the distro) before make.
OpenMP runtime — libsais uses OpenMP for parallel suffix-array construction. Linux + GCC: libgomp ships with the compiler, nothing extra to install. Linux + Clang: libomp-dev (Debian) / libomp-devel (RHEL). macOS: brew install libomp; the Makefile auto-detects the Homebrew prefix or honours LIBOMP_PREFIX.
Git submodules initialised: git submodule update --init --recursive.

See Getting Started → Installation for the full per-platform install commands.

Warning — Submodules must be present

The build will fail with a clear error message if any of the required submodules (ext/libsais, ext/htslib, ext/mimalloc, ext/sse2neon) are missing. Always clone with --recursive or run git submodule update --init --recursive before make.

Standard builds

Default build (host-native)

make

On x86 hosts this is equivalent to make single (see below): one binary containing all five SIMD tiers, dispatched in process at startup. On Apple Silicon and other aarch64 hosts the Makefile detects the architecture and builds a single ARM64 binary with one NEON kernel TU.

The resulting binary is bwa-mem3 in the repo root.

Single multi-tier x86 build (default on x86)

make single                       # alias of the default `make`
make BASELINE_ARCH=avx512bw       # raise non-kernel TU compile baseline
make BASELINE_ARCH=sse41          # lower it for pre-Haswell hosts

Builds one bwa-mem3 binary. The four hand-tuned kernel TUs in KERNEL_SRCS (bandedSWA.cpp, kswv.cpp, ksw.cpp, sam_encode.cpp) are compiled five times each — once per supported tier (sse41 / sse42 / avx / avx2 / avx512bw) — and dispatched at runtime via __builtin_cpu_supports. Non-kernel TUs compile once at BASELINE_ARCH (default avx2 since PR #84). See Single-binary SIMD dispatch (x86) for the full design.

Single-tier x86 builds

Pass arch=<target> to compile a single binary with kernels for one tier only (no runtime dispatch table — useful on clusters with uniform hardware):

Command	SIMD level	`ARCH_FLAGS`
`make arch=sse41`	SSE4.1	`-msse … -msse4.1`
`make arch=sse42`	SSE4.2	`-msse … -msse4.2`
`make arch=avx`	AVX	`-mavx`
`make arch=avx2`	AVX2	`-mavx2`
`make arch=avx512bw`	AVX-512BW	`-mavx512f -mavx512bw -mprefer-vector-width=256`
`make arch=native`	host CPU features	`-march=native`

For Intel compiler (icpc / icpx) the flags differ slightly; see the Makefile for the ifeq ($(CXX), icpc) branches. The avx512bw target keeps the -mprefer-vector-width=256 cap from PR #86 — see BASELINE_ARCH=avx512bw build flag for the empirical perf characterization.

ARM64 / Apple Silicon build

make arch=arm64

Compiles a single binary bwa-mem3 with one NEON kernel TU. See Apple Silicon / NEON port for background.

Tuned builds

Profile-Guided Optimization (PGO)

PGO produces the best single-binary performance. The workflow is two-phase:

# Phase 1: instrument binary
make pgo-generate                              # builds bwa-mem3.pgo-instr (arm64 default)
make pgo-generate PGO_ARCH=avx2               # or a specific x86 target

# Run your training workload with the instrumented binary
./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null

# Phase 2: optimised binary
make pgo-use                                   # builds bwa-mem3.pgo
make pgo-use PGO_ARCH=avx2                     # matching arch

PGO_ARCH accepts the same values as arch=. PGO_PROFILE_DIR defaults to pgo_profiles/ but can be overridden. Output binaries are named bwa-mem3.pgo (default arch) or bwa-mem3.pgo.<arch> when a non-default arch is specified, so multiple arch builds coexist.

Clean up instrumented objects and profile data:

make pgo-clean

Link-Time Optimization (LTO)

make lto-build                                 # builds bwa-mem3.lto (native arch)
make lto-build LTO_ARCH=avx2                   # explicit arch

LTO compiles bwa-mem3’s own translation units with -flto (thin LTO on Clang, full LTO on GCC) plus -fno-semantic-interposition on GCC. Third-party libraries (htslib, mimalloc) are linked without LTO. Clean:

make lto-clean

Compute-only profile binary

Used when profiling CPU hotspots without I/O noise. The -DDISABLE_OUTPUT flag short-circuits all BAM/SAM write paths and the file-open / header-emit step, so only alignment work contributes to wall time.

make profile-build                             # builds bwa-mem3.profile (native)
make profile-build PROFILE_ARCH=avx2          # explicit arch
./bwa-mem3.profile mem -t 16 ref.fa R1.fq.gz R2.fq.gz

make profile-clean

Build knobs

Variable	Default	Effect
`USE_MIMALLOC`	`1`	Include mimalloc; set `0` to use the system allocator
`ASAN`	(unset)	Set to any non-empty value to enable AddressSanitizer (forces `USE_MIMALLOC=0`)
`COVERAGE`	(unset)	Set to enable `--coverage` + `-O0` for gcov line-level coverage
`EXTRA_CXXFLAGS`	(empty)	Appended to `CXXFLAGS`; forwarded through PGO / LTO targets
`DISABLE_BATCHED_MATESW`	(unset)	Set to `1` to disable the batched mate-rescue SW path on ARM
`CXX`	`c++`	Compiler. Paired `CC` is auto-derived from `CXX` for libsais.

Cleaning

make clean

Removes object files, libbwa.a, all binaries, test binaries, libsais objects, htslib, and the mimalloc build tree.

make docs-clean

Removes only the mdbook build output (docs/book/). See the Documentation targets below for the full list.

Documentation targets

Target	Action
`make docs`	Build the mdbook into `docs/book/html/`
`make docs-serve`	Live-preview at `http://localhost:3000`
`make docs-cli`	Capture `--help` output for each subcommand into `docs/_generated/cli/`
`make docs-clean`	Remove `docs/book/`
`make docs-install-tools`	`cargo install` mdbook, mdbook-mermaid, and mdbook-linkcheck2

The build runs the mdbook-linkcheck2 backend, which fails the build on a dead internal link (a link to a page that does not exist). This guards against broken cross-references reaching the published site — mdBook on its own only warns. External (web) links are not checked, and bracketed literal text in the captured CLI snippets (e.g. [P]) is reported only as a non-fatal warning. Because a second output backend is configured, the HTML site is written to docs/book/html/ rather than docs/book/.

SIMD dispatch architecture

bwa-mem3 uses two complementary mechanisms to run the best available SIMD code path at run time: in-process tier dispatch on x86 (handled separately in Single-binary SIMD dispatch (x86)) and compile-time conditional compilation inside each kernel translation unit, mediated by src/simd_compat.h and src/kernel_dispatch.h.

This page covers the compile-time layer: what the macros do, which kernels are vectorised at each ISA level, and how the dispatch decision flows from main() to a tier-specific kernel instruction.

The `simd_compat.h` abstraction layer

src/simd_compat.h is the single point where platform detection and intrinsic selection occur. It is included by every file that touches SIMD code. The header resolves to one of four paths:

Platform	Branch condition	Intrinsic headers
ARM / Apple Silicon	`__ARM_NEON` or `__aarch64__`	`sse2neon.h` (translation) + `<arm_neon.h>` (native)
x86 AVX-512BW	`__AVX512BW__`	`<immintrin.h>`
x86 AVX2	`__AVX2__`	`<immintrin.h>`
x86 SSE4.1 / SSE2	`__SSE4_1__` or `__SSE2__`	`<smmintrin.h>` + `<emmintrin.h>`

The ARM path defines APPLE_SILICON 1, sets SIMD_WIDTH8 = 16 and SIMD_WIDTH16 = 8 (128-bit NEON lanes), defines a posix_memalign-backed _mm_malloc replacement that enforces the 128-byte Apple Silicon cache-line alignment, and provides two optimised NEON helpers that sse2neon does not generate efficiently:

_mm_movemask_epi16 — extracts the MSB of each 16-bit element using vshrq_n_u16 + vmovn_u16 + position-weighted vaddv_u8, replacing the _mm_movemask_epi8(v) & 0xAAAA pattern used in bandedSWA.cpp.
_mm_blendv_epi16_fast — a bitwise select on 16-bit elements via NEON vbslq_s16, replacing the OR/AND/ANDNOT sequence sse2neon emits for _mm_blendv_epi8.

SIMD_WIDTH8 and SIMD_WIDTH16 control the lane counts in kswv.cpp and bandedSWA.cpp. The macros differ per ISA level:

ISA	`SIMD_WIDTH8`	`SIMD_WIDTH16`
SSE4.1	16	8
AVX2	32	16
AVX-512BW	64	32
ARM NEON	16	8

Per-tier compilation and symbol mangling

On x86 the four kernel translation units listed in KERNEL_SRCS (bandedSWA.cpp, kswv.cpp, ksw.cpp, sam_encode.cpp) are compiled five times each — once per supported tier (sse41 / sse42 / avx / avx2 / avx512bw) — with tier-specific -m... flags. src/kernel_dispatch.h is a preprocessor-only header that renames each exported kernel symbol per a KERNEL_VARIANT=_<tier> macro, so the five tier compiles produce non-colliding symbols that all link into one binary.

bandedSWA.h adds an abstract IBandedPairWiseSW interface; BandedPairWiseSW is final and inherits from it. kswv.h mirrors this with Ikswv. Each per-tier kernel TU exports a C-linkage factory function (make_bsw_kernel_<tier>, make_kswv_kernel_<tier>) that returns a std::unique_ptr<I*> to the tier-specific concrete class. The dispatcher in src/simd_dispatch.cpp switches on g_tier and calls the matching factory; the call sites in bwamem.cpp and bwamem_pair.cpp see only the interface. This separation keeps the dispatcher TU free of class-layout knowledge and sidesteps the ODR risk that would arise from each tier’s compile pulling in a differently-laid-out concrete class definition.

The free-function ksw_* family (ksw_extend2, ksw_global2, ksw_extend, ksw_global, ksw_align2, ksw_align) is dispatched through thin extern "C" wrappers in simd_dispatch.cpp that switch on g_tier and tail-call the matching mangled per-tier symbol. Internal aux helpers in ksw.cpp (ksw_qinit, ksw_u8, ksw_i16) are forced static so the five tier compiles do not multi-define them. The SAM seq/qual encoder previously inlined in bwamem.cpp was lifted into src/sam_encode.{h,cpp} so it also participates in per-tier compilation.

All non-kernel TUs (bwamem.cpp, bwamem_pair.cpp, fastmap.cpp, FMI_search.cpp, bntseq.cpp, …) compile once at the BASELINE_ARCH tier (default avx2, set by the make line). They call into the dispatcher’s tier-agnostic entry points, which fan out to the per-tier kernels at run time. See Single-binary SIMD dispatch (x86) for the runtime selection and override semantics, and BASELINE_ARCH=avx512bw build flag for why non-kernel TUs do not auto-vectorize at 512-bit by default.

On arm64 there is one NEON tier and one kernel compile per TU; the dispatch tables collapse to single-entry switches and the per-tier mangling layer is a no-op.

Dispatch diagram

The full dispatch decision, from the shell to a kernel instruction, follows this flow:

flowchart TD
    A[User runs: bwa-mem3 mem ...] --> B{Platform}

    B -- ARM / Apple Silicon --> C[bwa-mem3 main, single NEON kernel TU]
    B -- x86 --> D[bwa-mem3 main, calls bwamem3_simd_init in src/simd_dispatch.cpp]

    D --> E{__builtin_cpu_supports + BWAMEM3_FORCE_TIER}
    E -- AVX-512BW --> F1[g_tier = avx512bw]
    E -- AVX2 --> F2[g_tier = avx2]
    E -- AVX --> F3[g_tier = avx]
    E -- SSE4.2 --> F4[g_tier = sse42]
    E -- SSE4.1 --> F5[g_tier = sse41]

    F1 & F2 & F3 & F4 & F5 --> G[Non-kernel TUs run\nat BASELINE_ARCH tier]
    C --> G

    G --> H{Kernel call}

    H -- kswv\nbatched SW --> I[per-tier kswv.<tier>.o\nvia make_kswv_kernel_<tier>]
    H -- bandedSWA\nmate-rescue --> J[per-tier bandedSWA.<tier>.o\nvia make_bsw_kernel_<tier>]
    H -- ksw_align2 etc.\nfree functions --> K[per-tier ksw.<tier>.o\nvia extern-C wrapper in simd_dispatch.cpp]
    H -- sam_encode --> L[per-tier sam_encode.<tier>.o]
    H -- FMI_search\nbackward extension --> M[FMI_search.cpp\n__builtin_popcountl — not SIMD]
    H -- libsais\nBWT construction --> N[libsais.c\nOpenMP parallel SA-IS]

    I --> O[SIMD instructions\nat the dispatched tier]
    J --> O
    K --> O
    L --> O

Per-kernel vectorisation status

Kernel	SSE4.1	SSE4.2	AVX	AVX2	AVX-512BW	ARM NEON
`kswv` (batched Smith-Waterman)	8-wide int16	8-wide int16	8-wide int16	16-wide int16	32-wide int16	8-wide int16 (native)
`bandedSWA` (banded SW / mate-rescue)	vectorised	vectorised	vectorised	vectorised	vectorised	native NEON blendv
`ksw_*` free functions (SW extension)	per-tier	per-tier	per-tier	per-tier	per-tier	per-tier (NEON)
`sam_encode` (SAM seq/qual encoder)	per-tier	per-tier	per-tier	per-tier	per-tier	per-tier (NEON)
`FMI_search` (FM-index backward ext.)	scalar	scalar	scalar	scalar	scalar	scalar
`libsais` (BWT / SA construction)	OpenMP only	OpenMP only	OpenMP only	OpenMP only	OpenMP only	OpenMP only

FMI_search is memory-bound with sequential pointer-chasing dependencies; adding SIMD to it produces no measurable speedup. libsais benefits from OpenMP-parallel induced sorting but not from SIMD widening within a single thread.

Adding a new SIMD kernel

Include simd_compat.h rather than any platform intrinsic header directly.
Use SIMD_WIDTH8 / SIMD_WIDTH16 for lane-count arithmetic so the code compiles correctly across all ISA levels.
If the kernel needs per-tier compilation:
- Add the source to KERNEL_SRCS in the Makefile so the per-tier pattern rules (src/%.<tier>.o) pick it up.
- Use the KERNEL_VARIANT rename macros from src/kernel_dispatch.h to expose mangled symbols.
- Export a C-linkage factory or dispatcher entry point from the per-tier TU and add a switch on g_tier in src/simd_dispatch.cpp.
For ARM-specific optimisations, gate them with #ifdef APPLE_SILICON (or #if defined(__ARM_NEON)) and provide a simd_compat.h-routed fallback for x86.
Verify correctness on at least SSE4.1 (lowest supported x86 tier) and ARM64 using make test, then run test/regression/all_tiers_parity.sh to confirm byte-identical SAM across every x86 tier under BWAMEM3_FORCE_TIER.

Tip — Testing SIMD correctness

The kswv unit tests in test/unit/test_kswv*.cpp use synthetic sequence-pair generators that drive edge cases (empty batches, nrow==0, homopolymers) across every SIMD width. Run them with ./test/bwa_mem3_tests_unit --test-suite="unit/kswv" after modifying any vectorised kernel, then loop BWAMEM3_FORCE_TIER over all five tiers in an end-to-end smoke run to catch dispatcher-wiring regressions that the unit tests miss.

Single-binary SIMD dispatch (x86)

On x86 Linux and x86 macOS, bwa-mem3 is a single binary that contains compiled kernels for every supported SIMD tier (sse41 / sse42 / avx / avx2 / avx512bw). At startup the binary detects the host CPU’s capabilities and selects the matching tier in process, without fork or exec. There is no separate launcher binary and no bwa-mem3.<tier> variant files on disk.

ARM / Apple Silicon does not need tier dispatch at all: there is only one NEON instruction-set level across current ARM64 CPUs, so the arm64 build is a single binary with one kernel TU. The dispatch machinery described below is only meaningful on x86.

This design replaces the multi-binary execv launcher inherited from bwa-mem2. The motivation, validation, and trade-offs are tracked in PR #83; the AVX-512 auto-vectorization cap that ships alongside it is documented in BASELINE_ARCH=avx512bw build flag.

What the build produces

make            # default: single multi-tier binary, BASELINE_ARCH=avx2
make single     # explicit alias of the default target

Produces one file in the repo root:

File	Contains	Non-kernel TU compile flags
`bwa-mem3`	All 5 x86 tier kernels + dispatcher + non-kernel TUs	`BASELINE_ARCH` (default `avx2`)

The five kernel translation units listed in the Makefile’s KERNEL_SRCS (bandedSWA.cpp, kswv.cpp, ksw.cpp, sam_encode.cpp) are compiled five times each, once per tier, with tier-specific -m... flags. Every non-kernel TU is compiled once at the BASELINE_ARCH tier. BASELINE_ARCH defaults to avx2 (PR #84) and can be set on the make line:

make BASELINE_ARCH=avx512bw       # for an AVX-512BW-only fleet
make BASELINE_ARCH=sse41          # for pre-Haswell hosts (~10–15% slower on AVX2)

Lowering BASELINE_ARCH reduces the supported host floor and is the documented escape hatch for vintage hardware. Raising it locks the binary to that host class and disables the host-floor precheck for lower tiers. The bwa-mem3 version banner prints the resulting SIMD floor: line so operators can confirm the build matches the intended deployment surface — see Host requirements and BASELINE_ARCH=avx512bw build flag.

For ARM, make arm64 produces a single binary with a single NEON kernel TU; no dispatch table is generated.

Runtime tier selection

src/simd_dispatch.cpp provides three pieces:

bwamem3_simd_init() — idempotent initializer called from main.cpp. Caches the host’s raw capability into a file-scope g_host_capability and the effective dispatch tier into a separate g_tier (the two differ when BWAMEM3_FORCE_TIER is set).
An enum of supported tiers (sse41 → sse42 → avx → avx2 → avx512bw, plus neon on arm64) and bwamem3_simd_tier_name() for stderr reporting.
Per-kernel factory functions (make_bsw_kernel_<tier>, make_kswv_kernel_<tier>) and free-function dispatch wrappers (ksw_extend2, ksw_global2, ksw_extend, ksw_global, ksw_align2, ksw_align, sam_encode_*) that switch on g_tier and call into the matching mangled per-tier symbol.

x86 detection uses __builtin_cpu_supports directly; arm64 reports neon unconditionally. The selection happens once at startup and the result is cached in a TU-level global — subsequent kernel calls pay a single indirect-call overhead through a vtable (for the BandedPairWiseSW / kswv factories) or an extern "C" wrapper (for the ksw_* free functions). Per PR #83 measurement, the indirect call costs ~0.3 ns after BTB warm-up, so a 1M-read alignment with ~100M kernel calls adds roughly 30 ms — well below run-to-run noise on every tested host.

Symbol mangling per tier

The compile-time machinery — the KERNEL_VARIANT symbol-rename scheme, the IBandedPairWiseSW / Ikswv interface split that keeps the dispatcher TU free of class-layout knowledge, and the ODR-collision avoidance — is documented once in SIMD dispatch architecture → Per-tier compilation and symbol mangling. This page covers the runtime side: tier selection, host-floor enforcement, and distribution.

Environment overrides

Two environment variables exposed at runtime:

Variable Behavior

BWAMEM3_FORCE_TIER=<tier> Force the dispatcher to use <tier> (one of sse41 sse42 avx avx2 avx512bw). Downgrade-only: requests above the detected host tier (which would SIGILL on the first wider instruction) and unrecognized names are rejected with a stderr warning and the dispatcher falls back to the detected tier. Replaces the prior “exec the bwa-mem3.sse41 binary” pattern for A/B regression testing on AVX-512 hosts.

BWAMEM3_DEBUG_SIMD=1 Print a one-line [I::bwamem3_simd_init_body] banner at startup naming the build baseline (g_build_tier), the detected host capability, and the resolved dispatch tier. Also enables the build-baseline-vs-host gap warning that PR #84 originally emitted unconditionally and PR #86 demoted to debug-only.

Variable	Behavior
`BWAMEM3_FORCE_TIER=<tier>`	Force the dispatcher to use `<tier>` (one of `sse41` `sse42` `avx` `avx2` `avx512bw`). Downgrade-only: requests above the detected host tier (which would SIGILL on the first wider instruction) and unrecognized names are rejected with a stderr warning and the dispatcher falls back to the detected tier. Replaces the prior “exec the `bwa-mem3.sse41` binary” pattern for A/B regression testing on AVX-512 hosts.
`BWAMEM3_DEBUG_SIMD=1`	Print a one-line `[I::bwamem3_simd_init_body]` banner at startup naming the build baseline (`g_build_tier`), the detected host capability, and the resolved dispatch tier. Also enables the build-baseline-vs-host gap warning that PR #84 originally emitted unconditionally and PR #86 demoted to debug-only.

Both are read once during bwamem3_simd_init() and ignored after that call returns.

Host-floor enforcement

bwa-mem3 mem, bwa-mem3 index, and bwa-mem3 shm all call bwamem3_enforce_host_floor() early in main() (PR #95). The check compares g_host_capability against the compile-time g_build_tier (derived from compiler predefined macros, reflecting whichever BASELINE_ARCH was set at build time) and exits with code 2 and an [E::bwamem3] message naming the gap if the host cannot execute the binary’s compiled-in instructions. This converts what would otherwise be an unhelpful SIGILL deep in alignment into a clean abort at startup.

Diagnostic invocations opt out: bwa-mem3 version, bwa-mem3 <subcommand> --help, and bwa-mem3 <subcommand> -h always succeed regardless of host capability, so operators can introspect a binary on a host that cannot run alignment. The version command prints SIMD floor: (the build’s required minimum) and SIMD runtime: (the resolved tier) on stdout; on a too-old host it also emits a [W::bwa-mem3] warning on stderr.

The simd_dispatch.cpp translation unit is compiled at BASELINE_ARCH like every other non-kernel TU; an earlier draft forced it to -march=x86-64 to keep the precheck SIGILL-safe, but that broke g_build_tier (a static constexpr derived from __AVX2__ / __SSE4_1__ / etc., which are only defined when the matching -m flag is in scope) — every binary reported its floor as scalar and the precheck became a no-op. In practice the precheck path is scalar-only (std::call_once, integer comparisons, getenv, snprintf, fputs, exit) with no array loops the compiler could autovectorize, so it stays SIGILL-safe even when BASELINE_ARCH=avx2 (or higher) for the rest of the binary.

Per-tier parity validation

test/regression/all_tiers_parity.sh runs bwa-mem3 mem with BWAMEM3_FORCE_TIER walking the full ladder (sse41 → sse42 → avx → avx2 → avx512bw) on the same input and diff’s the BAM output. The expected result is byte-identical SAM across every tier; any divergence is a bug in either a kernel TU or the per-kernel factory wiring. CI runs this script on the x86 matrix row.

Trade-offs vs the prior multi-binary launcher

Property	Pre-PR-#83 (multi-binary `execv`)	Current (single binary, in-process dispatch)
Install size	~120 MB (5 ISA binaries + launcher)	~25 MB (one binary)
Build cost	5 sequential clean rebuilds + launcher	One parallel build
Process model	`bwa-mem3` (launcher) → `execv` → `bwa-mem3.<tier>`	One process, one `main()`
Per-call overhead	Direct call (tier fixed at launch via separate binary)	Indirect call through factory vtable or `extern "C"` wrapper (~0.3 ns / call)
Non-kernel auto-vectorization	At each binary’s compile tier	At `BASELINE_ARCH` (default `avx2`); raise via `BASELINE_ARCH=`
Tier override	Run the `.<tier>` binary directly	`BWAMEM3_FORCE_TIER=<tier>` (downgrade-only)
`runsimd.cpp` (220-line launcher)	Required	Removed

The ~0.3 ns indirect-call cost is amortized across alignment work and has not been measurable in any bench cell. The non-kernel auto-vectorization at BASELINE_ARCH is what closes the gap PR #84 identified after PR #83 originally regressed by silently hardcoding the non-kernel compile to sse41.

Distribution layout

For deployment on any x86 host meeting the build’s floor:

bin/
  bwa-mem3       ← single binary, dispatches in-process

For ARM:

bin/
  bwa-mem3       ← single binary, NEON kernels only

No .<tier>-suffixed companion files are produced or needed. When shipping a Docker image intended for a mixed-microarch fleet, build at the lowest expected tier (e.g. BASELINE_ARCH=avx2 for “AVX2 and newer”) — the runtime dispatcher will still pick AVX-512BW kernels on AVX-512 hosts via the per-tier factory tables. See Multi-architecture deployment for the docker buildx manifest-list recipe.

The legacy Executing in AVX2 mode!! banner is gone. Use either:

bwa-mem3 version — prints SIMD floor: and SIMD runtime: lines on stdout (always available, no alignment required).
BWAMEM3_DEBUG_SIMD=1 bwa-mem3 mem … — prints a one-line [I::bwamem3_simd_init_body] banner on stderr at the start of the run.

Apple Silicon / NEON port

bwa-mem3 supports ARM64 (Apple Silicon and Linux aarch64) as a first-class build target. The port uses the sse2neon translation shim as a baseline and replaces the two most performance-critical SSE paths with native NEON intrinsics.

Architecture overview

The ARM build compiles a single binary with a single NEON kernel TU. There is only one NEON instruction-set level on all current ARM64 CPUs, so the per-tier dispatch table used by the x86 single-binary build (see Single-binary SIMD dispatch (x86)) collapses to a one-entry switch on aarch64 — there is effectively no dispatch overhead. make arm64 builds and installs the binary at the bare bwa-mem3 name.

sse2neon shim

ext/sse2neon/sse2neon.h is a header-only library that maps Intel SSE intrinsics to their NEON equivalents. When APPLE_SILICON=1 is defined (set automatically when uname -m is arm64 or aarch64), src/simd_compat.h includes sse2neon and defines the SSE feature test macros (__SSE__ through __SSE4_2__) so that code guarded by those macros compiles without changes.

The translation is not zero-cost for all operations. Two patterns that sse2neon handles poorly are replaced with native NEON in src/simd_compat.h:

_mm_movemask_epi16 — used heavily in bandedSWA.cpp to extract the sign bit of each 16-bit lane. The native implementation shifts right by 15, narrows to 8-bit with vmovn_u16, and reduces with position-weighted vaddv_u8.
_mm_blendv_epi16_fast — a bitwise select on 16-bit lanes using vbslq_s16. Replaces the three-operation OR/AND/ANDNOT sequence sse2neon emits for _mm_blendv_epi8.

Because the bulk of the ARM SIMD path is compiler-translated rather than hand-written intrinsics, codegen quality is unusually sensitive to the compiler and its version — a recent clang or gcc closes most of the gap to a hypothetical full native port. See Best Practices → Build for measured numbers and the recommendation.

Memory alignment

Apple Silicon uses 128-byte cache lines (versus 64 bytes on x86). simd_compat.h overrides _mm_malloc on ARM to call posix_memalign with a minimum alignment of 128 bytes for all SIMD allocations. CACHE_LINE_BYTES is set to 128 in macro.h when APPLE_SILICON=1.

Accelerate.framework

The Makefile links -framework Accelerate on macOS ARM builds. The framework is linked but not used for computation: bwa-mem3’s hot paths (Smith-Waterman, FM-index) do not match the large-matrix / large-vector patterns that BLAS and vDSP target. The link is retained to keep the option open and adds no overhead at runtime.

P-core / E-core detection

src/fastmap.cpp calls HTStatus() on macOS to detect the Apple Silicon microarchitecture. HTStatus() reads the hw.perflevel0.physicalcpu and hw.perflevel1.physicalcpu sysctl keys to report P-core and E-core counts and the L2 cache size (typically 4 MB on M-series chips). This information is printed at startup for diagnostic purposes. The L2 cache size is used to validate the compile-time BATCH_SIZE setting (currently 1024, which was already optimal for a 4 MB L2 cache).

Benchmark results

All measurements use 100K paired-end reads, 5% error rate, 30% indels, chr17 reference, 8 threads, on an M-series Apple Silicon machine.

Build	Wall-clock (avg, s)	vs. baseline
sse2neon baseline (no native NEON)	15.4	—
+ native NEON `kswv.cpp`	14.4	~7% faster
+ native NEON `bandedSWA.cpp` blendv	13.8	~4% faster
PGO on top of native NEON	~13.4	~3% further

The FM-index (FMI_search.cpp) is memory-bound with sequential pointer-chasing dependencies and does not benefit from SIMD. libsais benefits from OpenMP-parallel suffix-array construction but not from SIMD widening within a single thread.

Optimization task summary

Task	Status	Impact	Notes
Correctness verification	done	—	200,006 alignments, 0 differences vs. reference
Dynamic L2 cache detection	done	~0%	4 MB detected; compile-time `BATCH_SIZE=1024` already optimal
Native NEON `bandedSWA.cpp`	done	~4%	`vbsl`-based blendv in `simd_compat.h`
Per-tier dispatch table	N/A	0%	Collapses to one entry on ARM (single NEON level)
Accelerate.framework	done	~0%	Linked; no suitable compute patterns
M1/M2/M3/M4 detection	done	~0%	P/E-core counts and L2 cache via sysctl
Native NEON `FMI_search.cpp`	N/A	0%	Memory-bound; SIMD cannot help
Profile-Guided Optimization	done	~3%	`make pgo-generate` / `make pgo-use`

Building for Apple Silicon

# Standard arm64 build
make arch=arm64

# PGO build (recommended for production on Apple Silicon)
make pgo-generate PGO_ARCH=arm64
./bwa-mem3.pgo-instr mem -t 8 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=arm64

The resulting bwa-mem3.pgo binary delivers the full ~10% improvement over the pure sse2neon baseline.

Tip — Recommended production build on Apple Silicon

Use PGO for production deployments. The combined ~10% improvement from native NEON kernels plus PGO is consistent and verified on M-series hardware.

Files modified in the NEON port

src/kswv.cpp, src/kswv.h — native NEON batched Smith-Waterman
src/bandedSWA.h — SIMD width definitions for ARM
src/simd_compat.h — sse2neon integration, aligned allocation, _mm_blendv_epi16_fast, _mm_movemask_epi16
src/fastmap.cpp — L2 cache detection, HTStatus() for non-NUMA (macOS)
src/macro.h — BATCH_SIZE and CACHE_LINE_BYTES tuning for Apple Silicon
Makefile — arm64 target, sse2neon flags, Accelerate linkage, PGO targets

Regression test framework

bwa-mem3 has three categories of tests — unit, integration, and regression — plus a separate benchmark harness in bench/. Understanding the distinction helps you choose where to add a new test and what to expect from CI.

Test categories

Category	Binary / runner	Fixtures	CI scope
unit	`test/bwa_mem3_tests_unit`	None; all inputs synthetic	Every matrix row
integration	`test/bwa_mem3_tests_integration`	Small committed FASTAs / FMI in `test/fixtures/`	SSE4.1, AVX2, ARM64 Linux, macOS ARM
regression	`test/regression/*.sh`	Downloaded references (phiX, chr22) + bwa + dwgsim	Canonical AVX2 row only

Unit tests must use only synthetic inputs generated programmatically and complete in under 100 ms each. They exercise individual kernels in isolation: kswv scoring, banded Smith-Waterman, KSW, FM-index operations, SMEM extraction, BAM encoding, and pair handling.

Integration tests may load small committed fixtures from test/fixtures/ and have a per-test budget of 10 seconds. They exercise cross-component paths: index loading, SMEM-to-alignment pipelines, and output format validation.

Regression tests are standalone bash scripts that shell out to the bwa-mem3 binary, may diff against third-party tool output (bwa, bwa-meth, samtools), and require fixtures that are either committed to the fixtures directory or downloaded by CI at run time.

Running tests locally

# Build the aligner and test binaries
make
make -C test -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)

# Run all unit tests
./test/bwa_mem3_tests_unit

# Run all integration tests
./test/bwa_mem3_tests_integration

# Run a specific test case or suite
./test/bwa_mem3_tests_unit --test-case="*kswv*"
./test/bwa_mem3_tests_unit --test-suite="unit/kswv"
./test/bwa_mem3_tests_unit --test-suite-exclude=slow

# Verbose output (also print passing assertions)
./test/bwa_mem3_tests_unit --success

The make test target is a convenience shortcut that builds and runs the unit and integration binaries plus the two legacy standalone regression tests (kswv_nrow_zero_test and shm_section_find_test):

make test

Running a regression test locally

Regression scripts expect certain environment variables to point at fixtures. The phiX parity test requires dwgsim:

mkdir -p /tmp/ci-test && cd /tmp/ci-test
curl -sL "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/819/615/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gz" | gunzip > phix174.fa
dwgsim -z 42 -N 500 -1 150 -2 150 -r 0.001 -S 2 phix174.fa reads
cd -
BWA_MEM2="$(pwd)/bwa-mem3" CI_TEST_DIR=/tmp/ci-test bash test/regression/phix_parity.sh

Test framework

The unit and integration binaries are built on doctest, a single-header C++ test framework. Tests are discovered by file glob: any test/unit/test_*.cpp file is compiled into the unit binary; any test/integration/test_*.cpp file is compiled into the integration binary. No Makefile edit is needed when adding a new test_*.cpp.

Test organisation

Tag each TEST_CASE with doctest::test_suite("category/module"):

TEST_CASE("nrow==0 batch does not store out of bounds"
          * doctest::test_suite("unit/kswv")) {
    // ...
}

The test_suite decorator is overriding (not additive). Encode the category (unit or integration) and module (kswv, bandedsw, ksw, fmindex, smem, bam, pair, cigar, util) as a single slash-separated string.

Framework helpers

The test/framework/ directory provides helpers shared across test files:

Header	Provides
`scoring.h`	`ScoringMatrix`, `build_scoring_matrix`, `default_scoring_matrix`
`seqpair.h`	`TestPair` struct
`seqpair_gen.h`	Deterministic pair generators: random, exact-match, all-mismatch, homopolymer, sub-cluster, N-bases
`seqpair_batch.h`	`BatchBuffers` — flat-layout packer for kswv batch input
`ksw_runner.h`	`run_scalar_ksw`, default gap/extra parameters
`kswv_runner.h`	Two-pass `run_kswv_batch`
`kswr_cmp.h`	Score / coordinate / score2 comparators
`junit_reporter.h`	CI matrix-row banner and JUnit XML output

Debugging a failing test

# Break into debugger at the first failing assertion
./test/bwa_mem3_tests_unit --test-case="*kswv*" --break

# Run a single SUBCASE
./test/bwa_mem3_tests_unit --test-case="*foo*" --subcase="bar"

# Enable per-phase diagnostics for kswv tests
BWA_TESTS_DEBUG_PHASE0=1 BWA_TESTS_DEBUG_PHASE1=1 \
  ./test/bwa_mem3_tests_unit --test-suite="unit/kswv"

JUnit artifacts are uploaded per CI matrix row (unit-results-<name>.xml, integration-results-<name>.xml) and available on the Actions run page.

Tip — Use ASAN for memory bugs

Build with make ASAN=1 test to catch out-of-bounds writes in vectorised kernels. The kswv_nrow_zero_test specifically exercises the nrow==0 path that triggered a pre-allocation store bug; ASAN reports this immediately rather than at a later allocator operation.

Standalone regression tests

Three standalone regression tests live outside the doctest harness because they predated it. The two binaries are built and run by make test; the third is script-driven:

kswv_nrow_zero_test — binary; exercises the all-len1==0 batch path in every SIMD kswv variant. Catches the nrow==0 rowMax store overrun from issue #38 / upstream bwa-mem2 PR #289.
shm_section_find_test — binary; exercises the shared-memory index section-find logic.
shm_pack_round_trip_test — script-driven, invoked via test/shm_pack_round_trip_test.sh, which builds the phiX index first.

Additional integration shell scripts in test/:

Script	What it tests
`pg_cl_escape_test.sh`	`@PG CL:` tab/newline escape in SAM headers
`mimalloc_loaded_test.sh`	mimalloc override is active when `USE_MIMALLOC=1`
`shm_round_trip_test.sh`	`bwa-mem3 shm` load / list / drop cycle
`shm_meth_test.sh`	`--meth` index compatibility with `shm`
`help_prescan_test.sh`	`--help` prints without running alignment
`libsais_*.sh`	libsais index correctness vs. BWA / determinism

Benchmark harness (`bench/`)

bench/ is a separate performance measurement harness used during development to gate performance PRs. It is not part of the CI test suite.

cp bench/config.env.example bench/config.env
# Edit config.env to point at your index, reads, and binary paths
bench/run.sh baseline         # N trials; appends to bench/results.csv
bench/run.sh candidate        # N trials on the candidate binary
bench/compare.sh baseline candidate  # wall-clock / RSS / md5 delta report

Each run records: tag, host, architecture, binary path, thread count, trial index, wall-clock seconds, max RSS (KB), and a golden md5 (single-threaded, @PG-stripped SAM). The md5 verifies byte-identical output across builds; wall-clock is the primary performance metric.

Release process

bwa-mem3 follows semantic versioning. Releases are automated with release-please: every push to main updates a standing “release PR” that bumps the version and regenerates the changelog from the Conventional Commits history; merging that PR tags the release and publishes a GitHub Release. The version string is derived from version.txt and embedded in every binary at compile time.

Version stamping

version.txt at the repo root is the single source of truth for the version. release-please rewrites it (and .release-please-manifest.json) in the release PR; nothing else should edit it by hand.

The Makefile computes the build’s version string from it at parse time:

# version.txt is the single source of truth; scripts/version.sh reads it
# and appends an informational git-describe-style dev suffix.
VERSION_STRING := $(shell scripts/version.sh)

scripts/version.sh reads the base version from version.txt and, when git is available, appends a dev suffix that surfaces how far the working tree is from the matching tag:

Working-tree state	Example string
Source tarball / shallow clone (no git)	`0.4.0`
Clean and exactly at tag `v0.4.0`	`0.4.0`
Clean, not at the tag	`0.4.0-3f7ab2e`
Uncommitted changes at the tag	`0.4.0-dirty`
Uncommitted changes, not at the tag	`0.4.0-3f7ab2e-dirty`

“At the tag” means HEAD is precisely the commit pointed to by tag v<base> (or <base>). Manifest drift — HEAD at a tag that disagrees with version.txt — is treated as “not at tag”, so the -<sha> suffix surfaces the drift visibly rather than silently printing a wrong bare version.

The string is written into src/version.h by the src/version.h: FORCE rule, which runs on every make invocation but only touches the file when the string changes. This minimises unnecessary recompilation of src/main.o.

PACKAGE_VERSION from src/version.h appears in:

bwa-mem3 version output (stdout).
The @PG VN: field in every SAM/BAM file produced by bwa-mem3 mem.

Note the version string has no leading v (it mirrors version.txt, e.g. 0.4.0), even though the git tags do (v0.4.0).

Verifying the version

./bwa-mem3 version
# Example output on a release tag:
# 0.4.0
# mimalloc 3.3.0        ← if USE_MIMALLOC=1

On a commit past the tag the string carries the short SHA suffix:

0.4.0-3f7ab2e

Semver policy

bwa-mem3 follows semver, interpreted for an alignment tool as follows.

MAJOR (X.0.0) — bump when the change would break a downstream consumer that pinned the previous version without checking release notes. Concretely:

An on-disk index file format change (a re-index is required to use the new version).
Removal or rename of a CLI flag or subcommand.
A SAM/BAM tag is removed, renamed, or its type/value space changes incompatibly (a column-fixed downstream parser would break). Adding a new tag is not a major change.
A change to the resolved primary alignment that is intentional and affects more than a negligible fraction of reads (e.g. a MAPQ recalibration applied unconditionally). Concordance regressions attributable to bug fixes are not major changes — call them out in the release notes under “Correctness” instead.
Dropping support for a previously supported host class (e.g. raising the build’s compiled-in BASELINE_ARCH floor in a way that excludes hosts the previous release ran on).

MINOR (0.X.0) — bump for any user-visible new functionality that does not break consumers pinned to the previous minor. Examples:

A new CLI flag or subcommand.
A new SAM aux tag emitted on output (e.g. HN:i in 0.1.0, the Bismark XR:Z / XG:Z / XM:Z set in 0.2.0).
A new operational feature (e.g. bwa-mem3 shm, in-process SIMD dispatch).
A user-facing default change that is documented in release notes but does not require any consumer action (e.g. BASELINE_ARCH=avx2 as the build default).
New performance characteristics that change wall-time meaningfully.

PATCH (0.0.X) — bump for bug fixes, doc-only changes, build fixes, and internal refactors that have no user-visible behavioral delta. Pre-existing-bug fixes that incidentally shift output for a small fraction of reads are patch-level when called out in the release notes; widespread output shifts (>0.1% of reads on a typical WGS bench cell) deserve MINOR or MAJOR depending on the source.

While the project is pre-1.0, the leading 0. is treated literally — 0.2.0 may make breaking changes vs 0.1.0 if called out clearly in the release notes. After 1.0, MAJOR bumps are reserved for genuinely breaking changes.

Release-readiness checklist

Run through this list on the candidate main commit before merging the release PR (merging it is what tags and publishes the release — see Cutting a release). Every item must pass.

Build and test

make clean && make succeeds at the default BASELINE_ARCH (avx2) on a Linux x86_64 host.
make clean && make BASELINE_ARCH=sse41 succeeds on the same host — confirms the portability floor still compiles.
make clean && make succeeds on an arm64 host (Apple Silicon or aarch64 Linux).
make test passes on both x86_64 and arm64.
test/regression/all_tiers_parity.sh produces byte-identical SAM across BWAMEM3_FORCE_TIER=sse41 → sse42 → avx → avx2 → avx512bw on an AVX-512BW host. Failures here indicate a per-tier kernel or dispatcher-wiring regression — fix before tagging.

Bench

bwa-mem3-bench run submitted on the candidate SHA via bwa_mem3_bench.cli submit --fg-labs-sha <sha> (or the local smoke path for a fast sanity check).
bench regression --prev <previous-tag-sha> reports gate PASS — concordance ≥ 99.999% on every vs-baseline.json cell except methylation (which is expected to drift vs the bwameth baseline; see the methylation carve-out below) and no cell labeled REGRESSION.
Methylation cells reviewed for expected-drift consistency: the meth-twist-emseq-5M concordance vs the bwameth baseline should sit at ~98.9% post-PR-#90, with the per-class breakdown matching the entry in bwa-mem3-bench/docs/expected-divergences.yaml (or the entry added in this release — the file is in the bench repo, not in this repo).

Docs

make docs builds cleanly with no mdbook warnings.
The release notes are generated automatically by release-please from the conventional-commit history, so the real check is upstream: every user-visible PR in the release window has a correct conventional-commit type, and any breaking change carries a ! / BREAKING CHANGE: marker so it lands in the ⚠ BREAKING CHANGES section (see Flagging breaking changes). NEWS.md is not updated — it is frozen at 0.2.0; 0.3.0 and later live only in CHANGELOG.md and the GitHub Releases page.
docs/src/reference/pr-catalog.md FG-MAIN-TABLE block has a row for every fork-carried PR landed since the previous tag, with its upstream disposition (see Contributing).
docs/src/reference/changelog.md and docs/src/cli/version.md examples reference the new release string.
Spot-check the bwa-mem3-bench reference numbers in docs/src/performance/overview.md against the bench’s regression.md for the tagging SHA.

Cutting a release

Releases are not tagged by hand. The .github/workflows/release.yml workflow runs release-please on every push to main and a tarball job on each published release.

release-please maintains a standing release PR. After PRs land on main, release-please opens (or updates) a PR titled chore(main): release X.Y.Z. It computes the next version from the conventional-commit history since the last tag, bumps version.txt and .release-please-manifest.json, and prepends the generated section to CHANGELOG.md. The bump level is driven entirely by the commit types: feat: → minor, fix:/perf:/etc. → patch, and a ! / BREAKING CHANGE: marker forces the breaking bump (a minor pre-1.0, since bump-minor-pre-major is set). If the proposed version is wrong, the fix is upstream — correct the offending commit’s type or add a breaking marker (see Flagging breaking changes), not the release PR.
Review the release PR. Confirm the proposed version matches the change set per the semver policy, and that the generated CHANGELOG.md reads correctly — in particular that any breaking change appears under ⚠ BREAKING CHANGES. Run the release-readiness checklist against the PR’s base commit.
Merge the release PR. This is the action that ships the release. release-please creates the vX.Y.Z tag and a GitHub Release whose body is the generated changelog section. Read the Docs activates a versioned build at /vX.Y.Z/ automatically once the tag appears.
The tarball job runs automatically once the release is created. It checks out the tag with submodules, verifies version.txt matches the tag, builds the vendored Source_code_including_submodules.tar.gz (all submodules bundled, no .git/), smoke-tests that it compiles and reports the right version, uploads the asset plus its .sha256, and appends a “For packagers” block (with the asset URL and sha256) to the release body. Bioconda recipes pin against this asset.

Note — Manual rebuild of a tarball asset

The tarball job can be re-run for an existing tag via the workflow’s workflow_dispatch input (e.g. to repair a missing asset), but v0.2.0 is rejected — its asset is pinned by sha256 in an open bioconda PR and must not change.

Note — Tarball builds and the version string

A source tarball has no git history, so scripts/version.sh cannot append a dev suffix — but it still reads the git-tracked version.txt, so a make from the tarball prints the bare base version (e.g. 0.4.0) with no further action needed.

Post-release verification

After the release PR is merged and the GitHub release is published:

Wait ~5 minutes for Read the Docs to build the new version, then open https://bwa-mem3.readthedocs.io/en/v0.X.Y/ and confirm:
- The version selector lists v0.X.Y.
- The home page renders with no missing-page errors.
- developer-guide/launcher.md, performance/overview.md, and methylation/tags.md all render with their mermaid diagrams and tables intact (these are the most diagram-heavy pages).

Pull the tag in a clean clone and verify bwa-mem3 version reports the bare version string (no -<sha> dev suffix):

git clone -b v0.X.Y --depth 1 https://github.com/fg-labs/bwa-mem3.git
cd bwa-mem3 && make
./bwa-mem3 version | head -1
# expect: 0.X.Y   (no leading 'v'; mirrors version.txt)

If the docs build failed on RTD or the version string is wrong, do not delete or move the tag. Tags are immutable in practice — let release-please open the next release PR with the fix (a follow-up 0.X.(Y+1) patch) instead.

Branch and tag conventions

All release tags are on the main branch, which carries both upstream bwa-mem2 commits and fork-carried changes. See Branch and worktree conventions for the full branching model.
Tags are prefixed with v: v0.1.0, v0.2.0, etc.
Pre-release tags use a -pre suffix: v0.1.0-pre.
Patch releases increment the third component: v0.1.1.

What’s Different table update

When a release bundles new fork-carried commits that were not previously documented, update the FG-MAIN-TABLE in docs/src/reference/pr-catalog.md in the same PR before tagging. See Contributing for the rule.

Branch and worktree conventions

This page describes how the bwa-mem3 repository branches relate to upstream bwa-mem2, the policy for where PRs land, and the conventions for local worktrees when working on multiple branches simultaneously.

Branch model

`master` — upstream mirror

master tracks the upstream bwa-mem2 master branch verbatim. No fork-carried changes are applied here. When upstream bwa-mem2 merges new commits, master is fast-forwarded to match.

master is the starting point for upstream rebase operations. It is never the target of fork PRs.

`main` — fork integration branch

main carries all fork-carried commits on top of a rebased upstream baseline. This is the branch that:

All new feature, fix, and improvement PRs target.
All git tags (v0.X.Y) are placed on.
Read the Docs /latest/ follows.

When upstream bwa-mem2 makes significant changes, master is fast-forwarded and then main is rebased onto the new master tip. The rebase is verified by running the full test suite before the result is pushed.

Feature and fix branches

All development work happens on short-lived branches that are merged into main via pull request. Branch name conventions:

Prefix	Use
`feat/`	New features or capabilities
`fix/`	Bug fixes
`perf/`	Performance improvements
`test/`	Test additions or improvements
`docs/`	Documentation changes
`ci/`	CI / build system changes
`refactor/`	Code restructuring without behaviour change

Branch names use kebab-case after the prefix: fix/kswv-nrow-zero, perf/libsais-fm-index, test/regression-tests.

Upstream rebase cadence

main is rebased onto master (i.e., onto upstream bwa-mem2) periodically — not on every upstream commit, but when upstream merges a batch of changes worth incorporating. The process is:

Fast-forward master to the new upstream tip.
Rebase main onto master, resolving any conflicts.
Run make && make test to confirm the rebase is correct.
Push master and main to the fg-labs remote.

Warning — Do not merge upstream into main

Always rebase rather than merge when incorporating upstream changes. Merge commits obscure the fork-carried commit history and make the What’s Different table harder to maintain.

Worktrees for parallel branches

When working on multiple branches simultaneously, use git worktrees instead of stashing or switching branches. Each worktree is a sibling directory of the main clone.

Creating a worktree for a PR branch

# Fetch the PR's head branch from the fg-labs remote
git fetch fg-labs <head-branch-name>

# Create a worktree with a local branch tracking the remote branch
git worktree add ../pr-<N> -b pr-<N> --track fg-labs/<head-branch-name>

The local branch name and directory name match the PR number (pr-N).

Creating a worktree for a new issue branch

# Fetch the latest main from fg-labs
git fetch fg-labs main

# Create a new feature branch off fg-labs/main
git worktree add ../issue-<N> -b <prefix>/issue-<N>-<short-slug> fg-labs/main

# Unset the upstream so the branch is untracked until first push
git -C ../issue-<N> branch --unset-upstream

On first push, push to fg-labs so the head branch is in the same organisation as the PR base:

git push -u fg-labs HEAD

Worktree naming conventions

Directory name	Branch type
`main/`	Primary checkout; tracks `fg-labs/main`
`pr-<N>/`	PR review; local branch `pr-N` tracks `fg-labs/<head-branch>`
`issue-<N>/`	Issue work; local branch `<prefix>/issue-N-<slug>`
Descriptive name	Feature work not yet tied to a PR or issue

Listing and removing worktrees

# List all worktrees
git worktree list

# Remove a worktree after the PR is merged
git worktree remove ../pr-<N>
git branch -D pr-<N>

# Remove an issue worktree
git worktree remove ../issue-<N>
git branch -D <prefix>/issue-<N>-<slug>

Note — Worktree directories are siblings, not nested

All worktree directories sit next to the main clone at the same directory level, not inside it. This avoids confusing git commands that walk parent directories looking for .git.

PR policy

All PRs target main.
PRs from fork contributors should be opened against fg-labs/bwa-mem3 main.
Every PR that adds a fork-carried commit must update the FG-MAIN-TABLE in docs/src/reference/pr-catalog.md in the same PR. See Contributing.
Merge policy: squash-merge for single-commit changes; rebase-merge for multi-commit PRs with a clean commit history.

Contributing

This page covers the mechanics of submitting changes to bwa-mem3: commit conventions, PR workflow, CI requirements, and the rule for keeping the fork-lineage table current.

Before you start

Check the open issues and existing PRs to avoid duplicate work.
For substantial changes, open an issue first to discuss scope and approach.
Fork or branch from fg-labs/bwa-mem3 main. See Branch and worktree conventions for the branching model.

Commit message conventions

bwa-mem3 follows Conventional Commits (v1.0.0). Every commit message must start with a type prefix:

Prefix	Use
`feat:`	New feature or capability
`fix:`	Bug fix
`perf:`	Performance improvement
`test:`	Test additions or changes
`docs:`	Documentation only
`ci:`	CI / build-system changes
`refactor:`	Restructuring without behaviour change
`chore:`	Maintenance (dependency bumps, version pins)

The subject line is lowercase after the prefix, imperative mood, no trailing period. Keep it under 72 characters. Body lines wrap at 100 characters.

Good:

fix: kswv nrow==0 batch skips rowMax store when i==0

Exercises the all-len1==0 path across SSE4.1, AVX2, AVX-512BW, and ARM NEON.
Without the `if (i > 0)` guard, the store writes SIMD_WIDTH* bytes before the
allocation.

Closes #38.

Not acceptable:

Fixed stuff
Updated kswv
WIP

Flagging breaking changes

Releases are generated automatically by release-please from the conventional-commit history (see Release process). A change that breaks a downstream consumer — see the semver policy for what qualifies — must be marked so the breaking-change notice lands in CHANGELOG.md and the GitHub release body. Describing the break only in the commit body is not enough: release-please does not read prose, so an unmarked break is silently filed under its plain type (e.g. a perf: commit lands under “Performance” with no warning) and downstream users never see it.

Mark a break either way:

Append a ! after the type/scope: perf(index)!: stop building .0123 by default, and/or
Add a BREAKING CHANGE: footer (the colon is required) describing the break and the migration path.

perf(index)!: stop building the unpacked .0123 reference by default

BREAKING CHANGE: `bwa-mem3 index` no longer writes `.0123`; `mem` reconstructs
reference bases from `.pac` on demand. External tools that read `.0123` directly
(e.g. sharing an index with bwa-mem2) must re-run with `index --emit-unpacked-ref`.

release-please then emits a ⚠ BREAKING CHANGES section automatically. While the project is pre-1.0 this still bumps only the minor version (bump-minor-pre-major is set), matching the semver policy’s rule that a pre-1.0 break is allowed when it is called out clearly — the footer is how it gets called out.

Note — Squash merges

The project squash-merges single-commit PRs and rebase-merges multi-commit PRs with a clean history. For a squash-merged PR, release-please reads only the squash-merge commit subject (which defaults to the PR title), not the individual commit bodies — so put the ! / BREAKING CHANGE: marker on the PR title or the squash commit message itself, because a marker buried in a sub-commit body is discarded at squash time. A rebase-merged PR keeps its individual commits, so a marker in any sub-commit message is preserved and detected; even so, prefer putting it on the PR title as well so the convention is uniform regardless of how the PR is merged.

Pull request workflow

Push your branch to fg-labs/bwa-mem3 (or your fork) and open a PR targeting fg-labs/bwa-mem3 main.
The PR description should explain the motivation, summarise the change, and note any benchmarks or test results.
All CI jobs must pass before merge. See CI matrix below.
CodeRabbitAI reviews every PR automatically. Address all comments, including inline suggestions, summary comments, and nitpicks. Do not dismiss comments without a reply explaining why the suggestion was not adopted.
A project maintainer will review and merge once CI is green and all comments are resolved.

Note — Draft PRs first

Open PRs as drafts while CI is running or while you are actively revising. Convert to ready-for-review only when the branch is stable, CI is green, and you have self-reviewed the diff.

The FG-MAIN-TABLE rule

Every PR that introduces a new fork-carried commit — a commit that is on main but not on master (the upstream bwa-mem2 mirror) — must update the FG-MAIN-TABLE block in docs/src/reference/pr-catalog.md in the same PR.

The table records each fork-carried change, its bwa-mem3 PR number, the corresponding upstream bwa-mem2 PR or issue (if any), and its upstream status. Keeping this table current is the primary mechanism by which the project maintains transparency about its relationship to upstream.

Warning — Do not skip the table update

A PR that adds a fork-carried commit but omits the table update will be sent back for revision. The table is reviewed as part of the standard PR checklist.

What counts as a fork-carried commit

A commit is fork-carried if:

It adds new behaviour, fixes a bug, or changes build infrastructure in a way that diverges from upstream bwa-mem2 master.
It is present on fg-labs/bwa-mem3 main but not (yet) merged upstream.

Pure documentation commits, CI-only changes, and upstream-rebase bookkeeping commits do not need a table entry.

CI matrix

CI runs on every PR and on push to main. The matrix covers:

Row	Architecture	ISA	Platform
`sse41`	x86_64	SSE4.1	Ubuntu
`avx2`	x86_64	AVX2	Ubuntu (canonical)
`avx512bw`	x86_64	AVX-512BW	Ubuntu
`arm64-linux`	aarch64	NEON	Ubuntu ARM
`arm64-macos`	arm64	NEON	macOS

The canonical row (avx2) is the only one that runs regression tests (shell scripts in test/regression/). Unit tests run on every row. Integration tests run on the four widened canonical rows (SSE4.1, AVX2, ARM64 Linux, macOS ARM).

A PR must pass all rows before merge.

Code style

C++14, gnu++14 dialect.
Match the style of the surrounding code. The codebase inherits the upstream bwa-mem2 style, which is C-ish C++ with minimal STL use in hot paths.
For new test code, follow the doctest patterns documented in the test framework.
New SIMD code must include src/simd_compat.h rather than platform-specific headers directly. See SIMD dispatch architecture.

Adding a test for your change

Bug fix → add a unit test or integration test that fails without the fix and passes with it.
New feature → add unit tests for the core logic and, if the feature is end-to-end testable with a shell invocation, a regression test in test/regression/.
Performance change → run the benchmark harness (bench/) to confirm the improvement and include median wall-clock numbers in the PR description.

See Regression test framework for the full guide on where to add tests and how to organise them.

bwa-mem3-bench

bwa-mem3-bench is a benchmarking suite that measures the alignment performance of bwa-mem3 against the upstream bwa-mem2 v2.2.1 baseline. It runs on AWS Batch spot instances across four dataset types — whole-genome sequencing (WGS), whole-exome sequencing (WES), panel, and bisulfite-sequencing (methylation) — all aligned against the hg38 reference. The suite covers three CPU microarchitectures: ARM Neon, x86 AVX2, and x86 AVX-512. Results are collected into a SQLite database for local analysis and reporting. The project is implemented in Python (orchestration, reporting, and CLI), Rust (BAM comparison tool), Snakemake (alignment workflow), and AWS CDK (cloud infrastructure).

When you’d use it

Use bwa-mem3-bench when you need reproducible, multi-architecture throughput numbers before committing a bwa-mem3 change to production or before deciding whether to adopt bwa-mem3 in place of bwa-mem2. It provides a structured “bless baseline, then compare” workflow: an upstream bwa-mem2 run is blessed once per upstream tag and stored in S3; subsequent bwa-mem3 runs are measured against that fixed baseline. Running a full benchmark fires a Snakemake coordinator job on AWS Batch and costs roughly $10 in spot capacity.

How it relates to bwa-mem3

bwa-mem3-bench is the authoritative source of benchmark evidence for every performance claim made in the bwa-mem3 documentation and changelog. When the Performance Overview cites speedup numbers, those numbers come from bwa-mem3-bench runs collected after the relevant PR was merged. The suite also validates that bwa-mem3 does not regress relative to bwa-mem2 on any supported architecture before a new release is tagged.

Per-release concordance history

Per-(release, sample) primary-alignment concordance against upstream bwa-mem2 v2.2.1, with supplementary-alignment counts, across released bwa-mem3 versions. Concordance is the minimum vs-baseline value over reps and x86 architectures (deterministic per sample); supp_query/supp_baseline are total supplementary records emitted by bwa-mem3 and bwa-mem2, and count_mismatch is the number of templates whose supplementary count differs. The divergence catalog explains what each kind of drift is and its budget.

This table and the divergence catalog are both generated from the benchmark database — do not edit them by hand. Regenerate after a new release is collected with pixi run python -m bwa_mem3_bench.cli bench docs --releases v0.2.0=<sha>,v0.2.1=<sha>,... (in the bwa-mem3-bench repo), then replace the content between the FG-DIVERGENCE-CATALOG / FG-RELEASE-TABLE markers with the emitted runs/docs/{divergence-catalog.md,release-table.md} (the inject_between_markers helper in bwa_mem3_bench.report.docs does exactly this splice).

release	sample	concordance_%	supp_query	supp_baseline	count_mismatch
v0.2.0	meth-twist-emseq-5M	98.8852	0	0	0
v0.2.0	panel-twist-5M	100.0000	186946	186946	0
v0.2.0	smoke-1M	100.0000	1455	1455	0
v0.2.0	smoke-meth	98.8573	0	0	0
v0.2.0	wes-5M	100.0000	5118	5118	0
v0.2.0	wgs-5M	100.0000	49686	49686	0
v0.2.1	meth-twist-emseq-5M	98.8852	0	0	0
v0.2.1	panel-twist-5M	100.0000	186946	186946	0
v0.2.1	smoke-1M	100.0000	1455	1455	0
v0.2.1	smoke-meth	98.8573	0	0	0
v0.2.1	wes-5M	100.0000	5118	5118	0
v0.2.1	wgs-5M	100.0000	49686	49686	0
v0.2.2	meth-twist-emseq-5M	98.8773	0	0	0
v0.2.2	panel-twist-5M	99.9414	187039	186946	199
v0.2.2	smoke-1M	99.9460	0	0	0
v0.2.2	smoke-meth	98.8429	0	0	0
v0.2.2	wes-5M	99.9996	5123	5118	5
v0.2.2	wgs-5M	99.9893	49926	49686	256

bwa-mem3-rs

bwa-mem3-rs is a Rust crate that provides idiomatic bindings to the bwa-mem family of short-read aligners — bwa (original), bwa-mem2, and bwa-mem3. It exposes a safe Rust API over the underlying C++ alignment engine, allowing Rust programs to index a reference, configure alignment parameters, and align reads without shelling out to an external process. The bindings link statically against the chosen backend, so a binary built with bwa-mem3-rs carries the aligner and its SIMD kernels as a self-contained artifact.

When you’d use it

Use bwa-mem3-rs when you are building a Rust bioinformatics tool or pipeline that needs short-read alignment as an in-process library call rather than a subprocess invocation. It is especially useful when latency between reads arriving and alignments being available matters (no process-startup overhead), or when you want tight integration between the aligner’s output and downstream Rust code such as UMI grouping, consensus calling, or duplicate marking.

How it relates to bwa-mem3

bwa-mem3-rs targets bwa-mem3 as its primary high-performance backend. It is the intended integration path for fgumi and other Fulcrum Genomics tools that need alignment as a library dependency. Changes to bwa-mem3’s public API, flag semantics, or output format are coordinated with bwa-mem3-rs to keep the bindings current.

bwa-mem2 (upstream)

bwa-mem2 is the direct predecessor of bwa-mem3 and the project from which the bwa-mem3 fork is derived. It was created at Intel’s Parallel Computing Lab by Vasimuddin Md and Sanchit Misra to accelerate the alignment algorithm originally written by Heng Li in bwa. bwa-mem2 achieves a 1.3–3.1x throughput improvement over the original bwa-mem by replacing key inner loops with vectorised implementations (SSE4.1, SSE4.2, AVX2, and AVX-512) and by switching to a more compact FM-index encoding. Its output is identical to bwa-mem at the alignment level, and it is distributed under the MIT license.

Lineage

The bwa alignment family has evolved through three generations, each building on the last:

bwa — Written by Heng Li. Established the BWA-MEM algorithm, the SAM output format conventions, and the .bwt / .pac / .ann / .amb index layout.
bwa-mem2 (Vasimuddin et al., Intel) — Replaced scalar inner loops with SIMD kernels; introduced the compact .bwt.2bit.64 and .0123 index formats; retained full output compatibility with bwa-mem. (bwa-mem3 reads the same formats but no longer builds the unpacked .0123 by default — it pac-fetches reference bases from .pac. If you want to share a bwa-mem3 index with bwa-mem2, which loads .0123 directly, build it with bwa-mem3 index --emit-unpacked-ref.)
bwa-mem3 (Fulcrum Genomics fork) — Carries correctness fixes, performance improvements, new features (bisulfite alignment, mimalloc, ARM Neon), and expanded architecture support on top of the bwa-mem2 codebase. See What’s Different from bwa-mem2 for the full change catalog.

When you’d use it

Use bwa-mem2 directly when you need a stable, widely validated aligner with precompiled binaries available via Bioconda and the project’s GitHub releases page, and when you do not require the features or fixes that bwa-mem3 adds. bwa-mem2 is also the right choice when you are working in an environment where the bwa-mem3 fork has not yet been validated against your specific reference or sequencing library type.

How it relates to bwa-mem3

bwa-mem3 tracks bwa-mem2’s master branch and periodically rebases fork-carried commits on top of upstream changes. The What’s Different section documents every divergence between the two projects, and the PR catalog page tracks which bwa-mem3 changes have been proposed back to bwa-mem2. The goal is to keep the fork divergence minimal and to upstream as many fixes as practical.

fgumi

fgumi (Fulcrum Genomics Unique Molecular Indexing tools) is a high-performance suite of command-line tools for processing UMI-tagged next-generation sequencing data. Written in Rust, it provides UMI extraction from FASTQ files, read grouping by UMI with configurable assignment strategies, UMI-aware deduplication, simplex and duplex consensus calling, CODEC consensus calling, quality filtering of consensus reads, and overlapping read-pair clipping. fgumi is the intended successor to the Scala-based fgbio toolkit for UMI processing, targeting significantly higher throughput on multi-core systems. It is published on Bioconda and documented at https://fgumi.readthedocs.io.

Warning — Research preview

fgumi is currently a research preview. The Fulcrum Genomics team targets June 2026 for recommending fgumi over fgbio for production use. Verify fitness for your application before deploying in a clinical or production pipeline.

When you’d use it

Use fgumi when your sequencing library includes unique molecular identifiers and you need to group reads by UMI, call simplex or duplex consensus sequences, or remove PCR duplicates in a UMI-aware manner. It handles the standard commercial UMI library preparations (IDT xGen, KAPA, Twist, QIAseq, and others) and the CODEC protocol for duplex sequencing. fgumi is designed to be run after alignment with bwa-mem3 (or bwa-mem2) and before downstream variant calling or methylation analysis.

How it relates to bwa-mem3

fgumi and bwa-mem3 are sibling projects maintained by Fulcrum Genomics and are designed to work together in the same alignment-and-consensus pipeline. bwa-mem3 provides the aligned BAM that fgumi takes as input for grouping and consensus calling. The two projects share build and documentation conventions (mdbook on Read the Docs, Fulcrum theme, conventional commits) and are benchmarked together in the fgumi-benchmarks internal dataset suite. The intended integration path for in-process alignment within fgumi is bwa-mem3-rs, the Rust bindings for bwa-mem3.

bwameth.py

bwameth.py is a Python script written by Brent Pedersen that implements bisulfite sequencing (BS-Seq) alignment using the in-silico three-letter genome approach. It converts all cytosines to thymines in both the reference and the reads (C-to-T on the forward strand, G-to-A on the reverse), aligns the converted sequences with bwa-mem (or optionally bwa-mem2), and then recovers the original read sequence from the aligner’s tag output to tabulate methylation. bwameth.py supports single-end and paired-end reads from the directional bisulfite protocol and is published at https://arxiv.org/abs/1401.1129.

When you’d use it

Use bwameth.py when you need a battle-tested, community-supported bisulfite aligner that runs on top of the standard bwa-mem or bwa-mem2 you have already installed, and when you prefer a Python wrapper over a self-contained binary. It also remains the reference for downstream tabulation tools such as MethylDackel and SNP callers such as biscuit that expect the bwameth.py output format. For the actual methylation tabulation and variant calling steps, bwameth.py’s author recommends those dedicated tools rather than the tabulation utilities bundled with the original script.

How it relates to bwa-mem3

bwa-mem3 mem --meth is a single-binary alignment pipeline that, in its default collapsed mode, reproduces bwameth.py’s read placement (it is a placement drop-in, not a byte-for-byte clone). The key difference is where it scores: where bwameth.py converts both reads and reference to 3-letter space and aligns there, bwa-mem3 uses the 3-letter projection only to find seeds, then extends and scores against the original 4-letter reference. That enables a second, opt-in mode — --meth-scoring genomic — which keeps real C/T and G/A variants as mismatches (truthful NM/MD), something a collapsed-space aligner cannot do.

It rewrites the @SQ headers to consolidate the per-strand contig pairs back to canonical chromosome names, emits Bismark-compatible XR:Z / XG:Z / XM:Z auxiliary tags, and writes a @PG ID:bwa-mem3-meth header. The bwameth.py-style chimera QC heuristic is available via --chimera-qc (off by default — Bismark behavior). The Methylation Reference documents the full implementation, including the two --meth-scoring modes, the Bismark tags, and the --set-as-failed / --chimera-qc flags.

Note — external c2t interop was removed

Because scoring runs on the original bases, bwa-mem3 mem --meth can no longer consume pre-converted reads or a .bwameth.c2t reference. Pass raw FASTQ and the original ref.fa prefix; see Migrating from bwameth.py c2t.

PR catalog

The single, consolidated record of every change carried in bwa-mem3 main on top of upstream bwa-mem2 — its bwa-mem3 PR, a one-line description, its class, and its upstream bwa-mem2 disposition (PR/issue and status). The narrative What’s Different pages (correctness, performance, features, architecture support, build & infrastructure) explain the why; this page is the flat, scannable what.

Note — generated table; do not hand-edit between the markers

The table below is consolidated from the per-area change history (git log --reverse --no-merges master..main for the change list, merged with the upstream-disposition tracking). Every PR that adds a fork-carried commit must add its row here (the FG-MAIN-TABLE rule — see Contributing). “Upstream status” of fork-only means no upstream PR exists; open means an upstream PR/issue existed at implementation time but was unmerged.

Fork-carried changes

PR	Change	Class	Upstream PR / issue	Upstream status
#1	feat(arm64): make Linux aarch64 build + CI-test on every fg-main push	Architecture support	bwa-mem2#288	open PR
#2	chore: configure CodeRabbit to review PRs against fg-main	Build & infrastructure	—	fork-only
#3	docs: add FG-MAIN.md documenting the fork’s relationship to upstream	Build & infrastructure	—	fork-only
#4	ci: pin GitHub Actions to full-length commit SHAs	Build & infrastructure	—	fork-only
#5	fix(hdr): align bwamem.h declarations with bwamem.cpp definitions	Correctness	—	fork-only
#6	feat(hdr): export mem_infer_dir for external consumers	Features	—	fork-only
#7	chore: move profiling globals out of main.cpp	Build & infrastructure	—	fork-only
#8	feat: expose worker_alloc/worker_free, the core worker_t pre-allocation helpers	Features	—	fork-only
#9	feat: split mem_sam_pe into mem_pair_resolve + thin emission wrapper	Features	—	fork-only
#10	ci: pin dwgsim seed (-z 42) to stop parity-test flakiness	Build & infrastructure	—	fork-only
#12	feat: –bam[=LEVEL] output flag for direct BAM emission	Features	—	fork-only
#13	feat(meth): –meth + `index --meth` — bwameth.py-equivalent bisulfite mode	Features	—	fork-only
#16	build(make): add explicit arch=avx512bw target	Architecture support	—	fork-only
#17	fix: compute no_pairing 0x2 flag from the emitted alignment	Correctness	—	fork-only
#18	[proto] NEON kswv mate-rescue — correctness + perf harness	Architecture support	—	fork-only
#19	feat: vendor mimalloc v3.3.0 and link by default	Features	—	fork-only
#20	[proto] AVX2 kswv mate-rescue — stacked on PR 18	Architecture support	—	fork-only
#21	fix(kswv): apply NEON score2-scan fixes to AVX-512BW kernel	Correctness	—	fork-only
#22	fix: zero bseq1_t in kseq2bseq1 so realloc’d entries don’t carry garbage	Correctness	—	fork-only
#23	test(ci): add unit-test harness, fixtures, and ARM build support	Build & infrastructure	—	fork-only
#24	ci: expand workflow matrix + add canonical deep-test row	Build & infrastructure	—	fork-only
#26	fix(kswv): gate AVX2 arch dispatch on !AVX512BW	Correctness	—	fork-only
#28	fix(kswv): consolidate score2 plateaus per-lane to match scalar ksw_align2	Correctness	—	fork-only
#29	fix(kswv): port score2 plateau consolidation to NEON + AVX-512BW	Correctness	—	fork-only
#30	fix(kswv): apply score2 plateau fix + missing filters to kswv_512_16	Correctness	—	fork-only
#31	fix(kswv): rewrite kswv_neon_16 — real SIMD kernel with correct table + score2	Correctness	—	fork-only
#33	perf(seed): lockstep SMEM batching across N reads	Performance	—	fork-only
#34	test: doctest-based test framework scaffolding + Codecov	Build & infrastructure	—	fork-only
#35	chore: port four nh13 lh3/bwa PRs into bwa-mem2 (-z, -u/XB, MQ, @HD order)	Features	lh3/bwa#330	merged (lh3 only)
#42	feat(mem): emit HN:i tag with total hit count per primary	Features	lh3/bwa#438	analogous to bwa aln; no direct upstream port
#49	perf(header): batch -H ingestion to fix O(n^2) header read (closes #37)	Performance	bwa-mem2#204	open PR
#50	build(make): forward user CXXFLAGS/CPPFLAGS/LDFLAGS to final link steps	Build & infrastructure	bwa-mem2#290	open upstream PR
#51	fix(kswv): guard post-loop rowMax store on nrow==0 batches	Correctness	bwa-mem2#289	open PR (upstream covers AVX-512BW only)
#52	chore(version): stamp PACKAGE_VERSION from git describe at build time	Build & infrastructure	bwa-mem2#283, bwa-mem2#284	open issue + open PR
#53	chore: normalize CRLF line endings to LF (#43)	Build & infrastructure	—	fork-only
#54	fix(sam): sanitize whitespace in -R when embedding into @PG CL: field	Correctness	bwa-mem2#293	open issue
#55	fix(smem): size SMEM buffers from observed max read length (closes #44)	Correctness	bwa-mem2#238, bwa-mem2#210	PR closed without merge; issue open
#56	feat(mapq): add –supp-rep-hard-cap opt-in supp MAPQ rescoring	Features	bwa-mem2#260	open issue
#57	feat(index): libsais-based memory-bounded FM-index construction	Performance	—	fork-only
#58	perf: consolidated mapping speedups (ksw2, SMEM, SAL, SAM)	Performance	—	fork-only
#59	feat(makefile): parameterize PGO targets by arch + profile dir	Build & infrastructure	—	fork-only
#60	feat(cli): wire up –help across commands; add -h to top-level and index	Features	—	fork-only
#63	ci(proto-neon-kswv): split into fan-out/fan-in jobs with caching	Build & infrastructure	—	fork-only
#65	feat(shm): port `bwa shm` from bwa-mem v1	Features	—	fork-only (v1 feature port)
#67	feat(shm): add `bwa-mem2 shm --meth` for symmetric meth UX	Features	—	fork-only
#68	chore: rename shell vars BWAMEM2/BWA_MEM2[] to BWAMEM3/BWA_MEM3[]	Build & infrastructure	—	fork-only
#70	perf(kswv): add per-strip L1 prefetches to all u8/16 kernels	Performance	—	fork-only
#71	docs: add comprehensive mdbook on Read the Docs	Build & infrastructure	—	fork-only
#72	fix(test/meth): alias bwa-mem2 -> bwa-mem3 on PATH for bwameth.py oracle	Build & infrastructure	—	fork-only
#73	fix(fmi): parenthesize SA_COMPX_MASK precedence in sampled-SA prefetch	Correctness	—	fork-only
#74	fix(bntseq): bound .alt parse buffer to prevent stack overflow	Correctness	—	fork-only
#75	perf(fmi): bump SMEM_LOCKSTEP_N from 8 to 16	Performance	—	fork-only
#76	feat(bns): convert mem_matesw_batch_{pre,post} to bns_fetch_seq_v2	Architecture support	—	fork-only
#77	perf(ungapped): closed-form HIT for total_mis == 0	Performance	—	fork-only
#78	perf(ksort): replace per-call malloc with on-stack buffer for small n	Performance	—	fork-only
#79	Update index.md	Build & infrastructure	—	fork-only
#80	perf(libsais_build): skip wasted zero-init on unpack + SA buffers	Performance	—	fork-only
#81	fix(profiling): clamp display_stats nthreads to LIM_C	Correctness	—	fork-only
#82	feat(shm): serialize /bwactl RMW with a POSIX named semaphore	Features	—	fork-only
#83	feat(simd): replace multi-binary execv launcher with single-binary in-process dispatch	Architecture support	—	fork-only
#84	perf(build): default x86 single-binary baseline to avx2 (was sse41)	Architecture support	—	fork-only
#85	fix(matesw): copy ref slice before ksw_align2 to avoid SIGSEGV on shm-backed ref_string	Correctness	—	fork-only
#86	perf(x86): cap avx512bw autovec at 256-bit; bwa_shm /dev/shm preflight	Features	—	fork-only
#88	perf(fmi): inline backwardExt to recover gcc 12+ wall-clock regression	Performance	—	fork-only
#89	ci: migrate parity tests from dwgsim/phiX174 to holodeck/chr22	Build & infrastructure	—	fork-only
#90	feat(meth): emit Bismark-compatible XR/XG/XM auxiliary tags	Features	—	fork-only
#93	docs(install): list autoconf/automake/libomp/zlib system prereqs	Build & infrastructure	—	fork-only
#94	docs(install): fix RHEL/Fedora package name pkgconfig → pkgconf-pkg-config	Build & infrastructure	—	fork-only
#95	feat(simd): add SIMD host-floor precheck for multi-arch deployment	Features	—	fork-only
#96	docs: pre-release documentation pass for v0.2.0-pre	Build & infrastructure	—	fork-only
#97	chore(release): prep v0.2.0 release notes and metadata	Build & infrastructure	—	fork-only
#123	Stable tie-breaks + pdqsort	Performance	—	fork-only
#128	FASTQ reader fast path	Performance	—	fork-only
#140	Recover 8-bit banded SW (≥128 bp)	Performance	—	fork-only
#141	Gotoh gaps from H	Performance	—	fork-only
#143	Drop dead `qlen[]` param	Performance	—	fork-only
#144	Long-read kernel parity test	Performance	—	fork-only
#147	Short-circuit re-baseline scan	Performance	—	fork-only
#148	Remove dead SW code paths	Performance	—	fork-only
#149	Vectorize epilogue side-channel	Performance	—	fork-only
#150	Bound getScores prefetch reads	Performance	—	fork-only
#151	Unsigned 8-bit h0-prefix seed	Performance	—	fork-only
#152	`--profile` stage timing	Performance	—	fork-only
#153	zlib-ng inflate + 3rd worker	Performance	—	fork-only
#157	Right-size SA staging buffers	Performance	—	fork-only
#158	`gtle` contract test	Performance	—	fork-only
#160	NEON SW tuning	Performance	—	fork-only
#161	AVX2 SW tuning	Performance	—	fork-only
#162	AVX2 16-bit `kswv256_16`	Performance	—	fork-only
#164	NEON `movemask` parity test	Performance	—	fork-only

Upstream issues tracked but not yet fixed

These upstream issues are tracked in the bwa-mem3 issue list but do not yet have a corresponding fix in main:

Issue	Upstream reference	Notes
Split-alignment evidence loss vs bwa 0.7.17	bwa-mem2#273	issue #47 — under investigation
MAPQ/coordinate parity vs bwa mem 0.7.18	bwa-mem2#262, bwa-mem2#246, bwa-mem2#239	issue #48 — tracking only

Glossary

Terms used throughout this book, listed alphabetically.

@HD header The first line of a SAM file header. Specifies the SAM format version (VN) and sort order (SO). Required when any other header lines are present. See Output: SAM/BAM, headers, tags.

@PG header A SAM header line recording a program that processed the file, including ID, PN, VN, and CL fields. bwa-mem3 inserts ID:bwa-mem3 (or ID:bwa-mem3-meth in methylation mode). See Output: SAM/BAM, headers, tags.

@SQ header A SAM header line describing a reference sequence (chromosome). Contains the sequence name (SN) and length (LN). In methylation mode, bwa-mem3 post-processes @SQ lines to collapse f/r-prefixed contig names back to one entry per chromosome. See Chimera QC and header rewriting.

BAM Binary Alignment Map — a compressed, binary encoding of SAM. Produced by bwa-mem3 when the --bam flag is given or when output is piped through samtools. See Output: SAM/BAM, headers, tags.

Banded Smith-Waterman (banded SWA) A heuristic variant of the Smith-Waterman alignment algorithm that restricts the dynamic programming to a band of width w around the main diagonal. bwa-mem3 uses banded SWA for extension alignment; bwa-mem2 kernels are SIMD-vectorized and bwa-mem3 adds NEON implementations for Apple Silicon. See SIMD dispatch architecture.

c2t Cytosine-to-thymine in-silico conversion applied to reads (or reference) before methylation alignment. In --meth mode, bwa-mem3 converts R1 reads C→T and R2 reads G→A inline, without writing intermediate FASTQ files. See Conversion details (C->T, G->A).

Chimera A read alignment where the aligned portion is short relative to the read length, often indicating a mapping artefact or a true chimeric molecule. In methylation mode, bwa-mem3 applies a chimera QC heuristic: if the longest contiguous M/=/X CIGAR run is less than 44% of the read length, the alignment is flagged 0x200, the proper-pair bit is cleared, and MAPQ is capped at 1. See Chimera QC and header rewriting.

FASTQ A text format for raw sequencing reads. Each record contains a sequence identifier, the nucleotide sequence, a separator, and per-base quality scores in ASCII-encoded Phred format. bwa-mem3 accepts gzip-compressed FASTQ as input. See Quick start: align paired-end FASTQs.

FM-index Ferragina-Manzini index — a full-text index over the Burrows-Wheeler Transform of a sequence. bwa-mem3 uses the compressed .bwt.2bit.64 FM-index for seed finding (SMEM lookup). See Indexing the reference.

Hard clip A CIGAR operation (H) indicating that bases at the read end are absent from the SEQ field of the alignment record. Hard clipping is used in supplementary alignments to avoid duplicating the read sequence. See Output: SAM/BAM, headers, tags.

kswv The SIMD-vectorized kernel implementing the inner loop of the Smith-Waterman extension alignment in bwa-mem2/bwa-mem3. bwa-mem3 carries correctness fixes for the score-saturation edge case across all SIMD width variants (NEON, AVX2, AVX-512BW). See Correctness fixes.

libsais A library implementing the suffix-array induced sorting (SAIS) algorithm. bwa-mem3 optionally uses libsais for FM-index construction, reducing indexing time compared to the default suffix-array builder. See Performance improvements.

LTO Link-Time Optimization — a compiler mode that defers optimization to link time, enabling cross-compilation-unit inlining. Activated via make lto-build. See Building from source.

MAPQ Mapping quality — a Phred-scaled probability that a read alignment is incorrectly mapped. Reported in SAM field 5. bwa-mem3 follows bwa-mem2 MAPQ semantics; chimera QC in methylation mode caps MAPQ at 1 for chimeric alignments. See Output: SAM/BAM, headers, tags.

Mate rescue A step in paired-end alignment where, if one mate lacks a confident seed, bwa-mem3 attempts to find it by performing Smith-Waterman alignment in the region near the mapped mate. bwa-mem3 adds NEON and AVX2 implementations of the mate-rescue kernel. See Architecture support.

mimalloc A high-performance memory allocator from Microsoft. bwa-mem3 vendors mimalloc and links it into every binary by default. To disable, build with USE_MIMALLOC=0. See Memory allocator (mimalloc).

Single-binary SIMD dispatch On x86, bwa-mem3 ships one binary that contains compiled kernels for every supported SIMD tier (sse41 / sse42 / avx / avx2 / avx512bw) and selects one in process at startup via __builtin_cpu_supports. There are no per-tier companion binaries. On ARM64 the binary contains a single NEON kernel TU. Replaces the prior multi-binary execv launcher (PR #83). See Single-binary SIMD dispatch (x86).

PGO Profile-Guided Optimization — a two-pass build where the first pass instruments the binary, a representative workload is run to collect profiles, and the second pass uses those profiles to guide inlining and branch layout. Activated via make pgo-generate then make pgo-use. See PGO build.

Primary alignment The alignment record for a read that represents the aligner’s best placement. A read has exactly one primary alignment (or is reported as unmapped). All other alignments for the same read are marked supplementary (chimeric split read) or secondary (alternative mapping). See Output: SAM/BAM, headers, tags.

Proper-pair flag (0x2) SAM flag bit indicating that both mates of a pair are mapped in the expected orientation and insert-size range. In bwa-mem3, the mem_sam_pe function sets this flag; a correctness fix (PR #17) ensures it is propagated correctly under all conditions. See Correctness fixes.

SAM Sequence Alignment Map — a tab-delimited text format for read alignments. Each record contains mandatory fields (QNAME, FLAG, RNAME, POS, MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, QUAL) plus optional tags. See Output: SAM/BAM, headers, tags.

SIMD dispatch Runtime selection of the fastest available SIMD instruction set (SSE4.1, SSE4.2, AVX, AVX2, AVX-512BW, NEON) for hot alignment kernels. On x86 this is implemented in process by src/simd_dispatch.cpp via __builtin_cpu_supports; on ARM64 a single NEON tier covers every supported CPU. See SIMD dispatch matrix.

SMEM Super-Maximal Exact Match — a seed found by extending a read’s position in the FM-index as far as possible in both directions. SMEMs form the initial seeds for chaining and extension in the BWA-MEM algorithm. See Performance improvements.

Soft clip A CIGAR operation (S) indicating that bases at the read end were not part of the alignment, but are still present in the SEQ field. Soft clipping commonly appears at adapter-containing or low-quality read ends. See Output: SAM/BAM, headers, tags.

Supplementary alignment A SAM record (FLAG bit 0x800 set) representing a chimeric read split across two or more genomic loci. The segment with the longest aligned span is typically designated primary; remaining segments are supplementary. Hard clipping is used to avoid duplicating the SEQ field. See Output: SAM/BAM, headers, tags.

Citation

How to cite

bwa-mem3 is a derivative of bwa-mem2. If you use bwa-mem3 in published work, please cite the original bwa-mem2 paper:

Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru. Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2019. doi:10.1109/IPDPS.2019.00041

BibTeX:

@inproceedings{bwamem2-ipdps2019,
  author    = {Vasimuddin Md and Sanchit Misra and Heng Li and Srinivas Aluru},
  title     = {Efficient Architecture-Aware Acceleration of {BWA-MEM} for Multicore Systems},
  booktitle = {IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
  year      = {2019},
  doi       = {10.1109/IPDPS.2019.00041},
  url       = {https://doi.org/10.1109/IPDPS.2019.00041}
}

Lineage

bwa-mem3 is maintained by Fulcrum Genomics as a derivative of bwa-mem2, itself derived from bwa (Li & Durbin, 2009). The BWA-MEM algorithm was originally described in:

Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997, 2013. arXiv:1303.3997

The bwa-mem3-specific changes and improvements carried on top of bwa-mem2 are documented in What’s Different from bwa-mem2.

License

bwa-mem3 is licensed under the MIT License (same as upstream bwa-mem2).

                           The MIT License

   BWA-MEM3  (Sequence alignment using Burrows-Wheeler Transform),
   based on BWA-MEM2.
   Copyright (C) 2026 Fulcrum Genomics LLC.
   Copyright (C) 2019 Intel Corporation, Heng Li.

   Permission is hereby granted, free of charge, to any person obtaining
   a copy of this software and associated documentation files (the
   "Software"), to deal in the Software without restriction, including
   without limitation the rights to use, copy, modify, merge, publish,
   distribute, sublicense, and/or sell copies of the Software, and to
   permit persons to whom the Software is furnished to do so, subject to
   the following conditions:

   The above copyright notice and this permission notice shall be
   included in all copies or substantial portions of the Software.

   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
   BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
   ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
   CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
   SOFTWARE.

Contacts: Vasimuddin Md <vasimuddin.md@intel.com>; Sanchit Misra <sanchit.misra@intel.com>;
                                Heng Li <hli@jimmy.harvard.edu>

Changelog

Release notes from 0.2.1 onward are generated automatically by release-please from the commit history (see Release process) and mirror the GitHub Releases page.

0.6.0 (2026-07-16)

Features

version: report whether mimalloc is the active allocator (#217) (2cd9fe9)

Performance

bam: cut per-record work in the –bam writer (mem_aln_to_bam) (#212) (837f40e)
bsw: drop dead per-row H1/H2 setup stores in the SW batch wrappers (#211) (039b09f)
dedup: drop provably-dead exact-duplicate passes in mem_sort_dedup_patch (#205) (c544258)
fmi: arm64 lockstep for third-pass reseeding (bwtSeedStrategy) (#215) (c5f69ec)
kswv: remove write-only Hmax scratch buffer from batched SW kernels (#214) (b14ef49)
mem: add unique-mapper fast paths to mem_chain_flt and mem_mark_primary_se (#209) (1700aab)
mem: drop redundant per-call scratch allocations in mem_gen_alt (#216) (3418525)
mem: pool per-read scratch in get_sa_entries_prefetch and mem_reg2aln (#208) (5747a13)
mem: remove dead/instrumentation work from the extension hot path (#206) (439be97)
seed: replace sortSMEMs qsort with a counting sort by rid (#207) (4356b5c)
seed: vectorize backwardExt occ-counting on arm64 (NEON) + hoist invariants (#210) (64a55a2)

Refactoring

mem: static cleanup — remove dead code, silence int64 format warnings (#213) (8c57b7e)

Documentation

retitle LICENSE for BWA-MEM3 and add Fulcrum copyright (#203) (089a956)

0.5.0 (2026-07-04)

Features

add opt-in –seed-order seed reordering (default off, byte-identical) (#186) (04749a1)
add opt-in –smem-dedup (dedup identical SMEMs before chaining) (#187) (1384972)
mem: add –adaptive-band (chain-geometry adaptive banding) for long reads (#194) (4fe92a6)
mem: add –extend-mate-concordant; fix –fast –meth placement regression (#195) (c9ffef1)
mem: add –fast speed preset (#189) (a946af8)
mem: add –max-extend-chains and bundle it into –fast (#193) (e39b3d4)
mem: add –skip-contained-ext and enable it under –fast (#192) (2d2b2b4)

Bug Fixes

bandedSWA: 8-bit SW drops query-end gscore/gtle on zero-score-row exit (#198) (611e21b)
bandedSWA: getScores{8,16} must not scribble padding past numPairs (#199) (9aae808)

Performance

bandedSWA: gate the getScores overshoot guard to sub-slice callers (#201) (162e909)

Refactoring

kswv: drop duplicate F warm-up prefetch in kswv512_16 (#191) (6e4cf2b)

Documentation

changelog: backfill the 0.4.0 breaking-change notice (#183) (e1c381a)
changelog: render the live changelog, not the frozen NEWS.md (#184) (12a46d8)
contributing: document breaking-change commit footers (#181) (4ebd122)
release: describe the release-please flow, not manual tagging (#182) (fc0c4b6)

0.4.0 (2026-06-27)

⚠ BREAKING CHANGES

index: bwa-mem3 index no longer writes the unpacked .0123 reference file by default (#177). bwa-mem3 mem now reconstructs reference bases from the packed .pac on demand (“pac-fetch”) and ignores any .0123 present, so the file is redundant for bwa-mem3 itself. External tools that read .0123 directly — most notably sharing a single index with bwa-mem2 — will break, since the expected file is now absent. To restore the old on-disk layout, re-run indexing with the new opt-in flag index --emit-unpacked-ref. Alignment output is byte-for-byte identical; the change is purely to the index artifact set.

Features

mem: –min-ext-len opt-in filter to skip extension of short seeds (#169) (13db252)
meth: native bisulfite (BS-seq) alignment via –meth (D3) (#174) (a0296b1)

Performance

mem: pac-fetch the reference from .pac instead of loading/building .0123 (#177) (9c4bbf2)
meth: batched (SIMD) asymmetric mate rescue (closes #173) (#175) (f146a18)

Documentation

book: document the libdeflate build prerequisite (incl. AL2023) (#172) (c2b6ec7)
book: recommend -y 0 (drop 3rd-round seeding) as an opt-in speed knob (#171) (ca9ac1f)
collapse the mdbook sidebar into nested, foldable sections (#180) (c6ac47b)
deep mdbook cleanup — dedup, consolidate, and tighten (#179) (54d6d11)
meth: disclose collapsed-mode placement drift vs bwameth.py (#178) (96f29e2)
settings-profiles: note repeat-aggregating downstream caveat for -m 10 (#168) (38fd1ec)

0.3.0 (2026-06-21)

Features

bsw: make the 8-bit h0-prefix seed unsigned [0,255] (#151) (9f51c5f)
bsw: recover the 8-bit banded Smith–Waterman path for reads ≥128 bp (#140) (155a916)
kswv: AVX2 16-bit mate-rescue kernel (kswv256_16) (#162) (9107b82)
meth: carry original-reference @SQ M5/UR and @CO/@PG into –meth headers (#139) (e94ad8b)
prof: off-by-default –profile stage-timing instrumentation (#152) (83cf7ab)
reader: content-detecting FASTQ reader fast path (libdeflate BGZF) (#128) (cdd71bf)

Bug Fixes

bsw: bound getScores8/16 prefetch reads to the padding contract (#150) (87ed5d4)
fmi: widen mem_lim to int64 and guard SA-entry allocations (#156) (2d18c1e)
kthread: drive kt_for with a persistent worker pool (#154) (26b24e7)
meth: emit -R read group as @RG header in –meth mode (#137) (ccd1fc5)
seeding: widen SMEM read positions from int16_t to int32_t (#142) (037c418)
test: make meth layer-2 FAIL diagnostics reachable under set -e (#133) (d2d6688)

Performance

bsw: AVX2 SIMD tuning for the Smith-Waterman kernels (#161) (458b216)
bsw: NEON SIMD tuning for the Smith-Waterman kernels (#160) (d971ff0)
bsw: prefetch next batch’s ref/query in the AVX2 8-bit wrapper (#163) (e6082a0)
bsw: short-circuit the inert per-row re-baseline scan (#147) (8e284e0)
bsw: vectorize the per-row epilogue side-channel loop (#149) (403aeb7)
fmi: size SA-entry staging buffers to the exact write count (#157) (aa0fe33)
read: vendored zlib-ng inflate + chunk cap + 3rd pipeline worker (#153) (5cf89e3)
sw: reassociate affine-gap recurrences on NEON (kswv + bandedSWA) (#166) (a02fcb4)

Refactoring

bsw: derive extension gaps from H (standard Gotoh), not M (#141) (f715fbd)
bsw: drop the dead qlen[] parameter from the 8-bit kernels (#143) (42321df)
bsw: remove dead SW code paths (SORT_PAIRS, non-CORE macros, SSE2 polyfill) (#148) (ad8937e)

Documentation

add memory budgeting and data-type tuning guide (#145) (7127d80)
add situational –supp-rep-hard-cap 20 note for SV-aware pipelines (#134) (e20bc0e)
perf: refresh reference-architecture table to v0.3.0 (a02fcb4) (#167) (2cd7bfa)
perf: what drives the speedup, full perf-PR catalog, and fix the stale RTD build (#155) (0b01fb7)
recommend -s 0 for –meth Pass-2 re-seeding in settings profiles (#132) (45e02e0)
recommend a recent compiler on aarch64, with measured NEON numbers (#165) (b591684)
settings profiles (bwa drop-in vs recommended) (#131) (4d845c9)

0.2.2 (2026-06-08)

Bug Fixes

lto: pass explicit -flto=N on GCC to bypass jobserver under sandboxes (#122) (c6240a7)
smem: free lockstep SMEM caches at thread exit (closes #116) (#117) (9454f10)

Performance

sort: stabilize alnreg tie-breaks + drop in pdqsort at dedup-patch sort sites (#123) (85f8542)

Documentation

bench: inject generated divergence catalog + per-release concordance table (#126) (fea1c94)
correct concordance claims and document supplementary-alignment divergence (#125) (8b2dc69)
document bwa-mem3<->bwa-mem2 non-bit-identity + auditable PR list (#124) (bffae5a)

0.2.1 (2026-05-17)

Bug Fixes

changelog: strip preamble so release-please owns the file (#112) (56e580c)
mapq: propagate SMEM SA-count to seed n_hits so –supp-rep-hard-cap works (#101) (cca9d4f)
smem: track enc_qdb byte capacity separately from wsize_mem (#100) (ab922b6)

Documentation

readme: add bioconda badges and install instructions (#106) (830276c)

Earlier releases and upstream history

The 0.2.0 and 0.1.0-pre fg-labs notes, plus the upstream bwa-mem2 release history, are preserved below as a frozen archive.

Release 0.2.0 (2026-05-13)

Operational / packaging

Single-binary SIMD dispatch on x86 (#83). The previous multi-binary build (make multi producing five bwa-mem3.<tier> ISA variants plus a runsimd.cpp launcher that execv’d the matching tier) is replaced by a single binary that contains compiled kernels for every supported tier (sse41 / sse42 / avx / avx2 / avx512bw) and selects one in process at startup via __builtin_cpu_supports. Install size drops from ~120 MB to ~25 MB; per-call overhead is one indirect branch (~0.3 ns after BTB warm-up). No .<tier> companion files are produced or needed. See docs/src/developer-guide/launcher.md.
BWAMEM3_FORCE_TIER=<tier> and BWAMEM3_DEBUG_SIMD=1 env vars (#83). BWAMEM3_FORCE_TIER is downgrade-only and replaces the prior “exec the bwa-mem3.sse41 binary” A/B-testing pattern; up-tier or unrecognized requests are rejected with a stderr warning.
BASELINE_ARCH=avx2 is the new default for non-kernel translation units on x86 (#84, supersedes the SSE4.1 floor that PR #83 originally shipped with). Override via make BASELINE_ARCH=<tier>. AVX-512BW hosts using BASELINE_ARCH=avx512bw see a small additional speedup on Zen 4 with -mprefer-vector-width=256 (#86) and roughly flat results on Sapphire Rapids — see docs/src/whats-different/avx512-baseline.md for the characterization.
Host-floor precheck (#95). bwa-mem3 mem, bwa-mem3 index, and bwa-mem3 shm refuse to run with exit code 2 and an [E::bwamem3] stderr message when the host CPU does not meet the build’s compile-time SIMD floor, instead of SIGILL-ing deep in alignment. bwa-mem3 version, --help, and -h are exempt and always succeed.
bwa-mem3 version now prints SIMD floor: (build’s required minimum) and SIMD runtime: (resolved tier) lines on stdout, plus a [W::bwa-mem3] warning on stderr (exit 0) if the host is below the floor. See docs/src/getting-started/host-requirements.md.
bwa-mem3 shm performs a statvfs("/dev/shm") capacity preflight (#86). When /dev/shm is too small for the index, the stage aborts with an [E::bwa_shm_stage] message naming /dev/shm, the required size, and a mount -o remount,size=... hint — replacing the prior [fread] Bad address failure mode. statvfs failures (no /dev/shm, restricted sandbox) are non-fatal and the stage proceeds.
bwa-mem3 shm /bwactl registry RMW is now serialized via a POSIX named semaphore (#82, closes #66). Concurrent shm stage / shm drop invocations across processes no longer race when updating the registry; the prior best-effort flock was per-open and did not cover the read-modify-write window.

Methylation

mem --meth emits Bismark-compatible auxiliary tags XR:Z (read conversion CT/GA), XG:Z (genome strand CT/GA), and XM:Z (per-base methylation call string) (#90). These replace the prior bwameth-style YS:Z / YC:Z / YD:Z on output (still used internally for SEQ restoration). The reference-annotation XR:Z from -V is suppressed under --meth to avoid colliding with the Bismark semantics. Downstream tools that previously read YS:Z / YC:Z / YD:Z must be pointed at the corresponding XR:Z / XG:Z and the per-base XM:Z. See docs/src/methylation/tags.md.

Correctness

Fixed SIGSEGV in mem_matesw on shm-backed ref_string (#85). ksw_align2 mutates its reference slice in place; when the slice pointed into a read-only shm segment, this faulted. Now copies the slice before passing it in.
FMI_search sampled-SA prefetch: parenthesized SA_COMPX_MASK precedence so the masked offset is computed against the correct operand (#73). The unparenthesized form was silently producing wrong-but-harmless prefetch addresses; no alignment output was affected.
bntseq .alt parser bounds the line buffer to prevent a stack-overflow on malicious or malformed .alt files (#74).
display_stats clamps the per-thread bucket count to LIM_C so --profile with -t greater than the compiled-in limit no longer writes past the end of the stats array (#81).

Performance

x86 wall-time improvements on the bench (vs the 0.1.0-pre baseline): AVX2 (c6a) −17 to −22%, AVX-512 AMD Zen4 (c7a) −16 to −24%, AVX-512 Intel SPR (c7i) −28 to −30% across wgs / wes / panel-twist 5M-read samples. Concordance vs upstream bwa-mem2 v2.2.1 remains 100.0000% on all non-methylation cells. arm64 (c7g / c8g) is flat (within ±2%). The wins are attributable primarily to (a) capping AVX-512BW auto-vectorization at 256-bit on the avx512bw target (#86) and (b) inlining FMI_search::backwardExt to recover a gcc 12+ wall-clock regression (#88). See docs/src/performance/overview.md for the reference numbers across architectures.
Smaller contributions in the release window: per-strip L1 prefetches across all kswv u8/u16 kernels (#70); SMEM_LOCKSTEP_N bumped from 8 to 16 (#75); closed-form ungapped HIT path when total_mis == 0 (#77); ksort switched to an on-stack buffer for small n to drop a per-call malloc (#78); libsais_build skips a wasted zero-init pass on its unpack and SA buffers, trimming index-build time (#80).

Release 0.1.0-pre (2026-04-28)

Project renamed from bwa-mem2 to bwa-mem3. The new project tracks Fulcrum Genomics’ performance and feature work on top of the upstream bwa-mem2 codebase.
Default branch renamed from fg-main to main.
Binary renamed from bwa-mem2 to bwa-mem3. Arch-suffixed variants (bwa-mem3.sse41, .sse42, .avx, .avx2, .avx512bw, .arm64, .pgo, .profile, .lto) renamed to match.
@PG SAM header tags now read ID:bwa-mem3 PN:bwa-mem3 (and bwa-mem3-meth for --meth mode).
Test binaries renamed: bwa_mem2_tests_unit → bwa_mem3_tests_unit, bwa_mem2_tests_integration → bwa_mem3_tests_integration.
.bwt.2bit.64 index file format unchanged — bwa-mem3 reads indexes built by bwa-mem2 index without re-indexing.

Release 2.2.1 (17 March 2021)

Hotfix for v2.2: Fixed the bug mentioned in #135.

Release 2.2 (8 March 2021)

Changes since the last release (2.1):

Passed the validation test on ~88 billions reads (Credits: Keiran Raine, CASM division, Sanger Institute)
Fixed bugs reported in #109 causing mismatch between bwa-mem and bwa-mem2
Fixed the issue (# 112) causing crash due to corrupted thread id
Using all the SSE flags to create optimized SSE41 and SSE42 binaries

Release 2.1 (16 October 2020)

Release 2.1 of BWA-MEM2.

Changes since the last release (2.0):

Smaller index: the index size on disk is down by 8 times and in memory by 4 times due to moving to only one type of FM-index (2bit.64 instead of 2bit.64 and 8bit.32) and 8x compression of suffix array. For example, for human genome, index size on disk is down to ~10GB from ~80GB and memory footprint is down to ~10GB from ~40GB. There is a substantial decrease in index IO time due to the reduction and hardly any performance impact on read mapping.
Added support for 2 more execution modes: sse4.2 and avx.
Fixed multiple bugs including those reported in Issues #71, #80 and #85.
Merged multiple pull requests.

Release 2.0 (9 July 2020)

This is the first production release of BWA-MEM2.

Changes since the last release:

Made the source code more secure with more than 300 changes all across it.
Added support for memory re-allocations in case the pre-allocated fixed memory is insufficient.
Added support for MC flag in the sam file and support for -5, -q flags in the command line.
The output is now identical to the output of bwa-mem-0.7.17.
Merged index building code with FMI_Search class.
Added support for different ways to input read files, now, it is same as bwa-mem.
Fixed a bug in AVX512 sam processing part, which was leading to incorrect output.

Release 2.0pre2 (4 February 2020)

Miscellaneous changes:

Changed the license from GPL to MIT.
IMPORTANT: the index structure has changed since commit 6743183. Please rebuild the index if you are using a later commit or the new release.
Added charts in README.md comparing the performance of bwa-mem2 with bwa-mem.

Major code changes:

Fixed working for variable length reads.
Fixed a bug involving reads of length greater than 250bp.
Added support for allocation of more memory in small chunks if large pre-allocated fixed memory is insufficient. This is needed very rarely (thus, having no impact on performance) but prevents asserts from failing (code from crashing) in that scenario.
Fixed a memory leak due to not releasing the memory allocated for seeds after smem.
Fixed a segfault due to non-alignment of small allocated memory in the optimized banded Smith-Waterman.
Enabled working with genomes larger than 7-8 billion nucleotides (e.g. Wheat genome).
Fixed a segfault occuring (with gcc compiler) while reading the index.

Keyboard shortcuts

bwa-mem3

bwa-mem3