Home

bwa-mem3

A faster, more correct, drop-in replacement for bwa mem and bwa-mem2.

If you align short reads with bwa or bwa-mem2 today, bwa-mem3 will give you the same answers — only quicker, with fewer rough edges, and with first-class support for things you used to need a wrapper script for.

Why bwa-mem3

Drop in, go faster. Same algorithm, same outputs, same flags as bwa-mem2 — but consolidated mapping speedups, a memory-bounded index builder, batched header ingestion, and a tuned allocator add up to measurable wall-clock wins on real workloads.
Methylation in one binary. A --meth flag turns bwa-mem3 into a drop-in replacement for the entire bwameth.py pipeline. No Python, no inline conversion script, no separate post-processing step. One bwa-mem3 index --meth ref.fa, one bwa-mem3 mem --meth ref.fa R1.fq R2.fq, done — header collapsed, tags emitted, chimeras flagged.
Stage the index once, align many. A bwa-mem3 shm subcommand pins the FM-index in shared memory so back-to-back runs on the same host skip the 28 GB read every time.
Correctness fixes upstream haven’t merged yet. Tabs in -R, 151+ bp reads, AVX-512 mate-rescue, kswv score2 plateau across NEON/AVX2/AVX-512BW, mem_sam_pe proper-pair flag — every fix tracked back to the upstream PR or issue that found it.
Architecture-aware out of the box. SSE4.1, SSE4.2, AVX, AVX2, AVX-512BW, and ARM64/NEON. One binary per platform; the dispatcher picks the right tier for your CPU in process at startup.

Get started in 30 seconds

git clone --recursive https://github.com/fg-labs/bwa-mem3
cd bwa-mem3 && make
./bwa-mem3 index ref.fa
./bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam

Tip — Emit BAM directly

For production pipelines, add --bam=0 to skip the SAM text round-trip entirely. See Best Practices: Output format.

Where to start

Installation — Build from source (Bioconda is on the way).
Quick start: align paired-end FASTQs — Two commands to your first alignment.
Quick start: methylation — The one-binary bwameth.py replacement, in two commands.
Best Practices — The five things that actually move the needle for production runs.
What’s different from bwa-mem2 — Every fix and feature, with upstream cross-references.

What’s in this book

Getting Started — Install and run your first alignment.
User Guide — Indexing, alignment, output, threading, allocator notes.
Performance — Where the speed comes from and how to get more.
Best Practices — Build, run, and deploy recommendations.
CLI Reference — Every flag, auto-captured from --help.
Methylation Reference — --meth mode in full.
What’s Different from bwa-mem2 — The full changelog, by category.
Developer Guide — Build matrix, SIMD dispatch, regression tests, contributing.
Related Projects — bwa-mem3-bench, bwa-mem3-rs, fgumi, bwa-mem2 upstream.
Reference — Glossary, citation, license, changelog.

bwa-mem3 is a derivative of bwa-mem2 maintained by Fulcrum Genomics. MIT licensed. See License and Citation.

Installation

Bioconda (coming soon)

A Bioconda package for bwa-mem3 is in preparation. Once published, installation will be:

conda install -c bioconda bwa-mem3

This will be the recommended path for most users. Check back here or watch the fg-labs/bwa-mem3 repository for the announcement.

Build from source

Until the Bioconda package is available, build from source using the steps below.

Prerequisites

bwa-mem3 vendors several libraries as git submodules. Building from source requires the toolchain to compile bwa-mem3 itself plus the bootstrap tools each vendored library needs.

Tool	Why it’s needed	Minimum version
C++14 compiler (GCC or Clang)	bwa-mem3 itself	GCC 8+ / Clang 7+
GNU make	top-level build	3.81+
Git	submodule checkout (with `--recursive`)	any recent
autoconf, automake, autoconf-archive, libtool	`ext/htslib` runs `autoreconf -i && ./configure` during build	any recent
pkg-config	htslib’s `configure` uses it to locate zlib	any recent
zlib development headers	htslib links against zlib	any recent
OpenMP runtime	`ext/libsais` uses OpenMP for parallel suffix-array construction	see notes below
CMake 3.12+	building bundled mimalloc (default; skip if you pass `USE_MIMALLOC=0`)	3.12+

OpenMP notes.

On Linux with GCC, libgomp ships with the compiler — no extra package needed.

On Linux with Clang, install libomp-dev (Debian/Ubuntu) or libomp-devel (RHEL/Fedora).

On macOS, install Homebrew’s libomp (brew install libomp). The Makefile auto-detects the Homebrew prefix; set LIBOMP_PREFIX=/path/to/libomp if you installed it elsewhere.

Install prerequisites by platform

Debian / Ubuntu:

sudo apt-get install \
    build-essential git cmake pkg-config \
    autoconf automake autoconf-archive libtool \
    zlib1g-dev \
    libomp-dev          # only needed if building with Clang

RHEL / Fedora / Amazon Linux:

sudo dnf install \
    gcc gcc-c++ make git cmake pkgconf-pkg-config \
    autoconf automake autoconf-archive libtool \
    zlib-devel \
    libomp-devel        # only needed if building with Clang

macOS (Homebrew):

xcode-select --install   # Apple Clang + git + make
brew install \
    cmake pkg-config \
    autoconf automake autoconf-archive libtool \
    libomp

What happens if a prereq is missing. The Makefile fails fast with an actionable error: a missing libomp on macOS, a missing autoreconf, or a missing cmake each produce a one-line hint pointing at the install command above. There is no need to install everything optimistically — install only what the error message asks for if you prefer.

Clone and build

git clone --recursive https://github.com/fg-labs/bwa-mem3
cd bwa-mem3
make

The --recursive flag is required. bwa-mem3 vendors several libraries (mimalloc, sse2neon, and others) as git submodules. A shallow or non-recursive clone will fail to compile.

Warning — Shallow clone submodule pitfall

If you cloned without --recursive, initialize the submodules before running make:
git submodule update --init --recursive
Forgetting this step is the most common source of build failures.

Target architecture

By default, make builds a general-purpose binary that runs on any supported CPU. For maximum performance, specify the architecture that matches your deployment target:

Flag	Requires	Notes
`make`	SSE4.1 or better (x86), any (ARM)	Default; selects best dispatch at runtime on x86
`make arch=avx2`	AVX2 (e.g. Haswell, Zen 2)	Recommended for modern x86 servers
`make arch=avx512bw`	AVX-512BW (e.g. Skylake-X, Ice Lake, Sapphire Rapids)	Maximum x86 performance
`make arch=arm64`	Apple Silicon / AWS Graviton	NEON-vectorized build

See Performance — SIMD dispatch matrix for the full matrix of which kernels are vectorized under each target.

Memory allocator (mimalloc)

bwa-mem3 bundles mimalloc and links it into every binary by default. mimalloc reduces allocator contention under high thread counts and lowers wall-clock time on multi-threaded alignment runs.

To build without mimalloc, pass USE_MIMALLOC=0:

make USE_MIMALLOC=0

See User Guide — Memory allocator for details on how mimalloc is linked on Linux versus macOS and when opting out is appropriate.

Smoke test

After building, run the smoke test to confirm the binary works and report which allocator is active:

./bwa-mem3 version

Expected output (with mimalloc):

v0.2.0-12-gabcdef1
mimalloc 3.x.x

If the mimalloc line is absent, the build linked the system allocator (expected when USE_MIMALLOC=0 was passed or when the vendor submodule was not initialized).

Next: host requirements

If you’re planning to deploy bwa-mem3 across a heterogeneous fleet (AWS Batch, mixed compute clusters), read Host requirements for the supported CPU floor and Best Practices → Multi-architecture deployment for the deployment recipe.

Host requirements

bwa-mem3 runs on the hosts in the table below. Verify your host with bwa-mem3 version — the SIMD floor and runtime lines tell you what the binary needs and what your host provides.

Platform	Default build floor	Earliest supported CPU	Notes
Linux x86_64	AVX2 (`BASELINE_ARCH=avx2`)	Intel Haswell (2013); AMD Zen / Naples (2017)	Auto-selects best of `sse41 / sse42 / avx / avx2 / avx512bw` at runtime
Linux x86_64 (legacy)	SSE4.1 (`BASELINE_ARCH=sse41`)	Intel Nehalem (2008); AMD Bulldozer (2011)	Opt-in rebuild; ~10-15% slower on AVX2 hosts
Linux arm64	NEON (aarch64 ABI baseline)	Any aarch64 host	Single tier; NEON is mandatory in the aarch64 ABI
macOS arm64	NEON	Apple M1 (2020)	Apple Silicon only; macOS x86_64 is unsupported

How to verify

$ bwa-mem3 version
v0.2.0-12-gabcdef1
SIMD floor: avx2 (x86-64-v3, Haswell 2013+); kernels: sse41 sse42 avx avx2 avx512bw
SIMD runtime: avx512bw (BWAMEM3_FORCE_TIER unset)
mimalloc 3.x.x

The SIMD floor: line tells you what host features the binary requires.
The SIMD runtime: line tells you what kernel tier was selected at startup.
On a host below the floor, bwa-mem3 version writes a [W::bwa-mem3] warning line to stderr (not stdout) and still exits 0, so the diagnostic command stays usable even on hosts that cannot run alignment. The floor + runtime lines remain on stdout, so bwa-mem3 version | grep '^SIMD' works in CI scripts even on too-old hosts.

Failure mode on too-old hosts

If you run bwa-mem3 mem (or another alignment subcommand) on a host below the floor, the binary refuses with exit code 2 and a stderr message identifying the gap:

[E::bwamem3] this binary was compiled for SIMD floor avx2 and emits avx2
instructions in non-kernel translation units. The host CPU does not support
avx2 (detected: sse42). Running would SIGILL on the first avx2 instruction.

To run on this host, rebuild bwa-mem3 with BASELINE_ARCH=sse42 (or lower),
or use a binary built for a lower SIMD floor.

The version subcommand stays exit-0 so introspection still works on the same host.

Mixed-architecture fleets

For AWS Batch and other heterogeneous compute environments where the same job may schedule onto x86_64 or arm64 hosts, see Best Practices → Multi-architecture deployment.

Quick start: align paired-end FASTQs

This page walks through the two-command workflow: index the reference once, then align reads.

Index the reference

bwa-mem3 index ref.fa

This produces five index files alongside ref.fa:

File	Description
`ref.fa.bwt.2bit.64`	FM-index in 2-bit packed format
`ref.fa.0123`	2-bit packed reference sequence
`ref.fa.amb`	Ambiguous base positions
`ref.fa.ann`	Sequence name and length annotations
`ref.fa.pac`	Packed 4-bit reference sequence

Indexing hg38 takes roughly 2-3 minutes and requires approximately 60 GB of peak disk space during creation (including temporary/intermediate files); the final FM-index stored on disk is roughly 28 GB. The index is read once per mem invocation; for workloads that align many samples, load it into shared memory first (see Quick start: shared-memory index).

Align paired-end reads

bwa-mem3 mem -t 16 ref.fa r1.fq.gz r2.fq.gz > out.sam

-t 16 sets the thread count to 16. bwa-mem3 scales well up to the number of physical CPU cores; hyperthreading provides diminishing returns above that point. See User Guide — Threading and resource use for recommendations at different core counts.

The default output is uncompressed SAM on stdout. To write compressed BAM directly, use the --bam flag:

bwa-mem3 mem --bam -t 16 ref.fa r1.fq.gz r2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

Tip — Prefer BAM output in production

Piping BAM (--bam) to samtools sort avoids the text formatting and parsing overhead of SAM on both sides of the pipe. For large cohorts this yields a measurable wall-clock reduction. See Best Practices — Output format for the recommended pipeline and a discussion of when SAM is still useful.

Read group tagging

For downstream tools that require a @RG header (most variant callers), pass -R:

bwa-mem3 mem -t 16 \
  -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA\tLB:lib1' \
  ref.fa r1.fq.gz r2.fq.gz > out.sam

The value is a tab-delimited string following BWA conventions. Every aligned record receives an RG:Z: tag matching the ID field of the read-group header.

Output tags

bwa-mem3 emits standard SAM tags plus the HN:i: tag introduced by the fork:

Tag	Type	Description
`NM:i`	int	Edit distance to the reference
`MD:Z`	string	Mismatch and deletion string
`AS:i`	int	Alignment score
`XS:i`	int	Suboptimal alignment score
`SA:Z`	string	Supplementary alignment chain
`MC:Z`	string	Mate CIGAR (paired-end)
`MQ:i`	int	Mate mapping quality (paired-end)
`HN:i`	int	Total number of primary alignments (reported and suppressed) found for the read, before the `-h` supplementary cap is applied

For the methylation-specific tags (XR:Z, XG:Z, XM:Z), see Methylation Reference — SAM tags.

Quick start: methylation alignment

bwa-mem3 supports bisulfite-converted (WGBS/RRBS/EM-seq) read alignment through a single --meth flag on both index and mem. No Python interpreter, no piped preprocessor, and no separate postprocessing step are required.

Note — Drop-in replacement for bwameth.py

bwa-mem3 with --meth is a single-binary drop-in replacement for the bwameth.py pipeline. The output BAM is byte-compatible for the standard tags used by methylation callers (Bismark, MethylDackel, PileOMeth, etc.).

Index the reference for methylation

Build the c2t doubled reference once:

bwa-mem3 index --meth ref.fa

This writes two additional files next to the standard index:

File	Description
`ref.fa.bwameth.c2t`	C→T converted reference (forward strand) with G→A reverse complement interleaved
`ref.fa.bwameth.c2t.*`	FM-index files for the c2t reference

The c2t index is separate from the standard index produced by bwa-mem3 index ref.fa. You need both if you intend to run standard and methylation alignments against the same reference.

Align bisulfite-converted reads

bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam
samtools index out.bam

Pass the original (unconverted) reference path, not the .bwameth.c2t file. bwa-mem3 auto-appends .bwameth.c2t to the reference path when --meth is active.

What `--meth` does

--meth activates a pipeline of in-process transformations that would otherwise require external tools:

Inline c2t read conversion. R1 reads have every C converted to T before alignment; R2 reads have every G converted to A. The original unconverted sequence is restored into the BAM SEQ field on emit. The conversion direction is reported per record in the Bismark XR:Z tag (value CT for R1/SE, GA for R2).
bwameth.py-equivalent scoring defaults. --meth sets -B 2 -L 10 -U 100 -T 40 -CM automatically. These match the defaults used by bwameth.py and are optimized for bisulfite-converted reads where C→T mismatches carry no penalty. Any of these values can be overridden on the command line.
Inline BAM post-processing. After alignment, bwa-mem3 rewrites the SAM stream in-process:
- @SQ headers with f/r prefixes (e.g. fchr1, rchr1) are collapsed back to one entry per real chromosome (chr1). Read-level RNAME fields are rewritten to match.
- Each mapped record gains Bismark XG:Z (genome strand: CT for top-strand alignment, GA for bottom-strand) and XM:Z (per-base methylation call string).
- Chimera QC: reads whose longest M/=/X run is less than 44% of the read length are flagged 0x200 (QC-fail), have flag 0x2 (proper pair) cleared, and have MAPQ capped at 1.
- Pair-level QC-fail propagation: if one mate is QC-failed, the other mate is also flagged.
- A @PG ID:bwa-mem3-meth program record is appended to the header.
Uncompressed BAM output. The post-processed stream is written as uncompressed BAM (wb0) rather than SAM text. This eliminates text serialization overhead and allows downstream samtools sort to read BAM natively. The stream is still fully readable by any htslib-based tool.

For full details on each tag, the optional chimera QC heuristic, and the --set-as-failed / --chimera-qc flags, see the Methylation Reference.

Quick start: shared-memory index

The bwa-mem3 FM-index for a genome like hg38 is approximately 28 GB. By default, every bwa-mem3 mem invocation reads the index from disk, which can take 30–60 seconds on a spinning disk and several seconds even on fast NVMe storage. For workloads that align many small samples in sequence on the same machine, this per-invocation overhead accumulates.

bwa-mem3 shm stages the index once into POSIX shared memory. Subsequent mem invocations attach to the in-memory segment instead of reading from disk, reducing per-sample startup time to near zero.

Stage the index

bwa-mem3 shm ref.fa

This reads the index files from disk and copies them into a POSIX shared-memory segment. The command returns when staging is complete. The index stays in memory until it is explicitly dropped or the system is rebooted.

To stage a methylation (--meth) index:

bwa-mem3 shm --meth ref.fa

A standard and a methylation index for the same reference can be staged simultaneously; they occupy separate named segments.

Align using the staged index

No extra flag is needed. When bwa-mem3 mem starts, it checks whether a matching shared-memory segment exists. If one does, it attaches automatically:

bwa-mem3 mem -t 16 ref.fa r1.fq.gz r2.fq.gz > out.sam

Inspect and drop staged segments

List all currently staged indices:

bwa-mem3 shm -l

Drop all staged segments:

bwa-mem3 shm -d

When to use shared-memory indexing

Shared-memory indexing is most beneficial when:

Aligning tens to hundreds of small samples (e.g. amplicon panels, targeted sequencing) where per-sample read time dominates the per-sample alignment time.
Running a batch pipeline on a single large machine where the index fits comfortably in RAM (approximately 28 GB for hg38 with the standard index).
The same reference is used for all samples in the batch; a new shm invocation is required for each distinct reference.

It provides little benefit when:

Aligning a small number of large samples (WGS), where alignment time far exceeds index load time.
The available RAM is insufficient to hold the index alongside the operating system and alignment worker processes.

Warning — No staleness check — always drop before re-indexing

bwa-mem3 shm does not detect whether the on-disk index files have changed after staging. If you run bwa-mem3 index ref.fa again (e.g. to rebuild after a reference update), the shared-memory segment is not invalidated. Subsequent mem invocations will attach to the stale segment and produce silently incorrect alignments.

Always drop the segment before re-indexing:
bwa-mem3 shm -d
bwa-mem3 index ref.fa
bwa-mem3 shm ref.fa

Indexing the reference

Before aligning reads, bwa-mem3 builds an FM-index from the reference FASTA. The index is read back from disk at the start of every mem run, so it is built once and reused indefinitely.

Basic indexing

bwa-mem3 index ref.fa

The command writes five files alongside the input FASTA:

File	Contents
`ref.fa.bwt.2bit.64`	Burrows-Wheeler Transform, 2-bit packed, 64-bit offsets
`ref.fa.0123`	Forward sequence, 2-bit packed
`ref.fa.amb`	Coordinates and counts of ambiguous (N) bases
`ref.fa.ann`	Sequence names and lengths
`ref.fa.pac`	Forward sequence, 4-bit packed

The .bwt.2bit.64 file dominates disk usage. For the human reference (hg38), expect roughly 28 GB total across all five files.

Methylation index (`--meth`)

bwa-mem3 index --meth ref.fa

Methylation mode builds a C-to-T doubled reference in addition to the standard FM-index files. The command writes a ref.fa.bwameth.c2t file (the doubled FASTA) and its own set of five index files with the .bwameth.c2t suffix:

ref.fa.bwameth.c2t
ref.fa.bwameth.c2t.bwt.2bit.64
ref.fa.bwameth.c2t.0123
ref.fa.bwameth.c2t.amb
ref.fa.bwameth.c2t.ann
ref.fa.bwameth.c2t.pac

The doubled reference is roughly twice the size of the standard one. For hg38, allow approximately 56 GB of disk space.

Tip — Pass the original FASTA to mem, not the c2t file

When running bwa-mem3 mem --meth, pass the original FASTA path (ref.fa), not ref.fa.bwameth.c2t. bwa-mem3 appends .bwameth.c2t automatically. The auto-append is skipped only when the path already ends in .bwameth.c2t, which is useful for external-c2t interop pipelines.

Output file locations

Index files are written to the same directory as the input FASTA by default. The input path is taken verbatim as a prefix — you can pass an absolute path to write into a different directory:

bwa-mem3 index /data/indexes/hg38/hg38.fa
# writes hg38.fa.bwt.2bit.64, etc. into /data/indexes/hg38/

Time and memory

Indexing hg38 takes roughly 60–90 minutes on a single core and requires about 80 GB of RAM during construction. The process is single-threaded; additional cores do not reduce wall time.

bwa-mem3 uses libsais to construct the suffix array, which is faster than the original bwa-mem2 approach. See Performance improvements for benchmark numbers.

Warning — Do not index over a live shared-memory segment

If you have previously staged the index into shared memory with bwa-mem3 shm, drop the segment first before re-indexing:
bwa-mem3 shm -d
bwa-mem3 index ref.fa
There is no staleness check. If bwa-mem3 mem finds a matching segment in shared memory it will attach to it even when the on-disk index has been updated. See Quick start: shared-memory index.

Arch flags and the index format

The FM-index format is architecture-independent. A single index works across every SIMD tier and every supported platform: the x86 binary’s AVX2 / AVX-512BW dispatch paths and the arm64 NEON binary all read the same on-disk layout.

Aligning short reads (mem)

bwa-mem3 mem aligns one or two FASTQ files against an indexed reference and writes SAM (default) or BAM (--bam) to stdout. It is a drop-in replacement for bwa-mem2 mem and supports all standard bwa-mem flags.

Basic usage

Paired-end:

bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam

Single-end:

bwa-mem3 mem -t 16 ref.fa reads.fq.gz > out.sam

Pipe directly to samtools:

bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

Using --bam=0 (uncompressed BAM) avoids SAM text formatting on the write side and SAM parsing on the samtools side, and skips the wasted compression that samtools sort would immediately decompress; the BAM bytes flow between processes in the pipe.

Key flags

Threading: `-t`

-t INT   number of threads [1]

Performance scales well through 8–16 threads on most machines. Beyond 32 threads, returns diminish on typical workloads because inter-thread locking and IO become the bottleneck. See Threading and resource use for detailed guidance.

Read-group header: `-R`

-R STR   read group header line, e.g. '@RG\tID:sample1\tSM:sample1\tLB:lib1\tPL:ILLUMINA'

Every production alignment should include a @RG header. The ID in the -R string is embedded as an RG:Z: tag on every output record.

Tip — Escape the tab correctly

Pass -R with a literal \t between fields. Most shells require single quotes or $'...' quoting to prevent interpretation of the backslash:
bwa-mem3 mem -R $'@RG\tID:s1\tSM:sample1' -t 16 ref.fa R1.fq.gz R2.fq.gz

Chunk size: `-K`

-K INT   process INT input bases in each batch [10000000]

Larger -K values increase memory use but can improve throughput on very deep or very wide batches. The default is appropriate for most workloads.

SAM output control: `-S`, `-P`

-S    skip mate rescue
-P    skip pairing; mate rescue performed unless -S also in use

These flags are primarily useful for debugging or non-standard workflows. Normal paired-end alignments should leave both at their defaults.

Output modes

SAM (default)

bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam

Plain-text SAM. Suitable for inspection, compatibility testing, and piping to tools that consume SAM.

BAM (`--bam=0`)

bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz > out.bam

Writes BAM directly. --bam=0 is uncompressed BAM, which avoids double-compression when piping into a downstream sorter and is roughly 10–15% faster end-to-end. Pass --bam=6 to write a fully compressed BAM if the output is the final product.

Note — –bam=0 is the recommended output mode

For production pipelines, always use --bam=0 and pipe to samtools sort. See Best Practices: output format for the canonical pipeline.

Methylation alignment (`--meth`)

Pass --meth for bisulfite/RRBS samples. This activates inline C-to-T read conversion, bwameth.py-compatible flag defaults, and inline BAM post-processing. See Quick start: methylation alignment for the two-command workflow and the Methylation Reference for full detail.

Shared-memory index auto-attach

When bwa-mem3 shm has staged the index into shared memory, bwa-mem3 mem attaches automatically — no extra flag is required. The shared-memory path is transparent to users.

Cross-references

The full flag list is in the CLI Reference: mem page.

Output: SAM/BAM, headers, tags

bwa-mem3 writes output in either SAM (default) or BAM (--bam) format. This page covers the header structure and every non-standard SAM tag emitted by bwa-mem3.

Output format

By default, bwa-mem3 mem writes SAM to stdout. Pass --bam (or --bam=N for a specific compression level) to write BAM. Level 0 (uncompressed) is the default when --bam is given without an argument, which is optimal when piping to a downstream samtools sort.

# SAM (default)
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam

# Uncompressed BAM — best for piping
bwa-mem3 mem --bam -t 16 ref.fa R1.fq.gz R2.fq.gz | samtools sort -@ 8 -o out.bam -

# Compressed BAM — useful when the output is the final file
bwa-mem3 mem --bam=6 -t 16 ref.fa R1.fq.gz R2.fq.gz > out.bam

SAM header

`@HD`

A default @HD VN:1.6 SO:unsorted line is emitted unless the user supplies one via -H. The sort order is unsorted because bwa-mem3 writes records in input read order; downstream sorting is always a separate step.

`@SQ`

One @SQ line is written per reference sequence, with the sequence name (SN:) and length (LN:) derived from the FM-index. If the index was built with a .dict or .hdr file that supplies @SQ records, those records are used instead of the auto-generated ones.

In methylation mode (--meth), the doubled reference contains sequences with an f or r prefix in their names. The inline BAM post-processor collapses these back to canonical chromosome names so that the output @SQ lines match a standard non-methylation alignment. See Chimera QC and header rewriting.

`@PG`

One @PG entry is written in standard mode:

ID	Description
`bwa-mem3`	The alignment step. `VN:` is the bwa-mem3 version string; `CL:` is the full command line.

In methylation mode (--meth), a second @PG entry is appended:

ID	Description
`bwa-mem3-meth`	The inline post-processor. `VN:` carries the version with `-meth` suffix; `CL:` is the full command line.

The bwa-mem3-meth entry follows immediately after the bwa-mem3 entry and records the post-processing step as a distinct pipeline node, matching the convention of separate-tool pipelines.

Tags emitted by bwa-mem3

Standard tags

bwa-mem3 emits the same standard tags as bwa-mem2 (NM:i, MD:Z, AS:i, XS:i, SA:Z, RG:Z, XA:Z, MC:Z, etc.). These are documented in the SAM specification and are not described further here.

bwa-mem3 additionally emits MQ:i on paired-end records — the mate’s mapping quality, set alongside MC:Z (the mate’s CIGAR) so callers that key off the mate’s MAPQ don’t need to look at the mate record. Both SAM and --bam output paths emit it. Backported from lh3/bwa PR #330 in fg-labs PR #35.

The XA:Z field set widens from chr,pos,CIGAR,NM to chr,pos,CIGAR,NM,score,mapq when -u (a.k.a. the upstream “XB” toggle) is passed; the tag name itself remains XA:Z for downstream compatibility. Tools that parse XA:Z need to be aware of the two possible field widths.

`HN:i` — total alignment hit count

HN:i:<count>

The total number of primary alignments (both reported and suppressed) that the aligner found for this read, before the -h supplementary cap is applied. Useful for distinguishing “uniquely mapped” from “multi-mapped” reads without relying solely on MAPQ.

HN:i is emitted on the primary alignment record only.

Methylation-only tags

The following Bismark-compatible tags are emitted only when --meth is active. See SAM tags: XR, XG, XM for the full per-tag reference, including the XM:Z character alphabet and the XG:Z strand-pick semantics.

Tag	Type	Description
`XR:Z`	string	Read conversion direction: `CT` (R1 / SE) or `GA` (R2)
`XG:Z`	string	Genome strand of the alignment: `CT` (OT) or `GA` (OB)
`XM:Z`	string	Per-base methylation call string (length = `SEQ`)

The bwameth-style YS:Z / YC:Z tags exist only as an internal carrier on bseq1_t.comment for SEQ restoration and XR:Z derivation; they are suppressed at BAM emit and never appear in output. The bwameth YD:Z strand tag has been replaced by Bismark XG:Z and is not emitted.

MAPQ semantics

MAPQ semantics are inherited from bwa-mem2 and follow the same scoring model. In methylation mode, alignments identified as chimeras (longest M/=/X run covering less than 44% of the read length) have their MAPQ capped at 1 and the 0x200 (QC fail) flag set. See Chimera QC and header rewriting.

Threading and resource use

The `-t` flag

-t INT   number of threads [1]

bwa-mem3 parallelizes alignment by dividing the input into fixed-size batches (controlled by -K) and processing batches concurrently. Threads share the in-memory FM-index; there is no per-thread copy.

How threads interact with performance

Where threads help

Seed finding (SMEM enumeration) is fully parallel across reads in a batch.
Extension (banded Smith-Waterman) is fully parallel.
Pair rescue is parallel.
BAM encoding (when --bam is active) is parallel.

Where threads stop helping

Thread count and wall-clock alignment time scale well to approximately 16–32 threads on a modern CPU. Beyond that, several effects conspire to flatten the curve:

FM-index bandwidth. The index for hg38 is ~28 GB and does not fit in the L3 cache of any current server. At high thread counts, threads contend for memory bandwidth accessing the BWT.
IO contention. On spinning disk or a shared network filesystem, concurrent reads of the same large index file saturate IO bandwidth before the CPU is saturated.
Output serialization. SAM output is serialized per-record to stdout. BAM output with --bam reduces this bottleneck but does not eliminate it entirely.

Recommended thread counts

Machine	Recommended `-t`	Notes
16-core workstation	12–14	Leave 2 cores for `samtools sort`
32-core server	24–28	Leave cores for downstream and OS overhead
64-core server	40–48	Marginal returns above 48; test with your workload
Multiple parallel runs	divide evenly	See below

These are starting points. Profile with your specific data and storage configuration to find the practical optimum.

Running multiple parallel alignments

When running multiple bwa-mem3 mem processes on the same machine, divide threads so that the total does not exceed the physical core count. For example, on a 32-core machine running four concurrent samples:

# Four parallel runs, 8 threads each
for sample in a b c d; do
  bwa-mem3 mem --bam -t 8 ref.fa ${sample}_R1.fq.gz ${sample}_R2.fq.gz \
    | samtools sort -@ 2 -o ${sample}.bam - &
done
wait

Using shared memory (bwa-mem3 shm) amortizes the index read-in cost across all four runs. See Quick start: shared-memory index and Best Practices: multi-sample workflows.

Memory use

Peak RAM during alignment is dominated by the in-memory FM-index. For hg38, expect roughly 28 GB of resident memory per bwa-mem3 mem process. Additional memory is used per batch (-K reads × read length × a small constant).

With bwa-mem3 shm, the index is mapped from a shared-memory segment, so multiple concurrent mem processes share the same physical pages. The OS deduplicates the pages; total RAM use is approximately one index, not one per process.

Tip — Use shm for repeated runs on the same machine

If you run more than a few samples on the same machine without rebooting, bwa-mem3 shm pays off immediately. The index is read from disk once and stays in RAM for all subsequent mem invocations.

IO recommendations

Use local NVMe storage for the index files when possible. The ~28 GB BWT read is the dominant IO event at the start of each mem run.
Write BAM (--bam) to a fast local disk or pipe directly to samtools sort. Avoid writing uncompressed SAM to a network filesystem.
Separate read and write paths if your storage topology allows it: read the index from one volume and write sorted BAM to another.

Memory allocator (mimalloc)

bwa-mem3 vendors and links mimalloc, Microsoft’s high-performance memory allocator, into every binary by default. On multi-threaded alignment workloads, mimalloc reduces wall-clock time by replacing the system allocator with one optimized for many small, short-lived allocations — exactly the access pattern produced by the inner alignment loops.

What mimalloc replaces

The system allocator (glibc malloc on Linux, libSystem malloc on macOS) is a general-purpose allocator with a global lock. Under heavy multi-threaded allocation pressure — 16+ threads each issuing thousands of short-lived allocations per batch — the lock becomes a measurable bottleneck. mimalloc uses per-thread free lists and a segment-based heap to eliminate most of this contention.

Platform-specific linkage

The linkage strategy differs by OS:

Platform	Mechanism
Linux	Static linkage with `--whole-archive`. The entire mimalloc static library is embedded into the `bwa-mem3` binary; its `malloc`/`free` symbols take precedence over `glibc`’s at link time.
macOS	Dynamic linkage via dyld interposing. `libmimalloc.dylib` is built alongside the binary; dyld’s `DYLD_INSERT_LIBRARIES` interposing mechanism replaces `malloc`/`free` at load time. The dylib ships next to the binary.

Warning — macOS: keep libmimalloc.dylib next to the binary

On macOS, libmimalloc.dylib must remain in the same directory as the bwa-mem3 binary (or be reachable via the embedded rpath). If you move bwa-mem3 without also moving libmimalloc.dylib, the binary will fall back to the system allocator silently — bwa-mem3 version will not print a mimalloc line, which is the indicator that the allocator is active.

Verifying that mimalloc is active

Run:

./bwa-mem3 version

When mimalloc is linked and loaded, the output includes a line like:

mimalloc 3.x.x

If that line is absent, mimalloc is not active.

Opting out

Pass USE_MIMALLOC=0 at build time to produce a binary linked against the system allocator:

make USE_MIMALLOC=0

Reasons to opt out:

AddressSanitizer (ASAN) builds. The Makefile automatically sets USE_MIMALLOC=0 when ASAN_FLAGS is detected, because ASAN and mimalloc’s malloc interposing cannot coexist cleanly.
Container environments where distributing a dylib alongside the binary is inconvenient.
Reproducibility testing to isolate whether a behavioral difference is allocator-related.

Note — Default is on

USE_MIMALLOC=1 is the default. Opt-out is not recommended for production workloads — mimalloc measurably reduces wall time on multi-threaded runs.

Build internals

The mimalloc source lives in ext/mimalloc/ as a git submodule. The Makefile target builds it via CMake before linking bwa-mem3. The relevant Makefile variables are MIMALLOC_SRC, MIMALLOC_BUILD, and MIMALLOC_LIB.

The feature was introduced in bwa-mem3 as part of the performance improvement work. See Features and Build & infrastructure for the PR history.

Tips and best practices

This page collects the most commonly useful operational tips for running bwa-mem3. Each tip is a short actionable point; the linked pages provide the full rationale.

Index once, align many times

Build the FM-index once per reference version. The on-disk index format is stable across bwa-mem3 releases and across every SIMD tier inside the single binary — the AVX2 and AVX-512BW kernel paths read the same files. You do not need to re-index when upgrading bwa-mem3 unless the release notes say otherwise.

# Build once
bwa-mem3 index ref.fa

# Align many samples
for sample in a b c d; do
  bwa-mem3 mem --bam -t 16 ref.fa ${sample}_R1.fq.gz ${sample}_R2.fq.gz \
    | samtools sort -@ 4 -o ${sample}.bam -
done

Pipe to `samtools sort -@`

Never write an intermediate unsorted BAM to disk and then sort it in a second step. bwa-mem3’s --bam mode + samtools sort in a single pipeline avoids the extra write/read cycle and is significantly faster:

bwa-mem3 mem --bam -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

Allocate roughly 2/3 of available threads to bwa-mem3 mem and 1/3 to samtools sort. On a 24-core machine, -t 16 for bwa-mem3 and -@ 8 for samtools is a good starting point.

Stage the index in shared memory for batch workloads

When aligning more than a few samples on the same machine, reading the ~28 GB hg38 index from disk on every mem invocation is the dominant wall-clock cost. Stage it once:

bwa-mem3 shm ref.fa

Subsequent bwa-mem3 mem invocations attach automatically. The shared-memory segment persists until explicitly dropped (bwa-mem3 shm -d) or the machine reboots.

Warning — Always drop the segment before re-indexing

There is no staleness check. If you rebuild the index without first dropping the shared-memory segment, bwa-mem3 mem will attach to the stale segment and produce incorrect alignments without any warning. Always run bwa-mem3 shm -d before bwa-mem3 index.

Pin threads when running concurrent jobs

When running multiple bwa-mem3 mem processes in parallel, divide threads explicitly so that the total does not exceed the physical core count. Avoid relying on the scheduler to balance over-subscribed threads — each process will spin waiting for CPU time, and total throughput drops.

# Good: 4 jobs × 6 threads = 24 cores, on a 24-core machine
for sample in a b c d; do
  bwa-mem3 mem --bam -t 6 ref.fa ${sample}_R1.fq.gz ${sample}_R2.fq.gz \
    | samtools sort -@ 2 -o ${sample}.bam - &
done
wait

See Threading and resource use for per-machine thread count recommendations.

Confirm the binary’s SIMD tier matches your CPU

bwa-mem3 ships one binary per platform that contains every supported x86 SIMD tier (or the single NEON path on arm64) and picks the right tier in process at startup. There are no per-tier companion binaries to copy or call directly.

CPU generation	Resolved tier
Modern Intel/AMD (2018+)	`avx512bw` or `avx2`
Older x86	`sse42` or `sse41`
Apple Silicon / AWS Graviton	`neon`

Verify the resolved tier with bwa-mem3 version (prints SIMD floor: and SIMD runtime: lines on stdout) or set BWAMEM3_DEBUG_SIMD=1 to get a startup banner from bwa-mem3 mem. If you need to force a lower tier for A/B regression testing, set BWAMEM3_FORCE_TIER=<tier> — upgrade requests above the host’s capability are rejected.

See Performance: SIMD dispatch matrix.

Include a read-group header

Always pass -R with at minimum ID: and SM: fields. Many downstream tools (GATK, fgbio, Picard) require a @RG header and will fail or warn without one.

bwa-mem3 mem \
  -R $'@RG\tID:run1\tSM:sample1\tLB:lib1\tPL:ILLUMINA' \
  -t 16 ref.fa R1.fq.gz R2.fq.gz

Performance Overview

Performance claims in this section are benchmarked, not asserted. The canonical source of truth for benchmark methodology, hardware configurations, and current numbers is bwa-mem3-bench, a reproducible benchmarking harness that runs across AWS Batch architectures (x86 AVX2, AVX-512, ARM Graviton). Consult that repository before drawing conclusions from isolated anecdotal timings.

What drives bwa-mem3’s performance

bwa-mem3 inherits the SIMD-vectorized alignment kernels of bwa-mem2 and adds several improvements of its own. The headline gains relative to a stock bwa-mem2 build fall into four categories.

Vectorized alignment kernels. The Smith-Waterman and banded-SWA kernels (kswv, bandedSWA) are compiled against the widest SIMD ISA the current CPU supports — SSE4.1 through AVX-512BW on x86, or native NEON on ARM. On Apple Silicon, native NEON intrinsics replaced the sse2neon shim in the two hottest kernels, delivering roughly 10% additional throughput over the pure-translation baseline. See SIMD dispatch matrix for the full picture.

libsais FM-index construction. The indexing step uses the linear-time suffix-array/BWT construction library libsais in place of the original quadratic-time approach. This cuts bwa-mem3 index wall time substantially on large references. See What’s Different — Performance improvements for the corresponding PR details.

mimalloc allocator. bwa-mem3 vendors and statically links mimalloc, replacing the system malloc/free for all allocations. On Linux the library is injected via --whole-archive; on macOS it uses dyld interposition. The allocator shows consistent throughput gains on multi-threaded workloads because mimalloc avoids the lock contention in glibc’s ptmalloc at high thread counts. See User Guide — Memory allocator for details.

Profile-Guided Optimization (PGO). The build system provides make pgo-generate and make pgo-use targets that compile an instrumented binary, gather branch-probability and call-frequency profiles from a representative workload, and then recompile with those profiles applied. On Apple Silicon the measured gain is approximately 3%; on x86 the gain depends on the workload mix. PGO is opt-in and is not applied to the default make output. See PGO build for the full workflow.

Consolidated mapping speedups

PR #58 and the related lockstep SMEM-batching work (#33) reduced per-read overhead in the main mapping loop beyond what upstream bwa-mem2 carries. The batch -H ingestion improvement (#49) further reduces header-processing latency for large sample sets.

Reference numbers across architectures

Wall-time medians from bwa-mem3-bench at SHA dc7fcfe (2026-05-13), 5 reps per cell, t≈16, hg38, paired-end 150 bp:

sample	c6a (AVX2, Zen3)	c7a (AVX-512, Zen4)	c7i (AVX-512, SPR)	c7g (NEON, Graviton3)	c8g (NEON, Graviton4)
wgs-5M	147.70 s	101.17 s	138.33 s	178.54 s	151.23 s
wes-5M	84.37 s	61.96 s	75.08 s	84.50 s	70.90 s
panel-twist-5M	158.49 s	106.94 s	151.78 s	194.04 s	163.38 s

Concordance vs upstream bwa-mem2 v2.2.1 on these cells: 100.0000% across 8.1M–10M reads/cell. NEON-vs-x86 cross-architecture concordance on the same builds is also 100.0000%. Spot-pool noise envelope (rep-to-rep CV): ~1% on c6a / c7a / c7g / c8g, ~8–9% on c7i. See the bench repo for the methodology, the full per-rep table, and noisier instance classes excluded from this summary.

Benchmarking responsibly

Alignment throughput is sensitive to read length, error rate, reference size, thread count, CPU architecture, NUMA topology, and whether the index is cold (in-kernel page cache) or warm. The bwa-mem3-bench harness controls for these variables by running standardized workloads on defined instance types. If you need numbers for a procurement or publication decision, run the harness against your target hardware.

SIMD Dispatch Matrix

bwa-mem3 ships one binary per platform. The x86 binary contains compiled kernels for every supported SIMD tier (sse41 / sse42 / avx / avx2 / avx512bw) and dispatches in process at startup. The arm64 binary contains a single NEON kernel path. There are no bwa-mem3.<tier> companion files on disk and no launcher binary.

Dispatch flowchart

flowchart TD
    A[bwa-mem3 mem starts] --> B{Platform?}
    B -- ARM / aarch64 --> C[NEON kernel TU, no dispatch]
    B -- x86 --> D[bwamem3_simd_init in src/simd_dispatch.cpp]
    D --> E[__builtin_cpu_supports]
    E --> F{Host capability?}
    F -- AVX-512BW --> G1[g_tier = avx512bw]
    F -- AVX2 --> G2[g_tier = avx2]
    F -- AVX --> G3[g_tier = avx]
    F -- SSE4.2 --> G4[g_tier = sse42]
    F -- SSE4.1 --> G5[g_tier = sse41]
    F -- below build floor --> H[exit(2): host below SIMD floor]
    G1 & G2 & G3 & G4 & G5 --> I[Per-kernel factory selects matching tier]

Tier detection runs once during main(). Subsequent kernel calls pay a single indirect-call hop through a factory vtable (or an extern "C" wrapper for free-function ksw_* kernels) — about 0.3 ns per call after BTB warm-up, well below run-to-run noise on the bwa-mem3-bench corpus.

If the host CPU does not meet the build’s compile-time SIMD floor (BASELINE_ARCH, default avx2 since PR #84), the binary exits with code 2 and an [E::bwamem3] message naming the gap before any alignment work runs. bwa-mem3 version, --help, and -h are exempt and always succeed so operators can introspect a binary on a host that cannot run alignment. See Host requirements.

Building

make                              # single multi-tier x86 binary, BASELINE_ARCH=avx2
make BASELINE_ARCH=sse41          # lower host SIMD floor / maximize portability (~10–15% slower on AVX2 hosts)
make BASELINE_ARCH=avx512bw       # AVX-512BW-only fleet (locks the host floor)
make arm64                        # single NEON binary, no dispatch table

BASELINE_ARCH controls the tier at which non-kernel translation units compile. The hand-tuned kernel TUs in KERNEL_SRCS (bandedSWA, kswv, ksw, sam_encode) are always compiled at every supported tier and dispatched at runtime, so a build at BASELINE_ARCH=avx2 still uses the AVX-512BW kernels on AVX-512BW hosts. The non-kernel TUs are not auto-vectorized above BASELINE_ARCH, which is the trade-off — see BASELINE_ARCH=avx512bw build flag for the empirical perf characterization.

Supported x86 tiers (minimum CPU for each tier’s kernel path):

Tier	Arch flags	Minimum CPU
`sse41`	`-msse4.1`	Penryn (2007) / K10 (2011)
`sse42`	`-msse4.2`	Nehalem (2008) / Bulldozer (2011)
`avx`	`-mavx`	Sandy Bridge (2011) / Bulldozer (2011)
`avx2`	`-mavx2`	Haswell (2013) / Excavator (2015)
`avx512bw`	`-mavx512f -mavx512bw -mprefer-vector-width=256`	Skylake-X (2017) / Zen 4 (2022)

For arm64 builds:

Binary	Arch flags	Platform
`bwa-mem3` (arm64)	`-DAPPLE_SILICON=1` + native NEON / sse2neon shim	Any aarch64 / Apple Silicon

Kernel vectorization coverage

Kernel	SSE4.1	SSE4.2	AVX	AVX2	AVX-512BW	NEON (arm64)
`kswv` (vectorized Smith-Waterman)	8-wide int16	8-wide int16	8-wide int16	16-wide int16	32-wide int16	8-wide int16 (native)
`bandedSWA` (banded alignment / mate-rescue)	vectorized	vectorized	vectorized	vectorized	vectorized	native NEON blendv
`ksw_*` (SW extension free functions)	per-tier	per-tier	per-tier	per-tier	per-tier	per-tier (NEON)
`sam_encode` (SAM seq/qual encoder)	per-tier	per-tier	per-tier	per-tier	per-tier	per-tier (NEON)
FM-index lookup (`FMI_search`)	scalar popcount	scalar popcount	scalar popcount	scalar popcount	scalar popcount	`__builtin_popcountl`
libsais BWT construction	scalar	scalar	scalar	OpenMP parallel	OpenMP parallel	OpenMP parallel

Note — FM-index is memory-bound

The FM-index backward-extension loop is limited by pointer-chasing through the cp_occ arrays, not by computation. Additional SIMD width does not increase throughput here. See Developer Guide — Apple Silicon / NEON port for the profiling evidence.

Runtime overrides

Two environment variables tune dispatch:

Variable	Effect
`BWAMEM3_FORCE_TIER=<tier>`	Forces a specific tier (`sse41` / `sse42` / `avx` / `avx2` / `avx512bw`). Downgrade-only: requests above the host’s detected tier (which would SIGILL) and unknown names are rejected with a stderr warning. Used by `test/regression/all_tiers_parity.sh` to confirm byte-identical SAM across all tiers on AVX-512 hosts.
`BWAMEM3_DEBUG_SIMD=1`	Prints a one-line `[I::bwamem3_simd_init_body]` startup banner with the build baseline, the detected host capability, and the resolved tier. Also enables the build-baseline-vs-host gap warning.

Use bwa-mem3 version to read the resolved tier without alignment:

v0.2.0
SIMD floor:   avx2 (x86-64-v3, Haswell 2013+); kernels: sse41 sse42 avx avx2 avx512bw
SIMD runtime: avx512bw (BWAMEM3_FORCE_TIER unset)

Why in-process dispatch, not separate binaries

The pre-PR-#83 design shipped six binaries (one launcher plus one per ISA tier) and execvd the matching tier at startup. That worked but cost ~120 MB on disk, required all six binaries to be present in the same directory, and made BWAMEM3_FORCE_TIER impossible without re-exec’ing a different file. The current single-binary design keeps the per-tier compile granularity for the hand-tuned kernel TUs while collapsing distribution to one file (~25 MB), and adds runtime tier override and a clean host-floor precheck. Indirect-call overhead is the only trade-off, and it is below the measurement noise floor on every architecture in the bench matrix.

PGO Build

Profile-Guided Optimization (PGO) is a two-pass compiler technique. In the first pass (pgo-generate) the compiler inserts counters into every branch, call site, and loop back-edge. You run a representative training workload against the instrumented binary so those counters accumulate real branch-probability data. In the second pass (pgo-use) the compiler recompiles every translation unit using the collected profiles to make better inlining, branch-prediction, and code-layout decisions.

bwa-mem3’s Makefile provides three targets that implement this workflow.

Observed gains

On Apple Silicon (M-series), PGO delivered approximately 3% throughput improvement over the native NEON build. The gain on x86 depends on the workload — short-read paired-end alignment on avx2 or avx512bw hardware typically sees 2–5%. PGO is most useful when you will run the same binary on the same hardware against the same workload repeatedly (e.g. a production pipeline node). It is not worth the extra build time for one-off or exploratory runs.

Workflow

Step 1: Build the instrumented binary

make pgo-generate

By default PGO_ARCH is set to arm64 on Apple Silicon / aarch64 hosts and native on x86 hosts. To target a specific ISA, pass PGO_ARCH explicitly:

make pgo-generate PGO_ARCH=avx2

This produces a binary named bwa-mem3.pgo-instr (or bwa-mem3.pgo-instr.avx2 for non-default arch). Profiles are written to the directory pgo_profiles/ by default. Override with PGO_PROFILE_DIR:

make pgo-generate PGO_ARCH=avx2 PGO_PROFILE_DIR=/scratch/pgo-profiles-avx2

Step 2: Run the training workload

Run a workload that is representative of your production use. A single-end or paired-end alignment run against the same reference and similar read length is sufficient. A larger training run produces more stable profiles but 5–10 million read pairs is generally enough.

./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null

The run discards output so you are measuring the alignment work alone.

Tip — Training workload size

Aim for a training run that exercises the same code paths as your production workload. If you align 150 bp paired-end reads in production, train on 150 bp reads. If you use --meth, include a methylation alignment run in training. A few million read pairs is sufficient; a full WGS run provides diminishing returns.

Step 3: Build the optimized binary

make pgo-use

Or with matching arch and profile dir:

make pgo-use PGO_ARCH=avx2 PGO_PROFILE_DIR=/scratch/pgo-profiles-avx2

This produces bwa-mem3.pgo (or bwa-mem3.pgo.avx2). The binary is ready to use in production.

Step 4: Clean up instrumentation artifacts

make pgo-clean

This removes the profile directory and all bwa-mem3.pgo-instr* and bwa-mem3.pgo* files.

Multi-arch builds with PGO

Each architecture requires its own profile because the instrumentation counters are embedded in arch-specific code. Run the full three-step workflow once per arch and keep the profiles in separate directories:

# AVX2 profile
make pgo-generate PGO_ARCH=avx2 PGO_PROFILE_DIR=pgo_profiles_avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2 PGO_PROFILE_DIR=pgo_profiles_avx2

# AVX-512BW profile (separate host or same host with matching CPU)
make pgo-generate PGO_ARCH=avx512bw PGO_PROFILE_DIR=pgo_profiles_avx512bw
./bwa-mem3.pgo-instr.avx512bw mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx512bw PGO_PROFILE_DIR=pgo_profiles_avx512bw

Warning — Profile portability

Profile data collected on one microarchitecture is not portable to a different one. An AVX2 profile collected on a Haswell CPU will not improve — and may pessimize — an AVX-512BW build run on a Sapphire Rapids CPU. Always collect profiles on the same hardware class where the optimized binary will run.

PGO and the single-binary multi-tier build

The PGO targets produce one optimized binary for a single arch= target. They do not yet rebuild the default make single multi-tier binary’s per-tier kernel TUs. If you need PGO across more than one host class, build and profile each arch= variant separately and deploy whichever matches the target fleet — bwa-mem3 version will report the resolved tier so you can confirm. PGO for the in-process multi-tier dispatch path is tracked as a future enhancement.

Relationship to LTO

make lto-build produces a Link-Time Optimization binary; make pgo-use produces a PGO-optimized binary. Both are independent opt-in targets. You can combine them by passing -flto (or -flto=thin for clang) as part of EXTRA_CXXFLAGS during the pgo-use step, but the combination has not been systematically benchmarked. In practice, LTO and PGO each provide modest single-digit gains; their interaction is compiler-specific.

Tuning Checklist

The items below are ordered by expected impact for most workloads. Work through them in sequence; there is little point optimizing output format before confirming you are running the right binary for your CPU.

1. Confirm the resolved SIMD tier matches your CPU

The default make produces a single binary that contains every supported x86 SIMD tier and selects one in process at startup. Verify which tier is running:

bwa-mem3 version
# expect: SIMD floor: <build_floor>; SIMD runtime: <resolved_tier>

If the runtime tier is below what your CPU supports, double-check whether you accidentally built with a lower BASELINE_ARCH= or set BWAMEM3_FORCE_TIER in the environment. Set BWAMEM3_DEBUG_SIMD=1 to get a startup banner on stderr at the start of a mem run.

On ARM / Apple Silicon, the binary has one NEON tier; bwa-mem3 version reports SIMD runtime: neon.

See SIMD dispatch matrix for the full dispatch logic and the minimum CPU requirements for each tier.

Tip — Single-arch deployments

On a cluster where every node has the same CPU, build with make arch=avx2 (or the appropriate ISA). The runtime dispatch overhead is negligible, but a single-arch build trims the binary and removes any chance of BWAMEM3_FORCE_TIER accidentally downgrading throughput in production.

2. Build with PGO if you will run repeatedly

For production pipeline nodes that will process many samples against the same reference, a PGO build provides an additional 2–5% throughput at the cost of one extra build pass and a training run:

make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2

See PGO build for the full workflow, including multi-arch and profile portability notes.

3. Use shared memory for many small samples

When aligning many samples on one machine against the same reference, loading the index into POSIX shared memory once and reusing it across all mem invocations eliminates redundant I/O and reduces per-sample startup time significantly. The benefit grows with the number of samples and the size of the reference.

# Load the index into shared memory once
bwa-mem3 shm ref.fa

# Align each sample against the in-memory index
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz | samtools sort -@ 4 -o sample.bam -

# When finished with all samples, drop the shared segment
bwa-mem3 shm -d

Warning — No staleness check

bwa-mem3 shm does not detect whether the on-disk index has changed after the segment was loaded. Always run bwa-mem3 shm -d before re-indexing a reference and re-loading with bwa-mem3 shm. Failing to do so results in alignments against a stale index.

See Getting Started — Shared-memory index and Best Practices — Multi-sample workflows for complete workflows.

4. Emit BAM directly

Use --bam (or --bam=0 for uncompressed BAM) to emit BAM instead of SAM. Uncompressed BAM avoids the text-formatting cost on the aligner side and the text-parsing cost on the downstream side. samtools sort reads BAM natively and is fastest when the input is uncompressed:

bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

The --bam flag (without =0) produces BGZF-compressed BAM. This is useful when writing directly to disk without a downstream piped tool.

See Best Practices — Output format for guidance on when SAM is still appropriate.

5. Pipe to a multi-threaded sorter

Sorting is typically the bottleneck after alignment. Keep a separate thread budget for samtools sort:

bwa-mem3 mem --bam=0 -t 12 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -m 2G -o out.bam -

On a 16-core machine, allocating 12 threads to mem and 8 to samtools sort (with overlap via the pipe) is a common starting point. The aligner is generally CPU-bound; the sorter is I/O-bound during merge. Profile both stages to find the right split for your hardware.

Tip — Thread count tuning

bwa-mem3 mem scales well to 16–32 threads on most workloads. Beyond 32 threads the per-thread work unit becomes small enough that synchronization overhead starts to erode gains. See User Guide — Threading and resource use for thread-scaling data.

Summary table

Item	Action	Reference
Right SIMD tier for CPU	`bwa-mem3 version`; verify `SIMD runtime:`	SIMD dispatch matrix
PGO for production	`pgo-generate` → train → `pgo-use`	PGO build
Shared-memory index	`bwa-mem3 shm ref.fa` before batch runs	Quick start: shm
Emit uncompressed BAM	`--bam=0`	Best Practices — Output format
Multi-threaded sort	`samtools sort -@` with appropriate thread split	User Guide — Threading

Build

This page describes the recommended build configuration for production use of bwa-mem3.

Choose the right arch target

The default make invocation builds a single multi-tier binary on x86 (or a single NEON binary on arm64). For production clusters where the CPU family is uniform, you can trim further by building one tier only — the binary drops the per-tier dispatch table and ships a single kernel path:

# Most modern x86-64 servers (Haswell or later):
make arch=avx2

# Intel Cascade Lake / Sapphire Rapids, AWS c7i/m7i:
make arch=avx512bw

# Apple Silicon / AWS Graviton:
make arch=arm64

Omit arch= if the deployment target is heterogeneous or unknown; the default make produces a single binary that includes every supported x86 tier and dispatches at runtime via __builtin_cpu_supports. Tune the non-kernel TU compile baseline with BASELINE_ARCH= (default avx2) — see Single-binary SIMD dispatch (x86).

See SIMD dispatch matrix for the full list of targets and which kernels each vectorizes.

Profile-Guided Optimization (PGO)

PGO typically yields 3–5% throughput improvement on real workloads. It is opt-in — the standard make target does not use it — but is recommended for any installation that will run many alignment jobs against the same reference.

The workflow is three steps:

# Step 1: Build an instrumented binary (produces bwa-mem3.pgo-instr).
make pgo-generate

# Step 2: Run a representative training workload.
#   Use reads and a reference that reflect actual production input.
#   About 10–30 million read pairs is sufficient.
./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null

# Step 3: Build the PGO-optimized binary (produces bwa-mem3.pgo).
make pgo-use

To target a specific SIMD level, pass PGO_ARCH=:

make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Produces: bwa-mem3.pgo.avx2

Profile data is written to pgo_profiles/ by default. Pass PGO_PROFILE_DIR=<path> to change the location.

Tip — Training data matters

The training workload should resemble production input in read length, base quality distribution, and reference composition. A read set that is too short, too long, or too easy (low mismatch rate) will bias the branch predictions and may produce a build that is slower than the non-PGO baseline on real data.

mimalloc

mimalloc is compiled in by default (USE_MIMALLOC=1). The allocator improves multi-threaded throughput by reducing lock contention on malloc and free hot paths. Run bwa-mem3 version to confirm it is active:

bwa-mem3 version
# Expected output includes a line like:
#   mimalloc 3.x.x

To build without mimalloc (for example, when using AddressSanitizer or on a system with a known-incompatible allocator):

make USE_MIMALLOC=0

Summary

For a production installation on a known x86 server with AVX2:

make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Deploy: bwa-mem3.pgo.avx2

Output Format

The choice of output format — SAM, compressed BAM, or uncompressed BAM — has a measurable effect on end-to-end pipeline wall time. This page explains why uncompressed BAM is the right default and shows the recommended pipeline.

Why uncompressed BAM is faster than SAM

When bwa-mem3 writes SAM (the default when --bam is not set), every alignment record must be serialized into ASCII text: integers are formatted as decimal strings, bases are encoded as characters, and flags are written as decimal numbers. The receiving process — typically samtools sort — then parses each field back from text into binary integers. Both conversions are pure overhead: the data is binary inside bwa-mem3 and binary inside samtools; text is only an interchange format that is immediately discarded.

Uncompressed BAM (--bam=0) bypasses this round-trip. bwa-mem3 writes binary BAM records directly via htslib’s wb0 mode. The write path performs no text formatting; the read path in samtools sort performs no text parsing. The htslib overhead of the wb0 write is negligible — it is effectively a buffered write(2) call with a small BAM block header prepended.

Compressed BAM (--bam=1) adds BGZF compression on top, which costs CPU on the write side and gains nothing: the pipe is in-process memory or a kernel pipe buffer, and samtools sort will re-compress the output anyway. Compressed BAM on a pipe wastes CPU on both sides.

Recommended pipeline

bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

The -@ 8 flag gives samtools sort eight compression threads for writing the final sorted BAM. Tune this number based on available cores; the total core count should be split so that alignment threads and sort threads do not contend. A 16:8 split (bwa-mem3:samtools) works well on 24-core machines.

Tip — Thread allocation

Do not give all cores to bwa-mem3. Downstream samtools sort needs threads to compress and write the sorted BAM. Leaving 4–8 threads for samtools sort keeps the pipeline balanced and prevents a write bottleneck that would stall the aligner.

Methylation output

The --meth path always writes uncompressed BAM internally, regardless of the --bam flag. The post-processing step (header rewrite, Bismark XR:Z / XG:Z / XM:Z tag emission, opt-in chimera QC) is performed inline before the record is handed to htslib, so the same pipeline shape applies:

bwa-mem3 mem --meth --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

When SAM is appropriate

SAM (the default, equivalent to omitting --bam) remains the right choice for:

Debugging. Plain text is readable with less, grep, and any text editor, making it easy to inspect individual records without samtools view.
Ad-hoc inspection. When you need to scan a few thousand reads to diagnose a mapping problem, piping to SAM and reading the output directly is faster than writing a BAM file and then querying it.
Compatibility with tools that require SAM input. Some legacy tools do not accept BAM. If the downstream tool does not support BAM, use SAM.

For production alignment jobs that feed samtools sort, always use --bam=0.

Summary table

Format	`--bam` value	Pipe overhead	Recommended for
SAM	(default / omit)	High (text round-trip)	Debugging, ad-hoc inspection
Uncompressed BAM	`0`	Negligible	Production pipelines
Compressed BAM	`1`	High on write side	Writing directly to a file (no downstream sort)

Multi-Sample Workflows

When you need to align many samples back-to-back against the same reference on a single machine, loading the FM-index into shared memory once — and keeping it resident across all alignment jobs — eliminates the index I/O cost for every sample after the first.

The problem: repeated index loads

The bwa-mem3 FM-index for hg38 is approximately 28 GB on disk. Without shared memory, bwa-mem3 mem reads the entire index from disk on every invocation. On a fast NVMe drive this takes 30–60 seconds; on a network-attached or spinning-disk filesystem it can take several minutes. For a batch of 100 samples, that adds hours of pure I/O overhead.

Staging the index once with `bwa-mem3 shm`

# Stage the index into shared memory (one-time cost, ~28 GB for hg38).
bwa-mem3 shm ref.fa

# Align each sample. bwa-mem3 mem attaches automatically — no extra flag.
bwa-mem3 mem --bam=0 -t 16 ref.fa sample1_R1.fq.gz sample1_R2.fq.gz \
  | samtools sort -@ 4 -o sample1.bam -
bwa-mem3 mem --bam=0 -t 16 ref.fa sample2_R1.fq.gz sample2_R2.fq.gz \
  | samtools sort -@ 4 -o sample2.bam -
# ...

# When done, release the segment.
bwa-mem3 shm -d

For methylation workflows, stage the c2t index instead:

bwa-mem3 shm --meth ref.fa
bwa-mem3 mem --meth --bam=0 -t 16 ref.fa sample1_R1.fq.gz sample1_R2.fq.gz \
  | samtools sort -@ 4 -o sample1.bam -
bwa-mem3 shm -d

Confirming the index is staged

bwa-mem3 shm -l
# Prints the basename and memory usage of each staged segment.

If the listing is empty, the index is not staged and bwa-mem3 mem will fall back to loading from disk.

Thread layout for parallel alignment

Running multiple bwa-mem3 mem instances in parallel is efficient when the samples are independent and the machine has enough cores. The shared-memory index eliminates disk contention, so the bottleneck becomes CPU and memory bandwidth.

Guidelines for N-core machines:

N = 32: Two instances at -t 14 each, with -@ 4 for samtools sort. Keeps 4 cores reserved for OS and I/O.
N = 64: Two to four instances at -t 14 to -t 16, each with -@ 4 for samtools sort.
N = 128: Four to eight instances; keep at least 8–16 cores free for samtools sort threads and OS scheduling.

Tip — Memory bandwidth limit

The FM-index lookup is memory-bandwidth bound. On machines with NUMA topology (multi-socket or multi-chiplet), binding each bwa-mem3 instance to a NUMA node with numactl --cpunodebind=N --membind=N can improve throughput by reducing cross-node memory traffic.

Scripting a batch with a loop

bwa-mem3 shm ref.fa

for sample in sample1 sample2 sample3; do
  bwa-mem3 mem --bam=0 -t 16 ref.fa "${sample}_R1.fq.gz" "${sample}_R2.fq.gz" \
    | samtools sort -@ 4 -o "${sample}.bam" -
  samtools index "${sample}.bam"
done

bwa-mem3 shm -d

For parallel execution, replace the for loop body with a background job (or use a workflow manager such as Snakemake or Nextflow) and limit the degree of parallelism to match available cores.

Warning — Stale segment footgun

If you need to re-index the reference (e.g. after updating it), always run bwa-mem3 shm -d before bwa-mem3 index. There is no automatic staleness check. See Anti-patterns for details.

Methylation Defaults

bwa-mem3 mem --meth ships with a set of scoring and filtering defaults that match the bwameth.py reference implementation. This page describes what those defaults are, when to keep them, and when to override them.

What `--meth` sets

When --meth is passed, the following flags are applied automatically in addition to enabling inline c2t conversion and BAM post-processing:

Flag	Value	Purpose
`-B`	`2`	Mismatch penalty. Reduced from the bwa-mem2 default of 4. Bisulfite-treated reads carry C→T and G→A mismatches at converted positions; a lower penalty prevents these from causing spurious soft-clipping or unmapped reads.
`-L`	`10`	Clipping penalty. Increased from the bwa-mem2 default of 5 to discourage clipping of read ends that carry converted bases at positions that look like mismatches.
`-U`	`100`	Unpaired read penalty. Higher than default; methylation libraries typically have well-defined insert sizes and anomalous pairing usually reflects a mapping artifact.
`-T`	`40`	Minimum alignment score threshold. Higher than default; raises the bar to report an alignment, reducing spurious low-quality hits against the doubled reference.
`-CM`	—	Treats soft-clipped bases as matches in CIGAR output. Required for correct behavior of downstream methylation callers (e.g. Bismark, MethylDackel) that count clipped bases.

These defaults can all be overridden on the command line. The --meth flag sets them first; any explicit flag that follows overrides the --meth-set value.

When to keep the defaults

For standard whole-genome bisulfite sequencing (WGBS) workflows, the defaults are appropriate as-is. They were derived from the bwameth.py codebase and are expected by most downstream methylation calling tools. Unless you have a specific reason to deviate, use:

bwa-mem3 mem --meth --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 4 -o out.bam -
samtools index out.bam

When to override

Low-coverage or targeted bisulfite sequencing. If your library covers a small target region and insert sizes are more variable, consider lowering -T (e.g. -T 20) to recover short or soft-clipped alignments in the target.

Amplicon bisulfite sequencing. Amplicon reads have uniform insert sizes; the default -U 100 is appropriate. However, if your amplicons are short (< 100 bp), consider lowering -L further to reduce clipping at read ends.

Non-standard conversion chemistry. Some library preparations use only one strand conversion (C→T only, not G→A). In such cases, --set-as-failed r suppresses alignments to the reverse-complement strand, which reduces noise from strand-ambiguous alignments:

bwa-mem3 mem --meth --set-as-failed r --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 4 -o out.bam -

Chimera QC is opt-in (matches Bismark default). bwameth.py applies a chimera heuristic that flags reads whose longest matching run (CIGAR M/=/X) is less than 44 % of the read length: 0x200 set, 0x2 cleared, MAPQ capped at 1. bwa-mem3 --meth does not apply this by default — the runtime posture matches Bismark, where no such heuristic exists.

If your library is PBAT / scBS-Seq (where intra-fragment chimerism is common) or you want bwameth.py-equivalent flagging, pass --chimera-qc:

bwa-mem3 mem --meth --chimera-qc --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 4 -o out.bam -

Note — Overrides are positional

Flags supplied after --meth on the command line override the defaults set by --meth. For example, bwa-mem3 mem --meth -B 4 ... uses -B 4 (not 2). Flags supplied before --meth are silently overwritten by --meth’s defaults, so always place overrides after --meth.

Downstream tool compatibility

The --meth output BAM is designed to be a drop-in replacement for the output of the bwameth.py pipeline. The following downstream tools have been used successfully with bwa-mem3 --meth output:

bismark_methylation_extractor, methylKit processBismarkAln, methtuple, DMRfinder, epialleleR — read the Bismark XR:Z, XG:Z, XM:Z tags directly from --meth output.
MethylDackel — reads XG:Z (and ignores the bwameth-convention YD:Z: if present, which --meth no longer emits).
biscuit per-read tools — read XG:Z.

Multi-architecture deployment

This page covers running bwa-mem3 in heterogeneous compute environments — AWS Batch with mixed instance families, GCP Batch with mixed CPU platforms, on-prem Slurm with mixed nodes, Kubernetes clusters with mixed node pools.

Within x86_64: one binary, dynamic dispatch

bwa-mem3 ships a single x86_64 binary that contains five SIMD kernel tiers (sse41, sse42, avx, avx2, avx512bw) and selects the best one at runtime via __builtin_cpu_supports. See src/simd_dispatch.cpp for the dispatcher and src/kernel_dispatch.h for the per-tier symbol mangling.

Build once at the BASELINE_ARCH floor that matches your fleet’s oldest x86 host. The default BASELINE_ARCH=avx2 covers Intel Haswell (2013) and AMD Zen (2017) onward. Within that floor, every host transparently uses its best available tier for the hot kernel paths.

Across x86_64 and arm64

A single ELF binary cannot span CPU families. You must build two binaries — one for x86_64, one for arm64 — and package them so the right one runs on each host.

The recommended approach is a Docker manifest-list container of your own making, with one layer per architecture under a single tag. Example:

FROM ubuntu:24.04 AS build
RUN apt-get update && apt-get install -y \
    build-essential git cmake pkg-config \
    autoconf automake autoconf-archive libtool \
    zlib1g-dev
WORKDIR /src
RUN git clone --recursive https://github.com/fg-labs/bwa-mem3 .
RUN make -j

FROM ubuntu:24.04
COPY --from=build /src/bwa-mem3 /usr/local/bin/bwa-mem3
RUN apt-get update && apt-get install -y libgomp1 zlib1g \
 && rm -rf /var/lib/apt/lists/*
ENTRYPOINT ["bwa-mem3"]

Build for both architectures with one command:

docker buildx build --platform linux/amd64,linux/arm64 \
  -t <registry>/<image>:<tag> --push .

AWS Batch, GCP Batch, Kubernetes, and containerd all read the manifest list and pull the correct layer based on the host’s architecture. The submitter references one tag; the runtime picks the right binary automatically.

Verifying at runtime

bwa-mem3 version reports the build’s floor, the kernels compiled in, and the resolved runtime tier. Use this in CI or in your Batch job’s startup script to confirm the right layer was pulled:

$ bwa-mem3 version
v0.2.0-12-gabcdef1
SIMD floor: avx2 (x86-64-v3, Haswell 2013+); kernels: sse41 sse42 avx avx2 avx512bw
SIMD runtime: avx512bw (BWAMEM3_FORCE_TIER unset)

Grep for SIMD runtime: to record the tier each job ran at — useful for post-mortem diagnosis of perf regressions.

Pre-Haswell hosts

If your fleet really must include pre-Haswell x86 (c4, m4, pre-Skylake Xeons), rebuild with a lower floor:

make BASELINE_ARCH=sse41

Expect roughly 10-15% slower wall time on AVX2 hosts in the same container compared to a default BASELINE_ARCH=avx2 build. This is the trade-off for broader host coverage; only do it if you actually need pre-Haswell support.

The default BASELINE_ARCH=avx2 covers virtually every modern compute environment. AWS, GCP, and Azure all default to Haswell-or-newer instance types in current-generation compute environments.

What the host-floor precheck does

If a job is scheduled onto a host that doesn’t meet the build’s floor (e.g. an avx2-baseline binary lands on a pre-Haswell host), bwa-mem3 mem refuses to run with exit code 2 and a clear stderr message:

[E::bwamem3] this binary was compiled for SIMD floor avx2 and emits avx2
instructions in non-kernel translation units. The host CPU does not support
avx2 (detected: sse42). Running would SIGILL on the first avx2 instruction.

To run on this host, rebuild bwa-mem3 with BASELINE_ARCH=sse42 (or lower),
or use a binary built for a lower SIMD floor.

This is a clean failure: the job exits before any billable alignment work starts. Compare to the alternative without the precheck (SIGILL deep inside an alignment job, opaque process death, wasted compute).

Defence-in-depth recommendation: configure your AWS Batch compute environment (or equivalent) to exclude instance families older than your binary’s floor. The precheck protects against accidental scheduling; an allowlist at the orchestrator level prevents the scheduling decision in the first place.

Anti-Patterns

This page documents common mistakes that produce incorrect results or unnecessary failures when using bwa-mem3.

Re-indexing without dropping the shared-memory segment

Warning — Footgun

bwa-mem3 shm does not detect stale segments. If you re-run bwa-mem3 index after a shared-memory segment is already staged, the on-disk index files will not match the in-memory segment. bwa-mem3 mem will attach to the stale segment and produce incorrect alignments without any warning.

Always run bwa-mem3 shm -d before re-indexing:
bwa-mem3 shm -d           # drop all staged segments
bwa-mem3 index ref.fa     # rebuild the on-disk index
bwa-mem3 shm ref.fa       # re-stage the new index
There is no automatic staleness check in the implementation. The segment name is derived from the reference basename only; no content hash or modification timestamp is stored.

To confirm that no stale segments are staged, use bwa-mem3 shm -l before running any indexing step.

Forgetting to initialize submodules

bwa-mem3 depends on several submodules (ext/htslib, ext/safestringlib, ext/libsais, ext/mimalloc, ext/sse2neon). A shallow clone or a clone without --recursive will produce a build that fails at the linking step with missing symbols, or at runtime with missing index files.

Warning — Missing submodules

Always clone with --recursive, or initialize submodules after cloning:
git clone --recursive https://github.com/fg-labs/bwa-mem3
# or, after a bare clone:
git submodule update --init --recursive
If make reports missing headers (e.g. htslib/hts.h: No such file or directory), the submodules were not initialized.

Leaving `BASELINE_ARCH` at the default on a known higher-tier CPU

The default make (no arch=) builds the multi-tier single binary with non-kernel TUs compiled at BASELINE_ARCH=avx2. On a production server with a known higher-tier CPU family, this leaves auto-vectorized non-kernel hot paths at 256-bit width when the host could go wider, or keeps the host-floor precheck at avx2 when the deployment surface is strictly AVX-512. Pass BASELINE_ARCH= (or build a single-tier binary with arch=) to align the build with the deployment:

Warning — Suboptimal build on known hardware
# Single multi-tier binary with non-kernel TUs at the host's tier:
make BASELINE_ARCH=avx512bw      # Cascade Lake / Ice Lake / Sapphire Rapids / Zen 4

# Single-tier binary (no dispatch table; smallest install) when the cluster
# is uniform and you don't need cross-tier portability:
make arch=avx2                   # Broadwell/Skylake and later x86
make arch=avx512bw               # Cascade Lake / Sapphire Rapids
make arch=arm64                  # Apple Silicon / AWS Graviton
The default (make with no overrides) is appropriate when the binary will be distributed across multiple CPU families or when the target CPU is genuinely unknown. Note that BASELINE_ARCH=avx512bw does not always win over avx2 even on AVX-512 hosts — see BASELINE_ARCH=avx512bw build flag for the empirical perf characterization.

See SIMD dispatch matrix for the full set of targets and the in-process dispatch architecture.

Mixing bwa-mem3 and bwa-mem2 outputs in the same pipeline

bwa-mem3 adds several custom SAM tags that bwa-mem2 does not emit: HN:i (total number of primary alignments — both reported and suppressed — that the aligner found for this read, before the -h supplementary cap is applied), and — in --meth mode — the Bismark-compatible XR:Z (read conversion direction), XG:Z (genome strand), and XM:Z (per-base methylation call string) tags. It also rewrites @SQ header lines in --meth mode (collapsing f/r strand prefixes back to one entry per chromosome).

Warning — Header and tag mismatch

Do not merge BAM files produced by bwa-mem3 and bwa-mem2 without verifying that the @PG headers and custom tags are handled correctly by the downstream tool. In methylation workflows, a bwa-mem2 BAM mixed into a bwa-mem3 --meth pipeline will be missing the XR:Z / XG:Z / XM:Z Bismark annotations, which will cause methylation callers to silently drop or misclassify those records.

If you must merge outputs from both tools, run samtools view -H on both files and confirm that @SQ lines are consistent and that the downstream tool can tolerate the tag differences.

Writing compressed BAM to a pipe

Passing --bam=1 (compressed BAM) when piping to samtools sort compresses the stream on the bwa-mem3 side and then immediately decompresses it on the samtools side. This wastes CPU on both ends with no benefit.

Use --bam=0 (uncompressed BAM) for all pipe-to-sort workflows. See Output format for the full explanation and recommended pipeline.

CLI Reference Overview

bwa-mem3 exposes four subcommands: index, mem, shm, and version. Run bwa-mem3 <subcommand> --help to see the full option list for any command.

How this section is structured

Each subcommand page follows the same layout:

Introduction — what the subcommand does and when to reach for it.
Synopsis — the verbatim --help output, auto-captured from the binary at build time and included here via mdbook’s {{#include}} directive. The snippet is regenerated by make docs-cli and CI fails if it drifts from the binary.
Common usage — two or three worked command-line examples.
Flag reference (for mem, grouped by topic) — per-flag prose covering semantics, defaults, and interaction with other flags that the --help text does not have room to explain.
Notes / Gotchas — operational warnings about non-obvious behavior.
See also — cross-links to related pages in this book.

Subcommands

index builds the FM-index from a reference FASTA. Pass --meth to produce a bwameth-style doubled c2t reference for methylation alignment.

mem aligns short reads against an indexed reference, producing SAM or BAM output. It is the primary alignment subcommand. The flag surface is large; the mem reference page groups flags by purpose to make them easier to navigate.

shm stages an FM-index into POSIX shared memory so that repeated bwa-mem3 mem invocations on the same machine skip the per-run disk read. It also lists and destroys staged segments.

version prints the bwa-mem3 release version and, when mimalloc is compiled in, the mimalloc version.

index

bwa-mem3 index builds the FM-index (BWT + suffix array) that bwa-mem3 mem requires for alignment. Run it once per reference; the resulting files sit alongside the input FASTA and are reused for all subsequent alignment jobs. Pass --meth to produce a bwameth-compatible doubled c2t reference for bisulfite-seq alignment.

Synopsis

Usage: bwa-mem3 index [-p prefix] [-t N] [--max-memory SIZE] [--tmp-dir PATH] [--meth] <in.fasta>

  -p STR             output prefix (default: <in.fasta>)
  -t INT             worker threads [auto: detected cores, cgroup-aware]
  --max-memory SIZE  peak memory budget; SIZE accepts a G/M/K suffix
                     (case-insensitive) or bare bytes
                     [auto: min(50% of RAM, 32G), cgroup-aware]
  --tmp-dir PATH     scratch directory [$TMPDIR]
  --meth             build a bwameth-style doubled c2t reference + FMI.
                     Writes <in.fasta>.bwameth.c2t and the FMI alongside it.
                     Use with `bwa-mem3 mem --meth <in.fasta> R1.fq [R2.fq]`.
  -h, --help         print this help message and exit

Common usage

Build a standard index using all available cores:

bwa-mem3 index ref.fa

Build a methylation-aware index (required before bwa-mem3 mem --meth):

bwa-mem3 index --meth ref.fa

Limit peak RAM to 16 GB and write scratch data to /scratch:

bwa-mem3 index --max-memory 16G --tmp-dir /scratch ref.fa

Flag reference

`-p STR` — output prefix

By default, index files are written alongside <in.fasta> using the FASTA path as a prefix (e.g. ref.fa.bwt.2bit.64, ref.fa.0123, etc.). Use -p to write them to a different base path, such as a dedicated index directory:

bwa-mem3 index -p /idx/hg38 ref.fa
# writes /idx/hg38.bwt.2bit.64, /idx/hg38.0123, …
# align with: bwa-mem3 mem /idx/hg38 R1.fq R2.fq

`-t INT` — worker threads

Controls the number of threads used during index construction. The default auto-detects available cores and is cgroup-aware, so it behaves correctly inside containers and on shared cluster nodes. Set explicitly when you want to cap CPU usage.

`--max-memory SIZE` — peak memory budget

Limits how much RAM the indexer may use at once. SIZE accepts a G, M, or K suffix (case-insensitive) or a bare byte count. The default is min(50% of RAM, 32 GB), computed in a cgroup-aware manner.

For large references (hg38 and above) on machines with limited RAM, setting this to a value lower than the reference size causes the indexer to partition work and use --tmp-dir for intermediate files, at the cost of extra I/O.

`--tmp-dir PATH` — scratch directory

Scratch directory for intermediate files when memory is partitioned. Defaults to $TMPDIR. Point this at a fast local disk (NVMe or ramdisk) to minimize wall-clock time when --max-memory forces partitioned construction.

`--meth` — build a methylation (c2t) index

Writes a bwameth-style doubled reference — <in.fasta>.bwameth.c2t — and builds the FM-index over that file rather than the original FASTA. The c2t file and its index files are placed alongside the original FASTA.

Pass the original FASTA prefix (not the .bwameth.c2t path) to all three index, shm, and mem commands. The c2t suffix is appended automatically when --meth is present.

Notes / Gotchas

Tip — Index once, align many times

Index construction for hg38 takes several minutes and ~28 GB of disk. Build the index once and store it on shared storage; all alignment jobs on the same reference share the same index files.

Warning — –meth index is not interchangeable with the standard index

A --meth index is built over the c2t reference and cannot be used for normal (non-bisulfite) alignment. Keep separate index directories if you align both standard and bisulfite samples to the same reference.

mem

bwa-mem3 mem aligns short DNA reads against an indexed reference genome using the BWA-MEM algorithm. It accepts one or two FASTQ files (single-end or paired-end) and writes alignments to stdout in SAM or BAM format. It is the primary alignment subcommand; nearly all bwa-mem3 usage flows through it.

Synopsis

Usage: bwa-mem3 mem [options] <idxbase> <in1.fq> [in2.fq]
Options:
  Algorithm options:
    -o STR        Output SAM file name
    --bam[=N]     Emit BAM instead of SAM text. N=0 (default) = uncompressed;
                  1..9 = BGZF deflate levels. Writes to stdout; redirect with `>`.
    -t INT        number of threads [1]
    -k INT        minimum seed length [19]
    -w INT        band width for banded alignment [100]
    -d INT        off-diagonal X-dropoff [100]
    -r FLOAT      look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
    -y INT        seed occurrence for the 3rd round seeding [20]
    -c INT        skip seeds with more than INT occurrences [500]
    -D FLOAT      drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
    -W INT        discard a chain if seeded bases shorter than INT [0]
    -m INT        perform at most INT rounds of mate rescues for each read [50]
    -S            skip mate rescue
    -P            skip pairing; mate rescue performed unless -S also in use
Scoring options:
   -A INT        score for a sequence match, which scales options -TdBOELU unless overridden [1]
   -B INT        penalty for a mismatch [4]
   -O INT[,INT]  gap open penalties for deletions and insertions [6,6]
   -E INT[,INT]  gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]
   -L INT[,INT]  penalty for 5'- and 3'-end clipping [5,5]
   -U INT        penalty for an unpaired read pair [17]
Input/output options:
   -p            smart pairing (ignoring in2.fq)
   -R STR        read group header line such as '@RG\tID:foo\tSM:bar' [null]
   -H STR/FILE   insert STR to header if it starts with @; or insert lines in FILE [null]
   -j            treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)
   -5            for split alignment, take the alignment with the smallest coordinate as primary
   -q            don't modify mapQ of supplementary alignments
   -K INT        process INT input bases in each batch regardless of nThreads (for reproducibility) []
   -v INT        verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]
   -T INT        minimum score to output [30]
   -h INT[,INT]  if there are <INT hits with score >80.00% of the max score, output all in XA [5,200]
   -z FLOAT      the fraction of the max score to use with -h [0.80]
   -u            output XB instead of XA; XB is XA with the alignment score and mapping quality added
   -a            output all alignments for SE or unpaired PE
   -C            append FASTA/FASTQ comment to SAM output
   -V            output the reference FASTA header in the XR tag
   -Y            use soft clipping for supplementary alignments
   -M            mark shorter split hits as secondary
   -I FLOAT[,FLOAT[,INT[,INT]]]
                 specify the mean, standard deviation (10% of the mean if absent), max
                 (4 sigma from the mean if absent) and min of the insert size distribution.
                 FR orientation only. [inferred]
Bisulfite (--meth) options:
   --meth        enable inline bwameth-style C→T/G→A read conversion + meth-aware BAM
                 emission. Implies --bam. Requires the reference to have been built
                 with `bwa-mem3 index --meth` (emits ref.fa.bwameth.c2t).
   --set-as-failed f|r
                 flag alignments to the matching strand ('f' or 'r') as QC-fail (0x200)
   --chimera-qc
                 enable the bwameth.py-style longest-match <44% chimera heuristic
                 (sets 0x200, clears 0x2, caps MAPQ at 1). Off by default; not in Bismark.
Supplementary MAPQ rescoring (fg-labs extension):
   --supp-rep-hard-cap INT
                 force MAPQ=0 for supplementary alignments whose chain contains any seed
                 with >=INT genome occurrences (i.e. the supp region is repetitive on its
                 own). 0 disables (default). Typical values 5-20; lower = more aggressive.
                 Primary MAPQ is unaffected.
Help:
   --help        print this help message and exit
Note: Please read the man page for detailed description of the command line and options.

Common usage

Paired-end alignment, 16 threads, SAM to stdout:

bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam

Paired-end alignment, emit uncompressed BAM, pipe directly to samtools sort:

bwa-mem3 mem --bam -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

Paired-end methylation alignment with a read group header:

bwa-mem3 mem --meth -t 16 \
  -R '@RG\tID:lib1\tSM:sample1\tPL:ILLUMINA' \
  ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam -

Flag reference

Input / output

`-o STR` — output file

Write output to STR instead of stdout. Honored for both SAM and --bam output; the path is opened lazily so BAM mode can hand it to htslib instead of truncating it as a SAM-text file. Stdout redirection (>) remains an alternative.

`--bam[=N]` — emit BAM

Emit BAM instead of SAM. N controls BGZF compression: 0 (default when --bam is used without =) writes uncompressed BAM, which costs almost no CPU and is the recommended mode for piping to samtools sort. Values 1–9 select increasing BGZF deflate levels; use --bam=6 or --bam=9 only when writing directly to final storage without a downstream sort step.

Tip — Prefer –bam for production pipelines

Uncompressed BAM (--bam or --bam=0) eliminates the text-formatting cost on the aligner side and the text-parse cost on the samtools sort side. For any pipeline that immediately sorts or processes the output, this is faster than SAM at no quality cost.

`-R STR` — read group header

Injects a @RG header line and tags every alignment with RG:Z:<ID>. The value is a tab-separated @RG line with literal \t escapes, for example:

-R '@RG\tID:run1\tSM:HG001\tPL:ILLUMINA\tLB:lib1'

bwa-mem3 escapes any literal tab characters inside -R values before writing them to the @PG CL: field, preventing header corruption (fix for issue #45).

`-H STR/FILE` — extra header lines

If STR begins with @, it is injected verbatim as a header line. Otherwise STR is treated as a path and every line in the file is injected. Useful for adding @CO comments or custom @RG / @PG entries.

`-p` — smart pairing

Reads interleaved paired-end data from a single FASTQ file (in1.fq) rather than two separate files. The second positional argument (in2.fq) is ignored.

`-5` — leftmost-coordinate primary

For split alignments, designates the alignment with the smallest genomic coordinate as primary, rather than the longest alignment. Useful for some downstream tools that expect the leftmost alignment to be primary.

`-q` — preserve supplementary MAPQ

By default, bwa-mem3 may downgrade the MAPQ of supplementary alignments. -q suppresses that adjustment.

`-K INT` — fixed batch size

Forces each thread batch to process exactly INT input bases regardless of the number of threads. Useful when you need bit-for-bit reproducible output across runs with different -t values: fix -K to the same value and the output is deterministic.

`-v INT` — verbosity

Controls stderr diagnostic output: 1 = errors only, 2 = warnings, 3 = informational messages (default), 4+ = debugging.

`-a` — all alignments

Output all alignments for single-end or unpaired paired-end reads, including secondary alignments. Equivalent to enabling secondary-alignment reporting.

`-C` — append FASTA/FASTQ comment

Appends the comment field from the FASTA/FASTQ header to the SAM output as an additional column. Useful when the comment carries barcodes or UMIs.

`-V` — reference header in XR tag

Emits the reference FASTA header line for each alignment position as an XR SAM tag.

`-Y` — soft-clip supplementary alignments

Uses soft clipping instead of hard clipping for supplementary alignments. Some downstream tools require this.

`-M` — mark shorter split hits as secondary

Marks the shorter alignment in a split read as secondary (sets 0x100 flag) rather than supplementary. Required for compatibility with tools that do not handle supplementary alignments (e.g. Picard’s duplicate-marking before certain versions).

`-j` — treat ALT contigs as primary

Treats ALT contigs as part of the primary assembly by ignoring the <idxbase>.alt file. Use when your workflow does not include ALT-aware postprocessing.

Scoring

All scoring flags accept integer values. Changing -A (match score) scales the penalty flags that default to multiples of -A; explicit overrides of individual flags are unaffected.

Flag	Default	Meaning
`-A INT`	1	Score for a sequence match. Scales `-T`, `-d`, `-B`, `-O`, `-E`, `-L`, `-U` unless overridden.
`-B INT`	4	Mismatch penalty.
`-O INT[,INT]`	6,6	Gap open penalty for deletions and insertions respectively.
`-E INT[,INT]`	1,1	Gap extension penalty per base. A gap of length k costs `-O + -E * k`.
`-L INT[,INT]`	5,5	Clipping penalty for 5’ and 3’ ends.
`-U INT`	17	Penalty for an unpaired read pair (affects mate-rescue scoring).
`-T INT`	30	Minimum alignment score to output. Alignments below this threshold are not reported.

Note — –meth overrides scoring defaults

When --meth is active, bwa-mem3 applies bwameth.py-compatible defaults: -B 2 -L 10 -U 100 -T 40 -CM. Any of these can still be overridden by passing the flag explicitly after --meth.

Paired-end

`-I FLOAT[,FLOAT[,INT[,INT]]]` — insert size distribution

Specifies the mean, standard deviation (default: 10% of mean), maximum (default: 4 sigma above mean), and minimum of the insert size distribution for FR-orientation paired-end reads. By default bwa-mem3 infers these parameters from the first batch of reads. Provide them explicitly for speed or when the reference is short and inference may be inaccurate.

`-m INT` — mate rescue rounds

Maximum number of mate-rescue attempts per read. Reduce to speed up alignment on data where the default (50) wastes time on unrescuable pairs.

`-S` — skip mate rescue

Disables mate rescue entirely. Faster but may reduce sensitivity for discordant pairs.

`-P` — skip pairing

Skips the pairing step; mate rescue still runs unless -S is also given.

Filtering

`-c INT` — skip repetitive seeds

Seeds with more than INT occurrences in the reference are skipped. Lowering this (e.g. to 50) speeds up alignment of highly repetitive reads but may reduce sensitivity. Raising it increases sensitivity in repeat-heavy regions at a cost in runtime.

`-D FLOAT` — chain length fraction

Drops chains shorter than FLOAT times the longest overlapping chain. The default (0.50) discards chains that are less than half the length of the best chain.

`-W INT` — minimum seeded bases

Discards chains with fewer than INT seeded bases. Raising this filters out very short, low-confidence chains.

`-h INT[,INT]` — secondary alignment reporting

If there are fewer than INT hits with score exceeding FLOAT (see -z) times the maximum score, all of them are output in the XA auxiliary tag. The second integer is a hard cap on the number of XA entries. Defaults: 5, 200.

`-z FLOAT` — secondary score fraction

Fraction of the maximum alignment score used as the threshold for secondary hit reporting with -h. Default: 0.80.

`-u` — emit XB instead of XA

Outputs XB in place of XA. XB is an extension of XA that also carries the alignment score and mapping quality for each secondary hit.

Methylation (`--meth`)

`--meth` — enable bisulfite alignment mode

Activates inline C→T (R1) and G→A (R2) read conversion, bwameth-compatible scoring defaults, inline BAM post-processing, and forces --bam output. The reference must have been indexed with bwa-mem3 index --meth.

Pass the original FASTA prefix as <idxbase> — the .bwameth.c2t suffix is appended automatically. If <idxbase> already ends in .bwameth.c2t (interop with an external c2t converter), the auto-append is skipped.

See Methylation Reference for the full treatment.

`--set-as-failed {f|r}` — strand QC-fail flag

Forces the QC-fail bit (0x200) on all alignments to the forward (f) or reverse (r) bisulfite strand. Used when one strand is known to be unreliable for a given library preparation.

`--chimera-qc` — opt in to bwameth.py-style chimera heuristic

Off by default (matches Bismark, which has no equivalent heuristic). When set, mapped records whose longest M/=/X CIGAR run is less than 44 % of the read length get 0x200 set, 0x2 cleared, and MAPQ capped at 1. Useful for PBAT / scBS-Seq libraries where intra-fragment chimerism is common, or when reproducing bwameth.py output bit-for-bit.

Threading

`-t INT` — number of threads

Number of worker threads. Defaults to 1. Set to the number of physical cores available to this job. Scaling is workload- and hardware-dependent: on typical machines the curve flattens around 16–32 threads (FM-index bandwidth and I/O contention dominate); on high-memory / fast-I/O servers the aligner can keep scaling toward ~64 threads on hg38 before saturating. See the threading guide for measured guidance and per-machine recommendations.

See User Guide — Threading and resource use for guidance on thread counts at various machine sizes.

Supplementary MAPQ rescoring

`--supp-rep-hard-cap INT` — cap MAPQ for repetitive supplementary alignments

Forces MAPQ=0 for supplementary alignments whose chain contains any seed with at least INT occurrences in the genome. This targets supplementary alignments anchored in repetitive regions that upstream MAPQ scoring may overestimate. 0 disables the cap (default). Typical values are 5–20; lower values are more aggressive. Primary alignment MAPQ is unaffected.

Debug

`-k INT` — minimum seed length

Minimum exact-match seed length. Shorter seeds increase sensitivity but raise runtime. The default (19) is calibrated for 100–150 bp Illumina reads.

`-w INT` — band width

Band width for the banded Smith-Waterman extension. Wider bands can recover alignments with long indels at greater CPU cost.

`-d INT` — X-dropoff

Off-diagonal X-dropoff for the Z-drop heuristic. Controls how far an alignment extension continues after a score drop.

`-r FLOAT` — re-seeding factor

Seeds longer than -k * FLOAT are re-seeded internally to find sub-seeds. Lowering this produces more seeds and higher sensitivity at greater cost.

`-y INT` — third-round seed occurrence threshold

Seed occurrence threshold for the third round of seeding. Rarely needs adjustment outside highly repetitive genomes.

Notes / Gotchas

Warning — –meth requires a –meth index

Running bwa-mem3 mem --meth against a standard (non-c2t) index produces incorrect alignments without an error. Confirm that the index was built with bwa-mem3 index --meth before aligning bisulfite data.

Note — SIMD variant printed to stderr at startup

When mem starts it prints a banner (Executing in AVX512 mode!! etc.) to stderr. This is informational and does not affect stdout output.

shm

bwa-mem3 shm stages an FM-index into POSIX shared memory so that subsequent bwa-mem3 mem invocations on the same machine attach to the in-memory segment instead of re-reading the index files from disk. For workloads that align many small samples back-to-back against the same reference — such as clinical panels or amplicon sequencing — this removes the dominant I/O bottleneck. shm also lists and destroys staged segments.

Synopsis


Usage: bwa-mem3 shm [-d|-l|--help] [--meth] [idxbase]

Options:
  -d        destroy all indices in shared memory (matches bwa v1 behavior)
  -l        list names of indices in shared memory
  --meth    stage a `bwa-mem3 index --meth` index — auto-appends
            `.bwameth.c2t` to <idxbase>, mirroring `mem --meth`
  -h --help print this help and exit

Stage with no flags: `bwa-mem3 shm <idxbase>` loads the index into
POSIX shared memory; subsequent `bwa-mem3 mem <idxbase> ...` runs
auto-attach instead of re-reading from disk. For meth indices, pass
the same plain `<idxbase>` to all three commands plus `--meth` on
`index`, `shm`, and `mem` (the c2t suffix is auto-appended).

Footgun: if you re-build the index, run `bwa-mem3 shm -d` first.
There is no staleness check -- a stale segment will silently mis-align.

Stuck-lock recovery: concurrent stagers are serialized by a named
       POSIX semaphore. If a stager is kill -9'd mid-stage, the lock
       persists and subsequent stages block forever. `bwa-mem3 shm -d`
       unlinks the semaphore alongside the registry; rerun afterwards.

macOS: POSIX shm has implementation-defined per-segment caps; large
       indices may simply fail to stage. Prefer Linux for production.
Linux: /dev/shm defaults to ~50% of RAM on bare metal; in containers
       it is often much smaller and may need raising via --shm-size
       (Docker) or an emptyDir tmpfs (Kubernetes).

Common usage

Stage a standard index, align two samples, then release the segment:

bwa-mem3 shm ref.fa
bwa-mem3 mem -t 16 ref.fa sample1_R1.fq sample1_R2.fq > sample1.sam
bwa-mem3 mem -t 16 ref.fa sample2_R1.fq sample2_R2.fq > sample2.sam
bwa-mem3 shm -d

Stage a methylation index and align:

bwa-mem3 shm --meth ref.fa
bwa-mem3 mem --meth -t 16 ref.fa R1.fq R2.fq | samtools sort -o out.bam -
bwa-mem3 shm -d

List all currently staged segments:

bwa-mem3 shm -l

Flag reference

(no flags) `<idxbase>` — stage an index

Loads all index files for <idxbase> into a POSIX shared-memory segment. After staging, any bwa-mem3 mem <idxbase> ... on the same machine auto-attaches and reads from memory rather than disk.

`-d` — destroy all segments

Removes every bwa-mem3 shared-memory segment on the machine. This is the correct clean-up command after a batch job and the required step before re-building the index (see the footgun warning below).

`-l` — list staged indices

Prints the names of all currently staged segments. Useful to confirm that staging succeeded before launching alignment jobs.

`--meth` — stage a methylation index

Auto-appends .bwameth.c2t to <idxbase> before staging, mirroring the behavior of bwa-mem3 index --meth and bwa-mem3 mem --meth. Pass the same plain <idxbase> to all three commands; the c2t suffix is handled transparently.

Notes / Gotchas

Warning — No staleness check — always destroy before re-indexing

There is no staleness check. If you re-run bwa-mem3 index ref.fa after staging, the on-disk index files will not match the in-memory segment, but bwa-mem3 mem will still attach to the stale segment and silently produce incorrect alignments. Always run bwa-mem3 shm -d before re-indexing.

Note — Platform limits

macOS: POSIX shared memory has implementation-defined per-segment size caps. Staging a full hg38 index (~28 GB) may fail silently or with a cryptic error. Prefer Linux for production use with large references.

Linux containers: /dev/shm typically defaults to ~50% of physical RAM on bare metal but is often much smaller inside Docker containers or Kubernetes pods. Raise the limit with --shm-size (Docker) or an emptyDir tmpfs volume with an explicit size (Kubernetes) before attempting to stage a large index.

Note — /dev/shm capacity preflight (PR #86)

Before opening the segment, bwa-mem3 shm calls statvfs("/dev/shm") and compares the available bytes against the index’s total_size. If /dev/shm is too small the stage aborts cleanly with an [E::bwa_shm_stage] message that names /dev/shm, the required size, and a mount -o remount,size=... hint. This replaces the previous failure mode where ftruncate succeeded lazily and pack_into later surfaced ENOSPC as [fread] Bad address with no indication that /dev/shm was the cause. The preflight is best-effort: a statvfs failure (no /dev/shm, restricted sandbox, ENOSYS) is non-fatal and the stage proceeds. As a rough sizing guide, hg38 stages ~17 GB; AWS instances default to RAM/2 (so c7a.4xlarge / c7i.4xlarge at 32 GB get ~16 GB of /dev/shm, which is just under the index size — a remount,size=28g is the documented fix).

Note — Stuck-lock recovery

Concurrent bwa-mem3 shm <prefix> invocations are serialized by a named POSIX semaphore (/bwactl_lock) so the registry stays consistent. POSIX semaphores have no SEM_UNDO equivalent: if a stager segfaults or is kill -9’d while holding the lock, every subsequent stage will block in sem_wait forever. Run bwa-mem3 shm -d to recover — it unlinks the semaphore alongside the registry, freeing the next stager.

version

bwa-mem3 version prints the release version, the build’s compiled-in SIMD floor, the SIMD tier resolved at runtime, and (when mimalloc is compiled in) the mimalloc version. It is the canonical way to confirm which build is on PATH, what host class it requires, and what kernel path it will dispatch to.

bwa-mem3 version always exits 0 — even on a host below the build’s SIMD floor — so operators can introspect a binary on a host that cannot actually run alignment. bwa-mem3 <subcommand> --help and -h share the same property.

Synopsis

mimalloc 3.3.0
v<MAJOR.MINOR>-<N>-g<COMMIT>

Common usage

Confirm the installed version, SIMD floor, and resolved tier:

./bwa-mem3 version

A typical run on an AVX-512BW host with the default BASELINE_ARCH=avx2 build prints (mimalloc line on stderr, the rest on stdout — order in a merged stream is not guaranteed):

v0.2.0-12-gabcdef1
SIMD floor: avx2 (x86-64-v3, Haswell 2013+); kernels: sse41 sse42 avx avx2 avx512bw
SIMD runtime: avx512bw (BWAMEM3_FORCE_TIER unset)
mimalloc 3.3.0

version line — bwa-mem3’s release string, derived from git describe at build time and stored as PACKAGE_VERSION in the binary. When building from a tarball without git history, the fallback value is set via FG_LABS_VERSION_FALLBACK at compile time.
SIMD floor: — the compile-time minimum the binary requires. Set by BASELINE_ARCH (default avx2) and listed alongside the per-tier kernel set the binary carries.
SIMD runtime: — the tier resolved at startup by __builtin_cpu_supports (or BWAMEM3_FORCE_TIER if set in the environment). On arm64 this is always neon.
mimalloc line — present only when USE_MIMALLOC=1 (the default).

On a host below the SIMD floor, version also writes a [W::bwa-mem3] warning on stderr identifying the gap (and the alignment subcommands will refuse to run with exit code 2 — see Host requirements for the exit-2 message format and rebuild instructions).

Notes / Gotchas

Tip — version | grep is safe in CI

The version, SIMD floor:, and SIMD runtime: lines all go to stdout; the mimalloc line and any host-below-floor warning go to stderr. So bwa-mem3 version | grep '^SIMD' works in CI scripts even on hosts that cannot run alignment. Use 2>/dev/null to suppress the mimalloc and warning lines if you want stdout only.

Tip — No mimalloc line means USE_MIMALLOC=0

If no mimalloc line appears, the binary was built without the bundled allocator (make USE_MIMALLOC=0). See User Guide — Memory allocator (mimalloc) for when this is appropriate.

Methylation Reference Overview

bwa-mem3 --meth is a single-binary, single-command drop-in replacement for the bwameth.py bisulfite-sequencing alignment pipeline. No Python installation, no piped preprocessing step, and no separate post-processing script — one bwa-mem3 index --meth builds the reference, and one bwa-mem3 mem --meth aligns and post-processes reads from raw FASTQ to sorted-ready BAM.

The output BAM is structurally equivalent to what the bwameth.py pipeline produces: consolidated @SQ headers (one entry per real chromosome rather than one per doubled-reference contig), Bismark-compatible XR:Z (read conversion CT/GA), XG:Z (genome strand CT/GA), and XM:Z (per-base methylation call string) auxiliary tags, optional chimera QC flags (--chimera-qc, off by default to match Bismark), and a @PG ID:bwa-mem3-meth provenance entry. Every Bismark-native tool (bismark_methylation_extractor, methylKit, methtuple, DMRfinder, epialleleR), MethylDackel, and biscuit’s per-read methylation tools read the BAM directly without conversion.

Pipeline at a glance

The diagram below shows the internal flow when bwa-mem3 mem --meth runs. Every step executes inside the single process; no external programs or temporary files are required.

flowchart LR
    A[Raw FASTQ\nR1 / R2] -->|inline C→T / G→A| B[c2t-converted reads\n+ internal YS/YC carrier]
    B -->|bwa mem core| C[mem_aln_t\nalignments vs doubled ref]
    C -->|chrom map\nf/r → real chr| D[header rewrite\n@SQ consolidated]
    D -->|XR/XG/XM Bismark tags\noptional --chimera-qc\nQC-fail propagation| E[BAM output\nwb0 uncompressed]

Steps:

FASTQ ingest with inline c2t conversion. R1 bases have every C replaced with T; R2 bases have every G replaced with A. The original bases and conversion direction are kept on an internal carrier on each read (in bseq1_t.comment); they are never emitted to BAM as tags themselves but feed the BAM-write step (SEQ restoration, XR:Z derivation). This conversion happens in-memory — the FASTQ is never written to disk in converted form.
Alignment against the doubled reference. The converted reads are aligned against the ref.fa.bwameth.c2t reference, which contains both a forward C→T projection (f-prefixed contigs) and a reverse G→A projection (r-prefixed contigs) of each chromosome.
Header rewriting and chrom consolidation. The f/r-prefixed contig names used internally are collapsed: every pair fchr1 / rchr1 becomes a single @SQ SN:chr1 entry in the output BAM header. RNAME and RNEXT fields in each record are rewritten to the consolidated name.
Tag emission and QC. Each aligned record receives Bismark-compatible XR:Z (read conversion direction), XG:Z (genome strand), and XM:Z (per-base methylation call string) auxiliary tags. With opt-in --chimera-qc (off by default — matches Bismark), records whose longest M/=/X CIGAR run covers less than 44 % of the read length are flagged 0x200; QC-fail flags then propagate across all records in a read group. The original pre-c2t sequence is copied back into the BAM SEQ field so methylation callers see real cytosines rather than the converted sequence.
BAM output. Records are written as uncompressed BAM (wb0 mode via htslib). The @PG ID:bwa-mem3-meth line records the exact command line. The caller pipes directly to samtools sort.

Quick-start commands

# Index the reference once (builds ref.fa.bwameth.c2t + FMI)
bwa-mem3 index --meth ref.fa

# Align paired-end FASTQs
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam
samtools index out.bam

Note — bwameth.py compatibility

The default scoring parameters applied by --meth (-B 2 -L 10 -U 100 -T 40 -CM) match those used by bwameth.py so outputs are comparable. Any parameter can be overridden on the command line.

bwameth.py Drop-In Mapping

bwa-mem3 --meth is designed to produce output that is equivalent to the bwameth.py pipeline for the standard paired-end case. This page explains what changes between the two approaches and what stays the same.

Command comparison

bwameth.py pipeline (multi-step)

# Step 1: build a doubled reference with bwameth.py
bwameth.py index ref.fa                # writes ref.fa.bwameth.c2t + bwa-mem2 FMI

# Step 2: align (bwameth.py converts reads, calls bwa-mem2, post-processes)
bwameth.py map --bwa-mem2 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam
samtools index out.bam

bwa-mem3 –meth (single binary)

# Step 1: build the doubled reference with bwa-mem3
bwa-mem3 index --meth ref.fa           # same ref.fa.bwameth.c2t layout as bwameth.py

# Step 2: align (inline c2t conversion + post-processing, no Python)
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam
samtools index out.bam

The index files produced by bwa-mem3 index --meth and bwameth.py index are identical in layout: the same ref.fa.bwameth.c2t doubled-reference FASTA followed by the bwa-mem2 FM-index files (.bwt.2bit.64, .0123, .pac, .amb, .ann).

What is gained

No Python or bwameth.py dependency. The entire pipeline — read conversion, alignment, and BAM post-processing — runs inside a single bwa-mem3 process. This simplifies deployment: one binary, no virtual environment, no version pinning of bwameth.py.

No intermediate files. bwameth.py writes a converted FASTQ (or pipes it) before handing off to the aligner. bwa-mem3 --meth performs the C→T / G→A conversion in-memory on each read batch before passing it to the alignment kernel. No temporary FASTQ is written and no extra pipe stage is needed.

Inline BAM post-processing. Header rewriting, Bismark XR:Z / XG:Z / XM:Z tag emission, opt-in chimera QC (--chimera-qc), and QC-fail propagation all happen inside the same process and the same pass over the alignments. There is no separate post-processing step. Output is written as uncompressed BAM (wb0) — a near-zero-cost format that downstream samtools sort reads natively.

Same flag defaults. --meth applies -B 2 -L 10 -U 100 -T 40 -CM automatically, matching bwameth.py’s default scoring. All parameters can be overridden.

What stays the same

The output BAM is field-compatible with bwameth.py output for the standard methylation tag set, flags, and SEQ representation (the @PG provenance line intentionally differs — see below):

Field	bwameth.py	bwa-mem3 –meth
`@SQ` headers	One per real chromosome	One per real chromosome
Methylation aux tags	`YS:Z`, `YC:Z`, `YD:Z` (bwameth)	`XR:Z`, `XG:Z`, `XM:Z` (Bismark-compatible)
`@PG`	`ID:bwameth`	`ID:bwa-mem3-meth`
Chimera QC threshold	Longest M < 44% of read	Same (44%), opt-in via `--chimera-qc`
Chimera QC flags	`0x200`, clear `0x2`, MAPQ ≤ 1	Same
SEQ field	Pre-c2t bases (RC-flipped when `is_rev`)	Same

The @PG ID: is intentionally different so provenance is unambiguous. bwa-mem3 --meth emits the Bismark-compatible XR:Z / XG:Z / XM:Z tag set rather than the bwameth-style YS:Z / YC:Z / YD:Z set, which means the output is directly consumable by bismark_methylation_extractor, methylKit, methtuple, DMRfinder, and epialleleR in addition to MethylDackel and biscuit. Downstream tools that read YS:Z / YC:Z / YD:Z will not find those tags and must be pointed at the corresponding XR:Z / XG:Z (and the per-base XM:Z methylation call string) instead.

Info — End-to-end regression coverage

PR #13 includes a three-layer regression test that verifies 100% chrom+pos match, 100% CIGAR match, and byte-identical SEQ across 92,684 paired-end records compared to a bwameth.py reference run.

When to prefer bwameth.py

If your workflow requires bwameth.py-specific features (e.g. bwameth.py markduplicates or non-standard bwameth.py post-processors), continue using bwameth.py. bwa-mem3 --meth targets the indexing + alignment + standard post-processing path only.

Conversion Details (C→T, G→A)

Bisulfite sequencing relies on chemical conversion of unmethylated cytosines to uracil (read as thymine after PCR). bwa-mem3 --meth models this with an in-memory read transformation applied to every read before the alignment kernel sees the bases.

What gets converted

Paired-end bisulfite reads follow a strand convention:

R1 (read 1): every C in the base sequence is replaced with T. This models the OT (original top) and CTOB (complementary to original bottom) strands as they appear after bisulfite treatment and PCR.
R2 (read 2): every G in the base sequence is replaced with A. This models the OB (original bottom) and CTOT strands.

Single-end mode uses the R1 (C→T) rule for all reads.

The doubled reference built by bwa-mem3 index --meth (or bwameth.py) contains two projections of each chromosome:

f-prefixed contigs (e.g. fchr1): the chromosome with every C replaced by T.
r-prefixed contigs (e.g. rchr1): the reverse complement of the chromosome with every G replaced by A.

Converted R1 reads are therefore alignable to f-prefixed contigs and converted R2 reads to r-prefixed contigs. The contig prefix records which strand hypothesis was used and feeds the Bismark XG:Z tag directly (CT for f-prefixed / OT, GA for r-prefixed / OB).

Where conversion happens

Read conversion runs inside src/fastmap.cpp in the meth_mode ingest block, immediately after sequence parsing and before any alignment work. The transformation is applied to the in-memory bseq1_t.seq buffer; the original FASTQ file is never rewritten.

Before the bases are modified, the original sequence is recorded in the read’s comment buffer as:

YS:Z:<l_seq bases>\tYC:Z:<direction>

where <direction> is CT for R1 (C→T) and GA for R2 (G→A). These fields pass through the alignment kernel untouched and serve two internal purposes at BAM-write time: YS:Z is the source for SEQ restoration (see next section), and YC:Z is the source for the emitted XR:Z: Bismark tag. They are not emitted to the output BAM — bam_writer.cpp suppresses them under --meth. See SAM tags: XR, XG, XM for the per-tag output reference.

Sequence restoration in the BAM SEQ field

Methylation callers such as MethylDackel identify methylated cytosines by examining the BAM SEQ field at each CpG site. They need to see real C/T bases — not the uniformly-converted T/A bases that were used for alignment.

meth_mem_aln_to_bam (in src/meth_bam.cpp) restores the original sequence from the internal YS:Z carrier on bseq1_t.comment before writing the BAM record (the carrier itself is suppressed at BAM emit by bam_writer.cpp under --meth, so it never reaches the output file):

The YS:Z: payload is located at the start of the bseq1_t.comment field (offset +5 past the YS:Z: header bytes).
For forward-aligned records (!p.is_rev), the pre-c2t bases are copied directly into the BAM SEQ buffer.
For reverse-aligned records (p.is_rev), the bases are reverse-complemented using the standard TGCAN table before being placed in SEQ.
If YS:Z: is absent (e.g. when running with an external c2t converter that does not emit it), the code falls back to the converted sequence in s->seq, with the same RC flip logic.

Warning — Soft-clip and supplementary trimming

When computing the SEQ range for supplementary alignments, the qb/qe boundaries account for soft-clip or hard-clip operations at the CIGAR ends. The YS:Z: restoration applies over the same trimmed range so SEQ length always matches the emitted CIGAR.

QUAL field handling

The QUAL field is taken directly from the original FASTQ (bseq1_t.qual) over the same [qb, qe) range and is never modified by the c2t process. Quality scores correspond to the original base calls, not the converted ones.

Relationship to the reference index

bwa-mem3 index --meth ref.fa writes ref.fa.bwameth.c2t, which applies the same C→T / G→A projection to the reference sequence. The resulting file is compatible with what bwameth.py index produces, so the same doubled-reference FASTA can be used interchangeably with either tool across tested versions.

SAM Tags: XR, XG, XM (Bismark-compatible)

bwa-mem3 mem --meth emits three Bismark-compatible auxiliary tags on each output record: XR:Z, XG:Z, and XM:Z. These tags are read by bismark_methylation_extractor, deduplicate_bismark, methylKit processBismarkAln, methtuple, DMRfinder, epialleleR, MethylDackel, and biscuit’s per-read methylation tools.

Tag reference

`XR:Z` — read conversion direction

Property	Value
Type	`Z` (NUL-terminated string)
Values	`CT` (R1 / SE) or `GA` (R2)
Set by	`meth_mem_aln_to_bam` from FASTQ-ingest carrier (`s->comment`’s YC payload)
Emitted on	All records (mapped and unmapped)

XR:Z records which conversion was applied to the read at FASTQ ingest:

CT — C→T conversion applied; this is an R1 read or single-end read.
GA — G→A conversion applied; this is an R2 read.

`XG:Z` — genome strand of the alignment

Property	Value
Type	`Z` (NUL-terminated string)
Values	`CT` (aligned to original top, `f-`-prefixed contig) or `GA` (aligned to original bottom, `r-`-prefixed contig)
Set by	`meth_mem_aln_to_bam` from `meth_chrom_map_t.direction`
Emitted on	Mapped records only

XG:Z indicates which doubled-reference strand the read aligned to:

CT — read aligned to the C→T-projected forward strand (OT).
GA — read aligned to the G→A-projected forward strand (OB).

For properly paired directional reads, R1 and R2 of a fragment naturally share XG:Z. Discordant pairs (already flagged with 0x200 by the chimera-QC heuristic) may see XG:Z diverge between mates.

`XM:Z` — methylation call string

Property	Value
Type	`Z` (NUL-terminated string)
Length	Equal to `SEQ` length
Set by	`meth_build_xm` (`src/meth_xm.cpp`) walking SEQ-orientation read against un-converted ref
Emitted on	Mapped records only

Per-base methylation call. Each character corresponds to one SEQ base:

char	meaning
`z` / `Z`	unmethylated / methylated C in CpG context
`x` / `X`	unmethylated / methylated C in CHG context
`h` / `H`	unmethylated / methylated C in CHH context
`u` / `U`	unmethylated / methylated C in unknown context (N within 1 or 2 bp downstream of the C, on the read’s source strand)
`.`	non-C reference at this position, sequencing mismatch (read base ≠ C/T at a ref C), insertion, soft clip, or N at the C position itself

The string is in SEQ orientation (matches the BAM SEQ field): for reads with the 0x10 flag set, both SEQ and XM:Z are reverse-complemented relative to FASTQ-original orientation.

Computation

Under --meth, the doubled c2t reference (<prefix>.bwameth.c2t.*) is folded once at startup into an in-memory un-converted pac (the meth_orig_ref module — src/meth_orig_ref.cpp). The fold uses (f, r) → original recovery on every position via a 5-row table:

f[P]	r[P]	original[P]
T	T	T
T	C	C
G	A	G
A	A	A
N	N	N (via `bns->ambs`)

Per mapped record, meth_build_xm slices the un-converted forward-strand window at the read’s footprint plus 2 bp of context on either side, then walks the BAM CIGAR jointly over the restored SEQ and the ref window. The classifier matches Bismark’s methylation_call:

match position with ref[t] == 'C' (top strand) or 'G' (bottom strand):
    determine context from ref[t±1], ref[t±2]
        N in either context base    -> u/U (unknown context)
        ref[t±1] == G/C (per strand) -> z/Z (CpG)
        ref[t±2] == G/C (per strand) -> x/X (CHG)
        otherwise                    -> h/H (CHH)
    determine methylation:
        read base == C/G (per strand) -> uppercase (methylated)
        read base == T/A (per strand) -> lowercase (unmethylated)
        otherwise                      -> '.'
insertion / soft clip                  -> '.' per consumed read base
deletion / N op                         -> no XM emit
hard clip / pad                         -> no XM emit

The top vs bottom strand choice is driven by XG:Z (= cmap direction), not by the SAM 0x10 (RC) flag. CTOT reads (R2 mapped forward to a top-strand contig with 0x10 set) and OB reads (R1 mapped RC to a bottom-strand contig) are both handled by reading the rule table from the strand encoded in XG. The walk runs in SEQ orientation throughout — no RC of the ref slice or the read.

For bottom-strand methylation, the C of interest at forward position P is encoded as a G on the forward strand (complement of bottom-strand C). The downstream context on the bottom strand corresponds to upstream positions on the forward strand; the classifier indexes ref[t-1] and ref[t-2] instead of ref[t+1] and ref[t+2], and looks for a C (forward) instead of a G to flag CpG.

Inspecting tags with samtools

samtools view out.bam | head -1 | tr '\t' '\n' | grep -E '^X[RGM]:'

Expected output looks like:

XR:Z:CT
XG:Z:CT
XM:Z:..z..h..Z..x..h.....Z..

Chimera QC and Header Rewriting

After the alignment kernel produces mem_aln_t records, bwa-mem3 --meth applies a set of post-processing steps before writing BAM output. These steps are implemented in src/meth_bam.cpp and run in the same process, in the same pass over the aligned records.

`@SQ` header consolidation

The doubled reference (ref.fa.bwameth.c2t) contains two contigs for each chromosome:

fchr1, fchr2, … — C→T projections of each chromosome.
rchr1, rchr2, … — G→A projections of each chromosome.

If the raw alignment header were written directly, every downstream tool would see twice as many sequences as there are real chromosomes, with unfamiliar f/r-prefixed names. meth_bam_writer_open instead builds a consolidated header using the meth_chrom_map_t:

meth_chrom_map_build_from_bns iterates over bns->anns and strips the leading f/r from each contig name.
The first contig with a given stripped name registers that name in the output list; subsequent contigs with the same stripped name map to the same output index.
The BAM @SQ lines are written from the consolidated list — one SN: per real chromosome.

RNAME, RNEXT, and SA/XA tag contig references in every record are rewritten through cmap->out_tid and cmap->output_names so they reference the consolidated names. The mapping from internal (doubled-ref) contig index to output contig index is cmap->out_tid[p.rid].

Note — TLEN computation uses consolidated TIDs

Template length (TLEN) is computed using the consolidated output TIDs, not the internal p.rid values. Two mates that rescue onto fchr1 and rchr1 respectively both map to output chr1, so TLEN is reported as a non-zero distance rather than zero (which would happen if the mismatched internal TIDs were used).

Chimera QC heuristic (opt-in)

bwameth.py applies a heuristic to flag reads that look like chimeric fragments: if the longest contiguous alignment run (sum of M/=/X CIGAR operations) covers less than 44 % of the read length, the read is considered a potential chimera. Bismark does not apply this kind of heuristic.

bwa-mem3 --meth makes this opt-in via --chimera-qc (default off, so the runtime posture matches Bismark). When enabled, the check inside meth_mem_aln_to_bam does:

if (100 * longest_M_run < 44 * l_seq):
    flag  |=  0x200   # set QC fail
    flag  &= ~0x2     # clear proper pair
    mapq   =  min(mapq, 1)

The threshold constant is MIN_LONGEST_M_PCT = 44 (defined at the top of src/meth_bam.cpp). The longest run is computed by cigar_longest_m_mem from src/cigar_util.cpp, which counts M, =, and X operations.

The chimera heuristic is only applied to mapped records (!(flag & 0x4) && direction != 0). Unmapped records are not touched.

See Flags for when to use --chimera-qc (PBAT / scBS-Seq; bwameth.py-equivalence runs).

`--set-as-failed` strand filtering

Before the chimera check, meth_mem_aln_to_bam checks whether opt->meth_set_as_failed is set and matches the record’s strand direction:

if (meth_set_as_failed != 0 && meth_set_as_failed == direction):
    flag |= 0x200

This unconditionally marks all alignments to the specified strand (f or r) as QC-failed before chimera logic runs. The chimera check then applies on top of the already-set fail flag.

Pair-level QC-fail propagation

Once per read group (all records sharing the same query name), after individual records have been processed:

meth_bam_group_propagate_qcfail(group, n)

This function scans all records in the group. If any record has 0x200 set, it propagates that flag to every other record in the group and clears 0x2 (proper pair) on all of them. This ensures that a chimeric or strand-filtered primary alignment also marks its supplementary alignments and the mate as QC-failed, preventing inconsistent flag states in the output BAM.

`@PG ID:bwa-mem3-meth` insertion

meth_bam_writer_open appends a @PG line to the header after the original bwa-mem3 @PG entry:

@PG  ID:bwa-mem3-meth  PN:bwa-mem3-meth  VN:<version>-meth  CL:<command line>

The <command line> field is the full bwa-mem3 mem --meth ... invocation with embedded tab characters replaced by spaces (htslib does not permit literal tabs in @PG CL: fields). This records the exact parameters used for provenance and reproducibility.

Tip — Verifying the header

After alignment, confirm consolidation and provenance with:
samtools view -H out.bam | grep -E '^@SQ|^@PG'
You should see one @SQ line per chromosome (no f/r prefixes) and both @PG ID:bwa-mem3 and @PG ID:bwa-mem3-meth entries.

Flags: –set-as-failed, –chimera-qc

bwa-mem3 --meth adds two flags that control QC behavior during BAM post-processing. Both flags affect the chimera QC and strand-filtering logic inside meth_mem_aln_to_bam (src/meth_bam.cpp).

`--set-as-failed {f|r}`

Marks every alignment to the specified strand as QC-failed (0x200) regardless of alignment quality or CIGAR structure.

Accepted values:

f — flag all alignments to f-prefixed contigs (C→T top-strand projection).
r — flag all alignments to r-prefixed contigs (G→A bottom-strand projection).

Effect on records:

When --set-as-failed f (or r) is set and a mapped record’s strand matches the specified value, the record’s SAM flag has 0x200 set. If --chimera-qc is also active, the chimera heuristic runs on top, possibly clearing 0x2 and capping MAPQ. QC-fail propagation then spreads the flag to all records in the read group.

When to use it:

Some experimental designs produce reads that are expected to align exclusively to one strand. Flagging the other strand as QC-failed before downstream analysis prevents spurious methylation calls from mis-strand alignments. It is also useful for diagnosing library preparation issues: run once with --set-as-failed r and once without to compare yield on each strand.

Warning — All records on the strand are flagged

--set-as-failed is a blunt instrument. It marks every alignment to the chosen strand, including correctly aligned reads that simply happened to land on the complementary strand due to library structure. Use this flag only when your library is expected to be strand-specific.

`--chimera-qc`

Enables the bwameth.py-style longest-M chimera heuristic. Off by default; this is the Bismark-equivalent posture, since Bismark itself does not apply this kind of QC heuristic.

When --chimera-qc is set, any mapped record whose longest M/=/X CIGAR run covers less than 44 % of the read length receives:

0x200 (QC fail) set.
0x2 (proper pair) cleared.
MAPQ capped at 1.

QC-fail propagation across the read group also applies.

When to use it:

The 44 % threshold was calibrated by bwameth.py for standard mammalian whole- genome bisulfite-sequencing (WGBS) libraries with typical read lengths and is helpful on PBAT / scBS-Seq libraries where intra-fragment chimeras are common. For Bismark-equivalent output (and most directional EM-seq / WGBS workflows), leave it off.

It is also useful when benchmarking: comparing bwa-mem3 --meth output against bwameth.py output is cleaner with --chimera-qc enabled, since bwameth.py’s chimera logic always runs.

Note — Pair-level propagation still applies

--chimera-qc controls only whether the heuristic itself runs. --set-as-failed is independent: when active, those flags are still set, and meth_bam_group_propagate_qcfail propagates any 0x200 flags across the read group regardless of --chimera-qc.

Flag interaction summary

Condition	`0x200` set?	`0x2` cleared?	MAPQ capped?
Normal aligned record (default, no flags)	No	No	No
`--chimera-qc` triggers (longest M/=/X < 44%)	Yes	Yes	Yes (≤1)
`--set-as-failed` strand matches	Yes	No	No
Both `--chimera-qc` + `--set-as-failed` active	Yes	Yes	Yes (≤1)

`-V` reference annotation `XR:Z` is suppressed under `--meth`

bwa-mem3 mem -V normally emits the contig annotation as an XR:Z auxiliary field. Under --meth, XR:Z carries the Bismark read-conversion direction (CT/GA) instead. The reference-annotation XR:Z is silently suppressed when --meth is active so the two uses don’t collide. There is no flag to override this — -V is a no-op for XR:Z under --meth. See tags.md.

Interop with External bwameth.py c2t

Some workflows use bwameth.py’s c2t subcommand to convert reads before passing them to an aligner. bwa-mem3 --meth supports this pattern by detecting whether the caller has already provided a pre-converted FASTQ and whether the reference path already points to the doubled-reference FASTA.

Auto-detect logic for the reference path

When --meth is active, bwa-mem3 mem ordinarily appends .bwameth.c2t to the reference path so the user can pass the original FASTA prefix:

bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz
# internally uses ref.fa.bwameth.c2t as the reference

If the reference path already ends with .bwameth.c2t, the auto-append is skipped:

bwa-mem3 mem --meth -t 16 ref.fa.bwameth.c2t R1.fq.gz R2.fq.gz
# no suffix appended; ref.fa.bwameth.c2t is used as-is

This detection is a simple suffix check on the path string. It allows callers that manage the doubled-reference path explicitly to pass it without triggering double-append.

Using bwameth.py c2t as the read preprocessor

If your pipeline already runs bwameth.py c2t to convert reads (for example, because it needs to reuse converted reads across multiple aligners), you can pipe the output directly to bwa-mem3 mem --meth:

bwameth.py c2t R1.fq.gz R2.fq.gz \
  | bwa-mem3 mem --meth -p -t 16 ref.fa.bwameth.c2t /dev/stdin \
  | samtools sort -o out.bam

Key points for this pattern:

Pass the .bwameth.c2t reference path explicitly so the auto-append is suppressed.
Use -p to tell bwa-mem3 mem that the input contains interleaved paired-end reads (bwameth.py c2t emits interleaved output to stdout).
Use /dev/stdin as the reads argument to read from the pipe.
The bwa-mem3 --meth inline c2t conversion is not applied when the reads arrive pre-converted. XR:Z (read conversion) and XG:Z (genome strand) are still emitted on every record; XM:Z (per-base methylation call string) is emitted on every mapped record. XR:Z is derived from the inline carrier the c2t step normally writes — when reads are pre-converted, the carrier is absent unless the external preprocessor emits it as a FASTQ comment (see warning below).

Warning — XR:Z: requires the inline carrier

XR:Z: records the read’s bisulfite-conversion direction (CT for top- strand, GA for bottom-strand R2). bwa-mem3’s inline c2t step records that direction into the FASTQ comment as YS:Z:<seq>\tYC:Z:<dir>, which the BAM emitter then reads to set XR:Z: (the YS/YC carrier itself is dropped from BAM output). When reads are pre-converted externally and piped in, the inline c2t step in src/fastmap.cpp is bypassed. If your external preprocessor does not emit a compatible YC:Z: comment field, XR:Z: will be absent from the output BAM. XG:Z: and XM:Z: are unaffected — they’re derived from the reference contig direction and CIGAR walk, not from the carrier.

Header rewriting and BAM post-processing with external c2t

Whether reads are converted inline or externally, all BAM post-processing steps apply identically when --meth is active:

@SQ header consolidation (f/r contigs → one entry per chromosome).
Bismark XR:Z / XG:Z / XM:Z auxiliary tag emission.
Chimera QC heuristic (only when --chimera-qc is set; off by default).
Pair-level QC-fail propagation.
@PG ID:bwa-mem3-meth insertion.

The post-processing pipeline depends only on the reference contig names (to determine XG:Z) and the alignment flags — not on whether reads were converted inline or externally.

Summary of path variants

Reference arg	Read source	Auto-append?	Inline c2t?	`XR/XG/XM` emitted?
`ref.fa`	Raw FASTQ	Yes (→ `ref.fa.bwameth.c2t`)	Yes	All three
`ref.fa.bwameth.c2t`	Raw FASTQ	No	Yes	All three
`ref.fa.bwameth.c2t`	Pre-converted (pipe)	No	No	`XG`/`XM` always; `XR` only if external preprocessor emits the `YC:Z` carrier

What’s Different from bwa-mem2

This section tracks every change that bwa-mem3 carries on top of upstream bwa-mem2/bwa-mem2’s master branch, explains why each change was made, and records its upstream disposition.

How this section is organized

Each deep-dive page covers one category of change:

Correctness fixes — bugs in upstream bwa-mem2 that are fixed in bwa-mem3, including the kswv SIMD score2 plateau series, the proper-pair flag regression, the zero-init crash, the SMEM buffer overflow, and the @PG tab-escape issue.
Performance improvements — lockstep SMEM batching, batched -H header ingestion, libsais FM-index construction, and the consolidated mapping speedup suite.
Features — --meth bisulfite mode, mimalloc allocator, --supp-rep-hard-cap, bwa-mem3 shm, shm --meth, the HN:i tag, and the --bam=LEVEL output flag.
Architecture support — the Linux ARM64/aarch64 build, the arch=avx512bw Makefile target, the NEON kswv mate-rescue kernel, and the AVX2 kswv mate-rescue kernel.
Build & infrastructure — the doctest framework, Codecov integration, PACKAGE_VERSION from git describe, PGO target parameterization, CXXFLAGS/CPPFLAGS/LDFLAGS forwarding, the unit-test harness, and the CI matrix expansion.
Upstream PR status — a single table cross-referencing every fork-carried change to its corresponding upstream PR or issue, with current upstream disposition.

Carried on top of upstream

Auto-generated from git log --reverse --no-merges master..main and the conventional-commits scope on each PR-merge title; do not edit by hand. For per-PR upstream disposition (bwa-mem2 PR / issue refs and status), see Upstream PR status.

Commit	Topic	PR
`ae73227`	Add Apple Silicon (ARM64/NEON) support with native optimizations	—
`744a9e7`	feat: add CI workflow with cross-platform build and end-to-end test	—
`490502b`	fix: drop unused global `stat` that shadows libc	—
`9364cfc`	ci: pin GitHub Actions to full-length commit SHAs	#4
`b6eaba1`	chore: configure CodeRabbit to review PRs against fg-main	#2
`db5086a`	docs: add FG-MAIN.md documenting the fork’s relationship to upstream	#3
`5132582`	feat(arm64): make Linux aarch64 build + CI-test on every fg-main push	#1
`96016a5`	ci: pin dwgsim seed (-z 42) to stop parity-test flakiness	#10
`246b528`	fix(hdr): align bwamem.h declarations with bwamem.cpp definitions	#5
`b27f374`	feat(hdr): export mem_infer_dir for external consumers	#6
`62700b1`	chore: move profiling globals out of main.cpp	#7
`6b76c7b`	feat: expose worker_alloc/worker_free, the core worker_t pre-allocation helpers	#8
`e80765b`	feat: split mem_sam_pe into mem_pair_resolve + thin emission wrapper	#9
`84defc3`	feat: –bam[=LEVEL] output flag for direct BAM emission	#12
`73907d7`	feat: vendor mimalloc v3.3.0 and link by default	#19
`7641ebf`	feat(meth): –meth + `index --meth` — bwameth.py-equivalent bisulfite mode	#13
`0165b6c`	fix: zero bseq1_t in kseq2bseq1 so realloc’d entries don’t carry garbage	#22
`e7cb763`	[proto] NEON kswv mate-rescue — correctness + perf harness	#18
`a5aab04`	test(ci): add unit-test harness, fixtures, and ARM build support	#23
`2fddafd`	[proto] AVX2 kswv mate-rescue — stacked on PR 18	#20
`8944028`	fix: compute no_pairing 0x2 flag from the emitted alignment	#17
`2fd0e96`	fix(kswv): apply NEON score2-scan fixes to AVX-512BW kernel	#21
`68adecd`	ci: expand workflow matrix + add canonical deep-test row	#24
`690914f`	build(make): add explicit arch=avx512bw target	#16
`0bb9402`	fix(kswv): gate AVX2 arch dispatch on !AVX512BW	#26
`43457e8`	fix(kswv): consolidate score2 plateaus per-lane to match scalar ksw_align2	#28
`2311f11`	fix(kswv): port score2 plateau consolidation to NEON + AVX-512BW	#29
`75c709a`	fix(kswv): apply score2 plateau fix + missing filters to kswv_512_16	#30
`61813ef`	fix(kswv): rewrite kswv_neon_16 — real SIMD kernel with correct table + score2	#31
`1f76655`	perf(seed): lockstep SMEM batching across N reads	#33
`93a79ec`	feat(mem): emit HN:i tag with total hit count per primary	#42
`dd3a82c`	chore: port four nh13 lh3/bwa PRs into bwa-mem2 (-z, -u/XB, MQ, @HD order)	#35
`98ba6ab`	build(make): forward user CXXFLAGS/CPPFLAGS/LDFLAGS to final link steps	#50
`e9302a1`	fix(kswv): guard post-loop rowMax store on nrow==0 batches	#51
`9b702ca`	fix(sam): sanitize whitespace in -R when embedding into @PG CL: field	#54
`ed63fad`	perf(header): batch -H ingestion to fix O(n^2) header read (closes #37)	#49
`595d8e5`	feat(mapq): add –supp-rep-hard-cap opt-in supp MAPQ rescoring	#56
`79628c3`	chore(version): stamp PACKAGE_VERSION from git describe at build time	#52
`e22dade`	fix(smem): size SMEM buffers from observed max read length (closes #44)	#55
`03688a0`	chore: normalize CRLF line endings to LF (#43)	#53
`57e21bd`	feat(makefile): parameterize PGO targets by arch + profile dir	#59
`79b90ce`	feat(index): libsais-based memory-bounded FM-index construction	#57
`d8d4a6d`	feat(cli): wire up –help across commands; add -h to top-level and index	#60
`7301762`	perf: consolidated mapping speedups (ksw2, SMEM, SAL, SAM)	#58
`eaf4ed6`	test: doctest-based test framework scaffolding + Codecov	#34
`bbbecd3`	ci(proto-neon-kswv): split into fan-out/fan-in jobs with caching	#63
`20f77e9`	feat(shm): port `bwa shm` from bwa-mem v1	#65
`c20f61c`	feat(shm): add `bwa-mem2 shm --meth` for symmetric meth UX	#67
`ee18a3b`	refactor: rename bwa_mem2idx to bwa_mem3idx	—
`bb919f2`	feat: rename PG header to bwa-mem3 (ID, PN, usage strings)	—
`7a56f9a`	feat: rename meth PG header to bwa-mem3-meth and drive VN from PACKAGE_VERSION	—
`95c673d`	build: rename binary to bwa-mem3, update version guard and fallback	—
`2d5ad10`	test: rename test binaries from bwa_mem2_tests_* to bwa_mem3_tests_*	—
`31f214a`	chore: sweep bwa-mem2 -> bwa-mem3 in source comments and log messages	—
`ff40f96`	chore: rename BWAMEM2_* header guards to BWAMEM3_*	—
`c2c786a`	test: sweep bwa-mem2 -> bwa-mem3 in test, bench, and scripts	—
`148e431`	ci: update workflows for bwa-mem3 rename and main branch	—
`dddd8dd`	docs: rewrite README for bwa-mem3 (lineage attribution, drop upstream-only sections)	—
`34c8ea3`	docs: rename FG-MAIN.md to docs/whats-different.md	—
`5719617`	docs: update whats-different.md for bwa-mem3 and main branch	—
`bdd67f3`	docs: drop README-ori.md (lineage preserved in README + git history)	—
`924a70f`	docs: add 0.1.0-pre release notes and update status.md	—
`85d3b3b`	ci: drop master from branch filter (master branch removed from remote)	—
`2ea69db`	fix(test/meth): alias bwa-mem2 -> bwa-mem3 on PATH for bwameth.py oracle	#72
`4f805e6`	chore: rename shell vars BWAMEM2/BWA_MEM2[] to BWAMEM3/BWA_MEM3[]	#68
`8137740`	perf(kswv): add per-strip L1 prefetches to all u8/16 kernels	#70
`41e1f3c`	docs: add comprehensive mdbook on Read the Docs	#71
`442de25`	fix(fmi): parenthesize SA_COMPX_MASK precedence in sampled-SA prefetch	#73
`000c0fd`	perf(fmi): bump SMEM_LOCKSTEP_N from 8 to 16	#75
`b3a665e`	fix(bntseq): bound .alt parse buffer to prevent stack overflow	#74
`af33cdd`	feat(bns): convert mem_matesw_batch_{pre,post} to bns_fetch_seq_v2	#76
`9bb277a`	Update index.md	#79
`fdb244d`	perf(libsais_build): skip wasted zero-init on unpack + SA buffers	#80
`ff95a4f`	perf(ksort): replace per-call malloc with on-stack buffer for small n	#78
`7caf77c`	perf(ungapped): closed-form HIT for total_mis == 0	#77
`e65ceb2`	fix(profiling): clamp display_stats nthreads to LIM_C	#81
`ddfb0da`	feat(shm): serialize /bwactl RMW with a POSIX named semaphore	#82
`b9e0b66`	feat(simd): replace multi-binary execv launcher with single-binary in-process dispatch	#83
`7d27f23`	perf(build): default x86 single-binary baseline to avx2 (was sse41)	#84
`316dba6`	fix(matesw): copy ref slice before ksw_align2 to avoid SIGSEGV on shm-backed ref_string	#85
`427c81c`	perf(fmi): inline backwardExt to recover gcc 12+ wall-clock regression	#88
`c96d31a`	perf(x86): cap avx512bw autovec at 256-bit; bwa_shm /dev/shm preflight	#86
`23f528d`	ci: migrate parity tests from dwgsim/phiX174 to holodeck/chr22	#89
`ec67b09`	feat(meth): emit Bismark-compatible XR/XG/XM auxiliary tags	#90
`652ce0f`	docs(install): list autoconf/automake/libomp/zlib system prereqs	#93
`296b1b9`	docs(install): fix RHEL/Fedora package name pkgconfig → pkgconf-pkg-config	#94
`dc7fcfe`	feat(simd): add SIMD host-floor precheck for multi-arch deployment	#95
`3bc64b0`	docs: pre-release documentation pass for v0.2.0-pre	#96
`27a60c9`	chore(release): prep v0.2.0 release notes and metadata	#97

Additional fork-level changes

Vendored mimalloc allocator: ext/mimalloc is pinned at v3.3.0 and linked into every binary by default (USE_MIMALLOC=1). Linux uses --whole-archive static linkage; macOS uses dyld-interposed shared linkage. USE_MIMALLOC=1 is the supported and recommended default on all platforms; USE_MIMALLOC=0 is provided as a best-effort opt-out and is CI-gated on Linux x86 only. See Features for details.
--supp-rep-hard-cap INT (opt-in, default disabled): forces MAPQ=0 on supplementary alignments whose chain contains a seed with >=INT genome occurrences. Addresses the long-standing bwa/bwa-mem2 issue where a supp fragment that maps to many places standalone (e.g. a short read in a CCATCC repeat) inherits a high MAPQ from its primary because the supp’s competing repetitive chains get filtered out during the full-read pipeline and therefore never contribute to its sub/sub_n. See upstream #260 for the reporter case. Primary MAPQ is unaffected; default output is byte-identical to stock bwa-mem2. Typical values are 5–20 (lower = more aggressive); the upstream #260 repro drops from MAPQ=60 to MAPQ=0 at --supp-rep-hard-cap 18.

Version stamping

PACKAGE_VERSION (the value reported by bwa-mem3 version and written to the @PG VN: SAM header field) is generated at build time by the Makefile from git describe --tags --dirty, e.g. v2.3-30-g61813ef for a tree 30 commits past upstream tag v2.3 at commit 61813ef.

No manual bumping required: cut a fresh release by tagging the commit (git tag -a vX.Y-fg-labs.N -m ...) and the next build picks it up.
Builds where git describe --tags fails (source-tarball extractions, or shallow clones / checkouts with no tag reachable from HEAD — including CI’s default actions/checkout fetch-depth of 1) fall back to the static FG_LABS_VERSION_FALLBACK in Makefile. Bump that when cutting a release that will be consumed as a tarball, or in CI artifacts.
src/version.h is generated and .gitignored; make clean removes it.

Branching and update policy

master tracks upstream unchanged.
main is upstream/master plus the commits above. Rebased onto upstream roughly quarterly, or sooner when an upstream release we care about lands.
Contributions go via PR targeting main. CI and CodeRabbit gate merges.
Any PR that adds or removes a fork-carried commit must update the table above in the same PR.

Consuming

Clone this repo and check out main:

git clone https://github.com/fg-labs/bwa-mem3.git
cd bwa-mem3
git checkout main

Or vendor the branch into a downstream repo by pinning to a specific commit (not the branch tip) so your build is reproducible.

Relationship to upstream

We submit the generally-useful fixes and features carried here as PRs against bwa-mem2/bwa-mem2 when the upstream maintainers are actively merging; while they are not, fixes land here first and we drop them from main once they appear upstream.

Correctness Fixes

This page documents bugs present in upstream bwa-mem2 that bwa-mem3 fixes. Each fix is isolated to a single PR so it can be reviewed independently and dropped from main once upstream merges the equivalent patch.

`@PG CL:` tab escaping (PR #54)

When a read-group string is passed via -R '@RG\tID:x\tSM:y', the tab characters in the argument were copied verbatim into the @PG CL: SAM header field. The SAM specification uses tabs as field delimiters, so the resulting header line appeared to have extra ID: and other tag fields embedded inside CL:. Lenient parsers (samtools, htsjdk) tolerated the output; strict parsers (noodles, some fgbio configurations) rejected the file as malformed.

The fix replaces each tab character with a space when building the @PG CL: value in src/main.cpp. The @RG line itself is not modified, so the read-group metadata is preserved correctly. A regression shell test (test/pg_cl_escape_test.sh) asserts that the @PG line contains exactly five tab-separated fields after the fix. Upstream issue reference: bwa-mem2#293.

SMEM buffer overflow on reads longer than 151 bp (PR #55)

bwa-mem2 hardcoded READ_LEN 151 in src/macro.h to size the per-thread matchArray SMEM buffer at compile time. The FMI walk wrote past this buffer without bounds checking when reads exceeded 151 bp, causing memory corruption that manifested as segfaults or silent wrong output on 300 bp MiSeq reads, error-corrected long reads, and any run with a non-default -k that extended seed length.

A second cap, MAX_READ_LEN_FOR_LOCKSTEP 512, guarded the lockstep driver’s per-slot stack arrays with a hard assert that aborted on anything longer.

The fix eliminates both compile-time caps. Every per-thread SMEM buffer is now heap-allocated on the memory management context (mmc) and grown on demand from each batch’s observed max_readlength. The pre-walk grow in mem_collect_smem sizes matchArray[tid] to BATCH_MUL * BATCH_SIZE * max_readlength, and all array writes are bounds-checked with a structured smem_overflow_die on overflow. Regression tests cover 300 bp, 1 kbp, and 3 kbp phiX reads; all three segfaulted before the fix and produce correct NM:i:0 alignments after. Upstream references: bwa-mem2#210 (issue), bwa-mem2#238 (closed unmerged upstream PR).

`kswv` nrow==0 guard (PR #51)

When a SIMD batch contained only padding pairs (all len1 == 0), the DP loop never executed and nrow was zero. The post-loop rowMax + (i-1) * SIMD_WIDTH store still executed, walking SIMD_WIDTH bytes before the beginning of the rowMax allocation. On glibc this produced a free(): invalid pointer abort; on macOS libc it silently corrupted the heap.

The fix wraps the post-loop store in an if (i > 0) guard on all five SIMD kswv kernels: NEON u8, NEON 16, AVX2 u8, AVX-512BW u8, and AVX-512BW 16. The upstream patch bwa-mem2#289 covered only the two AVX-512BW kernels; bwa-mem3 broadens it to the three additional kernels carried in this fork. A dedicated regression test (test/kswv_nrow_zero_test.cpp) builds all-padding batches and verifies each kernel is clean under AddressSanitizer.

kswv score2 plateau series (PRs #26, #27, #28, #29, #30, #31)

The batched mate-rescue Smith-Waterman path (kswv) contains a family of related bugs across its SIMD kernels that inflated the suboptimal score (score2 / XS) and consequently deflated MAPQ relative to upstream bwa-mem2.

AVX-512BW dispatch guard (PR #26). GCC with -mavx512bw automatically defines __AVX2__, so the #elif __AVX2__ branch in src/kswv.h and src/kswv.cpp matched first on every AVX-512BW build. The 256-bit AVX2 kernel produced only 32-lane results into 64-lane score[]/te1[]/qe[] arrays sized for AVX-512BW; the upper 32 lanes held uninitialized values. mem_matesw_batch_post read those bogus te values, bwa_gen_cigar2 returned NULL, and mem_reg2aln triggered an a.cigar != NULL assertion on every AVX-512BW dispatch host (AWS c7a, c7i). The fix qualifies the #elif __AVX2__ guard with !__AVX512BW__, matching the existing pattern in bandedSWA.h. Closes issue #25.

AVX2 score2 plateau fix (PR #27 closed, PR #28 merged). The AVX2 256-bit kswv kernel added in PR #20 used a dense SIMD max over every rowMax row to compute the suboptimal score. Scalar ksw_u8 instead collapses consecutive rows above minsc into a single b[] entry anchored at the max-score row, then finds the best anchor outside the primary region. The dense max pulled in tail rows from a plateau whose anchor sat inside the primary region, inflating XS by 1–4 on a minority of reads and reducing MAPQ by 2–18 on those reads. PR #27 (closed) temporarily disabled the AVX2 batched path. PR #28 fixes the kernel itself by replacing the dense scan with a per-lane scalar emulation of the b[] build-and-scan logic.

NEON and AVX-512BW 8-bit port (PR #29). The same dense-rowMax score2 scan existed in kswv_neon_u8 and kswv_512_u8. Confirmed on ARM: rebuilding smoke-1M on darwin/arm64 pre-fix produced the identical four MAPQ regressions as the AVX2 case. PR #29 ports the per-lane scalar b[]-emulation fix to both kernels.

AVX-512BW 16-bit port (PR #30). kswv_512_16 carried four bugs: the same dense-rowMax plateau pattern, aggregate maxl/minh bounds instead of per-lane bounds (a gap from PR #21), no minsc filter, and no qe mask. The per-lane scalar emulation from PR #29 fixes all four naturally.

NEON 16-bit rewrite (PR #31). kswv_neon_16 was effectively dead code before this PR. Five interacting bugs produced 20,435 BAM diffs vs scalar reference on smoke-1M -A 2: the score table reinterpreted int16 xor indices as int8 lookups (inflating match scores by ~256 per cell), the table was too small for the 16-bit SoA encoding, rowMax was never written, the early-exit fired on row 0 for all pairs without a KSW_XSTOP target, and all the fix-3 class bugs from PRs #28–#30 were missing. The PR rewrites the kernel from scratch against kswv_neon_u8’s structure using 32-byte int8 tables indexed via vqtbl2_s8, per-lane freeze, exit0 bitmap, and per-lane scalar score2.

`kseq2bseq1` zero-initialization (PR #22)

bseq_read_orig grows its sequence buffer with realloc, leaving tail entries uninitialized. kseq2bseq1 populated only name, comment, seq, qual, and l_seq for each entry, leaving sam, bams, n_bams, and cap_bams at whatever values realloc happened to return. PR #13 added an unconditional free(ret->seqs[i].bams) in the output loop (fastmap.cpp:571), which turned those garbage values into a crash — a pointer being freed was not allocated abort under system malloc and a SIGSEGV under mimalloc — once input exceeded the initial 256-sequence allocation. The crash was deterministic and reproducible with -t1.

The fix is a single memset(s, 0, sizeof(*s)) at the top of kseq2bseq1.

Proper-pair flag from emitted alignment (PR #17)

In the no_pairing emission path of mem_sam_pe and mem_sam_pe_batch_post, the proper-pair bit (0x2) was computed from a[i].a[0].rb regardless of which alignment was actually emitted. When the primary’s alignment score fell below the reporting threshold opt->T but a non-primary ALT hit cleared it, mem_reg2aln emitted a[i].a[n_pri[i]] while mem_infer_dir still read the below-threshold primary. In that case the SAM flag did not reflect the coordinates in the record.

The fix stores the selected alignment index per mate in a which[2] array and passes a[i].a[which[i]].rb to mem_infer_dir, ensuring the proper-pair flag always matches the emitted record. The bug was present in the bwa-mem2 initial commit from 2019. Upstream reference: pre-existing bug, no open upstream PR at time of merge.

Changes catalog

Item	bwa-mem3 PR	Upstream PR/issue	Status
`@PG CL:` tab escape	#54	bwa-mem2#293	fork-only (open upstream issue)
SMEM buffer overflow on >151 bp reads	#55	bwa-mem2#238, bwa-mem2#210	fork-only (upstream PR closed unmerged)
kswv nrow==0 guard	#51	bwa-mem2#289	fork-only (upstream PR open)
AVX-512BW dispatch guard	#26	—	fork-only
AVX2 score2 plateau disable (superseded)	#27	—	closed (superseded by #28)
AVX2 score2 plateau fix	#28	—	fork-only
NEON + AVX-512BW 8-bit score2 fix	#29	—	fork-only
AVX-512BW 16-bit score2 fix	#30	—	fork-only
NEON 16-bit kernel rewrite	#31	—	fork-only
kseq2bseq1 zero-initialization	#22	—	fork-only
Proper-pair flag from emitted alignment	#17	—	fork-only

Performance Improvements

This page covers the performance work carried in bwa-mem3 on top of upstream bwa-mem2. Every change listed here preserves byte-identical SAM/BAM output vs the upstream baseline it was benchmarked against.

For current benchmark numbers across architectures and workloads, see bwa-mem3-bench, the canonical source of truth for benchmark methodology and results.

Lockstep SMEM batching (PR #33)

Seeding in bwa-mem2 advances one read’s SMEM walk at a time. Because each forward/backward extension step issues a random access into the cp_occ checkpoint array (~4 GB for human genome), the CPU stalls on cache misses between steps. Lockstep batching advances SMEM_LOCKSTEP_N reads’ SMEM walks in slot-interleaved round-robin order so that the out-of-order engine can overlap the cp_occ cache-miss loads for read i+N with the compute-bound walk of read i.

Each read slot (BatchSlot) carries its own prev[] walk buffer and match_buf[] reorder buffer. A tight recycling loop assigns finished slots to the next unprocessed read immediately. The match-emit cursor enforces input-index order so output is byte-identical to scalar. SMEM_LOCKSTEP_N is compile-time tunable; N=1 dispatches to the unchanged scalar path for bisection.

Measured improvement on 150 bp NovaSeq WGS (1M pairs, hg38, Graviton3 r7g.4xlarge, 8 threads): −6.1% wall time (82 s → 77 s). The backwardExt hot cp_occ load share dropped from 65.5% to 53.3% of function time — direct evidence that the OoO engine is overlapping cross-slot loads. On 300 bp MiSeq reads the workload is SW-dominated (~85% of cycles in kswv kernels) and the SMEM improvement is within noise; parity holds.

Supersedes PR #15 (cross-read _mm_prefetch shape), which regressed on Graviton3.

Batched `-H` header ingestion (PR #49, closes issue #37)

Passing a large header file via -H <file> re-ran strlen on the growing header string and called realloc on every input line, making ingestion O(n²) in the number of header lines. For a ~70 MB / ~1.5 M-line header (reported in upstream bwa-mem2#204) this caused runtimes exceeding 10 minutes before alignment started.

The fix introduces bwa_insert_header_file, a batched helper that determines the file size with fseek/ftell, allocates a single buffer, copies all @-prefixed lines in one pass, and calls bwa_insert_header once. The fix also addresses four correctness gaps in the upstream PR #204: the return-value assignment was dropped (leaving hdr_line stale after realloc), const FILE* caused compiler warnings, empty files were not guarded, and each fgets was not bounded by remaining buffer. A regression test (test/header_insert_test.cpp) diffs the batched path against the pre-patch per-line baseline across eight edge cases.

libsais FM-index construction (PR #57)

bwa-mem3 index now builds the FM-index using libsais v2.9.1 (Ilya Grebnov) instead of the sais-lite (Yuta Mori saisxx) library that bwa-mem2 inherited. libsais is actively maintained, supports OpenMP-parallel induced sorting, and produces a byte-identical FM-index. No changes are required to existing indexes — bwa-mem3 reads index files built by bwa-mem2 index without re-indexing.

For a human reference (GRCh38 + decoys), libsais reduces indexing wall time and peak memory vs sais-lite. Exact numbers depend on thread count and available RAM; see the PR body for measurements on Graviton3.

Consolidated mapping speedups (PR #58)

PR #58 is a multi-phase performance audit of bwa-mem2’s hot path, squashed and rebased onto main. It incorporates improvements across five subsystems:

ksw2 banded SW — tuned the band extension loop to reduce redundant computation in the common case.
SMEM lockstep batching — additional refinements on top of PR #33.
SAL prefetch — prefetch hints for the suffix array lookup hot path.
SAM record building — reduced per-record allocation in the text formatting path.
PGO build — the opt-in profile-guided optimization target (see also Performance → PGO build) is included in this suite.

On the smoke-1M workload (1M PE 150 bp reads, hg38, Graviton3 r7g.4xlarge, 16 threads, warm page cache), this PR contributed the largest single-step wall time reduction in the main branch’s performance history. Benchmark details are maintained at bwa-mem3-bench.

Changes catalog

Item	bwa-mem3 PR	Upstream PR/issue	Status
Lockstep SMEM batching	#33	—	fork-only
Batched `-H` header ingestion	#49	bwa-mem2#204	fork-only (upstream PR open)
Large header performance (issue)	—	issue #37	closed by #49
libsais FM-index construction	#57	—	fork-only
Consolidated mapping speedups	#58	—	fork-only

Features

This page covers user-facing features added to bwa-mem3 on top of upstream bwa-mem2. None of these features change default behavior: output produced by bwa-mem3 mem without any of these flags is byte-identical to the corresponding bwa-mem2 output (except for the @PG ID: and PN: fields which now read bwa-mem3).

`--meth` bisulfite alignment mode (PR #13)

--meth turns bwa-mem3 index and bwa-mem3 mem into a single-binary drop-in replacement for the entire bwameth.py pipeline. No Python, no separate post-processing step, no bwameth.py dependency.

bwa-mem3 index --meth ref.fa          # once per reference
bwa-mem3 mem --meth ref.fa R1.fq R2.fq | samtools sort -o out.bam

index --meth writes <ref>.bwameth.c2t — a doubled reference with f/r-prefixed contigs and C→T / G→A projection, byte-identical to the index that bwameth.py index-mem2 produces.

mem --meth performs inline C→T conversion of R1 and G→A conversion of R2 before seeding (stashing the pre-conversion bases on an internal YS:Z / YC:Z carrier in bseq1_t.comment; both are suppressed at BAM emit), consolidates the f/r contig pairs back to one @SQ per real chromosome, emits Bismark-compatible XR:Z (read conversion direction), XG:Z (genome strand), and XM:Z (per-base methylation call string) auxiliary tags on every record, optionally applies a chimera QC heuristic (longest M/=/X run < 44% of read length → set 0x200, clear proper-pair 0x2, cap MAPQ at 1) when --chimera-qc is passed, copies the internal pre-conversion sequence back into the BAM SEQ field for CpG-calling tools, and writes a @PG ID:bwa-mem3-meth entry.

On the bwameth.py example fixture (92,684 reads), end-to-end output is byte-identical on chrom, pos, CIGAR, and SEQ vs the bwameth.py oracle. Stacks on PR #12 (--bam). See the Methylation Reference for full details.

Vendored mimalloc allocator (PR #19)

bwa-mem3 vendors mimalloc v3.3.0 as a pinned submodule at ext/mimalloc and links it into every binary by default (USE_MIMALLOC=1). On Linux, static linkage uses --whole-archive; on macOS, dyld-interposed shared linkage is used.

Measured on AWS c7g.4xlarge (Graviton3, 16 threads, 29M 150 bp paired-end exome-capture reads vs hg38, page cache dropped between iterations): −24.5% wall-clock time (528.6 s → 424.7 s) compared to the same build with USE_MIMALLOC=0. No user-visible interface change; no runtime configuration required.

USE_MIMALLOC=0 is a supported best-effort opt-out and is CI-gated on Linux x86. bwa-mem3 version prints the mimalloc version string when it is active.

`--supp-rep-hard-cap` supplementary MAPQ rescoring (PR #56)

Supplementary alignments for a split read inherit MAPQ from the full-read scoring pipeline. Competing repetitive chains for the supplementary fragment are filtered out during full-read chain scoring (mem_chain_flt) before Smith-Waterman, so they never contribute to sub/sub_n. A supp fragment landing in a CCATCC repeat that would map equally well to 50+ locations standalone can therefore carry MAPQ=60 from its primary.

--supp-rep-hard-cap INT opts into rescoring: if any seed in a supplementary alignment’s chain has >=INT genome occurrences (from the SMEM SA count), the supplementary MAPQ is forced to 0. Primary alignment MAPQ and coordinates are unaffected. Default output (no flag) is byte-identical to upstream bwa-mem2.

The SMEM SA-occurrence count is preserved on each seed as mem_seed_t.n_hits and propagated to mem_alnreg_t.chain_n_hits during chain-to-alignment conversion. Typical values for INT are 5–20; lower is more aggressive. The upstream bwa-mem2#260 reporter case drops from MAPQ=60 to MAPQ=0 at --supp-rep-hard-cap 18. Closes issue #46.

Shared-memory index: `bwa-mem3 shm` (PR #65)

bwa-mem3 mem reloads the FM-index from disk on every invocation. For hg38 the index is ~28 GB; for short alignment jobs (targeted panels, small sample batches) this load cost dominates runtime and makes per-invocation IOPS the bottleneck.

PR #65 ports the bwa shm command from bwa-mem v1 to bwa-mem3 with strict v1 CLI parity:

bwa-mem3 shm <index-prefix>    # load index into shared-memory segment once
bwa-mem3 mem <index-prefix> ...  # subsequent runs attach instead of re-reading
bwa-mem3 shm -d <index-prefix>  # detach and free the segment

The index lives in a POSIX shared-memory segment. Multiple bwa-mem3 mem processes on the same host share the same in-memory copy. Closes issue #64.

Warning — Stale index

bwa-mem3 shm does not detect when the on-disk index has been rebuilt. Always run bwa-mem3 shm -d <prefix> before running bwa-mem3 index and then re-stage with bwa-mem3 shm <prefix>. Using a stale shared-memory segment produces silently wrong alignments.

`bwa-mem3 shm --meth` (PR #67)

bwa-mem3 mem --meth <prefix> auto-appends .bwameth.c2t to locate the methylation index built by bwa-mem3 index --meth <prefix>. Before PR #67, staging a methylation index in shared memory required passing the full .bwameth.c2t-suffixed path to shm while continuing to pass the plain prefix to mem. The mismatch was easy to forget, and the failure mode — a run that silently attached the wrong segment — was difficult to diagnose.

PR #67 adds --meth support to bwa-mem3 shm so the same plain-prefix convention works end-to-end:

bwa-mem3 shm --meth ref.fa       # stages ref.fa.bwameth.c2t
bwa-mem3 mem --meth ref.fa ...   # attaches automatically
bwa-mem3 shm -d --meth ref.fa   # detaches

`HN:i` hit count tag (PR #42)

Every primary SAM/BAM record now carries an HN:i:<n> tag reporting the number of secondary alignment candidates clustered with this primary under XA_drop_ratio. This count is captured before the -h/max_XA_hits cap truncates the XA:Z: string, so HN reports the true number of alternate loci even when no XA:Z: field appears in the record.

This makes it possible to distinguish:

HN:i:0 + no XA:Z: — genuinely unique mapper.
HN:i:N + XA:Z:... (N ≤ -h) — multi-mapper with all alternates listed.
HN:i:N + no XA:Z: (N > -h) — multi-mapper whose alternates were suppressed by the cap.

Motivated by lh3/bwa#438, which adds HN to bwa aln. HN is emitted in both SAM (mem_aln2sam) and BAM (mem_aln_to_bam) paths and is absent when -a (MEM_F_ALL) is active.

`--bam=LEVEL` direct BAM output (PR #12)

bwa-mem3 mem --bam (or --bam=0 through --bam=9) emits BAM directly via htslib, bypassing the SAM-text-to-BAM conversion round trip that normally occurs when the output is piped to samtools view -bS.

--bam / --bam=0: uncompressed BAM (BGZF framing only) — near-zero CPU overhead, smaller than SAM text, fast downstream parsing.
--bam=1..9: BGZF deflate at the specified level.
No flag: SAM text on stdout (default, unchanged).

The implementation adds src/bam_writer.{h,cpp}, a new module that converts mem_aln_t to bam1_t via mem_aln_to_bam. htslib v1.21 is pulled in as a submodule at ext/htslib. On the bwameth.py example fixture (92,961 records), samtools view of --bam output vs SAM text produces a zero-line diff across all 11 SAM columns and all aux tags. See Best Practices → Output format for the recommended pipeline.

Changes catalog

Item	bwa-mem3 PR	Upstream PR/issue	Status
`--meth` bisulfite alignment mode	#13	—	fork-only
Vendored mimalloc allocator	#19	—	fork-only
`--supp-rep-hard-cap` MAPQ rescoring	#56	bwa-mem2#260	fork-only (upstream issue open)
`bwa-mem3 shm` shared-memory index	#65	—	fork-only
`shm --meth` symmetry	#67	—	fork-only
`HN:i` hit count tag	#42	lh3/bwa#438	fork-only (analogous to bwa aln)
`--bam=LEVEL` direct BAM output	#12	—	fork-only

Architecture Support

This page covers the architecture-specific build and runtime work carried in bwa-mem3. The goal is a single codebase that builds cleanly on all supported targets and runs the best available SIMD kernels on each.

For the full dispatch matrix and runtime selection logic, see Performance → SIMD dispatch matrix and Developer Guide → SIMD dispatch architecture.

Linux ARM64 / aarch64 build (PR #1)

The Apple Silicon work that reached the fork in commit ae73227 gated ARM behavior on $(UNAME_M) == arm64. On macOS, uname -m returns arm64. On Linux ARM64, it returns aarch64. The Makefile’s ifeq check therefore fell through to the x86 multi target on every Linux aarch64 host, failing with:

g++: error: unrecognized command-line option '-msse'

PR #1 introduces an IS_ARM variable ($(filter $(UNAME_M),arm64 aarch64)) that matches both names. All four architecture-conditional blocks in the Makefile are rewritten to use IS_ARM: the NEON/sse2neon flag block, the x86 arch-specific block, the ARM64 single-binary build block, and the multi target ARM64 short-circuit. The CI workflow is extended to trigger on pushes to fg-main (the integration branch at the time of PR #1, renamed to main in the 0.1.0-pre release) and adds an ubuntu-24.04-arm matrix row so the aarch64 path is exercised on every PR.

`arch=avx512bw` explicit build target (PR #16)

The AVX-512 Smith-Waterman kernels in bwa-mem2 are guarded by the __AVX512BW__ preprocessor macro — not __AVX512F__. The only way to build them before this PR was arch=avx512, but the (then) make multi rule emitted the dispatch binary as bwa-mem2.avx512bw. The build selector (avx512), the preprocessor guard (__AVX512BW__), and the dispatcher suffix (.avx512bw) disagreed.

PR #16 added arch=avx512bw as an explicit Makefile target with flags -mavx512f -mavx512bw and switched the multi-binary make path to use it. The legacy arch=avx512 was preserved as an alias with identical flags. No C++ was changed; the fix was 11 insertions and 2 deletions in the Makefile.

PR #83 has since replaced the multi-binary scheme with a single binary that compiles each kernel TU at every supported tier and dispatches in process; the avx512bw tier name and flag set survived the transition unchanged, and the arch=avx512bw build target remains the single-arch fallback for clusters with uniform AVX-512BW hardware. The pre-#16 mismatch between selector, guard, and suffix is therefore resolved in both the historical multi-binary layout and the current single-binary layout.

This is a pure build-correctness fix: before PR #16, arch=avx512bw and the legacy multi-binary build on AVX-512BW hardware silently compiled the wrong kernel (see Correctness → AVX-512BW dispatch guard for the downstream effect).

NEON kswv mate-rescue (PR #18)

bwa-mem2 has a batched mate-rescue Smith-Waterman path (BWAMEM_BATCHED_MATESW) that uses SIMD kswv kernels to score rescue candidates in parallel. On ARM64 the gate was __AVX512BW__, which is never true on NEON hardware. The NEON kswv::getScores8 kernel existed in the source but was unreachable in production.

PR #18 enables this path on ARM64 by replacing the __AVX512BW__ gate with a new BWAMEM_BATCHED_MATESW macro that fires on NEON/Apple Silicon as well. Along the way, four kernel bugs were found and fixed:

te split — the te (traceback end) value needed separate hi/lo tracking for 16-lane u8 batches.
Freeze mask — a frozen_vec mask now gates gmax/te/qe updates after KSW_XSTOP fires, preventing stale values from escaping to the score2 scan.
Per-lane score2 exclusion — len1, low/high, and qe masks were not applied per-lane in Loop 1, allowing lanes without a valid primary to contribute spurious suboptimal scores.
minsc filter on rowMax — sub-minsc plateau scores were leaking into score2 because the scalar ksw_u8 gating condition (imax >= minsc) was not replicated.

Measured on an M-series Mac (8 threads, 500k PE 100 bp reads on chr17): 1.42× speedup (−29.4% wall time) with byte-identical sorted SAM output.

AVX2 kswv mate-rescue (PR #20)

PR #18 enabled batched mate-rescue on ARM64. Most x86 production deployments (AWS c6a, c6i, older Xeons) use AVX2 without AVX-512BW and were excluded from the same gate. PR #20 extends the batched path to AVX2 by adding a 256-bit kswv256_u8 kernel and widening BWAMEM_BATCHED_MATESW to fire on __AVX2__.

The AVX2 kernel is a direct port of the corrected NEON kernel from PR #18, with an additional fix for per-lane te2 tracking (_mm256_blendv_epi8 on a sign-extended 8→16 bit mask). Verified byte-identical sorted SAM vs the pre-BWAMEM_BATCHED_MATESW scalar control on EC2 m5.xlarge (Skylake-SP, 4 threads, 500k chr17 PE pairs).

Note: PR #20 introduced a score2 plateau regression in the AVX2 kernel that was identified and fixed in the correctness series (PRs #27, #28, #29).

Changes catalog

Item	bwa-mem3 PR	Upstream PR/issue	Status
Linux ARM64 / aarch64 build + CI	#1	bwa-mem2#288	fork-only (upstream PR open)
`arch=avx512bw` explicit target	#16	—	fork-only
NEON kswv mate-rescue kernel	#18	—	fork-only
AVX2 kswv mate-rescue kernel	#20	—	fork-only

Build & Infrastructure

This page covers the build-system, testing, and CI infrastructure changes carried in bwa-mem3 on top of upstream bwa-mem2.

doctest framework and Codecov (PR #34)

PR #34 establishes the long-term test infrastructure for bwa-mem3:

doctest 2.4.11 is vendored as a single-header under ext/doctest/, with the SHA256 recorded in ext/doctest/VERSION.
A new test/framework/ static library provides shared helpers: scoring matrices, deterministic sequence-pair generators, kswv-style batch packers, scalar and SIMD runners, kswr comparators, a JUnit reporter hook, and a shared main.
Two test binaries are produced: bwa_mem3_tests_unit (runs on every CI matrix row) and bwa_mem3_tests_integration (runs on a subset of rows).
The existing kswv_selftest is ported to test/unit/test_kswv_correctness.cpp — 30,049 assertions against scalar ksw_align2 on 10k random plus curated edge pairs.
Five legacy integration sources are moved to test/integration/ via git mv; their binaries still emit at test/<name> so existing scripts keep working.
Five inline CI bash regression blocks are extracted to test/regression/*.sh (phix_parity, chr22_parity, thread_determinism, bam_roundtrip, meth_oracle).
A coverage CI job builds libbwa.a and both test binaries with COVERAGE=1 (-O0 --coverage), runs both test binaries, collects Cobertura XML via gcovr, and uploads to Codecov via codecov/codecov-action.

`PACKAGE_VERSION` from `git describe` (PR #52)

Before PR #52, src/main.cpp hardcoded PACKAGE_VERSION "2.2.1". This string appeared in bwa-mem3 version output and in the @PG VN: SAM header field but was never updated, causing every build to report an outdated version.

The Makefile now generates src/version.h from git describe --tags --dirty, falling back to a static FG_LABS_VERSION_FALLBACK when git describe cannot reach a tag (source-tarball extractions, shallow clones — e.g. CI with the default fetch-depth: 1). A write-if-changed mechanism (cmp -s + mv) regenerates the file on every invocation but only bumps its mtime when the stamped string changes, so only main.o is rebuilt when the version changes, not the entire tree. src/version.h is .gitignored and removed by make clean. Fixes issue #40. Related upstream: bwa-mem2#283, bwa-mem2#284.

PGO target parameterization (PR #59)

The original pgo-generate and pgo-use Makefile targets hardcoded arch=arm64 and a single shared pgo_profiles/ directory. PR #59 generalizes both:

PGO_ARCH (default: arm64 on ARM hosts, native otherwise) passes through to the recursive make invocation as arch=$(PGO_ARCH). Accepts the same values as the rest of the Makefile: arm64, sse41, avx2, avx512bw, native, etc.
PGO_PROFILE_DIR is now overridable (?= instead of =). Each (arch × training-regime) combination can capture into its own directory.
When PGO_ARCH != arm64, the output binaries are named bwa-mem3.pgo-instr.<arch> and bwa-mem3.pgo.<arch> so multiple per-arch PGO builds coexist. The default arm64 names are unchanged for backward compatibility.
pgo-clean now removes arch-suffixed PGO binaries in addition to bare names.

This enables the benchmarking workflow at bwa-mem3-bench, which requires per-arch × per-regime profile capture. See also Performance → PGO build.

`CXXFLAGS`/`CPPFLAGS`/`LDFLAGS` forwarding (PR #50)

At the time of PR #50, the Makefile’s multi: rule compiled runsimd.cpp (the x86 multi-binary launcher) without honoring CXXFLAGS, CPPFLAGS, or LDFLAGS. The $(EXE) link honored CXXFLAGS and LDFLAGS but not CPPFLAGS. PR #83 has since replaced the multi-binary scheme with a single binary that builds via the single: target (the default), and that target inherits the same flag-forwarding behavior.

PR #50 mirrored upstream bwa-mem2#290: the compile rules now honor all three variables, and $(EXE) link adds $(CPPFLAGS). This allows downstream packagers (Debian, Bioconda) and reproducible-build systems to inject hardening flags (-D_FORTIFY_SOURCE=2, -fstack-protector-strong, -Wl,-z,relro) through the environment without patching the Makefile. No functional change unless the env vars are set. Closes issue #39.

Unit-test harness and ARM CI (PR #23)

Historically, PR #23 added a local bash harness (test/run_unit_tests.sh) that built and ran the five C++ unit binaries under test/ against committed fixtures in test/fixtures/, asserting exit 0 and non-empty diff-able output (those binaries have since been consolidated into the doctest harness — see the section above). It also fixes several pre-existing issues blocking the harness:

test/Makefile defaulted to icpc (Intel compiler, not available on GitHub runners); changed to g++ on Linux x86.
ARM flags are mirrored from the parent Makefile so cd test && make builds on macOS arm64 and Linux aarch64.
Three test sources (smem2_test, bwt_seed_strategy_test, sa2ref_test) were missing the fmiSearch->load_index() call that fmi_test.cpp has, causing immediate segfaults on run.
test/main_banded.cpp opened fksw.txt but never wrote to it; output is now written and main() returns 0 on success.
Fixtures are added under test/fixtures/ covering phiX174, 50 bp test reads, BWT seed strategy inputs, SA pairs, and SW pairs.

CI matrix expansion (PR #24)

PR #24 stacks on PR #23 and expands the GitHub workflow .github/workflows/ci.yml from 5 matrix rows to 7:

Row	Runner	Arch	Role
1	ubuntu-latest	sse41	smoke + unit tests
2	ubuntu-latest	avx2	canonical deep tests
3	ubuntu-latest	avx2 (no mimalloc)	unchanged
4	ubuntu-24.04-arm	arm64	unchanged
5	macos-latest	arm64	unchanged
6 (new)	ubuntu-latest	multi	runsimd dispatcher smoke
7 (new)	ubuntu-latest	avx2 clang++	Linux Clang smoke

The canonical row (row 2) adds: --bam=6 roundtrip record-count parity, thread-determinism (-t1 vs -t4 sorted diff), unit-test harness, chr22 pipeline parity vs bwa, SE smoke, interleaved smoke, and --meth Layers 1–3.

Changes catalog

Item	bwa-mem3 PR	Upstream PR/issue	Status
doctest framework + Codecov	#34	—	fork-only
`PACKAGE_VERSION` from `git describe`	#52	bwa-mem2#283, bwa-mem2#284	fork-only (upstream issue + PR open)
PGO target parameterization	#59	—	fork-only
`CXXFLAGS`/`CPPFLAGS`/`LDFLAGS` forwarding	#50	bwa-mem2#290	fork-only (mirrors open upstream PR)
Unit-test harness + ARM CI	#23	—	fork-only
CI matrix expansion	#24	—	fork-only

`BASELINE_ARCH=avx512bw` build flag

This page documents the empirical perf characterization of building bwa-mem3 with BASELINE_ARCH=avx512bw and the -mprefer-vector-width=256 mitigation that ships as part of that target.

Background: BASELINE_ARCH

bwa-mem3 ships a single x86 binary with all five SIMD tiers (sse41 / sse42 / avx / avx2 / avx512bw) compiled for the hand-tuned kernel TUs (KERNEL_SRCS in the Makefile: bandedSWA, kswv, ksw, sam_encode). The runtime dispatcher in src/simd_dispatch.cpp picks the right tier per kernel call based on __builtin_cpu_supports.

Everything outside KERNEL_SRCS — bwamem.cpp, bwamem_pair.cpp, FMI_search.cpp, fastmap.cpp, bntseq.cpp, etc. — is compiled once at the tier set by the BASELINE_ARCH Makefile variable (default: avx2). The compiler can auto-vectorize loops in those TUs at up to that tier’s width.

PR #84 raised the default from sse41 to avx2 after measuring ~10-15% wall-time gains on AVX2 hosts (c6a, etc.) when the auto-vectorizer could finally widen hot non-kernel loops to 256-bit.

The naive expectation: avx2 → avx512bw should give another tier

Following PR #84’s logic, you might expect BASELINE_ARCH=avx512bw to unlock another ~10-15% on AVX-512BW hosts (c7a, c7i, m7i) by widening auto-vectorization to 512-bit. It does not. The avx2 → avx512bw transition has fundamentally different hardware economics from the sse41 → avx2 transition.

The two AVX-512 perf hazards

1. AMD Zen 4 µop-split (c7a)

AMD’s Zen 4 cores (c7a / Genoa, c8a / Bergamo) implement 512-bit AVX-512 operations by issuing 2× 256-bit µops per 512-bit op. For auto-vectorized loops:

Iteration latency doubles.
Iteration count only halves if the trip count is large enough. Short-trip loops eat the 2× latency without amortizing.
512-bit instruction encodings are larger → more I-cache pressure.

Net: loops that auto-vectorized productively at 256-bit AVX2 lose performance when the compiler widens them to 512-bit.

2. Intel Sapphire Rapids transition + downclock (c7i / m7i)

Intel’s Sapphire Rapids has native 512-bit execution units, so the µop-split issue does not apply. But it pays:

~3-5% AVX-512 frequency downclock under sustained heavy 512-bit use.
AVX-512 ↔ AVX2 transition penalties when non-kernel TUs running 512-bit code call into the 256-bit hand-tuned kernel TUs (which always run at host tier via the dispatcher).

Net: small or zero gain from widening, often offset by the transition costs.

Mitigation: `-mprefer-vector-width=256`

The canonical mitigation (used by FFmpeg, libvpx, Intel ISPC) is to keep AVX-512BW capabilities available but cap auto-vectorization at 256-bit width. The flag -mprefer-vector-width=256 (gcc / clang) / -qopt-zmm-usage=low (icpc) does exactly that:

The compiler can still emit AVX-512BW instructions where it explicitly needs them (mask registers, byte/word lane permutes, gather/scatter, the 32-zmm register file).
The auto-vectorizer’s preferred SIMD width stays at 256-bit, dodging Zen 4’s µop-split and Intel’s downclock/transition costs.

The Makefile bakes this flag into arch=avx512bw directly. Hand-tuned 512-bit kernel intrinsics in KERNEL_SRCS are unaffected — the cap is about auto-vec, not intrinsics.

Empirical numbers

c7a.4xlarge (AMD EPYC 9R14, Zen 4) and c7i.4xlarge (Intel Xeon Platinum 8488C, Sapphire Rapids) running the bench’s wgs-5M sample (1kg HG00096, 5M PE reads on hg38), shm-warmed via bwa-mem3 shm, 3 reps median, timing via tricord (fg-labs/tricord):

host	avx2	avx512bw	avx512bw + pvw256 (default)
c7a (Zen 4)	105.70 s	103.40 s (−2.2%)	101.03 s (−4.4%)
c7i (Sapphire Rapids)	156.50 s	155.47 s (−0.7%)	155.41 s (−0.7%)

The gain is real but small. Defaulting BASELINE_ARCH=avx2 for x86 distribution is still correct: it’s portable across every x86 host and loses only ~2-4% to the host-locked avx512bw build on AVX-512 hosts.

Why the runtime warning was misleading

Earlier versions of src/simd_dispatch.cpp printed at startup:

[W::bwamem3_simd_init_body] build baseline avx2 < host tier avx512bw;
non-kernel TUs are not auto-vectorized at the higher width (expect
10-15% slower hot paths). Rebuild with BASELINE_ARCH=avx512bw to recover.

The “10-15%” figure was the sse41 → avx2 transition on AVX2-only hosts (PR #84’s measurement, before avx2 became the default). It did not generalize to avx2 → avx512bw for the µop-split / downclock / transition reasons above. The warning was demoted to BWAMEM3_DEBUG_SIMD gating in a follow-up commit; the recommendation reflects the actual measurements (typically <2% wall-time gain on AVX-512 hosts).

When to use `BASELINE_ARCH=avx512bw`

Production fleets pinned to AVX-512BW hosts (c7a / c7i / m7i): ship a host-locked build for the small (~2-4%) extra gain. The Makefile’s arch=avx512bw includes the -mprefer-vector-width=256 cap by default, which is empirically the right choice for both Zen 4 and Sapphire Rapids. The binary will SIGILL on hosts below avx2; pair with explicit Batch queue / image plumbing.
Mixed fleets / generic x86 distribution: stay on the default BASELINE_ARCH=avx2. The 2-4% gap is small enough that portability is worth it.

Reproducer

The investigation harness lives at scripts/perf-diff-baseline-arch.sh. It builds N variants of bwa-mem3 with different BASELINE_ARCH and EXTRA_CXXFLAGS settings, runs each through tricord (or perf record for hot-function diffs), and emits per-variant median tables. Example usage:

scripts/perf-diff-baseline-arch.sh \
    --ref hg38.fasta --r1 r1.fq.gz --r2 r2.fq.gz \
    --out out/ --reps 3 --threads 16 \
    --variants 'avx2:,avx512bw:,avx512bw-pvw256:-mprefer-vector-width=256'

Requires tricord on the PATH (cargo install tricord). /dev/shm must be ≥18 GB to stage the hg38 FMI index — on a default EC2 instance that means mount -o remount,size=28g /dev/shm.

Bench-side caveats

The bench’s Phase C report (May 2026) reported a +14% c7a regression when comparing the bench’s portable image vs the host-locked avx512bw image inside AWS Batch. That delta does not reproduce on a single-instance bare-metal measurement. A 4-way disambiguation test (c7a.4xlarge, wgs-5M, shm-warmed, 3 reps each — see “Reproducer” above) attributes the gap to two independent bench-side factors:

variant	wall (s)	vs A
A: AL2023-built avx512bw, bare-metal	99.50	(baseline)
B: AL2023-built avx512bw, in bench Docker image	102.30	+2.8%
C: bench-image avx512bw binary, bare-metal	117.03	+17.6%
D: bench-image avx512bw binary, in bench Docker image	118.44	+19.0%

The findings:

Build-environment matters more than container. The bench’s :316dba6-avx512bw image binary, run bare-metal on the same c7a.4xlarge with the same input, is +17.6% slower than a fresh AL2023-built binary from the same SHA with the same BASELINE_ARCH.
- AL2023 ships gcc 11.5.0 and defaults to -no-pie. Output is a non-PIE ELF.
- Debian bookworm (the bench’s Dockerfile base) ships gcc 12.x and defaults to -pie. Output is a PIE ELF.
- PIE adds indirection through a GOT for every global reference and is well-known to cost 5–15% on tight CPU loops. Combined with gcc-11 vs gcc-12 codegen differences this comfortably accounts for the +17.6%.
Docker container overhead is small (~3%). A→B and C→D both show ~2–3% wall-time delta when wrapping the same binary in the bench’s image. Consistent with the broader literature on cgroup-namespaced compute-bound workloads.
The bench’s portable :316dba6 image is built with BASELINE_ARCH=sse41, not avx2. Direct evidence from the binary’s startup banner inside the container: [W::bwamem3_simd_init_body] build baseline sse41 < host tier avx512bw. The banner is generated from compile-time macros in simd_dispatch.cpp and unambiguously testifies to the build flags used. Why it’s sse41 (rather than the post-PR-#84 default of avx2) is bench-side mystery — possibilities include an explicit BASELINE_ARCH=sse41 build-arg, a Makefile-default change between the prior portable build and SHA 316dba6, or an environment quirk in the Docker build. The Phase C report’s “42/42 saw avx2 warning” summary likely reflects a different prior run, not the current :316dba6 portable image.

The Phase C report’s headline “+14% c7a regression for avx512bw” is therefore comparing an sse41-built portable image at +17.6% build-environment penalty against a BASELINE_ARCH=avx512bw binary at the same +17.6% penalty plus Zen 4’s µop-split cost — which roughly cancels for a small absolute delta in either direction. None of it is bwa-mem3’s BASELINE_ARCH knob’s fault.

These are bench-side concerns. The bwa-mem3 fix (-mprefer-vector-width=256 for arch=avx512bw) stands on its own bare-metal merit: −2.2% on Zen 4 vs avx2 vanilla, plus −2.2% incremental from the cap; wash on Sapphire Rapids.

Bench-side toolchain attribution

The +17.6% bare-metal delta between the bench’s bookworm-built binary and an AL2023-built binary was decomposed via a six-variant single-host test (c7a.4xlarge, wgs-5M, shm-warmed, 5 reps each, tricord median):

variant	gcc	PIE	CET (`endbr64`)	wall (s)	vs AL2023
AL2023 default	11.5.0	no	8 (libc init only)	98.28	(baseline)
bookworm + gcc-11	11.4.0	yes	3 (libc init only)	100.16	+1.9%
bookworm default	12.2.0	yes	3 (libc init only)	117.21	+19.3%
bookworm + `-no-pie`	12.2.0	no	3	118.98	+21.1%
bookworm + `-fcf-protection=none`	12.2.0	yes	3	118.40	+20.5%
bookworm + `-no-pie -fcf-protection=none`	12.2.0	no	3	116.84	+18.9%

Findings:

The +17.6% delta is gcc-11 vs gcc-12 codegen, not any hardening flag. Switching gcc inside Debian bookworm (keeping PIE on, keeping whatever default CET there is, keeping every other Debian default) recovers the perf to within 2% of AL2023.
Disabling PIE in bookworm gcc-12 has no measurable effect. Median 118.98s with -no-pie vs 117.21s default — within rep-to-rep noise. The 5–15% PIE penalty cited in the literature doesn’t manifest for bwa-mem3’s intrinsic-heavy hot path; GOT indirection is rare on tight inner loops.
Disabling CET in bookworm gcc-12 has no measurable effect. Both bookworm-default and -fcf-protection=none builds emit only ~3 endbr64 instructions in 7.6 MB of binary (libc init code). Something — probably bwa-mem3’s -mavx512f -mavx512bw per-tier kernel flags, or a #pragma GCC somewhere — suppresses CET emission regardless of the flag. So disabling it removes nothing.
The combined -no-pie -fcf-protection=none build is statistically indistinguishable from the default. Both PIE and CET are noise.

So the actionable bench-side fix isn’t -no-pie or -fcf-protection=none — it’s “use gcc-11”. The six-variant table above is the only data we have; gcc-13 was not tested here, and the postscript below shows that gcc-13 / gcc-14 do not recover the gap without #88’s source-side fix. A one-liner for the bench Dockerfile:

RUN apt-get install -y gcc-11 g++-11
RUN cd fg-labs && CC=gcc-11 CXX=g++-11 make BASELINE_ARCH=... -j

…recovers ~17% wall-time uniformly across every Batch worker, every arch. That’s a much bigger lever than the bwa-mem3-side -mprefer-vector-width=256 mitigation here, which is ~2–4% on c7a and wash on c7i. But it’s bench infra, not bwa-mem3 source.

(The deeper question — why is gcc-12 codegen ~19% slower than gcc-11 on Zen 4 for this workload? — was followed up in #88; see the postscript below.)

The bwa-mem3-side conclusion stands on its own bare-metal merit: -mprefer-vector-width=256 for arch=avx512bw is a real ~2–4% win on c7a and a wash on c7i, independent of toolchain and container concerns.

Postscript: gcc-12 attribution closed by #88

The “use gcc-11” recommendation above was the actionable bench-side fix at the time this page was written. #88 (perf(fmi): inline backwardExt to recover gcc 12+ wall-clock regression) has since identified the underlying mechanism and closed the gap in source — so on current main no compiler pin is required.

Profile attribution on a fresh c7a.8xlarge run with perf record --no-children localized ~9 percentage points of wall-time to FMI_search::backwardExt’s self-time (12.5% on gcc-11 vs 21.5% on gcc-14). Disassembly histograms were nearly identical between compilers (~110 instructions, 8 scalar popcntq, 25 mov each); IPC fell from 1.77 to 1.60. perf annotate isolated a single instruction at 42% of the function’s samples on gcc-14: vmovdqu %ymm0, (%r8) — the 32-byte AVX store of the SMEM return value through SysV’s hidden-pointer convention. The matching argument-load (mov 0x30(%rbp), %r10 for smem.s from the caller’s stack push) was next-hottest at 17%. Together those two instructions accounted for ~60% of the function’s self-time.

The fix in #88 is a one-line attribute change: marking backwardExt __attribute__((always_inline)) removes the call boundary at all 9 hot call sites in getSMEMs* and ls_advance_*. Without a call boundary the SMEM struct stays in caller registers across the would-be call site — no struct push, no return-slot store, no vzeroupper.

Post-#88 numbers on c7a.8xlarge (hg38 wgs-5M, shm-warmed, -t 32, 5 reps mean, single-binary at BASELINE_ARCH=avx2):

binary	wall (s)	vs gcc-14	vs gcc-11	IPC
main + gcc-11	64.59	−6.6%	baseline	1.77
main + gcc-14	69.16	baseline	+7.07%	1.60
#88 + gcc-14	61.94	−10.4%	−4.10%	1.83

So the gcc-11 vs gcc-12 attribution above was the surface symptom of an ABI-level inefficiency that was costing cycles on gcc-11 too — the always-inline fix beats gcc-11 baseline by 4.10%, not just gcc-14. The empirical table and findings list in the previous section remain accurate as an investigation snapshot at commit 316dba6 (the pre-#88 state); the bench-side gcc-11 recommendation it produced is obsolete.

Upstream PR Status

This table cross-references every change carried in bwa-mem3 main to its corresponding upstream bwa-mem2 PR or issue. “Fork-only” means no upstream PR exists; the change may be submitted upstream in the future or may be fork-specific by design. “Open” means the upstream PR or issue existed at the time of bwa-mem3’s implementation but had not been merged. Upstream status is current as of the bwa-mem3 0.2.0 release.

For prose descriptions of each change, follow the links in the “bwa-mem3 PR” column to the relevant deep-dive page section.

Full cross-reference table

Topic	bwa-mem3 PR	Upstream PR / Issue	Upstream status
Correctness
`@PG CL:` tab escaping	#54	bwa-mem2#293	open issue
SMEM buffer overflow on >151 bp reads	#55	bwa-mem2#238, bwa-mem2#210	PR closed without merge; issue open
kswv nrow==0 guard (all 5 kernels)	#51	bwa-mem2#289	open PR (upstream covers AVX-512BW only)
AVX-512BW dispatch guard (`!__AVX512BW__`)	#26	—	fork-only
AVX2 score2 plateau consolidation	#28	—	fork-only
NEON + AVX-512BW 8-bit score2 fix	#29	—	fork-only
AVX-512BW 16-bit score2 fix	#30	—	fork-only
NEON 16-bit kernel rewrite	#31	—	fork-only
kseq2bseq1 zero-initialization	#22	—	fork-only
Proper-pair flag from emitted alignment	#17	—	fork-only
`@HD` emitted before `@SQ` per SAM spec	#35	lh3/bwa#345	closed (lh3 only)
`mem_matesw` SIGSEGV on shm-backed `ref_string`	#85	—	fork-only
`SA_COMPX_MASK` precedence in sampled-SA prefetch	#73	—	fork-only
`.alt` parse buffer bounded (stack overflow)	#74	—	fork-only
`display_stats` nthreads clamp to `LIM_C`	#81	—	fork-only
Performance
Lockstep SMEM batching	#33	—	fork-only
Batched `-H` header ingestion (O(n) fix)	#49	bwa-mem2#204	open PR
libsais FM-index construction	#57	—	fork-only
Consolidated mapping speedups	#58	—	fork-only
kswv per-strip L1 prefetches (all u8/16 kernels)	#70	—	fork-only
`SMEM_LOCKSTEP_N` bumped from 8 to 16	#75	—	fork-only
Closed-form ungapped HIT when `total_mis == 0`	#77	—	fork-only
`ksort` on-stack buffer for small `n`	#78	—	fork-only
`libsais_build` skip wasted zero-init	#80	—	fork-only
Cap `avx512bw` autovec at 256-bit	#86	—	fork-only
Inline `FMI_search::backwardExt` (recover gcc 12+ regression)	#88	—	fork-only
Features
`--bam=LEVEL` direct BAM output	#12	—	fork-only
`--meth` bisulfite alignment mode	#13	—	fork-only
Vendored mimalloc allocator	#19	—	fork-only
`HN:i` hit count tag	#42	lh3/bwa#438	analogous to bwa aln; no direct upstream port
`--supp-rep-hard-cap` MAPQ rescoring	#56	bwa-mem2#260	open issue
`bwa-mem3 shm` shared-memory index	#65	—	fork-only (v1 feature port)
`shm --meth` symmetry	#67	—	fork-only
`-z FLOAT` (XA_drop_ratio CLI knob)	#35	lh3/bwa#294	merged (lh3 only)
`-u` flag — widen `XA:Z` records with `,score,mapq`	#35	lh3/bwa#293	merged (lh3 only)
`MQ:i` mate mapping quality tag	#35	lh3/bwa#330	merged (lh3 only)
Bismark-compatible `XR:Z` / `XG:Z` / `XM:Z` tags	#90	—	fork-only
`/bwactl` registry interprocess lock (POSIX named semaphore)	#82	—	fork-only
`bwa-mem3 shm` `/dev/shm` capacity preflight	#86	—	fork-only
Host-floor precheck (`SIMD floor:` / `SIMD runtime:`, exit 2 on under-floor host)	#95	—	fork-only
Architecture support
Linux ARM64 / aarch64 build + CI	#1	bwa-mem2#288	open PR
`arch=avx512bw` explicit Makefile target	#16	—	fork-only
NEON kswv mate-rescue kernel	#18	—	fork-only
AVX2 kswv mate-rescue kernel	#20	—	fork-only
`bns_fetch_seq_v2` migration of `mem_matesw_batch_{pre,post}`	#76	—	fork-only
Single-binary in-process SIMD dispatch (replaces multi-binary `execv` launcher)	#83	—	fork-only
Default x86 `BASELINE_ARCH=avx2` (was `sse41`)	#84	—	fork-only
Build & infrastructure
doctest framework + Codecov	#34	—	fork-only
`PACKAGE_VERSION` from `git describe`	#52	bwa-mem2#283, bwa-mem2#284	open issue + open PR
PGO target parameterization	#59	—	fork-only
`CXXFLAGS`/`CPPFLAGS`/`LDFLAGS` forwarding	#50	bwa-mem2#290	open upstream PR
Unit-test harness + ARM CI	#23	—	fork-only
CI matrix expansion (7 rows)	#24	—	fork-only
Shell-var rename `BWAMEM2`/`BWA_MEM2[_]` → `BWAMEM3`/`BWA_MEM3[_]` (CI/bench/test scripts)	#68	—	fork-only
Methylation oracle: alias `bwa-mem2` → `bwa-mem3` on `PATH` for `bwameth.py`	#72	—	fork-only
Migrate parity tests from dwgsim/phiX174 to holodeck/chr22	#89	—	fork-only

Upstream issues tracked but not yet fixed in bwa-mem3

The following upstream issues are tracked in the bwa-mem3 issue list but do not yet have corresponding fixes in main:

Issue	Upstream reference	Notes
Split-alignment evidence loss vs bwa 0.7.17	bwa-mem2#273	issue #47 — under investigation
MAPQ/coordinate parity vs bwa mem 0.7.18	bwa-mem2#262, bwa-mem2#246, bwa-mem2#239	issue #48 — tracking only

Building from source

This page documents every build target available in the Makefile and what each produces. For the recommended production build workflow see Best Practices → Build.

Prerequisites

A C++14-capable compiler: GCC 7+ or Clang 6+ on Linux; Clang 15+ (Xcode) on macOS.
GNU make 3.81+.
CMake 3.12+ (required only when USE_MIMALLOC=1, which is the default).
autoconf, automake, autoconf-archive, libtool, pkg-config — ext/htslib’s build runs autoreconf -i && ./configure and locates zlib via pkg-config.
zlib development headers — htslib links against zlib.
OpenMP runtime — libsais uses OpenMP for parallel suffix-array construction. Linux + GCC: libgomp ships with the compiler, nothing extra to install. Linux + Clang: libomp-dev (Debian) / libomp-devel (RHEL). macOS: brew install libomp; the Makefile auto-detects the Homebrew prefix or honours LIBOMP_PREFIX.
Git submodules initialised: git submodule update --init --recursive.

See Getting Started → Installation for the full per-platform install commands.

Warning — Submodules must be present

The build will fail with a clear error message if any of the required submodules (ext/libsais, ext/htslib, ext/safestringlib, ext/mimalloc, ext/sse2neon) are missing. Always clone with --recursive or run git submodule update --init --recursive before make.

Standard builds

Default build (host-native)

make

On x86 hosts this is equivalent to make single (see below): one binary containing all five SIMD tiers, dispatched in process at startup. On Apple Silicon and other aarch64 hosts the Makefile detects the architecture and builds a single ARM64 binary with one NEON kernel TU.

The resulting binary is bwa-mem3 in the repo root.

Single multi-tier x86 build (default on x86)

make single                       # alias of the default `make`
make BASELINE_ARCH=avx512bw       # raise non-kernel TU compile baseline
make BASELINE_ARCH=sse41          # lower it for pre-Haswell hosts

Builds one bwa-mem3 binary. The four hand-tuned kernel TUs in KERNEL_SRCS (bandedSWA.cpp, kswv.cpp, ksw.cpp, sam_encode.cpp) are compiled five times each — once per supported tier (sse41 / sse42 / avx / avx2 / avx512bw) — and dispatched at runtime via __builtin_cpu_supports. Non-kernel TUs compile once at BASELINE_ARCH (default avx2 since PR #84). See Single-binary SIMD dispatch (x86) for the full design.

Single-tier x86 builds

Pass arch=<target> to compile a single binary with kernels for one tier only (no runtime dispatch table — useful on clusters with uniform hardware):

Command	SIMD level	`ARCH_FLAGS`
`make arch=sse41`	SSE4.1	`-msse … -msse4.1`
`make arch=sse42`	SSE4.2	`-msse … -msse4.2`
`make arch=avx`	AVX	`-mavx`
`make arch=avx2`	AVX2	`-mavx2`
`make arch=avx512bw`	AVX-512BW	`-mavx512f -mavx512bw -mprefer-vector-width=256`
`make arch=native`	host CPU features	`-march=native`

For Intel compiler (icpc / icpx) the flags differ slightly; see the Makefile for the ifeq ($(CXX), icpc) branches. The avx512bw target keeps the -mprefer-vector-width=256 cap from PR #86 — see BASELINE_ARCH=avx512bw build flag for the empirical perf characterization.

ARM64 / Apple Silicon build

make arch=arm64

Compiles a single binary bwa-mem3 with one NEON kernel TU. See Apple Silicon / NEON port for background.

Tuned builds

Profile-Guided Optimization (PGO)

PGO produces the best single-binary performance. The workflow is two-phase:

# Phase 1: instrument binary
make pgo-generate                              # builds bwa-mem3.pgo-instr (arm64 default)
make pgo-generate PGO_ARCH=avx2               # or a specific x86 target

# Run your training workload with the instrumented binary
./bwa-mem3.pgo-instr mem -t 16 ref.fa r1.fq.gz r2.fq.gz > /dev/null

# Phase 2: optimised binary
make pgo-use                                   # builds bwa-mem3.pgo
make pgo-use PGO_ARCH=avx2                     # matching arch

PGO_ARCH accepts the same values as arch=. PGO_PROFILE_DIR defaults to pgo_profiles/ but can be overridden. Output binaries are named bwa-mem3.pgo (default arch) or bwa-mem3.pgo.<arch> when a non-default arch is specified, so multiple arch builds coexist.

Clean up instrumented objects and profile data:

make pgo-clean

Link-Time Optimization (LTO)

make lto-build                                 # builds bwa-mem3.lto (native arch)
make lto-build LTO_ARCH=avx2                   # explicit arch

LTO compiles bwa-mem3’s own translation units with -flto (thin LTO on Clang, full LTO on GCC) plus -fno-semantic-interposition on GCC. Third-party libraries (htslib, mimalloc, safestringlib) are linked without LTO. Clean:

make lto-clean

Compute-only profile binary

Used when profiling CPU hotspots without I/O noise. The -DDISABLE_OUTPUT flag short-circuits all BAM/SAM write paths and the file-open / header-emit step, so only alignment work contributes to wall time.

make profile-build                             # builds bwa-mem3.profile (native)
make profile-build PROFILE_ARCH=avx2          # explicit arch
./bwa-mem3.profile mem -t 16 ref.fa r1.fq.gz r2.fq.gz

make profile-clean

Build knobs

Variable	Default	Effect
`USE_MIMALLOC`	`1`	Include mimalloc; set `0` to use the system allocator
`ASAN`	(unset)	Set to any non-empty value to enable AddressSanitizer (forces `USE_MIMALLOC=0`)
`COVERAGE`	(unset)	Set to enable `--coverage` + `-O0` for gcov line-level coverage
`EXTRA_CXXFLAGS`	(empty)	Appended to `CXXFLAGS`; forwarded through PGO / LTO targets
`DISABLE_BATCHED_MATESW`	(unset)	Set to `1` to disable the batched mate-rescue SW path on ARM
`CXX`	`c++`	Compiler. Paired `CC` is auto-derived from `CXX` for libsais.

Cleaning

make clean

Removes object files, libbwa.a, all binaries, test binaries, libsais objects, safestringlib, htslib, and the mimalloc build tree.

make docs-clean

Removes only the mdbook build output (docs/book/). Covered in Developer Guide → Building context; see the Makefile docs targets for the full list.

Documentation targets

Target	Action
`make docs`	Build the mdbook into `docs/book/`
`make docs-serve`	Live-preview at `http://localhost:3000`
`make docs-cli`	Capture `--help` output for each subcommand into `docs/_generated/cli/`
`make docs-clean`	Remove `docs/book/`
`make docs-install-tools`	`cargo install` mdbook + three plugins

SIMD dispatch architecture

bwa-mem3 uses two complementary mechanisms to run the best available SIMD code path at run time: in-process tier dispatch on x86 (handled separately in Single-binary SIMD dispatch (x86)) and compile-time conditional compilation inside each kernel translation unit, mediated by src/simd_compat.h and src/kernel_dispatch.h.

This page covers the compile-time layer: what the macros do, which kernels are vectorised at each ISA level, and how the dispatch decision flows from main() to a tier-specific kernel instruction.

The `simd_compat.h` abstraction layer

src/simd_compat.h is the single point where platform detection and intrinsic selection occur. It is included by every file that touches SIMD code. The header resolves to one of four paths:

Platform	Branch condition	Intrinsic headers
ARM / Apple Silicon	`__ARM_NEON` or `__aarch64__`	`sse2neon.h` (translation) + `<arm_neon.h>` (native)
x86 AVX-512BW	`__AVX512BW__`	`<immintrin.h>`
x86 AVX2	`__AVX2__`	`<immintrin.h>`
x86 SSE4.1 / SSE2	`__SSE4_1__` or `__SSE2__`	`<smmintrin.h>` + `<emmintrin.h>`

The ARM path defines APPLE_SILICON 1, sets SIMD_WIDTH8 = 16 and SIMD_WIDTH16 = 8 (128-bit NEON lanes), defines a posix_memalign-backed _mm_malloc replacement that enforces the 128-byte Apple Silicon cache-line alignment, and provides two optimised NEON helpers that sse2neon does not generate efficiently:

_mm_movemask_epi16 — extracts the MSB of each 16-bit element using vshrq_n_u16 + vmovn_u16 + position-weighted vaddv_u8, replacing the _mm_movemask_epi8(v) & 0xAAAA pattern used in bandedSWA.cpp.
_mm_blendv_epi16_fast — a bitwise select on 16-bit elements via NEON vbslq_s16, replacing the OR/AND/ANDNOT sequence sse2neon emits for _mm_blendv_epi8.

SIMD_WIDTH8 and SIMD_WIDTH16 control the lane counts in kswv.cpp and bandedSWA.cpp. The macros differ per ISA level:

ISA	`SIMD_WIDTH8`	`SIMD_WIDTH16`
SSE4.1	16	8
AVX2	32	16
AVX-512BW	64	32
ARM NEON	16	8

Per-tier compilation and symbol mangling

On x86 the four kernel translation units listed in KERNEL_SRCS (bandedSWA.cpp, kswv.cpp, ksw.cpp, sam_encode.cpp) are compiled five times each — once per supported tier (sse41 / sse42 / avx / avx2 / avx512bw) — with tier-specific -m... flags. src/kernel_dispatch.h is a preprocessor-only header that renames each exported kernel symbol per a KERNEL_VARIANT=_<tier> macro, so the five tier compiles produce non-colliding symbols that all link into one binary.

bandedSWA.h adds an abstract IBandedPairWiseSW interface; BandedPairWiseSW is final and inherits from it. kswv.h mirrors this with Ikswv. Each per-tier kernel TU exports a C-linkage factory function (make_bsw_kernel_<tier>, make_kswv_kernel_<tier>) that returns a std::unique_ptr<I*> to the tier-specific concrete class. The dispatcher in src/simd_dispatch.cpp switches on g_tier and calls the matching factory; the call sites in bwamem.cpp and bwamem_pair.cpp see only the interface. This separation keeps the dispatcher TU free of class-layout knowledge and sidesteps the ODR risk that would arise from each tier’s compile pulling in a differently-laid-out concrete class definition.

The free-function ksw_* family (ksw_extend2, ksw_global2, ksw_extend, ksw_global, ksw_align2, ksw_align) is dispatched through thin extern "C" wrappers in simd_dispatch.cpp that switch on g_tier and tail-call the matching mangled per-tier symbol. Internal aux helpers in ksw.cpp (ksw_qinit, ksw_u8, ksw_i16) are forced static so the five tier compiles do not multi-define them. The SAM seq/qual encoder previously inlined in bwamem.cpp was lifted into src/sam_encode.{h,cpp} so it also participates in per-tier compilation.

All non-kernel TUs (bwamem.cpp, bwamem_pair.cpp, fastmap.cpp, FMI_search.cpp, bntseq.cpp, …) compile once at the BASELINE_ARCH tier (default avx2, set by the make line). They call into the dispatcher’s tier-agnostic entry points, which fan out to the per-tier kernels at run time. See Single-binary SIMD dispatch (x86) for the runtime selection and override semantics, and BASELINE_ARCH=avx512bw build flag for why non-kernel TUs do not auto-vectorize at 512-bit by default.

On arm64 there is one NEON tier and one kernel compile per TU; the dispatch tables collapse to single-entry switches and the per-tier mangling layer is a no-op.

Dispatch diagram

The full dispatch decision, from the shell to a kernel instruction, follows this flow:

flowchart TD
    A[User runs: bwa-mem3 mem ...] --> B{Platform}

    B -- ARM / Apple Silicon --> C[bwa-mem3 main, single NEON kernel TU]
    B -- x86 --> D[bwa-mem3 main, calls bwamem3_simd_init in src/simd_dispatch.cpp]

    D --> E{__builtin_cpu_supports + BWAMEM3_FORCE_TIER}
    E -- AVX-512BW --> F1[g_tier = avx512bw]
    E -- AVX2 --> F2[g_tier = avx2]
    E -- AVX --> F3[g_tier = avx]
    E -- SSE4.2 --> F4[g_tier = sse42]
    E -- SSE4.1 --> F5[g_tier = sse41]

    F1 & F2 & F3 & F4 & F5 --> G[Non-kernel TUs run\nat BASELINE_ARCH tier]
    C --> G

    G --> H{Kernel call}

    H -- kswv\nbatched SW --> I[per-tier kswv.<tier>.o\nvia make_kswv_kernel_<tier>]
    H -- bandedSWA\nmate-rescue --> J[per-tier bandedSWA.<tier>.o\nvia make_bsw_kernel_<tier>]
    H -- ksw_align2 etc.\nfree functions --> K[per-tier ksw.<tier>.o\nvia extern-C wrapper in simd_dispatch.cpp]
    H -- sam_encode --> L[per-tier sam_encode.<tier>.o]
    H -- FMI_search\nbackward extension --> M[FMI_search.cpp\n__builtin_popcountl — not SIMD]
    H -- libsais\nBWT construction --> N[libsais.c\nOpenMP parallel SA-IS]

    I --> O[SIMD instructions\nat the dispatched tier]
    J --> O
    K --> O
    L --> O

Per-kernel vectorisation status

Kernel	SSE4.1	SSE4.2	AVX	AVX2	AVX-512BW	ARM NEON
`kswv` (batched Smith-Waterman)	8-wide int16	8-wide int16	8-wide int16	16-wide int16	32-wide int16	8-wide int16 (native)
`bandedSWA` (banded SW / mate-rescue)	vectorised	vectorised	vectorised	vectorised	vectorised	native NEON blendv
`ksw_*` free functions (SW extension)	per-tier	per-tier	per-tier	per-tier	per-tier	per-tier (NEON)
`sam_encode` (SAM seq/qual encoder)	per-tier	per-tier	per-tier	per-tier	per-tier	per-tier (NEON)
`FMI_search` (FM-index backward ext.)	scalar	scalar	scalar	scalar	scalar	scalar
`libsais` (BWT / SA construction)	OpenMP only	OpenMP only	OpenMP only	OpenMP only	OpenMP only	OpenMP only

FMI_search is memory-bound with sequential pointer-chasing dependencies; adding SIMD to it produces no measurable speedup. libsais benefits from OpenMP-parallel induced sorting but not from SIMD widening within a single thread.

Adding a new SIMD kernel

Include simd_compat.h rather than any platform intrinsic header directly.
Use SIMD_WIDTH8 / SIMD_WIDTH16 for lane-count arithmetic so the code compiles correctly across all ISA levels.
If the kernel needs per-tier compilation:
- Add the source to KERNEL_SRCS in the Makefile so the per-tier pattern rules (src/%.<tier>.o) pick it up.
- Use the KERNEL_VARIANT rename macros from src/kernel_dispatch.h to expose mangled symbols.
- Export a C-linkage factory or dispatcher entry point from the per-tier TU and add a switch on g_tier in src/simd_dispatch.cpp.
For ARM-specific optimisations, gate them with #ifdef APPLE_SILICON (or #if defined(__ARM_NEON)) and provide a simd_compat.h-routed fallback for x86.
Verify correctness on at least SSE4.1 (lowest supported x86 tier) and ARM64 using make test, then run test/regression/all_tiers_parity.sh to confirm byte-identical SAM across every x86 tier under BWAMEM3_FORCE_TIER.

Tip — Testing SIMD correctness

The kswv unit tests in test/unit/test_kswv*.cpp use synthetic sequence-pair generators that drive edge cases (empty batches, nrow==0, homopolymers) across every SIMD width. Run them with ./test/bwa_mem3_tests_unit --test-suite="unit/kswv" after modifying any vectorised kernel, then loop BWAMEM3_FORCE_TIER over all five tiers in an end-to-end smoke run to catch dispatcher-wiring regressions that the unit tests miss.

Single-binary SIMD dispatch (x86)

On x86 Linux and x86 macOS, bwa-mem3 is a single binary that contains compiled kernels for every supported SIMD tier (sse41 / sse42 / avx / avx2 / avx512bw). At startup the binary detects the host CPU’s capabilities and selects the matching tier in process, without fork or exec. There is no separate launcher binary and no bwa-mem3.<tier> variant files on disk.

ARM / Apple Silicon does not need tier dispatch at all: there is only one NEON instruction-set level across current ARM64 CPUs, so the arm64 build is a single binary with one kernel TU. The dispatch machinery described below is only meaningful on x86.

This design replaces the multi-binary execv launcher inherited from bwa-mem2. The motivation, validation, and trade-offs are tracked in PR #83; the AVX-512 auto-vectorization cap that ships alongside it is documented in BASELINE_ARCH=avx512bw build flag.

What the build produces

make            # default: single multi-tier binary, BASELINE_ARCH=avx2
make single     # explicit alias of the default target

Produces one file in the repo root:

File	Contains	Non-kernel TU compile flags
`bwa-mem3`	All 5 x86 tier kernels + dispatcher + non-kernel TUs	`BASELINE_ARCH` (default `avx2`)

The five kernel translation units listed in the Makefile’s KERNEL_SRCS (bandedSWA.cpp, kswv.cpp, ksw.cpp, sam_encode.cpp) are compiled five times each, once per tier, with tier-specific -m... flags. Every non-kernel TU is compiled once at the BASELINE_ARCH tier. BASELINE_ARCH defaults to avx2 (PR #84) and can be set on the make line:

make BASELINE_ARCH=avx512bw       # for an AVX-512BW-only fleet
make BASELINE_ARCH=sse41          # for pre-Haswell hosts (~10–15% slower on AVX2)

Lowering BASELINE_ARCH reduces the supported host floor and is the documented escape hatch for vintage hardware. Raising it locks the binary to that host class and disables the host-floor precheck for lower tiers. The bwa-mem3 version banner prints the resulting SIMD floor: line so operators can confirm the build matches the intended deployment surface — see Host requirements and BASELINE_ARCH=avx512bw build flag.

For ARM, make arm64 produces a single binary with a single NEON kernel TU; no dispatch table is generated.

Runtime tier selection

src/simd_dispatch.cpp provides three pieces:

bwamem3_simd_init() — idempotent initializer called from main.cpp. Caches the host’s raw capability into a file-scope g_host_capability and the effective dispatch tier into a separate g_tier (the two differ when BWAMEM3_FORCE_TIER is set).
An enum of supported tiers (sse41 → sse42 → avx → avx2 → avx512bw, plus neon on arm64) and bwamem3_simd_tier_name() for stderr reporting.
Per-kernel factory functions (make_bsw_kernel_<tier>, make_kswv_kernel_<tier>) and free-function dispatch wrappers (ksw_extend2, ksw_global2, ksw_extend, ksw_global, ksw_align2, ksw_align, sam_encode_*) that switch on g_tier and call into the matching mangled per-tier symbol.

x86 detection uses __builtin_cpu_supports directly; arm64 reports neon unconditionally. The selection happens once at startup and the result is cached in a TU-level global — subsequent kernel calls pay a single indirect-call overhead through a vtable (for the BandedPairWiseSW / kswv factories) or an extern "C" wrapper (for the ksw_* free functions). Per PR #83 measurement, the indirect call costs ~0.3 ns after BTB warm-up, so a 1M-read alignment with ~100M kernel calls adds roughly 30 ms — well below run-to-run noise on every tested host.

Symbol mangling per tier

src/kernel_dispatch.h is a preprocessor-only header that renames kernel-exported symbols according to a KERNEL_VARIANT=_<tier> macro. Each kernel TU is compiled N times with a different -DKERNEL_VARIANT=_<tier> plus the matching -m... flags, producing per-tier mangled symbols that link cleanly into one binary without ODR collision.

bandedSWA.h adds an abstract IBandedPairWiseSW interface; BandedPairWiseSW is final and inherits from it. kswv.h mirrors this with Ikswv. The dispatcher TU sees only the interface; the factory implementations in each per-tier kernel TU see the full concrete class layout via the rename. This separation sidesteps the ODR risk that would arise if the dispatcher TU and the factory TUs both included the full class definition.

Internal aux helpers in ksw.cpp (ksw_qinit, ksw_u8, ksw_i16) are forced static so the per-tier compiles don’t multi-define them. The SAM seq/qual encoder previously inlined in bwamem.cpp was lifted into a free-standing src/sam_encode.{h,cpp} translation unit so it participates in per-tier compilation and benefits from the auto-vectorizer’s tier-specific vmovdqu / VEX / EVEX encoding wins.

Environment overrides

Two environment variables exposed at runtime:

Variable Behavior

BWAMEM3_FORCE_TIER=<tier> Force the dispatcher to use <tier> (one of sse41 sse42 avx avx2 avx512bw). Downgrade-only: requests above the detected host tier (which would SIGILL on the first wider instruction) and unrecognized names are rejected with a stderr warning and the dispatcher falls back to the detected tier. Replaces the prior “exec the bwa-mem3.sse41 binary” pattern for A/B regression testing on AVX-512 hosts.

BWAMEM3_DEBUG_SIMD=1 Print a one-line [I::bwamem3_simd_init_body] banner at startup naming the build baseline (g_build_tier), the detected host capability, and the resolved dispatch tier. Also enables the build-baseline-vs-host gap warning that PR #84 originally emitted unconditionally and PR #86 demoted to debug-only.

Both are read once during bwamem3_simd_init() and ignored after that call returns.

Host-floor enforcement

bwa-mem3 mem, bwa-mem3 index, and bwa-mem3 shm all call bwamem3_enforce_host_floor() early in main() (PR #95). The check compares g_host_capability against the compile-time g_build_tier (derived from compiler predefined macros, reflecting whichever BASELINE_ARCH was set at build time) and exits with code 2 and an [E::bwamem3] message naming the gap if the host cannot execute the binary’s compiled-in instructions. This converts what would otherwise be an unhelpful SIGILL deep in alignment into a clean abort at startup.

Diagnostic invocations opt out: bwa-mem3 version, bwa-mem3 <subcommand> --help, and bwa-mem3 <subcommand> -h always succeed regardless of host capability, so operators can introspect a binary on a host that cannot run alignment. The version command prints SIMD floor: (the build’s required minimum) and SIMD runtime: (the resolved tier) on stdout; on a too-old host it also emits a [W::bwa-mem3] warning on stderr.

The simd_dispatch.cpp translation unit itself is compiled at -march=x86-64 via an explicit Makefile rule, so the precheck path stays SIGILL-safe even when BASELINE_ARCH=avx2 (or higher) for the rest of the binary.

Per-tier parity validation

test/regression/all_tiers_parity.sh runs bwa-mem3 mem with BWAMEM3_FORCE_TIER walking the full ladder (sse41 → sse42 → avx → avx2 → avx512bw) on the same input and diff’s the BAM output. The expected result is byte-identical SAM across every tier; any divergence is a bug in either a kernel TU or the per-kernel factory wiring. CI runs this script on the x86 matrix row.

Trade-offs vs the prior multi-binary launcher

Property	Pre-PR-#83 (multi-binary `execv`)	Current (single binary, in-process dispatch)
Install size	~120 MB (5 ISA binaries + launcher)	~25 MB (one binary)
Build cost	5 sequential clean rebuilds + launcher	One parallel build
Process model	`bwa-mem3` (launcher) → `execv` → `bwa-mem3.<tier>`	One process, one `main()`
Per-call overhead	Direct call (tier fixed at launch via separate binary)	Indirect call through factory vtable or `extern "C"` wrapper (~0.3 ns / call)
Non-kernel auto-vectorization	At each binary’s compile tier	At `BASELINE_ARCH` (default `avx2`); raise via `BASELINE_ARCH=`
Tier override	Run the `.<tier>` binary directly	`BWAMEM3_FORCE_TIER=<tier>` (downgrade-only)
`runsimd.cpp` (220-line launcher + safestringlib)	Required	Removed

The ~0.3 ns indirect-call cost is amortized across alignment work and has not been measurable in any bench cell. The non-kernel auto-vectorization at BASELINE_ARCH is what closes the gap PR #84 identified after PR #83 originally regressed by silently hardcoding the non-kernel compile to sse41.

Distribution layout

For deployment on any x86 host meeting the build’s floor:

bin/
  bwa-mem3       ← single binary, dispatches in-process

For ARM:

bin/
  bwa-mem3       ← single binary, NEON kernels only

No .<tier>-suffixed companion files are produced or needed. When shipping a Docker image intended for a mixed-microarch fleet, build at the lowest expected tier (e.g. BASELINE_ARCH=avx2 for “AVX2 and newer”) — the runtime dispatcher will still pick AVX-512BW kernels on AVX-512 hosts via the per-tier factory tables. See Multi-architecture deployment for the docker buildx manifest-list recipe.

The legacy Executing in AVX2 mode!! banner is gone. Use either:

bwa-mem3 version — prints SIMD floor: and SIMD runtime: lines on stdout (always available, no alignment required).
BWAMEM3_DEBUG_SIMD=1 bwa-mem3 mem … — prints a one-line [I::bwamem3_simd_init_body] banner on stderr at the start of the run.

Apple Silicon / NEON port

bwa-mem3 supports ARM64 (Apple Silicon and Linux aarch64) as a first-class build target. The port uses the sse2neon translation shim as a baseline and replaces the two most performance-critical SSE paths with native NEON intrinsics.

Architecture overview

The ARM build compiles a single binary with a single NEON kernel TU. There is only one NEON instruction-set level on all current ARM64 CPUs, so the per-tier dispatch table used by the x86 single-binary build (see Single-binary SIMD dispatch (x86)) collapses to a one-entry switch on aarch64 — there is effectively no dispatch overhead. make arm64 builds and installs the binary at the bare bwa-mem3 name.

sse2neon shim

ext/sse2neon/sse2neon.h is a header-only library that maps Intel SSE intrinsics to their NEON equivalents. When APPLE_SILICON=1 is defined (set automatically when uname -m is arm64 or aarch64), src/simd_compat.h includes sse2neon and defines the SSE feature test macros (__SSE__ through __SSE4_2__) so that code guarded by those macros compiles without changes.

The translation is not zero-cost for all operations. Two patterns that sse2neon handles poorly are replaced with native NEON in src/simd_compat.h:

_mm_movemask_epi16 — used heavily in bandedSWA.cpp to extract the sign bit of each 16-bit lane. The native implementation shifts right by 15, narrows to 8-bit with vmovn_u16, and reduces with position-weighted vaddv_u8.
_mm_blendv_epi16_fast — a bitwise select on 16-bit lanes using vbslq_s16. Replaces the three-operation OR/AND/ANDNOT sequence sse2neon emits for _mm_blendv_epi8.

Memory alignment

Apple Silicon uses 128-byte cache lines (versus 64 bytes on x86). simd_compat.h overrides _mm_malloc on ARM to call posix_memalign with a minimum alignment of 128 bytes for all SIMD allocations. CACHE_LINE_BYTES is set to 128 in macro.h when APPLE_SILICON=1.

Accelerate.framework

The Makefile links -framework Accelerate on macOS ARM builds. The framework is linked but not used for computation: bwa-mem3’s hot paths (Smith-Waterman, FM-index) do not match the large-matrix / large-vector patterns that BLAS and vDSP target. The link is retained to keep the option open and adds no overhead at runtime.

P-core / E-core detection

src/fastmap.cpp calls HTStatus() on macOS to detect the Apple Silicon microarchitecture. HTStatus() reads the hw.perflevel0.physicalcpu and hw.perflevel1.physicalcpu sysctl keys to report P-core and E-core counts and the L2 cache size (typically 4 MB on M-series chips). This information is printed at startup for diagnostic purposes. The L2 cache size is used to validate the compile-time BATCH_SIZE setting (currently 1024, which was already optimal for a 4 MB L2 cache).

Benchmark results

All measurements use 100K paired-end reads, 5% error rate, 30% indels, chr17 reference, 8 threads, on an M-series Apple Silicon machine.

Build	Wall-clock (avg, s)	vs. baseline
sse2neon baseline (no native NEON)	15.4	—
+ native NEON `kswv.cpp`	14.4	~7% faster
+ native NEON `bandedSWA.cpp` blendv	13.8	~4% faster
PGO on top of native NEON	~13.4	~3% further

The FM-index (FMI_search.cpp) is memory-bound with sequential pointer-chasing dependencies and does not benefit from SIMD. libsais benefits from OpenMP-parallel suffix-array construction but not from SIMD widening within a single thread.

Optimization task summary

Task	Status	Impact	Notes
Correctness verification	done	—	200,006 alignments, 0 differences vs. reference
Dynamic L2 cache detection	done	~0%	4 MB detected; compile-time `BATCH_SIZE=1024` already optimal
Native NEON `bandedSWA.cpp`	done	~4%	`vbsl`-based blendv in `simd_compat.h`
Per-tier dispatch table	N/A	0%	Collapses to one entry on ARM (single NEON level)
Accelerate.framework	done	~0%	Linked; no suitable compute patterns
M1/M2/M3/M4 detection	done	~0%	P/E-core counts and L2 cache via sysctl
Native NEON `FMI_search.cpp`	N/A	0%	Memory-bound; SIMD cannot help
Profile-Guided Optimization	done	~3%	`make pgo-generate` / `make pgo-use`

Building for Apple Silicon

# Standard arm64 build
make arch=arm64

# PGO build (recommended for production on Apple Silicon)
make pgo-generate PGO_ARCH=arm64
./bwa-mem3.pgo-instr mem -t 8 ref.fa r1.fq.gz r2.fq.gz > /dev/null
make pgo-use PGO_ARCH=arm64

The resulting bwa-mem3.pgo binary delivers the full ~10% improvement over the pure sse2neon baseline.

Tip — Recommended production build on Apple Silicon

Use PGO for production deployments. The combined ~10% improvement from native NEON kernels plus PGO is consistent and verified on M-series hardware.

Files modified in the NEON port

src/kswv.cpp, src/kswv.h — native NEON batched Smith-Waterman
src/bandedSWA.h — SIMD width definitions for ARM
src/simd_compat.h — sse2neon integration, aligned allocation, _mm_blendv_epi16_fast, _mm_movemask_epi16
src/fastmap.cpp — L2 cache detection, HTStatus() for non-NUMA (macOS)
src/macro.h — BATCH_SIZE and CACHE_LINE_BYTES tuning for Apple Silicon
Makefile — arm64 target, sse2neon flags, Accelerate linkage, PGO targets

Regression test framework

bwa-mem3 has three categories of tests — unit, integration, and regression — plus a separate benchmark harness in bench/. Understanding the distinction helps you choose where to add a new test and what to expect from CI.

Test categories

Category	Binary / runner	Fixtures	CI scope
unit	`test/bwa_mem3_tests_unit`	None; all inputs synthetic	Every matrix row
integration	`test/bwa_mem3_tests_integration`	Small committed FASTAs / FMI in `test/fixtures/`	SSE4.1, AVX2, ARM64 Linux, macOS ARM
regression	`test/regression/*.sh`	Downloaded references (phiX, chr22) + bwa + dwgsim	Canonical AVX2 row only

Unit tests must use only synthetic inputs generated programmatically and complete in under 100 ms each. They exercise individual kernels in isolation: kswv scoring, banded Smith-Waterman, KSW, FM-index operations, SMEM extraction, BAM encoding, and pair handling.

Integration tests may load small committed fixtures from test/fixtures/ and have a per-test budget of 10 seconds. They exercise cross-component paths: index loading, SMEM-to-alignment pipelines, and output format validation.

Regression tests are standalone bash scripts that shell out to the bwa-mem3 binary, may diff against third-party tool output (bwa, bwa-meth, samtools), and require fixtures that are either committed to the fixtures directory or downloaded by CI at run time.

Running tests locally

# Build the aligner and test binaries
make
make -C test -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)

# Run all unit tests
./test/bwa_mem3_tests_unit

# Run all integration tests
./test/bwa_mem3_tests_integration

# Run a specific test case or suite
./test/bwa_mem3_tests_unit --test-case="*kswv*"
./test/bwa_mem3_tests_unit --test-suite="unit/kswv"
./test/bwa_mem3_tests_unit --test-suite-exclude=slow

# Verbose output (also print passing assertions)
./test/bwa_mem3_tests_unit --success

The make test target is a convenience shortcut that builds and runs the unit and integration binaries plus the two legacy standalone regression tests (kswv_nrow_zero_test and shm_section_find_test):

make test

Running a regression test locally

Regression scripts expect certain environment variables to point at fixtures. The phiX parity test requires dwgsim:

mkdir -p /tmp/ci-test && cd /tmp/ci-test
curl -sL "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/819/615/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gz" | gunzip > phix174.fa
dwgsim -z 42 -N 500 -1 150 -2 150 -r 0.001 -S 2 phix174.fa reads
cd -
BWA_MEM2="$(pwd)/bwa-mem3" CI_TEST_DIR=/tmp/ci-test bash test/regression/phix_parity.sh

Test framework

The unit and integration binaries are built on doctest, a single-header C++ test framework. Tests are discovered by file glob: any test/unit/test_*.cpp file is compiled into the unit binary; any test/integration/test_*.cpp file is compiled into the integration binary. No Makefile edit is needed when adding a new test_*.cpp.

Test organisation

Tag each TEST_CASE with doctest::test_suite("category/module"):

TEST_CASE("nrow==0 batch does not store out of bounds"
          * doctest::test_suite("unit/kswv")) {
    // ...
}

The test_suite decorator is overriding (not additive). Encode the category (unit or integration) and module (kswv, bandedsw, ksw, fmindex, smem, bam, pair, cigar, util) as a single slash-separated string.

Framework helpers

The test/framework/ directory provides helpers shared across test files:

Header	Provides
`scoring.h`	`ScoringMatrix`, `build_scoring_matrix`, `default_scoring_matrix`
`seqpair.h`	`TestPair` struct
`seqpair_gen.h`	Deterministic pair generators: random, exact-match, all-mismatch, homopolymer, sub-cluster, N-bases
`seqpair_batch.h`	`BatchBuffers` — flat-layout packer for kswv batch input
`ksw_runner.h`	`run_scalar_ksw`, default gap/extra parameters
`kswv_runner.h`	Two-pass `run_kswv_batch`
`kswr_cmp.h`	Score / coordinate / score2 comparators
`junit_reporter.h`	CI matrix-row banner and JUnit XML output

Debugging a failing test

# Break into debugger at the first failing assertion
./test/bwa_mem3_tests_unit --test-case="*kswv*" --break

# Run a single SUBCASE
./test/bwa_mem3_tests_unit --test-case="*foo*" --subcase="bar"

# Enable per-phase diagnostics for kswv tests
BWA_TESTS_DEBUG_PHASE0=1 BWA_TESTS_DEBUG_PHASE1=1 \
  ./test/bwa_mem3_tests_unit --test-suite="unit/kswv"

JUnit artifacts are uploaded per CI matrix row (unit-results-<name>.xml, integration-results-<name>.xml) and available on the Actions run page.

Tip — Use ASAN for memory bugs

Build with make ASAN=1 test to catch out-of-bounds writes in vectorised kernels. The kswv_nrow_zero_test specifically exercises the nrow==0 path that triggered a pre-allocation store bug; ASAN reports this immediately rather than at a later allocator operation.

Standalone regression tests

Three standalone regression tests live outside the doctest harness because they predated it. The two binaries are built and run by make test; the third is script-driven:

kswv_nrow_zero_test — binary; exercises the all-len1==0 batch path in every SIMD kswv variant. Catches the nrow==0 rowMax store overrun from issue #38 / upstream bwa-mem2 PR #289.
shm_section_find_test — binary; exercises the shared-memory index section-find logic.
shm_pack_round_trip_test — script-driven, invoked via test/shm_pack_round_trip_test.sh, which builds the phiX index first.

Additional integration shell scripts in test/:

Script	What it tests
`pg_cl_escape_test.sh`	`@PG CL:` tab/newline escape in SAM headers
`mimalloc_loaded_test.sh`	mimalloc override is active when `USE_MIMALLOC=1`
`shm_round_trip_test.sh`	`bwa-mem3 shm` load / list / drop cycle
`shm_meth_test.sh`	`--meth` index compatibility with `shm`
`help_prescan_test.sh`	`--help` prints without running alignment
`libsais_*.sh`	libsais index correctness vs. BWA / determinism

Benchmark harness (`bench/`)

bench/ is a separate performance measurement harness used during development to gate performance PRs. It is not part of the CI test suite.

cp bench/config.env.example bench/config.env
# Edit config.env to point at your index, reads, and binary paths
bench/run.sh baseline         # N trials; appends to bench/results.csv
bench/run.sh candidate        # N trials on the candidate binary
bench/compare.sh baseline candidate  # wall-clock / RSS / md5 delta report

Each run records: tag, host, architecture, binary path, thread count, trial index, wall-clock seconds, max RSS (KB), and a golden md5 (single-threaded, @PG-stripped SAM). The md5 verifies byte-identical output across builds; wall-clock is the primary performance metric.

Release process

bwa-mem3 follows semantic versioning. Releases are driven by git tags. The version string is derived automatically from git describe and embedded in every binary at compile time.

Version stamping

The Makefile computes the version string at parse time:

FG_LABS_VERSION_FALLBACK := 0.2.0
VERSION_STRING := $(shell git describe --tags --dirty 2>/dev/null || echo $(FG_LABS_VERSION_FALLBACK))

git describe produces a string such as v0.1.0 (on a tag), v0.1.0-3-gabcdef1 (three commits past the tag), or v0.1.0-dirty (uncommitted changes). If git describe fails — for example in a source tarball or a shallow clone without tag history — the build falls back to FG_LABS_VERSION_FALLBACK.

The string is written into src/version.h by the src/version.h: FORCE rule, which runs on every make invocation but only touches the file when the string changes. This minimises unnecessary recompilation of src/main.o.

PACKAGE_VERSION from src/version.h appears in:

bwa-mem3 version output (stdout).
The @PG VN: field in every SAM/BAM file produced by bwa-mem3 mem.

Verifying the version

./bwa-mem3 version
# Example output on a tagged commit:
# v0.1.0
# mimalloc 3.3.0        ← if USE_MIMALLOC=1

On an untagged commit the string includes the commit distance and short SHA:

v0.1.0-12-g3f7ab2e

Semver policy

bwa-mem3 follows semver, interpreted for an alignment tool as follows.

MAJOR (X.0.0) — bump when the change would break a downstream consumer that pinned the previous version without checking release notes. Concretely:

An on-disk index file format change (a re-index is required to use the new version).
Removal or rename of a CLI flag or subcommand.
A SAM/BAM tag is removed, renamed, or its type/value space changes incompatibly (a column-fixed downstream parser would break). Adding a new tag is not a major change.
A change to the resolved primary alignment that is intentional and affects more than a negligible fraction of reads (e.g. a MAPQ recalibration applied unconditionally). Concordance regressions attributable to bug fixes are not major changes — call them out in the release notes under “Correctness” instead.
Dropping support for a previously supported host class (e.g. raising the build’s compiled-in BASELINE_ARCH floor in a way that excludes hosts the previous release ran on).

MINOR (0.X.0) — bump for any user-visible new functionality that does not break consumers pinned to the previous minor. Examples:

A new CLI flag or subcommand.
A new SAM aux tag emitted on output (e.g. HN:i in 0.1.0, the Bismark XR:Z / XG:Z / XM:Z set in 0.2.0).
A new operational feature (e.g. bwa-mem3 shm, in-process SIMD dispatch).
A user-facing default change that is documented in release notes but does not require any consumer action (e.g. BASELINE_ARCH=avx2 as the build default).
New performance characteristics that change wall-time meaningfully.

PATCH (0.0.X) — bump for bug fixes, doc-only changes, build fixes, and internal refactors that have no user-visible behavioral delta. Pre-existing-bug fixes that incidentally shift output for a small fraction of reads are patch-level when called out in the release notes; widespread output shifts (>0.1% of reads on a typical WGS bench cell) deserve MINOR or MAJOR depending on the source.

While the project is pre-1.0, the leading 0. is treated literally — 0.2.0 may make breaking changes vs 0.1.0 if called out clearly in the release notes. After 1.0, MAJOR bumps are reserved for genuinely breaking changes.

Release-readiness checklist

Run through this list on the commit you intend to tag, before any git-tag command. Every item must pass.

Build and test

make clean && make succeeds at the default BASELINE_ARCH (avx2) on a Linux x86_64 host.
make clean && make BASELINE_ARCH=sse41 succeeds on the same host — confirms the portability floor still compiles.
make clean && make succeeds on an arm64 host (Apple Silicon or aarch64 Linux).
make test passes on both x86_64 and arm64.
test/regression/all_tiers_parity.sh produces byte-identical SAM across BWAMEM3_FORCE_TIER=sse41 → sse42 → avx → avx2 → avx512bw on an AVX-512BW host. Failures here indicate a per-tier kernel or dispatcher-wiring regression — fix before tagging.

Bench

bwa-mem3-bench run submitted on the candidate SHA via bwa_mem3_bench.cli submit --fg-labs-sha <sha> (or the local smoke path for a fast sanity check).
bench regression --prev <previous-tag-sha> reports gate PASS — concordance ≥ 99.999% on every vs-baseline.json cell except methylation (which is expected to drift vs the bwameth baseline; see the methylation carve-out below) and no cell labeled REGRESSION.
Methylation cells reviewed for expected-drift consistency: the meth-twist-emseq-5M concordance vs the bwameth baseline should sit at ~98.9% post-PR-#90, with the per-class breakdown matching the entry in bwa-mem3-bench/docs/expected-divergences.yaml (or the entry added in this release — the file is in the bench repo, not in this repo).

Docs

make docs builds cleanly with no mdbook warnings.
NEWS.md has a top-section entry for the new version with Operational / packaging, Correctness, Performance, and Methylation subheadings as applicable; every user-visible PR in the release window is listed with its number.
docs/src/whats-different/overview.md FG-MAIN-TABLE is regenerated to cover the new PRs (see Contributing for the regeneration command).
docs/src/whats-different/upstream-prs.md has rows for every user-visible PR landed since the previous tag.
docs/src/reference/changelog.md and docs/src/cli/version.md examples reference the new release string.
Spot-check the bwa-mem3-bench reference numbers in docs/src/performance/overview.md against the bench’s regression.md for the tagging SHA.

Tagging the release

Run these in order; each command depends on the previous one.

Pre-flight (confirms the readiness checklist):
```
make clean && make
make test
make docs
```
Confirm NEWS.md is current. The top entry header line must match the tag you are about to create (e.g. Release 0.2.0 (YYYY-MM-DD)).
Tag the release commit. Prefer a signed tag (-s); fall back to an annotated tag (-a) only when signing is unavailable:
```
git tag -s v0.X.Y -m "Release v0.X.Y"
```
Push the tag to the fg-labs remote:
```
git push fg-labs v0.X.Y
```
Read the Docs activates a versioned build at /v0.X.Y/ automatically when the tag appears on the remote.
Create a GitHub release from the tag via gh. The body should be the matching NEWS.md section, no preamble:
```
gh release create v0.X.Y --repo fg-labs/bwa-mem3 \
  --title "bwa-mem3 v0.X.Y" \
  --notes-file <(awk '/^Release / {p = ($0 ~ /^Release 0\.X\.Y /)} p' NEWS.md)
```
Substitute 0.X.Y with the exact version literal you are tagging, including any pre-release suffix — e.g. for v0.3.0-pre use /^Release 0\.3\.0-pre /. The trailing space in the inner pattern anchors the match to a complete version token so that 0.3.0 does not also match Release 0.3.0-pre (...). The awk script prints lines while p is true: it flips on at the matching Release line and back off at the next Release line, which gives a clean section without needing a trailing sed '$d'.

Note — Tarball builds

Source tarballs created by GitHub (or git archive) do not include git history, so git describe fails and the version falls back to FG_LABS_VERSION_FALLBACK. For reproducible tarball builds, set VERSION_STRING explicitly on the command line: make VERSION_STRING=v0.X.Y.

Post-release verification

After the tag is pushed and the GitHub release is published:

Wait ~5 minutes for Read the Docs to build the new version, then open https://bwa-mem3.readthedocs.io/en/v0.X.Y/ and confirm:
- The version selector lists v0.X.Y.
- The home page renders with no missing-page errors.
- developer-guide/launcher.md, performance/overview.md, and methylation/tags.md all render with their mermaid diagrams and tables intact (these are the most diagram-heavy pages).

Pull the tag in a clean clone and verify bwa-mem3 version reports the bare tag string (no -N-gSHA distance suffix):

git clone -b v0.X.Y --depth 1 https://github.com/fg-labs/bwa-mem3.git
cd bwa-mem3 && make
./bwa-mem3 version | head -1
# expect: v0.X.Y

If the docs build failed on RTD or the version string is wrong, do not delete or move the tag. Tags are immutable in practice — open a follow-up v0.X.(Y+1) patch release with the fix instead.

Branch and tag conventions

All release tags are on the main branch, which carries both upstream bwa-mem2 commits and fork-carried changes. See Branch and worktree conventions for the full branching model.
Tags are prefixed with v: v0.1.0, v0.2.0, etc.
Pre-release tags use a -pre suffix: v0.1.0-pre.
Patch releases increment the third component: v0.1.1.

What’s Different table update

When a release bundles new fork-carried commits that were not previously documented, update the FG-MAIN-TABLE in docs/src/whats-different/overview.md in the same PR before tagging. See Contributing for the rule.

Branch and worktree conventions

This page describes how the bwa-mem3 repository branches relate to upstream bwa-mem2, the policy for where PRs land, and the conventions for local worktrees when working on multiple branches simultaneously.

Branch model

`master` — upstream mirror

master tracks the upstream bwa-mem2 master branch verbatim. No fork-carried changes are applied here. When upstream bwa-mem2 merges new commits, master is fast-forwarded to match.

master is the starting point for upstream rebase operations. It is never the target of fork PRs.

`main` — fork integration branch

main carries all fork-carried commits on top of a rebased upstream baseline. This is the branch that:

All new feature, fix, and improvement PRs target.
All git tags (v0.X.Y) are placed on.
Read the Docs /latest/ follows.

When upstream bwa-mem2 makes significant changes, master is fast-forwarded and then main is rebased onto the new master tip. The rebase is verified by running the full test suite before the result is pushed.

Feature and fix branches

All development work happens on short-lived branches that are merged into main via pull request. Branch name conventions:

Prefix	Use
`feat/`	New features or capabilities
`fix/`	Bug fixes
`perf/`	Performance improvements
`test/`	Test additions or improvements
`docs/`	Documentation changes
`ci/`	CI / build system changes
`refactor/`	Code restructuring without behaviour change

Branch names use kebab-case after the prefix: fix/kswv-nrow-zero, perf/libsais-fm-index, test/regression-tests.

Upstream rebase cadence

main is rebased onto master (i.e., onto upstream bwa-mem2) periodically — not on every upstream commit, but when upstream merges a batch of changes worth incorporating. The process is:

Fast-forward master to the new upstream tip.
Rebase main onto master, resolving any conflicts.
Run make && make test to confirm the rebase is correct.
Push master and main to the fg-labs remote.

Warning — Do not merge upstream into main

Always rebase rather than merge when incorporating upstream changes. Merge commits obscure the fork-carried commit history and make the What’s Different table harder to maintain.

Worktrees for parallel branches

When working on multiple branches simultaneously, use git worktrees instead of stashing or switching branches. Each worktree is a sibling directory of the main clone.

Creating a worktree for a PR branch

# Fetch the PR's head branch from the fg-labs remote
git fetch fg-labs <head-branch-name>

# Create a worktree with a local branch tracking the remote branch
git worktree add ../pr-<N> -b pr-<N> --track fg-labs/<head-branch-name>

The local branch name and directory name match the PR number (pr-N).

Creating a worktree for a new issue branch

# Fetch the latest main from fg-labs
git fetch fg-labs main

# Create a new feature branch off fg-labs/main
git worktree add ../issue-<N> -b <prefix>/issue-<N>-<short-slug> fg-labs/main

# Unset the upstream so the branch is untracked until first push
git -C ../issue-<N> branch --unset-upstream

On first push, push to fg-labs so the head branch is in the same organisation as the PR base:

git push -u fg-labs HEAD

Worktree naming conventions

Directory name	Branch type
`main/`	Primary checkout; tracks `fg-labs/main`
`pr-<N>/`	PR review; local branch `pr-N` tracks `fg-labs/<head-branch>`
`issue-<N>/`	Issue work; local branch `<prefix>/issue-N-<slug>`
Descriptive name	Feature work not yet tied to a PR or issue

Listing and removing worktrees

# List all worktrees
git worktree list

# Remove a worktree after the PR is merged
git worktree remove ../pr-<N>
git branch -D pr-<N>

# Remove an issue worktree
git worktree remove ../issue-<N>
git branch -D <prefix>/issue-<N>-<slug>

Note — Worktree directories are siblings, not nested

All worktree directories sit next to the main clone at the same directory level, not inside it. This avoids confusing git commands that walk parent directories looking for .git.

PR policy

All PRs target main.
PRs from fork contributors should be opened against fg-labs/bwa-mem3 main.
Every PR that adds a fork-carried commit must update the FG-MAIN-TABLE in docs/src/whats-different/overview.md in the same PR. See Contributing.
Merge policy: squash-merge for single-commit changes; rebase-merge for multi-commit PRs with a clean commit history.

Contributing

This page covers the mechanics of submitting changes to bwa-mem3: commit conventions, PR workflow, CI requirements, and the rule for keeping the fork-lineage table current.

Before you start

Check the open issues and existing PRs to avoid duplicate work.
For substantial changes, open an issue first to discuss scope and approach.
Fork or branch from fg-labs/bwa-mem3 main. See Branch and worktree conventions for the branching model.

Commit message conventions

bwa-mem3 follows Conventional Commits (v1.0.0). Every commit message must start with a type prefix:

Prefix	Use
`feat:`	New feature or capability
`fix:`	Bug fix
`perf:`	Performance improvement
`test:`	Test additions or changes
`docs:`	Documentation only
`ci:`	CI / build-system changes
`refactor:`	Restructuring without behaviour change
`chore:`	Maintenance (dependency bumps, version pins)

The subject line is lowercase after the prefix, imperative mood, no trailing period. Keep it under 72 characters. Body lines wrap at 100 characters.

Good:

fix: kswv nrow==0 batch skips rowMax store when i==0

Exercises the all-len1==0 path across SSE4.1, AVX2, AVX-512BW, and ARM NEON.
Without the `if (i > 0)` guard, the store writes SIMD_WIDTH* bytes before the
allocation.

Closes #38.

Not acceptable:

Fixed stuff
Updated kswv
WIP

Pull request workflow

Push your branch to fg-labs/bwa-mem3 (or your fork) and open a PR targeting fg-labs/bwa-mem3 main.
The PR description should explain the motivation, summarise the change, and note any benchmarks or test results.
All CI jobs must pass before merge. See CI matrix below.
CodeRabbitAI reviews every PR automatically. Address all comments, including inline suggestions, summary comments, and nitpicks. Do not dismiss comments without a reply explaining why the suggestion was not adopted.
A project maintainer will review and merge once CI is green and all comments are resolved.

Note — Draft PRs first

Open PRs as drafts while CI is running or while you are actively revising. Convert to ready-for-review only when the branch is stable, CI is green, and you have self-reviewed the diff.

The FG-MAIN-TABLE rule

Every PR that introduces a new fork-carried commit — a commit that is on main but not on master (the upstream bwa-mem2 mirror) — must update the FG-MAIN-TABLE block in docs/src/whats-different/overview.md in the same PR.

The table records each fork-carried change, its bwa-mem3 PR number, the corresponding upstream bwa-mem2 PR or issue (if any), and its upstream status. Keeping this table current is the primary mechanism by which the project maintains transparency about its relationship to upstream.

Warning — Do not skip the table update

A PR that adds a fork-carried commit but omits the table update will be sent back for revision. The table is reviewed as part of the standard PR checklist.

What counts as a fork-carried commit

A commit is fork-carried if:

It adds new behaviour, fixes a bug, or changes build infrastructure in a way that diverges from upstream bwa-mem2 master.
It is present on fg-labs/bwa-mem3 main but not (yet) merged upstream.

Pure documentation commits, CI-only changes, and upstream-rebase bookkeeping commits do not need a table entry.

CI matrix

CI runs on every PR and on push to main. The matrix covers:

Row	Architecture	ISA	Platform
`sse41`	x86_64	SSE4.1	Ubuntu
`avx2`	x86_64	AVX2	Ubuntu (canonical)
`avx512bw`	x86_64	AVX-512BW	Ubuntu
`arm64-linux`	aarch64	NEON	Ubuntu ARM
`arm64-macos`	arm64	NEON	macOS

The canonical row (avx2) is the only one that runs regression tests (shell scripts in test/regression/). Unit tests run on every row. Integration tests run on the four widened canonical rows (SSE4.1, AVX2, ARM64 Linux, macOS ARM).

A PR must pass all rows before merge.

Code style

C++14, gnu++14 dialect.
Match the style of the surrounding code. The codebase inherits the upstream bwa-mem2 style, which is C-ish C++ with minimal STL use in hot paths.
For new test code, follow the doctest patterns documented in the test framework.
New SIMD code must include src/simd_compat.h rather than platform-specific headers directly. See SIMD dispatch architecture.

Adding a test for your change

Bug fix → add a unit test or integration test that fails without the fix and passes with it.
New feature → add unit tests for the core logic and, if the feature is end-to-end testable with a shell invocation, a regression test in test/regression/.
Performance change → run the benchmark harness (bench/) to confirm the improvement and include median wall-clock numbers in the PR description.

See Regression test framework for the full guide on where to add tests and how to organise them.

bwa-mem3-bench

bwa-mem3-bench is a benchmarking suite that measures the alignment performance of bwa-mem3 against the upstream bwa-mem2 v2.2.1 baseline. It runs on AWS Batch spot instances across four dataset types — whole-genome sequencing (WGS), whole-exome sequencing (WES), panel, and bisulfite-sequencing (methylation) — all aligned against the hg38 reference. The suite covers three CPU microarchitectures: ARM Neon, x86 AVX2, and x86 AVX-512. Results are collected into a SQLite database for local analysis and reporting. The project is implemented in Python (orchestration, reporting, and CLI), Rust (BAM comparison tool), Snakemake (alignment workflow), and AWS CDK (cloud infrastructure).

When you’d use it

Use bwa-mem3-bench when you need reproducible, multi-architecture throughput numbers before committing a bwa-mem3 change to production or before deciding whether to adopt bwa-mem3 in place of bwa-mem2. It provides a structured “bless baseline, then compare” workflow: an upstream bwa-mem2 run is blessed once per upstream tag and stored in S3; subsequent bwa-mem3 runs are measured against that fixed baseline. Running a full benchmark fires a Snakemake coordinator job on AWS Batch and costs roughly $10 in spot capacity.

How it relates to bwa-mem3

bwa-mem3-bench is the authoritative source of benchmark evidence for every performance claim made in the bwa-mem3 documentation and changelog. When the Performance Overview cites speedup numbers, those numbers come from bwa-mem3-bench runs collected after the relevant PR was merged. The suite also validates that bwa-mem3 does not regress relative to bwa-mem2 on any supported architecture before a new release is tagged.

bwa-mem3-rs

bwa-mem3-rs is a Rust crate that provides idiomatic bindings to the bwa-mem family of short-read aligners — bwa (original), bwa-mem2, and bwa-mem3. It exposes a safe Rust API over the underlying C++ alignment engine, allowing Rust programs to index a reference, configure alignment parameters, and align reads without shelling out to an external process. The bindings link statically against the chosen backend, so a binary built with bwa-mem3-rs carries the aligner and its SIMD kernels as a self-contained artifact.

When you’d use it

Use bwa-mem3-rs when you are building a Rust bioinformatics tool or pipeline that needs short-read alignment as an in-process library call rather than a subprocess invocation. It is especially useful when latency between reads arriving and alignments being available matters (no process-startup overhead), or when you want tight integration between the aligner’s output and downstream Rust code such as UMI grouping, consensus calling, or duplicate marking.

How it relates to bwa-mem3

bwa-mem3-rs targets bwa-mem3 as its primary high-performance backend. It is the intended integration path for fgumi and other Fulcrum Genomics tools that need alignment as a library dependency. Changes to bwa-mem3’s public API, flag semantics, or output format are coordinated with bwa-mem3-rs to keep the bindings current.

bwa-mem2 (upstream)

bwa-mem2 is the direct predecessor of bwa-mem3 and the project from which the bwa-mem3 fork is derived. It was created at Intel’s Parallel Computing Lab by Vasimuddin Md and Sanchit Misra to accelerate the alignment algorithm originally written by Heng Li in bwa. bwa-mem2 achieves a 1.3–3.1x throughput improvement over the original bwa-mem by replacing key inner loops with vectorised implementations (SSE4.1, SSE4.2, AVX2, and AVX-512) and by switching to a more compact FM-index encoding. Its output is identical to bwa-mem at the alignment level, and it is distributed under the MIT license.

Lineage

The bwa alignment family has evolved through three generations, each building on the last:

bwa — Written by Heng Li. Established the BWA-MEM algorithm, the SAM output format conventions, and the .bwt / .pac / .ann / .amb index layout.
bwa-mem2 (Vasimuddin et al., Intel) — Replaced scalar inner loops with SIMD kernels; introduced the compact .bwt.2bit.64 and .0123 index formats; retained full output compatibility with bwa-mem.
bwa-mem3 (Fulcrum Genomics fork) — Carries correctness fixes, performance improvements, new features (bisulfite alignment, mimalloc, ARM Neon), and expanded architecture support on top of the bwa-mem2 codebase. See What’s Different from bwa-mem2 for the full change catalog.

When you’d use it

Use bwa-mem2 directly when you need a stable, widely validated aligner with precompiled binaries available via Bioconda and the project’s GitHub releases page, and when you do not require the features or fixes that bwa-mem3 adds. bwa-mem2 is also the right choice when you are working in an environment where the bwa-mem3 fork has not yet been validated against your specific reference or sequencing library type.

How it relates to bwa-mem3

bwa-mem3 tracks bwa-mem2’s master branch and periodically rebases fork-carried commits on top of upstream changes. The What’s Different section documents every divergence between the two projects, and the Upstream PR status page tracks which bwa-mem3 changes have been proposed back to bwa-mem2. The goal is to keep the fork divergence minimal and to upstream as many fixes as practical.

fgumi

fgumi (Fulcrum Genomics Unique Molecular Indexing tools) is a high-performance suite of command-line tools for processing UMI-tagged next-generation sequencing data. Written in Rust, it provides UMI extraction from FASTQ files, read grouping by UMI with configurable assignment strategies, UMI-aware deduplication, simplex and duplex consensus calling, CODEC consensus calling, quality filtering of consensus reads, and overlapping read-pair clipping. fgumi is the intended successor to the Scala-based fgbio toolkit for UMI processing, targeting significantly higher throughput on multi-core systems. It is published on Bioconda and documented at https://fgumi.readthedocs.io.

Warning — Research preview

fgumi is currently a research preview. The Fulcrum Genomics team targets June 2026 for recommending fgumi over fgbio for production use. Verify fitness for your application before deploying in a clinical or production pipeline.

When you’d use it

Use fgumi when your sequencing library includes unique molecular identifiers and you need to group reads by UMI, call simplex or duplex consensus sequences, or remove PCR duplicates in a UMI-aware manner. It handles the standard commercial UMI library preparations (IDT xGen, KAPA, Twist, QIAseq, and others) and the CODEC protocol for duplex sequencing. fgumi is designed to be run after alignment with bwa-mem3 (or bwa-mem2) and before downstream variant calling or methylation analysis.

How it relates to bwa-mem3

fgumi and bwa-mem3 are sibling projects maintained by Fulcrum Genomics and are designed to work together in the same alignment-and-consensus pipeline. bwa-mem3 provides the aligned BAM that fgumi takes as input for grouping and consensus calling. The two projects share build and documentation conventions (mdbook on Read the Docs, Fulcrum theme, conventional commits) and are benchmarked together in the fgumi-benchmarks internal dataset suite. The intended integration path for in-process alignment within fgumi is bwa-mem3-rs, the Rust bindings for bwa-mem3.

bwameth.py

bwameth.py is a Python script written by Brent Pedersen that implements bisulfite sequencing (BS-Seq) alignment using the in-silico three-letter genome approach. It converts all cytosines to thymines in both the reference and the reads (C-to-T on the forward strand, G-to-A on the reverse), aligns the converted sequences with bwa-mem (or optionally bwa-mem2), and then recovers the original read sequence from the aligner’s tag output to tabulate methylation. bwameth.py supports single-end and paired-end reads from the directional bisulfite protocol and is published at https://arxiv.org/abs/1401.1129.

When you’d use it

Use bwameth.py when you need a battle-tested, community-supported bisulfite aligner that runs on top of the standard bwa-mem or bwa-mem2 you have already installed, and when you prefer a Python wrapper over a self-contained binary. It also remains the reference for downstream tabulation tools such as MethylDackel and SNP callers such as biscuit that expect the bwameth.py output format. For the actual methylation tabulation and variant calling steps, bwameth.py’s author recommends those dedicated tools rather than the tabulation utilities bundled with the original script.

How it relates to bwa-mem3

bwa-mem3 mem --meth is a single-binary drop-in replacement for the bwameth.py alignment pipeline. It inlines the C-to-T and G-to-A conversion, runs the bwa-mem3 alignment engine (with all of its correctness fixes and SIMD speedups), rewrites the @SQ headers to collapse the per-strand contig pairs back to canonical chromosome names, emits Bismark-compatible XR:Z / XG:Z / XM:Z auxiliary tags, and writes a @PG ID:bwa-mem3-meth header. The bwameth.py-style chimera QC heuristic is available via --chimera-qc (off by default — Bismark behavior). The Methylation Reference section documents the full implementation in detail, including the Bismark XR:Z / XG:Z / XM:Z tags and the --set-as-failed / --chimera-qc flags.

Tip — Interop with the bwameth.py c2t step

If your pipeline already performs its own C-to-T conversion before alignment, see Interop with external bwameth.py c2t for how to pass pre-converted reads to bwa-mem3 mem --meth without double-conversion.

Glossary

Terms used throughout this book, listed alphabetically.

@HD header The first line of a SAM file header. Specifies the SAM format version (VN) and sort order (SO). Required when any other header lines are present. See Output: SAM/BAM, headers, tags.

@PG header A SAM header line recording a program that processed the file, including ID, PN, VN, and CL fields. bwa-mem3 inserts ID:bwa-mem3 (or ID:bwa-mem3-meth in methylation mode). See Output: SAM/BAM, headers, tags.

@SQ header A SAM header line describing a reference sequence (chromosome). Contains the sequence name (SN) and length (LN). In methylation mode, bwa-mem3 post-processes @SQ lines to collapse f/r-prefixed contig names back to one entry per chromosome. See Chimera QC and header rewriting.

BAM Binary Alignment Map — a compressed, binary encoding of SAM. Produced by bwa-mem3 when the --bam flag is given or when output is piped through samtools. See Output: SAM/BAM, headers, tags.

Banded Smith-Waterman (banded SWA) A heuristic variant of the Smith-Waterman alignment algorithm that restricts the dynamic programming to a band of width w around the main diagonal. bwa-mem3 uses banded SWA for extension alignment; bwa-mem2 kernels are SIMD-vectorized and bwa-mem3 adds NEON implementations for Apple Silicon. See SIMD dispatch architecture.

c2t Cytosine-to-thymine in-silico conversion applied to reads (or reference) before methylation alignment. In --meth mode, bwa-mem3 converts R1 reads C→T and R2 reads G→A inline, without writing intermediate FASTQ files. See Conversion details (C->T, G->A).

Chimera A read alignment where the aligned portion is short relative to the read length, often indicating a mapping artefact or a true chimeric molecule. In methylation mode, bwa-mem3 applies a chimera QC heuristic: if the longest contiguous M/=/X CIGAR run is less than 44% of the read length, the alignment is flagged 0x200, the proper-pair bit is cleared, and MAPQ is capped at 1. See Chimera QC and header rewriting.

FASTQ A text format for raw sequencing reads. Each record contains a sequence identifier, the nucleotide sequence, a separator, and per-base quality scores in ASCII-encoded Phred format. bwa-mem3 accepts gzip-compressed FASTQ as input. See Quick start: align paired-end FASTQs.

FM-index Ferragina-Manzini index — a full-text index over the Burrows-Wheeler Transform of a sequence. bwa-mem3 uses the compressed .bwt.2bit.64 FM-index for seed finding (SMEM lookup). See Indexing the reference.

Hard clip A CIGAR operation (H) indicating that bases at the read end are absent from the SEQ field of the alignment record. Hard clipping is used in supplementary alignments to avoid duplicating the read sequence. See Output: SAM/BAM, headers, tags.

kswv The SIMD-vectorized kernel implementing the inner loop of the Smith-Waterman extension alignment in bwa-mem2/bwa-mem3. bwa-mem3 carries correctness fixes for the score-saturation edge case across all SIMD width variants (NEON, AVX2, AVX-512BW). See Correctness fixes.

libsais A library implementing the suffix-array induced sorting (SAIS) algorithm. bwa-mem3 optionally uses libsais for FM-index construction, reducing indexing time compared to the default suffix-array builder. See Performance improvements.

LTO Link-Time Optimization — a compiler mode that defers optimization to link time, enabling cross-compilation-unit inlining. Activated via make lto-build. See Building from source.

MAPQ Mapping quality — a Phred-scaled probability that a read alignment is incorrectly mapped. Reported in SAM field 5. bwa-mem3 follows bwa-mem2 MAPQ semantics; chimera QC in methylation mode caps MAPQ at 1 for chimeric alignments. See Output: SAM/BAM, headers, tags.

Mate rescue A step in paired-end alignment where, if one mate lacks a confident seed, bwa-mem3 attempts to find it by performing Smith-Waterman alignment in the region near the mapped mate. bwa-mem3 adds NEON and AVX2 implementations of the mate-rescue kernel. See Architecture support.

mimalloc A high-performance memory allocator from Microsoft. bwa-mem3 vendors mimalloc and links it into every binary by default. To disable, build with USE_MIMALLOC=0. See Memory allocator (mimalloc).

Single-binary SIMD dispatch On x86, bwa-mem3 ships one binary that contains compiled kernels for every supported SIMD tier (sse41 / sse42 / avx / avx2 / avx512bw) and selects one in process at startup via __builtin_cpu_supports. There are no per-tier companion binaries. On ARM64 the binary contains a single NEON kernel TU. Replaces the prior multi-binary execv launcher (PR #83). See Single-binary SIMD dispatch (x86).

PGO Profile-Guided Optimization — a two-pass build where the first pass instruments the binary, a representative workload is run to collect profiles, and the second pass uses those profiles to guide inlining and branch layout. Activated via make pgo-generate then make pgo-use. See PGO build.

Primary alignment The alignment record for a read that represents the aligner’s best placement. A read has exactly one primary alignment (or is reported as unmapped). All other alignments for the same read are marked supplementary (chimeric split read) or secondary (alternative mapping). See Output: SAM/BAM, headers, tags.

Proper-pair flag (0x2) SAM flag bit indicating that both mates of a pair are mapped in the expected orientation and insert-size range. In bwa-mem3, the mem_sam_pe function sets this flag; a correctness fix (PR #17) ensures it is propagated correctly under all conditions. See Correctness fixes.

SAM Sequence Alignment Map — a tab-delimited text format for read alignments. Each record contains mandatory fields (QNAME, FLAG, RNAME, POS, MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, QUAL) plus optional tags. See Output: SAM/BAM, headers, tags.

SIMD dispatch Runtime selection of the fastest available SIMD instruction set (SSE4.1, SSE4.2, AVX, AVX2, AVX-512BW, NEON) for hot alignment kernels. On x86 this is implemented in process by src/simd_dispatch.cpp via __builtin_cpu_supports; on ARM64 a single NEON tier covers every supported CPU. See SIMD dispatch matrix.

SMEM Super-Maximal Exact Match — a seed found by extending a read’s position in the FM-index as far as possible in both directions. SMEMs form the initial seeds for chaining and extension in the BWA-MEM algorithm. See Performance improvements.

Soft clip A CIGAR operation (S) indicating that bases at the read end were not part of the alignment, but are still present in the SEQ field. Soft clipping commonly appears at adapter-containing or low-quality read ends. See Output: SAM/BAM, headers, tags.

Supplementary alignment A SAM record (FLAG bit 0x800 set) representing a chimeric read split across two or more genomic loci. The segment with the longest aligned span is typically designated primary; remaining segments are supplementary. Hard clipping is used to avoid duplicating the SEQ field. See Output: SAM/BAM, headers, tags.

Citation

How to cite

bwa-mem3 is a derivative of bwa-mem2. If you use bwa-mem3 in published work, please cite the original bwa-mem2 paper:

Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru. Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2019. doi:10.1109/IPDPS.2019.00041

BibTeX:

@inproceedings{bwamem2-ipdps2019,
  author    = {Vasimuddin Md and Sanchit Misra and Heng Li and Srinivas Aluru},
  title     = {Efficient Architecture-Aware Acceleration of {BWA-MEM} for Multicore Systems},
  booktitle = {IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
  year      = {2019},
  doi       = {10.1109/IPDPS.2019.00041},
  url       = {https://doi.org/10.1109/IPDPS.2019.00041}
}

Lineage

bwa-mem3 is maintained by Fulcrum Genomics as a derivative of bwa-mem2, itself derived from bwa (Li & Durbin, 2009). The BWA-MEM algorithm was originally described in:

Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997, 2013. arXiv:1303.3997

The bwa-mem3-specific changes and improvements carried on top of bwa-mem2 are documented in What’s Different from bwa-mem2.

License

bwa-mem3 is licensed under the MIT License (same as upstream bwa-mem2).

                           The MIT License

   BWA-MEM2  (Sequence alignment using Burrows-Wheeler Transform),
   Copyright (C) 2019 Intel Corporation, Heng Li.

   Permission is hereby granted, free of charge, to any person obtaining
   a copy of this software and associated documentation files (the
   "Software"), to deal in the Software without restriction, including
   without limitation the rights to use, copy, modify, merge, publish,
   distribute, sublicense, and/or sell copies of the Software, and to
   permit persons to whom the Software is furnished to do so, subject to
   the following conditions:

   The above copyright notice and this permission notice shall be
   included in all copies or substantial portions of the Software.

   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
   BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
   ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
   CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
   SOFTWARE.

Contacts: Vasimuddin Md <vasimuddin.md@intel.com>; Sanchit Misra <sanchit.misra@intel.com>;
                                Heng Li <hli@jimmy.harvard.edu>

Changelog

Release 0.2.0 (2026-05-13)

Operational / packaging

Single-binary SIMD dispatch on x86 (#83). The previous multi-binary build (make multi producing five bwa-mem3.<tier> ISA variants plus a runsimd.cpp launcher that execv’d the matching tier) is replaced by a single binary that contains compiled kernels for every supported tier (sse41 / sse42 / avx / avx2 / avx512bw) and selects one in process at startup via __builtin_cpu_supports. Install size drops from ~120 MB to ~25 MB; per-call overhead is one indirect branch (~0.3 ns after BTB warm-up). No .<tier> companion files are produced or needed. See docs/src/developer-guide/launcher.md.
BWAMEM3_FORCE_TIER=<tier> and BWAMEM3_DEBUG_SIMD=1 env vars (#83). BWAMEM3_FORCE_TIER is downgrade-only and replaces the prior “exec the bwa-mem3.sse41 binary” A/B-testing pattern; up-tier or unrecognized requests are rejected with a stderr warning.
BASELINE_ARCH=avx2 is the new default for non-kernel translation units on x86 (#84, supersedes the SSE4.1 floor that PR #83 originally shipped with). Override via make BASELINE_ARCH=<tier>. AVX-512BW hosts using BASELINE_ARCH=avx512bw see a small additional speedup on Zen 4 with -mprefer-vector-width=256 (#86) and roughly flat results on Sapphire Rapids — see docs/src/whats-different/avx512-baseline.md for the characterization.
Host-floor precheck (#95). bwa-mem3 mem, bwa-mem3 index, and bwa-mem3 shm refuse to run with exit code 2 and an [E::bwamem3] stderr message when the host CPU does not meet the build’s compile-time SIMD floor, instead of SIGILL-ing deep in alignment. bwa-mem3 version, --help, and -h are exempt and always succeed.
bwa-mem3 version now prints SIMD floor: (build’s required minimum) and SIMD runtime: (resolved tier) lines on stdout, plus a [W::bwa-mem3] warning on stderr (exit 0) if the host is below the floor. See docs/src/getting-started/host-requirements.md.
bwa-mem3 shm performs a statvfs("/dev/shm") capacity preflight (#86). When /dev/shm is too small for the index, the stage aborts with an [E::bwa_shm_stage] message naming /dev/shm, the required size, and a mount -o remount,size=... hint — replacing the prior [fread] Bad address failure mode. statvfs failures (no /dev/shm, restricted sandbox) are non-fatal and the stage proceeds.
bwa-mem3 shm /bwactl registry RMW is now serialized via a POSIX named semaphore (#82, closes #66). Concurrent shm stage / shm drop invocations across processes no longer race when updating the registry; the prior best-effort flock was per-open and did not cover the read-modify-write window.

Methylation

mem --meth emits Bismark-compatible auxiliary tags XR:Z (read conversion CT/GA), XG:Z (genome strand CT/GA), and XM:Z (per-base methylation call string) (#90). These replace the prior bwameth-style YS:Z / YC:Z / YD:Z on output (still used internally for SEQ restoration). The reference-annotation XR:Z from -V is suppressed under --meth to avoid colliding with the Bismark semantics. Downstream tools that previously read YS:Z / YC:Z / YD:Z must be pointed at the corresponding XR:Z / XG:Z and the per-base XM:Z. See docs/src/methylation/tags.md.

Correctness

Fixed SIGSEGV in mem_matesw on shm-backed ref_string (#85). ksw_align2 mutates its reference slice in place; when the slice pointed into a read-only shm segment, this faulted. Now copies the slice before passing it in.
FMI_search sampled-SA prefetch: parenthesized SA_COMPX_MASK precedence so the masked offset is computed against the correct operand (#73). The unparenthesized form was silently producing wrong-but-harmless prefetch addresses; no alignment output was affected.
bntseq .alt parser bounds the line buffer to prevent a stack-overflow on malicious or malformed .alt files (#74).
display_stats clamps the per-thread bucket count to LIM_C so --profile with -t greater than the compiled-in limit no longer writes past the end of the stats array (#81).

Performance

x86 wall-time improvements on the bench (vs the 0.1.0-pre baseline): AVX2 (c6a) −17 to −22%, AVX-512 AMD Zen4 (c7a) −16 to −24%, AVX-512 Intel SPR (c7i) −28 to −30% across wgs / wes / panel-twist 5M-read samples. Concordance vs upstream bwa-mem2 v2.2.1 remains 100.0000% on all non-methylation cells. arm64 (c7g / c8g) is flat (within ±2%). The wins are attributable primarily to (a) capping AVX-512BW auto-vectorization at 256-bit on the avx512bw target (#86) and (b) inlining FMI_search::backwardExt to recover a gcc 12+ wall-clock regression (#88). See docs/src/performance/overview.md for the reference numbers across architectures.
Smaller contributions in the release window: per-strip L1 prefetches across all kswv u8/u16 kernels (#70); SMEM_LOCKSTEP_N bumped from 8 to 16 (#75); closed-form ungapped HIT path when total_mis == 0 (#77); ksort switched to an on-stack buffer for small n to drop a per-call malloc (#78); libsais_build skips a wasted zero-init pass on its unpack and SA buffers, trimming index-build time (#80).

Release 0.1.0-pre (2026-04-28)

Project renamed from bwa-mem2 to bwa-mem3. The new project tracks Fulcrum Genomics’ performance and feature work on top of the upstream bwa-mem2 codebase.
Default branch renamed from fg-main to main.
Binary renamed from bwa-mem2 to bwa-mem3. Arch-suffixed variants (bwa-mem3.sse41, .sse42, .avx, .avx2, .avx512bw, .arm64, .pgo, .profile, .lto) renamed to match.
@PG SAM header tags now read ID:bwa-mem3 PN:bwa-mem3 (and bwa-mem3-meth for --meth mode).
Test binaries renamed: bwa_mem2_tests_unit → bwa_mem3_tests_unit, bwa_mem2_tests_integration → bwa_mem3_tests_integration.
.bwt.2bit.64 index file format unchanged — bwa-mem3 reads indexes built by bwa-mem2 index without re-indexing.

Release 2.2.1 (17 March 2021)

Hotfix for v2.2: Fixed the bug mentioned in #135.

Release 2.2 (8 March 2021)

Changes since the last release (2.1):

Passed the validation test on ~88 billions reads (Credits: Keiran Raine, CASM division, Sanger Institute)
Fixed bugs reported in #109 causing mismatch between bwa-mem and bwa-mem2
Fixed the issue (# 112) causing crash due to corrupted thread id
Using all the SSE flags to create optimized SSE41 and SSE42 binaries

Release 2.1 (16 October 2020)

Release 2.1 of BWA-MEM2.

Changes since the last release (2.0):

Smaller index: the index size on disk is down by 8 times and in memory by 4 times due to moving to only one type of FM-index (2bit.64 instead of 2bit.64 and 8bit.32) and 8x compression of suffix array. For example, for human genome, index size on disk is down to ~10GB from ~80GB and memory footprint is down to ~10GB from ~40GB. There is a substantial decrease in index IO time due to the reduction and hardly any performance impact on read mapping.
Added support for 2 more execution modes: sse4.2 and avx.
Fixed multiple bugs including those reported in Issues #71, #80 and #85.
Merged multiple pull requests.

Release 2.0 (9 July 2020)

This is the first production release of BWA-MEM2.

Changes since the last release:

Made the source code more secure with more than 300 changes all across it.
Added support for memory re-allocations in case the pre-allocated fixed memory is insufficient.
Added support for MC flag in the sam file and support for -5, -q flags in the command line.
The output is now identical to the output of bwa-mem-0.7.17.
Merged index building code with FMI_Search class.
Added support for different ways to input read files, now, it is same as bwa-mem.
Fixed a bug in AVX512 sam processing part, which was leading to incorrect output.

Release 2.0pre2 (4 February 2020)

Miscellaneous changes:

Changed the license from GPL to MIT.
IMPORTANT: the index structure has changed since commit 6743183. Please rebuild the index if you are using a later commit or the new release.
Added charts in README.md comparing the performance of bwa-mem2 with bwa-mem.

Major code changes:

Fixed working for variable length reads.
Fixed a bug involving reads of length greater than 250bp.
Added support for allocation of more memory in small chunks if large pre-allocated fixed memory is insufficient. This is needed very rarely (thus, having no impact on performance) but prevents asserts from failing (code from crashing) in that scenario.
Fixed a memory leak due to not releasing the memory allocated for seeds after smem.
Fixed a segfault due to non-alignment of small allocated memory in the optimized banded Smith-Waterman.
Enabled working with genomes larger than 7-8 billion nucleotides (e.g. Wheat genome).
Fixed a segfault occuring (with gcc compiler) while reading the index.

Keyboard shortcuts

bwa-mem3

bwa-mem3