Home
bwa-mem3
A faster, more correct, drop-in replacement for bwa mem and bwa-mem2.
If you align short reads with bwa or bwa-mem2 today, bwa-mem3 will give you the same answers — only quicker, with fewer rough edges, and with first-class support for things you used to need a wrapper script for.
Why bwa-mem3
- Drop in, go faster. Same algorithm, same outputs, same flags as bwa-mem2 — but consolidated mapping speedups, a memory-bounded index builder, batched header ingestion, and a tuned allocator add up to measurable wall-clock wins on real workloads.
- Methylation in one binary. A
--methflag turns bwa-mem3 into a drop-in replacement for the entirebwameth.pypipeline. No Python, no inline conversion script, no separate post-processing step. Onebwa-mem3 index --meth ref.fa, onebwa-mem3 mem --meth ref.fa R1.fq R2.fq, done — header collapsed, tags emitted, chimeras flagged. - Stage the index once, align many. A
bwa-mem3 shmsubcommand pins the FM-index in shared memory so back-to-back runs on the same host skip the 28 GB read every time. - Correctness fixes upstream haven’t merged yet. Tabs in
-R, 151+ bp reads, AVX-512 mate-rescue, kswvscore2plateau across NEON/AVX2/AVX-512BW, mem_sam_pe proper-pair flag — every fix tracked back to the upstream PR or issue that found it. - Architecture-aware out of the box. SSE4.1, SSE4.2, AVX, AVX2, AVX-512BW, and ARM64/NEON. One binary per platform; the dispatcher picks the right tier for your CPU in process at startup.
Get started in 30 seconds
git clone --recursive https://github.com/fg-labs/bwa-mem3
cd bwa-mem3 && make
./bwa-mem3 index ref.fa
./bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -o out.bam
Tip — Emit BAM directly
For production pipelines, add
--bam=0to skip the SAM text round-trip entirely. See Best Practices: Output format.
Where to start
- Installation — Build from source (Bioconda is on the way).
- Quick start: align paired-end FASTQs — Two commands to your first alignment.
- Quick start: methylation — The
one-binary
bwameth.pyreplacement, in two commands. - Best Practices — The five things that actually move the needle for production runs.
- What’s different from bwa-mem2 — Every fix and feature, with upstream cross-references.
What’s in this book
- Getting Started — Install and run your first alignment.
- User Guide — Indexing, alignment, output, threading, allocator notes.
- Performance — Where the speed comes from and how to get more.
- Best Practices — Build, run, and deploy recommendations.
- CLI Reference — Every flag, auto-captured from
--help. - Methylation Reference —
--methmode in full. - What’s Different from bwa-mem2 — The full changelog, by category.
- Developer Guide — Build matrix, SIMD dispatch, regression tests, contributing.
- Related Projects — bwa-mem3-bench, bwa-mem3-rs, fgumi, bwa-mem2 upstream.
- Reference — Glossary, citation, license, changelog.
bwa-mem3 is a derivative of bwa-mem2 maintained by Fulcrum Genomics. MIT licensed. See License and Citation.
Installation
Bioconda (coming soon)
A Bioconda package for bwa-mem3 is in preparation. Once published, installation will be:
conda install -c bioconda bwa-mem3
This will be the recommended path for most users. Check back here or watch the fg-labs/bwa-mem3 repository for the announcement.
Build from source
Until the Bioconda package is available, build from source using the steps below.
Prerequisites
bwa-mem3 vendors several libraries as git submodules. Building from source requires the toolchain to compile bwa-mem3 itself plus the bootstrap tools each vendored library needs.
| Tool | Why it’s needed | Minimum version |
|---|---|---|
| C++14 compiler (GCC or Clang) | bwa-mem3 itself | GCC 8+ / Clang 7+ |
| GNU make | top-level build | 3.81+ |
| Git | submodule checkout (with --recursive) | any recent |
| autoconf, automake, autoconf-archive, libtool | ext/htslib runs autoreconf -i && ./configure during build | any recent |
| pkg-config | htslib’s configure uses it to locate zlib | any recent |
| zlib development headers | htslib links against zlib | any recent |
| OpenMP runtime | ext/libsais uses OpenMP for parallel suffix-array construction | see notes below |
| CMake 3.12+ | building bundled mimalloc (default; skip if you pass USE_MIMALLOC=0) | 3.12+ |
OpenMP notes.
- On Linux with GCC, libgomp ships with the compiler — no extra package needed.
- On Linux with Clang, install
libomp-dev(Debian/Ubuntu) orlibomp-devel(RHEL/Fedora).- On macOS, install Homebrew’s
libomp(brew install libomp). The Makefile auto-detects the Homebrew prefix; setLIBOMP_PREFIX=/path/to/libompif you installed it elsewhere.
Install prerequisites by platform
Debian / Ubuntu:
sudo apt-get install \
build-essential git cmake pkg-config \
autoconf automake autoconf-archive libtool \
zlib1g-dev \
libomp-dev # only needed if building with Clang
RHEL / Fedora / Amazon Linux:
sudo dnf install \
gcc gcc-c++ make git cmake pkgconf-pkg-config \
autoconf automake autoconf-archive libtool \
zlib-devel \
libomp-devel # only needed if building with Clang
macOS (Homebrew):
xcode-select --install # Apple Clang + git + make
brew install \
cmake pkg-config \
autoconf automake autoconf-archive libtool \
libomp
What happens if a prereq is missing. The Makefile fails fast with an actionable error: a missing
libompon macOS, a missingautoreconf, or a missingcmakeeach produce a one-line hint pointing at the install command above. There is no need to install everything optimistically — install only what the error message asks for if you prefer.
Clone and build
git clone --recursive https://github.com/fg-labs/bwa-mem3
cd bwa-mem3
make
The --recursive flag is required. bwa-mem3 vendors several libraries (mimalloc, sse2neon, and
others) as git submodules. A shallow or non-recursive clone will fail to compile.
Warning — Shallow clone submodule pitfall
If you cloned without
--recursive, initialize the submodules before runningmake:git submodule update --init --recursiveForgetting this step is the most common source of build failures.
Target architecture
By default, make builds a general-purpose binary that runs on any supported CPU. For maximum
performance, specify the architecture that matches your deployment target:
| Flag | Requires | Notes |
|---|---|---|
make | SSE4.1 or better (x86), any (ARM) | Default; selects best dispatch at runtime on x86 |
make arch=avx2 | AVX2 (e.g. Haswell, Zen 2) | Recommended for modern x86 servers |
make arch=avx512bw | AVX-512BW (e.g. Skylake-X, Ice Lake, Sapphire Rapids) | Maximum x86 performance |
make arch=arm64 | Apple Silicon / AWS Graviton | NEON-vectorized build |
See Performance — SIMD dispatch matrix for the full matrix of which kernels are vectorized under each target.
Memory allocator (mimalloc)
bwa-mem3 bundles mimalloc and links it into every binary by default. mimalloc reduces allocator contention under high thread counts and lowers wall-clock time on multi-threaded alignment runs.
To build without mimalloc, pass USE_MIMALLOC=0:
make USE_MIMALLOC=0
See User Guide — Memory allocator for details on how mimalloc is linked on Linux versus macOS and when opting out is appropriate.
Smoke test
After building, run the smoke test to confirm the binary works and report which allocator is active:
./bwa-mem3 version
Expected output (with mimalloc):
v0.2.0-12-gabcdef1
mimalloc 3.x.x
If the mimalloc line is absent, the build linked the system allocator (expected when
USE_MIMALLOC=0 was passed or when the vendor submodule was not initialized).
Next: host requirements
If you’re planning to deploy bwa-mem3 across a heterogeneous fleet (AWS Batch, mixed compute clusters), read Host requirements for the supported CPU floor and Best Practices → Multi-architecture deployment for the deployment recipe.
See also: Quick start: align paired-end FASTQs · User Guide — Memory allocator · Developer Guide — Building from source
Host requirements
bwa-mem3 runs on the hosts in the table below. Verify your host with bwa-mem3 version — the SIMD floor and runtime lines tell you what the binary needs and what your host provides.
| Platform | Default build floor | Earliest supported CPU | Notes |
|---|---|---|---|
| Linux x86_64 | AVX2 (BASELINE_ARCH=avx2) | Intel Haswell (2013); AMD Zen / Naples (2017) | Auto-selects best of sse41 / sse42 / avx / avx2 / avx512bw at runtime |
| Linux x86_64 (legacy) | SSE4.1 (BASELINE_ARCH=sse41) | Intel Nehalem (2008); AMD Bulldozer (2011) | Opt-in rebuild; ~10-15% slower on AVX2 hosts |
| Linux arm64 | NEON (aarch64 ABI baseline) | Any aarch64 host | Single tier; NEON is mandatory in the aarch64 ABI |
| macOS arm64 | NEON | Apple M1 (2020) | Apple Silicon only; macOS x86_64 is unsupported |
How to verify
$ bwa-mem3 version
v0.2.0-12-gabcdef1
SIMD floor: avx2 (x86-64-v3, Haswell 2013+); kernels: sse41 sse42 avx avx2 avx512bw
SIMD runtime: avx512bw (BWAMEM3_FORCE_TIER unset)
mimalloc 3.x.x
- The
SIMD floor:line tells you what host features the binary requires. - The
SIMD runtime:line tells you what kernel tier was selected at startup. - On a host below the floor,
bwa-mem3 versionwrites a[W::bwa-mem3]warning line to stderr (not stdout) and still exits 0, so the diagnostic command stays usable even on hosts that cannot run alignment. The floor + runtime lines remain on stdout, sobwa-mem3 version | grep '^SIMD'works in CI scripts even on too-old hosts.
Failure mode on too-old hosts
If you run bwa-mem3 mem (or another alignment subcommand) on a host below the floor, the binary refuses with exit code 2 and a stderr message identifying the gap:
[E::bwamem3] this binary was compiled for SIMD floor avx2 and emits avx2
instructions in non-kernel translation units. The host CPU does not support
avx2 (detected: sse42). Running would SIGILL on the first avx2 instruction.
To run on this host, rebuild bwa-mem3 with BASELINE_ARCH=sse42 (or lower),
or use a binary built for a lower SIMD floor.
The version subcommand stays exit-0 so introspection still works on the same host.
Mixed-architecture fleets
For AWS Batch and other heterogeneous compute environments where the same job may schedule onto x86_64 or arm64 hosts, see Best Practices → Multi-architecture deployment.
Quick start: align paired-end FASTQs
This page walks through the two-command workflow: index the reference once, then align reads.
Index the reference
bwa-mem3 index ref.fa
This produces five index files alongside ref.fa:
| File | Description |
|---|---|
ref.fa.bwt.2bit.64 | FM-index in 2-bit packed format |
ref.fa.0123 | 2-bit packed reference sequence |
ref.fa.amb | Ambiguous base positions |
ref.fa.ann | Sequence name and length annotations |
ref.fa.pac | Packed 4-bit reference sequence |
Indexing hg38 takes roughly 2-3 minutes and requires approximately 60 GB of peak disk space
during creation (including temporary/intermediate files); the final FM-index stored on disk
is roughly 28 GB. The index is read once per mem invocation; for workloads that align many
samples, load it into shared memory first (see Quick start: shared-memory index).
Align paired-end reads
bwa-mem3 mem -t 16 ref.fa r1.fq.gz r2.fq.gz > out.sam
-t 16 sets the thread count to 16. bwa-mem3 scales well up to the number of physical CPU
cores; hyperthreading provides diminishing returns above that point. See
User Guide — Threading and resource use for recommendations at
different core counts.
The default output is uncompressed SAM on stdout. To write compressed BAM directly, use the
--bam flag:
bwa-mem3 mem --bam -t 16 ref.fa r1.fq.gz r2.fq.gz \
| samtools sort -@ 8 -o out.bam -
samtools index out.bam
Tip — Prefer BAM output in production
Piping BAM (
--bam) tosamtools sortavoids the text formatting and parsing overhead of SAM on both sides of the pipe. For large cohorts this yields a measurable wall-clock reduction. See Best Practices — Output format for the recommended pipeline and a discussion of when SAM is still useful.
Read group tagging
For downstream tools that require a @RG header (most variant callers), pass -R:
bwa-mem3 mem -t 16 \
-R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA\tLB:lib1' \
ref.fa r1.fq.gz r2.fq.gz > out.sam
The value is a tab-delimited string following BWA conventions. Every aligned record receives an
RG:Z: tag matching the ID field of the read-group header.
Output tags
bwa-mem3 emits standard SAM tags plus the HN:i: tag introduced by the fork:
| Tag | Type | Description |
|---|---|---|
NM:i | int | Edit distance to the reference |
MD:Z | string | Mismatch and deletion string |
AS:i | int | Alignment score |
XS:i | int | Suboptimal alignment score |
SA:Z | string | Supplementary alignment chain |
MC:Z | string | Mate CIGAR (paired-end) |
MQ:i | int | Mate mapping quality (paired-end) |
HN:i | int | Total number of primary alignments (reported and suppressed) found for the read, before the -h supplementary cap is applied |
For the methylation-specific tags (XR:Z, XG:Z, XM:Z), see
Methylation Reference — SAM tags.
See also: Quick start: methylation alignment · Quick start: shared-memory index · User Guide — Aligning short reads · CLI Reference — mem
Quick start: methylation alignment
bwa-mem3 supports bisulfite-converted (WGBS/RRBS/EM-seq) read alignment through a single --meth
flag on both index and mem. No Python interpreter, no piped preprocessor, and no separate
postprocessing step are required.
Note — Drop-in replacement for bwameth.py
bwa-mem3 with
--methis a single-binary drop-in replacement for thebwameth.pypipeline. The output BAM is byte-compatible for the standard tags used by methylation callers (Bismark, MethylDackel, PileOMeth, etc.).
Index the reference for methylation
Build the c2t doubled reference once:
bwa-mem3 index --meth ref.fa
This writes two additional files next to the standard index:
| File | Description |
|---|---|
ref.fa.bwameth.c2t | C→T converted reference (forward strand) with G→A reverse complement interleaved |
ref.fa.bwameth.c2t.* | FM-index files for the c2t reference |
The c2t index is separate from the standard index produced by bwa-mem3 index ref.fa. You need
both if you intend to run standard and methylation alignments against the same reference.
Align bisulfite-converted reads
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -o out.bam
samtools index out.bam
Pass the original (unconverted) reference path, not the .bwameth.c2t file. bwa-mem3
auto-appends .bwameth.c2t to the reference path when --meth is active.
What --meth does
--meth activates a pipeline of in-process transformations that would otherwise require
external tools:
-
Inline c2t read conversion. R1 reads have every
Cconverted toTbefore alignment; R2 reads have everyGconverted toA. The original unconverted sequence is restored into the BAMSEQfield on emit. The conversion direction is reported per record in the BismarkXR:Ztag (valueCTfor R1/SE,GAfor R2). -
bwameth.py-equivalent scoring defaults.
--methsets-B 2 -L 10 -U 100 -T 40 -CMautomatically. These match the defaults used by bwameth.py and are optimized for bisulfite-converted reads where C→T mismatches carry no penalty. Any of these values can be overridden on the command line. -
Inline BAM post-processing. After alignment, bwa-mem3 rewrites the SAM stream in-process:
@SQheaders withf/rprefixes (e.g.fchr1,rchr1) are collapsed back to one entry per real chromosome (chr1). Read-levelRNAMEfields are rewritten to match.- Each mapped record gains Bismark
XG:Z(genome strand:CTfor top-strand alignment,GAfor bottom-strand) andXM:Z(per-base methylation call string). - Chimera QC: reads whose longest
M/=/Xrun is less than 44% of the read length are flagged0x200(QC-fail), have flag0x2(proper pair) cleared, and have MAPQ capped at 1. - Pair-level QC-fail propagation: if one mate is QC-failed, the other mate is also flagged.
- A
@PG ID:bwa-mem3-methprogram record is appended to the header.
-
Uncompressed BAM output. The post-processed stream is written as uncompressed BAM (
wb0) rather than SAM text. This eliminates text serialization overhead and allows downstreamsamtools sortto read BAM natively. The stream is still fully readable by any htslib-based tool.
For full details on each tag, the optional chimera QC heuristic, and the
--set-as-failed / --chimera-qc flags, see the
Methylation Reference.
See also: Methylation Reference — Overview · Methylation Reference — SAM tags · Best Practices — Methylation defaults · CLI Reference — mem
Quick start: shared-memory index
The bwa-mem3 FM-index for a genome like hg38 is approximately 28 GB. By default, every
bwa-mem3 mem invocation reads the index from disk, which can take 30–60 seconds on a spinning
disk and several seconds even on fast NVMe storage. For workloads that align many small samples
in sequence on the same machine, this per-invocation overhead accumulates.
bwa-mem3 shm stages the index once into POSIX shared memory. Subsequent mem invocations
attach to the in-memory segment instead of reading from disk, reducing per-sample startup time
to near zero.
Stage the index
bwa-mem3 shm ref.fa
This reads the index files from disk and copies them into a POSIX shared-memory segment. The command returns when staging is complete. The index stays in memory until it is explicitly dropped or the system is rebooted.
To stage a methylation (--meth) index:
bwa-mem3 shm --meth ref.fa
A standard and a methylation index for the same reference can be staged simultaneously; they occupy separate named segments.
Align using the staged index
No extra flag is needed. When bwa-mem3 mem starts, it checks whether a matching shared-memory
segment exists. If one does, it attaches automatically:
bwa-mem3 mem -t 16 ref.fa r1.fq.gz r2.fq.gz > out.sam
Inspect and drop staged segments
List all currently staged indices:
bwa-mem3 shm -l
Drop all staged segments:
bwa-mem3 shm -d
When to use shared-memory indexing
Shared-memory indexing is most beneficial when:
- Aligning tens to hundreds of small samples (e.g. amplicon panels, targeted sequencing) where per-sample read time dominates the per-sample alignment time.
- Running a batch pipeline on a single large machine where the index fits comfortably in RAM (approximately 28 GB for hg38 with the standard index).
- The same reference is used for all samples in the batch; a new
shminvocation is required for each distinct reference.
It provides little benefit when:
- Aligning a small number of large samples (WGS), where alignment time far exceeds index load time.
- The available RAM is insufficient to hold the index alongside the operating system and alignment worker processes.
Warning — No staleness check — always drop before re-indexing
bwa-mem3 shmdoes not detect whether the on-disk index files have changed after staging. If you runbwa-mem3 index ref.faagain (e.g. to rebuild after a reference update), the shared-memory segment is not invalidated. Subsequentmeminvocations will attach to the stale segment and produce silently incorrect alignments.Always drop the segment before re-indexing:
bwa-mem3 shm -d bwa-mem3 index ref.fa bwa-mem3 shm ref.fa
See also: CLI Reference — shm · Best Practices — Multi-sample workflows · Best Practices — Anti-patterns · User Guide — Threading and resource use
Indexing the reference
Before aligning reads, bwa-mem3 builds an FM-index from the reference FASTA.
The index is read back from disk at the start of every mem run, so it is
built once and reused indefinitely.
Basic indexing
bwa-mem3 index ref.fa
The command writes five files alongside the input FASTA:
| File | Contents |
|---|---|
ref.fa.bwt.2bit.64 | Burrows-Wheeler Transform, 2-bit packed, 64-bit offsets |
ref.fa.0123 | Forward sequence, 2-bit packed |
ref.fa.amb | Coordinates and counts of ambiguous (N) bases |
ref.fa.ann | Sequence names and lengths |
ref.fa.pac | Forward sequence, 4-bit packed |
The .bwt.2bit.64 file dominates disk usage. For the human reference (hg38),
expect roughly 28 GB total across all five files.
Methylation index (--meth)
bwa-mem3 index --meth ref.fa
Methylation mode builds a C-to-T doubled reference in addition to the standard
FM-index files. The command writes a ref.fa.bwameth.c2t file (the doubled
FASTA) and its own set of five index files with the .bwameth.c2t suffix:
ref.fa.bwameth.c2t
ref.fa.bwameth.c2t.bwt.2bit.64
ref.fa.bwameth.c2t.0123
ref.fa.bwameth.c2t.amb
ref.fa.bwameth.c2t.ann
ref.fa.bwameth.c2t.pac
The doubled reference is roughly twice the size of the standard one. For hg38, allow approximately 56 GB of disk space.
Tip — Pass the original FASTA to mem, not the c2t file
When running
bwa-mem3 mem --meth, pass the original FASTA path (ref.fa), notref.fa.bwameth.c2t. bwa-mem3 appends.bwameth.c2tautomatically. The auto-append is skipped only when the path already ends in.bwameth.c2t, which is useful for external-c2t interop pipelines.
Output file locations
Index files are written to the same directory as the input FASTA by default. The input path is taken verbatim as a prefix — you can pass an absolute path to write into a different directory:
bwa-mem3 index /data/indexes/hg38/hg38.fa
# writes hg38.fa.bwt.2bit.64, etc. into /data/indexes/hg38/
Time and memory
Indexing hg38 takes roughly 60–90 minutes on a single core and requires about 80 GB of RAM during construction. The process is single-threaded; additional cores do not reduce wall time.
bwa-mem3 uses libsais to construct the suffix array, which is faster than the original bwa-mem2 approach. See Performance improvements for benchmark numbers.
Warning — Do not index over a live shared-memory segment
If you have previously staged the index into shared memory with
bwa-mem3 shm, drop the segment first before re-indexing:bwa-mem3 shm -d bwa-mem3 index ref.faThere is no staleness check. If
bwa-mem3 memfinds a matching segment in shared memory it will attach to it even when the on-disk index has been updated. See Quick start: shared-memory index.
Arch flags and the index format
The FM-index format is architecture-independent. A single index works across every SIMD tier and every supported platform: the x86 binary’s AVX2 / AVX-512BW dispatch paths and the arm64 NEON binary all read the same on-disk layout.
See also: Quick start: align paired-end FASTQs · Quick start: methylation alignment · Quick start: shared-memory index · Performance improvements · CLI Reference: index
Aligning short reads (mem)
bwa-mem3 mem aligns one or two FASTQ files against an indexed reference and
writes SAM (default) or BAM (--bam) to stdout. It is a drop-in replacement
for bwa-mem2 mem and supports all standard bwa-mem flags.
Basic usage
Paired-end:
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam
Single-end:
bwa-mem3 mem -t 16 ref.fa reads.fq.gz > out.sam
Pipe directly to samtools:
bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -o out.bam -
samtools index out.bam
Using --bam=0 (uncompressed BAM) avoids SAM text formatting on the write side
and SAM parsing on the samtools side, and skips the wasted compression that
samtools sort would immediately decompress; the BAM bytes flow between
processes in the pipe.
Key flags
Threading: -t
-t INT number of threads [1]
Performance scales well through 8–16 threads on most machines. Beyond 32 threads, returns diminish on typical workloads because inter-thread locking and IO become the bottleneck. See Threading and resource use for detailed guidance.
Read-group header: -R
-R STR read group header line, e.g. '@RG\tID:sample1\tSM:sample1\tLB:lib1\tPL:ILLUMINA'
Every production alignment should include a @RG header. The ID in the -R
string is embedded as an RG:Z: tag on every output record.
Tip — Escape the tab correctly
Pass
-Rwith a literal\tbetween fields. Most shells require single quotes or$'...'quoting to prevent interpretation of the backslash:bwa-mem3 mem -R $'@RG\tID:s1\tSM:sample1' -t 16 ref.fa R1.fq.gz R2.fq.gz
Chunk size: -K
-K INT process INT input bases in each batch [10000000]
Larger -K values increase memory use but can improve throughput on very deep
or very wide batches. The default is appropriate for most workloads.
SAM output control: -S, -P
-S skip mate rescue
-P skip pairing; mate rescue performed unless -S also in use
These flags are primarily useful for debugging or non-standard workflows. Normal paired-end alignments should leave both at their defaults.
Output modes
SAM (default)
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam
Plain-text SAM. Suitable for inspection, compatibility testing, and piping to tools that consume SAM.
BAM (--bam=0)
bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz > out.bam
Writes BAM directly. --bam=0 is uncompressed BAM, which avoids
double-compression when piping into a downstream sorter and is roughly
10–15% faster end-to-end. Pass --bam=6 to write a fully compressed BAM if
the output is the final product.
Note — –bam=0 is the recommended output mode
For production pipelines, always use
--bam=0and pipe tosamtools sort. See Best Practices: output format for the canonical pipeline.
Methylation alignment (--meth)
Pass --meth for bisulfite/RRBS samples. This activates inline C-to-T read
conversion, bwameth.py-compatible flag defaults, and inline BAM post-processing.
See Quick start: methylation alignment for
the two-command workflow and the Methylation Reference
for full detail.
Shared-memory index auto-attach
When bwa-mem3 shm has staged the index into shared memory, bwa-mem3 mem
attaches automatically — no extra flag is required. The shared-memory path
is transparent to users.
Cross-references
The full flag list is in the CLI Reference: mem page.
See also: Output: SAM/BAM, headers, tags · Threading and resource use · Best Practices: output format · CLI Reference: mem · Methylation Reference: overview
Output: SAM/BAM, headers, tags
bwa-mem3 writes output in either SAM (default) or BAM (--bam) format.
This page covers the header structure and every non-standard SAM tag emitted
by bwa-mem3.
Output format
By default, bwa-mem3 mem writes SAM to stdout. Pass --bam (or --bam=N
for a specific compression level) to write BAM. Level 0 (uncompressed) is the
default when --bam is given without an argument, which is optimal when piping
to a downstream samtools sort.
# SAM (default)
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam
# Uncompressed BAM — best for piping
bwa-mem3 mem --bam -t 16 ref.fa R1.fq.gz R2.fq.gz | samtools sort -@ 8 -o out.bam -
# Compressed BAM — useful when the output is the final file
bwa-mem3 mem --bam=6 -t 16 ref.fa R1.fq.gz R2.fq.gz > out.bam
SAM header
@HD
A default @HD VN:1.6 SO:unsorted line is emitted unless the user supplies
one via -H. The sort order is unsorted because bwa-mem3 writes records in
input read order; downstream sorting is always a separate step.
@SQ
One @SQ line is written per reference sequence, with the sequence name
(SN:) and length (LN:) derived from the FM-index. If the index was built
with a .dict or .hdr file that supplies @SQ records, those records are
used instead of the auto-generated ones.
In methylation mode (--meth), the doubled reference contains sequences with
an f or r prefix in their names. The inline BAM post-processor collapses
these back to canonical chromosome names so that the output @SQ lines match
a standard non-methylation alignment. See
Chimera QC and header rewriting.
@PG
One @PG entry is written in standard mode:
| ID | Description |
|---|---|
bwa-mem3 | The alignment step. VN: is the bwa-mem3 version string; CL: is the full command line. |
In methylation mode (--meth), a second @PG entry is appended:
| ID | Description |
|---|---|
bwa-mem3-meth | The inline post-processor. VN: carries the version with -meth suffix; CL: is the full command line. |
The bwa-mem3-meth entry follows immediately after the bwa-mem3 entry and
records the post-processing step as a distinct pipeline node, matching the
convention of separate-tool pipelines.
Tags emitted by bwa-mem3
Standard tags
bwa-mem3 emits the same standard tags as bwa-mem2 (NM:i, MD:Z, AS:i,
XS:i, SA:Z, RG:Z, XA:Z, MC:Z, etc.). These are documented in the
SAM specification and are not described further here.
bwa-mem3 additionally emits MQ:i on paired-end records — the mate’s
mapping quality, set alongside MC:Z (the mate’s CIGAR) so callers that
key off the mate’s MAPQ don’t need to look at the mate record. Both SAM
and --bam output paths emit it. Backported from lh3/bwa PR #330 in
fg-labs PR #35.
The XA:Z field set widens from chr,pos,CIGAR,NM to
chr,pos,CIGAR,NM,score,mapq when -u (a.k.a. the upstream “XB” toggle)
is passed; the tag name itself remains XA:Z for downstream
compatibility. Tools that parse XA:Z need to be aware of the two
possible field widths.
HN:i — total alignment hit count
HN:i:<count>
The total number of primary alignments (both reported and suppressed) that
the aligner found for this read, before the -h supplementary cap is applied.
Useful for distinguishing “uniquely mapped” from “multi-mapped” reads without
relying solely on MAPQ.
HN:i is emitted on the primary alignment record only.
Methylation-only tags
The following Bismark-compatible tags are emitted only when --meth is
active. See SAM tags: XR, XG, XM for the full
per-tag reference, including the XM:Z character alphabet and the
XG:Z strand-pick semantics.
| Tag | Type | Description |
|---|---|---|
XR:Z | string | Read conversion direction: CT (R1 / SE) or GA (R2) |
XG:Z | string | Genome strand of the alignment: CT (OT) or GA (OB) |
XM:Z | string | Per-base methylation call string (length = SEQ) |
The bwameth-style YS:Z / YC:Z tags exist only as an internal carrier
on bseq1_t.comment for SEQ restoration and XR:Z derivation; they
are suppressed at BAM emit and never appear in output. The bwameth
YD:Z strand tag has been replaced by Bismark XG:Z and is not
emitted.
MAPQ semantics
MAPQ semantics are inherited from bwa-mem2 and follow the same scoring model.
In methylation mode, alignments identified as chimeras (longest M/=/X
run covering less than 44% of the read length) have their MAPQ capped at 1 and
the 0x200 (QC fail) flag set. See
Chimera QC and header rewriting.
See also: Aligning short reads (mem) · Methylation Reference: SAM tags · Methylation Reference: post-processing · CLI Reference: mem · Best Practices: output format
Threading and resource use
The -t flag
-t INT number of threads [1]
bwa-mem3 parallelizes alignment by dividing the input into fixed-size batches
(controlled by -K) and processing batches concurrently. Threads share the
in-memory FM-index; there is no per-thread copy.
How threads interact with performance
Where threads help
- Seed finding (SMEM enumeration) is fully parallel across reads in a batch.
- Extension (banded Smith-Waterman) is fully parallel.
- Pair rescue is parallel.
- BAM encoding (when
--bamis active) is parallel.
Where threads stop helping
Thread count and wall-clock alignment time scale well to approximately 16–32 threads on a modern CPU. Beyond that, several effects conspire to flatten the curve:
- FM-index bandwidth. The index for hg38 is ~28 GB and does not fit in the L3 cache of any current server. At high thread counts, threads contend for memory bandwidth accessing the BWT.
- IO contention. On spinning disk or a shared network filesystem, concurrent reads of the same large index file saturate IO bandwidth before the CPU is saturated.
- Output serialization. SAM output is serialized per-record to stdout.
BAM output with
--bamreduces this bottleneck but does not eliminate it entirely.
Recommended thread counts
| Machine | Recommended -t | Notes |
|---|---|---|
| 16-core workstation | 12–14 | Leave 2 cores for samtools sort |
| 32-core server | 24–28 | Leave cores for downstream and OS overhead |
| 64-core server | 40–48 | Marginal returns above 48; test with your workload |
| Multiple parallel runs | divide evenly | See below |
These are starting points. Profile with your specific data and storage configuration to find the practical optimum.
Running multiple parallel alignments
When running multiple bwa-mem3 mem processes on the same machine, divide
threads so that the total does not exceed the physical core count. For example,
on a 32-core machine running four concurrent samples:
# Four parallel runs, 8 threads each
for sample in a b c d; do
bwa-mem3 mem --bam -t 8 ref.fa ${sample}_R1.fq.gz ${sample}_R2.fq.gz \
| samtools sort -@ 2 -o ${sample}.bam - &
done
wait
Using shared memory (bwa-mem3 shm) amortizes the index read-in cost across
all four runs. See Quick start: shared-memory index
and Best Practices: multi-sample workflows.
Memory use
Peak RAM during alignment is dominated by the in-memory FM-index. For hg38,
expect roughly 28 GB of resident memory per bwa-mem3 mem process. Additional
memory is used per batch (-K reads × read length × a small constant).
With bwa-mem3 shm, the index is mapped from a shared-memory segment, so
multiple concurrent mem processes share the same physical pages. The OS
deduplicates the pages; total RAM use is approximately one index, not one per
process.
Tip — Use shm for repeated runs on the same machine
If you run more than a few samples on the same machine without rebooting,
bwa-mem3 shmpays off immediately. The index is read from disk once and stays in RAM for all subsequentmeminvocations.
IO recommendations
- Use local NVMe storage for the index files when possible. The ~28 GB BWT
read is the dominant IO event at the start of each
memrun. - Write BAM (
--bam) to a fast local disk or pipe directly tosamtools sort. Avoid writing uncompressed SAM to a network filesystem. - Separate read and write paths if your storage topology allows it: read the index from one volume and write sorted BAM to another.
See also: Aligning short reads (mem) · Memory allocator (mimalloc) · Quick start: shared-memory index · Best Practices: multi-sample workflows · Performance: tuning checklist
Memory allocator (mimalloc)
bwa-mem3 vendors and links mimalloc, Microsoft’s high-performance memory allocator, into every binary by default. On multi-threaded alignment workloads, mimalloc reduces wall-clock time by replacing the system allocator with one optimized for many small, short-lived allocations — exactly the access pattern produced by the inner alignment loops.
What mimalloc replaces
The system allocator (glibc malloc on Linux, libSystem malloc on macOS) is
a general-purpose allocator with a global lock. Under heavy multi-threaded
allocation pressure — 16+ threads each issuing thousands of short-lived
allocations per batch — the lock becomes a measurable bottleneck. mimalloc uses
per-thread free lists and a segment-based heap to eliminate most of this
contention.
Platform-specific linkage
The linkage strategy differs by OS:
| Platform | Mechanism |
|---|---|
| Linux | Static linkage with --whole-archive. The entire mimalloc static library is embedded into the bwa-mem3 binary; its malloc/free symbols take precedence over glibc’s at link time. |
| macOS | Dynamic linkage via dyld interposing. libmimalloc.dylib is built alongside the binary; dyld’s DYLD_INSERT_LIBRARIES interposing mechanism replaces malloc/free at load time. The dylib ships next to the binary. |
Warning — macOS: keep libmimalloc.dylib next to the binary
On macOS,
libmimalloc.dylibmust remain in the same directory as thebwa-mem3binary (or be reachable via the embedded rpath). If you movebwa-mem3without also movinglibmimalloc.dylib, the binary will fall back to the system allocator silently —bwa-mem3 versionwill not print a mimalloc line, which is the indicator that the allocator is active.
Verifying that mimalloc is active
Run:
./bwa-mem3 version
When mimalloc is linked and loaded, the output includes a line like:
mimalloc 3.x.x
If that line is absent, mimalloc is not active.
Opting out
Pass USE_MIMALLOC=0 at build time to produce a binary linked against the
system allocator:
make USE_MIMALLOC=0
Reasons to opt out:
- AddressSanitizer (ASAN) builds. The Makefile automatically sets
USE_MIMALLOC=0whenASAN_FLAGSis detected, because ASAN and mimalloc’s malloc interposing cannot coexist cleanly. - Container environments where distributing a dylib alongside the binary is inconvenient.
- Reproducibility testing to isolate whether a behavioral difference is allocator-related.
Note — Default is on
USE_MIMALLOC=1is the default. Opt-out is not recommended for production workloads — mimalloc measurably reduces wall time on multi-threaded runs.
Build internals
The mimalloc source lives in ext/mimalloc/ as a git submodule. The Makefile
target builds it via CMake before linking bwa-mem3. The relevant Makefile
variables are MIMALLOC_SRC, MIMALLOC_BUILD, and MIMALLOC_LIB.
The feature was introduced in bwa-mem3 as part of the performance improvement work. See Features and Build & infrastructure for the PR history.
See also: Threading and resource use · Features: mimalloc · Getting Started: installation · Developer Guide: building from source · Performance: tuning checklist
Tips and best practices
This page collects the most commonly useful operational tips for running bwa-mem3. Each tip is a short actionable point; the linked pages provide the full rationale.
Index once, align many times
Build the FM-index once per reference version. The on-disk index format is stable across bwa-mem3 releases and across every SIMD tier inside the single binary — the AVX2 and AVX-512BW kernel paths read the same files. You do not need to re-index when upgrading bwa-mem3 unless the release notes say otherwise.
# Build once
bwa-mem3 index ref.fa
# Align many samples
for sample in a b c d; do
bwa-mem3 mem --bam -t 16 ref.fa ${sample}_R1.fq.gz ${sample}_R2.fq.gz \
| samtools sort -@ 4 -o ${sample}.bam -
done
Pipe to samtools sort -@
Never write an intermediate unsorted BAM to disk and then sort it in a second
step. bwa-mem3’s --bam mode + samtools sort in a single pipeline avoids the
extra write/read cycle and is significantly faster:
bwa-mem3 mem --bam -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -o out.bam -
samtools index out.bam
Allocate roughly 2/3 of available threads to bwa-mem3 mem and 1/3 to
samtools sort. On a 24-core machine, -t 16 for bwa-mem3 and -@ 8 for
samtools is a good starting point.
Stage the index in shared memory for batch workloads
When aligning more than a few samples on the same machine, reading the ~28 GB
hg38 index from disk on every mem invocation is the dominant wall-clock cost.
Stage it once:
bwa-mem3 shm ref.fa
Subsequent bwa-mem3 mem invocations attach automatically. The shared-memory
segment persists until explicitly dropped (bwa-mem3 shm -d) or the machine
reboots.
Warning — Always drop the segment before re-indexing
There is no staleness check. If you rebuild the index without first dropping the shared-memory segment,
bwa-mem3 memwill attach to the stale segment and produce incorrect alignments without any warning. Always runbwa-mem3 shm -dbeforebwa-mem3 index.
Pin threads when running concurrent jobs
When running multiple bwa-mem3 mem processes in parallel, divide threads
explicitly so that the total does not exceed the physical core count. Avoid
relying on the scheduler to balance over-subscribed threads — each process
will spin waiting for CPU time, and total throughput drops.
# Good: 4 jobs × 6 threads = 24 cores, on a 24-core machine
for sample in a b c d; do
bwa-mem3 mem --bam -t 6 ref.fa ${sample}_R1.fq.gz ${sample}_R2.fq.gz \
| samtools sort -@ 2 -o ${sample}.bam - &
done
wait
See Threading and resource use for per-machine thread count recommendations.
Confirm the binary’s SIMD tier matches your CPU
bwa-mem3 ships one binary per platform that contains every supported x86 SIMD tier (or the single NEON path on arm64) and picks the right tier in process at startup. There are no per-tier companion binaries to copy or call directly.
| CPU generation | Resolved tier |
|---|---|
| Modern Intel/AMD (2018+) | avx512bw or avx2 |
| Older x86 | sse42 or sse41 |
| Apple Silicon / AWS Graviton | neon |
Verify the resolved tier with bwa-mem3 version (prints SIMD floor: and
SIMD runtime: lines on stdout) or set BWAMEM3_DEBUG_SIMD=1 to get a
startup banner from bwa-mem3 mem. If you need to force a lower tier
for A/B regression testing, set BWAMEM3_FORCE_TIER=<tier> — upgrade
requests above the host’s capability are rejected.
See Performance: SIMD dispatch matrix.
Include a read-group header
Always pass -R with at minimum ID: and SM: fields. Many downstream tools
(GATK, fgbio, Picard) require a @RG header and will fail or warn without one.
bwa-mem3 mem \
-R $'@RG\tID:run1\tSM:sample1\tLB:lib1\tPL:ILLUMINA' \
-t 16 ref.fa R1.fq.gz R2.fq.gz
Further reading
The Best Practices section covers these topics in depth:
- Best Practices: build — PGO builds, arch selection
- Best Practices: output format — the canonical pipeline
- Best Practices: multi-sample workflows — shared-memory batch jobs
- Best Practices: anti-patterns — common mistakes and how to avoid them
See also: Aligning short reads (mem) · Threading and resource use · Memory allocator (mimalloc) · Performance: tuning checklist · Best Practices: anti-patterns
Performance Overview
Performance claims in this section are benchmarked, not asserted. The canonical source of truth for benchmark methodology, hardware configurations, and current numbers is bwa-mem3-bench, a reproducible benchmarking harness that runs across AWS Batch architectures (x86 AVX2, AVX-512, ARM Graviton). Consult that repository before drawing conclusions from isolated anecdotal timings.
What drives bwa-mem3’s performance
bwa-mem3 inherits the SIMD-vectorized alignment kernels of bwa-mem2 and adds several improvements of its own. The headline gains relative to a stock bwa-mem2 build fall into four categories.
Vectorized alignment kernels. The Smith-Waterman and banded-SWA kernels (kswv, bandedSWA) are compiled against the widest SIMD ISA the current CPU supports — SSE4.1 through AVX-512BW on x86, or native NEON on ARM. On Apple Silicon, native NEON intrinsics replaced the sse2neon shim in the two hottest kernels, delivering roughly 10% additional throughput over the pure-translation baseline. See SIMD dispatch matrix for the full picture.
libsais FM-index construction. The indexing step uses the linear-time suffix-array/BWT construction library libsais in place of the original quadratic-time approach. This cuts bwa-mem3 index wall time substantially on large references. See What’s Different — Performance improvements for the corresponding PR details.
mimalloc allocator. bwa-mem3 vendors and statically links mimalloc, replacing the system malloc/free for all allocations. On Linux the library is injected via --whole-archive; on macOS it uses dyld interposition. The allocator shows consistent throughput gains on multi-threaded workloads because mimalloc avoids the lock contention in glibc’s ptmalloc at high thread counts. See User Guide — Memory allocator for details.
Profile-Guided Optimization (PGO). The build system provides make pgo-generate and make pgo-use targets that compile an instrumented binary, gather branch-probability and call-frequency profiles from a representative workload, and then recompile with those profiles applied. On Apple Silicon the measured gain is approximately 3%; on x86 the gain depends on the workload mix. PGO is opt-in and is not applied to the default make output. See PGO build for the full workflow.
Consolidated mapping speedups
PR #58 and the related lockstep SMEM-batching work (#33) reduced per-read overhead in the main mapping loop beyond what upstream bwa-mem2 carries. The batch -H ingestion improvement (#49) further reduces header-processing latency for large sample sets.
Reference numbers across architectures
Wall-time medians from bwa-mem3-bench at SHA dc7fcfe (2026-05-13), 5 reps per cell, t≈16, hg38, paired-end 150 bp:
| sample | c6a (AVX2, Zen3) | c7a (AVX-512, Zen4) | c7i (AVX-512, SPR) | c7g (NEON, Graviton3) | c8g (NEON, Graviton4) |
|---|---|---|---|---|---|
| wgs-5M | 147.70 s | 101.17 s | 138.33 s | 178.54 s | 151.23 s |
| wes-5M | 84.37 s | 61.96 s | 75.08 s | 84.50 s | 70.90 s |
| panel-twist-5M | 158.49 s | 106.94 s | 151.78 s | 194.04 s | 163.38 s |
Concordance vs upstream bwa-mem2 v2.2.1 on these cells: 100.0000% across 8.1M–10M reads/cell. NEON-vs-x86 cross-architecture concordance on the same builds is also 100.0000%. Spot-pool noise envelope (rep-to-rep CV): ~1% on c6a / c7a / c7g / c8g, ~8–9% on c7i. See the bench repo for the methodology, the full per-rep table, and noisier instance classes excluded from this summary.
Benchmarking responsibly
Alignment throughput is sensitive to read length, error rate, reference size, thread count, CPU architecture, NUMA topology, and whether the index is cold (in-kernel page cache) or warm. The bwa-mem3-bench harness controls for these variables by running standardized workloads on defined instance types. If you need numbers for a procurement or publication decision, run the harness against your target hardware.
See also: SIMD dispatch matrix · PGO build · Tuning checklist · What’s Different — Performance improvements · bwa-mem3-bench
SIMD Dispatch Matrix
bwa-mem3 ships one binary per platform. The x86 binary contains
compiled kernels for every supported SIMD tier
(sse41 / sse42 / avx / avx2 / avx512bw) and dispatches in
process at startup. The arm64 binary contains a single NEON kernel
path. There are no bwa-mem3.<tier> companion files on disk and no
launcher binary.
Dispatch flowchart
flowchart TD
A[bwa-mem3 mem starts] --> B{Platform?}
B -- ARM / aarch64 --> C[NEON kernel TU, no dispatch]
B -- x86 --> D[bwamem3_simd_init in src/simd_dispatch.cpp]
D --> E[__builtin_cpu_supports]
E --> F{Host capability?}
F -- AVX-512BW --> G1[g_tier = avx512bw]
F -- AVX2 --> G2[g_tier = avx2]
F -- AVX --> G3[g_tier = avx]
F -- SSE4.2 --> G4[g_tier = sse42]
F -- SSE4.1 --> G5[g_tier = sse41]
F -- below build floor --> H[exit(2): host below SIMD floor]
G1 & G2 & G3 & G4 & G5 --> I[Per-kernel factory selects matching tier]
Tier detection runs once during main(). Subsequent kernel calls pay
a single indirect-call hop through a factory vtable (or an
extern "C" wrapper for free-function ksw_* kernels) — about
0.3 ns per call after BTB warm-up, well below run-to-run noise on the
bwa-mem3-bench corpus.
If the host CPU does not meet the build’s compile-time SIMD floor
(BASELINE_ARCH, default avx2 since PR #84), the binary exits with
code 2 and an [E::bwamem3] message naming the gap before any
alignment work runs. bwa-mem3 version, --help, and -h are
exempt and always succeed so operators can introspect a binary on a
host that cannot run alignment. See
Host requirements.
Building
make # single multi-tier x86 binary, BASELINE_ARCH=avx2
make BASELINE_ARCH=sse41 # lower host SIMD floor / maximize portability (~10–15% slower on AVX2 hosts)
make BASELINE_ARCH=avx512bw # AVX-512BW-only fleet (locks the host floor)
make arm64 # single NEON binary, no dispatch table
BASELINE_ARCH controls the tier at which non-kernel translation
units compile. The hand-tuned kernel TUs in KERNEL_SRCS
(bandedSWA, kswv, ksw, sam_encode) are always compiled at
every supported tier and dispatched at runtime, so a build at
BASELINE_ARCH=avx2 still uses the AVX-512BW kernels on AVX-512BW
hosts. The non-kernel TUs are not auto-vectorized above
BASELINE_ARCH, which is the trade-off — see
BASELINE_ARCH=avx512bw build flag
for the empirical perf characterization.
Supported x86 tiers (minimum CPU for each tier’s kernel path):
| Tier | Arch flags | Minimum CPU |
|---|---|---|
sse41 | -msse4.1 | Penryn (2007) / K10 (2011) |
sse42 | -msse4.2 | Nehalem (2008) / Bulldozer (2011) |
avx | -mavx | Sandy Bridge (2011) / Bulldozer (2011) |
avx2 | -mavx2 | Haswell (2013) / Excavator (2015) |
avx512bw | -mavx512f -mavx512bw -mprefer-vector-width=256 | Skylake-X (2017) / Zen 4 (2022) |
For arm64 builds:
| Binary | Arch flags | Platform |
|---|---|---|
bwa-mem3 (arm64) | -DAPPLE_SILICON=1 + native NEON / sse2neon shim | Any aarch64 / Apple Silicon |
Kernel vectorization coverage
| Kernel | SSE4.1 | SSE4.2 | AVX | AVX2 | AVX-512BW | NEON (arm64) |
|---|---|---|---|---|---|---|
kswv (vectorized Smith-Waterman) | 8-wide int16 | 8-wide int16 | 8-wide int16 | 16-wide int16 | 32-wide int16 | 8-wide int16 (native) |
bandedSWA (banded alignment / mate-rescue) | vectorized | vectorized | vectorized | vectorized | vectorized | native NEON blendv |
ksw_* (SW extension free functions) | per-tier | per-tier | per-tier | per-tier | per-tier | per-tier (NEON) |
sam_encode (SAM seq/qual encoder) | per-tier | per-tier | per-tier | per-tier | per-tier | per-tier (NEON) |
FM-index lookup (FMI_search) | scalar popcount | scalar popcount | scalar popcount | scalar popcount | scalar popcount | __builtin_popcountl |
| libsais BWT construction | scalar | scalar | scalar | OpenMP parallel | OpenMP parallel | OpenMP parallel |
Note — FM-index is memory-bound
The FM-index backward-extension loop is limited by pointer-chasing through the
cp_occarrays, not by computation. Additional SIMD width does not increase throughput here. See Developer Guide — Apple Silicon / NEON port for the profiling evidence.
Runtime overrides
Two environment variables tune dispatch:
| Variable | Effect |
|---|---|
BWAMEM3_FORCE_TIER=<tier> | Forces a specific tier (sse41 / sse42 / avx / avx2 / avx512bw). Downgrade-only: requests above the host’s detected tier (which would SIGILL) and unknown names are rejected with a stderr warning. Used by test/regression/all_tiers_parity.sh to confirm byte-identical SAM across all tiers on AVX-512 hosts. |
BWAMEM3_DEBUG_SIMD=1 | Prints a one-line [I::bwamem3_simd_init_body] startup banner with the build baseline, the detected host capability, and the resolved tier. Also enables the build-baseline-vs-host gap warning. |
Use bwa-mem3 version to read the resolved tier without alignment:
v0.2.0
SIMD floor: avx2 (x86-64-v3, Haswell 2013+); kernels: sse41 sse42 avx avx2 avx512bw
SIMD runtime: avx512bw (BWAMEM3_FORCE_TIER unset)
Why in-process dispatch, not separate binaries
The pre-PR-#83 design shipped six binaries (one launcher plus one
per ISA tier) and execvd the matching tier at startup. That worked
but cost ~120 MB on disk, required all six binaries to be present in
the same directory, and made BWAMEM3_FORCE_TIER impossible without
re-exec’ing a different file. The current single-binary design keeps
the per-tier compile granularity for the hand-tuned kernel TUs while
collapsing distribution to one file (~25 MB), and adds runtime tier
override and a clean host-floor precheck. Indirect-call overhead is
the only trade-off, and it is below the measurement noise floor on
every architecture in the bench matrix.
See also:
Performance overview ·
PGO build ·
Host requirements ·
Developer Guide — SIMD dispatch architecture ·
Developer Guide — Single-binary SIMD dispatch (x86) ·
Developer Guide — Apple Silicon / NEON port ·
BASELINE_ARCH=avx512bw build flag
PGO Build
Profile-Guided Optimization (PGO) is a two-pass compiler technique. In the first pass (pgo-generate) the compiler inserts counters into every branch, call site, and loop back-edge. You run a representative training workload against the instrumented binary so those counters accumulate real branch-probability data. In the second pass (pgo-use) the compiler recompiles every translation unit using the collected profiles to make better inlining, branch-prediction, and code-layout decisions.
bwa-mem3’s Makefile provides three targets that implement this workflow.
Observed gains
On Apple Silicon (M-series), PGO delivered approximately 3% throughput improvement over the native NEON build. The gain on x86 depends on the workload — short-read paired-end alignment on avx2 or avx512bw hardware typically sees 2–5%. PGO is most useful when you will run the same binary on the same hardware against the same workload repeatedly (e.g. a production pipeline node). It is not worth the extra build time for one-off or exploratory runs.
Workflow
Step 1: Build the instrumented binary
make pgo-generate
By default PGO_ARCH is set to arm64 on Apple Silicon / aarch64 hosts and native on x86 hosts. To target a specific ISA, pass PGO_ARCH explicitly:
make pgo-generate PGO_ARCH=avx2
This produces a binary named bwa-mem3.pgo-instr (or bwa-mem3.pgo-instr.avx2 for non-default arch). Profiles are written to the directory pgo_profiles/ by default. Override with PGO_PROFILE_DIR:
make pgo-generate PGO_ARCH=avx2 PGO_PROFILE_DIR=/scratch/pgo-profiles-avx2
Step 2: Run the training workload
Run a workload that is representative of your production use. A single-end or paired-end alignment run against the same reference and similar read length is sufficient. A larger training run produces more stable profiles but 5–10 million read pairs is generally enough.
./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
The run discards output so you are measuring the alignment work alone.
Tip — Training workload size
Aim for a training run that exercises the same code paths as your production workload. If you align 150 bp paired-end reads in production, train on 150 bp reads. If you use
--meth, include a methylation alignment run in training. A few million read pairs is sufficient; a full WGS run provides diminishing returns.
Step 3: Build the optimized binary
make pgo-use
Or with matching arch and profile dir:
make pgo-use PGO_ARCH=avx2 PGO_PROFILE_DIR=/scratch/pgo-profiles-avx2
This produces bwa-mem3.pgo (or bwa-mem3.pgo.avx2). The binary is ready to use in production.
Step 4: Clean up instrumentation artifacts
make pgo-clean
This removes the profile directory and all bwa-mem3.pgo-instr* and bwa-mem3.pgo* files.
Multi-arch builds with PGO
Each architecture requires its own profile because the instrumentation counters are embedded in arch-specific code. Run the full three-step workflow once per arch and keep the profiles in separate directories:
# AVX2 profile
make pgo-generate PGO_ARCH=avx2 PGO_PROFILE_DIR=pgo_profiles_avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2 PGO_PROFILE_DIR=pgo_profiles_avx2
# AVX-512BW profile (separate host or same host with matching CPU)
make pgo-generate PGO_ARCH=avx512bw PGO_PROFILE_DIR=pgo_profiles_avx512bw
./bwa-mem3.pgo-instr.avx512bw mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx512bw PGO_PROFILE_DIR=pgo_profiles_avx512bw
Warning — Profile portability
Profile data collected on one microarchitecture is not portable to a different one. An AVX2 profile collected on a Haswell CPU will not improve — and may pessimize — an AVX-512BW build run on a Sapphire Rapids CPU. Always collect profiles on the same hardware class where the optimized binary will run.
PGO and the single-binary multi-tier build
The PGO targets produce one optimized binary for a single arch= target. They do not yet rebuild the default make single multi-tier binary’s per-tier kernel TUs. If you need PGO across more than one host class, build and profile each arch= variant separately and deploy whichever matches the target fleet — bwa-mem3 version will report the resolved tier so you can confirm. PGO for the in-process multi-tier dispatch path is tracked as a future enhancement.
Relationship to LTO
make lto-build produces a Link-Time Optimization binary; make pgo-use produces a PGO-optimized binary. Both are independent opt-in targets. You can combine them by passing -flto (or -flto=thin for clang) as part of EXTRA_CXXFLAGS during the pgo-use step, but the combination has not been systematically benchmarked. In practice, LTO and PGO each provide modest single-digit gains; their interaction is compiler-specific.
See also: Performance overview · SIMD dispatch matrix · Tuning checklist · Best Practices — Build · Developer Guide — Building from source
Tuning Checklist
The items below are ordered by expected impact for most workloads. Work through them in sequence; there is little point optimizing output format before confirming you are running the right binary for your CPU.
1. Confirm the resolved SIMD tier matches your CPU
The default make produces a single binary that contains every supported
x86 SIMD tier and selects one in process at startup. Verify which tier
is running:
bwa-mem3 version
# expect: SIMD floor: <build_floor>; SIMD runtime: <resolved_tier>
If the runtime tier is below what your CPU supports, double-check
whether you accidentally built with a lower BASELINE_ARCH= or set
BWAMEM3_FORCE_TIER in the environment. Set BWAMEM3_DEBUG_SIMD=1 to
get a startup banner on stderr at the start of a mem run.
On ARM / Apple Silicon, the binary has one NEON tier; bwa-mem3 version
reports SIMD runtime: neon.
See SIMD dispatch matrix for the full dispatch logic and the minimum CPU requirements for each tier.
Tip — Single-arch deployments
On a cluster where every node has the same CPU, build with
make arch=avx2(or the appropriate ISA). The runtime dispatch overhead is negligible, but a single-arch build trims the binary and removes any chance ofBWAMEM3_FORCE_TIERaccidentally downgrading throughput in production.
2. Build with PGO if you will run repeatedly
For production pipeline nodes that will process many samples against the same reference, a PGO build provides an additional 2–5% throughput at the cost of one extra build pass and a training run:
make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
See PGO build for the full workflow, including multi-arch and profile portability notes.
3. Use shared memory for many small samples
When aligning many samples on one machine against the same reference, loading the index into POSIX shared memory once and reusing it across all mem invocations eliminates redundant I/O and reduces per-sample startup time significantly. The benefit grows with the number of samples and the size of the reference.
# Load the index into shared memory once
bwa-mem3 shm ref.fa
# Align each sample against the in-memory index
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz | samtools sort -@ 4 -o sample.bam -
# When finished with all samples, drop the shared segment
bwa-mem3 shm -d
Warning — No staleness check
bwa-mem3 shmdoes not detect whether the on-disk index has changed after the segment was loaded. Always runbwa-mem3 shm -dbefore re-indexing a reference and re-loading withbwa-mem3 shm. Failing to do so results in alignments against a stale index.
See Getting Started — Shared-memory index and Best Practices — Multi-sample workflows for complete workflows.
4. Emit BAM directly
Use --bam (or --bam=0 for uncompressed BAM) to emit BAM instead of SAM. Uncompressed BAM avoids the text-formatting cost on the aligner side and the text-parsing cost on the downstream side. samtools sort reads BAM natively and is fastest when the input is uncompressed:
bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -o out.bam -
samtools index out.bam
The --bam flag (without =0) produces BGZF-compressed BAM. This is useful when writing directly to disk without a downstream piped tool.
See Best Practices — Output format for guidance on when SAM is still appropriate.
5. Pipe to a multi-threaded sorter
Sorting is typically the bottleneck after alignment. Keep a separate thread budget for samtools sort:
bwa-mem3 mem --bam=0 -t 12 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -m 2G -o out.bam -
On a 16-core machine, allocating 12 threads to mem and 8 to samtools sort (with overlap via the pipe) is a common starting point. The aligner is generally CPU-bound; the sorter is I/O-bound during merge. Profile both stages to find the right split for your hardware.
Tip — Thread count tuning
bwa-mem3 memscales well to 16–32 threads on most workloads. Beyond 32 threads the per-thread work unit becomes small enough that synchronization overhead starts to erode gains. See User Guide — Threading and resource use for thread-scaling data.
Summary table
| Item | Action | Reference |
|---|---|---|
| Right SIMD tier for CPU | bwa-mem3 version; verify SIMD runtime: | SIMD dispatch matrix |
| PGO for production | pgo-generate → train → pgo-use | PGO build |
| Shared-memory index | bwa-mem3 shm ref.fa before batch runs | Quick start: shm |
| Emit uncompressed BAM | --bam=0 | Best Practices — Output format |
| Multi-threaded sort | samtools sort -@ with appropriate thread split | User Guide — Threading |
See also: Performance overview · SIMD dispatch matrix · PGO build · Best Practices — Build · User Guide — Threading and resource use
Build
This page describes the recommended build configuration for production use of bwa-mem3.
Choose the right arch target
The default make invocation builds a single multi-tier binary on x86
(or a single NEON binary on arm64). For production clusters where the
CPU family is uniform, you can trim further by building one tier only —
the binary drops the per-tier dispatch table and ships a single kernel
path:
# Most modern x86-64 servers (Haswell or later):
make arch=avx2
# Intel Cascade Lake / Sapphire Rapids, AWS c7i/m7i:
make arch=avx512bw
# Apple Silicon / AWS Graviton:
make arch=arm64
Omit arch= if the deployment target is heterogeneous or unknown; the
default make produces a single binary that includes every supported
x86 tier and dispatches at runtime via __builtin_cpu_supports. Tune
the non-kernel TU compile baseline with BASELINE_ARCH= (default
avx2) — see
Single-binary SIMD dispatch (x86).
See SIMD dispatch matrix for the full list of targets and which kernels each vectorizes.
Profile-Guided Optimization (PGO)
PGO typically yields 3–5% throughput improvement on real workloads. It is opt-in — the standard
make target does not use it — but is recommended for any installation that
will run many alignment jobs against the same reference.
The workflow is three steps:
# Step 1: Build an instrumented binary (produces bwa-mem3.pgo-instr).
make pgo-generate
# Step 2: Run a representative training workload.
# Use reads and a reference that reflect actual production input.
# About 10–30 million read pairs is sufficient.
./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
# Step 3: Build the PGO-optimized binary (produces bwa-mem3.pgo).
make pgo-use
To target a specific SIMD level, pass PGO_ARCH=:
make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Produces: bwa-mem3.pgo.avx2
Profile data is written to pgo_profiles/ by default. Pass
PGO_PROFILE_DIR=<path> to change the location.
Tip — Training data matters
The training workload should resemble production input in read length, base quality distribution, and reference composition. A read set that is too short, too long, or too easy (low mismatch rate) will bias the branch predictions and may produce a build that is slower than the non-PGO baseline on real data.
mimalloc
mimalloc is compiled in by default (USE_MIMALLOC=1). The allocator
improves multi-threaded throughput by reducing lock contention on malloc
and free hot paths. Run bwa-mem3 version to confirm it is active:
bwa-mem3 version
# Expected output includes a line like:
# mimalloc 3.x.x
To build without mimalloc (for example, when using AddressSanitizer or on a system with a known-incompatible allocator):
make USE_MIMALLOC=0
Summary
For a production installation on a known x86 server with AVX2:
make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Deploy: bwa-mem3.pgo.avx2
See also: SIMD dispatch matrix · PGO build · Memory allocator (mimalloc) · Building from source · Anti-patterns
Output Format
The choice of output format — SAM, compressed BAM, or uncompressed BAM — has a measurable effect on end-to-end pipeline wall time. This page explains why uncompressed BAM is the right default and shows the recommended pipeline.
Why uncompressed BAM is faster than SAM
When bwa-mem3 writes SAM (the default when --bam is not set), every
alignment record must be serialized into ASCII text: integers are formatted as
decimal strings, bases are encoded as characters, and flags are written as
decimal numbers. The receiving process — typically samtools sort — then parses
each field back from text into binary integers. Both conversions are pure
overhead: the data is binary inside bwa-mem3 and binary inside samtools; text
is only an interchange format that is immediately discarded.
Uncompressed BAM (--bam=0) bypasses this round-trip. bwa-mem3 writes binary
BAM records directly via htslib’s wb0 mode. The write path performs no text
formatting; the read path in samtools sort performs no text parsing. The
htslib overhead of the wb0 write is negligible — it is effectively a
buffered write(2) call with a small BAM block header prepended.
Compressed BAM (--bam=1) adds BGZF compression on top, which costs CPU on
the write side and gains nothing: the pipe is in-process memory or a kernel
pipe buffer, and samtools sort will re-compress the output anyway. Compressed
BAM on a pipe wastes CPU on both sides.
Recommended pipeline
bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -o out.bam -
samtools index out.bam
The -@ 8 flag gives samtools sort eight compression threads for writing the
final sorted BAM. Tune this number based on available cores; the total core
count should be split so that alignment threads and sort threads do not
contend. A 16:8 split (bwa-mem3:samtools) works well on 24-core machines.
Tip — Thread allocation
Do not give all cores to bwa-mem3. Downstream
samtools sortneeds threads to compress and write the sorted BAM. Leaving 4–8 threads forsamtools sortkeeps the pipeline balanced and prevents a write bottleneck that would stall the aligner.
Methylation output
The --meth path always writes uncompressed BAM internally, regardless of
the --bam flag. The post-processing step (header rewrite, Bismark
XR:Z / XG:Z / XM:Z tag emission, opt-in chimera QC) is performed
inline before the record is handed to htslib, so the same pipeline shape
applies:
bwa-mem3 mem --meth --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -o out.bam -
samtools index out.bam
When SAM is appropriate
SAM (the default, equivalent to omitting --bam) remains the right choice for:
- Debugging. Plain text is readable with
less,grep, and any text editor, making it easy to inspect individual records withoutsamtools view. - Ad-hoc inspection. When you need to scan a few thousand reads to diagnose a mapping problem, piping to SAM and reading the output directly is faster than writing a BAM file and then querying it.
- Compatibility with tools that require SAM input. Some legacy tools do not accept BAM. If the downstream tool does not support BAM, use SAM.
For production alignment jobs that feed samtools sort, always use
--bam=0.
Summary table
| Format | --bam value | Pipe overhead | Recommended for |
|---|---|---|---|
| SAM | (default / omit) | High (text round-trip) | Debugging, ad-hoc inspection |
| Uncompressed BAM | 0 | Negligible | Production pipelines |
| Compressed BAM | 1 | High on write side | Writing directly to a file (no downstream sort) |
See also: Aligning short reads (mem) · Output: SAM/BAM, headers, tags · Threading and resource use · Tuning checklist · CLI Reference: mem
Multi-Sample Workflows
When you need to align many samples back-to-back against the same reference on a single machine, loading the FM-index into shared memory once — and keeping it resident across all alignment jobs — eliminates the index I/O cost for every sample after the first.
The problem: repeated index loads
The bwa-mem3 FM-index for hg38 is approximately 28 GB on disk. Without shared
memory, bwa-mem3 mem reads the entire index from disk on every invocation.
On a fast NVMe drive this takes 30–60 seconds; on a network-attached or
spinning-disk filesystem it can take several minutes. For a batch of 100
samples, that adds hours of pure I/O overhead.
Staging the index once with bwa-mem3 shm
# Stage the index into shared memory (one-time cost, ~28 GB for hg38).
bwa-mem3 shm ref.fa
# Align each sample. bwa-mem3 mem attaches automatically — no extra flag.
bwa-mem3 mem --bam=0 -t 16 ref.fa sample1_R1.fq.gz sample1_R2.fq.gz \
| samtools sort -@ 4 -o sample1.bam -
bwa-mem3 mem --bam=0 -t 16 ref.fa sample2_R1.fq.gz sample2_R2.fq.gz \
| samtools sort -@ 4 -o sample2.bam -
# ...
# When done, release the segment.
bwa-mem3 shm -d
For methylation workflows, stage the c2t index instead:
bwa-mem3 shm --meth ref.fa
bwa-mem3 mem --meth --bam=0 -t 16 ref.fa sample1_R1.fq.gz sample1_R2.fq.gz \
| samtools sort -@ 4 -o sample1.bam -
bwa-mem3 shm -d
Confirming the index is staged
bwa-mem3 shm -l
# Prints the basename and memory usage of each staged segment.
If the listing is empty, the index is not staged and bwa-mem3 mem will fall
back to loading from disk.
Thread layout for parallel alignment
Running multiple bwa-mem3 mem instances in parallel is efficient when the
samples are independent and the machine has enough cores. The shared-memory
index eliminates disk contention, so the bottleneck becomes CPU and memory
bandwidth.
Guidelines for N-core machines:
- N = 32: Two instances at
-t 14each, with-@ 4forsamtools sort. Keeps 4 cores reserved for OS and I/O. - N = 64: Two to four instances at
-t 14to-t 16, each with-@ 4forsamtools sort. - N = 128: Four to eight instances; keep at least 8–16 cores free for
samtools sortthreads and OS scheduling.
Tip — Memory bandwidth limit
The FM-index lookup is memory-bandwidth bound. On machines with NUMA topology (multi-socket or multi-chiplet), binding each bwa-mem3 instance to a NUMA node with
numactl --cpunodebind=N --membind=Ncan improve throughput by reducing cross-node memory traffic.
Scripting a batch with a loop
bwa-mem3 shm ref.fa
for sample in sample1 sample2 sample3; do
bwa-mem3 mem --bam=0 -t 16 ref.fa "${sample}_R1.fq.gz" "${sample}_R2.fq.gz" \
| samtools sort -@ 4 -o "${sample}.bam" -
samtools index "${sample}.bam"
done
bwa-mem3 shm -d
For parallel execution, replace the for loop body with a background job (or
use a workflow manager such as Snakemake or Nextflow) and limit the degree of
parallelism to match available cores.
Warning — Stale segment footgun
If you need to re-index the reference (e.g. after updating it), always run
bwa-mem3 shm -dbeforebwa-mem3 index. There is no automatic staleness check. See Anti-patterns for details.
See also: Quick start: shared-memory index · CLI Reference: shm · Output format · Threading and resource use · Anti-patterns
Methylation Defaults
bwa-mem3 mem --meth ships with a set of scoring and filtering defaults that
match the bwameth.py reference implementation. This page describes what those
defaults are, when to keep them, and when to override them.
What --meth sets
When --meth is passed, the following flags are applied automatically in
addition to enabling inline c2t conversion and BAM post-processing:
| Flag | Value | Purpose |
|---|---|---|
-B | 2 | Mismatch penalty. Reduced from the bwa-mem2 default of 4. Bisulfite-treated reads carry C→T and G→A mismatches at converted positions; a lower penalty prevents these from causing spurious soft-clipping or unmapped reads. |
-L | 10 | Clipping penalty. Increased from the bwa-mem2 default of 5 to discourage clipping of read ends that carry converted bases at positions that look like mismatches. |
-U | 100 | Unpaired read penalty. Higher than default; methylation libraries typically have well-defined insert sizes and anomalous pairing usually reflects a mapping artifact. |
-T | 40 | Minimum alignment score threshold. Higher than default; raises the bar to report an alignment, reducing spurious low-quality hits against the doubled reference. |
-CM | — | Treats soft-clipped bases as matches in CIGAR output. Required for correct behavior of downstream methylation callers (e.g. Bismark, MethylDackel) that count clipped bases. |
These defaults can all be overridden on the command line. The --meth flag
sets them first; any explicit flag that follows overrides the --meth-set
value.
When to keep the defaults
For standard whole-genome bisulfite sequencing (WGBS) workflows, the defaults are appropriate as-is. They were derived from the bwameth.py codebase and are expected by most downstream methylation calling tools. Unless you have a specific reason to deviate, use:
bwa-mem3 mem --meth --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 4 -o out.bam -
samtools index out.bam
When to override
Low-coverage or targeted bisulfite sequencing. If your library covers a
small target region and insert sizes are more variable, consider lowering -T
(e.g. -T 20) to recover short or soft-clipped alignments in the target.
Amplicon bisulfite sequencing. Amplicon reads have uniform insert sizes;
the default -U 100 is appropriate. However, if your amplicons are short
(< 100 bp), consider lowering -L further to reduce clipping at read ends.
Non-standard conversion chemistry. Some library preparations use only one
strand conversion (C→T only, not G→A). In such cases, --set-as-failed r
suppresses alignments to the reverse-complement strand, which reduces noise
from strand-ambiguous alignments:
bwa-mem3 mem --meth --set-as-failed r --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 4 -o out.bam -
Chimera QC is opt-in (matches Bismark default). bwameth.py applies a
chimera heuristic that flags reads whose longest matching run
(CIGAR M/=/X) is less than 44 % of the read length: 0x200 set,
0x2 cleared, MAPQ capped at 1. bwa-mem3 --meth does not apply
this by default — the runtime posture matches Bismark, where no such
heuristic exists.
If your library is PBAT / scBS-Seq (where intra-fragment chimerism is
common) or you want bwameth.py-equivalent flagging, pass --chimera-qc:
bwa-mem3 mem --meth --chimera-qc --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 4 -o out.bam -
Note — Overrides are positional
Flags supplied after
--methon the command line override the defaults set by--meth. For example,bwa-mem3 mem --meth -B 4 ...uses-B 4(not 2). Flags supplied before--methare silently overwritten by--meth’s defaults, so always place overrides after--meth.
Downstream tool compatibility
The --meth output BAM is designed to be a drop-in replacement for the output
of the bwameth.py pipeline. The following downstream tools have been used
successfully with bwa-mem3 --meth output:
bismark_methylation_extractor, methylKitprocessBismarkAln, methtuple, DMRfinder, epialleleR — read the BismarkXR:Z,XG:Z,XM:Ztags directly from--methoutput.- MethylDackel — reads
XG:Z(and ignores the bwameth-conventionYD:Z:if present, which--methno longer emits). - biscuit per-read tools — read
XG:Z.
See also: Methylation Reference: Overview · SAM tags: XR, XG, XM · Flags: –set-as-failed, –chimera-qc · Quick start: methylation alignment · Output format
Multi-architecture deployment
This page covers running bwa-mem3 in heterogeneous compute environments — AWS Batch with mixed instance families, GCP Batch with mixed CPU platforms, on-prem Slurm with mixed nodes, Kubernetes clusters with mixed node pools.
Within x86_64: one binary, dynamic dispatch
bwa-mem3 ships a single x86_64 binary that contains five SIMD kernel tiers (sse41, sse42, avx, avx2, avx512bw) and selects the best one at runtime via __builtin_cpu_supports. See src/simd_dispatch.cpp for the dispatcher and src/kernel_dispatch.h for the per-tier symbol mangling.
Build once at the BASELINE_ARCH floor that matches your fleet’s oldest x86 host. The default BASELINE_ARCH=avx2 covers Intel Haswell (2013) and AMD Zen (2017) onward. Within that floor, every host transparently uses its best available tier for the hot kernel paths.
Across x86_64 and arm64
A single ELF binary cannot span CPU families. You must build two binaries — one for x86_64, one for arm64 — and package them so the right one runs on each host.
The recommended approach is a Docker manifest-list container of your own making, with one layer per architecture under a single tag. Example:
FROM ubuntu:24.04 AS build
RUN apt-get update && apt-get install -y \
build-essential git cmake pkg-config \
autoconf automake autoconf-archive libtool \
zlib1g-dev
WORKDIR /src
RUN git clone --recursive https://github.com/fg-labs/bwa-mem3 .
RUN make -j
FROM ubuntu:24.04
COPY --from=build /src/bwa-mem3 /usr/local/bin/bwa-mem3
RUN apt-get update && apt-get install -y libgomp1 zlib1g \
&& rm -rf /var/lib/apt/lists/*
ENTRYPOINT ["bwa-mem3"]
Build for both architectures with one command:
docker buildx build --platform linux/amd64,linux/arm64 \
-t <registry>/<image>:<tag> --push .
AWS Batch, GCP Batch, Kubernetes, and containerd all read the manifest list and pull the correct layer based on the host’s architecture. The submitter references one tag; the runtime picks the right binary automatically.
Verifying at runtime
bwa-mem3 version reports the build’s floor, the kernels compiled in, and the resolved runtime tier. Use this in CI or in your Batch job’s startup script to confirm the right layer was pulled:
$ bwa-mem3 version
v0.2.0-12-gabcdef1
SIMD floor: avx2 (x86-64-v3, Haswell 2013+); kernels: sse41 sse42 avx avx2 avx512bw
SIMD runtime: avx512bw (BWAMEM3_FORCE_TIER unset)
Grep for SIMD runtime: to record the tier each job ran at — useful for post-mortem diagnosis of perf regressions.
Pre-Haswell hosts
If your fleet really must include pre-Haswell x86 (c4, m4, pre-Skylake Xeons), rebuild with a lower floor:
make BASELINE_ARCH=sse41
Expect roughly 10-15% slower wall time on AVX2 hosts in the same container compared to a default BASELINE_ARCH=avx2 build. This is the trade-off for broader host coverage; only do it if you actually need pre-Haswell support.
The default BASELINE_ARCH=avx2 covers virtually every modern compute environment. AWS, GCP, and Azure all default to Haswell-or-newer instance types in current-generation compute environments.
What the host-floor precheck does
If a job is scheduled onto a host that doesn’t meet the build’s floor (e.g. an avx2-baseline binary lands on a pre-Haswell host), bwa-mem3 mem refuses to run with exit code 2 and a clear stderr message:
[E::bwamem3] this binary was compiled for SIMD floor avx2 and emits avx2
instructions in non-kernel translation units. The host CPU does not support
avx2 (detected: sse42). Running would SIGILL on the first avx2 instruction.
To run on this host, rebuild bwa-mem3 with BASELINE_ARCH=sse42 (or lower),
or use a binary built for a lower SIMD floor.
This is a clean failure: the job exits before any billable alignment work starts. Compare to the alternative without the precheck (SIGILL deep inside an alignment job, opaque process death, wasted compute).
Defence-in-depth recommendation: configure your AWS Batch compute environment (or equivalent) to exclude instance families older than your binary’s floor. The precheck protects against accidental scheduling; an allowlist at the orchestrator level prevents the scheduling decision in the first place.
See also
- Getting Started → Host requirements
- Getting Started → Installation
src/simd_dispatch.cpp— the runtime dispatcher (bwamem3_simd_init,bwamem3_enforce_host_floor,bwamem3_print_version_simd)
Anti-Patterns
This page documents common mistakes that produce incorrect results or unnecessary failures when using bwa-mem3.
Re-indexing without dropping the shared-memory segment
Warning — Footgun
bwa-mem3 shmdoes not detect stale segments. If you re-runbwa-mem3 indexafter a shared-memory segment is already staged, the on-disk index files will not match the in-memory segment.bwa-mem3 memwill attach to the stale segment and produce incorrect alignments without any warning.Always run
bwa-mem3 shm -dbefore re-indexing:bwa-mem3 shm -d # drop all staged segments bwa-mem3 index ref.fa # rebuild the on-disk index bwa-mem3 shm ref.fa # re-stage the new indexThere is no automatic staleness check in the implementation. The segment name is derived from the reference basename only; no content hash or modification timestamp is stored.
To confirm that no stale segments are staged, use bwa-mem3 shm -l before
running any indexing step.
Forgetting to initialize submodules
bwa-mem3 depends on several submodules (ext/htslib, ext/safestringlib,
ext/libsais, ext/mimalloc, ext/sse2neon). A shallow clone or a clone
without --recursive will produce a build that fails at the linking step with
missing symbols, or at runtime with missing index files.
Warning — Missing submodules
Always clone with
--recursive, or initialize submodules after cloning:git clone --recursive https://github.com/fg-labs/bwa-mem3 # or, after a bare clone: git submodule update --init --recursiveIf
makereports missing headers (e.g.htslib/hts.h: No such file or directory), the submodules were not initialized.
Leaving BASELINE_ARCH at the default on a known higher-tier CPU
The default make (no arch=) builds the multi-tier single binary
with non-kernel TUs compiled at BASELINE_ARCH=avx2. On a production
server with a known higher-tier CPU family, this leaves auto-vectorized
non-kernel hot paths at 256-bit width when the host could go wider, or
keeps the host-floor precheck at avx2 when the deployment surface is
strictly AVX-512. Pass BASELINE_ARCH= (or build a single-tier binary
with arch=) to align the build with the deployment:
Warning — Suboptimal build on known hardware
# Single multi-tier binary with non-kernel TUs at the host's tier: make BASELINE_ARCH=avx512bw # Cascade Lake / Ice Lake / Sapphire Rapids / Zen 4 # Single-tier binary (no dispatch table; smallest install) when the cluster # is uniform and you don't need cross-tier portability: make arch=avx2 # Broadwell/Skylake and later x86 make arch=avx512bw # Cascade Lake / Sapphire Rapids make arch=arm64 # Apple Silicon / AWS GravitonThe default (
makewith no overrides) is appropriate when the binary will be distributed across multiple CPU families or when the target CPU is genuinely unknown. Note thatBASELINE_ARCH=avx512bwdoes not always win overavx2even on AVX-512 hosts — seeBASELINE_ARCH=avx512bwbuild flag for the empirical perf characterization.
See SIMD dispatch matrix for the full set of targets and the in-process dispatch architecture.
Mixing bwa-mem3 and bwa-mem2 outputs in the same pipeline
bwa-mem3 adds several custom SAM tags that bwa-mem2 does not emit: HN:i
(total number of primary alignments — both reported and suppressed — that the
aligner found for this read, before the -h supplementary cap is applied),
and — in --meth mode — the Bismark-compatible XR:Z (read conversion
direction), XG:Z (genome strand), and XM:Z (per-base methylation call
string) tags. It also rewrites @SQ header lines in --meth mode
(collapsing f/r strand prefixes back to one entry per chromosome).
Warning — Header and tag mismatch
Do not merge BAM files produced by bwa-mem3 and bwa-mem2 without verifying that the
@PGheaders and custom tags are handled correctly by the downstream tool. In methylation workflows, a bwa-mem2 BAM mixed into a bwa-mem3--methpipeline will be missing theXR:Z/XG:Z/XM:ZBismark annotations, which will cause methylation callers to silently drop or misclassify those records.
If you must merge outputs from both tools, run samtools view -H on both
files and confirm that @SQ lines are consistent and that the downstream tool
can tolerate the tag differences.
Writing compressed BAM to a pipe
Passing --bam=1 (compressed BAM) when piping to samtools sort compresses
the stream on the bwa-mem3 side and then immediately decompresses it on the
samtools side. This wastes CPU on both ends with no benefit.
Use --bam=0 (uncompressed BAM) for all pipe-to-sort workflows. See
Output format for the full explanation and recommended
pipeline.
See also: Output format · Multi-sample workflows · Build · Quick start: shared-memory index · CLI Reference: shm
CLI Reference Overview
bwa-mem3 exposes four subcommands: index, mem, shm, and version. Run
bwa-mem3 <subcommand> --help to see the full option list for any command.
How this section is structured
Each subcommand page follows the same layout:
- Introduction — what the subcommand does and when to reach for it.
- Synopsis — the verbatim
--helpoutput, auto-captured from the binary at build time and included here via mdbook’s{{#include}}directive. The snippet is regenerated bymake docs-cliand CI fails if it drifts from the binary. - Common usage — two or three worked command-line examples.
- Flag reference (for
mem, grouped by topic) — per-flag prose covering semantics, defaults, and interaction with other flags that the--helptext does not have room to explain. - Notes / Gotchas — operational warnings about non-obvious behavior.
- See also — cross-links to related pages in this book.
Subcommands
index builds the FM-index from a reference FASTA.
Pass --meth to produce a bwameth-style doubled c2t reference for
methylation alignment.
mem aligns short reads against an indexed reference, producing
SAM or BAM output. It is the primary alignment subcommand. The flag surface is
large; the mem reference page groups flags by purpose to make them easier to
navigate.
shm stages an FM-index into POSIX shared memory so that
repeated bwa-mem3 mem invocations on the same machine skip the per-run disk
read. It also lists and destroys staged segments.
version prints the bwa-mem3 release version and, when mimalloc is compiled in, the mimalloc version.
See also: User Guide — Aligning short reads · User Guide — Indexing the reference · Getting Started — Quick start: align paired-end FASTQs · Getting Started — Quick start: shared-memory index · Performance — Tuning checklist
index
bwa-mem3 index builds the FM-index (BWT + suffix array) that bwa-mem3 mem
requires for alignment. Run it once per reference; the resulting files sit
alongside the input FASTA and are reused for all subsequent alignment jobs.
Pass --meth to produce a bwameth-compatible doubled c2t reference for
bisulfite-seq alignment.
Synopsis
Usage: bwa-mem3 index [-p prefix] [-t N] [--max-memory SIZE] [--tmp-dir PATH] [--meth] <in.fasta>
-p STR output prefix (default: <in.fasta>)
-t INT worker threads [auto: detected cores, cgroup-aware]
--max-memory SIZE peak memory budget; SIZE accepts a G/M/K suffix
(case-insensitive) or bare bytes
[auto: min(50% of RAM, 32G), cgroup-aware]
--tmp-dir PATH scratch directory [$TMPDIR]
--meth build a bwameth-style doubled c2t reference + FMI.
Writes <in.fasta>.bwameth.c2t and the FMI alongside it.
Use with `bwa-mem3 mem --meth <in.fasta> R1.fq [R2.fq]`.
-h, --help print this help message and exit
Common usage
Build a standard index using all available cores:
bwa-mem3 index ref.fa
Build a methylation-aware index (required before bwa-mem3 mem --meth):
bwa-mem3 index --meth ref.fa
Limit peak RAM to 16 GB and write scratch data to /scratch:
bwa-mem3 index --max-memory 16G --tmp-dir /scratch ref.fa
Flag reference
-p STR — output prefix
By default, index files are written alongside <in.fasta> using the FASTA
path as a prefix (e.g. ref.fa.bwt.2bit.64, ref.fa.0123, etc.). Use -p
to write them to a different base path, such as a dedicated index directory:
bwa-mem3 index -p /idx/hg38 ref.fa
# writes /idx/hg38.bwt.2bit.64, /idx/hg38.0123, …
# align with: bwa-mem3 mem /idx/hg38 R1.fq R2.fq
-t INT — worker threads
Controls the number of threads used during index construction. The default auto-detects available cores and is cgroup-aware, so it behaves correctly inside containers and on shared cluster nodes. Set explicitly when you want to cap CPU usage.
--max-memory SIZE — peak memory budget
Limits how much RAM the indexer may use at once. SIZE accepts a G, M, or
K suffix (case-insensitive) or a bare byte count. The default is
min(50% of RAM, 32 GB), computed in a cgroup-aware manner.
For large references (hg38 and above) on machines with limited RAM, setting
this to a value lower than the reference size causes the indexer to partition
work and use --tmp-dir for intermediate files, at the cost of extra I/O.
--tmp-dir PATH — scratch directory
Scratch directory for intermediate files when memory is partitioned. Defaults
to $TMPDIR. Point this at a fast local disk (NVMe or ramdisk) to minimize
wall-clock time when --max-memory forces partitioned construction.
--meth — build a methylation (c2t) index
Writes a bwameth-style doubled reference — <in.fasta>.bwameth.c2t — and
builds the FM-index over that file rather than the original FASTA. The c2t
file and its index files are placed alongside the original FASTA.
Pass the original FASTA prefix (not the .bwameth.c2t path) to all three
index, shm, and mem commands. The c2t suffix is appended automatically
when --meth is present.
Notes / Gotchas
Tip — Index once, align many times
Index construction for hg38 takes several minutes and ~28 GB of disk. Build the index once and store it on shared storage; all alignment jobs on the same reference share the same index files.
Warning — –meth index is not interchangeable with the standard index
A
--methindex is built over the c2t reference and cannot be used for normal (non-bisulfite) alignment. Keep separate index directories if you align both standard and bisulfite samples to the same reference.
See also: User Guide — Indexing the reference · CLI Reference — mem · CLI Reference — shm · Getting Started — Quick start: methylation alignment · Methylation Reference — Overview
mem
bwa-mem3 mem aligns short DNA reads against an indexed reference genome
using the BWA-MEM algorithm. It accepts one or two FASTQ files (single-end or
paired-end) and writes alignments to stdout in SAM or BAM format. It is the
primary alignment subcommand; nearly all bwa-mem3 usage flows through it.
Synopsis
Usage: bwa-mem3 mem [options] <idxbase> <in1.fq> [in2.fq]
Options:
Algorithm options:
-o STR Output SAM file name
--bam[=N] Emit BAM instead of SAM text. N=0 (default) = uncompressed;
1..9 = BGZF deflate levels. Writes to stdout; redirect with `>`.
-t INT number of threads [1]
-k INT minimum seed length [19]
-w INT band width for banded alignment [100]
-d INT off-diagonal X-dropoff [100]
-r FLOAT look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
-y INT seed occurrence for the 3rd round seeding [20]
-c INT skip seeds with more than INT occurrences [500]
-D FLOAT drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
-W INT discard a chain if seeded bases shorter than INT [0]
-m INT perform at most INT rounds of mate rescues for each read [50]
-S skip mate rescue
-P skip pairing; mate rescue performed unless -S also in use
Scoring options:
-A INT score for a sequence match, which scales options -TdBOELU unless overridden [1]
-B INT penalty for a mismatch [4]
-O INT[,INT] gap open penalties for deletions and insertions [6,6]
-E INT[,INT] gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]
-L INT[,INT] penalty for 5'- and 3'-end clipping [5,5]
-U INT penalty for an unpaired read pair [17]
Input/output options:
-p smart pairing (ignoring in2.fq)
-R STR read group header line such as '@RG\tID:foo\tSM:bar' [null]
-H STR/FILE insert STR to header if it starts with @; or insert lines in FILE [null]
-j treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)
-5 for split alignment, take the alignment with the smallest coordinate as primary
-q don't modify mapQ of supplementary alignments
-K INT process INT input bases in each batch regardless of nThreads (for reproducibility) []
-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]
-T INT minimum score to output [30]
-h INT[,INT] if there are <INT hits with score >80.00% of the max score, output all in XA [5,200]
-z FLOAT the fraction of the max score to use with -h [0.80]
-u output XB instead of XA; XB is XA with the alignment score and mapping quality added
-a output all alignments for SE or unpaired PE
-C append FASTA/FASTQ comment to SAM output
-V output the reference FASTA header in the XR tag
-Y use soft clipping for supplementary alignments
-M mark shorter split hits as secondary
-I FLOAT[,FLOAT[,INT[,INT]]]
specify the mean, standard deviation (10% of the mean if absent), max
(4 sigma from the mean if absent) and min of the insert size distribution.
FR orientation only. [inferred]
Bisulfite (--meth) options:
--meth enable inline bwameth-style C→T/G→A read conversion + meth-aware BAM
emission. Implies --bam. Requires the reference to have been built
with `bwa-mem3 index --meth` (emits ref.fa.bwameth.c2t).
--set-as-failed f|r
flag alignments to the matching strand ('f' or 'r') as QC-fail (0x200)
--chimera-qc
enable the bwameth.py-style longest-match <44% chimera heuristic
(sets 0x200, clears 0x2, caps MAPQ at 1). Off by default; not in Bismark.
Supplementary MAPQ rescoring (fg-labs extension):
--supp-rep-hard-cap INT
force MAPQ=0 for supplementary alignments whose chain contains any seed
with >=INT genome occurrences (i.e. the supp region is repetitive on its
own). 0 disables (default). Typical values 5-20; lower = more aggressive.
Primary MAPQ is unaffected.
Help:
--help print this help message and exit
Note: Please read the man page for detailed description of the command line and options.
Common usage
Paired-end alignment, 16 threads, SAM to stdout:
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam
Paired-end alignment, emit uncompressed BAM, pipe directly to samtools sort:
bwa-mem3 mem --bam -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -o out.bam -
samtools index out.bam
Paired-end methylation alignment with a read group header:
bwa-mem3 mem --meth -t 16 \
-R '@RG\tID:lib1\tSM:sample1\tPL:ILLUMINA' \
ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -o out.bam -
Flag reference
Input / output
-o STR — output file
Write output to STR instead of stdout. Honored for both SAM and --bam
output; the path is opened lazily so BAM mode can hand it to htslib instead of
truncating it as a SAM-text file. Stdout redirection (>) remains an
alternative.
--bam[=N] — emit BAM
Emit BAM instead of SAM. N controls BGZF compression: 0 (default when
--bam is used without =) writes uncompressed BAM, which costs almost no
CPU and is the recommended mode for piping to samtools sort. Values 1–9
select increasing BGZF deflate levels; use --bam=6 or --bam=9 only when
writing directly to final storage without a downstream sort step.
Tip — Prefer –bam for production pipelines
Uncompressed BAM (
--bamor--bam=0) eliminates the text-formatting cost on the aligner side and the text-parse cost on thesamtools sortside. For any pipeline that immediately sorts or processes the output, this is faster than SAM at no quality cost.
-R STR — read group header
Injects a @RG header line and tags every alignment with RG:Z:<ID>. The
value is a tab-separated @RG line with literal \t escapes, for example:
-R '@RG\tID:run1\tSM:HG001\tPL:ILLUMINA\tLB:lib1'
bwa-mem3 escapes any literal tab characters inside -R values before writing
them to the @PG CL: field, preventing header corruption (fix for issue #45).
-H STR/FILE — extra header lines
If STR begins with @, it is injected verbatim as a header line. Otherwise
STR is treated as a path and every line in the file is injected. Useful for
adding @CO comments or custom @RG / @PG entries.
-p — smart pairing
Reads interleaved paired-end data from a single FASTQ file (in1.fq) rather
than two separate files. The second positional argument (in2.fq) is ignored.
-5 — leftmost-coordinate primary
For split alignments, designates the alignment with the smallest genomic coordinate as primary, rather than the longest alignment. Useful for some downstream tools that expect the leftmost alignment to be primary.
-q — preserve supplementary MAPQ
By default, bwa-mem3 may downgrade the MAPQ of supplementary alignments.
-q suppresses that adjustment.
-K INT — fixed batch size
Forces each thread batch to process exactly INT input bases regardless of
the number of threads. Useful when you need bit-for-bit reproducible output
across runs with different -t values: fix -K to the same value and the
output is deterministic.
-v INT — verbosity
Controls stderr diagnostic output: 1 = errors only, 2 = warnings,
3 = informational messages (default), 4+ = debugging.
-a — all alignments
Output all alignments for single-end or unpaired paired-end reads, including secondary alignments. Equivalent to enabling secondary-alignment reporting.
-C — append FASTA/FASTQ comment
Appends the comment field from the FASTA/FASTQ header to the SAM output as an additional column. Useful when the comment carries barcodes or UMIs.
-V — reference header in XR tag
Emits the reference FASTA header line for each alignment position as an XR
SAM tag.
-Y — soft-clip supplementary alignments
Uses soft clipping instead of hard clipping for supplementary alignments. Some downstream tools require this.
-M — mark shorter split hits as secondary
Marks the shorter alignment in a split read as secondary (sets 0x100 flag)
rather than supplementary. Required for compatibility with tools that do not
handle supplementary alignments (e.g. Picard’s duplicate-marking before
certain versions).
-j — treat ALT contigs as primary
Treats ALT contigs as part of the primary assembly by ignoring the
<idxbase>.alt file. Use when your workflow does not include ALT-aware
postprocessing.
Scoring
All scoring flags accept integer values. Changing -A (match score) scales
the penalty flags that default to multiples of -A; explicit overrides of
individual flags are unaffected.
| Flag | Default | Meaning |
|---|---|---|
-A INT | 1 | Score for a sequence match. Scales -T, -d, -B, -O, -E, -L, -U unless overridden. |
-B INT | 4 | Mismatch penalty. |
-O INT[,INT] | 6,6 | Gap open penalty for deletions and insertions respectively. |
-E INT[,INT] | 1,1 | Gap extension penalty per base. A gap of length k costs -O + -E * k. |
-L INT[,INT] | 5,5 | Clipping penalty for 5’ and 3’ ends. |
-U INT | 17 | Penalty for an unpaired read pair (affects mate-rescue scoring). |
-T INT | 30 | Minimum alignment score to output. Alignments below this threshold are not reported. |
Note — –meth overrides scoring defaults
When
--methis active, bwa-mem3 applies bwameth.py-compatible defaults:-B 2 -L 10 -U 100 -T 40 -CM. Any of these can still be overridden by passing the flag explicitly after--meth.
Paired-end
-I FLOAT[,FLOAT[,INT[,INT]]] — insert size distribution
Specifies the mean, standard deviation (default: 10% of mean), maximum (default: 4 sigma above mean), and minimum of the insert size distribution for FR-orientation paired-end reads. By default bwa-mem3 infers these parameters from the first batch of reads. Provide them explicitly for speed or when the reference is short and inference may be inaccurate.
-m INT — mate rescue rounds
Maximum number of mate-rescue attempts per read. Reduce to speed up alignment on data where the default (50) wastes time on unrescuable pairs.
-S — skip mate rescue
Disables mate rescue entirely. Faster but may reduce sensitivity for discordant pairs.
-P — skip pairing
Skips the pairing step; mate rescue still runs unless -S is also given.
Filtering
-c INT — skip repetitive seeds
Seeds with more than INT occurrences in the reference are skipped. Lowering
this (e.g. to 50) speeds up alignment of highly repetitive reads but may
reduce sensitivity. Raising it increases sensitivity in repeat-heavy regions
at a cost in runtime.
-D FLOAT — chain length fraction
Drops chains shorter than FLOAT times the longest overlapping chain. The
default (0.50) discards chains that are less than half the length of the best
chain.
-W INT — minimum seeded bases
Discards chains with fewer than INT seeded bases. Raising this filters out
very short, low-confidence chains.
-h INT[,INT] — secondary alignment reporting
If there are fewer than INT hits with score exceeding FLOAT (see -z)
times the maximum score, all of them are output in the XA auxiliary tag.
The second integer is a hard cap on the number of XA entries. Defaults: 5, 200.
-z FLOAT — secondary score fraction
Fraction of the maximum alignment score used as the threshold for secondary
hit reporting with -h. Default: 0.80.
-u — emit XB instead of XA
Outputs XB in place of XA. XB is an extension of XA that also carries
the alignment score and mapping quality for each secondary hit.
Methylation (--meth)
--meth — enable bisulfite alignment mode
Activates inline C→T (R1) and G→A (R2) read conversion, bwameth-compatible
scoring defaults, inline BAM post-processing, and forces --bam output.
The reference must have been indexed with bwa-mem3 index --meth.
Pass the original FASTA prefix as <idxbase> — the .bwameth.c2t suffix is
appended automatically. If <idxbase> already ends in .bwameth.c2t
(interop with an external c2t converter), the auto-append is skipped.
See Methylation Reference for the full treatment.
--set-as-failed {f|r} — strand QC-fail flag
Forces the QC-fail bit (0x200) on all alignments to the forward (f) or
reverse (r) bisulfite strand. Used when one strand is known to be
unreliable for a given library preparation.
--chimera-qc — opt in to bwameth.py-style chimera heuristic
Off by default (matches Bismark, which has no equivalent heuristic).
When set, mapped records whose longest M/=/X CIGAR run is less than 44 % of
the read length get 0x200 set, 0x2 cleared, and MAPQ capped at 1. Useful
for PBAT / scBS-Seq libraries where intra-fragment chimerism is common, or
when reproducing bwameth.py output bit-for-bit.
Threading
-t INT — number of threads
Number of worker threads. Defaults to 1. Set to the number of physical cores available to this job. Scaling is workload- and hardware-dependent: on typical machines the curve flattens around 16–32 threads (FM-index bandwidth and I/O contention dominate); on high-memory / fast-I/O servers the aligner can keep scaling toward ~64 threads on hg38 before saturating. See the threading guide for measured guidance and per-machine recommendations.
See User Guide — Threading and resource use for guidance on thread counts at various machine sizes.
Supplementary MAPQ rescoring
--supp-rep-hard-cap INT — cap MAPQ for repetitive supplementary alignments
Forces MAPQ=0 for supplementary alignments whose chain contains any seed with
at least INT occurrences in the genome. This targets supplementary
alignments anchored in repetitive regions that upstream MAPQ scoring may
overestimate. 0 disables the cap (default). Typical values are 5–20; lower
values are more aggressive. Primary alignment MAPQ is unaffected.
Debug
-k INT — minimum seed length
Minimum exact-match seed length. Shorter seeds increase sensitivity but raise runtime. The default (19) is calibrated for 100–150 bp Illumina reads.
-w INT — band width
Band width for the banded Smith-Waterman extension. Wider bands can recover alignments with long indels at greater CPU cost.
-d INT — X-dropoff
Off-diagonal X-dropoff for the Z-drop heuristic. Controls how far an alignment extension continues after a score drop.
-r FLOAT — re-seeding factor
Seeds longer than -k * FLOAT are re-seeded internally to find sub-seeds.
Lowering this produces more seeds and higher sensitivity at greater cost.
-y INT — third-round seed occurrence threshold
Seed occurrence threshold for the third round of seeding. Rarely needs adjustment outside highly repetitive genomes.
Notes / Gotchas
Warning — –meth requires a –meth index
Running
bwa-mem3 mem --methagainst a standard (non-c2t) index produces incorrect alignments without an error. Confirm that the index was built withbwa-mem3 index --methbefore aligning bisulfite data.Note — SIMD variant printed to stderr at startup
When mem starts it prints a banner (
Executing in AVX512 mode!!etc.) to stderr. This is informational and does not affect stdout output.
See also: User Guide — Aligning short reads · User Guide — Output: SAM/BAM, headers, tags · CLI Reference — index · Methylation Reference — Overview · Best Practices — Output format
shm
bwa-mem3 shm stages an FM-index into POSIX shared memory so that subsequent
bwa-mem3 mem invocations on the same machine attach to the in-memory segment
instead of re-reading the index files from disk. For workloads that align many
small samples back-to-back against the same reference — such as clinical
panels or amplicon sequencing — this removes the dominant I/O bottleneck.
shm also lists and destroys staged segments.
Synopsis
Usage: bwa-mem3 shm [-d|-l|--help] [--meth] [idxbase]
Options:
-d destroy all indices in shared memory (matches bwa v1 behavior)
-l list names of indices in shared memory
--meth stage a `bwa-mem3 index --meth` index — auto-appends
`.bwameth.c2t` to <idxbase>, mirroring `mem --meth`
-h --help print this help and exit
Stage with no flags: `bwa-mem3 shm <idxbase>` loads the index into
POSIX shared memory; subsequent `bwa-mem3 mem <idxbase> ...` runs
auto-attach instead of re-reading from disk. For meth indices, pass
the same plain `<idxbase>` to all three commands plus `--meth` on
`index`, `shm`, and `mem` (the c2t suffix is auto-appended).
Footgun: if you re-build the index, run `bwa-mem3 shm -d` first.
There is no staleness check -- a stale segment will silently mis-align.
Stuck-lock recovery: concurrent stagers are serialized by a named
POSIX semaphore. If a stager is kill -9'd mid-stage, the lock
persists and subsequent stages block forever. `bwa-mem3 shm -d`
unlinks the semaphore alongside the registry; rerun afterwards.
macOS: POSIX shm has implementation-defined per-segment caps; large
indices may simply fail to stage. Prefer Linux for production.
Linux: /dev/shm defaults to ~50% of RAM on bare metal; in containers
it is often much smaller and may need raising via --shm-size
(Docker) or an emptyDir tmpfs (Kubernetes).
Common usage
Stage a standard index, align two samples, then release the segment:
bwa-mem3 shm ref.fa
bwa-mem3 mem -t 16 ref.fa sample1_R1.fq sample1_R2.fq > sample1.sam
bwa-mem3 mem -t 16 ref.fa sample2_R1.fq sample2_R2.fq > sample2.sam
bwa-mem3 shm -d
Stage a methylation index and align:
bwa-mem3 shm --meth ref.fa
bwa-mem3 mem --meth -t 16 ref.fa R1.fq R2.fq | samtools sort -o out.bam -
bwa-mem3 shm -d
List all currently staged segments:
bwa-mem3 shm -l
Flag reference
(no flags) <idxbase> — stage an index
Loads all index files for <idxbase> into a POSIX shared-memory segment.
After staging, any bwa-mem3 mem <idxbase> ... on the same machine
auto-attaches and reads from memory rather than disk.
-d — destroy all segments
Removes every bwa-mem3 shared-memory segment on the machine. This is the correct clean-up command after a batch job and the required step before re-building the index (see the footgun warning below).
-l — list staged indices
Prints the names of all currently staged segments. Useful to confirm that staging succeeded before launching alignment jobs.
--meth — stage a methylation index
Auto-appends .bwameth.c2t to <idxbase> before staging, mirroring the
behavior of bwa-mem3 index --meth and bwa-mem3 mem --meth. Pass the
same plain <idxbase> to all three commands; the c2t suffix is handled
transparently.
Notes / Gotchas
Warning — No staleness check — always destroy before re-indexing
There is no staleness check. If you re-run
bwa-mem3 index ref.faafter staging, the on-disk index files will not match the in-memory segment, butbwa-mem3 memwill still attach to the stale segment and silently produce incorrect alignments. Always runbwa-mem3 shm -dbefore re-indexing.Note — Platform limits
macOS: POSIX shared memory has implementation-defined per-segment size caps. Staging a full hg38 index (~28 GB) may fail silently or with a cryptic error. Prefer Linux for production use with large references.
Linux containers:
/dev/shmtypically defaults to ~50% of physical RAM on bare metal but is often much smaller inside Docker containers or Kubernetes pods. Raise the limit with--shm-size(Docker) or anemptyDirtmpfs volume with an explicit size (Kubernetes) before attempting to stage a large index.Note —
/dev/shmcapacity preflight (PR #86)Before opening the segment,
bwa-mem3 shmcallsstatvfs("/dev/shm")and compares the available bytes against the index’stotal_size. If/dev/shmis too small the stage aborts cleanly with an[E::bwa_shm_stage]message that names/dev/shm, the required size, and amount -o remount,size=...hint. This replaces the previous failure mode whereftruncatesucceeded lazily andpack_intolater surfaced ENOSPC as[fread] Bad addresswith no indication that/dev/shmwas the cause. The preflight is best-effort: astatvfsfailure (no/dev/shm, restricted sandbox, ENOSYS) is non-fatal and the stage proceeds. As a rough sizing guide, hg38 stages ~17 GB; AWS instances default to RAM/2 (so c7a.4xlarge / c7i.4xlarge at 32 GB get ~16 GB of/dev/shm, which is just under the index size — aremount,size=28gis the documented fix).Note — Stuck-lock recovery
Concurrent
bwa-mem3 shm <prefix>invocations are serialized by a named POSIX semaphore (/bwactl_lock) so the registry stays consistent. POSIX semaphores have noSEM_UNDOequivalent: if a stager segfaults or iskill -9’d while holding the lock, every subsequent stage will block insem_waitforever. Runbwa-mem3 shm -dto recover — it unlinks the semaphore alongside the registry, freeing the next stager.
See also: Getting Started — Quick start: shared-memory index · CLI Reference — index · CLI Reference — mem · Best Practices — Multi-sample workflows · Best Practices — Anti-patterns
version
bwa-mem3 version prints the release version, the build’s compiled-in
SIMD floor, the SIMD tier resolved at runtime, and (when mimalloc is
compiled in) the mimalloc version. It is the canonical way to confirm
which build is on PATH, what host class it requires, and what kernel
path it will dispatch to.
bwa-mem3 version always exits 0 — even on a host below the build’s
SIMD floor — so operators can introspect a binary on a host that cannot
actually run alignment. bwa-mem3 <subcommand> --help and -h share
the same property.
Synopsis
mimalloc 3.3.0
v<MAJOR.MINOR>-<N>-g<COMMIT>
Common usage
Confirm the installed version, SIMD floor, and resolved tier:
./bwa-mem3 version
A typical run on an AVX-512BW host with the default BASELINE_ARCH=avx2
build prints (mimalloc line on stderr, the rest on stdout — order in a
merged stream is not guaranteed):
v0.2.0-12-gabcdef1
SIMD floor: avx2 (x86-64-v3, Haswell 2013+); kernels: sse41 sse42 avx avx2 avx512bw
SIMD runtime: avx512bw (BWAMEM3_FORCE_TIER unset)
mimalloc 3.3.0
- version line — bwa-mem3’s release string, derived from
git describeat build time and stored asPACKAGE_VERSIONin the binary. When building from a tarball without git history, the fallback value is set viaFG_LABS_VERSION_FALLBACKat compile time. SIMD floor:— the compile-time minimum the binary requires. Set byBASELINE_ARCH(defaultavx2) and listed alongside the per-tier kernel set the binary carries.SIMD runtime:— the tier resolved at startup by__builtin_cpu_supports(orBWAMEM3_FORCE_TIERif set in the environment). On arm64 this is alwaysneon.- mimalloc line — present only when
USE_MIMALLOC=1(the default).
On a host below the SIMD floor, version also writes a
[W::bwa-mem3] warning on stderr identifying the gap (and the
alignment subcommands will refuse to run with exit code 2 — see
Host requirements for the
exit-2 message format and rebuild instructions).
Notes / Gotchas
Tip —
version | grepis safe in CIThe version,
SIMD floor:, andSIMD runtime:lines all go to stdout; the mimalloc line and any host-below-floor warning go to stderr. Sobwa-mem3 version | grep '^SIMD'works in CI scripts even on hosts that cannot run alignment. Use2>/dev/nullto suppress the mimalloc and warning lines if you want stdout only.Tip — No mimalloc line means USE_MIMALLOC=0
If no mimalloc line appears, the binary was built without the bundled allocator (
make USE_MIMALLOC=0). See User Guide — Memory allocator (mimalloc) for when this is appropriate.
See also: User Guide — Memory allocator (mimalloc) · Developer Guide — Release process · Getting Started — Installation · What’s Different — Build & infrastructure
Methylation Reference Overview
bwa-mem3 --meth is a single-binary, single-command drop-in replacement for
the bwameth.py bisulfite-sequencing
alignment pipeline. No Python installation, no piped preprocessing step, and no
separate post-processing script — one bwa-mem3 index --meth builds the
reference, and one bwa-mem3 mem --meth aligns and post-processes reads from
raw FASTQ to sorted-ready BAM.
The output BAM is structurally equivalent to what the bwameth.py pipeline
produces: consolidated @SQ headers (one entry per real chromosome rather
than one per doubled-reference contig), Bismark-compatible XR:Z (read
conversion CT/GA), XG:Z (genome strand CT/GA), and XM:Z
(per-base methylation call string) auxiliary tags, optional chimera QC flags
(--chimera-qc, off by default to match Bismark), and a
@PG ID:bwa-mem3-meth provenance entry. Every Bismark-native tool
(bismark_methylation_extractor, methylKit, methtuple, DMRfinder,
epialleleR), MethylDackel, and biscuit’s per-read methylation tools read
the BAM directly without conversion.
Pipeline at a glance
The diagram below shows the internal flow when bwa-mem3 mem --meth runs.
Every step executes inside the single process; no external programs or temporary
files are required.
flowchart LR
A[Raw FASTQ\nR1 / R2] -->|inline C→T / G→A| B[c2t-converted reads\n+ internal YS/YC carrier]
B -->|bwa mem core| C[mem_aln_t\nalignments vs doubled ref]
C -->|chrom map\nf/r → real chr| D[header rewrite\n@SQ consolidated]
D -->|XR/XG/XM Bismark tags\noptional --chimera-qc\nQC-fail propagation| E[BAM output\nwb0 uncompressed]
Steps:
-
FASTQ ingest with inline c2t conversion. R1 bases have every
Creplaced withT; R2 bases have everyGreplaced withA. The original bases and conversion direction are kept on an internal carrier on each read (inbseq1_t.comment); they are never emitted to BAM as tags themselves but feed the BAM-write step (SEQ restoration,XR:Zderivation). This conversion happens in-memory — the FASTQ is never written to disk in converted form. -
Alignment against the doubled reference. The converted reads are aligned against the
ref.fa.bwameth.c2treference, which contains both a forward C→T projection (f-prefixed contigs) and a reverse G→A projection (r-prefixed contigs) of each chromosome. -
Header rewriting and chrom consolidation. The
f/r-prefixed contig names used internally are collapsed: every pairfchr1/rchr1becomes a single@SQ SN:chr1entry in the output BAM header. RNAME and RNEXT fields in each record are rewritten to the consolidated name. -
Tag emission and QC. Each aligned record receives Bismark-compatible
XR:Z(read conversion direction),XG:Z(genome strand), andXM:Z(per-base methylation call string) auxiliary tags. With opt-in--chimera-qc(off by default — matches Bismark), records whose longest M/=/X CIGAR run covers less than 44 % of the read length are flagged0x200; QC-fail flags then propagate across all records in a read group. The original pre-c2t sequence is copied back into the BAM SEQ field so methylation callers see real cytosines rather than the converted sequence. -
BAM output. Records are written as uncompressed BAM (
wb0mode via htslib). The@PG ID:bwa-mem3-methline records the exact command line. The caller pipes directly tosamtools sort.
Quick-start commands
# Index the reference once (builds ref.fa.bwameth.c2t + FMI)
bwa-mem3 index --meth ref.fa
# Align paired-end FASTQs
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -o out.bam
samtools index out.bam
Note — bwameth.py compatibility
The default scoring parameters applied by
--meth(-B 2 -L 10 -U 100 -T 40 -CM) match those used by bwameth.py so outputs are comparable. Any parameter can be overridden on the command line.
See also: bwameth.py drop-in mapping · Conversion details · SAM tags: XR, XG, XM · Chimera QC and header rewriting · Quick start: methylation alignment
bwameth.py Drop-In Mapping
bwa-mem3 --meth is designed to produce output that is equivalent to the
bwameth.py pipeline for the standard paired-end case. This page explains what
changes between the two approaches and what stays the same.
Command comparison
bwameth.py pipeline (multi-step)
# Step 1: build a doubled reference with bwameth.py
bwameth.py index ref.fa # writes ref.fa.bwameth.c2t + bwa-mem2 FMI
# Step 2: align (bwameth.py converts reads, calls bwa-mem2, post-processes)
bwameth.py map --bwa-mem2 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -o out.bam
samtools index out.bam
bwa-mem3 –meth (single binary)
# Step 1: build the doubled reference with bwa-mem3
bwa-mem3 index --meth ref.fa # same ref.fa.bwameth.c2t layout as bwameth.py
# Step 2: align (inline c2t conversion + post-processing, no Python)
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -o out.bam
samtools index out.bam
The index files produced by bwa-mem3 index --meth and bwameth.py index are
identical in layout: the same ref.fa.bwameth.c2t doubled-reference FASTA
followed by the bwa-mem2 FM-index files (.bwt.2bit.64, .0123, .pac,
.amb, .ann).
What is gained
No Python or bwameth.py dependency. The entire pipeline — read conversion,
alignment, and BAM post-processing — runs inside a single bwa-mem3 process.
This simplifies deployment: one binary, no virtual environment, no version
pinning of bwameth.py.
No intermediate files. bwameth.py writes a converted FASTQ (or pipes it)
before handing off to the aligner. bwa-mem3 --meth performs the C→T / G→A
conversion in-memory on each read batch before passing it to the alignment
kernel. No temporary FASTQ is written and no extra pipe stage is needed.
Inline BAM post-processing. Header rewriting, Bismark XR:Z / XG:Z /
XM:Z tag emission, opt-in chimera QC (--chimera-qc), and QC-fail
propagation all happen inside the same process and the same pass over the
alignments. There is no separate post-processing step. Output is
written as uncompressed BAM (wb0) — a near-zero-cost format that downstream
samtools sort reads natively.
Same flag defaults. --meth applies -B 2 -L 10 -U 100 -T 40 -CM
automatically, matching bwameth.py’s default scoring. All parameters can be
overridden.
What stays the same
The output BAM is field-compatible with bwameth.py output for the standard
methylation tag set, flags, and SEQ representation (the @PG provenance line
intentionally differs — see below):
| Field | bwameth.py | bwa-mem3 –meth |
|---|---|---|
@SQ headers | One per real chromosome | One per real chromosome |
| Methylation aux tags | YS:Z, YC:Z, YD:Z (bwameth) | XR:Z, XG:Z, XM:Z (Bismark-compatible) |
@PG | ID:bwameth | ID:bwa-mem3-meth |
| Chimera QC threshold | Longest M < 44% of read | Same (44%), opt-in via --chimera-qc |
| Chimera QC flags | 0x200, clear 0x2, MAPQ ≤ 1 | Same |
| SEQ field | Pre-c2t bases (RC-flipped when is_rev) | Same |
The @PG ID: is intentionally different so provenance is unambiguous.
bwa-mem3 --meth emits the Bismark-compatible XR:Z / XG:Z /
XM:Z tag set rather than the bwameth-style YS:Z / YC:Z / YD:Z
set, which means the output is directly consumable by
bismark_methylation_extractor, methylKit, methtuple, DMRfinder, and
epialleleR in addition to MethylDackel and biscuit. Downstream tools
that read YS:Z / YC:Z / YD:Z will not find those tags and must be
pointed at the corresponding XR:Z / XG:Z (and the per-base XM:Z
methylation call string) instead.
Info — End-to-end regression coverage
PR #13 includes a three-layer regression test that verifies 100% chrom+pos match, 100% CIGAR match, and byte-identical SEQ across 92,684 paired-end records compared to a bwameth.py reference run.
When to prefer bwameth.py
If your workflow requires bwameth.py-specific features (e.g. bwameth.py markduplicates or non-standard bwameth.py post-processors), continue using
bwameth.py. bwa-mem3 --meth targets the indexing + alignment + standard
post-processing path only.
See also: Overview · Conversion details · SAM tags: XR, XG, XM · Chimera QC and header rewriting · Related Projects: bwameth.py
Conversion Details (C→T, G→A)
Bisulfite sequencing relies on chemical conversion of unmethylated cytosines to
uracil (read as thymine after PCR). bwa-mem3 --meth models this with an
in-memory read transformation applied to every read before the alignment kernel
sees the bases.
What gets converted
Paired-end bisulfite reads follow a strand convention:
-
R1 (read 1): every
Cin the base sequence is replaced withT. This models the OT (original top) and CTOB (complementary to original bottom) strands as they appear after bisulfite treatment and PCR. -
R2 (read 2): every
Gin the base sequence is replaced withA. This models the OB (original bottom) and CTOT strands.
Single-end mode uses the R1 (C→T) rule for all reads.
The doubled reference built by bwa-mem3 index --meth (or bwameth.py) contains
two projections of each chromosome:
f-prefixed contigs (e.g.fchr1): the chromosome with everyCreplaced byT.r-prefixed contigs (e.g.rchr1): the reverse complement of the chromosome with everyGreplaced byA.
Converted R1 reads are therefore alignable to f-prefixed contigs and
converted R2 reads to r-prefixed contigs. The contig prefix records which
strand hypothesis was used and feeds the Bismark XG:Z tag directly
(CT for f-prefixed / OT, GA for r-prefixed / OB).
Where conversion happens
Read conversion runs inside src/fastmap.cpp in the meth_mode ingest block,
immediately after sequence parsing and before any alignment work. The
transformation is applied to the in-memory bseq1_t.seq buffer; the original
FASTQ file is never rewritten.
Before the bases are modified, the original sequence is recorded in the read’s comment buffer as:
YS:Z:<l_seq bases>\tYC:Z:<direction>
where <direction> is CT for R1 (C→T) and GA for R2 (G→A). These
fields pass through the alignment kernel untouched and serve two
internal purposes at BAM-write time: YS:Z is the source for SEQ
restoration (see next section), and YC:Z is the source for the
emitted XR:Z: Bismark tag. They are not emitted to the output
BAM — bam_writer.cpp suppresses them under --meth. See
SAM tags: XR, XG, XM for the per-tag output reference.
Sequence restoration in the BAM SEQ field
Methylation callers such as MethylDackel identify methylated cytosines by
examining the BAM SEQ field at each CpG site. They need to see real C/T
bases — not the uniformly-converted T/A bases that were used for alignment.
meth_mem_aln_to_bam (in src/meth_bam.cpp) restores the original
sequence from the internal YS:Z carrier on bseq1_t.comment before
writing the BAM record (the carrier itself is suppressed at BAM emit
by bam_writer.cpp under --meth, so it never reaches the output
file):
- The
YS:Z:payload is located at the start of thebseq1_t.commentfield (offset+5past theYS:Z:header bytes). - For forward-aligned records (
!p.is_rev), the pre-c2t bases are copied directly into the BAM SEQ buffer. - For reverse-aligned records (
p.is_rev), the bases are reverse-complemented using the standardTGCANtable before being placed in SEQ. - If
YS:Z:is absent (e.g. when running with an external c2t converter that does not emit it), the code falls back to the converted sequence ins->seq, with the same RC flip logic.
Warning — Soft-clip and supplementary trimming
When computing the SEQ range for supplementary alignments, the
qb/qeboundaries account for soft-clip or hard-clip operations at the CIGAR ends. The YS:Z: restoration applies over the same trimmed range so SEQ length always matches the emitted CIGAR.
QUAL field handling
The QUAL field is taken directly from the original FASTQ (bseq1_t.qual) over
the same [qb, qe) range and is never modified by the c2t process. Quality
scores correspond to the original base calls, not the converted ones.
Relationship to the reference index
bwa-mem3 index --meth ref.fa writes ref.fa.bwameth.c2t, which applies the
same C→T / G→A projection to the reference sequence. The resulting file is
compatible with what bwameth.py index produces, so the same doubled-reference
FASTA can be used interchangeably with either tool across tested versions.
See also: Overview · SAM tags: XR, XG, XM · Interop with external bwameth.py c2t · User Guide → Indexing the reference · Best Practices → Methylation defaults
SAM Tags: XR, XG, XM (Bismark-compatible)
bwa-mem3 mem --meth emits three Bismark-compatible auxiliary tags on
each output record: XR:Z, XG:Z, and XM:Z. These tags are read by
bismark_methylation_extractor, deduplicate_bismark, methylKit
processBismarkAln, methtuple, DMRfinder, epialleleR, MethylDackel, and
biscuit’s per-read methylation tools.
Tag reference
XR:Z — read conversion direction
| Property | Value |
|---|---|
| Type | Z (NUL-terminated string) |
| Values | CT (R1 / SE) or GA (R2) |
| Set by | meth_mem_aln_to_bam from FASTQ-ingest carrier (s->comment’s YC payload) |
| Emitted on | All records (mapped and unmapped) |
XR:Z records which conversion was applied to the read at FASTQ ingest:
CT— C→T conversion applied; this is an R1 read or single-end read.GA— G→A conversion applied; this is an R2 read.
XG:Z — genome strand of the alignment
| Property | Value |
|---|---|
| Type | Z (NUL-terminated string) |
| Values | CT (aligned to original top, f--prefixed contig) or GA (aligned to original bottom, r--prefixed contig) |
| Set by | meth_mem_aln_to_bam from meth_chrom_map_t.direction |
| Emitted on | Mapped records only |
XG:Z indicates which doubled-reference strand the read aligned to:
CT— read aligned to the C→T-projected forward strand (OT).GA— read aligned to the G→A-projected forward strand (OB).
For properly paired directional reads, R1 and R2 of a fragment naturally
share XG:Z. Discordant pairs (already flagged with 0x200 by the
chimera-QC heuristic) may see XG:Z diverge between mates.
XM:Z — methylation call string
| Property | Value |
|---|---|
| Type | Z (NUL-terminated string) |
| Length | Equal to SEQ length |
| Set by | meth_build_xm (src/meth_xm.cpp) walking SEQ-orientation read against un-converted ref |
| Emitted on | Mapped records only |
Per-base methylation call. Each character corresponds to one SEQ base:
| char | meaning |
|---|---|
z / Z | unmethylated / methylated C in CpG context |
x / X | unmethylated / methylated C in CHG context |
h / H | unmethylated / methylated C in CHH context |
u / U | unmethylated / methylated C in unknown context (N within 1 or 2 bp downstream of the C, on the read’s source strand) |
. | non-C reference at this position, sequencing mismatch (read base ≠ C/T at a ref C), insertion, soft clip, or N at the C position itself |
The string is in SEQ orientation (matches the BAM SEQ field): for reads
with the 0x10 flag set, both SEQ and XM:Z are reverse-complemented
relative to FASTQ-original orientation.
Computation
Under --meth, the doubled c2t reference (<prefix>.bwameth.c2t.*) is
folded once at startup into an in-memory un-converted pac (the
meth_orig_ref module — src/meth_orig_ref.cpp). The fold uses
(f, r) → original recovery on every position via a 5-row table:
| f[P] | r[P] | original[P] |
|---|---|---|
| T | T | T |
| T | C | C |
| G | A | G |
| A | A | A |
| N | N | N (via bns->ambs) |
Per mapped record, meth_build_xm slices the un-converted forward-strand
window at the read’s footprint plus 2 bp of context on either side, then
walks the BAM CIGAR jointly over the restored SEQ and the ref window.
The classifier matches Bismark’s methylation_call:
match position with ref[t] == 'C' (top strand) or 'G' (bottom strand):
determine context from ref[t±1], ref[t±2]
N in either context base -> u/U (unknown context)
ref[t±1] == G/C (per strand) -> z/Z (CpG)
ref[t±2] == G/C (per strand) -> x/X (CHG)
otherwise -> h/H (CHH)
determine methylation:
read base == C/G (per strand) -> uppercase (methylated)
read base == T/A (per strand) -> lowercase (unmethylated)
otherwise -> '.'
insertion / soft clip -> '.' per consumed read base
deletion / N op -> no XM emit
hard clip / pad -> no XM emit
The top vs bottom strand choice is driven by XG:Z (= cmap
direction), not by the SAM 0x10 (RC) flag. CTOT reads (R2 mapped
forward to a top-strand contig with 0x10 set) and OB reads (R1 mapped
RC to a bottom-strand contig) are both handled by reading the rule
table from the strand encoded in XG. The walk runs in SEQ orientation
throughout — no RC of the ref slice or the read.
For bottom-strand methylation, the C of interest at forward position P
is encoded as a G on the forward strand (complement of bottom-strand C).
The downstream context on the bottom strand corresponds to upstream
positions on the forward strand; the classifier indexes ref[t-1] and
ref[t-2] instead of ref[t+1] and ref[t+2], and looks for a C
(forward) instead of a G to flag CpG.
Inspecting tags with samtools
samtools view out.bam | head -1 | tr '\t' '\n' | grep -E '^X[RGM]:'
Expected output looks like:
XR:Z:CT
XG:Z:CT
XM:Z:..z..h..Z..x..h.....Z..
See also
- Overview
- Conversion details
- Chimera QC and header rewriting
- Flags: –set-as-failed, –chimera-qc
- User Guide → Output: SAM/BAM, headers, tags
Chimera QC and Header Rewriting
After the alignment kernel produces mem_aln_t records, bwa-mem3 --meth
applies a set of post-processing steps before writing BAM output. These steps
are implemented in src/meth_bam.cpp and run in the same process, in the same
pass over the aligned records.
@SQ header consolidation
The doubled reference (ref.fa.bwameth.c2t) contains two contigs for each
chromosome:
fchr1,fchr2, … — C→T projections of each chromosome.rchr1,rchr2, … — G→A projections of each chromosome.
If the raw alignment header were written directly, every downstream tool would
see twice as many sequences as there are real chromosomes, with unfamiliar
f/r-prefixed names. meth_bam_writer_open instead builds a consolidated
header using the meth_chrom_map_t:
meth_chrom_map_build_from_bnsiterates overbns->annsand strips the leadingf/rfrom each contig name.- The first contig with a given stripped name registers that name in the output list; subsequent contigs with the same stripped name map to the same output index.
- The BAM
@SQlines are written from the consolidated list — oneSN:per real chromosome.
RNAME, RNEXT, and SA/XA tag contig references in every record are rewritten
through cmap->out_tid and cmap->output_names so they reference the
consolidated names. The mapping from internal (doubled-ref) contig index to
output contig index is cmap->out_tid[p.rid].
Note — TLEN computation uses consolidated TIDs
Template length (TLEN) is computed using the consolidated output TIDs, not the internal
p.ridvalues. Two mates that rescue ontofchr1andrchr1respectively both map to outputchr1, so TLEN is reported as a non-zero distance rather than zero (which would happen if the mismatched internal TIDs were used).
Chimera QC heuristic (opt-in)
bwameth.py applies a heuristic to flag reads that look like chimeric fragments: if the longest contiguous alignment run (sum of M/=/X CIGAR operations) covers less than 44 % of the read length, the read is considered a potential chimera. Bismark does not apply this kind of heuristic.
bwa-mem3 --meth makes this opt-in via --chimera-qc (default off, so
the runtime posture matches Bismark). When enabled, the check inside
meth_mem_aln_to_bam does:
if (100 * longest_M_run < 44 * l_seq):
flag |= 0x200 # set QC fail
flag &= ~0x2 # clear proper pair
mapq = min(mapq, 1)
The threshold constant is MIN_LONGEST_M_PCT = 44 (defined at the top
of src/meth_bam.cpp). The longest run is computed by
cigar_longest_m_mem from src/cigar_util.cpp, which counts M, =,
and X operations.
The chimera heuristic is only applied to mapped records (!(flag & 0x4) && direction != 0). Unmapped records are not touched.
See Flags for when to use --chimera-qc (PBAT / scBS-Seq;
bwameth.py-equivalence runs).
--set-as-failed strand filtering
Before the chimera check, meth_mem_aln_to_bam checks whether
opt->meth_set_as_failed is set and matches the record’s strand direction:
if (meth_set_as_failed != 0 && meth_set_as_failed == direction):
flag |= 0x200
This unconditionally marks all alignments to the specified strand (f or r)
as QC-failed before chimera logic runs. The chimera check then applies on top
of the already-set fail flag.
Pair-level QC-fail propagation
Once per read group (all records sharing the same query name), after individual records have been processed:
meth_bam_group_propagate_qcfail(group, n)
This function scans all records in the group. If any record has 0x200 set, it
propagates that flag to every other record in the group and clears 0x2
(proper pair) on all of them. This ensures that a chimeric or strand-filtered
primary alignment also marks its supplementary alignments and the mate as
QC-failed, preventing inconsistent flag states in the output BAM.
@PG ID:bwa-mem3-meth insertion
meth_bam_writer_open appends a @PG line to the header after the original
bwa-mem3 @PG entry:
@PG ID:bwa-mem3-meth PN:bwa-mem3-meth VN:<version>-meth CL:<command line>
The <command line> field is the full bwa-mem3 mem --meth ... invocation with
embedded tab characters replaced by spaces (htslib does not permit literal tabs
in @PG CL: fields). This records the exact parameters used for provenance
and reproducibility.
Tip — Verifying the header
After alignment, confirm consolidation and provenance with:
samtools view -H out.bam | grep -E '^@SQ|^@PG'You should see one @SQ line per chromosome (no f/r prefixes) and both
@PG ID:bwa-mem3and@PG ID:bwa-mem3-methentries.
See also: Overview · SAM tags: XR, XG, XM · Flags: –set-as-failed, –chimera-qc · Conversion details · User Guide → Output: SAM/BAM, headers, tags
Flags: –set-as-failed, –chimera-qc
bwa-mem3 --meth adds two flags that control QC behavior during BAM
post-processing. Both flags affect the chimera QC and strand-filtering logic
inside meth_mem_aln_to_bam (src/meth_bam.cpp).
--set-as-failed {f|r}
Marks every alignment to the specified strand as QC-failed (0x200) regardless
of alignment quality or CIGAR structure.
Accepted values:
f— flag all alignments tof-prefixed contigs (C→T top-strand projection).r— flag all alignments tor-prefixed contigs (G→A bottom-strand projection).
Effect on records:
When --set-as-failed f (or r) is set and a mapped record’s strand matches
the specified value, the record’s SAM flag has 0x200 set. If --chimera-qc
is also active, the chimera heuristic runs on top, possibly clearing 0x2 and
capping MAPQ. QC-fail propagation then spreads the flag to all records in the
read group.
When to use it:
Some experimental designs produce reads that are expected to align exclusively
to one strand. Flagging the other strand as QC-failed before downstream
analysis prevents spurious methylation calls from mis-strand alignments. It is
also useful for diagnosing library preparation issues: run once with
--set-as-failed r and once without to compare yield on each strand.
Warning — All records on the strand are flagged
--set-as-failedis a blunt instrument. It marks every alignment to the chosen strand, including correctly aligned reads that simply happened to land on the complementary strand due to library structure. Use this flag only when your library is expected to be strand-specific.
--chimera-qc
Enables the bwameth.py-style longest-M chimera heuristic. Off by default; this is the Bismark-equivalent posture, since Bismark itself does not apply this kind of QC heuristic.
When --chimera-qc is set, any mapped record whose longest M/=/X CIGAR run
covers less than 44 % of the read length receives:
0x200(QC fail) set.0x2(proper pair) cleared.- MAPQ capped at 1.
QC-fail propagation across the read group also applies.
When to use it:
The 44 % threshold was calibrated by bwameth.py for standard mammalian whole- genome bisulfite-sequencing (WGBS) libraries with typical read lengths and is helpful on PBAT / scBS-Seq libraries where intra-fragment chimeras are common. For Bismark-equivalent output (and most directional EM-seq / WGBS workflows), leave it off.
It is also useful when benchmarking: comparing bwa-mem3 --meth output
against bwameth.py output is cleaner with --chimera-qc enabled, since
bwameth.py’s chimera logic always runs.
Note — Pair-level propagation still applies
--chimera-qccontrols only whether the heuristic itself runs.--set-as-failedis independent: when active, those flags are still set, andmeth_bam_group_propagate_qcfailpropagates any0x200flags across the read group regardless of--chimera-qc.
Flag interaction summary
| Condition | 0x200 set? | 0x2 cleared? | MAPQ capped? |
|---|---|---|---|
| Normal aligned record (default, no flags) | No | No | No |
--chimera-qc triggers (longest M/=/X < 44%) | Yes | Yes | Yes (≤1) |
--set-as-failed strand matches | Yes | No | No |
Both --chimera-qc + --set-as-failed active | Yes | Yes | Yes (≤1) |
-V reference annotation XR:Z is suppressed under --meth
bwa-mem3 mem -V normally emits the contig annotation as an XR:Z
auxiliary field. Under --meth, XR:Z carries the Bismark
read-conversion direction (CT/GA) instead. The reference-annotation
XR:Z is silently suppressed when --meth is active so the two uses
don’t collide. There is no flag to override this — -V is a no-op for
XR:Z under --meth. See tags.md.
See also: Overview · Chimera QC and header rewriting · SAM tags: XR, XG, XM · Best Practices → Methylation defaults · CLI Reference → mem
Interop with External bwameth.py c2t
Some workflows use bwameth.py’s c2t subcommand to convert reads before
passing them to an aligner. bwa-mem3 --meth supports this pattern by
detecting whether the caller has already provided a pre-converted FASTQ and
whether the reference path already points to the doubled-reference FASTA.
Auto-detect logic for the reference path
When --meth is active, bwa-mem3 mem ordinarily appends .bwameth.c2t to
the reference path so the user can pass the original FASTA prefix:
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz
# internally uses ref.fa.bwameth.c2t as the reference
If the reference path already ends with .bwameth.c2t, the auto-append is
skipped:
bwa-mem3 mem --meth -t 16 ref.fa.bwameth.c2t R1.fq.gz R2.fq.gz
# no suffix appended; ref.fa.bwameth.c2t is used as-is
This detection is a simple suffix check on the path string. It allows callers that manage the doubled-reference path explicitly to pass it without triggering double-append.
Using bwameth.py c2t as the read preprocessor
If your pipeline already runs bwameth.py c2t to convert reads (for example,
because it needs to reuse converted reads across multiple aligners), you can
pipe the output directly to bwa-mem3 mem --meth:
bwameth.py c2t R1.fq.gz R2.fq.gz \
| bwa-mem3 mem --meth -p -t 16 ref.fa.bwameth.c2t /dev/stdin \
| samtools sort -o out.bam
Key points for this pattern:
- Pass the
.bwameth.c2treference path explicitly so the auto-append is suppressed. - Use
-pto tellbwa-mem3 memthat the input contains interleaved paired-end reads (bwameth.py c2t emits interleaved output to stdout). - Use
/dev/stdinas the reads argument to read from the pipe. - The
bwa-mem3 --methinline c2t conversion is not applied when the reads arrive pre-converted.XR:Z(read conversion) andXG:Z(genome strand) are still emitted on every record;XM:Z(per-base methylation call string) is emitted on every mapped record.XR:Zis derived from the inline carrier the c2t step normally writes — when reads are pre-converted, the carrier is absent unless the external preprocessor emits it as a FASTQ comment (see warning below).
Warning —
XR:Z:requires the inline carrier
XR:Z:records the read’s bisulfite-conversion direction (CTfor top- strand,GAfor bottom-strand R2). bwa-mem3’s inline c2t step records that direction into the FASTQ comment asYS:Z:<seq>\tYC:Z:<dir>, which the BAM emitter then reads to setXR:Z:(theYS/YCcarrier itself is dropped from BAM output). When reads are pre-converted externally and piped in, the inline c2t step insrc/fastmap.cppis bypassed. If your external preprocessor does not emit a compatibleYC:Z:comment field,XR:Z:will be absent from the output BAM.XG:Z:andXM:Z:are unaffected — they’re derived from the reference contig direction and CIGAR walk, not from the carrier.
Header rewriting and BAM post-processing with external c2t
Whether reads are converted inline or externally, all BAM post-processing steps
apply identically when --meth is active:
@SQheader consolidation (f/r contigs → one entry per chromosome).- Bismark
XR:Z/XG:Z/XM:Zauxiliary tag emission. - Chimera QC heuristic (only when
--chimera-qcis set; off by default). - Pair-level QC-fail propagation.
@PG ID:bwa-mem3-methinsertion.
The post-processing pipeline depends only on the reference contig names (to
determine XG:Z) and the alignment flags — not on whether reads were
converted inline or externally.
Summary of path variants
| Reference arg | Read source | Auto-append? | Inline c2t? | XR/XG/XM emitted? |
|---|---|---|---|---|
ref.fa | Raw FASTQ | Yes (→ ref.fa.bwameth.c2t) | Yes | All three |
ref.fa.bwameth.c2t | Raw FASTQ | No | Yes | All three |
ref.fa.bwameth.c2t | Pre-converted (pipe) | No | No | XG/XM always; XR only if external preprocessor emits the YC:Z carrier |
See also: Overview · Conversion details · SAM tags: XR, XG, XM · bwameth.py drop-in mapping · Related Projects: bwameth.py
What’s Different from bwa-mem2
This section tracks every change that bwa-mem3 carries on top of upstream
bwa-mem2/bwa-mem2’s master branch,
explains why each change was made, and records its upstream disposition.
How this section is organized
Each deep-dive page covers one category of change:
- Correctness fixes — bugs in upstream bwa-mem2 that are
fixed in bwa-mem3, including the kswv SIMD score2 plateau series, the
proper-pair flag regression, the zero-init crash, the SMEM buffer overflow,
and the
@PGtab-escape issue. - Performance improvements — lockstep SMEM batching, batched
-Hheader ingestion, libsais FM-index construction, and the consolidated mapping speedup suite. - Features —
--methbisulfite mode, mimalloc allocator,--supp-rep-hard-cap,bwa-mem3 shm,shm --meth, theHN:itag, and the--bam=LEVELoutput flag. - Architecture support — the Linux ARM64/aarch64 build,
the
arch=avx512bwMakefile target, the NEON kswv mate-rescue kernel, and the AVX2 kswv mate-rescue kernel. - Build & infrastructure — the doctest framework, Codecov
integration,
PACKAGE_VERSIONfromgit describe, PGO target parameterization,CXXFLAGS/CPPFLAGS/LDFLAGSforwarding, the unit-test harness, and the CI matrix expansion. - Upstream PR status — a single table cross-referencing every fork-carried change to its corresponding upstream PR or issue, with current upstream disposition.
Carried on top of upstream
Auto-generated from git log --reverse --no-merges master..main and the conventional-commits scope on each PR-merge title; do not edit by hand. For per-PR upstream disposition (bwa-mem2 PR / issue refs and status), see Upstream PR status.
| Commit | Topic | PR |
|---|---|---|
ae73227 | Add Apple Silicon (ARM64/NEON) support with native optimizations | — |
744a9e7 | feat: add CI workflow with cross-platform build and end-to-end test | — |
490502b | fix: drop unused global stat that shadows libc | — |
9364cfc | ci: pin GitHub Actions to full-length commit SHAs | #4 |
b6eaba1 | chore: configure CodeRabbit to review PRs against fg-main | #2 |
db5086a | docs: add FG-MAIN.md documenting the fork’s relationship to upstream | #3 |
5132582 | feat(arm64): make Linux aarch64 build + CI-test on every fg-main push | #1 |
96016a5 | ci: pin dwgsim seed (-z 42) to stop parity-test flakiness | #10 |
246b528 | fix(hdr): align bwamem.h declarations with bwamem.cpp definitions | #5 |
b27f374 | feat(hdr): export mem_infer_dir for external consumers | #6 |
62700b1 | chore: move profiling globals out of main.cpp | #7 |
6b76c7b | feat: expose worker_alloc/worker_free, the core worker_t pre-allocation helpers | #8 |
e80765b | feat: split mem_sam_pe into mem_pair_resolve + thin emission wrapper | #9 |
84defc3 | feat: –bam[=LEVEL] output flag for direct BAM emission | #12 |
73907d7 | feat: vendor mimalloc v3.3.0 and link by default | #19 |
7641ebf | feat(meth): –meth + index --meth — bwameth.py-equivalent bisulfite mode | #13 |
0165b6c | fix: zero bseq1_t in kseq2bseq1 so realloc’d entries don’t carry garbage | #22 |
e7cb763 | [proto] NEON kswv mate-rescue — correctness + perf harness | #18 |
a5aab04 | test(ci): add unit-test harness, fixtures, and ARM build support | #23 |
2fddafd | [proto] AVX2 kswv mate-rescue — stacked on PR 18 | #20 |
8944028 | fix: compute no_pairing 0x2 flag from the emitted alignment | #17 |
2fd0e96 | fix(kswv): apply NEON score2-scan fixes to AVX-512BW kernel | #21 |
68adecd | ci: expand workflow matrix + add canonical deep-test row | #24 |
690914f | build(make): add explicit arch=avx512bw target | #16 |
0bb9402 | fix(kswv): gate AVX2 arch dispatch on !AVX512BW | #26 |
43457e8 | fix(kswv): consolidate score2 plateaus per-lane to match scalar ksw_align2 | #28 |
2311f11 | fix(kswv): port score2 plateau consolidation to NEON + AVX-512BW | #29 |
75c709a | fix(kswv): apply score2 plateau fix + missing filters to kswv_512_16 | #30 |
61813ef | fix(kswv): rewrite kswv_neon_16 — real SIMD kernel with correct table + score2 | #31 |
1f76655 | perf(seed): lockstep SMEM batching across N reads | #33 |
93a79ec | feat(mem): emit HN:i tag with total hit count per primary | #42 |
dd3a82c | chore: port four nh13 lh3/bwa PRs into bwa-mem2 (-z, -u/XB, MQ, @HD order) | #35 |
98ba6ab | build(make): forward user CXXFLAGS/CPPFLAGS/LDFLAGS to final link steps | #50 |
e9302a1 | fix(kswv): guard post-loop rowMax store on nrow==0 batches | #51 |
9b702ca | fix(sam): sanitize whitespace in -R when embedding into @PG CL: field | #54 |
ed63fad | perf(header): batch -H ingestion to fix O(n^2) header read (closes #37) | #49 |
595d8e5 | feat(mapq): add –supp-rep-hard-cap opt-in supp MAPQ rescoring | #56 |
79628c3 | chore(version): stamp PACKAGE_VERSION from git describe at build time | #52 |
e22dade | fix(smem): size SMEM buffers from observed max read length (closes #44) | #55 |
03688a0 | chore: normalize CRLF line endings to LF (#43) | #53 |
57e21bd | feat(makefile): parameterize PGO targets by arch + profile dir | #59 |
79b90ce | feat(index): libsais-based memory-bounded FM-index construction | #57 |
d8d4a6d | feat(cli): wire up –help across commands; add -h to top-level and index | #60 |
7301762 | perf: consolidated mapping speedups (ksw2, SMEM, SAL, SAM) | #58 |
eaf4ed6 | test: doctest-based test framework scaffolding + Codecov | #34 |
bbbecd3 | ci(proto-neon-kswv): split into fan-out/fan-in jobs with caching | #63 |
20f77e9 | feat(shm): port bwa shm from bwa-mem v1 | #65 |
c20f61c | feat(shm): add bwa-mem2 shm --meth for symmetric meth UX | #67 |
ee18a3b | refactor: rename bwa_mem2idx to bwa_mem3idx | — |
bb919f2 | feat: rename PG header to bwa-mem3 (ID, PN, usage strings) | — |
7a56f9a | feat: rename meth PG header to bwa-mem3-meth and drive VN from PACKAGE_VERSION | — |
95c673d | build: rename binary to bwa-mem3, update version guard and fallback | — |
2d5ad10 | test: rename test binaries from bwa_mem2_tests_* to bwa_mem3_tests_* | — |
31f214a | chore: sweep bwa-mem2 -> bwa-mem3 in source comments and log messages | — |
ff40f96 | chore: rename BWAMEM2_* header guards to BWAMEM3_* | — |
c2c786a | test: sweep bwa-mem2 -> bwa-mem3 in test, bench, and scripts | — |
148e431 | ci: update workflows for bwa-mem3 rename and main branch | — |
dddd8dd | docs: rewrite README for bwa-mem3 (lineage attribution, drop upstream-only sections) | — |
34c8ea3 | docs: rename FG-MAIN.md to docs/whats-different.md | — |
5719617 | docs: update whats-different.md for bwa-mem3 and main branch | — |
bdd67f3 | docs: drop README-ori.md (lineage preserved in README + git history) | — |
924a70f | docs: add 0.1.0-pre release notes and update status.md | — |
85d3b3b | ci: drop master from branch filter (master branch removed from remote) | — |
2ea69db | fix(test/meth): alias bwa-mem2 -> bwa-mem3 on PATH for bwameth.py oracle | #72 |
4f805e6 | chore: rename shell vars BWAMEM2/BWA_MEM2[*] to BWAMEM3/BWA_MEM3[*] | #68 |
8137740 | perf(kswv): add per-strip L1 prefetches to all u8/16 kernels | #70 |
41e1f3c | docs: add comprehensive mdbook on Read the Docs | #71 |
442de25 | fix(fmi): parenthesize SA_COMPX_MASK precedence in sampled-SA prefetch | #73 |
000c0fd | perf(fmi): bump SMEM_LOCKSTEP_N from 8 to 16 | #75 |
b3a665e | fix(bntseq): bound .alt parse buffer to prevent stack overflow | #74 |
af33cdd | feat(bns): convert mem_matesw_batch_{pre,post} to bns_fetch_seq_v2 | #76 |
9bb277a | Update index.md | #79 |
fdb244d | perf(libsais_build): skip wasted zero-init on unpack + SA buffers | #80 |
ff95a4f | perf(ksort): replace per-call malloc with on-stack buffer for small n | #78 |
7caf77c | perf(ungapped): closed-form HIT for total_mis == 0 | #77 |
e65ceb2 | fix(profiling): clamp display_stats nthreads to LIM_C | #81 |
ddfb0da | feat(shm): serialize /bwactl RMW with a POSIX named semaphore | #82 |
b9e0b66 | feat(simd): replace multi-binary execv launcher with single-binary in-process dispatch | #83 |
7d27f23 | perf(build): default x86 single-binary baseline to avx2 (was sse41) | #84 |
316dba6 | fix(matesw): copy ref slice before ksw_align2 to avoid SIGSEGV on shm-backed ref_string | #85 |
427c81c | perf(fmi): inline backwardExt to recover gcc 12+ wall-clock regression | #88 |
c96d31a | perf(x86): cap avx512bw autovec at 256-bit; bwa_shm /dev/shm preflight | #86 |
23f528d | ci: migrate parity tests from dwgsim/phiX174 to holodeck/chr22 | #89 |
ec67b09 | feat(meth): emit Bismark-compatible XR/XG/XM auxiliary tags | #90 |
652ce0f | docs(install): list autoconf/automake/libomp/zlib system prereqs | #93 |
296b1b9 | docs(install): fix RHEL/Fedora package name pkgconfig → pkgconf-pkg-config | #94 |
dc7fcfe | feat(simd): add SIMD host-floor precheck for multi-arch deployment | #95 |
3bc64b0 | docs: pre-release documentation pass for v0.2.0-pre | #96 |
27a60c9 | chore(release): prep v0.2.0 release notes and metadata | #97 |
Additional fork-level changes
-
Vendored mimalloc allocator:
ext/mimallocis pinned atv3.3.0and linked into every binary by default (USE_MIMALLOC=1). Linux uses--whole-archivestatic linkage; macOS uses dyld-interposed shared linkage.USE_MIMALLOC=1is the supported and recommended default on all platforms;USE_MIMALLOC=0is provided as a best-effort opt-out and is CI-gated on Linux x86 only. See Features for details. -
--supp-rep-hard-cap INT(opt-in, default disabled): forces MAPQ=0 on supplementary alignments whose chain contains a seed with>=INTgenome occurrences. Addresses the long-standing bwa/bwa-mem2 issue where a supp fragment that maps to many places standalone (e.g. a short read in a CCATCC repeat) inherits a high MAPQ from its primary because the supp’s competing repetitive chains get filtered out during the full-read pipeline and therefore never contribute to itssub/sub_n. See upstream #260 for the reporter case. Primary MAPQ is unaffected; default output is byte-identical to stock bwa-mem2. Typical values are 5–20 (lower = more aggressive); the upstream #260 repro drops from MAPQ=60 to MAPQ=0 at--supp-rep-hard-cap 18.
Version stamping
PACKAGE_VERSION (the value reported by bwa-mem3 version and written to
the @PG VN: SAM header field) is generated at build time by the Makefile
from git describe --tags --dirty, e.g. v2.3-30-g61813ef for a tree 30
commits past upstream tag v2.3 at commit 61813ef.
- No manual bumping required: cut a fresh release by tagging the commit
(
git tag -a vX.Y-fg-labs.N -m ...) and the next build picks it up. - Builds where
git describe --tagsfails (source-tarball extractions, or shallow clones / checkouts with no tag reachable fromHEAD— including CI’s defaultactions/checkoutfetch-depth of 1) fall back to the staticFG_LABS_VERSION_FALLBACKinMakefile. Bump that when cutting a release that will be consumed as a tarball, or in CI artifacts. src/version.his generated and.gitignored;make cleanremoves it.
Branching and update policy
mastertracks upstream unchanged.mainisupstream/masterplus the commits above. Rebased onto upstream roughly quarterly, or sooner when an upstream release we care about lands.- Contributions go via PR targeting
main. CI and CodeRabbit gate merges. - Any PR that adds or removes a fork-carried commit must update the table above in the same PR.
Consuming
Clone this repo and check out main:
git clone https://github.com/fg-labs/bwa-mem3.git
cd bwa-mem3
git checkout main
Or vendor the branch into a downstream repo by pinning to a specific commit (not the branch tip) so your build is reproducible.
Relationship to upstream
We submit the generally-useful fixes and features carried here as PRs against
bwa-mem2/bwa-mem2 when the upstream
maintainers are actively merging; while they are not, fixes land here first
and we drop them from main once they appear upstream.
See also: Correctness fixes · Performance improvements · Features · Upstream PR status · Developer Guide → Contributing
Correctness Fixes
This page documents bugs present in upstream bwa-mem2 that bwa-mem3 fixes. Each
fix is isolated to a single PR so it can be reviewed independently and dropped
from main once upstream merges the equivalent patch.
@PG CL: tab escaping (PR #54)
When a read-group string is passed via -R '@RG\tID:x\tSM:y', the tab
characters in the argument were copied verbatim into the @PG CL: SAM header
field. The SAM specification uses tabs as field delimiters, so the resulting
header line appeared to have extra ID: and other tag fields embedded inside
CL:. Lenient parsers (samtools, htsjdk) tolerated the output; strict parsers
(noodles, some fgbio configurations) rejected the file as malformed.
The fix replaces each tab character with a space when building the @PG CL:
value in src/main.cpp. The @RG line itself is not modified, so the
read-group metadata is preserved correctly. A regression shell test
(test/pg_cl_escape_test.sh) asserts that the @PG line contains exactly
five tab-separated fields after the fix. Upstream issue reference:
bwa-mem2#293.
SMEM buffer overflow on reads longer than 151 bp (PR #55)
bwa-mem2 hardcoded READ_LEN 151 in src/macro.h to size the per-thread
matchArray SMEM buffer at compile time. The FMI walk wrote past this buffer
without bounds checking when reads exceeded 151 bp, causing memory corruption
that manifested as segfaults or silent wrong output on 300 bp MiSeq reads,
error-corrected long reads, and any run with a non-default -k that extended
seed length.
A second cap, MAX_READ_LEN_FOR_LOCKSTEP 512, guarded the lockstep driver’s
per-slot stack arrays with a hard assert that aborted on anything longer.
The fix eliminates both compile-time caps. Every per-thread SMEM buffer is now
heap-allocated on the memory management context (mmc) and grown on demand
from each batch’s observed max_readlength. The pre-walk grow in
mem_collect_smem sizes matchArray[tid] to BATCH_MUL * BATCH_SIZE * max_readlength, and all array writes are bounds-checked with a structured
smem_overflow_die on overflow. Regression tests cover 300 bp, 1 kbp, and
3 kbp phiX reads; all three segfaulted before the fix and produce correct
NM:i:0 alignments after. Upstream references:
bwa-mem2#210 (issue),
bwa-mem2#238
(closed unmerged upstream PR).
kswv nrow==0 guard (PR #51)
When a SIMD batch contained only padding pairs (all len1 == 0), the DP loop
never executed and nrow was zero. The post-loop rowMax + (i-1) * SIMD_WIDTH
store still executed, walking SIMD_WIDTH bytes before the beginning of the
rowMax allocation. On glibc this produced a free(): invalid pointer abort;
on macOS libc it silently corrupted the heap.
The fix wraps the post-loop store in an if (i > 0) guard on all five SIMD
kswv kernels: NEON u8, NEON 16, AVX2 u8, AVX-512BW u8, and AVX-512BW 16. The
upstream patch bwa-mem2#289
covered only the two AVX-512BW kernels; bwa-mem3 broadens it to the three
additional kernels carried in this fork. A dedicated regression test
(test/kswv_nrow_zero_test.cpp) builds all-padding batches and verifies each
kernel is clean under AddressSanitizer.
kswv score2 plateau series (PRs #26, #27, #28, #29, #30, #31)
The batched mate-rescue Smith-Waterman path (kswv) contains a family of
related bugs across its SIMD kernels that inflated the suboptimal score
(score2 / XS) and consequently deflated MAPQ relative to upstream
bwa-mem2.
AVX-512BW dispatch guard (PR #26). GCC with -mavx512bw automatically
defines __AVX2__, so the #elif __AVX2__ branch in src/kswv.h and
src/kswv.cpp matched first on every AVX-512BW build. The 256-bit AVX2 kernel
produced only 32-lane results into 64-lane score[]/te1[]/qe[] arrays
sized for AVX-512BW; the upper 32 lanes held uninitialized values.
mem_matesw_batch_post read those bogus te values, bwa_gen_cigar2
returned NULL, and mem_reg2aln triggered an a.cigar != NULL assertion on
every AVX-512BW dispatch host (AWS c7a, c7i). The fix qualifies the #elif __AVX2__ guard with !__AVX512BW__, matching the existing pattern in
bandedSWA.h. Closes issue #25.
AVX2 score2 plateau fix (PR #27 closed, PR #28 merged). The AVX2 256-bit
kswv kernel added in PR #20 used a dense SIMD max over every rowMax row to
compute the suboptimal score. Scalar ksw_u8 instead collapses consecutive
rows above minsc into a single b[] entry anchored at the max-score row,
then finds the best anchor outside the primary region. The dense max pulled in
tail rows from a plateau whose anchor sat inside the primary region, inflating
XS by 1–4 on a minority of reads and reducing MAPQ by 2–18 on those reads.
PR #27 (closed) temporarily disabled the AVX2 batched path. PR #28 fixes the
kernel itself by replacing the dense scan with a per-lane scalar emulation of
the b[] build-and-scan logic.
NEON and AVX-512BW 8-bit port (PR #29). The same dense-rowMax score2 scan
existed in kswv_neon_u8 and kswv_512_u8. Confirmed on ARM: rebuilding
smoke-1M on darwin/arm64 pre-fix produced the identical four MAPQ regressions
as the AVX2 case. PR #29 ports the per-lane scalar b[]-emulation fix to both
kernels.
AVX-512BW 16-bit port (PR #30). kswv_512_16 carried four bugs: the same
dense-rowMax plateau pattern, aggregate maxl/minh bounds instead of
per-lane bounds (a gap from PR #21), no minsc filter, and no qe mask. The
per-lane scalar emulation from PR #29 fixes all four naturally.
NEON 16-bit rewrite (PR #31). kswv_neon_16 was effectively dead code
before this PR. Five interacting bugs produced 20,435 BAM diffs vs scalar
reference on smoke-1M -A 2: the score table reinterpreted int16 xor
indices as int8 lookups (inflating match scores by ~256 per cell), the table
was too small for the 16-bit SoA encoding, rowMax was never written, the
early-exit fired on row 0 for all pairs without a KSW_XSTOP target, and all
the fix-3 class bugs from PRs #28–#30 were missing. The PR rewrites the kernel
from scratch against kswv_neon_u8’s structure using 32-byte int8 tables
indexed via vqtbl2_s8, per-lane freeze, exit0 bitmap, and per-lane scalar
score2.
kseq2bseq1 zero-initialization (PR #22)
bseq_read_orig grows its sequence buffer with realloc, leaving tail entries
uninitialized. kseq2bseq1 populated only name, comment, seq, qual,
and l_seq for each entry, leaving sam, bams, n_bams, and cap_bams at
whatever values realloc happened to return. PR #13 added an unconditional
free(ret->seqs[i].bams) in the output loop (fastmap.cpp:571), which turned
those garbage values into a crash — a pointer being freed was not allocated
abort under system malloc and a SIGSEGV under mimalloc — once input exceeded
the initial 256-sequence allocation. The crash was deterministic and
reproducible with -t1.
The fix is a single memset(s, 0, sizeof(*s)) at the top of kseq2bseq1.
Proper-pair flag from emitted alignment (PR #17)
In the no_pairing emission path of mem_sam_pe and mem_sam_pe_batch_post,
the proper-pair bit (0x2) was computed from a[i].a[0].rb regardless of
which alignment was actually emitted. When the primary’s alignment score fell
below the reporting threshold opt->T but a non-primary ALT hit cleared it,
mem_reg2aln emitted a[i].a[n_pri[i]] while mem_infer_dir still read the
below-threshold primary. In that case the SAM flag did not reflect the
coordinates in the record.
The fix stores the selected alignment index per mate in a which[2] array and
passes a[i].a[which[i]].rb to mem_infer_dir, ensuring the proper-pair
flag always matches the emitted record. The bug was present in the bwa-mem2
initial commit from 2019. Upstream reference: pre-existing bug, no open
upstream PR at time of merge.
Changes catalog
| Item | bwa-mem3 PR | Upstream PR/issue | Status |
|---|---|---|---|
@PG CL: tab escape | #54 | bwa-mem2#293 | fork-only (open upstream issue) |
| SMEM buffer overflow on >151 bp reads | #55 | bwa-mem2#238, bwa-mem2#210 | fork-only (upstream PR closed unmerged) |
| kswv nrow==0 guard | #51 | bwa-mem2#289 | fork-only (upstream PR open) |
| AVX-512BW dispatch guard | #26 | — | fork-only |
| AVX2 score2 plateau disable (superseded) | #27 | — | closed (superseded by #28) |
| AVX2 score2 plateau fix | #28 | — | fork-only |
| NEON + AVX-512BW 8-bit score2 fix | #29 | — | fork-only |
| AVX-512BW 16-bit score2 fix | #30 | — | fork-only |
| NEON 16-bit kernel rewrite | #31 | — | fork-only |
| kseq2bseq1 zero-initialization | #22 | — | fork-only |
| Proper-pair flag from emitted alignment | #17 | — | fork-only |
See also: Performance improvements · Architecture support · Upstream PR status · Developer Guide → Regression test framework · Performance → SIMD dispatch matrix
Performance Improvements
This page covers the performance work carried in bwa-mem3 on top of upstream bwa-mem2. Every change listed here preserves byte-identical SAM/BAM output vs the upstream baseline it was benchmarked against.
For current benchmark numbers across architectures and workloads, see bwa-mem3-bench, the canonical source of truth for benchmark methodology and results.
Lockstep SMEM batching (PR #33)
Seeding in bwa-mem2 advances one read’s SMEM walk at a time. Because each
forward/backward extension step issues a random access into the cp_occ
checkpoint array (~4 GB for human genome), the CPU stalls on cache misses
between steps. Lockstep batching advances SMEM_LOCKSTEP_N reads’ SMEM walks
in slot-interleaved round-robin order so that the out-of-order engine can
overlap the cp_occ cache-miss loads for read i+N with the compute-bound
walk of read i.
Each read slot (BatchSlot) carries its own prev[] walk buffer and
match_buf[] reorder buffer. A tight recycling loop assigns finished slots to
the next unprocessed read immediately. The match-emit cursor enforces
input-index order so output is byte-identical to scalar. SMEM_LOCKSTEP_N is
compile-time tunable; N=1 dispatches to the unchanged scalar path for
bisection.
Measured improvement on 150 bp NovaSeq WGS (1M pairs, hg38, Graviton3 r7g.4xlarge,
8 threads): −6.1% wall time (82 s → 77 s). The backwardExt hot
cp_occ load share dropped from 65.5% to 53.3% of function time — direct
evidence that the OoO engine is overlapping cross-slot loads. On 300 bp MiSeq
reads the workload is SW-dominated (~85% of cycles in kswv kernels) and the
SMEM improvement is within noise; parity holds.
Supersedes PR #15 (cross-read _mm_prefetch shape), which regressed on
Graviton3.
Batched -H header ingestion (PR #49, closes issue #37)
Passing a large header file via -H <file> re-ran strlen on the growing
header string and called realloc on every input line, making ingestion O(n²)
in the number of header lines. For a ~70 MB / ~1.5 M-line header (reported in
upstream bwa-mem2#204) this
caused runtimes exceeding 10 minutes before alignment started.
The fix introduces bwa_insert_header_file, a batched helper that determines
the file size with fseek/ftell, allocates a single buffer, copies all
@-prefixed lines in one pass, and calls bwa_insert_header once. The fix
also addresses four correctness gaps in the upstream PR #204: the return-value
assignment was dropped (leaving hdr_line stale after realloc), const FILE*
caused compiler warnings, empty files were not guarded, and each fgets was
not bounded by remaining buffer. A regression test
(test/header_insert_test.cpp) diffs the batched path against the pre-patch
per-line baseline across eight edge cases.
libsais FM-index construction (PR #57)
bwa-mem3 index now builds the FM-index using
libsais v2.9.1 (Ilya Grebnov)
instead of the sais-lite (Yuta Mori saisxx) library that bwa-mem2 inherited.
libsais is actively maintained, supports OpenMP-parallel induced sorting, and
produces a byte-identical FM-index. No changes are required to existing
indexes — bwa-mem3 reads index files built by bwa-mem2 index without
re-indexing.
For a human reference (GRCh38 + decoys), libsais reduces indexing wall time and peak memory vs sais-lite. Exact numbers depend on thread count and available RAM; see the PR body for measurements on Graviton3.
Consolidated mapping speedups (PR #58)
PR #58 is a multi-phase performance audit of bwa-mem2’s hot path, squashed and
rebased onto main. It incorporates improvements across five subsystems:
- ksw2 banded SW — tuned the band extension loop to reduce redundant computation in the common case.
- SMEM lockstep batching — additional refinements on top of PR #33.
- SAL prefetch — prefetch hints for the suffix array lookup hot path.
- SAM record building — reduced per-record allocation in the text formatting path.
- PGO build — the opt-in profile-guided optimization target (see also Performance → PGO build) is included in this suite.
On the smoke-1M workload (1M PE 150 bp reads, hg38, Graviton3 r7g.4xlarge, 16
threads, warm page cache), this PR contributed the largest single-step wall
time reduction in the main branch’s performance history. Benchmark details
are maintained at bwa-mem3-bench.
Changes catalog
| Item | bwa-mem3 PR | Upstream PR/issue | Status |
|---|---|---|---|
| Lockstep SMEM batching | #33 | — | fork-only |
Batched -H header ingestion | #49 | bwa-mem2#204 | fork-only (upstream PR open) |
| Large header performance (issue) | — | issue #37 | closed by #49 |
| libsais FM-index construction | #57 | — | fork-only |
| Consolidated mapping speedups | #58 | — | fork-only |
See also: Performance → Overview · Performance → PGO build · Correctness fixes · Build & infrastructure · bwa-mem3-bench
Features
This page covers user-facing features added to bwa-mem3 on top of upstream
bwa-mem2. None of these features change default behavior: output produced by
bwa-mem3 mem without any of these flags is byte-identical to the
corresponding bwa-mem2 output (except for the @PG ID: and PN: fields
which now read bwa-mem3).
--meth bisulfite alignment mode (PR #13)
--meth turns bwa-mem3 index and bwa-mem3 mem into a single-binary
drop-in replacement for the entire
bwameth.py pipeline. No Python, no
separate post-processing step, no bwameth.py dependency.
bwa-mem3 index --meth ref.fa # once per reference
bwa-mem3 mem --meth ref.fa R1.fq R2.fq | samtools sort -o out.bam
index --meth writes <ref>.bwameth.c2t — a doubled reference with
f/r-prefixed contigs and C→T / G→A projection, byte-identical to the
index that bwameth.py index-mem2 produces.
mem --meth performs inline C→T conversion of R1 and G→A conversion of R2
before seeding (stashing the pre-conversion bases on an internal
YS:Z / YC:Z carrier in bseq1_t.comment; both are suppressed at
BAM emit), consolidates the f/r contig pairs back to one @SQ
per real chromosome, emits Bismark-compatible XR:Z (read conversion
direction), XG:Z (genome strand), and XM:Z (per-base methylation
call string) auxiliary tags on every record, optionally applies a
chimera QC heuristic (longest M/=/X run < 44% of read length → set
0x200, clear proper-pair 0x2, cap MAPQ at 1) when --chimera-qc
is passed, copies the internal pre-conversion sequence back into the
BAM SEQ field for CpG-calling tools, and writes a @PG ID:bwa-mem3-meth entry.
On the bwameth.py example fixture (92,684 reads), end-to-end output is
byte-identical on chrom, pos, CIGAR, and SEQ vs the bwameth.py oracle. Stacks
on PR #12 (--bam). See the
Methylation Reference for full details.
Vendored mimalloc allocator (PR #19)
bwa-mem3 vendors mimalloc v3.3.0 as a
pinned submodule at ext/mimalloc and links it into every binary by default
(USE_MIMALLOC=1). On Linux, static linkage uses --whole-archive; on macOS,
dyld-interposed shared linkage is used.
Measured on AWS c7g.4xlarge (Graviton3, 16 threads, 29M 150 bp paired-end
exome-capture reads vs hg38, page cache dropped between iterations):
−24.5% wall-clock time (528.6 s → 424.7 s) compared to the same build
with USE_MIMALLOC=0. No user-visible interface change; no runtime
configuration required.
USE_MIMALLOC=0 is a supported best-effort opt-out and is CI-gated on Linux
x86. bwa-mem3 version prints the mimalloc version string when it is active.
--supp-rep-hard-cap supplementary MAPQ rescoring (PR #56)
Supplementary alignments for a split read inherit MAPQ from the full-read
scoring pipeline. Competing repetitive chains for the supplementary fragment
are filtered out during full-read chain scoring (mem_chain_flt) before
Smith-Waterman, so they never contribute to sub/sub_n. A supp fragment
landing in a CCATCC repeat that would map equally well to 50+ locations
standalone can therefore carry MAPQ=60 from its primary.
--supp-rep-hard-cap INT opts into rescoring: if any seed in a supplementary
alignment’s chain has >=INT genome occurrences (from the SMEM SA count), the
supplementary MAPQ is forced to 0. Primary alignment MAPQ and coordinates are
unaffected. Default output (no flag) is byte-identical to upstream bwa-mem2.
The SMEM SA-occurrence count is preserved on each seed as mem_seed_t.n_hits
and propagated to mem_alnreg_t.chain_n_hits during chain-to-alignment
conversion. Typical values for INT are 5–20; lower is more aggressive. The
upstream bwa-mem2#260
reporter case drops from MAPQ=60 to MAPQ=0 at --supp-rep-hard-cap 18.
Closes issue #46.
Shared-memory index: bwa-mem3 shm (PR #65)
bwa-mem3 mem reloads the FM-index from disk on every invocation. For hg38
the index is ~28 GB; for short alignment jobs (targeted panels, small sample
batches) this load cost dominates runtime and makes per-invocation IOPS the
bottleneck.
PR #65 ports the bwa shm command from bwa-mem v1 to bwa-mem3 with strict v1
CLI parity:
bwa-mem3 shm <index-prefix> # load index into shared-memory segment once
bwa-mem3 mem <index-prefix> ... # subsequent runs attach instead of re-reading
bwa-mem3 shm -d <index-prefix> # detach and free the segment
The index lives in a POSIX shared-memory segment. Multiple bwa-mem3 mem
processes on the same host share the same in-memory copy. Closes
issue #64.
Warning — Stale index
bwa-mem3 shmdoes not detect when the on-disk index has been rebuilt. Always runbwa-mem3 shm -d <prefix>before runningbwa-mem3 indexand then re-stage withbwa-mem3 shm <prefix>. Using a stale shared-memory segment produces silently wrong alignments.
bwa-mem3 shm --meth (PR #67)
bwa-mem3 mem --meth <prefix> auto-appends .bwameth.c2t to locate the
methylation index built by bwa-mem3 index --meth <prefix>. Before PR #67,
staging a methylation index in shared memory required passing the full
.bwameth.c2t-suffixed path to shm while continuing to pass the plain
prefix to mem. The mismatch was easy to forget, and the failure mode — a
run that silently attached the wrong segment — was difficult to diagnose.
PR #67 adds --meth support to bwa-mem3 shm so the same plain-prefix
convention works end-to-end:
bwa-mem3 shm --meth ref.fa # stages ref.fa.bwameth.c2t
bwa-mem3 mem --meth ref.fa ... # attaches automatically
bwa-mem3 shm -d --meth ref.fa # detaches
HN:i hit count tag (PR #42)
Every primary SAM/BAM record now carries an HN:i:<n> tag reporting the
number of secondary alignment candidates clustered with this primary under
XA_drop_ratio. This count is captured before the -h/max_XA_hits cap
truncates the XA:Z: string, so HN reports the true number of alternate
loci even when no XA:Z: field appears in the record.
This makes it possible to distinguish:
HN:i:0+ noXA:Z:— genuinely unique mapper.HN:i:N+XA:Z:...(N ≤-h) — multi-mapper with all alternates listed.HN:i:N+ noXA:Z:(N >-h) — multi-mapper whose alternates were suppressed by the cap.
Motivated by lh3/bwa#438, which adds
HN to bwa aln. HN is emitted in both SAM (mem_aln2sam) and BAM
(mem_aln_to_bam) paths and is absent when -a (MEM_F_ALL) is active.
--bam=LEVEL direct BAM output (PR #12)
bwa-mem3 mem --bam (or --bam=0 through --bam=9) emits BAM directly via
htslib, bypassing the SAM-text-to-BAM conversion round trip that normally
occurs when the output is piped to samtools view -bS.
--bam/--bam=0: uncompressed BAM (BGZF framing only) — near-zero CPU overhead, smaller than SAM text, fast downstream parsing.--bam=1..9: BGZF deflate at the specified level.- No flag: SAM text on stdout (default, unchanged).
The implementation adds src/bam_writer.{h,cpp}, a new module that converts
mem_aln_t to bam1_t via mem_aln_to_bam. htslib v1.21 is pulled in as a
submodule at ext/htslib. On the bwameth.py example fixture (92,961 records),
samtools view of --bam output vs SAM text produces a zero-line diff across
all 11 SAM columns and all aux tags. See
Best Practices → Output format for the
recommended pipeline.
Changes catalog
| Item | bwa-mem3 PR | Upstream PR/issue | Status |
|---|---|---|---|
--meth bisulfite alignment mode | #13 | — | fork-only |
| Vendored mimalloc allocator | #19 | — | fork-only |
--supp-rep-hard-cap MAPQ rescoring | #56 | bwa-mem2#260 | fork-only (upstream issue open) |
bwa-mem3 shm shared-memory index | #65 | — | fork-only |
shm --meth symmetry | #67 | — | fork-only |
HN:i hit count tag | #42 | lh3/bwa#438 | fork-only (analogous to bwa aln) |
--bam=LEVEL direct BAM output | #12 | — | fork-only |
See also: Methylation Reference → Overview · User Guide → Memory allocator · User Guide → Output: SAM/BAM, headers, tags · Getting Started → Quick start: shared-memory index · Best Practices → Output format
Architecture Support
This page covers the architecture-specific build and runtime work carried in bwa-mem3. The goal is a single codebase that builds cleanly on all supported targets and runs the best available SIMD kernels on each.
For the full dispatch matrix and runtime selection logic, see Performance → SIMD dispatch matrix and Developer Guide → SIMD dispatch architecture.
Linux ARM64 / aarch64 build (PR #1)
The Apple Silicon work that reached the fork in commit ae73227 gated ARM
behavior on $(UNAME_M) == arm64. On macOS, uname -m returns arm64. On
Linux ARM64, it returns aarch64. The Makefile’s ifeq check therefore fell
through to the x86 multi target on every Linux aarch64 host, failing with:
g++: error: unrecognized command-line option '-msse'
PR #1 introduces an IS_ARM variable ($(filter $(UNAME_M),arm64 aarch64))
that matches both names. All four architecture-conditional blocks in the
Makefile are rewritten to use IS_ARM: the NEON/sse2neon flag block, the x86
arch-specific block, the ARM64 single-binary build block, and the multi
target ARM64 short-circuit. The CI workflow is extended to trigger on pushes
to fg-main (the integration branch at the time of PR #1, renamed to main
in the 0.1.0-pre release) and adds an ubuntu-24.04-arm matrix row so the
aarch64 path is exercised on every PR.
arch=avx512bw explicit build target (PR #16)
The AVX-512 Smith-Waterman kernels in bwa-mem2 are guarded by the
__AVX512BW__ preprocessor macro — not __AVX512F__. The only way to
build them before this PR was arch=avx512, but the (then) make multi
rule emitted the dispatch binary as bwa-mem2.avx512bw. The build
selector (avx512), the preprocessor guard (__AVX512BW__), and the
dispatcher suffix (.avx512bw) disagreed.
PR #16 added arch=avx512bw as an explicit Makefile target with flags
-mavx512f -mavx512bw and switched the multi-binary make path to use
it. The legacy arch=avx512 was preserved as an alias with identical
flags. No C++ was changed; the fix was 11 insertions and 2 deletions in
the Makefile.
PR #83 has since replaced the multi-binary scheme with a single binary
that compiles each kernel TU at every supported tier and dispatches
in process; the avx512bw tier name and flag set survived the
transition unchanged, and the arch=avx512bw build target remains the
single-arch fallback for clusters with uniform AVX-512BW hardware. The
pre-#16 mismatch between selector, guard, and suffix is therefore
resolved in both the historical multi-binary layout and the current
single-binary layout.
This is a pure build-correctness fix: before PR #16, arch=avx512bw
and the legacy multi-binary build on AVX-512BW hardware silently
compiled the wrong kernel (see
Correctness → AVX-512BW dispatch guard for the
downstream effect).
NEON kswv mate-rescue (PR #18)
bwa-mem2 has a batched mate-rescue Smith-Waterman path (BWAMEM_BATCHED_MATESW)
that uses SIMD kswv kernels to score rescue candidates in parallel. On ARM64
the gate was __AVX512BW__, which is never true on NEON hardware. The NEON
kswv::getScores8 kernel existed in the source but was unreachable in
production.
PR #18 enables this path on ARM64 by replacing the __AVX512BW__ gate with a
new BWAMEM_BATCHED_MATESW macro that fires on NEON/Apple Silicon as well.
Along the way, four kernel bugs were found and fixed:
- te split — the
te(traceback end) value needed separate hi/lo tracking for 16-lane u8 batches. - Freeze mask — a
frozen_vecmask now gatesgmax/te/qeupdates afterKSW_XSTOPfires, preventing stale values from escaping to the score2 scan. - Per-lane score2 exclusion —
len1,low/high, andqemasks were not applied per-lane in Loop 1, allowing lanes without a valid primary to contribute spurious suboptimal scores. - minsc filter on rowMax — sub-
minscplateau scores were leaking intoscore2because the scalarksw_u8gating condition (imax >= minsc) was not replicated.
Measured on an M-series Mac (8 threads, 500k PE 100 bp reads on chr17): 1.42× speedup (−29.4% wall time) with byte-identical sorted SAM output.
AVX2 kswv mate-rescue (PR #20)
PR #18 enabled batched mate-rescue on ARM64. Most x86 production deployments
(AWS c6a, c6i, older Xeons) use AVX2 without AVX-512BW and were excluded from
the same gate. PR #20 extends the batched path to AVX2 by adding a 256-bit
kswv256_u8 kernel and widening BWAMEM_BATCHED_MATESW to fire on __AVX2__.
The AVX2 kernel is a direct port of the corrected NEON kernel from PR #18,
with an additional fix for per-lane te2 tracking (_mm256_blendv_epi8 on a
sign-extended 8→16 bit mask). Verified byte-identical sorted SAM vs the
pre-BWAMEM_BATCHED_MATESW scalar control on EC2 m5.xlarge (Skylake-SP, 4
threads, 500k chr17 PE pairs).
Note: PR #20 introduced a score2 plateau regression in the AVX2 kernel that was identified and fixed in the correctness series (PRs #27, #28, #29).
Changes catalog
| Item | bwa-mem3 PR | Upstream PR/issue | Status |
|---|---|---|---|
| Linux ARM64 / aarch64 build + CI | #1 | bwa-mem2#288 | fork-only (upstream PR open) |
arch=avx512bw explicit target | #16 | — | fork-only |
| NEON kswv mate-rescue kernel | #18 | — | fork-only |
| AVX2 kswv mate-rescue kernel | #20 | — | fork-only |
See also: Performance → SIMD dispatch matrix · Developer Guide → SIMD dispatch architecture · Developer Guide → Apple Silicon / NEON port · Correctness fixes · Performance → PGO build
Build & Infrastructure
This page covers the build-system, testing, and CI infrastructure changes carried in bwa-mem3 on top of upstream bwa-mem2.
doctest framework and Codecov (PR #34)
PR #34 establishes the long-term test infrastructure for bwa-mem3:
- doctest 2.4.11 is vendored as a single-header under
ext/doctest/, with the SHA256 recorded inext/doctest/VERSION. - A new
test/framework/static library provides shared helpers: scoring matrices, deterministic sequence-pair generators, kswv-style batch packers, scalar and SIMD runners, kswr comparators, a JUnit reporter hook, and a sharedmain. - Two test binaries are produced:
bwa_mem3_tests_unit(runs on every CI matrix row) andbwa_mem3_tests_integration(runs on a subset of rows). - The existing
kswv_selftestis ported totest/unit/test_kswv_correctness.cpp— 30,049 assertions against scalarksw_align2on 10k random plus curated edge pairs. - Five legacy integration sources are moved to
test/integration/viagit mv; their binaries still emit attest/<name>so existing scripts keep working. - Five inline CI bash regression blocks are extracted to
test/regression/*.sh(phix_parity, chr22_parity, thread_determinism, bam_roundtrip, meth_oracle). - A
coverageCI job buildslibbwa.aand both test binaries withCOVERAGE=1(-O0 --coverage), runs both test binaries, collects Cobertura XML viagcovr, and uploads to Codecov viacodecov/codecov-action.
PACKAGE_VERSION from git describe (PR #52)
Before PR #52, src/main.cpp hardcoded PACKAGE_VERSION "2.2.1". This string
appeared in bwa-mem3 version output and in the @PG VN: SAM header field
but was never updated, causing every build to report an outdated version.
The Makefile now generates src/version.h from git describe --tags --dirty,
falling back to a static FG_LABS_VERSION_FALLBACK when git describe cannot
reach a tag (source-tarball extractions, shallow clones — e.g. CI with the
default fetch-depth: 1). A write-if-changed mechanism (cmp -s + mv)
regenerates the file on every invocation but only bumps its mtime when the
stamped string changes, so only main.o is rebuilt when the version changes,
not the entire tree. src/version.h is .gitignored and removed by
make clean. Fixes
issue #40. Related upstream:
bwa-mem2#283,
bwa-mem2#284.
PGO target parameterization (PR #59)
The original pgo-generate and pgo-use Makefile targets hardcoded
arch=arm64 and a single shared pgo_profiles/ directory. PR #59 generalizes
both:
PGO_ARCH(default:arm64on ARM hosts,nativeotherwise) passes through to the recursivemakeinvocation asarch=$(PGO_ARCH). Accepts the same values as the rest of the Makefile:arm64,sse41,avx2,avx512bw,native, etc.PGO_PROFILE_DIRis now overridable (?=instead of=). Each(arch × training-regime)combination can capture into its own directory.- When
PGO_ARCH != arm64, the output binaries are namedbwa-mem3.pgo-instr.<arch>andbwa-mem3.pgo.<arch>so multiple per-arch PGO builds coexist. The default arm64 names are unchanged for backward compatibility. pgo-cleannow removes arch-suffixed PGO binaries in addition to bare names.
This enables the benchmarking workflow at bwa-mem3-bench, which requires per-arch × per-regime profile capture. See also Performance → PGO build.
CXXFLAGS/CPPFLAGS/LDFLAGS forwarding (PR #50)
At the time of PR #50, the Makefile’s multi: rule compiled
runsimd.cpp (the x86 multi-binary launcher) without honoring
CXXFLAGS, CPPFLAGS, or LDFLAGS. The $(EXE) link honored
CXXFLAGS and LDFLAGS but not CPPFLAGS. PR #83 has since replaced
the multi-binary scheme with a single binary that builds via the
single: target (the default), and that target inherits the same
flag-forwarding behavior.
PR #50 mirrored upstream bwa-mem2#290:
the compile rules now honor all three variables, and $(EXE) link adds
$(CPPFLAGS). This allows downstream packagers (Debian, Bioconda) and
reproducible-build systems to inject hardening flags (-D_FORTIFY_SOURCE=2,
-fstack-protector-strong, -Wl,-z,relro) through the environment without
patching the Makefile. No functional change unless the env vars are set.
Closes issue #39.
Unit-test harness and ARM CI (PR #23)
Historically, PR #23 added a local bash harness (test/run_unit_tests.sh) that
built and ran the five C++ unit binaries under test/ against committed fixtures
in test/fixtures/, asserting exit 0 and non-empty diff-able output (those
binaries have since been consolidated into the doctest harness — see the section
above). It also
fixes several pre-existing issues blocking the harness:
test/Makefiledefaulted toicpc(Intel compiler, not available on GitHub runners); changed tog++on Linux x86.- ARM flags are mirrored from the parent Makefile so
cd test && makebuilds on macOS arm64 and Linux aarch64. - Three test sources (
smem2_test,bwt_seed_strategy_test,sa2ref_test) were missing thefmiSearch->load_index()call thatfmi_test.cpphas, causing immediate segfaults on run. test/main_banded.cppopenedfksw.txtbut never wrote to it; output is now written andmain()returns 0 on success.- Fixtures are added under
test/fixtures/covering phiX174, 50 bp test reads, BWT seed strategy inputs, SA pairs, and SW pairs.
CI matrix expansion (PR #24)
PR #24 stacks on PR #23 and expands the GitHub workflow .github/workflows/ci.yml from 5 matrix
rows to 7:
| Row | Runner | Arch | Role |
|---|---|---|---|
| 1 | ubuntu-latest | sse41 | smoke + unit tests |
| 2 | ubuntu-latest | avx2 | canonical deep tests |
| 3 | ubuntu-latest | avx2 (no mimalloc) | unchanged |
| 4 | ubuntu-24.04-arm | arm64 | unchanged |
| 5 | macos-latest | arm64 | unchanged |
| 6 (new) | ubuntu-latest | multi | runsimd dispatcher smoke |
| 7 (new) | ubuntu-latest | avx2 clang++ | Linux Clang smoke |
The canonical row (row 2) adds: --bam=6 roundtrip record-count parity,
thread-determinism (-t1 vs -t4 sorted diff), unit-test harness, chr22
pipeline parity vs bwa, SE smoke, interleaved smoke, and --meth Layers 1–3.
Changes catalog
| Item | bwa-mem3 PR | Upstream PR/issue | Status |
|---|---|---|---|
| doctest framework + Codecov | #34 | — | fork-only |
PACKAGE_VERSION from git describe | #52 | bwa-mem2#283, bwa-mem2#284 | fork-only (upstream issue + PR open) |
| PGO target parameterization | #59 | — | fork-only |
CXXFLAGS/CPPFLAGS/LDFLAGS forwarding | #50 | bwa-mem2#290 | fork-only (mirrors open upstream PR) |
| Unit-test harness + ARM CI | #23 | — | fork-only |
| CI matrix expansion | #24 | — | fork-only |
See also: Developer Guide → Regression test framework · Developer Guide → Release process · Performance → PGO build · Performance improvements · Upstream PR status
BASELINE_ARCH=avx512bw build flag
This page documents the empirical perf characterization of building
bwa-mem3 with BASELINE_ARCH=avx512bw and the
-mprefer-vector-width=256 mitigation that ships as part of that
target.
Background: BASELINE_ARCH
bwa-mem3 ships a single x86 binary with all five SIMD tiers
(sse41 / sse42 / avx / avx2 / avx512bw) compiled for the hand-tuned
kernel TUs (KERNEL_SRCS in the Makefile: bandedSWA, kswv, ksw,
sam_encode). The runtime dispatcher in src/simd_dispatch.cpp picks
the right tier per kernel call based on __builtin_cpu_supports.
Everything outside KERNEL_SRCS — bwamem.cpp, bwamem_pair.cpp,
FMI_search.cpp, fastmap.cpp, bntseq.cpp, etc. — is compiled
once at the tier set by the BASELINE_ARCH Makefile variable
(default: avx2). The compiler can auto-vectorize loops in those TUs
at up to that tier’s width.
PR #84 raised the default from sse41 to avx2 after measuring
~10-15% wall-time gains on AVX2 hosts (c6a, etc.) when the
auto-vectorizer could finally widen hot non-kernel loops to 256-bit.
The naive expectation: avx2 → avx512bw should give another tier
Following PR #84’s logic, you might expect BASELINE_ARCH=avx512bw to
unlock another ~10-15% on AVX-512BW hosts (c7a, c7i, m7i) by widening
auto-vectorization to 512-bit. It does not. The avx2 → avx512bw
transition has fundamentally different hardware economics from the
sse41 → avx2 transition.
The two AVX-512 perf hazards
1. AMD Zen 4 µop-split (c7a)
AMD’s Zen 4 cores (c7a / Genoa, c8a / Bergamo) implement 512-bit AVX-512 operations by issuing 2× 256-bit µops per 512-bit op. For auto-vectorized loops:
- Iteration latency doubles.
- Iteration count only halves if the trip count is large enough. Short-trip loops eat the 2× latency without amortizing.
- 512-bit instruction encodings are larger → more I-cache pressure.
Net: loops that auto-vectorized productively at 256-bit AVX2 lose performance when the compiler widens them to 512-bit.
2. Intel Sapphire Rapids transition + downclock (c7i / m7i)
Intel’s Sapphire Rapids has native 512-bit execution units, so the µop-split issue does not apply. But it pays:
- ~3-5% AVX-512 frequency downclock under sustained heavy 512-bit use.
- AVX-512 ↔ AVX2 transition penalties when non-kernel TUs running 512-bit code call into the 256-bit hand-tuned kernel TUs (which always run at host tier via the dispatcher).
Net: small or zero gain from widening, often offset by the transition costs.
Mitigation: -mprefer-vector-width=256
The canonical mitigation (used by FFmpeg, libvpx, Intel ISPC) is to
keep AVX-512BW capabilities available but cap auto-vectorization at
256-bit width. The flag -mprefer-vector-width=256 (gcc / clang) /
-qopt-zmm-usage=low (icpc) does exactly that:
- The compiler can still emit AVX-512BW instructions where it explicitly needs them (mask registers, byte/word lane permutes, gather/scatter, the 32-zmm register file).
- The auto-vectorizer’s preferred SIMD width stays at 256-bit, dodging Zen 4’s µop-split and Intel’s downclock/transition costs.
The Makefile bakes this flag into arch=avx512bw directly. Hand-tuned
512-bit kernel intrinsics in KERNEL_SRCS are unaffected — the cap is
about auto-vec, not intrinsics.
Empirical numbers
c7a.4xlarge (AMD EPYC 9R14, Zen 4) and c7i.4xlarge
(Intel Xeon Platinum 8488C, Sapphire Rapids) running the bench’s
wgs-5M sample (1kg HG00096, 5M PE reads on hg38), shm-warmed via
bwa-mem3 shm, 3 reps median, timing via tricord
(fg-labs/tricord):
| host | avx2 | avx512bw | avx512bw + pvw256 (default) |
|---|---|---|---|
| c7a (Zen 4) | 105.70 s | 103.40 s (−2.2%) | 101.03 s (−4.4%) |
| c7i (Sapphire Rapids) | 156.50 s | 155.47 s (−0.7%) | 155.41 s (−0.7%) |
The gain is real but small. Defaulting BASELINE_ARCH=avx2 for x86
distribution is still correct: it’s portable across every x86 host and
loses only ~2-4% to the host-locked avx512bw build on AVX-512 hosts.
Why the runtime warning was misleading
Earlier versions of src/simd_dispatch.cpp printed at startup:
[W::bwamem3_simd_init_body] build baseline avx2 < host tier avx512bw;
non-kernel TUs are not auto-vectorized at the higher width (expect
10-15% slower hot paths). Rebuild with BASELINE_ARCH=avx512bw to recover.
The “10-15%” figure was the sse41 → avx2 transition on AVX2-only hosts (PR #84’s measurement, before avx2 became the default). It did not generalize to avx2 → avx512bw for the µop-split / downclock / transition reasons above. The warning was demoted to BWAMEM3_DEBUG_SIMD gating in a follow-up commit; the recommendation reflects the actual measurements (typically <2% wall-time gain on AVX-512 hosts).
When to use BASELINE_ARCH=avx512bw
- Production fleets pinned to AVX-512BW hosts (c7a / c7i / m7i): ship
a host-locked build for the small (~2-4%) extra gain. The Makefile’s
arch=avx512bwincludes the-mprefer-vector-width=256cap by default, which is empirically the right choice for both Zen 4 and Sapphire Rapids. The binary will SIGILL on hosts below avx2; pair with explicit Batch queue / image plumbing. - Mixed fleets / generic x86 distribution: stay on the default
BASELINE_ARCH=avx2. The 2-4% gap is small enough that portability is worth it.
Reproducer
The investigation harness lives at
scripts/perf-diff-baseline-arch.sh. It builds N variants of bwa-mem3
with different BASELINE_ARCH and EXTRA_CXXFLAGS settings, runs each
through tricord (or perf record for hot-function diffs), and emits
per-variant median tables. Example usage:
scripts/perf-diff-baseline-arch.sh \
--ref hg38.fasta --r1 r1.fq.gz --r2 r2.fq.gz \
--out out/ --reps 3 --threads 16 \
--variants 'avx2:,avx512bw:,avx512bw-pvw256:-mprefer-vector-width=256'
Requires tricord on the PATH (cargo install tricord).
/dev/shm must be ≥18 GB to stage the hg38 FMI index — on a default
EC2 instance that means mount -o remount,size=28g /dev/shm.
Bench-side caveats
The bench’s Phase C report (May 2026) reported a +14% c7a regression
when comparing the bench’s portable image vs the host-locked
avx512bw image inside AWS Batch. That delta does not reproduce on a
single-instance bare-metal measurement. A 4-way disambiguation test
(c7a.4xlarge, wgs-5M, shm-warmed, 3 reps each — see “Reproducer” above)
attributes the gap to two independent bench-side factors:
| variant | wall (s) | vs A |
|---|---|---|
| A: AL2023-built avx512bw, bare-metal | 99.50 | (baseline) |
| B: AL2023-built avx512bw, in bench Docker image | 102.30 | +2.8% |
| C: bench-image avx512bw binary, bare-metal | 117.03 | +17.6% |
| D: bench-image avx512bw binary, in bench Docker image | 118.44 | +19.0% |
The findings:
- Build-environment matters more than container. The bench’s
:316dba6-avx512bwimage binary, run bare-metal on the same c7a.4xlarge with the same input, is +17.6% slower than a fresh AL2023-built binary from the same SHA with the sameBASELINE_ARCH.- AL2023 ships
gcc 11.5.0and defaults to-no-pie. Output is a non-PIE ELF. - Debian bookworm (the bench’s Dockerfile base) ships
gcc 12.xand defaults to-pie. Output is a PIE ELF. - PIE adds indirection through a GOT for every global reference and is well-known to cost 5–15% on tight CPU loops. Combined with gcc-11 vs gcc-12 codegen differences this comfortably accounts for the +17.6%.
- AL2023 ships
- Docker container overhead is small (~3%). A→B and C→D both show ~2–3% wall-time delta when wrapping the same binary in the bench’s image. Consistent with the broader literature on cgroup-namespaced compute-bound workloads.
- The bench’s portable
:316dba6image is built withBASELINE_ARCH=sse41, notavx2. Direct evidence from the binary’s startup banner inside the container:[W::bwamem3_simd_init_body] build baseline sse41 < host tier avx512bw. The banner is generated from compile-time macros insimd_dispatch.cppand unambiguously testifies to the build flags used. Why it’s sse41 (rather than the post-PR-#84 default of avx2) is bench-side mystery — possibilities include an explicitBASELINE_ARCH=sse41build-arg, a Makefile-default change between the prior portable build and SHA 316dba6, or an environment quirk in the Docker build. The Phase C report’s “42/42 saw avx2 warning” summary likely reflects a different prior run, not the current:316dba6portable image.
The Phase C report’s headline “+14% c7a regression for avx512bw” is
therefore comparing an sse41-built portable image at +17.6%
build-environment penalty against a BASELINE_ARCH=avx512bw
binary at the same +17.6% penalty plus Zen 4’s µop-split cost —
which roughly cancels for a small absolute delta in either direction.
None of it is bwa-mem3’s BASELINE_ARCH knob’s fault.
These are bench-side concerns. The bwa-mem3 fix
(-mprefer-vector-width=256 for arch=avx512bw) stands on its own
bare-metal merit: −2.2% on Zen 4 vs avx2 vanilla, plus −2.2%
incremental from the cap; wash on Sapphire Rapids.
Bench-side toolchain attribution
The +17.6% bare-metal delta between the bench’s bookworm-built binary and an AL2023-built binary was decomposed via a six-variant single-host test (c7a.4xlarge, wgs-5M, shm-warmed, 5 reps each, tricord median):
| variant | gcc | PIE | CET (endbr64) | wall (s) | vs AL2023 |
|---|---|---|---|---|---|
| AL2023 default | 11.5.0 | no | 8 (libc init only) | 98.28 | (baseline) |
| bookworm + gcc-11 | 11.4.0 | yes | 3 (libc init only) | 100.16 | +1.9% |
| bookworm default | 12.2.0 | yes | 3 (libc init only) | 117.21 | +19.3% |
bookworm + -no-pie | 12.2.0 | no | 3 | 118.98 | +21.1% |
bookworm + -fcf-protection=none | 12.2.0 | yes | 3 | 118.40 | +20.5% |
bookworm + -no-pie -fcf-protection=none | 12.2.0 | no | 3 | 116.84 | +18.9% |
Findings:
- The +17.6% delta is gcc-11 vs gcc-12 codegen, not any hardening flag. Switching gcc inside Debian bookworm (keeping PIE on, keeping whatever default CET there is, keeping every other Debian default) recovers the perf to within 2% of AL2023.
- Disabling PIE in bookworm gcc-12 has no measurable effect. Median
118.98s with
-no-pievs 117.21s default — within rep-to-rep noise. The 5–15% PIE penalty cited in the literature doesn’t manifest for bwa-mem3’s intrinsic-heavy hot path; GOT indirection is rare on tight inner loops. - Disabling CET in bookworm gcc-12 has no measurable effect. Both
bookworm-default and
-fcf-protection=nonebuilds emit only ~3endbr64instructions in 7.6 MB of binary (libc init code). Something — probably bwa-mem3’s-mavx512f -mavx512bwper-tier kernel flags, or a#pragma GCCsomewhere — suppresses CET emission regardless of the flag. So disabling it removes nothing. - The combined
-no-pie -fcf-protection=nonebuild is statistically indistinguishable from the default. Both PIE and CET are noise.
So the actionable bench-side fix isn’t -no-pie or
-fcf-protection=none — it’s “use gcc-11”. The six-variant table
above is the only data we have; gcc-13 was not tested here, and the
postscript below shows that gcc-13 / gcc-14 do not recover the gap
without #88’s source-side fix. A one-liner for the bench Dockerfile:
RUN apt-get install -y gcc-11 g++-11
RUN cd fg-labs && CC=gcc-11 CXX=g++-11 make BASELINE_ARCH=... -j
…recovers ~17% wall-time uniformly across every Batch worker, every
arch. That’s a much bigger lever than the bwa-mem3-side
-mprefer-vector-width=256 mitigation here, which is ~2–4% on c7a and
wash on c7i. But it’s bench infra, not bwa-mem3 source.
(The deeper question — why is gcc-12 codegen ~19% slower than gcc-11 on Zen 4 for this workload? — was followed up in #88; see the postscript below.)
The bwa-mem3-side conclusion stands on its own bare-metal merit:
-mprefer-vector-width=256 for arch=avx512bw is a real ~2–4% win on
c7a and a wash on c7i, independent of toolchain and container
concerns.
Postscript: gcc-12 attribution closed by #88
The “use gcc-11” recommendation above was the actionable bench-side
fix at the time this page was written. #88 (perf(fmi): inline backwardExt to recover gcc 12+ wall-clock regression) has since
identified the underlying mechanism and closed the gap in source — so
on current main no compiler pin is required.
Profile attribution on a fresh c7a.8xlarge run with perf record --no-children localized ~9 percentage points of wall-time to
FMI_search::backwardExt’s self-time (12.5% on gcc-11 vs 21.5% on
gcc-14). Disassembly histograms were nearly identical between
compilers (~110 instructions, 8 scalar popcntq, 25 mov each); IPC
fell from 1.77 to 1.60. perf annotate isolated a single instruction
at 42% of the function’s samples on gcc-14: vmovdqu %ymm0, (%r8) —
the 32-byte AVX store of the SMEM return value through SysV’s
hidden-pointer convention. The matching argument-load
(mov 0x30(%rbp), %r10 for smem.s from the caller’s stack push) was
next-hottest at 17%. Together those two instructions accounted for
~60% of the function’s self-time.
The fix in #88 is a one-line attribute change: marking
backwardExt __attribute__((always_inline)) removes the call
boundary at all 9 hot call sites in getSMEMs* and
ls_advance_*. Without a call boundary the SMEM struct stays in
caller registers across the would-be call site — no struct push, no
return-slot store, no vzeroupper.
Post-#88 numbers on c7a.8xlarge (hg38 wgs-5M, shm-warmed, -t 32, 5
reps mean, single-binary at BASELINE_ARCH=avx2):
| binary | wall (s) | vs gcc-14 | vs gcc-11 | IPC |
|---|---|---|---|---|
| main + gcc-11 | 64.59 | −6.6% | baseline | 1.77 |
| main + gcc-14 | 69.16 | baseline | +7.07% | 1.60 |
| #88 + gcc-14 | 61.94 | −10.4% | −4.10% | 1.83 |
So the gcc-11 vs gcc-12 attribution above was the surface symptom of
an ABI-level inefficiency that was costing cycles on gcc-11 too — the
always-inline fix beats gcc-11 baseline by 4.10%, not just gcc-14. The
empirical table and findings list in the previous section remain
accurate as an investigation snapshot at commit 316dba6 (the pre-#88
state); the bench-side gcc-11 recommendation it produced is
obsolete.
See also
- SIMD dispatch architecture — how the runtime kernel dispatcher picks a tier
- Build & infrastructure — the broader build layout
- Architecture support — per-host SIMD coverage
Upstream PR Status
This table cross-references every change carried in bwa-mem3 main to its
corresponding upstream bwa-mem2 PR or issue. “Fork-only” means no upstream PR
exists; the change may be submitted upstream in the future or may be
fork-specific by design. “Open” means the upstream PR or issue existed at the
time of bwa-mem3’s implementation but had not been merged. Upstream status is
current as of the bwa-mem3 0.2.0 release.
For prose descriptions of each change, follow the links in the “bwa-mem3 PR” column to the relevant deep-dive page section.
Full cross-reference table
| Topic | bwa-mem3 PR | Upstream PR / Issue | Upstream status |
|---|---|---|---|
| Correctness | |||
@PG CL: tab escaping | #54 | bwa-mem2#293 | open issue |
| SMEM buffer overflow on >151 bp reads | #55 | bwa-mem2#238, bwa-mem2#210 | PR closed without merge; issue open |
| kswv nrow==0 guard (all 5 kernels) | #51 | bwa-mem2#289 | open PR (upstream covers AVX-512BW only) |
AVX-512BW dispatch guard (!__AVX512BW__) | #26 | — | fork-only |
| AVX2 score2 plateau consolidation | #28 | — | fork-only |
| NEON + AVX-512BW 8-bit score2 fix | #29 | — | fork-only |
| AVX-512BW 16-bit score2 fix | #30 | — | fork-only |
| NEON 16-bit kernel rewrite | #31 | — | fork-only |
| kseq2bseq1 zero-initialization | #22 | — | fork-only |
| Proper-pair flag from emitted alignment | #17 | — | fork-only |
@HD emitted before @SQ per SAM spec | #35 | lh3/bwa#345 | closed (lh3 only) |
mem_matesw SIGSEGV on shm-backed ref_string | #85 | — | fork-only |
SA_COMPX_MASK precedence in sampled-SA prefetch | #73 | — | fork-only |
.alt parse buffer bounded (stack overflow) | #74 | — | fork-only |
display_stats nthreads clamp to LIM_C | #81 | — | fork-only |
| Performance | |||
| Lockstep SMEM batching | #33 | — | fork-only |
Batched -H header ingestion (O(n) fix) | #49 | bwa-mem2#204 | open PR |
| libsais FM-index construction | #57 | — | fork-only |
| Consolidated mapping speedups | #58 | — | fork-only |
| kswv per-strip L1 prefetches (all u8/16 kernels) | #70 | — | fork-only |
SMEM_LOCKSTEP_N bumped from 8 to 16 | #75 | — | fork-only |
Closed-form ungapped HIT when total_mis == 0 | #77 | — | fork-only |
ksort on-stack buffer for small n | #78 | — | fork-only |
libsais_build skip wasted zero-init | #80 | — | fork-only |
Cap avx512bw autovec at 256-bit | #86 | — | fork-only |
Inline FMI_search::backwardExt (recover gcc 12+ regression) | #88 | — | fork-only |
| Features | |||
--bam=LEVEL direct BAM output | #12 | — | fork-only |
--meth bisulfite alignment mode | #13 | — | fork-only |
| Vendored mimalloc allocator | #19 | — | fork-only |
HN:i hit count tag | #42 | lh3/bwa#438 | analogous to bwa aln; no direct upstream port |
--supp-rep-hard-cap MAPQ rescoring | #56 | bwa-mem2#260 | open issue |
bwa-mem3 shm shared-memory index | #65 | — | fork-only (v1 feature port) |
shm --meth symmetry | #67 | — | fork-only |
-z FLOAT (XA_drop_ratio CLI knob) | #35 | lh3/bwa#294 | merged (lh3 only) |
-u flag — widen XA:Z records with ,score,mapq | #35 | lh3/bwa#293 | merged (lh3 only) |
MQ:i mate mapping quality tag | #35 | lh3/bwa#330 | merged (lh3 only) |
Bismark-compatible XR:Z / XG:Z / XM:Z tags | #90 | — | fork-only |
/bwactl registry interprocess lock (POSIX named semaphore) | #82 | — | fork-only |
bwa-mem3 shm /dev/shm capacity preflight | #86 | — | fork-only |
Host-floor precheck (SIMD floor: / SIMD runtime:, exit 2 on under-floor host) | #95 | — | fork-only |
| Architecture support | |||
| Linux ARM64 / aarch64 build + CI | #1 | bwa-mem2#288 | open PR |
arch=avx512bw explicit Makefile target | #16 | — | fork-only |
| NEON kswv mate-rescue kernel | #18 | — | fork-only |
| AVX2 kswv mate-rescue kernel | #20 | — | fork-only |
bns_fetch_seq_v2 migration of mem_matesw_batch_{pre,post} | #76 | — | fork-only |
Single-binary in-process SIMD dispatch (replaces multi-binary execv launcher) | #83 | — | fork-only |
Default x86 BASELINE_ARCH=avx2 (was sse41) | #84 | — | fork-only |
| Build & infrastructure | |||
| doctest framework + Codecov | #34 | — | fork-only |
PACKAGE_VERSION from git describe | #52 | bwa-mem2#283, bwa-mem2#284 | open issue + open PR |
| PGO target parameterization | #59 | — | fork-only |
CXXFLAGS/CPPFLAGS/LDFLAGS forwarding | #50 | bwa-mem2#290 | open upstream PR |
| Unit-test harness + ARM CI | #23 | — | fork-only |
| CI matrix expansion (7 rows) | #24 | — | fork-only |
Shell-var rename BWAMEM2/BWA_MEM2[_*] → BWAMEM3/BWA_MEM3[_*] (CI/bench/test scripts) | #68 | — | fork-only |
Methylation oracle: alias bwa-mem2 → bwa-mem3 on PATH for bwameth.py | #72 | — | fork-only |
| Migrate parity tests from dwgsim/phiX174 to holodeck/chr22 | #89 | — | fork-only |
Upstream issues tracked but not yet fixed in bwa-mem3
The following upstream issues are tracked in the bwa-mem3 issue list but do
not yet have corresponding fixes in main:
| Issue | Upstream reference | Notes |
|---|---|---|
| Split-alignment evidence loss vs bwa 0.7.17 | bwa-mem2#273 | issue #47 — under investigation |
| MAPQ/coordinate parity vs bwa mem 0.7.18 | bwa-mem2#262, bwa-mem2#246, bwa-mem2#239 | issue #48 — tracking only |
See also: Correctness fixes · Performance improvements · Features · Architecture support · Build & infrastructure
Building from source
This page documents every build target available in the Makefile and what each produces. For the recommended production build workflow see Best Practices → Build.
Prerequisites
- A C++14-capable compiler: GCC 7+ or Clang 6+ on Linux; Clang 15+ (Xcode) on macOS.
- GNU make 3.81+.
- CMake 3.12+ (required only when
USE_MIMALLOC=1, which is the default). - autoconf, automake, autoconf-archive, libtool, pkg-config —
ext/htslib’s build runsautoreconf -i && ./configureand locates zlib viapkg-config. - zlib development headers — htslib links against zlib.
- OpenMP runtime — libsais uses OpenMP for parallel suffix-array construction. Linux + GCC: libgomp ships with the compiler, nothing extra to install. Linux + Clang:
libomp-dev(Debian) /libomp-devel(RHEL). macOS:brew install libomp; the Makefile auto-detects the Homebrew prefix or honoursLIBOMP_PREFIX. - Git submodules initialised:
git submodule update --init --recursive.
See Getting Started → Installation for the full per-platform install commands.
Warning — Submodules must be present
The build will fail with a clear error message if any of the required submodules (
ext/libsais,ext/htslib,ext/safestringlib,ext/mimalloc,ext/sse2neon) are missing. Always clone with--recursiveor rungit submodule update --init --recursivebeforemake.
Standard builds
Default build (host-native)
make
On x86 hosts this is equivalent to make single (see below): one binary
containing all five SIMD tiers, dispatched in process at startup. On
Apple Silicon and other aarch64 hosts the Makefile detects the
architecture and builds a single ARM64 binary with one NEON kernel TU.
The resulting binary is bwa-mem3 in the repo root.
Single multi-tier x86 build (default on x86)
make single # alias of the default `make`
make BASELINE_ARCH=avx512bw # raise non-kernel TU compile baseline
make BASELINE_ARCH=sse41 # lower it for pre-Haswell hosts
Builds one bwa-mem3 binary. The four hand-tuned kernel TUs in
KERNEL_SRCS (bandedSWA.cpp, kswv.cpp, ksw.cpp,
sam_encode.cpp) are compiled five times each — once per supported
tier (sse41 / sse42 / avx / avx2 / avx512bw) — and dispatched
at runtime via __builtin_cpu_supports. Non-kernel TUs compile once at
BASELINE_ARCH (default avx2 since PR #84). See
Single-binary SIMD dispatch (x86) for the full design.
Single-tier x86 builds
Pass arch=<target> to compile a single binary with kernels for one
tier only (no runtime dispatch table — useful on clusters with uniform
hardware):
| Command | SIMD level | ARCH_FLAGS |
|---|---|---|
make arch=sse41 | SSE4.1 | -msse … -msse4.1 |
make arch=sse42 | SSE4.2 | -msse … -msse4.2 |
make arch=avx | AVX | -mavx |
make arch=avx2 | AVX2 | -mavx2 |
make arch=avx512bw | AVX-512BW | -mavx512f -mavx512bw -mprefer-vector-width=256 |
make arch=native | host CPU features | -march=native |
For Intel compiler (icpc / icpx) the flags differ slightly; see the
Makefile for the ifeq ($(CXX), icpc) branches. The avx512bw target
keeps the -mprefer-vector-width=256 cap from PR #86 — see
BASELINE_ARCH=avx512bw build flag
for the empirical perf characterization.
ARM64 / Apple Silicon build
make arch=arm64
Compiles a single binary bwa-mem3 with one NEON kernel TU. See
Apple Silicon / NEON port for background.
Tuned builds
Profile-Guided Optimization (PGO)
PGO produces the best single-binary performance. The workflow is two-phase:
# Phase 1: instrument binary
make pgo-generate # builds bwa-mem3.pgo-instr (arm64 default)
make pgo-generate PGO_ARCH=avx2 # or a specific x86 target
# Run your training workload with the instrumented binary
./bwa-mem3.pgo-instr mem -t 16 ref.fa r1.fq.gz r2.fq.gz > /dev/null
# Phase 2: optimised binary
make pgo-use # builds bwa-mem3.pgo
make pgo-use PGO_ARCH=avx2 # matching arch
PGO_ARCH accepts the same values as arch=. PGO_PROFILE_DIR defaults to pgo_profiles/ but can be overridden. Output binaries are named bwa-mem3.pgo (default arch) or bwa-mem3.pgo.<arch> when a non-default arch is specified, so multiple arch builds coexist.
Clean up instrumented objects and profile data:
make pgo-clean
Link-Time Optimization (LTO)
make lto-build # builds bwa-mem3.lto (native arch)
make lto-build LTO_ARCH=avx2 # explicit arch
LTO compiles bwa-mem3’s own translation units with -flto (thin LTO on Clang, full LTO on GCC) plus -fno-semantic-interposition on GCC. Third-party libraries (htslib, mimalloc, safestringlib) are linked without LTO. Clean:
make lto-clean
Compute-only profile binary
Used when profiling CPU hotspots without I/O noise. The -DDISABLE_OUTPUT flag short-circuits all BAM/SAM write paths and the file-open / header-emit step, so only alignment work contributes to wall time.
make profile-build # builds bwa-mem3.profile (native)
make profile-build PROFILE_ARCH=avx2 # explicit arch
./bwa-mem3.profile mem -t 16 ref.fa r1.fq.gz r2.fq.gz
make profile-clean
Build knobs
| Variable | Default | Effect |
|---|---|---|
USE_MIMALLOC | 1 | Include mimalloc; set 0 to use the system allocator |
ASAN | (unset) | Set to any non-empty value to enable AddressSanitizer (forces USE_MIMALLOC=0) |
COVERAGE | (unset) | Set to enable --coverage + -O0 for gcov line-level coverage |
EXTRA_CXXFLAGS | (empty) | Appended to CXXFLAGS; forwarded through PGO / LTO targets |
DISABLE_BATCHED_MATESW | (unset) | Set to 1 to disable the batched mate-rescue SW path on ARM |
CXX | c++ | Compiler. Paired CC is auto-derived from CXX for libsais. |
Cleaning
make clean
Removes object files, libbwa.a, all binaries, test binaries, libsais objects, safestringlib, htslib, and the mimalloc build tree.
make docs-clean
Removes only the mdbook build output (docs/book/). Covered in Developer Guide → Building context; see the Makefile docs targets for the full list.
Documentation targets
| Target | Action |
|---|---|
make docs | Build the mdbook into docs/book/ |
make docs-serve | Live-preview at http://localhost:3000 |
make docs-cli | Capture --help output for each subcommand into docs/_generated/cli/ |
make docs-clean | Remove docs/book/ |
make docs-install-tools | cargo install mdbook + three plugins |
See also: SIMD dispatch architecture · Single-binary SIMD dispatch (x86) · Best Practices → Build · Performance → PGO build · Apple Silicon / NEON port
SIMD dispatch architecture
bwa-mem3 uses two complementary mechanisms to run the best available
SIMD code path at run time: in-process tier dispatch on x86
(handled separately in
Single-binary SIMD dispatch (x86)) and compile-time
conditional compilation inside each kernel translation unit,
mediated by src/simd_compat.h and src/kernel_dispatch.h.
This page covers the compile-time layer: what the macros do, which
kernels are vectorised at each ISA level, and how the dispatch
decision flows from main() to a tier-specific kernel instruction.
The simd_compat.h abstraction layer
src/simd_compat.h is the single point where platform detection and
intrinsic selection occur. It is included by every file that touches
SIMD code. The header resolves to one of four paths:
| Platform | Branch condition | Intrinsic headers |
|---|---|---|
| ARM / Apple Silicon | __ARM_NEON or __aarch64__ | sse2neon.h (translation) + <arm_neon.h> (native) |
| x86 AVX-512BW | __AVX512BW__ | <immintrin.h> |
| x86 AVX2 | __AVX2__ | <immintrin.h> |
| x86 SSE4.1 / SSE2 | __SSE4_1__ or __SSE2__ | <smmintrin.h> + <emmintrin.h> |
The ARM path defines APPLE_SILICON 1, sets SIMD_WIDTH8 = 16 and
SIMD_WIDTH16 = 8 (128-bit NEON lanes), defines a
posix_memalign-backed _mm_malloc replacement that enforces the
128-byte Apple Silicon cache-line alignment, and provides two
optimised NEON helpers that sse2neon does not generate efficiently:
_mm_movemask_epi16— extracts the MSB of each 16-bit element usingvshrq_n_u16+vmovn_u16+ position-weightedvaddv_u8, replacing the_mm_movemask_epi8(v) & 0xAAAApattern used inbandedSWA.cpp._mm_blendv_epi16_fast— a bitwise select on 16-bit elements via NEONvbslq_s16, replacing the OR/AND/ANDNOT sequence sse2neon emits for_mm_blendv_epi8.
SIMD_WIDTH8 and SIMD_WIDTH16 control the lane counts in kswv.cpp
and bandedSWA.cpp. The macros differ per ISA level:
| ISA | SIMD_WIDTH8 | SIMD_WIDTH16 |
|---|---|---|
| SSE4.1 | 16 | 8 |
| AVX2 | 32 | 16 |
| AVX-512BW | 64 | 32 |
| ARM NEON | 16 | 8 |
Per-tier compilation and symbol mangling
On x86 the four kernel translation units listed in KERNEL_SRCS
(bandedSWA.cpp, kswv.cpp, ksw.cpp, sam_encode.cpp) are
compiled five times each — once per supported tier
(sse41 / sse42 / avx / avx2 / avx512bw) — with tier-specific
-m... flags. src/kernel_dispatch.h is a preprocessor-only header
that renames each exported kernel symbol per a
KERNEL_VARIANT=_<tier> macro, so the five tier compiles produce
non-colliding symbols that all link into one binary.
bandedSWA.h adds an abstract IBandedPairWiseSW interface;
BandedPairWiseSW is final and inherits from it. kswv.h mirrors
this with Ikswv. Each per-tier kernel TU exports a C-linkage factory
function (make_bsw_kernel_<tier>, make_kswv_kernel_<tier>) that
returns a std::unique_ptr<I*> to the tier-specific concrete class.
The dispatcher in src/simd_dispatch.cpp switches on g_tier and
calls the matching factory; the call sites in bwamem.cpp and
bwamem_pair.cpp see only the interface. This separation keeps the
dispatcher TU free of class-layout knowledge and sidesteps the ODR
risk that would arise from each tier’s compile pulling in a
differently-laid-out concrete class definition.
The free-function ksw_* family (ksw_extend2, ksw_global2,
ksw_extend, ksw_global, ksw_align2, ksw_align) is dispatched
through thin extern "C" wrappers in simd_dispatch.cpp that switch
on g_tier and tail-call the matching mangled per-tier symbol.
Internal aux helpers in ksw.cpp (ksw_qinit, ksw_u8, ksw_i16)
are forced static so the five tier compiles do not multi-define
them. The SAM seq/qual encoder previously inlined in bwamem.cpp was
lifted into src/sam_encode.{h,cpp} so it also participates in
per-tier compilation.
All non-kernel TUs (bwamem.cpp, bwamem_pair.cpp, fastmap.cpp,
FMI_search.cpp, bntseq.cpp, …) compile once at the
BASELINE_ARCH tier (default avx2, set by the make line). They
call into the dispatcher’s tier-agnostic entry points, which fan out
to the per-tier kernels at run time. See
Single-binary SIMD dispatch (x86) for the runtime
selection and override semantics, and
BASELINE_ARCH=avx512bw build flag
for why non-kernel TUs do not auto-vectorize at 512-bit by default.
On arm64 there is one NEON tier and one kernel compile per TU; the dispatch tables collapse to single-entry switches and the per-tier mangling layer is a no-op.
Dispatch diagram
The full dispatch decision, from the shell to a kernel instruction, follows this flow:
flowchart TD
A[User runs: bwa-mem3 mem ...] --> B{Platform}
B -- ARM / Apple Silicon --> C[bwa-mem3 main, single NEON kernel TU]
B -- x86 --> D[bwa-mem3 main, calls bwamem3_simd_init in src/simd_dispatch.cpp]
D --> E{__builtin_cpu_supports + BWAMEM3_FORCE_TIER}
E -- AVX-512BW --> F1[g_tier = avx512bw]
E -- AVX2 --> F2[g_tier = avx2]
E -- AVX --> F3[g_tier = avx]
E -- SSE4.2 --> F4[g_tier = sse42]
E -- SSE4.1 --> F5[g_tier = sse41]
F1 & F2 & F3 & F4 & F5 --> G[Non-kernel TUs run\nat BASELINE_ARCH tier]
C --> G
G --> H{Kernel call}
H -- kswv\nbatched SW --> I[per-tier kswv.<tier>.o\nvia make_kswv_kernel_<tier>]
H -- bandedSWA\nmate-rescue --> J[per-tier bandedSWA.<tier>.o\nvia make_bsw_kernel_<tier>]
H -- ksw_align2 etc.\nfree functions --> K[per-tier ksw.<tier>.o\nvia extern-C wrapper in simd_dispatch.cpp]
H -- sam_encode --> L[per-tier sam_encode.<tier>.o]
H -- FMI_search\nbackward extension --> M[FMI_search.cpp\n__builtin_popcountl — not SIMD]
H -- libsais\nBWT construction --> N[libsais.c\nOpenMP parallel SA-IS]
I --> O[SIMD instructions\nat the dispatched tier]
J --> O
K --> O
L --> O
Per-kernel vectorisation status
| Kernel | SSE4.1 | SSE4.2 | AVX | AVX2 | AVX-512BW | ARM NEON |
|---|---|---|---|---|---|---|
kswv (batched Smith-Waterman) | 8-wide int16 | 8-wide int16 | 8-wide int16 | 16-wide int16 | 32-wide int16 | 8-wide int16 (native) |
bandedSWA (banded SW / mate-rescue) | vectorised | vectorised | vectorised | vectorised | vectorised | native NEON blendv |
ksw_* free functions (SW extension) | per-tier | per-tier | per-tier | per-tier | per-tier | per-tier (NEON) |
sam_encode (SAM seq/qual encoder) | per-tier | per-tier | per-tier | per-tier | per-tier | per-tier (NEON) |
FMI_search (FM-index backward ext.) | scalar | scalar | scalar | scalar | scalar | scalar |
libsais (BWT / SA construction) | OpenMP only | OpenMP only | OpenMP only | OpenMP only | OpenMP only | OpenMP only |
FMI_search is memory-bound with sequential pointer-chasing
dependencies; adding SIMD to it produces no measurable speedup.
libsais benefits from OpenMP-parallel induced sorting but not from
SIMD widening within a single thread.
Adding a new SIMD kernel
- Include
simd_compat.hrather than any platform intrinsic header directly. - Use
SIMD_WIDTH8/SIMD_WIDTH16for lane-count arithmetic so the code compiles correctly across all ISA levels. - If the kernel needs per-tier compilation:
- Add the source to
KERNEL_SRCSin the Makefile so the per-tier pattern rules (src/%.<tier>.o) pick it up. - Use the
KERNEL_VARIANTrename macros fromsrc/kernel_dispatch.hto expose mangled symbols. - Export a C-linkage factory or dispatcher entry point from the per-tier TU and add a switch on
g_tierinsrc/simd_dispatch.cpp.
- Add the source to
- For ARM-specific optimisations, gate them with
#ifdef APPLE_SILICON(or#if defined(__ARM_NEON)) and provide asimd_compat.h-routed fallback for x86. - Verify correctness on at least SSE4.1 (lowest supported x86 tier) and ARM64 using
make test, then runtest/regression/all_tiers_parity.shto confirm byte-identical SAM across every x86 tier underBWAMEM3_FORCE_TIER.
Tip — Testing SIMD correctness
The kswv unit tests in
test/unit/test_kswv*.cppuse synthetic sequence-pair generators that drive edge cases (empty batches, nrow==0, homopolymers) across every SIMD width. Run them with./test/bwa_mem3_tests_unit --test-suite="unit/kswv"after modifying any vectorised kernel, then loopBWAMEM3_FORCE_TIERover all five tiers in an end-to-end smoke run to catch dispatcher-wiring regressions that the unit tests miss.
See also:
Single-binary SIMD dispatch (x86) ·
Apple Silicon / NEON port ·
Building from source ·
Performance → SIMD dispatch matrix ·
BASELINE_ARCH=avx512bw build flag ·
Regression test framework
Single-binary SIMD dispatch (x86)
On x86 Linux and x86 macOS, bwa-mem3 is a single binary that
contains compiled kernels for every supported SIMD tier
(sse41 / sse42 / avx / avx2 / avx512bw). At startup the binary
detects the host CPU’s capabilities and selects the matching tier in
process, without fork or exec. There is no separate launcher binary
and no bwa-mem3.<tier> variant files on disk.
ARM / Apple Silicon does not need tier dispatch at all: there is only one NEON instruction-set level across current ARM64 CPUs, so the arm64 build is a single binary with one kernel TU. The dispatch machinery described below is only meaningful on x86.
This design replaces the multi-binary execv launcher inherited from
bwa-mem2. The motivation, validation, and trade-offs are tracked in
PR #83; the AVX-512
auto-vectorization cap that ships alongside it is documented in
BASELINE_ARCH=avx512bw build flag.
What the build produces
make # default: single multi-tier binary, BASELINE_ARCH=avx2
make single # explicit alias of the default target
Produces one file in the repo root:
| File | Contains | Non-kernel TU compile flags |
|---|---|---|
bwa-mem3 | All 5 x86 tier kernels + dispatcher + non-kernel TUs | BASELINE_ARCH (default avx2) |
The five kernel translation units listed in the Makefile’s KERNEL_SRCS
(bandedSWA.cpp, kswv.cpp, ksw.cpp, sam_encode.cpp) are compiled
five times each, once per tier, with tier-specific -m... flags. Every
non-kernel TU is compiled once at the BASELINE_ARCH tier.
BASELINE_ARCH defaults to avx2 (PR #84) and can be set on the
make line:
make BASELINE_ARCH=avx512bw # for an AVX-512BW-only fleet
make BASELINE_ARCH=sse41 # for pre-Haswell hosts (~10–15% slower on AVX2)
Lowering BASELINE_ARCH reduces the supported host floor and is the
documented escape hatch for vintage hardware. Raising it locks the
binary to that host class and disables the host-floor precheck for
lower tiers. The bwa-mem3 version banner prints the resulting
SIMD floor: line so operators can confirm the build matches the
intended deployment surface — see
Host requirements and
BASELINE_ARCH=avx512bw build flag.
For ARM, make arm64 produces a single binary with a single NEON
kernel TU; no dispatch table is generated.
Runtime tier selection
src/simd_dispatch.cpp provides three pieces:
bwamem3_simd_init()— idempotent initializer called frommain.cpp. Caches the host’s raw capability into a file-scopeg_host_capabilityand the effective dispatch tier into a separateg_tier(the two differ whenBWAMEM3_FORCE_TIERis set).- An enum of supported tiers (
sse41→sse42→avx→avx2→avx512bw, plusneonon arm64) andbwamem3_simd_tier_name()for stderr reporting. - Per-kernel factory functions (
make_bsw_kernel_<tier>,make_kswv_kernel_<tier>) and free-function dispatch wrappers (ksw_extend2,ksw_global2,ksw_extend,ksw_global,ksw_align2,ksw_align,sam_encode_*) that switch ong_tierand call into the matching mangled per-tier symbol.
x86 detection uses __builtin_cpu_supports directly; arm64 reports
neon unconditionally. The selection happens once at startup and the
result is cached in a TU-level global — subsequent kernel calls pay a
single indirect-call overhead through a vtable (for the
BandedPairWiseSW / kswv factories) or an extern "C" wrapper (for
the ksw_* free functions). Per PR #83 measurement, the indirect call
costs ~0.3 ns after BTB warm-up, so a 1M-read alignment with ~100M
kernel calls adds roughly 30 ms — well below run-to-run noise on every
tested host.
Symbol mangling per tier
src/kernel_dispatch.h is a preprocessor-only header that renames
kernel-exported symbols according to a KERNEL_VARIANT=_<tier> macro.
Each kernel TU is compiled N times with a different
-DKERNEL_VARIANT=_<tier> plus the matching -m... flags, producing
per-tier mangled symbols that link cleanly into one binary without ODR
collision.
bandedSWA.h adds an abstract IBandedPairWiseSW interface;
BandedPairWiseSW is final and inherits from it. kswv.h mirrors
this with Ikswv. The dispatcher TU sees only the interface; the
factory implementations in each per-tier kernel TU see the full
concrete class layout via the rename. This separation sidesteps the
ODR risk that would arise if the dispatcher TU and the factory TUs
both included the full class definition.
Internal aux helpers in ksw.cpp (ksw_qinit, ksw_u8, ksw_i16)
are forced static so the per-tier compiles don’t multi-define them.
The SAM seq/qual encoder previously inlined in bwamem.cpp was lifted
into a free-standing src/sam_encode.{h,cpp} translation unit so it
participates in per-tier compilation and benefits from the
auto-vectorizer’s tier-specific vmovdqu / VEX / EVEX encoding wins.
Environment overrides
Two environment variables exposed at runtime:
| Variable | Behavior |
|---|---|
BWAMEM3_FORCE_TIER=<tier> | Force the dispatcher to use <tier> (one of sse41 sse42 avx avx2 avx512bw). Downgrade-only: requests above the detected host tier (which would SIGILL on the first wider instruction) and unrecognized names are rejected with a stderr warning and the dispatcher falls back to the detected tier. Replaces the prior “exec the bwa-mem3.sse41 binary” pattern for A/B regression testing on AVX-512 hosts. |
BWAMEM3_DEBUG_SIMD=1 | Print a one-line [I::bwamem3_simd_init_body] banner at startup naming the build baseline (g_build_tier), the detected host capability, and the resolved dispatch tier. Also enables the build-baseline-vs-host gap warning that PR #84 originally emitted unconditionally and PR #86 demoted to debug-only. |
Both are read once during bwamem3_simd_init() and ignored after that
call returns.
Host-floor enforcement
bwa-mem3 mem, bwa-mem3 index, and bwa-mem3 shm all call
bwamem3_enforce_host_floor() early in main() (PR #95). The check
compares g_host_capability against the compile-time g_build_tier
(derived from compiler predefined macros, reflecting whichever
BASELINE_ARCH was set at build time) and exits with code 2 and an
[E::bwamem3] message naming the gap if the host cannot execute the
binary’s compiled-in instructions. This converts what would otherwise
be an unhelpful SIGILL deep in alignment into a clean abort at startup.
Diagnostic invocations opt out: bwa-mem3 version,
bwa-mem3 <subcommand> --help, and bwa-mem3 <subcommand> -h always
succeed regardless of host capability, so operators can introspect a
binary on a host that cannot run alignment. The version command
prints SIMD floor: (the build’s required minimum) and
SIMD runtime: (the resolved tier) on stdout; on a too-old host it
also emits a [W::bwa-mem3] warning on stderr.
The simd_dispatch.cpp translation unit itself is compiled at
-march=x86-64 via an explicit Makefile rule, so the precheck path
stays SIGILL-safe even when BASELINE_ARCH=avx2 (or higher) for the
rest of the binary.
Per-tier parity validation
test/regression/all_tiers_parity.sh runs bwa-mem3 mem with
BWAMEM3_FORCE_TIER walking the full ladder
(sse41 → sse42 → avx → avx2 → avx512bw) on the same input and
diff’s the BAM output. The expected result is byte-identical SAM
across every tier; any divergence is a bug in either a kernel TU or
the per-kernel factory wiring. CI runs this script on the x86 matrix
row.
Trade-offs vs the prior multi-binary launcher
| Property | Pre-PR-#83 (multi-binary execv) | Current (single binary, in-process dispatch) |
|---|---|---|
| Install size | ~120 MB (5 ISA binaries + launcher) | ~25 MB (one binary) |
| Build cost | 5 sequential clean rebuilds + launcher | One parallel build |
| Process model | bwa-mem3 (launcher) → execv → bwa-mem3.<tier> | One process, one main() |
| Per-call overhead | Direct call (tier fixed at launch via separate binary) | Indirect call through factory vtable or extern "C" wrapper (~0.3 ns / call) |
| Non-kernel auto-vectorization | At each binary’s compile tier | At BASELINE_ARCH (default avx2); raise via BASELINE_ARCH= |
| Tier override | Run the .<tier> binary directly | BWAMEM3_FORCE_TIER=<tier> (downgrade-only) |
runsimd.cpp (220-line launcher + safestringlib) | Required | Removed |
The ~0.3 ns indirect-call cost is amortized across alignment work and
has not been measurable in any bench cell. The non-kernel
auto-vectorization at BASELINE_ARCH is what closes the gap PR #84
identified after PR #83 originally regressed by silently hardcoding
the non-kernel compile to sse41.
Distribution layout
For deployment on any x86 host meeting the build’s floor:
bin/
bwa-mem3 ← single binary, dispatches in-process
For ARM:
bin/
bwa-mem3 ← single binary, NEON kernels only
No .<tier>-suffixed companion files are produced or needed. When
shipping a Docker image intended for a mixed-microarch fleet, build at
the lowest expected tier (e.g. BASELINE_ARCH=avx2 for “AVX2 and
newer”) — the runtime dispatcher will still pick AVX-512BW kernels on
AVX-512 hosts via the per-tier factory tables. See
Multi-architecture deployment
for the docker buildx manifest-list recipe.
The mem SIMD banner
The legacy Executing in AVX2 mode!! banner is gone. Use either:
bwa-mem3 version— printsSIMD floor:andSIMD runtime:lines on stdout (always available, no alignment required).BWAMEM3_DEBUG_SIMD=1 bwa-mem3 mem …— prints a one-line[I::bwamem3_simd_init_body]banner on stderr at the start of the run.
See also:
SIMD dispatch architecture ·
Apple Silicon / NEON port ·
Building from source ·
Performance → SIMD dispatch matrix ·
Host requirements ·
BASELINE_ARCH=avx512bw build flag ·
Multi-architecture deployment
Apple Silicon / NEON port
bwa-mem3 supports ARM64 (Apple Silicon and Linux aarch64) as a first-class build target. The port uses the sse2neon translation shim as a baseline and replaces the two most performance-critical SSE paths with native NEON intrinsics.
Architecture overview
The ARM build compiles a single binary with a single NEON kernel TU. There is only one NEON instruction-set level on all current ARM64 CPUs, so the per-tier dispatch table used by the x86 single-binary build (see Single-binary SIMD dispatch (x86)) collapses to a one-entry switch on aarch64 — there is effectively no dispatch overhead. make arm64 builds and installs the binary at the bare bwa-mem3 name.
sse2neon shim
ext/sse2neon/sse2neon.h is a header-only library that maps Intel SSE intrinsics to their NEON equivalents. When APPLE_SILICON=1 is defined (set automatically when uname -m is arm64 or aarch64), src/simd_compat.h includes sse2neon and defines the SSE feature test macros (__SSE__ through __SSE4_2__) so that code guarded by those macros compiles without changes.
The translation is not zero-cost for all operations. Two patterns that sse2neon handles poorly are replaced with native NEON in src/simd_compat.h:
_mm_movemask_epi16— used heavily inbandedSWA.cppto extract the sign bit of each 16-bit lane. The native implementation shifts right by 15, narrows to 8-bit withvmovn_u16, and reduces with position-weightedvaddv_u8._mm_blendv_epi16_fast— a bitwise select on 16-bit lanes usingvbslq_s16. Replaces the three-operation OR/AND/ANDNOT sequence sse2neon emits for_mm_blendv_epi8.
Memory alignment
Apple Silicon uses 128-byte cache lines (versus 64 bytes on x86). simd_compat.h overrides _mm_malloc on ARM to call posix_memalign with a minimum alignment of 128 bytes for all SIMD allocations. CACHE_LINE_BYTES is set to 128 in macro.h when APPLE_SILICON=1.
Accelerate.framework
The Makefile links -framework Accelerate on macOS ARM builds. The framework is linked but not used for computation: bwa-mem3’s hot paths (Smith-Waterman, FM-index) do not match the large-matrix / large-vector patterns that BLAS and vDSP target. The link is retained to keep the option open and adds no overhead at runtime.
P-core / E-core detection
src/fastmap.cpp calls HTStatus() on macOS to detect the Apple Silicon microarchitecture. HTStatus() reads the hw.perflevel0.physicalcpu and hw.perflevel1.physicalcpu sysctl keys to report P-core and E-core counts and the L2 cache size (typically 4 MB on M-series chips). This information is printed at startup for diagnostic purposes. The L2 cache size is used to validate the compile-time BATCH_SIZE setting (currently 1024, which was already optimal for a 4 MB L2 cache).
Benchmark results
All measurements use 100K paired-end reads, 5% error rate, 30% indels, chr17 reference, 8 threads, on an M-series Apple Silicon machine.
| Build | Wall-clock (avg, s) | vs. baseline |
|---|---|---|
| sse2neon baseline (no native NEON) | 15.4 | — |
+ native NEON kswv.cpp | 14.4 | ~7% faster |
+ native NEON bandedSWA.cpp blendv | 13.8 | ~4% faster |
| PGO on top of native NEON | ~13.4 | ~3% further |
The FM-index (FMI_search.cpp) is memory-bound with sequential pointer-chasing dependencies and does not benefit from SIMD. libsais benefits from OpenMP-parallel suffix-array construction but not from SIMD widening within a single thread.
Optimization task summary
| Task | Status | Impact | Notes |
|---|---|---|---|
| Correctness verification | done | — | 200,006 alignments, 0 differences vs. reference |
| Dynamic L2 cache detection | done | ~0% | 4 MB detected; compile-time BATCH_SIZE=1024 already optimal |
Native NEON bandedSWA.cpp | done | ~4% | vbsl-based blendv in simd_compat.h |
| Per-tier dispatch table | N/A | 0% | Collapses to one entry on ARM (single NEON level) |
| Accelerate.framework | done | ~0% | Linked; no suitable compute patterns |
| M1/M2/M3/M4 detection | done | ~0% | P/E-core counts and L2 cache via sysctl |
Native NEON FMI_search.cpp | N/A | 0% | Memory-bound; SIMD cannot help |
| Profile-Guided Optimization | done | ~3% | make pgo-generate / make pgo-use |
Building for Apple Silicon
# Standard arm64 build
make arch=arm64
# PGO build (recommended for production on Apple Silicon)
make pgo-generate PGO_ARCH=arm64
./bwa-mem3.pgo-instr mem -t 8 ref.fa r1.fq.gz r2.fq.gz > /dev/null
make pgo-use PGO_ARCH=arm64
The resulting bwa-mem3.pgo binary delivers the full ~10% improvement over the pure sse2neon baseline.
Tip — Recommended production build on Apple Silicon
Use PGO for production deployments. The combined ~10% improvement from native NEON kernels plus PGO is consistent and verified on M-series hardware.
Files modified in the NEON port
src/kswv.cpp,src/kswv.h— native NEON batched Smith-Watermansrc/bandedSWA.h— SIMD width definitions for ARMsrc/simd_compat.h— sse2neon integration, aligned allocation,_mm_blendv_epi16_fast,_mm_movemask_epi16src/fastmap.cpp— L2 cache detection,HTStatus()for non-NUMA (macOS)src/macro.h—BATCH_SIZEandCACHE_LINE_BYTEStuning for Apple SiliconMakefile—arm64target, sse2neon flags, Accelerate linkage, PGO targets
See also: SIMD dispatch architecture · Building from source · Performance → PGO build · Performance → SIMD dispatch matrix · What’s Different → Architecture support
Regression test framework
bwa-mem3 has three categories of tests — unit, integration, and regression — plus a separate benchmark harness in bench/. Understanding the distinction helps you choose where to add a new test and what to expect from CI.
Test categories
| Category | Binary / runner | Fixtures | CI scope |
|---|---|---|---|
| unit | test/bwa_mem3_tests_unit | None; all inputs synthetic | Every matrix row |
| integration | test/bwa_mem3_tests_integration | Small committed FASTAs / FMI in test/fixtures/ | SSE4.1, AVX2, ARM64 Linux, macOS ARM |
| regression | test/regression/*.sh | Downloaded references (phiX, chr22) + bwa + dwgsim | Canonical AVX2 row only |
Unit tests must use only synthetic inputs generated programmatically and complete in under 100 ms each. They exercise individual kernels in isolation: kswv scoring, banded Smith-Waterman, KSW, FM-index operations, SMEM extraction, BAM encoding, and pair handling.
Integration tests may load small committed fixtures from test/fixtures/ and have a per-test budget of 10 seconds. They exercise cross-component paths: index loading, SMEM-to-alignment pipelines, and output format validation.
Regression tests are standalone bash scripts that shell out to the bwa-mem3 binary, may diff against third-party tool output (bwa, bwa-meth, samtools), and require fixtures that are either committed to the fixtures directory or downloaded by CI at run time.
Running tests locally
# Build the aligner and test binaries
make
make -C test -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)
# Run all unit tests
./test/bwa_mem3_tests_unit
# Run all integration tests
./test/bwa_mem3_tests_integration
# Run a specific test case or suite
./test/bwa_mem3_tests_unit --test-case="*kswv*"
./test/bwa_mem3_tests_unit --test-suite="unit/kswv"
./test/bwa_mem3_tests_unit --test-suite-exclude=slow
# Verbose output (also print passing assertions)
./test/bwa_mem3_tests_unit --success
The make test target is a convenience shortcut that builds and runs the unit and integration binaries plus the two legacy standalone regression tests (kswv_nrow_zero_test and shm_section_find_test):
make test
Running a regression test locally
Regression scripts expect certain environment variables to point at fixtures. The phiX parity test requires dwgsim:
mkdir -p /tmp/ci-test && cd /tmp/ci-test
curl -sL "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/819/615/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gz" | gunzip > phix174.fa
dwgsim -z 42 -N 500 -1 150 -2 150 -r 0.001 -S 2 phix174.fa reads
cd -
BWA_MEM2="$(pwd)/bwa-mem3" CI_TEST_DIR=/tmp/ci-test bash test/regression/phix_parity.sh
Test framework
The unit and integration binaries are built on doctest, a single-header C++ test framework. Tests are discovered by file glob: any test/unit/test_*.cpp file is compiled into the unit binary; any test/integration/test_*.cpp file is compiled into the integration binary. No Makefile edit is needed when adding a new test_*.cpp.
Test organisation
Tag each TEST_CASE with doctest::test_suite("category/module"):
TEST_CASE("nrow==0 batch does not store out of bounds"
* doctest::test_suite("unit/kswv")) {
// ...
}
The test_suite decorator is overriding (not additive). Encode the category (unit or integration) and module (kswv, bandedsw, ksw, fmindex, smem, bam, pair, cigar, util) as a single slash-separated string.
Framework helpers
The test/framework/ directory provides helpers shared across test files:
| Header | Provides |
|---|---|
scoring.h | ScoringMatrix, build_scoring_matrix, default_scoring_matrix |
seqpair.h | TestPair struct |
seqpair_gen.h | Deterministic pair generators: random, exact-match, all-mismatch, homopolymer, sub-cluster, N-bases |
seqpair_batch.h | BatchBuffers — flat-layout packer for kswv batch input |
ksw_runner.h | run_scalar_ksw, default gap/extra parameters |
kswv_runner.h | Two-pass run_kswv_batch |
kswr_cmp.h | Score / coordinate / score2 comparators |
junit_reporter.h | CI matrix-row banner and JUnit XML output |
Debugging a failing test
# Break into debugger at the first failing assertion
./test/bwa_mem3_tests_unit --test-case="*kswv*" --break
# Run a single SUBCASE
./test/bwa_mem3_tests_unit --test-case="*foo*" --subcase="bar"
# Enable per-phase diagnostics for kswv tests
BWA_TESTS_DEBUG_PHASE0=1 BWA_TESTS_DEBUG_PHASE1=1 \
./test/bwa_mem3_tests_unit --test-suite="unit/kswv"
JUnit artifacts are uploaded per CI matrix row (unit-results-<name>.xml, integration-results-<name>.xml) and available on the Actions run page.
Tip — Use ASAN for memory bugs
Build with
make ASAN=1 testto catch out-of-bounds writes in vectorised kernels. The kswv_nrow_zero_test specifically exercises the nrow==0 path that triggered a pre-allocation store bug; ASAN reports this immediately rather than at a later allocator operation.
Standalone regression tests
Three standalone regression tests live outside the doctest harness because they predated it. The two binaries are built and run by make test; the third is script-driven:
kswv_nrow_zero_test— binary; exercises the all-len1==0 batch path in every SIMD kswv variant. Catches the nrow==0 rowMax store overrun from issue #38 / upstream bwa-mem2 PR #289.shm_section_find_test— binary; exercises the shared-memory index section-find logic.shm_pack_round_trip_test— script-driven, invoked viatest/shm_pack_round_trip_test.sh, which builds the phiX index first.
Additional integration shell scripts in test/:
| Script | What it tests |
|---|---|
pg_cl_escape_test.sh | @PG CL: tab/newline escape in SAM headers |
mimalloc_loaded_test.sh | mimalloc override is active when USE_MIMALLOC=1 |
shm_round_trip_test.sh | bwa-mem3 shm load / list / drop cycle |
shm_meth_test.sh | --meth index compatibility with shm |
help_prescan_test.sh | --help prints without running alignment |
libsais_*.sh | libsais index correctness vs. BWA / determinism |
Benchmark harness (bench/)
bench/ is a separate performance measurement harness used during development to gate performance PRs. It is not part of the CI test suite.
cp bench/config.env.example bench/config.env
# Edit config.env to point at your index, reads, and binary paths
bench/run.sh baseline # N trials; appends to bench/results.csv
bench/run.sh candidate # N trials on the candidate binary
bench/compare.sh baseline candidate # wall-clock / RSS / md5 delta report
Each run records: tag, host, architecture, binary path, thread count, trial index, wall-clock seconds, max RSS (KB), and a golden md5 (single-threaded, @PG-stripped SAM). The md5 verifies byte-identical output across builds; wall-clock is the primary performance metric.
See also: Building from source · SIMD dispatch architecture · Contributing · What’s Different → Correctness fixes · What’s Different → Build & infrastructure
Release process
bwa-mem3 follows semantic versioning. Releases are driven by git tags. The version string is derived automatically from git describe and embedded in every binary at compile time.
Version stamping
The Makefile computes the version string at parse time:
FG_LABS_VERSION_FALLBACK := 0.2.0
VERSION_STRING := $(shell git describe --tags --dirty 2>/dev/null || echo $(FG_LABS_VERSION_FALLBACK))
git describe produces a string such as v0.1.0 (on a tag), v0.1.0-3-gabcdef1 (three commits past the tag), or v0.1.0-dirty (uncommitted changes). If git describe fails — for example in a source tarball or a shallow clone without tag history — the build falls back to FG_LABS_VERSION_FALLBACK.
The string is written into src/version.h by the src/version.h: FORCE rule, which runs on every make invocation but only touches the file when the string changes. This minimises unnecessary recompilation of src/main.o.
PACKAGE_VERSION from src/version.h appears in:
bwa-mem3 versionoutput (stdout).- The
@PG VN:field in every SAM/BAM file produced bybwa-mem3 mem.
Verifying the version
./bwa-mem3 version
# Example output on a tagged commit:
# v0.1.0
# mimalloc 3.3.0 ← if USE_MIMALLOC=1
On an untagged commit the string includes the commit distance and short SHA:
v0.1.0-12-g3f7ab2e
Semver policy
bwa-mem3 follows semver, interpreted for an alignment tool as follows.
MAJOR (X.0.0) — bump when the change would break a downstream
consumer that pinned the previous version without checking release
notes. Concretely:
- An on-disk index file format change (a re-index is required to use the new version).
- Removal or rename of a CLI flag or subcommand.
- A SAM/BAM tag is removed, renamed, or its type/value space changes incompatibly (a column-fixed downstream parser would break). Adding a new tag is not a major change.
- A change to the resolved primary alignment that is intentional and affects more than a negligible fraction of reads (e.g. a MAPQ recalibration applied unconditionally). Concordance regressions attributable to bug fixes are not major changes — call them out in the release notes under “Correctness” instead.
- Dropping support for a previously supported host class
(e.g. raising the build’s compiled-in
BASELINE_ARCHfloor in a way that excludes hosts the previous release ran on).
MINOR (0.X.0) — bump for any user-visible new functionality that
does not break consumers pinned to the previous minor. Examples:
- A new CLI flag or subcommand.
- A new SAM aux tag emitted on output (e.g.
HN:iin 0.1.0, the BismarkXR:Z/XG:Z/XM:Zset in 0.2.0). - A new operational feature (e.g.
bwa-mem3 shm, in-process SIMD dispatch). - A user-facing default change that is documented in release notes but
does not require any consumer action (e.g.
BASELINE_ARCH=avx2as the build default). - New performance characteristics that change wall-time meaningfully.
PATCH (0.0.X) — bump for bug fixes, doc-only changes, build
fixes, and internal refactors that have no user-visible behavioral
delta. Pre-existing-bug fixes that incidentally shift output for a
small fraction of reads are patch-level when called out in the
release notes; widespread output shifts (>0.1% of reads on a typical
WGS bench cell) deserve MINOR or MAJOR depending on the source.
While the project is pre-1.0, the leading 0. is treated literally —
0.2.0 may make breaking changes vs 0.1.0 if called out clearly in
the release notes. After 1.0, MAJOR bumps are reserved for genuinely
breaking changes.
Release-readiness checklist
Run through this list on the commit you intend to tag, before any git-tag command. Every item must pass.
Build and test
-
make clean && makesucceeds at the defaultBASELINE_ARCH(avx2) on a Linux x86_64 host. -
make clean && make BASELINE_ARCH=sse41succeeds on the same host — confirms the portability floor still compiles. -
make clean && makesucceeds on an arm64 host (Apple Silicon or aarch64 Linux). -
make testpasses on both x86_64 and arm64. -
test/regression/all_tiers_parity.shproduces byte-identical SAM acrossBWAMEM3_FORCE_TIER=sse41 → sse42 → avx → avx2 → avx512bwon an AVX-512BW host. Failures here indicate a per-tier kernel or dispatcher-wiring regression — fix before tagging.
Bench
- bwa-mem3-bench run
submitted on the candidate SHA via
bwa_mem3_bench.cli submit --fg-labs-sha <sha>(or the local smoke path for a fast sanity check). -
bench regression --prev <previous-tag-sha>reports gatePASS— concordance ≥ 99.999% on everyvs-baseline.jsoncell except methylation (which is expected to drift vs the bwameth baseline; see the methylation carve-out below) and no cell labeledREGRESSION. - Methylation cells reviewed for expected-drift consistency: the
meth-twist-emseq-5Mconcordance vs the bwameth baseline should sit at ~98.9% post-PR-#90, with the per-class breakdown matching the entry inbwa-mem3-bench/docs/expected-divergences.yaml(or the entry added in this release — the file is in the bench repo, not in this repo).
Docs
-
make docsbuilds cleanly with no mdbook warnings. -
NEWS.mdhas a top-section entry for the new version with Operational / packaging, Correctness, Performance, and Methylation subheadings as applicable; every user-visible PR in the release window is listed with its number. -
docs/src/whats-different/overview.mdFG-MAIN-TABLEis regenerated to cover the new PRs (see Contributing for the regeneration command). -
docs/src/whats-different/upstream-prs.mdhas rows for every user-visible PR landed since the previous tag. -
docs/src/reference/changelog.mdanddocs/src/cli/version.mdexamples reference the new release string. - Spot-check the bwa-mem3-bench reference numbers in
docs/src/performance/overview.mdagainst the bench’sregression.mdfor the tagging SHA.
Tagging the release
Run these in order; each command depends on the previous one.
-
Pre-flight (confirms the readiness checklist):
make clean && make make test make docs -
Confirm
NEWS.mdis current. The top entry header line must match the tag you are about to create (e.g.Release 0.2.0 (YYYY-MM-DD)). -
Tag the release commit. Prefer a signed tag (
-s); fall back to an annotated tag (-a) only when signing is unavailable:git tag -s v0.X.Y -m "Release v0.X.Y" -
Push the tag to the
fg-labsremote:git push fg-labs v0.X.YRead the Docs activates a versioned build at
/v0.X.Y/automatically when the tag appears on the remote. -
Create a GitHub release from the tag via
gh. The body should be the matchingNEWS.mdsection, no preamble:gh release create v0.X.Y --repo fg-labs/bwa-mem3 \ --title "bwa-mem3 v0.X.Y" \ --notes-file <(awk '/^Release / {p = ($0 ~ /^Release 0\.X\.Y /)} p' NEWS.md)Substitute
0.X.Ywith the exact version literal you are tagging, including any pre-release suffix — e.g. forv0.3.0-preuse/^Release 0\.3\.0-pre /. The trailing space in the inner pattern anchors the match to a complete version token so that0.3.0does not also matchRelease 0.3.0-pre (...). The awk script prints lines whilepis true: it flips on at the matchingReleaseline and back off at the nextReleaseline, which gives a clean section without needing a trailingsed '$d'.
Note — Tarball builds
Source tarballs created by GitHub (or
git archive) do not include git history, sogit describefails and the version falls back toFG_LABS_VERSION_FALLBACK. For reproducible tarball builds, setVERSION_STRINGexplicitly on the command line:make VERSION_STRING=v0.X.Y.
Post-release verification
After the tag is pushed and the GitHub release is published:
-
Wait ~5 minutes for Read the Docs to build the new version, then open
https://bwa-mem3.readthedocs.io/en/v0.X.Y/and confirm:- The version selector lists
v0.X.Y. - The home page renders with no missing-page errors.
developer-guide/launcher.md,performance/overview.md, andmethylation/tags.mdall render with their mermaid diagrams and tables intact (these are the most diagram-heavy pages).
- The version selector lists
-
Pull the tag in a clean clone and verify
bwa-mem3 versionreports the bare tag string (no-N-gSHAdistance suffix):git clone -b v0.X.Y --depth 1 https://github.com/fg-labs/bwa-mem3.git cd bwa-mem3 && make ./bwa-mem3 version | head -1 # expect: v0.X.Y -
If the docs build failed on RTD or the version string is wrong, do not delete or move the tag. Tags are immutable in practice — open a follow-up
v0.X.(Y+1)patch release with the fix instead.
Branch and tag conventions
- All release tags are on the
mainbranch, which carries both upstream bwa-mem2 commits and fork-carried changes. See Branch and worktree conventions for the full branching model. - Tags are prefixed with
v:v0.1.0,v0.2.0, etc. - Pre-release tags use a
-presuffix:v0.1.0-pre. - Patch releases increment the third component:
v0.1.1.
What’s Different table update
When a release bundles new fork-carried commits that were not previously documented, update the FG-MAIN-TABLE in docs/src/whats-different/overview.md in the same PR before tagging. See Contributing for the rule.
See also: Branch and worktree conventions · What’s Different → Overview · Reference → Changelog · Building from source
Branch and worktree conventions
This page describes how the bwa-mem3 repository branches relate to upstream bwa-mem2, the policy for where PRs land, and the conventions for local worktrees when working on multiple branches simultaneously.
Branch model
master — upstream mirror
master tracks the upstream bwa-mem2 master branch verbatim. No fork-carried changes are applied here. When upstream bwa-mem2 merges new commits, master is fast-forwarded to match.
master is the starting point for upstream rebase operations. It is never the target of fork PRs.
main — fork integration branch
main carries all fork-carried commits on top of a rebased upstream baseline. This is the branch that:
- All new feature, fix, and improvement PRs target.
- All git tags (
v0.X.Y) are placed on. - Read the Docs
/latest/follows.
When upstream bwa-mem2 makes significant changes, master is fast-forwarded and then main is rebased onto the new master tip. The rebase is verified by running the full test suite before the result is pushed.
Feature and fix branches
All development work happens on short-lived branches that are merged into main via pull request. Branch name conventions:
| Prefix | Use |
|---|---|
feat/ | New features or capabilities |
fix/ | Bug fixes |
perf/ | Performance improvements |
test/ | Test additions or improvements |
docs/ | Documentation changes |
ci/ | CI / build system changes |
refactor/ | Code restructuring without behaviour change |
Branch names use kebab-case after the prefix: fix/kswv-nrow-zero, perf/libsais-fm-index, test/regression-tests.
Upstream rebase cadence
main is rebased onto master (i.e., onto upstream bwa-mem2) periodically — not on every upstream commit, but when upstream merges a batch of changes worth incorporating. The process is:
- Fast-forward
masterto the new upstream tip. - Rebase
mainontomaster, resolving any conflicts. - Run
make && make testto confirm the rebase is correct. - Push
masterandmainto thefg-labsremote.
Warning — Do not merge upstream into main
Always rebase rather than merge when incorporating upstream changes. Merge commits obscure the fork-carried commit history and make the What’s Different table harder to maintain.
Worktrees for parallel branches
When working on multiple branches simultaneously, use git worktrees instead of stashing or switching branches. Each worktree is a sibling directory of the main clone.
Creating a worktree for a PR branch
# Fetch the PR's head branch from the fg-labs remote
git fetch fg-labs <head-branch-name>
# Create a worktree with a local branch tracking the remote branch
git worktree add ../pr-<N> -b pr-<N> --track fg-labs/<head-branch-name>
The local branch name and directory name match the PR number (pr-N).
Creating a worktree for a new issue branch
# Fetch the latest main from fg-labs
git fetch fg-labs main
# Create a new feature branch off fg-labs/main
git worktree add ../issue-<N> -b <prefix>/issue-<N>-<short-slug> fg-labs/main
# Unset the upstream so the branch is untracked until first push
git -C ../issue-<N> branch --unset-upstream
On first push, push to fg-labs so the head branch is in the same organisation as the PR base:
git push -u fg-labs HEAD
Worktree naming conventions
| Directory name | Branch type |
|---|---|
main/ | Primary checkout; tracks fg-labs/main |
pr-<N>/ | PR review; local branch pr-N tracks fg-labs/<head-branch> |
issue-<N>/ | Issue work; local branch <prefix>/issue-N-<slug> |
| Descriptive name | Feature work not yet tied to a PR or issue |
Listing and removing worktrees
# List all worktrees
git worktree list
# Remove a worktree after the PR is merged
git worktree remove ../pr-<N>
git branch -D pr-<N>
# Remove an issue worktree
git worktree remove ../issue-<N>
git branch -D <prefix>/issue-<N>-<slug>
Note — Worktree directories are siblings, not nested
All worktree directories sit next to the main clone at the same directory level, not inside it. This avoids confusing
gitcommands that walk parent directories looking for.git.
PR policy
- All PRs target
main. - PRs from fork contributors should be opened against
fg-labs/bwa-mem3 main. - Every PR that adds a fork-carried commit must update the
FG-MAIN-TABLEindocs/src/whats-different/overview.mdin the same PR. See Contributing. - Merge policy: squash-merge for single-commit changes; rebase-merge for multi-commit PRs with a clean commit history.
See also: Contributing · Release process · What’s Different → Overview · Building from source
Contributing
This page covers the mechanics of submitting changes to bwa-mem3: commit conventions, PR workflow, CI requirements, and the rule for keeping the fork-lineage table current.
Before you start
- Check the open issues and existing PRs to avoid duplicate work.
- For substantial changes, open an issue first to discuss scope and approach.
- Fork or branch from
fg-labs/bwa-mem3 main. See Branch and worktree conventions for the branching model.
Commit message conventions
bwa-mem3 follows Conventional Commits (v1.0.0). Every commit message must start with a type prefix:
| Prefix | Use |
|---|---|
feat: | New feature or capability |
fix: | Bug fix |
perf: | Performance improvement |
test: | Test additions or changes |
docs: | Documentation only |
ci: | CI / build-system changes |
refactor: | Restructuring without behaviour change |
chore: | Maintenance (dependency bumps, version pins) |
The subject line is lowercase after the prefix, imperative mood, no trailing period. Keep it under 72 characters. Body lines wrap at 100 characters.
Good:
fix: kswv nrow==0 batch skips rowMax store when i==0
Exercises the all-len1==0 path across SSE4.1, AVX2, AVX-512BW, and ARM NEON.
Without the `if (i > 0)` guard, the store writes SIMD_WIDTH* bytes before the
allocation.
Closes #38.
Not acceptable:
Fixed stuff
Updated kswv
WIP
Pull request workflow
- Push your branch to
fg-labs/bwa-mem3(or your fork) and open a PR targetingfg-labs/bwa-mem3 main. - The PR description should explain the motivation, summarise the change, and note any benchmarks or test results.
- All CI jobs must pass before merge. See CI matrix below.
- CodeRabbitAI reviews every PR automatically. Address all comments, including inline suggestions, summary comments, and nitpicks. Do not dismiss comments without a reply explaining why the suggestion was not adopted.
- A project maintainer will review and merge once CI is green and all comments are resolved.
Note — Draft PRs first
Open PRs as drafts while CI is running or while you are actively revising. Convert to ready-for-review only when the branch is stable, CI is green, and you have self-reviewed the diff.
The FG-MAIN-TABLE rule
Every PR that introduces a new fork-carried commit — a commit that is on main but not on master (the upstream bwa-mem2 mirror) — must update the FG-MAIN-TABLE block in docs/src/whats-different/overview.md in the same PR.
The table records each fork-carried change, its bwa-mem3 PR number, the corresponding upstream bwa-mem2 PR or issue (if any), and its upstream status. Keeping this table current is the primary mechanism by which the project maintains transparency about its relationship to upstream.
Warning — Do not skip the table update
A PR that adds a fork-carried commit but omits the table update will be sent back for revision. The table is reviewed as part of the standard PR checklist.
What counts as a fork-carried commit
A commit is fork-carried if:
- It adds new behaviour, fixes a bug, or changes build infrastructure in a way that diverges from upstream bwa-mem2
master. - It is present on
fg-labs/bwa-mem3 mainbut not (yet) merged upstream.
Pure documentation commits, CI-only changes, and upstream-rebase bookkeeping commits do not need a table entry.
CI matrix
CI runs on every PR and on push to main. The matrix covers:
| Row | Architecture | ISA | Platform |
|---|---|---|---|
sse41 | x86_64 | SSE4.1 | Ubuntu |
avx2 | x86_64 | AVX2 | Ubuntu (canonical) |
avx512bw | x86_64 | AVX-512BW | Ubuntu |
arm64-linux | aarch64 | NEON | Ubuntu ARM |
arm64-macos | arm64 | NEON | macOS |
The canonical row (avx2) is the only one that runs regression tests (shell scripts in test/regression/). Unit tests run on every row. Integration tests run on the four widened canonical rows (SSE4.1, AVX2, ARM64 Linux, macOS ARM).
A PR must pass all rows before merge.
Code style
- C++14,
gnu++14dialect. - Match the style of the surrounding code. The codebase inherits the upstream bwa-mem2 style, which is C-ish C++ with minimal STL use in hot paths.
- For new test code, follow the doctest patterns documented in the test framework.
- New SIMD code must include
src/simd_compat.hrather than platform-specific headers directly. See SIMD dispatch architecture.
Adding a test for your change
- Bug fix → add a unit test or integration test that fails without the fix and passes with it.
- New feature → add unit tests for the core logic and, if the feature is end-to-end testable with a shell invocation, a regression test in
test/regression/. - Performance change → run the benchmark harness (
bench/) to confirm the improvement and include median wall-clock numbers in the PR description.
See Regression test framework for the full guide on where to add tests and how to organise them.
See also: Branch and worktree conventions · Regression test framework · Release process · What’s Different → Overview · Building from source
bwa-mem3-bench
bwa-mem3-bench is a benchmarking suite that measures the alignment performance of bwa-mem3 against the upstream bwa-mem2 v2.2.1 baseline. It runs on AWS Batch spot instances across four dataset types — whole-genome sequencing (WGS), whole-exome sequencing (WES), panel, and bisulfite-sequencing (methylation) — all aligned against the hg38 reference. The suite covers three CPU microarchitectures: ARM Neon, x86 AVX2, and x86 AVX-512. Results are collected into a SQLite database for local analysis and reporting. The project is implemented in Python (orchestration, reporting, and CLI), Rust (BAM comparison tool), Snakemake (alignment workflow), and AWS CDK (cloud infrastructure).
When you’d use it
Use bwa-mem3-bench when you need reproducible, multi-architecture throughput numbers before committing a bwa-mem3 change to production or before deciding whether to adopt bwa-mem3 in place of bwa-mem2. It provides a structured “bless baseline, then compare” workflow: an upstream bwa-mem2 run is blessed once per upstream tag and stored in S3; subsequent bwa-mem3 runs are measured against that fixed baseline. Running a full benchmark fires a Snakemake coordinator job on AWS Batch and costs roughly $10 in spot capacity.
How it relates to bwa-mem3
bwa-mem3-bench is the authoritative source of benchmark evidence for every performance claim made in the bwa-mem3 documentation and changelog. When the Performance Overview cites speedup numbers, those numbers come from bwa-mem3-bench runs collected after the relevant PR was merged. The suite also validates that bwa-mem3 does not regress relative to bwa-mem2 on any supported architecture before a new release is tagged.
Links
- GitHub: https://github.com/fg-labs/bwa-mem3-bench
- License: MIT
See also: Performance Overview · SIMD dispatch matrix · bwa-mem2 (upstream) · Release process
bwa-mem3-rs
bwa-mem3-rs is a Rust crate that provides idiomatic bindings to the bwa-mem family of short-read aligners — bwa (original), bwa-mem2, and bwa-mem3. It exposes a safe Rust API over the underlying C++ alignment engine, allowing Rust programs to index a reference, configure alignment parameters, and align reads without shelling out to an external process. The bindings link statically against the chosen backend, so a binary built with bwa-mem3-rs carries the aligner and its SIMD kernels as a self-contained artifact.
When you’d use it
Use bwa-mem3-rs when you are building a Rust bioinformatics tool or pipeline that needs short-read alignment as an in-process library call rather than a subprocess invocation. It is especially useful when latency between reads arriving and alignments being available matters (no process-startup overhead), or when you want tight integration between the aligner’s output and downstream Rust code such as UMI grouping, consensus calling, or duplicate marking.
How it relates to bwa-mem3
bwa-mem3-rs targets bwa-mem3 as its primary high-performance backend. It is the intended integration path for fgumi and other Fulcrum Genomics tools that need alignment as a library dependency. Changes to bwa-mem3’s public API, flag semantics, or output format are coordinated with bwa-mem3-rs to keep the bindings current.
Links
- GitHub: https://github.com/fg-labs/bwa-mem3-rs
- License: MIT
See also: fgumi · bwa-mem3-bench · Aligning short reads (mem) · Developer Guide — Contributing
bwa-mem2 (upstream)
bwa-mem2 is the direct predecessor of bwa-mem3 and the project from which the bwa-mem3 fork is derived. It was created at Intel’s Parallel Computing Lab by Vasimuddin Md and Sanchit Misra to accelerate the alignment algorithm originally written by Heng Li in bwa. bwa-mem2 achieves a 1.3–3.1x throughput improvement over the original bwa-mem by replacing key inner loops with vectorised implementations (SSE4.1, SSE4.2, AVX2, and AVX-512) and by switching to a more compact FM-index encoding. Its output is identical to bwa-mem at the alignment level, and it is distributed under the MIT license.
Lineage
The bwa alignment family has evolved through three generations, each building on the last:
- bwa — Written by Heng Li. Established the BWA-MEM algorithm, the SAM output
format conventions, and the
.bwt/.pac/.ann/.ambindex layout. - bwa-mem2 (Vasimuddin et al., Intel) — Replaced scalar inner loops with SIMD
kernels; introduced the compact
.bwt.2bit.64and.0123index formats; retained full output compatibility with bwa-mem. - bwa-mem3 (Fulcrum Genomics fork) — Carries correctness fixes, performance improvements, new features (bisulfite alignment, mimalloc, ARM Neon), and expanded architecture support on top of the bwa-mem2 codebase. See What’s Different from bwa-mem2 for the full change catalog.
When you’d use it
Use bwa-mem2 directly when you need a stable, widely validated aligner with precompiled binaries available via Bioconda and the project’s GitHub releases page, and when you do not require the features or fixes that bwa-mem3 adds. bwa-mem2 is also the right choice when you are working in an environment where the bwa-mem3 fork has not yet been validated against your specific reference or sequencing library type.
How it relates to bwa-mem3
bwa-mem3 tracks bwa-mem2’s master branch and periodically rebases fork-carried
commits on top of upstream changes. The What’s Different
section documents every divergence between the two projects, and the
Upstream PR status page tracks which bwa-mem3
changes have been proposed back to bwa-mem2. The goal is to keep the fork divergence
minimal and to upstream as many fixes as practical.
Links
- GitHub: https://github.com/bwa-mem2/bwa-mem2
- Citation: Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru. “Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems.” IEEE IPDPS 2019.
- License: MIT (with third-party components under their respective licenses)
See also: What’s Different from bwa-mem2 · Upstream PR status · bwa-mem3-bench · Citation
fgumi
fgumi (Fulcrum Genomics Unique Molecular Indexing tools) is a high-performance suite of command-line tools for processing UMI-tagged next-generation sequencing data. Written in Rust, it provides UMI extraction from FASTQ files, read grouping by UMI with configurable assignment strategies, UMI-aware deduplication, simplex and duplex consensus calling, CODEC consensus calling, quality filtering of consensus reads, and overlapping read-pair clipping. fgumi is the intended successor to the Scala-based fgbio toolkit for UMI processing, targeting significantly higher throughput on multi-core systems. It is published on Bioconda and documented at https://fgumi.readthedocs.io.
Warning — Research preview
fgumi is currently a research preview. The Fulcrum Genomics team targets June 2026 for recommending fgumi over fgbio for production use. Verify fitness for your application before deploying in a clinical or production pipeline.
When you’d use it
Use fgumi when your sequencing library includes unique molecular identifiers and you need to group reads by UMI, call simplex or duplex consensus sequences, or remove PCR duplicates in a UMI-aware manner. It handles the standard commercial UMI library preparations (IDT xGen, KAPA, Twist, QIAseq, and others) and the CODEC protocol for duplex sequencing. fgumi is designed to be run after alignment with bwa-mem3 (or bwa-mem2) and before downstream variant calling or methylation analysis.
How it relates to bwa-mem3
fgumi and bwa-mem3 are sibling projects maintained by Fulcrum Genomics and are designed to work together in the same alignment-and-consensus pipeline. bwa-mem3 provides the aligned BAM that fgumi takes as input for grouping and consensus calling. The two projects share build and documentation conventions (mdbook on Read the Docs, Fulcrum theme, conventional commits) and are benchmarked together in the fgumi-benchmarks internal dataset suite. The intended integration path for in-process alignment within fgumi is bwa-mem3-rs, the Rust bindings for bwa-mem3.
Links
- GitHub: https://github.com/fulcrumgenomics/fgumi
- Docs: https://fgumi.readthedocs.io
- License: MIT
See also: bwa-mem3-rs · Aligning short reads (mem) · Best Practices — Multi-sample workflows · bwa-mem3-bench
bwameth.py
bwameth.py is a Python script written by Brent Pedersen that implements bisulfite sequencing (BS-Seq) alignment using the in-silico three-letter genome approach. It converts all cytosines to thymines in both the reference and the reads (C-to-T on the forward strand, G-to-A on the reverse), aligns the converted sequences with bwa-mem (or optionally bwa-mem2), and then recovers the original read sequence from the aligner’s tag output to tabulate methylation. bwameth.py supports single-end and paired-end reads from the directional bisulfite protocol and is published at https://arxiv.org/abs/1401.1129.
When you’d use it
Use bwameth.py when you need a battle-tested, community-supported bisulfite aligner that runs on top of the standard bwa-mem or bwa-mem2 you have already installed, and when you prefer a Python wrapper over a self-contained binary. It also remains the reference for downstream tabulation tools such as MethylDackel and SNP callers such as biscuit that expect the bwameth.py output format. For the actual methylation tabulation and variant calling steps, bwameth.py’s author recommends those dedicated tools rather than the tabulation utilities bundled with the original script.
How it relates to bwa-mem3
bwa-mem3 mem --meth is a single-binary drop-in replacement for the bwameth.py
alignment pipeline. It inlines the C-to-T and G-to-A conversion, runs the bwa-mem3
alignment engine (with all of its correctness fixes and SIMD speedups), rewrites the
@SQ headers to collapse the per-strand contig pairs back to canonical chromosome names,
emits Bismark-compatible XR:Z / XG:Z / XM:Z auxiliary tags, and writes a
@PG ID:bwa-mem3-meth header. The bwameth.py-style chimera QC heuristic is
available via --chimera-qc (off by default — Bismark behavior).
The Methylation Reference section documents the full
implementation in detail, including the Bismark XR:Z / XG:Z / XM:Z tags and
the --set-as-failed / --chimera-qc flags.
Tip — Interop with the bwameth.py c2t step
If your pipeline already performs its own C-to-T conversion before alignment, see Interop with external bwameth.py c2t for how to pass pre-converted reads to
bwa-mem3 mem --methwithout double-conversion.
Links
- GitHub: https://github.com/brentp/bwa-meth
- Paper: https://arxiv.org/abs/1401.1129
- License: MIT
See also: Methylation Reference: Overview · Quick start: methylation alignment · Best Practices — Methylation defaults · Interop with external bwameth.py c2t
Glossary
Terms used throughout this book, listed alphabetically.
@HD header
The first line of a SAM file header. Specifies the SAM format version (VN) and sort order (SO). Required when any other header lines are present. See Output: SAM/BAM, headers, tags.
@PG header
A SAM header line recording a program that processed the file, including ID, PN, VN, and CL fields. bwa-mem3 inserts ID:bwa-mem3 (or ID:bwa-mem3-meth in methylation mode). See Output: SAM/BAM, headers, tags.
@SQ header
A SAM header line describing a reference sequence (chromosome). Contains the sequence name (SN) and length (LN). In methylation mode, bwa-mem3 post-processes @SQ lines to collapse f/r-prefixed contig names back to one entry per chromosome. See Chimera QC and header rewriting.
BAM
Binary Alignment Map — a compressed, binary encoding of SAM. Produced by bwa-mem3 when the --bam flag is given or when output is piped through samtools. See Output: SAM/BAM, headers, tags.
Banded Smith-Waterman (banded SWA) A heuristic variant of the Smith-Waterman alignment algorithm that restricts the dynamic programming to a band of width w around the main diagonal. bwa-mem3 uses banded SWA for extension alignment; bwa-mem2 kernels are SIMD-vectorized and bwa-mem3 adds NEON implementations for Apple Silicon. See SIMD dispatch architecture.
c2t
Cytosine-to-thymine in-silico conversion applied to reads (or reference) before methylation alignment. In --meth mode, bwa-mem3 converts R1 reads C→T and R2 reads G→A inline, without writing intermediate FASTQ files. See Conversion details (C->T, G->A).
Chimera A read alignment where the aligned portion is short relative to the read length, often indicating a mapping artefact or a true chimeric molecule. In methylation mode, bwa-mem3 applies a chimera QC heuristic: if the longest contiguous M/=/X CIGAR run is less than 44% of the read length, the alignment is flagged 0x200, the proper-pair bit is cleared, and MAPQ is capped at 1. See Chimera QC and header rewriting.
FASTQ A text format for raw sequencing reads. Each record contains a sequence identifier, the nucleotide sequence, a separator, and per-base quality scores in ASCII-encoded Phred format. bwa-mem3 accepts gzip-compressed FASTQ as input. See Quick start: align paired-end FASTQs.
FM-index
Ferragina-Manzini index — a full-text index over the Burrows-Wheeler Transform of a sequence. bwa-mem3 uses the compressed .bwt.2bit.64 FM-index for seed finding (SMEM lookup). See Indexing the reference.
Hard clip
A CIGAR operation (H) indicating that bases at the read end are absent from the SEQ field of the alignment record. Hard clipping is used in supplementary alignments to avoid duplicating the read sequence. See Output: SAM/BAM, headers, tags.
kswv The SIMD-vectorized kernel implementing the inner loop of the Smith-Waterman extension alignment in bwa-mem2/bwa-mem3. bwa-mem3 carries correctness fixes for the score-saturation edge case across all SIMD width variants (NEON, AVX2, AVX-512BW). See Correctness fixes.
libsais A library implementing the suffix-array induced sorting (SAIS) algorithm. bwa-mem3 optionally uses libsais for FM-index construction, reducing indexing time compared to the default suffix-array builder. See Performance improvements.
LTO
Link-Time Optimization — a compiler mode that defers optimization to link time, enabling cross-compilation-unit inlining. Activated via make lto-build. See Building from source.
MAPQ Mapping quality — a Phred-scaled probability that a read alignment is incorrectly mapped. Reported in SAM field 5. bwa-mem3 follows bwa-mem2 MAPQ semantics; chimera QC in methylation mode caps MAPQ at 1 for chimeric alignments. See Output: SAM/BAM, headers, tags.
Mate rescue A step in paired-end alignment where, if one mate lacks a confident seed, bwa-mem3 attempts to find it by performing Smith-Waterman alignment in the region near the mapped mate. bwa-mem3 adds NEON and AVX2 implementations of the mate-rescue kernel. See Architecture support.
mimalloc
A high-performance memory allocator from Microsoft. bwa-mem3 vendors mimalloc and links it into every binary by default. To disable, build with USE_MIMALLOC=0. See Memory allocator (mimalloc).
Single-binary SIMD dispatch
On x86, bwa-mem3 ships one binary that contains compiled kernels for every supported SIMD tier (sse41 / sse42 / avx / avx2 / avx512bw) and selects one in process at startup via __builtin_cpu_supports. There are no per-tier companion binaries. On ARM64 the binary contains a single NEON kernel TU. Replaces the prior multi-binary execv launcher (PR #83). See Single-binary SIMD dispatch (x86).
PGO
Profile-Guided Optimization — a two-pass build where the first pass instruments the binary, a representative workload is run to collect profiles, and the second pass uses those profiles to guide inlining and branch layout. Activated via make pgo-generate then make pgo-use. See PGO build.
Primary alignment The alignment record for a read that represents the aligner’s best placement. A read has exactly one primary alignment (or is reported as unmapped). All other alignments for the same read are marked supplementary (chimeric split read) or secondary (alternative mapping). See Output: SAM/BAM, headers, tags.
Proper-pair flag (0x2)
SAM flag bit indicating that both mates of a pair are mapped in the expected orientation and insert-size range. In bwa-mem3, the mem_sam_pe function sets this flag; a correctness fix (PR #17) ensures it is propagated correctly under all conditions. See Correctness fixes.
SAM Sequence Alignment Map — a tab-delimited text format for read alignments. Each record contains mandatory fields (QNAME, FLAG, RNAME, POS, MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, QUAL) plus optional tags. See Output: SAM/BAM, headers, tags.
SIMD dispatch
Runtime selection of the fastest available SIMD instruction set (SSE4.1, SSE4.2, AVX, AVX2, AVX-512BW, NEON) for hot alignment kernels. On x86 this is implemented in process by src/simd_dispatch.cpp via __builtin_cpu_supports; on ARM64 a single NEON tier covers every supported CPU. See SIMD dispatch matrix.
SMEM Super-Maximal Exact Match — a seed found by extending a read’s position in the FM-index as far as possible in both directions. SMEMs form the initial seeds for chaining and extension in the BWA-MEM algorithm. See Performance improvements.
Soft clip
A CIGAR operation (S) indicating that bases at the read end were not part of the alignment, but are still present in the SEQ field. Soft clipping commonly appears at adapter-containing or low-quality read ends. See Output: SAM/BAM, headers, tags.
Supplementary alignment A SAM record (FLAG bit 0x800 set) representing a chimeric read split across two or more genomic loci. The segment with the longest aligned span is typically designated primary; remaining segments are supplementary. Hard clipping is used to avoid duplicating the SEQ field. See Output: SAM/BAM, headers, tags.
See also: Citation · License · Changelog · Output: SAM/BAM, headers, tags · What’s Different — Overview
Citation
How to cite
bwa-mem3 is a derivative of bwa-mem2. If you use bwa-mem3 in published work, please cite the original bwa-mem2 paper:
Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru. Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2019. doi:10.1109/IPDPS.2019.00041
BibTeX:
@inproceedings{bwamem2-ipdps2019,
author = {Vasimuddin Md and Sanchit Misra and Heng Li and Srinivas Aluru},
title = {Efficient Architecture-Aware Acceleration of {BWA-MEM} for Multicore Systems},
booktitle = {IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
year = {2019},
doi = {10.1109/IPDPS.2019.00041},
url = {https://doi.org/10.1109/IPDPS.2019.00041}
}
Lineage
bwa-mem3 is maintained by Fulcrum Genomics as a derivative of bwa-mem2, itself derived from bwa (Li & Durbin, 2009). The BWA-MEM algorithm was originally described in:
Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997, 2013. arXiv:1303.3997
The bwa-mem3-specific changes and improvements carried on top of bwa-mem2 are documented in What’s Different from bwa-mem2.
See also: License · Changelog · What’s Different — Overview · Related Projects: bwa-mem2
License
bwa-mem3 is licensed under the MIT License (same as upstream bwa-mem2).
The MIT License
BWA-MEM2 (Sequence alignment using Burrows-Wheeler Transform),
Copyright (C) 2019 Intel Corporation, Heng Li.
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Contacts: Vasimuddin Md <vasimuddin.md@intel.com>; Sanchit Misra <sanchit.misra@intel.com>;
Heng Li <hli@jimmy.harvard.edu>
See also: Citation · Changelog · Related Projects: bwa-mem2 · What’s Different — Overview
Changelog
Release 0.2.0 (2026-05-13)
Operational / packaging
- Single-binary SIMD dispatch on x86 (#83). The previous multi-binary
build (
make multiproducing fivebwa-mem3.<tier>ISA variants plus arunsimd.cpplauncher thatexecv’d the matching tier) is replaced by a single binary that contains compiled kernels for every supported tier (sse41/sse42/avx/avx2/avx512bw) and selects one in process at startup via__builtin_cpu_supports. Install size drops from ~120 MB to ~25 MB; per-call overhead is one indirect branch (~0.3 ns after BTB warm-up). No.<tier>companion files are produced or needed. Seedocs/src/developer-guide/launcher.md. BWAMEM3_FORCE_TIER=<tier>andBWAMEM3_DEBUG_SIMD=1env vars (#83).BWAMEM3_FORCE_TIERis downgrade-only and replaces the prior “exec thebwa-mem3.sse41binary” A/B-testing pattern; up-tier or unrecognized requests are rejected with a stderr warning.BASELINE_ARCH=avx2is the new default for non-kernel translation units on x86 (#84, supersedes the SSE4.1 floor that PR #83 originally shipped with). Override viamake BASELINE_ARCH=<tier>. AVX-512BW hosts usingBASELINE_ARCH=avx512bwsee a small additional speedup on Zen 4 with-mprefer-vector-width=256(#86) and roughly flat results on Sapphire Rapids — seedocs/src/whats-different/avx512-baseline.mdfor the characterization.- Host-floor precheck (#95).
bwa-mem3 mem,bwa-mem3 index, andbwa-mem3 shmrefuse to run with exit code 2 and an[E::bwamem3]stderr message when the host CPU does not meet the build’s compile-time SIMD floor, instead of SIGILL-ing deep in alignment.bwa-mem3 version,--help, and-hare exempt and always succeed. bwa-mem3 versionnow printsSIMD floor:(build’s required minimum) andSIMD runtime:(resolved tier) lines on stdout, plus a[W::bwa-mem3]warning on stderr (exit 0) if the host is below the floor. Seedocs/src/getting-started/host-requirements.md.bwa-mem3 shmperforms astatvfs("/dev/shm")capacity preflight (#86). When/dev/shmis too small for the index, the stage aborts with an[E::bwa_shm_stage]message naming/dev/shm, the required size, and amount -o remount,size=...hint — replacing the prior[fread] Bad addressfailure mode.statvfsfailures (no/dev/shm, restricted sandbox) are non-fatal and the stage proceeds.bwa-mem3 shm/bwactlregistry RMW is now serialized via a POSIX named semaphore (#82, closes #66). Concurrentshm stage/shm dropinvocations across processes no longer race when updating the registry; the prior best-effort flock was per-openand did not cover the read-modify-write window.
Methylation
mem --methemits Bismark-compatible auxiliary tagsXR:Z(read conversionCT/GA),XG:Z(genome strandCT/GA), andXM:Z(per-base methylation call string) (#90). These replace the prior bwameth-styleYS:Z/YC:Z/YD:Zon output (still used internally for SEQ restoration). The reference-annotationXR:Zfrom-Vis suppressed under--methto avoid colliding with the Bismark semantics. Downstream tools that previously readYS:Z/YC:Z/YD:Zmust be pointed at the correspondingXR:Z/XG:Zand the per-baseXM:Z. Seedocs/src/methylation/tags.md.
Correctness
- Fixed SIGSEGV in
mem_mateswon shm-backedref_string(#85).ksw_align2mutates its reference slice in place; when the slice pointed into a read-only shm segment, this faulted. Now copies the slice before passing it in. FMI_searchsampled-SA prefetch: parenthesizedSA_COMPX_MASKprecedence so the masked offset is computed against the correct operand (#73). The unparenthesized form was silently producing wrong-but-harmless prefetch addresses; no alignment output was affected.bntseq.altparser bounds the line buffer to prevent a stack-overflow on malicious or malformed.altfiles (#74).display_statsclamps the per-thread bucket count toLIM_Cso--profilewith-tgreater than the compiled-in limit no longer writes past the end of the stats array (#81).
Performance
- x86 wall-time improvements on the bench (vs the 0.1.0-pre baseline):
AVX2 (c6a) −17 to −22%, AVX-512 AMD Zen4 (c7a) −16 to −24%, AVX-512
Intel SPR (c7i) −28 to −30% across wgs / wes / panel-twist 5M-read
samples. Concordance vs upstream
bwa-mem2 v2.2.1remains 100.0000% on all non-methylation cells. arm64 (c7g / c8g) is flat (within ±2%). The wins are attributable primarily to (a) capping AVX-512BW auto-vectorization at 256-bit on theavx512bwtarget (#86) and (b) inliningFMI_search::backwardExtto recover a gcc 12+ wall-clock regression (#88). Seedocs/src/performance/overview.mdfor the reference numbers across architectures. - Smaller contributions in the release window: per-strip L1 prefetches
across all
kswvu8/u16 kernels (#70);SMEM_LOCKSTEP_Nbumped from 8 to 16 (#75); closed-form ungapped HIT path whentotal_mis == 0(#77);ksortswitched to an on-stack buffer for smallnto drop a per-callmalloc(#78);libsais_buildskips a wasted zero-init pass on its unpack and SA buffers, trimming index-build time (#80).
Release 0.1.0-pre (2026-04-28)
- Project renamed from
bwa-mem2tobwa-mem3. The new project tracks Fulcrum Genomics’ performance and feature work on top of the upstream bwa-mem2 codebase. - Default branch renamed from
fg-maintomain. - Binary renamed from
bwa-mem2tobwa-mem3. Arch-suffixed variants (bwa-mem3.sse41,.sse42,.avx,.avx2,.avx512bw,.arm64,.pgo,.profile,.lto) renamed to match. @PGSAM header tags now readID:bwa-mem3 PN:bwa-mem3(andbwa-mem3-methfor--methmode).- Test binaries renamed:
bwa_mem2_tests_unit→bwa_mem3_tests_unit,bwa_mem2_tests_integration→bwa_mem3_tests_integration. .bwt.2bit.64index file format unchanged — bwa-mem3 reads indexes built bybwa-mem2 indexwithout re-indexing.
Release 2.2.1 (17 March 2021)
Hotfix for v2.2: Fixed the bug mentioned in #135.
Release 2.2 (8 March 2021)
Changes since the last release (2.1):
- Passed the validation test on ~88 billions reads (Credits: Keiran Raine, CASM division, Sanger Institute)
- Fixed bugs reported in #109 causing mismatch between bwa-mem and bwa-mem2
- Fixed the issue (# 112) causing crash due to corrupted thread id
- Using all the SSE flags to create optimized SSE41 and SSE42 binaries
Release 2.1 (16 October 2020)
Release 2.1 of BWA-MEM2.
Changes since the last release (2.0):
-
Smaller index: the index size on disk is down by 8 times and in memory by 4 times due to moving to only one type of FM-index (2bit.64 instead of 2bit.64 and 8bit.32) and 8x compression of suffix array. For example, for human genome, index size on disk is down to ~10GB from ~80GB and memory footprint is down to ~10GB from ~40GB. There is a substantial decrease in index IO time due to the reduction and hardly any performance impact on read mapping.
-
Added support for 2 more execution modes: sse4.2 and avx.
-
Fixed multiple bugs including those reported in Issues #71, #80 and #85.
-
Merged multiple pull requests.
Release 2.0 (9 July 2020)
This is the first production release of BWA-MEM2.
Changes since the last release:
-
Made the source code more secure with more than 300 changes all across it.
-
Added support for memory re-allocations in case the pre-allocated fixed memory is insufficient.
-
Added support for MC flag in the sam file and support for -5, -q flags in the command line.
-
The output is now identical to the output of bwa-mem-0.7.17.
-
Merged index building code with FMI_Search class.
-
Added support for different ways to input read files, now, it is same as bwa-mem.
-
Fixed a bug in AVX512 sam processing part, which was leading to incorrect output.
Release 2.0pre2 (4 February 2020)
Miscellaneous changes:
-
Changed the license from GPL to MIT.
-
IMPORTANT: the index structure has changed since commit 6743183. Please rebuild the index if you are using a later commit or the new release.
-
Added charts in README.md comparing the performance of bwa-mem2 with bwa-mem.
Major code changes:
-
Fixed working for variable length reads.
-
Fixed a bug involving reads of length greater than 250bp.
-
Added support for allocation of more memory in small chunks if large pre-allocated fixed memory is insufficient. This is needed very rarely (thus, having no impact on performance) but prevents asserts from failing (code from crashing) in that scenario.
-
Fixed a memory leak due to not releasing the memory allocated for seeds after smem.
-
Fixed a segfault due to non-alignment of small allocated memory in the optimized banded Smith-Waterman.
-
Enabled working with genomes larger than 7-8 billion nucleotides (e.g. Wheat genome).
-
Fixed a segfault occuring (with gcc compiler) while reading the index.
See also: Citation · License · What’s Different — Overview · Developer Guide — Release process