Home
bwa-mem3
A faster, more correct, drop-in replacement for bwa mem and bwa-mem2.
If you align short reads with bwa or bwa-mem2 today, bwa-mem3 will give you the same answers — only quicker, with fewer rough edges, and with first-class support for things you used to need a wrapper script for.
Why bwa-mem3
- Drop in, go faster. Same algorithm, same outputs, same flags as bwa-mem2 — but consolidated mapping speedups, a memory-bounded index builder, batched header ingestion, and a tuned allocator add up to measurable wall-clock wins on real workloads.
- Methylation in one binary. A
--methflag turns bwa-mem3 into a drop-in replacement for the entirebwameth.pypipeline. No Python, no inline conversion script, no separate post-processing step. Onebwa-mem3 index --meth ref.fa, onebwa-mem3 mem --meth ref.fa R1.fq R2.fq, done — header collapsed, tags emitted, chimeras flagged. - Stage the index once, align many. A
bwa-mem3 shmsubcommand pins the FM-index in shared memory so back-to-back runs on the same host skip the 28 GB read every time. - Correctness fixes upstream hasn’t merged yet. Tabs in
-R, 151+ bp reads, AVX-512 mate-rescue, kswvscore2plateau across NEON/AVX2/AVX-512BW, mem_sam_pe proper-pair flag — every fix tracked back to the upstream PR or issue that found it. - Architecture-aware out of the box. SSE4.1, SSE4.2, AVX, AVX2, AVX-512BW, and ARM64/NEON. A multi-binary launcher picks the right one for your CPU.
Get started in 30 seconds
git clone --recursive https://github.com/fg-labs/bwa-mem3
cd bwa-mem3 && make
./bwa-mem3 index ref.fa
./bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -o out.bam
Tip — Emit BAM directly
For production pipelines, add
--bam=0to skip the SAM text round-trip entirely. See Best Practices: Output format.
Where to start
- Installation — Build from source (Bioconda is on the way).
- Quick start: align paired-end FASTQs — Two commands to your first alignment.
- Quick start: methylation — The
one-binary
bwameth.pyreplacement, in two commands. - Best Practices — The five things that actually move the needle for production runs.
- What’s different from bwa-mem2 — Every fix and feature, with upstream cross-references.
What’s in this book
- Getting Started — Install and run your first alignment.
- User Guide — Indexing, alignment, output, threading, allocator notes.
- Performance — Where the speed comes from and how to get more.
- Best Practices — Build, run, and deploy recommendations.
- CLI Reference — Every flag, auto-captured from
--help. - Methylation Reference —
--methmode in full. - What’s Different from bwa-mem2 — The full changelog, by category.
- Developer Guide — Build matrix, SIMD dispatch, regression tests, contributing.
- Related Projects — bwa-mem3-bench, bwa-mem3-rs, fgumi, bwa-mem2 upstream.
- Reference — Glossary, citation, license, changelog.
bwa-mem3 is a derivative of bwa-mem2 maintained by Fulcrum Genomics. MIT licensed. See License and Citation.
Installation
Bioconda (coming soon)
A Bioconda package for bwa-mem3 is in preparation. Once published, installation will be:
conda install -c bioconda bwa-mem3
This will be the recommended path for most users. Check back here or watch the fg-labs/bwa-mem3 repository for the announcement.
Build from source
Until the Bioconda package is available, build from source using the steps below.
Prerequisites
- A C++17 compiler (GCC 8+ or Clang 7+)
- GNU make
- Rust toolchain (for
cargo installof mdbook tools, not required for the aligner itself) - Git (for submodule checkout)
Clone and build
git clone --recursive https://github.com/fg-labs/bwa-mem3
cd bwa-mem3
make
The --recursive flag is required. bwa-mem3 vendors several libraries (mimalloc, sse2neon, and
others) as git submodules. A shallow or non-recursive clone will fail to compile.
Warning — Shallow clone submodule pitfall
If you cloned without
--recursive, initialize the submodules before runningmake:git submodule update --init --recursiveForgetting this step is the most common source of build failures.
Target architecture
By default, make builds a general-purpose binary that runs on any supported CPU. For maximum
performance, specify the architecture that matches your deployment target:
| Flag | Requires | Notes |
|---|---|---|
make | SSE4.1 or better (x86), any (ARM) | Default; selects best dispatch at runtime on x86 |
make arch=avx2 | AVX2 (e.g. Haswell, Zen 2) | Recommended for modern x86 servers |
make arch=avx512bw | AVX-512BW (e.g. Skylake-X, Ice Lake, Sapphire Rapids) | Maximum x86 performance |
make arch=arm64 | Apple Silicon / AWS Graviton | NEON-vectorized build |
See Performance — SIMD dispatch matrix for the full matrix of which kernels are vectorized under each target.
Memory allocator (mimalloc)
bwa-mem3 bundles mimalloc and links it into every binary by default. mimalloc reduces allocator contention under high thread counts and lowers wall-clock time on multi-threaded alignment runs.
To build without mimalloc, pass USE_MIMALLOC=0:
make USE_MIMALLOC=0
See User Guide — Memory allocator for details on how mimalloc is linked on Linux versus macOS and when opting out is appropriate.
Smoke test
After building, run the smoke test to confirm the binary works and report which allocator is active:
./bwa-mem3 version
Expected output (with mimalloc):
bwa-mem3-0.1.0-pre-12-gabcdef1
mimalloc 3.x.x
If the mimalloc line is absent, the build linked the system allocator (expected when
USE_MIMALLOC=0 was passed or when the vendor submodule was not initialized).
See also: Quick start: align paired-end FASTQs · User Guide — Memory allocator · Developer Guide — Building from source
Quick start: align paired-end FASTQs
This page walks through the two-command workflow: index the reference once, then align reads.
Index the reference
bwa-mem3 index ref.fa
This produces five index files alongside ref.fa:
| File | Description |
|---|---|
ref.fa.bwt.2bit.64 | FM-index in 2-bit packed format |
ref.fa.0123 | 2-bit packed reference sequence |
ref.fa.amb | Ambiguous base positions |
ref.fa.ann | Sequence name and length annotations |
ref.fa.pac | Packed 4-bit reference sequence |
Indexing hg38 takes roughly 2-3 minutes and requires approximately 60 GB of peak disk space
during creation (including temporary/intermediate files); the final FM-index stored on disk
is roughly 28 GB. The index is read once per mem invocation; for workloads that align many
samples, load it into shared memory first (see Quick start: shared-memory index).
Align paired-end reads
bwa-mem3 mem -t 16 ref.fa r1.fq.gz r2.fq.gz > out.sam
-t 16 sets the thread count to 16. bwa-mem3 scales well up to the number of physical CPU
cores; hyperthreading provides diminishing returns above that point. See
User Guide — Threading and resource use for recommendations at
different core counts.
The default output is uncompressed SAM on stdout. To write compressed BAM directly, use the
--bam flag:
bwa-mem3 mem --bam -t 16 ref.fa r1.fq.gz r2.fq.gz \
| samtools sort -@ 8 -o out.bam -
samtools index out.bam
Tip — Prefer BAM output in production
Piping BAM (
--bam) tosamtools sortavoids the text formatting and parsing overhead of SAM on both sides of the pipe. For large cohorts this yields a measurable wall-clock reduction. See Best Practices — Output format for the recommended pipeline and a discussion of when SAM is still useful.
Read group tagging
For downstream tools that require a @RG header (most variant callers), pass -R:
bwa-mem3 mem -t 16 \
-R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA\tLB:lib1' \
ref.fa r1.fq.gz r2.fq.gz > out.sam
The value is a tab-delimited string following BWA conventions. Every aligned record receives an
RG:Z: tag matching the ID field of the read-group header.
Output tags
bwa-mem3 emits standard SAM tags plus the HN:i: tag introduced by the fork:
| Tag | Type | Description |
|---|---|---|
NM:i | int | Edit distance to the reference |
MD:Z | string | Mismatch and deletion string |
AS:i | int | Alignment score |
XS:i | int | Suboptimal alignment score |
SA:Z | string | Supplementary alignment chain |
HN:i | int | Total number of primary alignments (reported and suppressed) found for the read, before the -h supplementary cap is applied |
For the methylation-specific tags (YS:Z, YC:Z, YD:Z), see
Methylation Reference — SAM tags.
See also: Quick start: methylation alignment · Quick start: shared-memory index · User Guide — Aligning short reads · CLI Reference — mem
Quick start: methylation alignment
bwa-mem3 supports bisulfite-converted (WGBS/RRBS/EM-seq) read alignment through a single --meth
flag on both index and mem. No Python interpreter, no piped preprocessor, and no separate
postprocessing step are required.
Note — Drop-in replacement for bwameth.py
bwa-mem3 with
--methis a single-binary drop-in replacement for thebwameth.pypipeline. The output BAM is byte-compatible for the standard tags used by methylation callers (Bismark, MethylDackel, PileOMeth, etc.).
Index the reference for methylation
Build the c2t doubled reference once:
bwa-mem3 index --meth ref.fa
This writes two additional files next to the standard index:
| File | Description |
|---|---|
ref.fa.bwameth.c2t | C→T converted reference (forward strand) with G→A reverse complement interleaved |
ref.fa.bwameth.c2t.* | FM-index files for the c2t reference |
The c2t index is separate from the standard index produced by bwa-mem3 index ref.fa. You need
both if you intend to run standard and methylation alignments against the same reference.
Align bisulfite-converted reads
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -o out.bam
samtools index out.bam
Pass the original (unconverted) reference path, not the .bwameth.c2t file. bwa-mem3
auto-appends .bwameth.c2t to the reference path when --meth is active.
What --meth does
--meth activates a pipeline of in-process transformations that would otherwise require
external tools:
-
Inline c2t read conversion. R1 reads have every
Cconverted toTbefore alignment; R2 reads have everyGconverted toA. The original unconverted sequence is preserved in theYS:Z:SAM tag. The conversion direction for each read is recorded inYC:Z:(valueCTorGA), matching the bwameth.py convention. -
bwameth.py-equivalent scoring defaults.
--methsets-B 2 -L 10 -U 100 -T 40 -CMautomatically. These match the defaults used by bwameth.py and are optimized for bisulfite-converted reads where C→T mismatches carry no penalty. Any of these values can be overridden on the command line. -
Inline BAM post-processing. After alignment, bwa-mem3 rewrites the SAM stream in-process:
@SQheaders withf/rprefixes (e.g.fchr1,rchr1) are collapsed back to one entry per real chromosome (chr1). Read-levelRNAMEfields are rewritten to match.- Each mapped record gains a
YD:Z:tag (ffor forward-strand,rfor reverse-strand) indicating which converted strand the read aligned to. - Chimera QC: reads whose longest
M/=/Xrun is less than 44% of the read length are flagged0x200(QC-fail), have flag0x2(proper pair) cleared, and have MAPQ capped at 1. - Pair-level QC-fail propagation: if one mate is QC-failed, the other mate is also flagged.
- A
@PG ID:bwa-mem3-methprogram record is appended to the header.
-
Uncompressed BAM output. The post-processed stream is written as uncompressed BAM (
wb0) rather than SAM text. This eliminates text serialization overhead and allows downstreamsamtools sortto read BAM natively. The stream is still fully readable by any htslib-based tool.
For full details on each tag, the chimera QC heuristic, and the --set-as-failed and
--do-not-penalize-chimeras flags, see the Methylation Reference.
See also: Methylation Reference — Overview · Methylation Reference — SAM tags · Best Practices — Methylation defaults · CLI Reference — mem
Quick start: shared-memory index
The bwa-mem3 FM-index for a genome like hg38 is approximately 28 GB. By default, every
bwa-mem3 mem invocation reads the index from disk, which can take 30–60 seconds on a spinning
disk and several seconds even on fast NVMe storage. For workloads that align many small samples
in sequence on the same machine, this per-invocation overhead accumulates.
bwa-mem3 shm stages the index once into POSIX shared memory. Subsequent mem invocations
attach to the in-memory segment instead of reading from disk, reducing per-sample startup time
to near zero.
Stage the index
bwa-mem3 shm ref.fa
This reads the index files from disk and copies them into a POSIX shared-memory segment. The command returns when staging is complete. The index stays in memory until it is explicitly dropped or the system is rebooted.
To stage a methylation (--meth) index:
bwa-mem3 shm --meth ref.fa
A standard and a methylation index for the same reference can be staged simultaneously; they occupy separate named segments.
Align using the staged index
No extra flag is needed. When bwa-mem3 mem starts, it checks whether a matching shared-memory
segment exists. If one does, it attaches automatically:
bwa-mem3 mem -t 16 ref.fa r1.fq.gz r2.fq.gz > out.sam
Inspect and drop staged segments
List all currently staged indices:
bwa-mem3 shm -l
Drop all staged segments:
bwa-mem3 shm -d
When to use shared-memory indexing
Shared-memory indexing is most beneficial when:
- Aligning tens to hundreds of small samples (e.g. amplicon panels, targeted sequencing) where per-sample read time dominates the per-sample alignment time.
- Running a batch pipeline on a single large machine where the index fits comfortably in RAM (approximately 28 GB for hg38 with the standard index).
- The same reference is used for all samples in the batch; a new
shminvocation is required for each distinct reference.
It provides little benefit when:
- Aligning a small number of large samples (WGS), where alignment time far exceeds index load time.
- The available RAM is insufficient to hold the index alongside the operating system and alignment worker processes.
Warning — No staleness check — always drop before re-indexing
bwa-mem3 shmdoes not detect whether the on-disk index files have changed after staging. If you runbwa-mem3 index ref.faagain (e.g. to rebuild after a reference update), the shared-memory segment is not invalidated. Subsequentmeminvocations will attach to the stale segment and produce silently incorrect alignments.Always drop the segment before re-indexing:
bwa-mem3 shm -d bwa-mem3 index ref.fa bwa-mem3 shm ref.fa
See also: CLI Reference — shm · Best Practices — Multi-sample workflows · Best Practices — Anti-patterns · User Guide — Threading and resource use
Indexing the reference
Before aligning reads, bwa-mem3 builds an FM-index from the reference FASTA.
The index is read back from disk at the start of every mem run, so it is
built once and reused indefinitely.
Basic indexing
bwa-mem3 index ref.fa
The command writes five files alongside the input FASTA:
| File | Contents |
|---|---|
ref.fa.bwt.2bit.64 | Burrows-Wheeler Transform, 2-bit packed, 64-bit offsets |
ref.fa.0123 | Forward sequence, 2-bit packed |
ref.fa.amb | Coordinates and counts of ambiguous (N) bases |
ref.fa.ann | Sequence names and lengths |
ref.fa.pac | Forward sequence, 4-bit packed |
The .bwt.2bit.64 file dominates disk usage. For the human reference (hg38),
expect roughly 28 GB total across all five files.
Methylation index (--meth)
bwa-mem3 index --meth ref.fa
Methylation mode builds a C-to-T doubled reference in addition to the standard
FM-index files. The command writes a ref.fa.bwameth.c2t file (the doubled
FASTA) and its own set of five index files with the .bwameth.c2t suffix:
ref.fa.bwameth.c2t
ref.fa.bwameth.c2t.bwt.2bit.64
ref.fa.bwameth.c2t.0123
ref.fa.bwameth.c2t.amb
ref.fa.bwameth.c2t.ann
ref.fa.bwameth.c2t.pac
The doubled reference is roughly twice the size of the standard one. For hg38, allow approximately 56 GB of disk space.
Tip — Pass the original FASTA to mem, not the c2t file
When running
bwa-mem3 mem --meth, pass the original FASTA path (ref.fa), notref.fa.bwameth.c2t. bwa-mem3 appends.bwameth.c2tautomatically. The auto-append is skipped only when the path already ends in.bwameth.c2t, which is useful for external-c2t interop pipelines.
Output file locations
Index files are written to the same directory as the input FASTA by default. The input path is taken verbatim as a prefix — you can pass an absolute path to write into a different directory:
bwa-mem3 index /data/indexes/hg38/hg38.fa
# writes hg38.fa.bwt.2bit.64, etc. into /data/indexes/hg38/
Time and memory
Indexing hg38 takes roughly 60–90 minutes on a single core and requires about 80 GB of RAM during construction. The process is single-threaded; additional cores do not reduce wall time.
bwa-mem3 uses libsais to construct the suffix array, which is faster than the original bwa-mem2 approach. See Performance improvements for benchmark numbers.
Warning — Do not index over a live shared-memory segment
If you have previously staged the index into shared memory with
bwa-mem3 shm, drop the segment first before re-indexing:bwa-mem3 shm -d bwa-mem3 index ref.faThere is no staleness check. If
bwa-mem3 memfinds a matching segment in shared memory it will attach to it even when the on-disk index has been updated. See Quick start: shared-memory index.
Arch flags and the index format
The FM-index format is architecture-independent. A single index can be used
with any bwa-mem3 binary — bwa-mem3.avx2, bwa-mem3.avx512bw, and the ARM
single-binary all read the same on-disk layout.
See also: Quick start: align paired-end FASTQs · Quick start: methylation alignment · Quick start: shared-memory index · Performance improvements · CLI Reference: index
Aligning short reads (mem)
bwa-mem3 mem aligns one or two FASTQ files against an indexed reference and
writes SAM (default) or BAM (--bam) to stdout. It is a drop-in replacement
for bwa-mem2 mem and supports all standard bwa-mem flags.
Basic usage
Paired-end:
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam
Single-end:
bwa-mem3 mem -t 16 ref.fa reads.fq.gz > out.sam
Pipe directly to samtools:
bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -o out.bam -
samtools index out.bam
Using --bam=0 (uncompressed BAM) avoids SAM text formatting on the write side
and SAM parsing on the samtools side, and skips the wasted compression that
samtools sort would immediately decompress; the BAM bytes flow between
processes in the pipe.
Key flags
Threading: -t
-t INT number of threads [1]
Performance scales well through 8–16 threads on most machines. Beyond 32 threads, returns diminish on typical workloads because inter-thread locking and IO become the bottleneck. See Threading and resource use for detailed guidance.
Read-group header: -R
-R STR read group header line, e.g. '@RG\tID:sample1\tSM:sample1\tLB:lib1\tPL:ILLUMINA'
Every production alignment should include a @RG header. The ID in the -R
string is embedded as an RG:Z: tag on every output record.
Tip — Escape the tab correctly
Pass
-Rwith a literal\tbetween fields. Most shells require single quotes or$'...'quoting to prevent interpretation of the backslash:bwa-mem3 mem -R $'@RG\tID:s1\tSM:sample1' -t 16 ref.fa R1.fq.gz R2.fq.gz
Chunk size: -K
-K INT process INT input bases in each batch [10000000]
Larger -K values increase memory use but can improve throughput on very deep
or very wide batches. The default is appropriate for most workloads.
SAM output control: -S, -P
-S skip mate rescue
-P skip pairing; mate rescue performed unless -S also in use
These flags are primarily useful for debugging or non-standard workflows. Normal paired-end alignments should leave both at their defaults.
Output modes
SAM (default)
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam
Plain-text SAM. Suitable for inspection, compatibility testing, and piping to tools that consume SAM.
BAM (--bam=0)
bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz > out.bam
Writes BAM directly. --bam=0 is uncompressed BAM, which avoids
double-compression when piping into a downstream sorter and is roughly
10–15% faster end-to-end. Pass --bam=6 to write a fully compressed BAM if
the output is the final product.
Note — –bam=0 is the recommended output mode
For production pipelines, always use
--bam=0and pipe tosamtools sort. See Best Practices: output format for the canonical pipeline.
Methylation alignment (--meth)
Pass --meth for bisulfite/RRBS samples. This activates inline C-to-T read
conversion, bwameth.py-compatible flag defaults, and inline BAM post-processing.
See Quick start: methylation alignment for
the two-command workflow and the Methylation Reference
for full detail.
Shared-memory index auto-attach
When bwa-mem3 shm has staged the index into shared memory, bwa-mem3 mem
attaches automatically — no extra flag is required. The shared-memory path
is transparent to users.
Cross-references
The full flag list is in the CLI Reference: mem page.
See also: Output: SAM/BAM, headers, tags · Threading and resource use · Best Practices: output format · CLI Reference: mem · Methylation Reference: overview
Output: SAM/BAM, headers, tags
bwa-mem3 writes output in either SAM (default) or BAM (--bam) format.
This page covers the header structure and every non-standard SAM tag emitted
by bwa-mem3.
Output format
By default, bwa-mem3 mem writes SAM to stdout. Pass --bam (or --bam=N
for a specific compression level) to write BAM. Level 0 (uncompressed) is the
default when --bam is given without an argument, which is optimal when piping
to a downstream samtools sort.
# SAM (default)
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam
# Uncompressed BAM — best for piping
bwa-mem3 mem --bam -t 16 ref.fa R1.fq.gz R2.fq.gz | samtools sort -@ 8 -o out.bam -
# Compressed BAM — useful when the output is the final file
bwa-mem3 mem --bam=6 -t 16 ref.fa R1.fq.gz R2.fq.gz > out.bam
SAM header
@HD
A default @HD VN:1.6 SO:unsorted line is emitted unless the user supplies
one via -H. The sort order is unsorted because bwa-mem3 writes records in
input read order; downstream sorting is always a separate step.
@SQ
One @SQ line is written per reference sequence, with the sequence name
(SN:) and length (LN:) derived from the FM-index. If the index was built
with a .dict or .hdr file that supplies @SQ records, those records are
used instead of the auto-generated ones.
In methylation mode (--meth), the doubled reference contains sequences with
an f or r prefix in their names. The inline BAM post-processor collapses
these back to canonical chromosome names so that the output @SQ lines match
a standard non-methylation alignment. See
Chimera QC and header rewriting.
@PG
One @PG entry is written in standard mode:
| ID | Description |
|---|---|
bwa-mem3 | The alignment step. VN: is the bwa-mem3 version string; CL: is the full command line. |
In methylation mode (--meth), a second @PG entry is appended:
| ID | Description |
|---|---|
bwa-mem3-meth | The inline post-processor. VN: carries the version with -meth suffix; CL: is the full command line. |
The bwa-mem3-meth entry follows immediately after the bwa-mem3 entry and
records the post-processing step as a distinct pipeline node, matching the
convention of separate-tool pipelines.
Tags emitted by bwa-mem3
Standard tags
bwa-mem3 emits the same standard tags as bwa-mem2 (NM:i, MD:Z, AS:i,
XS:i, SA:Z, RG:Z, XA:Z, etc.). These are documented in the SAM
specification and are not described further here.
HN:i — total alignment hit count
HN:i:<count>
The total number of primary alignments (both reported and suppressed) that
the aligner found for this read, before the -h supplementary cap is applied.
Useful for distinguishing “uniquely mapped” from “multi-mapped” reads without
relying solely on MAPQ.
HN:i is emitted on the primary alignment record only.
Methylation-only tags
The following tags are emitted only when --meth is active. See
SAM tags: YS, YC, YD for the full per-tag reference.
| Tag | Type | Description |
|---|---|---|
YS:Z | string | Original (pre-c2t) read sequence |
YC:Z | string | Conversion direction: CT (R1, C→T) or GA (R2, G→A) |
YD:Z | string | Methylation strand: f (forward) or r (reverse) |
MAPQ semantics
MAPQ semantics are inherited from bwa-mem2 and follow the same scoring model.
In methylation mode, alignments identified as chimeras (longest M/=/X
run covering less than 44% of the read length) have their MAPQ capped at 1 and
the 0x200 (QC fail) flag set. See
Chimera QC and header rewriting.
See also: Aligning short reads (mem) · Methylation Reference: SAM tags · Methylation Reference: post-processing · CLI Reference: mem · Best Practices: output format
Threading and resource use
The -t flag
-t INT number of threads [1]
bwa-mem3 parallelizes alignment by dividing the input into fixed-size batches
(controlled by -K) and processing batches concurrently. Threads share the
in-memory FM-index; there is no per-thread copy.
How threads interact with performance
Where threads help
- Seed finding (SMEM enumeration) is fully parallel across reads in a batch.
- Extension (banded Smith-Waterman) is fully parallel.
- Pair rescue is parallel.
- BAM encoding (when
--bamis active) is parallel.
Where threads stop helping
Thread count and wall-clock alignment time scale well to approximately 16–32 threads on a modern CPU. Beyond that, several effects conspire to flatten the curve:
- FM-index bandwidth. The index for hg38 is ~28 GB and does not fit in the L3 cache of any current server. At high thread counts, threads contend for memory bandwidth accessing the BWT.
- IO contention. On spinning disk or a shared network filesystem, concurrent reads of the same large index file saturate IO bandwidth before the CPU is saturated.
- Output serialization. SAM output is serialized per-record to stdout.
BAM output with
--bamreduces this bottleneck but does not eliminate it entirely.
Recommended thread counts
| Machine | Recommended -t | Notes |
|---|---|---|
| 16-core workstation | 12–14 | Leave 2 cores for samtools sort |
| 32-core server | 24–28 | Leave cores for downstream and OS overhead |
| 64-core server | 40–48 | Marginal returns above 48; test with your workload |
| Multiple parallel runs | divide evenly | See below |
These are starting points. Profile with your specific data and storage configuration to find the practical optimum.
Running multiple parallel alignments
When running multiple bwa-mem3 mem processes on the same machine, divide
threads so that the total does not exceed the physical core count. For example,
on a 32-core machine running four concurrent samples:
# Four parallel runs, 8 threads each
for sample in a b c d; do
bwa-mem3 mem --bam -t 8 ref.fa ${sample}_R1.fq.gz ${sample}_R2.fq.gz \
| samtools sort -@ 2 -o ${sample}.bam - &
done
wait
Using shared memory (bwa-mem3 shm) amortizes the index read-in cost across
all four runs. See Quick start: shared-memory index
and Best Practices: multi-sample workflows.
Memory use
Peak RAM during alignment is dominated by the in-memory FM-index. For hg38,
expect roughly 28 GB of resident memory per bwa-mem3 mem process. Additional
memory is used per batch (-K reads × read length × a small constant).
With bwa-mem3 shm, the index is mapped from a shared-memory segment, so
multiple concurrent mem processes share the same physical pages. The OS
deduplicates the pages; total RAM use is approximately one index, not one per
process.
Tip — Use shm for repeated runs on the same machine
If you run more than a few samples on the same machine without rebooting,
bwa-mem3 shmpays off immediately. The index is read from disk once and stays in RAM for all subsequentmeminvocations.
IO recommendations
- Use local NVMe storage for the index files when possible. The ~28 GB BWT
read is the dominant IO event at the start of each
memrun. - Write BAM (
--bam) to a fast local disk or pipe directly tosamtools sort. Avoid writing uncompressed SAM to a network filesystem. - Separate read and write paths if your storage topology allows it: read the index from one volume and write sorted BAM to another.
See also: Aligning short reads (mem) · Memory allocator (mimalloc) · Quick start: shared-memory index · Best Practices: multi-sample workflows · Performance: tuning checklist
Memory allocator (mimalloc)
bwa-mem3 vendors and links mimalloc, Microsoft’s high-performance memory allocator, into every binary by default. On multi-threaded alignment workloads, mimalloc reduces wall-clock time by replacing the system allocator with one optimized for many small, short-lived allocations — exactly the access pattern produced by the inner alignment loops.
What mimalloc replaces
The system allocator (glibc malloc on Linux, libSystem malloc on macOS) is
a general-purpose allocator with a global lock. Under heavy multi-threaded
allocation pressure — 16+ threads each issuing thousands of short-lived
allocations per batch — the lock becomes a measurable bottleneck. mimalloc uses
per-thread free lists and a segment-based heap to eliminate most of this
contention.
Platform-specific linkage
The linkage strategy differs by OS:
| Platform | Mechanism |
|---|---|
| Linux | Static linkage with --whole-archive. The entire mimalloc static library is embedded into the bwa-mem3 binary; its malloc/free symbols take precedence over glibc’s at link time. |
| macOS | Dynamic linkage via dyld interposing. libmimalloc.dylib is built alongside the binary; dyld’s DYLD_INSERT_LIBRARIES interposing mechanism replaces malloc/free at load time. The dylib ships next to the binary. |
Warning — macOS: keep libmimalloc.dylib next to the binary
On macOS,
libmimalloc.dylibmust remain in the same directory as thebwa-mem3binary (or be reachable via the embedded rpath). If you movebwa-mem3without also movinglibmimalloc.dylib, the binary will fall back to the system allocator silently —bwa-mem3 versionwill not print a mimalloc line, which is the indicator that the allocator is active.
Verifying that mimalloc is active
Run:
./bwa-mem3 version
When mimalloc is linked and loaded, the output includes a line like:
mimalloc 3.x.x
If that line is absent, mimalloc is not active.
Opting out
Pass USE_MIMALLOC=0 at build time to produce a binary linked against the
system allocator:
make USE_MIMALLOC=0
Reasons to opt out:
- AddressSanitizer (ASAN) builds. The Makefile automatically sets
USE_MIMALLOC=0whenASAN_FLAGSis detected, because ASAN and mimalloc’s malloc interposing cannot coexist cleanly. - Container environments where distributing a dylib alongside the binary is inconvenient.
- Reproducibility testing to isolate whether a behavioral difference is allocator-related.
Note — Default is on
USE_MIMALLOC=1is the default. Opt-out is not recommended for production workloads — mimalloc measurably reduces wall time on multi-threaded runs.
Build internals
The mimalloc source lives in ext/mimalloc/ as a git submodule. The Makefile
target builds it via CMake before linking bwa-mem3. The relevant Makefile
variables are MIMALLOC_SRC, MIMALLOC_BUILD, and MIMALLOC_LIB.
The feature was introduced in bwa-mem3 as part of the performance improvement work. See Features and Build & infrastructure for the PR history.
See also: Threading and resource use · Features: mimalloc · Getting Started: installation · Developer Guide: building from source · Performance: tuning checklist
Tips and best practices
This page collects the most commonly useful operational tips for running bwa-mem3. Each tip is a short actionable point; the linked pages provide the full rationale.
Index once, align many times
Build the FM-index once per reference version. The on-disk index format is
stable across bwa-mem3 releases and architecture variants — bwa-mem3.avx2
and bwa-mem3.avx512bw read the same files. You do not need to re-index when
upgrading bwa-mem3 unless the release notes say otherwise.
# Build once
bwa-mem3 index ref.fa
# Align many samples
for sample in a b c d; do
bwa-mem3 mem --bam -t 16 ref.fa ${sample}_R1.fq.gz ${sample}_R2.fq.gz \
| samtools sort -@ 4 -o ${sample}.bam -
done
Pipe to samtools sort -@
Never write an intermediate unsorted BAM to disk and then sort it in a second
step. bwa-mem3’s --bam mode + samtools sort in a single pipeline avoids the
extra write/read cycle and is significantly faster:
bwa-mem3 mem --bam -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -o out.bam -
samtools index out.bam
Allocate roughly 2/3 of available threads to bwa-mem3 mem and 1/3 to
samtools sort. On a 24-core machine, -t 16 for bwa-mem3 and -@ 8 for
samtools is a good starting point.
Stage the index in shared memory for batch workloads
When aligning more than a few samples on the same machine, reading the ~28 GB
hg38 index from disk on every mem invocation is the dominant wall-clock cost.
Stage it once:
bwa-mem3 shm ref.fa
Subsequent bwa-mem3 mem invocations attach automatically. The shared-memory
segment persists until explicitly dropped (bwa-mem3 shm -d) or the machine
reboots.
Warning — Always drop the segment before re-indexing
There is no staleness check. If you rebuild the index without first dropping the shared-memory segment,
bwa-mem3 memwill attach to the stale segment and produce incorrect alignments without any warning. Always runbwa-mem3 shm -dbeforebwa-mem3 index.
Pin threads when running concurrent jobs
When running multiple bwa-mem3 mem processes in parallel, divide threads
explicitly so that the total does not exceed the physical core count. Avoid
relying on the scheduler to balance over-subscribed threads — each process
will spin waiting for CPU time, and total throughput drops.
# Good: 4 jobs × 6 threads = 24 cores, on a 24-core machine
for sample in a b c d; do
bwa-mem3 mem --bam -t 6 ref.fa ${sample}_R1.fq.gz ${sample}_R2.fq.gz \
| samtools sort -@ 2 -o ${sample}.bam - &
done
wait
See Threading and resource use for per-machine thread count recommendations.
Use the right binary for your CPU
bwa-mem3 ships separate binaries per SIMD level. Using the highest level supported by your CPU gives the best performance:
| CPU generation | Recommended binary |
|---|---|
| Modern Intel/AMD (2018+) | bwa-mem3.avx512bw or bwa-mem3.avx2 |
| Older x86 | bwa-mem3.sse42 or bwa-mem3.sse41 |
| Apple Silicon / AWS Graviton | bwa-mem3 (single ARM binary) |
When you run bwa-mem3 (the launcher), it detects your CPU and execs the
appropriate variant automatically. If you copy only a single SIMD binary,
call it directly.
See Performance: SIMD dispatch matrix.
Include a read-group header
Always pass -R with at minimum ID: and SM: fields. Many downstream tools
(GATK, fgbio, Picard) require a @RG header and will fail or warn without one.
bwa-mem3 mem \
-R $'@RG\tID:run1\tSM:sample1\tLB:lib1\tPL:ILLUMINA' \
-t 16 ref.fa R1.fq.gz R2.fq.gz
Further reading
The Best Practices section covers these topics in depth:
- Best Practices: build — PGO builds, arch selection
- Best Practices: output format — the canonical pipeline
- Best Practices: multi-sample workflows — shared-memory batch jobs
- Best Practices: anti-patterns — common mistakes and how to avoid them
See also: Aligning short reads (mem) · Threading and resource use · Memory allocator (mimalloc) · Performance: tuning checklist · Best Practices: anti-patterns
Performance Overview
Performance claims in this section are benchmarked, not asserted. The canonical source of truth for benchmark methodology, hardware configurations, and current numbers is bwa-mem3-bench, a reproducible benchmarking harness that runs across AWS Batch architectures (x86 AVX2, AVX-512, ARM Graviton). Consult that repository before drawing conclusions from isolated anecdotal timings.
What drives bwa-mem3’s performance
bwa-mem3 inherits the SIMD-vectorized alignment kernels of bwa-mem2 and adds several improvements of its own. The headline gains relative to a stock bwa-mem2 build fall into four categories.
Vectorized alignment kernels. The Smith-Waterman and banded-SWA kernels (kswv, bandedSWA) are compiled against the widest SIMD ISA the current CPU supports — SSE4.1 through AVX-512BW on x86, or native NEON on ARM. On Apple Silicon, native NEON intrinsics replaced the sse2neon shim in the two hottest kernels, delivering roughly 10% additional throughput over the pure-translation baseline. See SIMD dispatch matrix for the full picture.
libsais FM-index construction. The indexing step uses the linear-time suffix-array/BWT construction library libsais in place of the original quadratic-time approach. This cuts bwa-mem3 index wall time substantially on large references. See What’s Different — Performance improvements for the corresponding PR details.
mimalloc allocator. bwa-mem3 vendors and statically links mimalloc, replacing the system malloc/free for all allocations. On Linux the library is injected via --whole-archive; on macOS it uses dyld interposition. The allocator shows consistent throughput gains on multi-threaded workloads because mimalloc avoids the lock contention in glibc’s ptmalloc at high thread counts. See User Guide — Memory allocator for details.
Profile-Guided Optimization (PGO). The build system provides make pgo-generate and make pgo-use targets that compile an instrumented binary, gather branch-probability and call-frequency profiles from a representative workload, and then recompile with those profiles applied. On Apple Silicon the measured gain is approximately 3%; on x86 the gain depends on the workload mix. PGO is opt-in and is not applied to the default make output. See PGO build for the full workflow.
Consolidated mapping speedups
PR #58 and the related lockstep SMEM-batching work (#33) reduced per-read overhead in the main mapping loop beyond what upstream bwa-mem2 carries. The batch -H ingestion improvement (#49) further reduces header-processing latency for large sample sets.
Benchmarking responsibly
Alignment throughput is sensitive to read length, error rate, reference size, thread count, CPU architecture, NUMA topology, and whether the index is cold (in-kernel page cache) or warm. The bwa-mem3-bench harness controls for these variables by running standardized workloads on defined instance types. If you need numbers for a procurement or publication decision, run the harness against your target hardware.
See also: SIMD dispatch matrix · PGO build · Tuning checklist · What’s Different — Performance improvements · bwa-mem3-bench
SIMD Dispatch Matrix
bwa-mem3 uses a multi-binary dispatch strategy on x86: the bwa-mem3 launcher reads the CPU’s CPUID bits at startup, then execvs the highest-capability variant binary found on disk. On ARM there is only one NEON level, so the launcher execs bwa-mem3.arm64 directly without any cpuid check.
Dispatch flowchart
flowchart TD
A[bwa-mem3 launcher starts] --> B{Platform?}
B -- ARM / aarch64 --> C[exec bwa-mem3.arm64]
B -- x86 --> D[read CPUID via __cpuidex]
D --> E{AVX-512BW supported?}
E -- yes --> F[exec bwa-mem3.avx512bw]
E -- no --> G{AVX2 supported?}
G -- yes --> H[exec bwa-mem3.avx2]
G -- no --> I{AVX supported?}
I -- yes --> J[exec bwa-mem3.avx]
I -- no --> K{SSE4.2 supported?}
K -- yes --> L[exec bwa-mem3.sse42]
K -- no --> M{SSE4.1 supported?}
M -- yes --> N[exec bwa-mem3.sse41]
M -- no --> O[error: no supported variant found]
The launcher reads CPUID leaf 1 (for SSE flags) and leaf 7 (for AVX2 and AVX-512 flags). It checks in descending capability order and stops at the first variant binary it finds on disk. If no variant binary is executable, it exits with an error.
The ARM path always tries bwa-mem3.arm64 first, then falls back to the bare bwa-mem3 name (which on ARM is a symlink to bwa-mem3.arm64 created by make arm64).
Building the variant binaries
make multi builds all five x86 variants and the launcher in sequence:
make multi
This produces:
| Filename | Arch flags | Minimum CPU |
|---|---|---|
bwa-mem3.sse41 | -msse4.1 | Penryn (2007) / K10 (2011) |
bwa-mem3.sse42 | -msse4.2 | Nehalem (2008) / Bulldozer (2011) |
bwa-mem3.avx | -mavx | Sandy Bridge (2011) / Bulldozer (2011) |
bwa-mem3.avx2 | -mavx2 | Haswell (2013) / Excavator (2015) |
bwa-mem3.avx512bw | -mavx512f -mavx512bw | Skylake-X (2017) / Zen 4 (2022) |
For ARM builds, make arm64 produces a single binary and creates the symlink:
| Filename | Arch flags | Platform |
|---|---|---|
bwa-mem3.arm64 | -DAPPLE_SILICON=1 + sse2neon shim | Any aarch64 / Apple Silicon |
To build a single-arch binary for a known target (e.g. for a cluster with uniform hardware):
make arch=avx2
The resulting binary is named bwa-mem3 and contains only AVX2 code. The launcher is not built; it is not needed when the target ISA is fixed.
Kernel vectorization coverage
The table below lists the kernels that have SIMD implementations and which ISA levels they cover.
| Kernel | SSE4.1 | SSE4.2 | AVX | AVX2 | AVX-512BW | NEON (arm64) |
|---|---|---|---|---|---|---|
| kswv (vectorized Smith-Waterman) | 8-wide int16 | 8-wide int16 | 8-wide int16 | 16-wide int16 | 32-wide int16 | 8-wide int16 (native) |
| bandedSWA (banded alignment) | SSE2 baseline | SSE2 baseline | SSE2 baseline | SSE2 baseline | SSE2 baseline | native NEON blendv |
| FM-index lookup (FMI_search) | SSE popcount | SSE popcount | SSE popcount | SSE popcount | SSE popcount | __builtin_popcountl |
| libsais BWT construction | scalar | scalar | scalar | OpenMP parallel | OpenMP parallel | OpenMP parallel |
Note — FM-index is memory-bound
The FM-index backward-extension loop is limited by pointer-chasing through the
cp_occarrays, not by computation. Additional SIMD width does not increase throughput here. See the Apple Silicon optimization log in Developer Guide — Apple Silicon / NEON port for the profiling evidence.
Why the launcher uses execv, not a function pointer
The multi-binary design was inherited from bwa-mem2. Separate compilation units mean the compiler can use the target ISA’s full instruction set throughout — not just in hand-vectorized loops but also in auto-vectorized loops, register allocation, and branch heuristics. A single-binary dispatcher that calls ISA-specific function pointers achieves the same for hand-written kernels but leaves the compiler’s auto-vectorization gated at the baseline ISA. For a workload with this many scalar loops, the execv approach yields a measurable difference. For the ARM path, all CPUs have the same NEON level so the single-binary approach is fine.
See also: Performance overview · PGO build · Developer Guide — SIMD dispatch architecture · Developer Guide — Multi-binary launcher (x86) · Developer Guide — Apple Silicon / NEON port
PGO Build
Profile-Guided Optimization (PGO) is a two-pass compiler technique. In the first pass (pgo-generate) the compiler inserts counters into every branch, call site, and loop back-edge. You run a representative training workload against the instrumented binary so those counters accumulate real branch-probability data. In the second pass (pgo-use) the compiler recompiles every translation unit using the collected profiles to make better inlining, branch-prediction, and code-layout decisions.
bwa-mem3’s Makefile provides three targets that implement this workflow.
Observed gains
On Apple Silicon (M-series), PGO delivered approximately 3% throughput improvement over the native NEON build. The gain on x86 depends on the workload — short-read paired-end alignment on avx2 or avx512bw hardware typically sees 2–5%. PGO is most useful when you will run the same binary on the same hardware against the same workload repeatedly (e.g. a production pipeline node). It is not worth the extra build time for one-off or exploratory runs.
Workflow
Step 1: Build the instrumented binary
make pgo-generate
By default PGO_ARCH is set to arm64 on Apple Silicon / aarch64 hosts and native on x86 hosts. To target a specific ISA, pass PGO_ARCH explicitly:
make pgo-generate PGO_ARCH=avx2
This produces a binary named bwa-mem3.pgo-instr (or bwa-mem3.pgo-instr.avx2 for non-default arch). Profiles are written to the directory pgo_profiles/ by default. Override with PGO_PROFILE_DIR:
make pgo-generate PGO_ARCH=avx2 PGO_PROFILE_DIR=/scratch/pgo-profiles-avx2
Step 2: Run the training workload
Run a workload that is representative of your production use. A single-end or paired-end alignment run against the same reference and similar read length is sufficient. A larger training run produces more stable profiles but 5–10 million read pairs is generally enough.
./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
The run discards output so you are measuring the alignment work alone.
Tip — Training workload size
Aim for a training run that exercises the same code paths as your production workload. If you align 150 bp paired-end reads in production, train on 150 bp reads. If you use
--meth, include a methylation alignment run in training. A few million read pairs is sufficient; a full WGS run provides diminishing returns.
Step 3: Build the optimized binary
make pgo-use
Or with matching arch and profile dir:
make pgo-use PGO_ARCH=avx2 PGO_PROFILE_DIR=/scratch/pgo-profiles-avx2
This produces bwa-mem3.pgo (or bwa-mem3.pgo.avx2). The binary is ready to use in production.
Step 4: Clean up instrumentation artifacts
make pgo-clean
This removes the profile directory and all bwa-mem3.pgo-instr* and bwa-mem3.pgo* files.
Multi-arch builds with PGO
Each architecture requires its own profile because the instrumentation counters are embedded in arch-specific code. Run the full three-step workflow once per arch and keep the profiles in separate directories:
# AVX2 profile
make pgo-generate PGO_ARCH=avx2 PGO_PROFILE_DIR=pgo_profiles_avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2 PGO_PROFILE_DIR=pgo_profiles_avx2
# AVX-512BW profile (separate host or same host with matching CPU)
make pgo-generate PGO_ARCH=avx512bw PGO_PROFILE_DIR=pgo_profiles_avx512bw
./bwa-mem3.pgo-instr.avx512bw mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx512bw PGO_PROFILE_DIR=pgo_profiles_avx512bw
Warning — Profile portability
Profile data collected on one microarchitecture is not portable to a different one. An AVX2 profile collected on a Haswell CPU will not improve — and may pessimize — an AVX-512BW build run on a Sapphire Rapids CPU. Always collect profiles on the same hardware class where the optimized binary will run.
PGO and the multi-binary layout
The PGO targets produce a single optimized binary for a single arch target. They do not rebuild the full make multi set. If you want PGO-optimized multi-binary dispatch, build and profile each arch variant separately, place them alongside the launcher, and verify with ./bwa-mem3 version.
Relationship to LTO
make lto-build produces a Link-Time Optimization binary; make pgo-use produces a PGO-optimized binary. Both are independent opt-in targets. You can combine them by passing -flto (or -flto=thin for clang) as part of EXTRA_CXXFLAGS during the pgo-use step, but the combination has not been systematically benchmarked. In practice, LTO and PGO each provide modest single-digit gains; their interaction is compiler-specific.
See also: Performance overview · SIMD dispatch matrix · Tuning checklist · Best Practices — Build · Developer Guide — Building from source
Tuning Checklist
The items below are ordered by expected impact for most workloads. Work through them in sequence; there is little point optimizing output format before confirming you are running the right binary for your CPU.
1. Run the right binary for your CPU
If you built with make multi (recommended for production x86 deployments), the bwa-mem3 launcher reads CPUID at startup and execs the highest-capability variant automatically. Verify which variant is running by checking the banner printed to stderr at the start of a mem run:
-----------------------------
Executing in AVX512 mode!!
-----------------------------
If the banner says SSE4.1 on a machine you believe supports AVX2, the variant binary may be missing from the directory. Confirm with:
ls -1 bwa-mem3.sse41 bwa-mem3.sse42 bwa-mem3.avx bwa-mem3.avx2 bwa-mem3.avx512bw 2>&1
If files are missing, rebuild with make multi.
For ARM / Apple Silicon, there is only one binary level. Confirm it is in use:
ls -la bwa-mem3
# expect: bwa-mem3 -> bwa-mem3.arm64
See SIMD dispatch matrix for the full dispatch logic and the minimum CPU requirements for each variant.
Tip — Single-arch deployments
On a cluster where all nodes have the same CPU, build with
make arch=avx2(or the appropriate ISA). The launcher overhead is negligible but removing it simplifies the deployment: only one binary to distribute and no variant-lookup failures.
2. Build with PGO if you will run repeatedly
For production pipeline nodes that will process many samples against the same reference, a PGO build provides an additional 2–5% throughput at the cost of one extra build pass and a training run:
make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
See PGO build for the full workflow, including multi-arch and profile portability notes.
3. Use shared memory for many small samples
When aligning many samples on one machine against the same reference, loading the index into POSIX shared memory once and reusing it across all mem invocations eliminates redundant I/O and reduces per-sample startup time significantly. The benefit grows with the number of samples and the size of the reference.
# Load the index into shared memory once
bwa-mem3 shm ref.fa
# Align each sample against the in-memory index
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz | samtools sort -@ 4 -o sample.bam -
# When finished with all samples, drop the shared segment
bwa-mem3 shm -d
Warning — No staleness check
bwa-mem3 shmdoes not detect whether the on-disk index has changed after the segment was loaded. Always runbwa-mem3 shm -dbefore re-indexing a reference and re-loading withbwa-mem3 shm. Failing to do so results in alignments against a stale index.
See Getting Started — Shared-memory index and Best Practices — Multi-sample workflows for complete workflows.
4. Emit BAM directly
Use --bam (or --bam=0 for uncompressed BAM) to emit BAM instead of SAM. Uncompressed BAM avoids the text-formatting cost on the aligner side and the text-parsing cost on the downstream side. samtools sort reads BAM natively and is fastest when the input is uncompressed:
bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -o out.bam -
samtools index out.bam
The --bam flag (without =0) produces BGZF-compressed BAM. This is useful when writing directly to disk without a downstream piped tool.
See Best Practices — Output format for guidance on when SAM is still appropriate.
5. Pipe to a multi-threaded sorter
Sorting is typically the bottleneck after alignment. Keep a separate thread budget for samtools sort:
bwa-mem3 mem --bam=0 -t 12 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -m 2G -o out.bam -
On a 16-core machine, allocating 12 threads to mem and 8 to samtools sort (with overlap via the pipe) is a common starting point. The aligner is generally CPU-bound; the sorter is I/O-bound during merge. Profile both stages to find the right split for your hardware.
Tip — Thread count tuning
bwa-mem3 memscales well to 16–32 threads on most workloads. Beyond 32 threads the per-thread work unit becomes small enough that synchronization overhead starts to erode gains. See User Guide — Threading and resource use for thread-scaling data.
Summary table
| Item | Action | Reference |
|---|---|---|
| Right binary for CPU | make multi; verify banner | SIMD dispatch matrix |
| PGO for production | pgo-generate → train → pgo-use | PGO build |
| Shared-memory index | bwa-mem3 shm ref.fa before batch runs | Quick start: shm |
| Emit uncompressed BAM | --bam=0 | Best Practices — Output format |
| Multi-threaded sort | samtools sort -@ with appropriate thread split | User Guide — Threading |
See also: Performance overview · SIMD dispatch matrix · PGO build · Best Practices — Build · User Guide — Threading and resource use
Build
This page describes the recommended build configuration for production use of bwa-mem3.
Choose the right arch target
The default make invocation builds the multi-binary launcher on x86 (or a
single ARM64 binary on Apple Silicon). For production servers where the CPU
family is known, specify the target explicitly so the compiler can generate
tighter code and the binary does not need the launcher overhead:
# Most modern x86-64 servers (Skylake or later):
make arch=avx2
# Intel Cascade Lake / Sapphire Rapids, AWS c7i/m7i:
make arch=avx512bw
# Apple Silicon / AWS Graviton:
make arch=arm64
Omit arch= if the deployment target is heterogeneous or unknown; make
(with no arguments) builds the full multi-binary suite on x86 and selects the
fastest variant at runtime via cpuid.
See SIMD dispatch matrix for the full list of targets and which kernels each vectorizes.
Profile-Guided Optimization (PGO)
PGO typically yields 3–5% throughput improvement on real workloads. It is opt-in — the standard
make target does not use it — but is recommended for any installation that
will run many alignment jobs against the same reference.
The workflow is three steps:
# Step 1: Build an instrumented binary (produces bwa-mem3.pgo-instr).
make pgo-generate
# Step 2: Run a representative training workload.
# Use reads and a reference that reflect actual production input.
# About 10–30 million read pairs is sufficient.
./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
# Step 3: Build the PGO-optimized binary (produces bwa-mem3.pgo).
make pgo-use
To target a specific SIMD level, pass PGO_ARCH=:
make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Produces: bwa-mem3.pgo.avx2
Profile data is written to pgo_profiles/ by default. Pass
PGO_PROFILE_DIR=<path> to change the location.
Tip — Training data matters
The training workload should resemble production input in read length, base quality distribution, and reference composition. A read set that is too short, too long, or too easy (low mismatch rate) will bias the branch predictions and may produce a build that is slower than the non-PGO baseline on real data.
mimalloc
mimalloc is compiled in by default (USE_MIMALLOC=1). The allocator
improves multi-threaded throughput by reducing lock contention on malloc
and free hot paths. Run bwa-mem3 version to confirm it is active:
bwa-mem3 version
# Expected output includes a line like:
# mimalloc 3.x.x
To build without mimalloc (for example, when using AddressSanitizer or on a system with a known-incompatible allocator):
make USE_MIMALLOC=0
Summary
For a production installation on a known x86 server with AVX2:
make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Deploy: bwa-mem3.pgo.avx2
See also: SIMD dispatch matrix · PGO build · Memory allocator (mimalloc) · Building from source · Anti-patterns
Output Format
The choice of output format — SAM, compressed BAM, or uncompressed BAM — has a measurable effect on end-to-end pipeline wall time. This page explains why uncompressed BAM is the right default and shows the recommended pipeline.
Why uncompressed BAM is faster than SAM
When bwa-mem3 writes SAM (the default when --bam is not set), every
alignment record must be serialized into ASCII text: integers are formatted as
decimal strings, bases are encoded as characters, and flags are written as
decimal numbers. The receiving process — typically samtools sort — then parses
each field back from text into binary integers. Both conversions are pure
overhead: the data is binary inside bwa-mem3 and binary inside samtools; text
is only an interchange format that is immediately discarded.
Uncompressed BAM (--bam=0) bypasses this round-trip. bwa-mem3 writes binary
BAM records directly via htslib’s wb0 mode. The write path performs no text
formatting; the read path in samtools sort performs no text parsing. The
htslib overhead of the wb0 write is negligible — it is effectively a
buffered write(2) call with a small BAM block header prepended.
Compressed BAM (--bam=1) adds BGZF compression on top, which costs CPU on
the write side and gains nothing: the pipe is in-process memory or a kernel
pipe buffer, and samtools sort will re-compress the output anyway. Compressed
BAM on a pipe wastes CPU on both sides.
Recommended pipeline
bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -o out.bam -
samtools index out.bam
The -@ 8 flag gives samtools sort eight compression threads for writing the
final sorted BAM. Tune this number based on available cores; the total core
count should be split so that alignment threads and sort threads do not
contend. A 16:8 split (bwa-mem3:samtools) works well on 24-core machines.
Tip — Thread allocation
Do not give all cores to bwa-mem3. Downstream
samtools sortneeds threads to compress and write the sorted BAM. Leaving 4–8 threads forsamtools sortkeeps the pipeline balanced and prevents a write bottleneck that would stall the aligner.
Methylation output
The --meth path always writes uncompressed BAM internally, regardless of
the --bam flag. The post-processing step (header rewrite, chimera QC,
YD:Z: tag) is performed inline before the record is handed to htslib, so the
same pipeline shape applies:
bwa-mem3 mem --meth --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -o out.bam -
samtools index out.bam
When SAM is appropriate
SAM (the default, equivalent to omitting --bam) remains the right choice for:
- Debugging. Plain text is readable with
less,grep, and any text editor, making it easy to inspect individual records withoutsamtools view. - Ad-hoc inspection. When you need to scan a few thousand reads to diagnose a mapping problem, piping to SAM and reading the output directly is faster than writing a BAM file and then querying it.
- Compatibility with tools that require SAM input. Some legacy tools do not accept BAM. If the downstream tool does not support BAM, use SAM.
For production alignment jobs that feed samtools sort, always use
--bam=0.
Summary table
| Format | --bam value | Pipe overhead | Recommended for |
|---|---|---|---|
| SAM | (default / omit) | High (text round-trip) | Debugging, ad-hoc inspection |
| Uncompressed BAM | 0 | Negligible | Production pipelines |
| Compressed BAM | 1 | High on write side | Writing directly to a file (no downstream sort) |
See also: Aligning short reads (mem) · Output: SAM/BAM, headers, tags · Threading and resource use · Tuning checklist · CLI Reference: mem
Multi-Sample Workflows
When you need to align many samples back-to-back against the same reference on a single machine, loading the FM-index into shared memory once — and keeping it resident across all alignment jobs — eliminates the index I/O cost for every sample after the first.
The problem: repeated index loads
The bwa-mem3 FM-index for hg38 is approximately 28 GB on disk. Without shared
memory, bwa-mem3 mem reads the entire index from disk on every invocation.
On a fast NVMe drive this takes 30–60 seconds; on a network-attached or
spinning-disk filesystem it can take several minutes. For a batch of 100
samples, that adds hours of pure I/O overhead.
Staging the index once with bwa-mem3 shm
# Stage the index into shared memory (one-time cost, ~28 GB for hg38).
bwa-mem3 shm ref.fa
# Align each sample. bwa-mem3 mem attaches automatically — no extra flag.
bwa-mem3 mem --bam=0 -t 16 ref.fa sample1_R1.fq.gz sample1_R2.fq.gz \
| samtools sort -@ 4 -o sample1.bam -
bwa-mem3 mem --bam=0 -t 16 ref.fa sample2_R1.fq.gz sample2_R2.fq.gz \
| samtools sort -@ 4 -o sample2.bam -
# ...
# When done, release the segment.
bwa-mem3 shm -d
For methylation workflows, stage the c2t index instead:
bwa-mem3 shm --meth ref.fa
bwa-mem3 mem --meth --bam=0 -t 16 ref.fa sample1_R1.fq.gz sample1_R2.fq.gz \
| samtools sort -@ 4 -o sample1.bam -
bwa-mem3 shm -d
Confirming the index is staged
bwa-mem3 shm -l
# Prints the basename and memory usage of each staged segment.
If the listing is empty, the index is not staged and bwa-mem3 mem will fall
back to loading from disk.
Thread layout for parallel alignment
Running multiple bwa-mem3 mem instances in parallel is efficient when the
samples are independent and the machine has enough cores. The shared-memory
index eliminates disk contention, so the bottleneck becomes CPU and memory
bandwidth.
Guidelines for N-core machines:
- N = 32: Two instances at
-t 14each, with-@ 4forsamtools sort. Keeps 4 cores reserved for OS and I/O. - N = 64: Two to four instances at
-t 14to-t 16, each with-@ 4forsamtools sort. - N = 128: Four to eight instances; keep at least 8–16 cores free for
samtools sortthreads and OS scheduling.
Tip — Memory bandwidth limit
The FM-index lookup is memory-bandwidth bound. On machines with NUMA topology (multi-socket or multi-chiplet), binding each bwa-mem3 instance to a NUMA node with
numactl --cpunodebind=N --membind=Ncan improve throughput by reducing cross-node memory traffic.
Scripting a batch with a loop
bwa-mem3 shm ref.fa
for sample in sample1 sample2 sample3; do
bwa-mem3 mem --bam=0 -t 16 ref.fa "${sample}_R1.fq.gz" "${sample}_R2.fq.gz" \
| samtools sort -@ 4 -o "${sample}.bam" -
samtools index "${sample}.bam"
done
bwa-mem3 shm -d
For parallel execution, replace the for loop body with a background job (or
use a workflow manager such as Snakemake or Nextflow) and limit the degree of
parallelism to match available cores.
Warning — Stale segment footgun
If you need to re-index the reference (e.g. after updating it), always run
bwa-mem3 shm -dbeforebwa-mem3 index. There is no automatic staleness check. See Anti-patterns for details.
See also: Quick start: shared-memory index · CLI Reference: shm · Output format · Threading and resource use · Anti-patterns
Methylation Defaults
bwa-mem3 mem --meth ships with a set of scoring and filtering defaults that
match the bwameth.py reference implementation. This page describes what those
defaults are, when to keep them, and when to override them.
What --meth sets
When --meth is passed, the following flags are applied automatically in
addition to enabling inline c2t conversion and BAM post-processing:
| Flag | Value | Purpose |
|---|---|---|
-B | 2 | Mismatch penalty. Reduced from the bwa-mem2 default of 4. Bisulfite-treated reads carry C→T and G→A mismatches at converted positions; a lower penalty prevents these from causing spurious soft-clipping or unmapped reads. |
-L | 10 | Clipping penalty. Increased from the bwa-mem2 default of 5 to discourage clipping of read ends that carry converted bases at positions that look like mismatches. |
-U | 100 | Unpaired read penalty. Higher than default; methylation libraries typically have well-defined insert sizes and anomalous pairing usually reflects a mapping artifact. |
-T | 40 | Minimum alignment score threshold. Higher than default; raises the bar to report an alignment, reducing spurious low-quality hits against the doubled reference. |
-CM | — | Treats soft-clipped bases as matches in CIGAR output. Required for correct behavior of downstream methylation callers (e.g. Bismark, MethylDackel) that count clipped bases. |
These defaults can all be overridden on the command line. The --meth flag
sets them first; any explicit flag that follows overrides the --meth-set
value.
When to keep the defaults
For standard whole-genome bisulfite sequencing (WGBS) workflows, the defaults are appropriate as-is. They were derived from the bwameth.py codebase and are expected by most downstream methylation calling tools. Unless you have a specific reason to deviate, use:
bwa-mem3 mem --meth --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 4 -o out.bam -
samtools index out.bam
When to override
Low-coverage or targeted bisulfite sequencing. If your library covers a
small target region and insert sizes are more variable, consider lowering -T
(e.g. -T 20) to recover short or soft-clipped alignments in the target.
Amplicon bisulfite sequencing. Amplicon reads have uniform insert sizes;
the default -U 100 is appropriate. However, if your amplicons are short
(< 100 bp), consider lowering -L further to reduce clipping at read ends.
Non-standard conversion chemistry. Some library preparations use only one
strand conversion (C→T only, not G→A). In such cases, --set-as-failed r
suppresses alignments to the reverse-complement strand, which reduces noise
from strand-ambiguous alignments:
bwa-mem3 mem --meth --set-as-failed r --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 4 -o out.bam -
Chimeric reads from long-read-length protocols. By default, --meth
applies a chimera QC heuristic: if the longest matching run (CIGAR M/=/X
operations) is less than 44% of the read length, the alignment is flagged
0x200 (QC fail), the paired flag 0x2 is cleared, and MAPQ is capped at 1.
If your protocol produces legitimate long reads where this heuristic
over-aggressively flags alignments, pass --do-not-penalize-chimeras:
bwa-mem3 mem --meth --do-not-penalize-chimeras --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 4 -o out.bam -
Note — Overrides are positional
Flags supplied after
--methon the command line override the defaults set by--meth. For example,bwa-mem3 mem --meth -B 4 ...uses-B 4(not 2). Flags supplied before--methare silently overwritten by--meth’s defaults, so always place overrides after--meth.
Downstream tool compatibility
The --meth output BAM is designed to be a drop-in replacement for the output
of the bwameth.py pipeline. The following downstream tools have been used
successfully with bwa-mem3 --meth output:
- MethylDackel — extracts methylation calls from the
YD:Z:strand tag. - Bismark — accepts the bwameth-convention
YD:Z:strand annotation. - PileOMeth — reads the standard bisulfite BAM format.
If a tool requires the XB:Z: tag convention used by Bismark’s own aligner
rather than the YD:Z: convention, a conversion step is needed before
methylation calling.
See also: Methylation Reference: Overview · SAM tags: YS, YC, YD · Flags: –set-as-failed, –do-not-penalize-chimeras · Quick start: methylation alignment · Output format
Anti-Patterns
This page documents common mistakes that produce incorrect results or unnecessary failures when using bwa-mem3.
Re-indexing without dropping the shared-memory segment
Warning — Footgun
bwa-mem3 shmdoes not detect stale segments. If you re-runbwa-mem3 indexafter a shared-memory segment is already staged, the on-disk index files will not match the in-memory segment.bwa-mem3 memwill attach to the stale segment and produce incorrect alignments without any warning.Always run
bwa-mem3 shm -dbefore re-indexing:bwa-mem3 shm -d # drop all staged segments bwa-mem3 index ref.fa # rebuild the on-disk index bwa-mem3 shm ref.fa # re-stage the new indexThere is no automatic staleness check in the implementation. The segment name is derived from the reference basename only; no content hash or modification timestamp is stored.
To confirm that no stale segments are staged, use bwa-mem3 shm -l before
running any indexing step.
Forgetting to initialize submodules
bwa-mem3 depends on several submodules (ext/htslib, ext/safestringlib,
ext/libsais, ext/mimalloc, ext/sse2neon). A shallow clone or a clone
without --recursive will produce a build that fails at the linking step with
missing symbols, or at runtime with missing index files.
Warning — Missing submodules
Always clone with
--recursive, or initialize submodules after cloning:git clone --recursive https://github.com/fg-labs/bwa-mem3 # or, after a bare clone: git submodule update --init --recursiveIf
makereports missing headers (e.g.htslib/hts.h: No such file or directory), the submodules were not initialized.
Building without an arch target on a known CPU
The default make (no arch=) builds the multi-binary launcher suite on x86.
On a production server with a known CPU family, this is unnecessary: the
launcher adds a small cpuid dispatch overhead on every invocation, and the
extra binaries consume disk space. More importantly, building without an
explicit arch= means the compiler cannot assume any ISA beyond SSE4.1, so
AVX2- and AVX-512-specific optimizations are not applied to the base binary.
Warning — Suboptimal build on known hardware
On a server with a known CPU family, always pass an explicit
arch=:make arch=avx2 # for Broadwell/Skylake and later x86 make arch=avx512bw # for Cascade Lake, Ice Lake, Sapphire Rapids make arch=arm64 # for Apple Silicon, AWS GravitonThe
make multitarget (or baremakeon x86) is appropriate when you are building a binary that will be distributed and run on multiple CPU families, or when the target CPU is genuinely unknown.
See SIMD dispatch matrix for the full set of targets.
Mixing bwa-mem3 and bwa-mem2 outputs in the same pipeline
bwa-mem3 adds several custom SAM tags that bwa-mem2 does not emit: HN:i
(total number of primary alignments — both reported and suppressed — that the
aligner found for this read, before the -h supplementary cap is applied),
and — in --meth mode —
YS:Z:, YC:Z:, and YD:Z:. It also rewrites @SQ header lines in
--meth mode (collapsing f/r strand prefixes back to one entry per
chromosome).
Warning — Header and tag mismatch
Do not merge BAM files produced by bwa-mem3 and bwa-mem2 without verifying that the
@PGheaders and custom tags are handled correctly by the downstream tool. In methylation workflows, a bwa-mem2 BAM mixed into a bwa-mem3--methpipeline will be missingYD:Z:strand annotations, which will cause methylation callers to silently drop or misclassify those records.
If you must merge outputs from both tools, run samtools view -H on both
files and confirm that @SQ lines are consistent and that the downstream tool
can tolerate the tag differences.
Writing compressed BAM to a pipe
Passing --bam=1 (compressed BAM) when piping to samtools sort compresses
the stream on the bwa-mem3 side and then immediately decompresses it on the
samtools side. This wastes CPU on both ends with no benefit.
Use --bam=0 (uncompressed BAM) for all pipe-to-sort workflows. See
Output format for the full explanation and recommended
pipeline.
See also: Output format · Multi-sample workflows · Build · Quick start: shared-memory index · CLI Reference: shm
CLI Reference Overview
bwa-mem3 exposes four subcommands: index, mem, shm, and version. Run
bwa-mem3 <subcommand> --help to see the full option list for any command.
How this section is structured
Each subcommand page follows the same layout:
- Introduction — what the subcommand does and when to reach for it.
- Synopsis — the verbatim
--helpoutput, auto-captured from the binary at build time and included here via mdbook’s{{#include}}directive. The snippet is regenerated bymake docs-cliand CI fails if it drifts from the binary. - Common usage — two or three worked command-line examples.
- Flag reference (for
mem, grouped by topic) — per-flag prose covering semantics, defaults, and interaction with other flags that the--helptext does not have room to explain. - Notes / Gotchas — operational warnings about non-obvious behavior.
- See also — cross-links to related pages in this book.
Subcommands
index builds the FM-index from a reference FASTA.
Pass --meth to produce a bwameth-style doubled c2t reference for
methylation alignment.
mem aligns short reads against an indexed reference, producing
SAM or BAM output. It is the primary alignment subcommand. The flag surface is
large; the mem reference page groups flags by purpose to make them easier to
navigate.
shm stages an FM-index into POSIX shared memory so that
repeated bwa-mem3 mem invocations on the same machine skip the per-run disk
read. It also lists and destroys staged segments.
version prints the bwa-mem3 release version and, when mimalloc is compiled in, the mimalloc version.
See also: User Guide — Aligning short reads · User Guide — Indexing the reference · Getting Started — Quick start: align paired-end FASTQs · Getting Started — Quick start: shared-memory index · Performance — Tuning checklist
index
bwa-mem3 index builds the FM-index (BWT + suffix array) that bwa-mem3 mem
requires for alignment. Run it once per reference; the resulting files sit
alongside the input FASTA and are reused for all subsequent alignment jobs.
Pass --meth to produce a bwameth-compatible doubled c2t reference for
bisulfite-seq alignment.
Synopsis
Usage: bwa-mem3 index [-p prefix] [-t N] [--max-memory SIZE] [--tmp-dir PATH] [--meth] <in.fasta>
-p STR output prefix (default: <in.fasta>)
-t INT worker threads [auto: detected cores, cgroup-aware]
--max-memory SIZE peak memory budget; SIZE accepts a G/M/K suffix
(case-insensitive) or bare bytes
[auto: min(50% of RAM, 32G), cgroup-aware]
--tmp-dir PATH scratch directory [$TMPDIR]
--meth build a bwameth-style doubled c2t reference + FMI.
Writes <in.fasta>.bwameth.c2t and the FMI alongside it.
Use with `bwa-mem3 mem --meth <in.fasta> R1.fq [R2.fq]`.
-h, --help print this help message and exit
Common usage
Build a standard index using all available cores:
bwa-mem3 index ref.fa
Build a methylation-aware index (required before bwa-mem3 mem --meth):
bwa-mem3 index --meth ref.fa
Limit peak RAM to 16 GB and write scratch data to /scratch:
bwa-mem3 index --max-memory 16G --tmp-dir /scratch ref.fa
Flag reference
-p STR — output prefix
By default, index files are written alongside <in.fasta> using the FASTA
path as a prefix (e.g. ref.fa.bwt.2bit.64, ref.fa.0123, etc.). Use -p
to write them to a different base path, such as a dedicated index directory:
bwa-mem3 index -p /idx/hg38 ref.fa
# writes /idx/hg38.bwt.2bit.64, /idx/hg38.0123, …
# align with: bwa-mem3 mem /idx/hg38 R1.fq R2.fq
-t INT — worker threads
Controls the number of threads used during index construction. The default auto-detects available cores and is cgroup-aware, so it behaves correctly inside containers and on shared cluster nodes. Set explicitly when you want to cap CPU usage.
--max-memory SIZE — peak memory budget
Limits how much RAM the indexer may use at once. SIZE accepts a G, M, or
K suffix (case-insensitive) or a bare byte count. The default is
min(50% of RAM, 32 GB), computed in a cgroup-aware manner.
For large references (hg38 and above) on machines with limited RAM, setting
this to a value lower than the reference size causes the indexer to partition
work and use --tmp-dir for intermediate files, at the cost of extra I/O.
--tmp-dir PATH — scratch directory
Scratch directory for intermediate files when memory is partitioned. Defaults
to $TMPDIR. Point this at a fast local disk (NVMe or ramdisk) to minimize
wall-clock time when --max-memory forces partitioned construction.
--meth — build a methylation (c2t) index
Writes a bwameth-style doubled reference — <in.fasta>.bwameth.c2t — and
builds the FM-index over that file rather than the original FASTA. The c2t
file and its index files are placed alongside the original FASTA.
Pass the original FASTA prefix (not the .bwameth.c2t path) to all three
index, shm, and mem commands. The c2t suffix is appended automatically
when --meth is present.
Notes / Gotchas
Tip — Index once, align many times
Index construction for hg38 takes several minutes and ~28 GB of disk. Build the index once and store it on shared storage; all alignment jobs on the same reference share the same index files.
Warning — –meth index is not interchangeable with the standard index
A
--methindex is built over the c2t reference and cannot be used for normal (non-bisulfite) alignment. Keep separate index directories if you align both standard and bisulfite samples to the same reference.
See also: User Guide — Indexing the reference · CLI Reference — mem · CLI Reference — shm · Getting Started — Quick start: methylation alignment · Methylation Reference — Overview
mem
bwa-mem3 mem aligns short DNA reads against an indexed reference genome
using the BWA-MEM algorithm. It accepts one or two FASTQ files (single-end or
paired-end) and writes alignments to stdout in SAM or BAM format. It is the
primary alignment subcommand; nearly all bwa-mem3 usage flows through it.
Synopsis
Usage: bwa-mem3 mem [options] <idxbase> <in1.fq> [in2.fq]
Options:
Algorithm options:
-o STR Output SAM file name
--bam[=N] Emit BAM instead of SAM text. N=0 (default) = uncompressed;
1..9 = BGZF deflate levels. Writes to stdout; redirect with `>`.
-t INT number of threads [1]
-k INT minimum seed length [19]
-w INT band width for banded alignment [100]
-d INT off-diagonal X-dropoff [100]
-r FLOAT look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
-y INT seed occurrence for the 3rd round seeding [20]
-c INT skip seeds with more than INT occurrences [500]
-D FLOAT drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
-W INT discard a chain if seeded bases shorter than INT [0]
-m INT perform at most INT rounds of mate rescues for each read [50]
-S skip mate rescue
-P skip pairing; mate rescue performed unless -S also in use
Scoring options:
-A INT score for a sequence match, which scales options -TdBOELU unless overridden [1]
-B INT penalty for a mismatch [4]
-O INT[,INT] gap open penalties for deletions and insertions [6,6]
-E INT[,INT] gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]
-L INT[,INT] penalty for 5'- and 3'-end clipping [5,5]
-U INT penalty for an unpaired read pair [17]
Input/output options:
-p smart pairing (ignoring in2.fq)
-R STR read group header line such as '@RG\tID:foo\tSM:bar' [null]
-H STR/FILE insert STR to header if it starts with @; or insert lines in FILE [null]
-j treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)
-5 for split alignment, take the alignment with the smallest coordinate as primary
-q don't modify mapQ of supplementary alignments
-K INT process INT input bases in each batch regardless of nThreads (for reproducibility) []
-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]
-T INT minimum score to output [30]
-h INT[,INT] if there are <INT hits with score >80.00% of the max score, output all in XA [5,200]
-z FLOAT the fraction of the max score to use with -h [0.80]
-u output XB instead of XA; XB is XA with the alignment score and mapping quality added
-a output all alignments for SE or unpaired PE
-C append FASTA/FASTQ comment to SAM output
-V output the reference FASTA header in the XR tag
-Y use soft clipping for supplementary alignments
-M mark shorter split hits as secondary
-I FLOAT[,FLOAT[,INT[,INT]]]
specify the mean, standard deviation (10% of the mean if absent), max
(4 sigma from the mean if absent) and min of the insert size distribution.
FR orientation only. [inferred]
Bisulfite (--meth) options:
--meth enable inline bwameth-style C→T/G→A read conversion + meth-aware BAM
emission. Implies --bam. Requires the reference to have been built
with `bwa-mem3 index --meth` (emits ref.fa.bwameth.c2t).
--set-as-failed f|r
flag alignments to the matching strand ('f' or 'r') as QC-fail (0x200)
--do-not-penalize-chimeras
disable the longest-match <44% chimera heuristic (no 0x200 / MAPQ cap)
Supplementary MAPQ rescoring (fg-labs extension):
--supp-rep-hard-cap INT
force MAPQ=0 for supplementary alignments whose chain contains any seed
with >=INT genome occurrences (i.e. the supp region is repetitive on its
own). 0 disables (default). Typical values 5-20; lower = more aggressive.
Primary MAPQ is unaffected.
Help:
--help print this help message and exit
Note: Please read the man page for detailed description of the command line and options.
Common usage
Paired-end alignment, 16 threads, SAM to stdout:
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam
Paired-end alignment, emit uncompressed BAM, pipe directly to samtools sort:
bwa-mem3 mem --bam -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -@ 8 -o out.bam -
samtools index out.bam
Paired-end methylation alignment with a read group header:
bwa-mem3 mem --meth -t 16 \
-R '@RG\tID:lib1\tSM:sample1\tPL:ILLUMINA' \
ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -o out.bam -
Flag reference
Input / output
-o STR — output file
Write output to STR instead of stdout. Honored for both SAM and --bam
output; the path is opened lazily so BAM mode can hand it to htslib instead of
truncating it as a SAM-text file. Stdout redirection (>) remains an
alternative.
--bam[=N] — emit BAM
Emit BAM instead of SAM. N controls BGZF compression: 0 (default when
--bam is used without =) writes uncompressed BAM, which costs almost no
CPU and is the recommended mode for piping to samtools sort. Values 1–9
select increasing BGZF deflate levels; use --bam=6 or --bam=9 only when
writing directly to final storage without a downstream sort step.
Tip — Prefer –bam for production pipelines
Uncompressed BAM (
--bamor--bam=0) eliminates the text-formatting cost on the aligner side and the text-parse cost on thesamtools sortside. For any pipeline that immediately sorts or processes the output, this is faster than SAM at no quality cost.
-R STR — read group header
Injects a @RG header line and tags every alignment with RG:Z:<ID>. The
value is a tab-separated @RG line with literal \t escapes, for example:
-R '@RG\tID:run1\tSM:HG001\tPL:ILLUMINA\tLB:lib1'
bwa-mem3 escapes any literal tab characters inside -R values before writing
them to the @PG CL: field, preventing header corruption (fix for issue #45).
-H STR/FILE — extra header lines
If STR begins with @, it is injected verbatim as a header line. Otherwise
STR is treated as a path and every line in the file is injected. Useful for
adding @CO comments or custom @RG / @PG entries.
-p — smart pairing
Reads interleaved paired-end data from a single FASTQ file (in1.fq) rather
than two separate files. The second positional argument (in2.fq) is ignored.
-5 — leftmost-coordinate primary
For split alignments, designates the alignment with the smallest genomic coordinate as primary, rather than the longest alignment. Useful for some downstream tools that expect the leftmost alignment to be primary.
-q — preserve supplementary MAPQ
By default, bwa-mem3 may downgrade the MAPQ of supplementary alignments.
-q suppresses that adjustment.
-K INT — fixed batch size
Forces each thread batch to process exactly INT input bases regardless of
the number of threads. Useful when you need bit-for-bit reproducible output
across runs with different -t values: fix -K to the same value and the
output is deterministic.
-v INT — verbosity
Controls stderr diagnostic output: 1 = errors only, 2 = warnings,
3 = informational messages (default), 4+ = debugging.
-a — all alignments
Output all alignments for single-end or unpaired paired-end reads, including secondary alignments. Equivalent to enabling secondary-alignment reporting.
-C — append FASTA/FASTQ comment
Appends the comment field from the FASTA/FASTQ header to the SAM output as an additional column. Useful when the comment carries barcodes or UMIs.
-V — reference header in XR tag
Emits the reference FASTA header line for each alignment position as an XR
SAM tag.
-Y — soft-clip supplementary alignments
Uses soft clipping instead of hard clipping for supplementary alignments. Some downstream tools require this.
-M — mark shorter split hits as secondary
Marks the shorter alignment in a split read as secondary (sets 0x100 flag)
rather than supplementary. Required for compatibility with tools that do not
handle supplementary alignments (e.g. Picard’s duplicate-marking before
certain versions).
-j — treat ALT contigs as primary
Treats ALT contigs as part of the primary assembly by ignoring the
<idxbase>.alt file. Use when your workflow does not include ALT-aware
postprocessing.
Scoring
All scoring flags accept integer values. Changing -A (match score) scales
the penalty flags that default to multiples of -A; explicit overrides of
individual flags are unaffected.
| Flag | Default | Meaning |
|---|---|---|
-A INT | 1 | Score for a sequence match. Scales -T, -d, -B, -O, -E, -L, -U unless overridden. |
-B INT | 4 | Mismatch penalty. |
-O INT[,INT] | 6,6 | Gap open penalty for deletions and insertions respectively. |
-E INT[,INT] | 1,1 | Gap extension penalty per base. A gap of length k costs -O + -E * k. |
-L INT[,INT] | 5,5 | Clipping penalty for 5’ and 3’ ends. |
-U INT | 17 | Penalty for an unpaired read pair (affects mate-rescue scoring). |
-T INT | 30 | Minimum alignment score to output. Alignments below this threshold are not reported. |
Note — –meth overrides scoring defaults
When
--methis active, bwa-mem3 applies bwameth.py-compatible defaults:-B 2 -L 10 -U 100 -T 40 -CM. Any of these can still be overridden by passing the flag explicitly after--meth.
Paired-end
-I FLOAT[,FLOAT[,INT[,INT]]] — insert size distribution
Specifies the mean, standard deviation (default: 10% of mean), maximum (default: 4 sigma above mean), and minimum of the insert size distribution for FR-orientation paired-end reads. By default bwa-mem3 infers these parameters from the first batch of reads. Provide them explicitly for speed or when the reference is short and inference may be inaccurate.
-m INT — mate rescue rounds
Maximum number of mate-rescue attempts per read. Reduce to speed up alignment on data where the default (50) wastes time on unrescuable pairs.
-S — skip mate rescue
Disables mate rescue entirely. Faster but may reduce sensitivity for discordant pairs.
-P — skip pairing
Skips the pairing step; mate rescue still runs unless -S is also given.
Filtering
-c INT — skip repetitive seeds
Seeds with more than INT occurrences in the reference are skipped. Lowering
this (e.g. to 50) speeds up alignment of highly repetitive reads but may
reduce sensitivity. Raising it increases sensitivity in repeat-heavy regions
at a cost in runtime.
-D FLOAT — chain length fraction
Drops chains shorter than FLOAT times the longest overlapping chain. The
default (0.50) discards chains that are less than half the length of the best
chain.
-W INT — minimum seeded bases
Discards chains with fewer than INT seeded bases. Raising this filters out
very short, low-confidence chains.
-h INT[,INT] — secondary alignment reporting
If there are fewer than INT hits with score exceeding FLOAT (see -z)
times the maximum score, all of them are output in the XA auxiliary tag.
The second integer is a hard cap on the number of XA entries. Defaults: 5, 200.
-z FLOAT — secondary score fraction
Fraction of the maximum alignment score used as the threshold for secondary
hit reporting with -h. Default: 0.80.
-u — emit XB instead of XA
Outputs XB in place of XA. XB is an extension of XA that also carries
the alignment score and mapping quality for each secondary hit.
Methylation (--meth)
--meth — enable bisulfite alignment mode
Activates inline C→T (R1) and G→A (R2) read conversion, bwameth-compatible
scoring defaults, inline BAM post-processing, and forces --bam output.
The reference must have been indexed with bwa-mem3 index --meth.
Pass the original FASTA prefix as <idxbase> — the .bwameth.c2t suffix is
appended automatically. If <idxbase> already ends in .bwameth.c2t
(interop with an external c2t converter), the auto-append is skipped.
See Methylation Reference for the full treatment.
--set-as-failed {f|r} — strand QC-fail flag
Forces the QC-fail bit (0x200) on all alignments to the forward (f) or
reverse (r) bisulfite strand. Used when one strand is known to be
unreliable for a given library preparation.
--do-not-penalize-chimeras — disable chimera heuristic
Disables the longest-match < 44% chimera heuristic that would otherwise set
0x200, clear 0x2, and cap MAPQ at 1 for likely chimeric alignments.
Use when the default chimera filter is too aggressive for your library type.
Threading
-t INT — number of threads
Number of worker threads. Defaults to 1. Set to the number of physical cores available to this job. Scaling is workload- and hardware-dependent: on typical machines the curve flattens around 16–32 threads (FM-index bandwidth and I/O contention dominate); on high-memory / fast-I/O servers the aligner can keep scaling toward ~64 threads on hg38 before saturating. See the threading guide for measured guidance and per-machine recommendations.
See User Guide — Threading and resource use for guidance on thread counts at various machine sizes.
Supplementary MAPQ rescoring
--supp-rep-hard-cap INT — cap MAPQ for repetitive supplementary alignments
Forces MAPQ=0 for supplementary alignments whose chain contains any seed with
at least INT occurrences in the genome. This targets supplementary
alignments anchored in repetitive regions that upstream MAPQ scoring may
overestimate. 0 disables the cap (default). Typical values are 5–20; lower
values are more aggressive. Primary alignment MAPQ is unaffected.
Debug
-k INT — minimum seed length
Minimum exact-match seed length. Shorter seeds increase sensitivity but raise runtime. The default (19) is calibrated for 100–150 bp Illumina reads.
-w INT — band width
Band width for the banded Smith-Waterman extension. Wider bands can recover alignments with long indels at greater CPU cost.
-d INT — X-dropoff
Off-diagonal X-dropoff for the Z-drop heuristic. Controls how far an alignment extension continues after a score drop.
-r FLOAT — re-seeding factor
Seeds longer than -k * FLOAT are re-seeded internally to find sub-seeds.
Lowering this produces more seeds and higher sensitivity at greater cost.
-y INT — third-round seed occurrence threshold
Seed occurrence threshold for the third round of seeding. Rarely needs adjustment outside highly repetitive genomes.
Notes / Gotchas
Warning — –meth requires a –meth index
Running
bwa-mem3 mem --methagainst a standard (non-c2t) index produces incorrect alignments without an error. Confirm that the index was built withbwa-mem3 index --methbefore aligning bisulfite data.Note — SIMD variant printed to stderr at startup
When mem starts it prints a banner (
Executing in AVX512 mode!!etc.) to stderr. This is informational and does not affect stdout output.
See also: User Guide — Aligning short reads · User Guide — Output: SAM/BAM, headers, tags · CLI Reference — index · Methylation Reference — Overview · Best Practices — Output format
shm
bwa-mem3 shm stages an FM-index into POSIX shared memory so that subsequent
bwa-mem3 mem invocations on the same machine attach to the in-memory segment
instead of re-reading the index files from disk. For workloads that align many
small samples back-to-back against the same reference — such as clinical
panels or amplicon sequencing — this removes the dominant I/O bottleneck.
shm also lists and destroys staged segments.
Synopsis
Usage: bwa-mem3 shm [-d|-l|--help] [--meth] [idxbase]
Options:
-d destroy all indices in shared memory (matches bwa v1 behavior)
-l list names of indices in shared memory
--meth stage a `bwa-mem3 index --meth` index — auto-appends
`.bwameth.c2t` to <idxbase>, mirroring `mem --meth`
-h --help print this help and exit
Stage with no flags: `bwa-mem3 shm <idxbase>` loads the index into
POSIX shared memory; subsequent `bwa-mem3 mem <idxbase> ...` runs
auto-attach instead of re-reading from disk. For meth indices, pass
the same plain `<idxbase>` to all three commands plus `--meth` on
`index`, `shm`, and `mem` (the c2t suffix is auto-appended).
Footgun: if you re-build the index, run `bwa-mem3 shm -d` first.
There is no staleness check -- a stale segment will silently mis-align.
macOS: POSIX shm has implementation-defined per-segment caps; large
indices may simply fail to stage. Prefer Linux for production.
Linux: /dev/shm defaults to ~50% of RAM on bare metal; in containers
it is often much smaller and may need raising via --shm-size
(Docker) or an emptyDir tmpfs (Kubernetes).
Common usage
Stage a standard index, align two samples, then release the segment:
bwa-mem3 shm ref.fa
bwa-mem3 mem -t 16 ref.fa sample1_R1.fq sample1_R2.fq > sample1.sam
bwa-mem3 mem -t 16 ref.fa sample2_R1.fq sample2_R2.fq > sample2.sam
bwa-mem3 shm -d
Stage a methylation index and align:
bwa-mem3 shm --meth ref.fa
bwa-mem3 mem --meth -t 16 ref.fa R1.fq R2.fq | samtools sort -o out.bam -
bwa-mem3 shm -d
List all currently staged segments:
bwa-mem3 shm -l
Flag reference
(no flags) <idxbase> — stage an index
Loads all index files for <idxbase> into a POSIX shared-memory segment.
After staging, any bwa-mem3 mem <idxbase> ... on the same machine
auto-attaches and reads from memory rather than disk.
-d — destroy all segments
Removes every bwa-mem3 shared-memory segment on the machine. This is the correct clean-up command after a batch job and the required step before re-building the index (see the footgun warning below).
-l — list staged indices
Prints the names of all currently staged segments. Useful to confirm that staging succeeded before launching alignment jobs.
--meth — stage a methylation index
Auto-appends .bwameth.c2t to <idxbase> before staging, mirroring the
behavior of bwa-mem3 index --meth and bwa-mem3 mem --meth. Pass the
same plain <idxbase> to all three commands; the c2t suffix is handled
transparently.
Notes / Gotchas
Warning — No staleness check — always destroy before re-indexing
There is no staleness check. If you re-run
bwa-mem3 index ref.faafter staging, the on-disk index files will not match the in-memory segment, butbwa-mem3 memwill still attach to the stale segment and silently produce incorrect alignments. Always runbwa-mem3 shm -dbefore re-indexing.Note — Platform limits
macOS: POSIX shared memory has implementation-defined per-segment size caps. Staging a full hg38 index (~28 GB) may fail silently or with a cryptic error. Prefer Linux for production use with large references.
Linux containers:
/dev/shmtypically defaults to ~50% of physical RAM on bare metal but is often much smaller inside Docker containers or Kubernetes pods. Raise the limit with--shm-size(Docker) or anemptyDirtmpfs volume with an explicit size (Kubernetes) before attempting to stage a large index.
See also: Getting Started — Quick start: shared-memory index · CLI Reference — index · CLI Reference — mem · Best Practices — Multi-sample workflows · Best Practices — Anti-patterns
version
bwa-mem3 version prints the release version of the binary and, when
mimalloc is compiled in, the mimalloc version. It is a quick way to confirm
which build is on PATH and whether the allocator is active.
Synopsis
mimalloc 3.3.0
v<MAJOR.MINOR>-<N>-g<COMMIT>
Common usage
Confirm the installed version and allocator:
./bwa-mem3 version
A typical run prints two lines (the mimalloc line goes to stderr and the version string to stdout, so the order in a merged stream is not guaranteed):
mimalloc 3.3.0
v0.1.0-pre-7-g613a4dc
The first line is the mimalloc version (present only when USE_MIMALLOC=1,
the default). The second line is the bwa-mem3 version string, derived from
git describe at build time and stored as PACKAGE_VERSION in the binary.
When building from a tarball without git history, the fallback value is set
via FG_LABS_VERSION_FALLBACK at compile time.
Notes / Gotchas
Note — mimalloc line goes to stderr, version to stdout
The mimalloc line is printed to stderr and the version string to stdout. Scripts that capture the version should redirect stderr appropriately or use
bwa-mem3 version 2>/dev/null.Tip — No mimalloc line means USE_MIMALLOC=0
If only the version string appears and no mimalloc line, the binary was built without the bundled allocator (
make USE_MIMALLOC=0). See User Guide — Memory allocator (mimalloc) for when this is appropriate.
See also: User Guide — Memory allocator (mimalloc) · Developer Guide — Release process · Getting Started — Installation · What’s Different — Build & infrastructure
Methylation Reference Overview
bwa-mem3 --meth is a single-binary, single-command drop-in replacement for
the bwameth.py bisulfite-sequencing
alignment pipeline. No Python installation, no piped preprocessing step, and no
separate post-processing script — one bwa-mem3 index --meth builds the
reference, and one bwa-mem3 mem --meth aligns and post-processes reads from
raw FASTQ to sorted-ready BAM.
The output BAM is structurally equivalent to what the bwameth.py pipeline
produces: consolidated @SQ headers (one entry per real chromosome rather than
one per doubled-reference contig), YS:Z: / YC:Z: tags carrying the
original pre-conversion read sequence and the conversion direction, YD:Z:
indicating the strand hypothesis, chimera QC flags, and a @PG ID:bwa-mem3-meth
provenance entry. Downstream tools that consume bwameth.py output — including
MethylDackel and Bismark — work without change.
Pipeline at a glance
The diagram below shows the internal flow when bwa-mem3 mem --meth runs.
Every step executes inside the single process; no external programs or temporary
files are required.
flowchart LR
A[Raw FASTQ\nR1 / R2] -->|inline C→T / G→A| B[c2t-converted reads\n+ YS:Z + YC:Z in comment]
B -->|bwa mem core| C[mem_aln_t\nalignments vs doubled ref]
C -->|chrom map\nf/r → real chr| D[header rewrite\n@SQ consolidated]
D -->|YD:Z tagging\nchimera QC\nQC-fail propagation| E[BAM output\nwb0 uncompressed]
Steps:
-
FASTQ ingest with inline c2t conversion. R1 bases have every
Creplaced withT; R2 bases have everyGreplaced withA. The original bases are preserved in aYS:Z:comment field and the conversion direction is stored inYC:Z:. This conversion happens in-memory — the FASTQ is never written to disk in converted form. -
Alignment against the doubled reference. The converted reads are aligned against the
ref.fa.bwameth.c2treference, which contains both a forward C→T projection (f-prefixed contigs) and a reverse G→A projection (r-prefixed contigs) of each chromosome. -
Header rewriting and chrom consolidation. The
f/r-prefixed contig names used internally are collapsed: every pairfchr1/rchr1becomes a single@SQ SN:chr1entry in the output BAM header. RNAME and RNEXT fields in each record are rewritten to the consolidated name. -
Tag emission and QC. Each aligned record receives a
YD:Z:{f,r}tag indicating which strand it mapped to. The chimera QC heuristic flags records whose longest M/=/X CIGAR run covers less than 44% of the read length. QC-fail flags propagate across all records in a read group. The original pre-c2t sequence fromYS:Z:is copied back into the BAM SEQ field so that methylation callers (e.g. MethylDackel) see real cytosines rather than the converted sequence. -
BAM output. Records are written as uncompressed BAM (
wb0mode via htslib). The@PG ID:bwa-mem3-methline records the exact command line. The caller pipes directly tosamtools sort.
Quick-start commands
# Index the reference once (builds ref.fa.bwameth.c2t + FMI)
bwa-mem3 index --meth ref.fa
# Align paired-end FASTQs
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -o out.bam
samtools index out.bam
Note — bwameth.py compatibility
The default scoring parameters applied by
--meth(-B 2 -L 10 -U 100 -T 40 -CM) match those used by bwameth.py so outputs are comparable. Any parameter can be overridden on the command line.
See also: bwameth.py drop-in mapping · Conversion details · SAM tags: YS, YC, YD · Chimera QC and header rewriting · Quick start: methylation alignment
bwameth.py Drop-In Mapping
bwa-mem3 --meth is designed to produce output that is equivalent to the
bwameth.py pipeline for the standard paired-end case. This page explains what
changes between the two approaches and what stays the same.
Command comparison
bwameth.py pipeline (multi-step)
# Step 1: build a doubled reference with bwameth.py
bwameth.py index ref.fa # writes ref.fa.bwameth.c2t + bwa-mem2 FMI
# Step 2: align (bwameth.py converts reads, calls bwa-mem2, post-processes)
bwameth.py map --bwa-mem2 -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -o out.bam
samtools index out.bam
bwa-mem3 –meth (single binary)
# Step 1: build the doubled reference with bwa-mem3
bwa-mem3 index --meth ref.fa # same ref.fa.bwameth.c2t layout as bwameth.py
# Step 2: align (inline c2t conversion + post-processing, no Python)
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
| samtools sort -o out.bam
samtools index out.bam
The index files produced by bwa-mem3 index --meth and bwameth.py index are
identical in layout: the same ref.fa.bwameth.c2t doubled-reference FASTA
followed by the bwa-mem2 FM-index files (.bwt.2bit.64, .0123, .pac,
.amb, .ann).
What is gained
No Python or bwameth.py dependency. The entire pipeline — read conversion,
alignment, and BAM post-processing — runs inside a single bwa-mem3 process.
This simplifies deployment: one binary, no virtual environment, no version
pinning of bwameth.py.
No intermediate files. bwameth.py writes a converted FASTQ (or pipes it)
before handing off to the aligner. bwa-mem3 --meth performs the C→T / G→A
conversion in-memory on each read batch before passing it to the alignment
kernel. No temporary FASTQ is written and no extra pipe stage is needed.
Inline BAM post-processing. Header rewriting, YD:Z: tagging, chimera QC,
and QC-fail propagation all happen inside the same process and the same pass
over the alignments. There is no separate post-processing step. Output is
written as uncompressed BAM (wb0) — a near-zero-cost format that downstream
samtools sort reads natively.
Same flag defaults. --meth applies -B 2 -L 10 -U 100 -T 40 -CM
automatically, matching bwameth.py’s default scoring. All parameters can be
overridden.
What stays the same
The output BAM is field-compatible with bwameth.py output for the standard
methylation tag set, flags, and SEQ representation (the @PG provenance line
intentionally differs — see below):
| Field | bwameth.py | bwa-mem3 –meth |
|---|---|---|
@SQ headers | One per real chromosome | One per real chromosome |
YS:Z: | Pre-c2t original sequence | Same |
YC:Z: | Conversion direction (CT or GA) | Same |
YD:Z: | Strand (f or r) | Same |
@PG | ID:bwameth | ID:bwa-mem3-meth |
| Chimera QC threshold | Longest M < 44% of read | Same (44%) |
| Chimera QC flags | 0x200, clear 0x2, MAPQ ≤ 1 | Same |
| SEQ field | Pre-c2t bases (RC-flipped when is_rev) | Same |
The @PG ID: is intentionally different so provenance is unambiguous. All
downstream tools that rely on YS:Z:, YC:Z:, YD:Z:, and the QC flags
behave identically.
Info — End-to-end regression coverage
PR #13 includes a three-layer regression test that verifies 100% chrom+pos match, 100% CIGAR match, and byte-identical SEQ across 92,684 paired-end records compared to a bwameth.py reference run.
When to prefer bwameth.py
If your workflow requires bwameth.py-specific features (e.g. bwameth.py markduplicates or non-standard bwameth.py post-processors), continue using
bwameth.py. bwa-mem3 --meth targets the indexing + alignment + standard
post-processing path only.
See also: Overview · Conversion details · SAM tags: YS, YC, YD · Chimera QC and header rewriting · Related Projects: bwameth.py
Conversion Details (C→T, G→A)
Bisulfite sequencing relies on chemical conversion of unmethylated cytosines to
uracil (read as thymine after PCR). bwa-mem3 --meth models this with an
in-memory read transformation applied to every read before the alignment kernel
sees the bases.
What gets converted
Paired-end bisulfite reads follow a strand convention:
-
R1 (read 1): every
Cin the base sequence is replaced withT. This models the OT (original top) and CTOB (complementary to original bottom) strands as they appear after bisulfite treatment and PCR. -
R2 (read 2): every
Gin the base sequence is replaced withA. This models the OB (original bottom) and CTOT strands.
Single-end mode uses the R1 (C→T) rule for all reads.
The doubled reference built by bwa-mem3 index --meth (or bwameth.py) contains
two projections of each chromosome:
f-prefixed contigs (e.g.fchr1): the chromosome with everyCreplaced byT.r-prefixed contigs (e.g.rchr1): the reverse complement of the chromosome with everyGreplaced byA.
Converted R1 reads are therefore alignable to f-prefixed contigs and
converted R2 reads to r-prefixed contigs. The contig prefix records which
strand hypothesis was used and feeds the YD:Z: tag directly.
Where conversion happens
Read conversion runs inside src/fastmap.cpp in the meth_mode ingest block,
immediately after sequence parsing and before any alignment work. The
transformation is applied to the in-memory bseq1_t.seq buffer; the original
FASTQ file is never rewritten.
Before the bases are modified, the original sequence is recorded in the read’s comment buffer as:
YS:Z:<l_seq bases>\tYC:Z:<direction>
where <direction> is CT for R1 (C→T) and GA for R2 (G→A). These fields
pass through the alignment kernel untouched and are emitted as SAM aux tags in
the output BAM. See SAM tags: YS, YC, YD for the per-tag reference.
Sequence restoration in the BAM SEQ field
Methylation callers such as MethylDackel identify methylated cytosines by
examining the BAM SEQ field at each CpG site. They need to see real C/T
bases — not the uniformly-converted T/A bases that were used for alignment.
meth_mem_aln_to_bam (in src/meth_bam.cpp) restores the original sequence
from YS:Z: before writing the BAM record:
- The
YS:Z:payload is located at the start of thebseq1_t.commentfield (offset+5past theYS:Z:header bytes). - For forward-aligned records (
!p.is_rev), the pre-c2t bases are copied directly into the BAM SEQ buffer. - For reverse-aligned records (
p.is_rev), the bases are reverse-complemented using the standardTGCANtable before being placed in SEQ. - If
YS:Z:is absent (e.g. when running with an external c2t converter that does not emit it), the code falls back to the converted sequence ins->seq, with the same RC flip logic.
Warning — Soft-clip and supplementary trimming
When computing the SEQ range for supplementary alignments, the
qb/qeboundaries account for soft-clip or hard-clip operations at the CIGAR ends. The YS:Z: restoration applies over the same trimmed range so SEQ length always matches the emitted CIGAR.
QUAL field handling
The QUAL field is taken directly from the original FASTQ (bseq1_t.qual) over
the same [qb, qe) range and is never modified by the c2t process. Quality
scores correspond to the original base calls, not the converted ones.
Relationship to the reference index
bwa-mem3 index --meth ref.fa writes ref.fa.bwameth.c2t, which applies the
same C→T / G→A projection to the reference sequence. The resulting file is
compatible with what bwameth.py index produces, so the same doubled-reference
FASTA can be used interchangeably with either tool across tested versions.
See also: Overview · SAM tags: YS, YC, YD · Interop with external bwameth.py c2t · User Guide → Indexing the reference · Best Practices → Methylation defaults
SAM Tags: YS, YC, YD
bwa-mem3 --meth emits three methylation-specific auxiliary tags that carry
the information downstream methylation callers need. Two of these (YS:Z: and
YC:Z:) are set during FASTQ ingest and pass through the alignment kernel
unchanged. The third (YD:Z:) is set during BAM post-processing based on the
contig name of the alignment.
Tag reference
YS:Z: — original (pre-conversion) sequence
| Property | Value |
|---|---|
| Type | Z (NUL-terminated string) |
| Length | Equal to l_seq (full read length) |
| Set by | FASTQ ingest (src/fastmap.cpp meth_mode block) |
| Emitted on | All records (mapped and unmapped) |
YS:Z: holds the original base sequence of the read before the C→T or
G→A conversion. The value is the ASCII string of bases as read from the FASTQ,
in read order (not reverse-complemented).
This tag serves two purposes:
-
SEQ restoration.
meth_mem_aln_to_bamcopies theYS:Z:payload back into the BAM SEQ field (with reverse-complement whenis_revis set) so that methylation callers see real cytosines. Without this restoration the SEQ field would show onlyTs whereCs existed in the original read. -
Downstream inspection. Tools that need to examine the unconverted sequence independently of the BAM SEQ field can read
YS:Z:directly.
Note — Format inside the comment buffer
Internally, the ingest code stores the value as
YS:Z:<bases>\tYC:Z:<dir>starting at offset 0 ofbseq1_t.comment.meth_mem_aln_to_bamlocates the payload atcomment + 5(past theYS:Z:prefix). The two tags are always co-emitted in this order.
YC:Z: — conversion direction
| Property | Value |
|---|---|
| Type | Z (NUL-terminated string) |
| Values | CT (R1, C→T) or GA (R2, G→A) |
| Set by | FASTQ ingest (src/fastmap.cpp meth_mode block) |
| Emitted on | All records |
YC:Z: records which conversion was applied to the read:
CT— C→T conversion applied; this is an R1 read (or a single-end read).GA— G→A conversion applied; this is an R2 read.
bwameth.py uses YC:Z: for the same purpose and with the same values. Tools
such as MethylDackel use YC:Z: to determine which cytosines to call as
methylated. YC:Z:CT records are candidates for CpG methylation on the top
strand; YC:Z:GA records are candidates on the bottom strand.
YD:Z: — strand hypothesis
| Property | Value |
|---|---|
| Type | Z (NUL-terminated string) |
| Values | f (forward / top strand) or r (reverse / bottom strand) |
| Set by | meth_mem_aln_to_bam (src/meth_bam.cpp) |
| Emitted on | Mapped records only (not unmapped) |
YD:Z: records which strand of the doubled reference the read aligned to. The
value is derived from the f/r prefix on the internal contig name via the
meth_chrom_map_t.direction array. Unmapped reads do not receive YD:Z:.
f— the read aligned to anf-prefixed contig (the C→T projection of the top strand).r— the read aligned to anr-prefixed contig (the G→A projection of the bottom strand).
This tag is used by --set-as-failed (see Flags) and is also
consumed by downstream methylation callers to confirm which strand each
alignment supports.
Tag emission summary
| Tag | Records | Source |
|---|---|---|
YS:Z: | All | FASTQ ingest (comment buffer) |
YC:Z: | All | FASTQ ingest (comment buffer) |
YD:Z: | Mapped only | meth_mem_aln_to_bam from chrom map |
Tip — Checking tags with samtools
To inspect these tags on a BAM file:
samtools view out.bam | cut -f12- | grep -oP 'Y[SCD]:Z:[^\t]+'Or use
samtools view -Hto confirm the@PG ID:bwa-mem3-methentry is present and the@SQlines are consolidated (nof/rprefixes).
See also: Overview · Conversion details · Chimera QC and header rewriting · Flags: –set-as-failed, –do-not-penalize-chimeras · User Guide → Output: SAM/BAM, headers, tags
Chimera QC and Header Rewriting
After the alignment kernel produces mem_aln_t records, bwa-mem3 --meth
applies a set of post-processing steps before writing BAM output. These steps
are implemented in src/meth_bam.cpp and run in the same process, in the same
pass over the aligned records.
@SQ header consolidation
The doubled reference (ref.fa.bwameth.c2t) contains two contigs for each
chromosome:
fchr1,fchr2, … — C→T projections of each chromosome.rchr1,rchr2, … — G→A projections of each chromosome.
If the raw alignment header were written directly, every downstream tool would
see twice as many sequences as there are real chromosomes, with unfamiliar
f/r-prefixed names. meth_bam_writer_open instead builds a consolidated
header using the meth_chrom_map_t:
meth_chrom_map_build_from_bnsiterates overbns->annsand strips the leadingf/rfrom each contig name.- The first contig with a given stripped name registers that name in the output list; subsequent contigs with the same stripped name map to the same output index.
- The BAM
@SQlines are written from the consolidated list — oneSN:per real chromosome.
RNAME, RNEXT, and SA/XA tag contig references in every record are rewritten
through cmap->out_tid and cmap->output_names so they reference the
consolidated names. The mapping from internal (doubled-ref) contig index to
output contig index is cmap->out_tid[p.rid].
Note — TLEN computation uses consolidated TIDs
Template length (TLEN) is computed using the consolidated output TIDs, not the internal
p.ridvalues. Two mates that rescue ontofchr1andrchr1respectively both map to outputchr1, so TLEN is reported as a non-zero distance rather than zero (which would happen if the mismatched internal TIDs were used).
Chimera QC heuristic
bwameth.py applies a heuristic to flag reads that look like chimeric fragments: if the longest contiguous alignment run (sum of M/=/X CIGAR operations) covers less than 44% of the read length, the read is considered a potential chimera.
bwa-mem3 --meth applies the same heuristic inside meth_mem_aln_to_bam:
if (100 * longest_M_run < 44 * l_seq):
flag |= 0x200 # set QC fail
flag &= ~0x2 # clear proper pair
mapq = min(mapq, 1)
The threshold constant is MIN_LONGEST_M_PCT = 44 (defined at the top of
src/meth_bam.cpp). The longest run is computed by cigar_longest_m_mem
from src/cigar_util.cpp, which counts M, =, and X operations.
The chimera heuristic is only applied to mapped records (!(flag & 0x4) && direction != 0). Unmapped records are not touched.
To disable this heuristic, pass --do-not-penalize-chimeras. See
Flags for details.
--set-as-failed strand filtering
Before the chimera check, meth_mem_aln_to_bam checks whether
opt->meth_set_as_failed is set and matches the record’s strand direction:
if (meth_set_as_failed != 0 && meth_set_as_failed == direction):
flag |= 0x200
This unconditionally marks all alignments to the specified strand (f or r)
as QC-failed before chimera logic runs. The chimera check then applies on top
of the already-set fail flag.
Pair-level QC-fail propagation
Once per read group (all records sharing the same query name), after individual records have been processed:
meth_bam_group_propagate_qcfail(group, n)
This function scans all records in the group. If any record has 0x200 set, it
propagates that flag to every other record in the group and clears 0x2
(proper pair) on all of them. This ensures that a chimeric or strand-filtered
primary alignment also marks its supplementary alignments and the mate as
QC-failed, preventing inconsistent flag states in the output BAM.
@PG ID:bwa-mem3-meth insertion
meth_bam_writer_open appends a @PG line to the header after the original
bwa-mem3 @PG entry:
@PG ID:bwa-mem3-meth PN:bwa-mem3-meth VN:<version>-meth CL:<command line>
The <command line> field is the full bwa-mem3 mem --meth ... invocation with
embedded tab characters replaced by spaces (htslib does not permit literal tabs
in @PG CL: fields). This records the exact parameters used for provenance
and reproducibility.
Tip — Verifying the header
After alignment, confirm consolidation and provenance with:
samtools view -H out.bam | grep -E '^@SQ|^@PG'You should see one @SQ line per chromosome (no f/r prefixes) and both
@PG ID:bwa-mem3and@PG ID:bwa-mem3-methentries.
See also: Overview · SAM tags: YS, YC, YD · Flags: –set-as-failed, –do-not-penalize-chimeras · Conversion details · User Guide → Output: SAM/BAM, headers, tags
Flags: –set-as-failed, –do-not-penalize-chimeras
bwa-mem3 --meth adds two flags that control QC behavior during BAM
post-processing. Both flags affect the chimera QC and strand-filtering logic
inside meth_mem_aln_to_bam (src/meth_bam.cpp).
--set-as-failed {f|r}
Marks every alignment to the specified strand as QC-failed (0x200) regardless
of alignment quality or CIGAR structure.
Accepted values:
f— flag all alignments tof-prefixed contigs (C→T top-strand projection).r— flag all alignments tor-prefixed contigs (G→A bottom-strand projection).
Effect on records:
When --set-as-failed f (or r) is set and a mapped record’s YD:Z: strand
matches the specified value, the record’s SAM flag has 0x200 set. The chimera
heuristic then runs on top of the already-set flag, but since 0x200 is already
present, it can only enforce additional constraints (clearing 0x2, capping
MAPQ). QC-fail propagation then spreads the flag to all records in the read
group.
When to use it:
Some experimental designs produce reads that are expected to align exclusively to
one strand. Flagging the other strand as QC-failed before downstream analysis
prevents spurious methylation calls from mis-strand alignments. It is also
useful for diagnosing library preparation issues: run once with --set-as-failed r
and once without to compare yield on each strand.
Warning — All records on the strand are flagged
--set-as-failedis a blunt instrument. It marks every alignment to the chosen strand, including correctly aligned reads that simply happened to land on the complementary strand due to library structure. Use this flag only when your library is expected to be strand-specific.
--do-not-penalize-chimeras
Disables the chimera QC heuristic entirely.
Without this flag, any mapped record whose longest M/=/X CIGAR run covers less than 44% of the read length receives:
0x200(QC fail) set.0x2(proper pair) cleared.- MAPQ capped at 1.
With --do-not-penalize-chimeras, none of these penalties are applied. Records
are written with the MAPQ and flags as determined by the alignment kernel. The
chimera check in meth_mem_aln_to_bam is skipped entirely.
When to use it:
The 44% threshold was calibrated for standard mammalian whole-genome bisulfite sequencing (WGBS) libraries with typical read lengths. For short reads (< 50 bp), reads with large insertions, or amplicon designs where short alignments are expected, the heuristic can produce excessive false-positive flagging. In those cases, disable it and apply a custom chimera filter downstream.
It is also useful when benchmarking: comparing bwa-mem3 --meth output against
bwameth.py output on a specific dataset is cleaner when chimera filtering is
disabled, since bwameth.py’s chimera logic may differ in edge cases.
Note — Pair-level propagation still applies
--do-not-penalize-chimerasonly suppresses the chimera heuristic. If--set-as-failedis also active, those flags are still set, andmeth_bam_group_propagate_qcfailstill propagates any0x200flags across the read group.
Flag interaction summary
| Condition | 0x200 set? | 0x2 cleared? | MAPQ capped? |
|---|---|---|---|
| Normal aligned record | No | No | No |
| Chimera heuristic triggers (default) | Yes | Yes | Yes (≤1) |
--set-as-failed strand matches | Yes | No | No |
Both chimera + --set-as-failed active | Yes | Yes | Yes (≤1) |
--do-not-penalize-chimeras only | No | No | No |
See also: Overview · Chimera QC and header rewriting · SAM tags: YS, YC, YD · Best Practices → Methylation defaults · CLI Reference → mem
Interop with External bwameth.py c2t
Some workflows use bwameth.py’s c2t subcommand to convert reads before
passing them to an aligner. bwa-mem3 --meth supports this pattern by
detecting whether the caller has already provided a pre-converted FASTQ and
whether the reference path already points to the doubled-reference FASTA.
Auto-detect logic for the reference path
When --meth is active, bwa-mem3 mem ordinarily appends .bwameth.c2t to
the reference path so the user can pass the original FASTA prefix:
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz
# internally uses ref.fa.bwameth.c2t as the reference
If the reference path already ends with .bwameth.c2t, the auto-append is
skipped:
bwa-mem3 mem --meth -t 16 ref.fa.bwameth.c2t R1.fq.gz R2.fq.gz
# no suffix appended; ref.fa.bwameth.c2t is used as-is
This detection is a simple suffix check on the path string. It allows callers that manage the doubled-reference path explicitly to pass it without triggering double-append.
Using bwameth.py c2t as the read preprocessor
If your pipeline already runs bwameth.py c2t to convert reads (for example,
because it needs to reuse converted reads across multiple aligners), you can
pipe the output directly to bwa-mem3 mem --meth:
bwameth.py c2t R1.fq.gz R2.fq.gz \
| bwa-mem3 mem --meth -p -t 16 ref.fa.bwameth.c2t /dev/stdin \
| samtools sort -o out.bam
Key points for this pattern:
- Pass the
.bwameth.c2treference path explicitly so the auto-append is suppressed. - Use
-pto tellbwa-mem3 memthat the input contains interleaved paired-end reads (bwameth.py c2t emits interleaved output to stdout). - Use
/dev/stdinas the reads argument to read from the pipe. - The
bwa-mem3 --methinline c2t conversion is not applied when the reads arrive pre-converted; however,YS:Z:andYC:Z:tags are only written by the inline conversion path. If you need those tags in the output, you must either use the integrated mode (no external c2t step) or ensure your external preprocessor emits compatible comments in the FASTQ.
Warning — YS:Z: and YC:Z: require integrated mode
When reads are pre-converted by an external tool and piped in, the inline c2t step in
src/fastmap.cppis bypassed.YS:Z:andYC:Z:tags will not be present in the output BAM unless the external converter writes them as FASTQ comment fields in the formatYS:Z:<seq>\tYC:Z:<dir>and those comments are passed through. MethylDackel and similar callers useYS:Z:to restore original bases for methylation calling; if the tag is absent, they fall back to reading SEQ directly, which may affect accuracy.
Header rewriting and BAM post-processing with external c2t
Whether reads are converted inline or externally, all BAM post-processing steps
apply identically when --meth is active:
@SQheader consolidation (f/r contigs → one entry per chromosome).YD:Z:tag emission from the contig name prefix.- Chimera QC heuristic (unless
--do-not-penalize-chimerasis set). - Pair-level QC-fail propagation.
@PG ID:bwa-mem3-methinsertion.
The post-processing pipeline depends only on the reference contig names (to
determine YD:Z:) and the alignment flags — not on whether reads were
converted inline or externally.
Summary of path variants
| Reference arg | Read source | Auto-append? | Inline c2t? | YS:Z: emitted? |
|---|---|---|---|---|
ref.fa | Raw FASTQ | Yes (→ ref.fa.bwameth.c2t) | Yes | Yes |
ref.fa.bwameth.c2t | Raw FASTQ | No | Yes | Yes |
ref.fa.bwameth.c2t | Pre-converted (pipe) | No | No | No (unless pre-emitted) |
See also: Overview · Conversion details · SAM tags: YS, YC, YD · bwameth.py drop-in mapping · Related Projects: bwameth.py
What’s Different from bwa-mem2
This section tracks every change that bwa-mem3 carries on top of upstream
bwa-mem2/bwa-mem2’s master branch,
explains why each change was made, and records its upstream disposition.
How this section is organized
Each deep-dive page covers one category of change:
- Correctness fixes — bugs in upstream bwa-mem2 that are
fixed in bwa-mem3, including the kswv SIMD score2 plateau series, the
proper-pair flag regression, the zero-init crash, the SMEM buffer overflow,
and the
@PGtab-escape issue. - Performance improvements — lockstep SMEM batching, batched
-Hheader ingestion, libsais FM-index construction, and the consolidated mapping speedup suite. - Features —
--methbisulfite mode, mimalloc allocator,--supp-rep-hard-cap,bwa-mem3 shm,shm --meth, theHN:itag, and the--bam=LEVELoutput flag. - Architecture support — the Linux ARM64/aarch64 build,
the
arch=avx512bwMakefile target, the NEON kswv mate-rescue kernel, and the AVX2 kswv mate-rescue kernel. - Build & infrastructure — the doctest framework, Codecov
integration,
PACKAGE_VERSIONfromgit describe, PGO target parameterization,CXXFLAGS/CPPFLAGS/LDFLAGSforwarding, the unit-test harness, and the CI matrix expansion. - Upstream PR status — a single table cross-referencing every fork-carried change to its corresponding upstream PR or issue, with current upstream disposition.
Carried on top of upstream
| Commit | Topic | Upstream status |
|---|---|---|
ae73227 | Apple Silicon / ARM64 NEON support (PR #288 work) | PR #288 open |
744a9e7 | ci: cross-platform build + dwgsim phiX end-to-end test | fork-only |
490502b | fix: drop unused global stat that shadows libc | fork-only |
Additional fork-level changes
-
Vendored mimalloc allocator:
ext/mimallocis pinned atv3.3.0and linked into every binary by default (USE_MIMALLOC=1). Linux uses--whole-archivestatic linkage; macOS uses dyld-interposed shared linkage.USE_MIMALLOC=1is the supported and recommended default on all platforms;USE_MIMALLOC=0is provided as a best-effort opt-out and is CI-gated on Linux x86 only. See Features for details. -
--supp-rep-hard-cap INT(opt-in, default disabled): forces MAPQ=0 on supplementary alignments whose chain contains a seed with>=INTgenome occurrences. Addresses the long-standing bwa/bwa-mem2 issue where a supp fragment that maps to many places standalone (e.g. a short read in a CCATCC repeat) inherits a high MAPQ from its primary because the supp’s competing repetitive chains get filtered out during the full-read pipeline and therefore never contribute to itssub/sub_n. See upstream #260 for the reporter case. Primary MAPQ is unaffected; default output is byte-identical to stock bwa-mem2. Typical values are 5–20 (lower = more aggressive); the upstream #260 repro drops from MAPQ=60 to MAPQ=0 at--supp-rep-hard-cap 18.
Version stamping
PACKAGE_VERSION (the value reported by bwa-mem3 version and written to
the @PG VN: SAM header field) is generated at build time by the Makefile
from git describe --tags --dirty, e.g. v2.3-30-g61813ef for a tree 30
commits past upstream tag v2.3 at commit 61813ef.
- No manual bumping required: cut a fresh release by tagging the commit
(
git tag -a vX.Y-fg-labs.N -m ...) and the next build picks it up. - Builds where
git describe --tagsfails (source-tarball extractions, or shallow clones / checkouts with no tag reachable fromHEAD— including CI’s defaultactions/checkoutfetch-depth of 1) fall back to the staticFG_LABS_VERSION_FALLBACKinMakefile. Bump that when cutting a release that will be consumed as a tarball, or in CI artifacts. src/version.his generated and.gitignored;make cleanremoves it.
Branching and update policy
mastertracks upstream unchanged.mainisupstream/masterplus the commits above. Rebased onto upstream roughly quarterly, or sooner when an upstream release we care about lands.- Contributions go via PR targeting
main. CI and CodeRabbit gate merges. - Any PR that adds or removes a fork-carried commit must update the table above in the same PR.
Consuming
Clone this repo and check out main:
git clone https://github.com/fg-labs/bwa-mem3.git
cd bwa-mem3
git checkout main
Or vendor the branch into a downstream repo by pinning to a specific commit (not the branch tip) so your build is reproducible.
Relationship to upstream
We submit the generally-useful fixes and features carried here as PRs against
bwa-mem2/bwa-mem2 when the upstream
maintainers are actively merging; while they are not, fixes land here first
and we drop them from main once they appear upstream.
See also: Correctness fixes · Performance improvements · Features · Upstream PR status · Developer Guide → Contributing
Correctness Fixes
This page documents bugs present in upstream bwa-mem2 that bwa-mem3 fixes. Each
fix is isolated to a single PR so it can be reviewed independently and dropped
from main once upstream merges the equivalent patch.
@PG CL: tab escaping (PR #54)
When a read-group string is passed via -R '@RG\tID:x\tSM:y', the tab
characters in the argument were copied verbatim into the @PG CL: SAM header
field. The SAM specification uses tabs as field delimiters, so the resulting
header line appeared to have extra ID: and other tag fields embedded inside
CL:. Lenient parsers (samtools, htsjdk) tolerated the output; strict parsers
(noodles, some fgbio configurations) rejected the file as malformed.
The fix replaces each tab character with a space when building the @PG CL:
value in src/main.cpp. The @RG line itself is not modified, so the
read-group metadata is preserved correctly. A regression shell test
(test/pg_cl_escape_test.sh) asserts that the @PG line contains exactly
five tab-separated fields after the fix. Upstream issue reference:
bwa-mem2#293.
SMEM buffer overflow on reads longer than 151 bp (PR #55)
bwa-mem2 hardcoded READ_LEN 151 in src/macro.h to size the per-thread
matchArray SMEM buffer at compile time. The FMI walk wrote past this buffer
without bounds checking when reads exceeded 151 bp, causing memory corruption
that manifested as segfaults or silent wrong output on 300 bp MiSeq reads,
error-corrected long reads, and any run with a non-default -k that extended
seed length.
A second cap, MAX_READ_LEN_FOR_LOCKSTEP 512, guarded the lockstep driver’s
per-slot stack arrays with a hard assert that aborted on anything longer.
The fix eliminates both compile-time caps. Every per-thread SMEM buffer is now
heap-allocated on the memory management context (mmc) and grown on demand
from each batch’s observed max_readlength. The pre-walk grow in
mem_collect_smem sizes matchArray[tid] to BATCH_MUL * BATCH_SIZE * max_readlength, and all array writes are bounds-checked with a structured
smem_overflow_die on overflow. Regression tests cover 300 bp, 1 kbp, and
3 kbp phiX reads; all three segfaulted before the fix and produce correct
NM:i:0 alignments after. Upstream references:
bwa-mem2#210 (issue),
bwa-mem2#238
(closed unmerged upstream PR).
kswv nrow==0 guard (PR #51)
When a SIMD batch contained only padding pairs (all len1 == 0), the DP loop
never executed and nrow was zero. The post-loop rowMax + (i-1) * SIMD_WIDTH
store still executed, walking SIMD_WIDTH bytes before the beginning of the
rowMax allocation. On glibc this produced a free(): invalid pointer abort;
on macOS libc it silently corrupted the heap.
The fix wraps the post-loop store in an if (i > 0) guard on all five SIMD
kswv kernels: NEON u8, NEON 16, AVX2 u8, AVX-512BW u8, and AVX-512BW 16. The
upstream patch bwa-mem2#289
covered only the two AVX-512BW kernels; bwa-mem3 broadens it to the three
additional kernels carried in this fork. A dedicated regression test
(test/kswv_nrow_zero_test.cpp) builds all-padding batches and verifies each
kernel is clean under AddressSanitizer.
kswv score2 plateau series (PRs #26, #27, #28, #29, #30, #31)
The batched mate-rescue Smith-Waterman path (kswv) contains a family of
related bugs across its SIMD kernels that inflated the suboptimal score
(score2 / XS) and consequently deflated MAPQ relative to upstream
bwa-mem2.
AVX-512BW dispatch guard (PR #26). GCC with -mavx512bw automatically
defines __AVX2__, so the #elif __AVX2__ branch in src/kswv.h and
src/kswv.cpp matched first on every AVX-512BW build. The 256-bit AVX2 kernel
produced only 32-lane results into 64-lane score[]/te1[]/qe[] arrays
sized for AVX-512BW; the upper 32 lanes held uninitialized values.
mem_matesw_batch_post read those bogus te values, bwa_gen_cigar2
returned NULL, and mem_reg2aln triggered an a.cigar != NULL assertion on
every AVX-512BW dispatch host (AWS c7a, c7i). The fix qualifies the #elif __AVX2__ guard with !__AVX512BW__, matching the existing pattern in
bandedSWA.h. Closes issue #25.
AVX2 score2 plateau fix (PR #27 closed, PR #28 merged). The AVX2 256-bit
kswv kernel added in PR #20 used a dense SIMD max over every rowMax row to
compute the suboptimal score. Scalar ksw_u8 instead collapses consecutive
rows above minsc into a single b[] entry anchored at the max-score row,
then finds the best anchor outside the primary region. The dense max pulled in
tail rows from a plateau whose anchor sat inside the primary region, inflating
XS by 1–4 on a minority of reads and reducing MAPQ by 2–18 on those reads.
PR #27 (closed) temporarily disabled the AVX2 batched path. PR #28 fixes the
kernel itself by replacing the dense scan with a per-lane scalar emulation of
the b[] build-and-scan logic.
NEON and AVX-512BW 8-bit port (PR #29). The same dense-rowMax score2 scan
existed in kswv_neon_u8 and kswv_512_u8. Confirmed on ARM: rebuilding
smoke-1M on darwin/arm64 pre-fix produced the identical four MAPQ regressions
as the AVX2 case. PR #29 ports the per-lane scalar b[]-emulation fix to both
kernels.
AVX-512BW 16-bit port (PR #30). kswv_512_16 carried four bugs: the same
dense-rowMax plateau pattern, aggregate maxl/minh bounds instead of
per-lane bounds (a gap from PR #21), no minsc filter, and no qe mask. The
per-lane scalar emulation from PR #29 fixes all four naturally.
NEON 16-bit rewrite (PR #31). kswv_neon_16 was effectively dead code
before this PR. Five interacting bugs produced 20,435 BAM diffs vs scalar
reference on smoke-1M -A 2: the score table reinterpreted int16 xor
indices as int8 lookups (inflating match scores by ~256 per cell), the table
was too small for the 16-bit SoA encoding, rowMax was never written, the
early-exit fired on row 0 for all pairs without a KSW_XSTOP target, and all
the fix-3 class bugs from PRs #28–#30 were missing. The PR rewrites the kernel
from scratch against kswv_neon_u8’s structure using 32-byte int8 tables
indexed via vqtbl2_s8, per-lane freeze, exit0 bitmap, and per-lane scalar
score2.
kseq2bseq1 zero-initialization (PR #22)
bseq_read_orig grows its sequence buffer with realloc, leaving tail entries
uninitialized. kseq2bseq1 populated only name, comment, seq, qual,
and l_seq for each entry, leaving sam, bams, n_bams, and cap_bams at
whatever values realloc happened to return. PR #13 added an unconditional
free(ret->seqs[i].bams) in the output loop (fastmap.cpp:571), which turned
those garbage values into a crash — a pointer being freed was not allocated
abort under system malloc and a SIGSEGV under mimalloc — once input exceeded
the initial 256-sequence allocation. The crash was deterministic and
reproducible with -t1.
The fix is a single memset(s, 0, sizeof(*s)) at the top of kseq2bseq1.
Proper-pair flag from emitted alignment (PR #17)
In the no_pairing emission path of mem_sam_pe and mem_sam_pe_batch_post,
the proper-pair bit (0x2) was computed from a[i].a[0].rb regardless of
which alignment was actually emitted. When the primary’s alignment score fell
below the reporting threshold opt->T but a non-primary ALT hit cleared it,
mem_reg2aln emitted a[i].a[n_pri[i]] while mem_infer_dir still read the
below-threshold primary. In that case the SAM flag did not reflect the
coordinates in the record.
The fix stores the selected alignment index per mate in a which[2] array and
passes a[i].a[which[i]].rb to mem_infer_dir, ensuring the proper-pair
flag always matches the emitted record. The bug was present in the bwa-mem2
initial commit from 2019. Upstream reference: pre-existing bug, no open
upstream PR at time of merge.
Changes catalog
| Item | bwa-mem3 PR | Upstream PR/issue | Status |
|---|---|---|---|
@PG CL: tab escape | #54 | bwa-mem2#293 | fork-only (open upstream issue) |
| SMEM buffer overflow on >151 bp reads | #55 | bwa-mem2#238, bwa-mem2#210 | fork-only (upstream PR closed unmerged) |
| kswv nrow==0 guard | #51 | bwa-mem2#289 | fork-only (upstream PR open) |
| AVX-512BW dispatch guard | #26 | — | fork-only |
| AVX2 score2 plateau disable (superseded) | #27 | — | closed (superseded by #28) |
| AVX2 score2 plateau fix | #28 | — | fork-only |
| NEON + AVX-512BW 8-bit score2 fix | #29 | — | fork-only |
| AVX-512BW 16-bit score2 fix | #30 | — | fork-only |
| NEON 16-bit kernel rewrite | #31 | — | fork-only |
| kseq2bseq1 zero-initialization | #22 | — | fork-only |
| Proper-pair flag from emitted alignment | #17 | — | fork-only |
See also: Performance improvements · Architecture support · Upstream PR status · Developer Guide → Regression test framework · Performance → SIMD dispatch matrix
Performance Improvements
This page covers the performance work carried in bwa-mem3 on top of upstream bwa-mem2. Every change listed here preserves byte-identical SAM/BAM output vs the upstream baseline it was benchmarked against.
For current benchmark numbers across architectures and workloads, see bwa-mem3-bench, the canonical source of truth for benchmark methodology and results.
Lockstep SMEM batching (PR #33)
Seeding in bwa-mem2 advances one read’s SMEM walk at a time. Because each
forward/backward extension step issues a random access into the cp_occ
checkpoint array (~4 GB for human genome), the CPU stalls on cache misses
between steps. Lockstep batching advances SMEM_LOCKSTEP_N reads’ SMEM walks
in slot-interleaved round-robin order so that the out-of-order engine can
overlap the cp_occ cache-miss loads for read i+N with the compute-bound
walk of read i.
Each read slot (BatchSlot) carries its own prev[] walk buffer and
match_buf[] reorder buffer. A tight recycling loop assigns finished slots to
the next unprocessed read immediately. The match-emit cursor enforces
input-index order so output is byte-identical to scalar. SMEM_LOCKSTEP_N is
compile-time tunable; N=1 dispatches to the unchanged scalar path for
bisection.
Measured improvement on 150 bp NovaSeq WGS (1M pairs, hg38, Graviton3 r7g.4xlarge,
8 threads): −6.1% wall time (82 s → 77 s). The backwardExt hot
cp_occ load share dropped from 65.5% to 53.3% of function time — direct
evidence that the OoO engine is overlapping cross-slot loads. On 300 bp MiSeq
reads the workload is SW-dominated (~85% of cycles in kswv kernels) and the
SMEM improvement is within noise; parity holds.
Supersedes PR #15 (cross-read _mm_prefetch shape), which regressed on
Graviton3.
Batched -H header ingestion (PR #49, closes issue #37)
Passing a large header file via -H <file> re-ran strlen on the growing
header string and called realloc on every input line, making ingestion O(n²)
in the number of header lines. For a ~70 MB / ~1.5 M-line header (reported in
upstream bwa-mem2#204) this
caused runtimes exceeding 10 minutes before alignment started.
The fix introduces bwa_insert_header_file, a batched helper that determines
the file size with fseek/ftell, allocates a single buffer, copies all
@-prefixed lines in one pass, and calls bwa_insert_header once. The fix
also addresses four correctness gaps in the upstream PR #204: the return-value
assignment was dropped (leaving hdr_line stale after realloc), const FILE*
caused compiler warnings, empty files were not guarded, and each fgets was
not bounded by remaining buffer. A regression test
(test/header_insert_test.cpp) diffs the batched path against the pre-patch
per-line baseline across eight edge cases.
libsais FM-index construction (PR #57)
bwa-mem3 index now builds the FM-index using
libsais v2.9.1 (Ilya Grebnov)
instead of the sais-lite (Yuta Mori saisxx) library that bwa-mem2 inherited.
libsais is actively maintained, supports OpenMP-parallel induced sorting, and
produces a byte-identical FM-index. No changes are required to existing
indexes — bwa-mem3 reads index files built by bwa-mem2 index without
re-indexing.
For a human reference (GRCh38 + decoys), libsais reduces indexing wall time and peak memory vs sais-lite. Exact numbers depend on thread count and available RAM; see the PR body for measurements on Graviton3.
Consolidated mapping speedups (PR #58)
PR #58 is a multi-phase performance audit of bwa-mem2’s hot path, squashed and
rebased onto main. It incorporates improvements across five subsystems:
- ksw2 banded SW — tuned the band extension loop to reduce redundant computation in the common case.
- SMEM lockstep batching — additional refinements on top of PR #33.
- SAL prefetch — prefetch hints for the suffix array lookup hot path.
- SAM record building — reduced per-record allocation in the text formatting path.
- PGO build — the opt-in profile-guided optimization target (see also Performance → PGO build) is included in this suite.
On the smoke-1M workload (1M PE 150 bp reads, hg38, Graviton3 r7g.4xlarge, 16
threads, warm page cache), this PR contributed the largest single-step wall
time reduction in the main branch’s performance history. Benchmark details
are maintained at bwa-mem3-bench.
Changes catalog
| Item | bwa-mem3 PR | Upstream PR/issue | Status |
|---|---|---|---|
| Lockstep SMEM batching | #33 | — | fork-only |
Batched -H header ingestion | #49 | bwa-mem2#204 | fork-only (upstream PR open) |
| Large header performance (issue) | — | issue #37 | closed by #49 |
| libsais FM-index construction | #57 | — | fork-only |
| Consolidated mapping speedups | #58 | — | fork-only |
See also: Performance → Overview · Performance → PGO build · Correctness fixes · Build & infrastructure · bwa-mem3-bench
Features
This page covers user-facing features added to bwa-mem3 on top of upstream
bwa-mem2. None of these features change default behavior: output produced by
bwa-mem3 mem without any of these flags is byte-identical to the
corresponding bwa-mem2 output (except for the @PG ID: and PN: fields
which now read bwa-mem3).
--meth bisulfite alignment mode (PR #13)
--meth turns bwa-mem3 index and bwa-mem3 mem into a single-binary
drop-in replacement for the entire
bwameth.py pipeline. No Python, no
separate post-processing step, no bwameth.py dependency.
bwa-mem3 index --meth ref.fa # once per reference
bwa-mem3 mem --meth ref.fa R1.fq R2.fq | samtools sort -o out.bam
index --meth writes <ref>.bwameth.c2t — a doubled reference with
f/r-prefixed contigs and C→T / G→A projection, byte-identical to the
index that bwameth.py index-mem2 produces.
mem --meth performs inline C→T conversion of R1 and G→A conversion of R2
before seeding, stashes the original bases in YS:Z:, records the conversion
direction in YC:Z:, consolidates the f/r contig pairs back to one @SQ
per real chromosome, applies a chimera QC heuristic (longest M/=/X run < 44%
of read length → set 0x200, clear proper-pair 0x2, cap MAPQ at 1), copies
YS:Z: back into the SEQ field for CpG-calling tools, and writes a @PG ID:bwa-mem3-meth entry.
On the bwameth.py example fixture (92,684 reads), end-to-end output is
byte-identical on chrom, pos, CIGAR, and SEQ vs the bwameth.py oracle. Stacks
on PR #12 (--bam). See the
Methylation Reference for full details.
Vendored mimalloc allocator (PR #19)
bwa-mem3 vendors mimalloc v3.3.0 as a
pinned submodule at ext/mimalloc and links it into every binary by default
(USE_MIMALLOC=1). On Linux, static linkage uses --whole-archive; on macOS,
dyld-interposed shared linkage is used.
Measured on AWS c7g.4xlarge (Graviton3, 16 threads, 29M 150 bp paired-end
exome-capture reads vs hg38, page cache dropped between iterations):
−24.5% wall-clock time (528.6 s → 424.7 s) compared to the same build
with USE_MIMALLOC=0. No user-visible interface change; no runtime
configuration required.
USE_MIMALLOC=0 is a supported best-effort opt-out and is CI-gated on Linux
x86. bwa-mem3 version prints the mimalloc version string when it is active.
--supp-rep-hard-cap supplementary MAPQ rescoring (PR #56)
Supplementary alignments for a split read inherit MAPQ from the full-read
scoring pipeline. Competing repetitive chains for the supplementary fragment
are filtered out during full-read chain scoring (mem_chain_flt) before
Smith-Waterman, so they never contribute to sub/sub_n. A supp fragment
landing in a CCATCC repeat that would map equally well to 50+ locations
standalone can therefore carry MAPQ=60 from its primary.
--supp-rep-hard-cap INT opts into rescoring: if any seed in a supplementary
alignment’s chain has >=INT genome occurrences (from the SMEM SA count), the
supplementary MAPQ is forced to 0. Primary alignment MAPQ and coordinates are
unaffected. Default output (no flag) is byte-identical to upstream bwa-mem2.
The SMEM SA-occurrence count is preserved on each seed as mem_seed_t.n_hits
and propagated to mem_alnreg_t.chain_n_hits during chain-to-alignment
conversion. Typical values for INT are 5–20; lower is more aggressive. The
upstream bwa-mem2#260
reporter case drops from MAPQ=60 to MAPQ=0 at --supp-rep-hard-cap 18.
Closes issue #46.
Shared-memory index: bwa-mem3 shm (PR #65)
bwa-mem3 mem reloads the FM-index from disk on every invocation. For hg38
the index is ~28 GB; for short alignment jobs (targeted panels, small sample
batches) this load cost dominates runtime and makes per-invocation IOPS the
bottleneck.
PR #65 ports the bwa shm command from bwa-mem v1 to bwa-mem3 with strict v1
CLI parity:
bwa-mem3 shm <index-prefix> # load index into shared-memory segment once
bwa-mem3 mem <index-prefix> ... # subsequent runs attach instead of re-reading
bwa-mem3 shm -d <index-prefix> # detach and free the segment
The index lives in a POSIX shared-memory segment. Multiple bwa-mem3 mem
processes on the same host share the same in-memory copy. Closes
issue #64.
Warning — Stale index
bwa-mem3 shmdoes not detect when the on-disk index has been rebuilt. Always runbwa-mem3 shm -d <prefix>before runningbwa-mem3 indexand then re-stage withbwa-mem3 shm <prefix>. Using a stale shared-memory segment produces silently wrong alignments.
bwa-mem3 shm --meth (PR #67)
bwa-mem3 mem --meth <prefix> auto-appends .bwameth.c2t to locate the
methylation index built by bwa-mem3 index --meth <prefix>. Before PR #67,
staging a methylation index in shared memory required passing the full
.bwameth.c2t-suffixed path to shm while continuing to pass the plain
prefix to mem. The mismatch was easy to forget, and the failure mode — a
run that silently attached the wrong segment — was difficult to diagnose.
PR #67 adds --meth support to bwa-mem3 shm so the same plain-prefix
convention works end-to-end:
bwa-mem3 shm --meth ref.fa # stages ref.fa.bwameth.c2t
bwa-mem3 mem --meth ref.fa ... # attaches automatically
bwa-mem3 shm -d --meth ref.fa # detaches
HN:i hit count tag (PR #42)
Every primary SAM/BAM record now carries an HN:i:<n> tag reporting the
number of secondary alignment candidates clustered with this primary under
XA_drop_ratio. This count is captured before the -h/max_XA_hits cap
truncates the XA:Z: string, so HN reports the true number of alternate
loci even when no XA:Z: field appears in the record.
This makes it possible to distinguish:
HN:i:0+ noXA:Z:— genuinely unique mapper.HN:i:N+XA:Z:...(N ≤-h) — multi-mapper with all alternates listed.HN:i:N+ noXA:Z:(N >-h) — multi-mapper whose alternates were suppressed by the cap.
Motivated by lh3/bwa#438, which adds
HN to bwa aln. HN is emitted in both SAM (mem_aln2sam) and BAM
(mem_aln_to_bam) paths and is absent when -a (MEM_F_ALL) is active.
--bam=LEVEL direct BAM output (PR #12)
bwa-mem3 mem --bam (or --bam=0 through --bam=9) emits BAM directly via
htslib, bypassing the SAM-text-to-BAM conversion round trip that normally
occurs when the output is piped to samtools view -bS.
--bam/--bam=0: uncompressed BAM (BGZF framing only) — near-zero CPU overhead, smaller than SAM text, fast downstream parsing.--bam=1..9: BGZF deflate at the specified level.- No flag: SAM text on stdout (default, unchanged).
The implementation adds src/bam_writer.{h,cpp}, a new module that converts
mem_aln_t to bam1_t via mem_aln_to_bam. htslib v1.21 is pulled in as a
submodule at ext/htslib. On the bwameth.py example fixture (92,961 records),
samtools view of --bam output vs SAM text produces a zero-line diff across
all 11 SAM columns and all aux tags. See
Best Practices → Output format for the
recommended pipeline.
Changes catalog
| Item | bwa-mem3 PR | Upstream PR/issue | Status |
|---|---|---|---|
--meth bisulfite alignment mode | #13 | — | fork-only |
| Vendored mimalloc allocator | #19 | — | fork-only |
--supp-rep-hard-cap MAPQ rescoring | #56 | bwa-mem2#260 | fork-only (upstream issue open) |
bwa-mem3 shm shared-memory index | #65 | — | fork-only |
shm --meth symmetry | #67 | — | fork-only |
HN:i hit count tag | #42 | lh3/bwa#438 | fork-only (analogous to bwa aln) |
--bam=LEVEL direct BAM output | #12 | — | fork-only |
See also: Methylation Reference → Overview · User Guide → Memory allocator · User Guide → Output: SAM/BAM, headers, tags · Getting Started → Quick start: shared-memory index · Best Practices → Output format
Architecture Support
This page covers the architecture-specific build and runtime work carried in bwa-mem3. The goal is a single codebase that builds cleanly on all supported targets and runs the best available SIMD kernels on each.
For the full dispatch matrix and runtime selection logic, see Performance → SIMD dispatch matrix and Developer Guide → SIMD dispatch architecture.
Linux ARM64 / aarch64 build (PR #1)
The Apple Silicon work that reached the fork in commit ae73227 gated ARM
behavior on $(UNAME_M) == arm64. On macOS, uname -m returns arm64. On
Linux ARM64, it returns aarch64. The Makefile’s ifeq check therefore fell
through to the x86 multi target on every Linux aarch64 host, failing with:
g++: error: unrecognized command-line option '-msse'
PR #1 introduces an IS_ARM variable ($(filter $(UNAME_M),arm64 aarch64))
that matches both names. All four architecture-conditional blocks in the
Makefile are rewritten to use IS_ARM: the NEON/sse2neon flag block, the x86
arch-specific block, the ARM64 single-binary build block, and the multi
target ARM64 short-circuit. The CI workflow is extended to trigger on pushes
to fg-main (the integration branch) and adds an ubuntu-24.04-arm matrix
row so the aarch64 path is exercised on every PR.
arch=avx512bw explicit build target (PR #16)
The AVX-512 Smith-Waterman kernels in bwa-mem2 are guarded by the
__AVX512BW__ preprocessor macro — not __AVX512F__. The only way to build
them before this PR was arch=avx512, but make multi emitted the dispatch
binary as bwa-mem2.avx512bw. The build selector (avx512), the preprocessor
guard (__AVX512BW__), and the dispatcher suffix (.avx512bw) disagreed.
PR #16 adds arch=avx512bw as an explicit Makefile target with flags
-mavx512f -mavx512bw and switches make multi to invoke arch=avx512bw
when emitting bwa-mem3.avx512bw. The legacy arch=avx512 is preserved as
an alias with identical flags. No C++ is changed; the fix is 11 insertions and
2 deletions in the Makefile.
This is a pure build-correctness fix: before PR #16, arch=avx512bw and
arch=multi builds on AVX-512BW hardware silently compiled the wrong kernel
(see Correctness → AVX-512BW dispatch guard for the
downstream effect).
NEON kswv mate-rescue (PR #18)
bwa-mem2 has a batched mate-rescue Smith-Waterman path (BWAMEM_BATCHED_MATESW)
that uses SIMD kswv kernels to score rescue candidates in parallel. On ARM64
the gate was __AVX512BW__, which is never true on NEON hardware. The NEON
kswv::getScores8 kernel existed in the source but was unreachable in
production.
PR #18 enables this path on ARM64 by replacing the __AVX512BW__ gate with a
new BWAMEM_BATCHED_MATESW macro that fires on NEON/Apple Silicon as well.
Along the way, four kernel bugs were found and fixed:
- te split — the
te(traceback end) value needed separate hi/lo tracking for 16-lane u8 batches. - Freeze mask — a
frozen_vecmask now gatesgmax/te/qeupdates afterKSW_XSTOPfires, preventing stale values from escaping to the score2 scan. - Per-lane score2 exclusion —
len1,low/high, andqemasks were not applied per-lane in Loop 1, allowing lanes without a valid primary to contribute spurious suboptimal scores. - minsc filter on rowMax — sub-
minscplateau scores were leaking intoscore2because the scalarksw_u8gating condition (imax >= minsc) was not replicated.
Measured on an M-series Mac (8 threads, 500k PE 100 bp reads on chr17): 1.42× speedup (−29.4% wall time) with byte-identical sorted SAM output.
AVX2 kswv mate-rescue (PR #20)
PR #18 enabled batched mate-rescue on ARM64. Most x86 production deployments
(AWS c6a, c6i, older Xeons) use AVX2 without AVX-512BW and were excluded from
the same gate. PR #20 extends the batched path to AVX2 by adding a 256-bit
kswv256_u8 kernel and widening BWAMEM_BATCHED_MATESW to fire on __AVX2__.
The AVX2 kernel is a direct port of the corrected NEON kernel from PR #18,
with an additional fix for per-lane te2 tracking (_mm256_blendv_epi8 on a
sign-extended 8→16 bit mask). Verified byte-identical sorted SAM vs the
pre-BWAMEM_BATCHED_MATESW scalar control on EC2 m5.xlarge (Skylake-SP, 4
threads, 500k chr17 PE pairs).
Note: PR #20 introduced a score2 plateau regression in the AVX2 kernel that was identified and fixed in the correctness series (PRs #27, #28, #29).
Changes catalog
| Item | bwa-mem3 PR | Upstream PR/issue | Status |
|---|---|---|---|
| Linux ARM64 / aarch64 build + CI | #1 | bwa-mem2#288 | fork-only (upstream PR open) |
arch=avx512bw explicit target | #16 | — | fork-only |
| NEON kswv mate-rescue kernel | #18 | — | fork-only |
| AVX2 kswv mate-rescue kernel | #20 | — | fork-only |
See also: Performance → SIMD dispatch matrix · Developer Guide → SIMD dispatch architecture · Developer Guide → Apple Silicon / NEON port · Correctness fixes · Performance → PGO build
Build & Infrastructure
This page covers the build-system, testing, and CI infrastructure changes carried in bwa-mem3 on top of upstream bwa-mem2.
doctest framework and Codecov (PR #34)
PR #34 establishes the long-term test infrastructure for bwa-mem3:
- doctest 2.4.11 is vendored as a single-header under
ext/doctest/, with the SHA256 recorded inext/doctest/VERSION. - A new
test/framework/static library provides shared helpers: scoring matrices, deterministic sequence-pair generators, kswv-style batch packers, scalar and SIMD runners, kswr comparators, a JUnit reporter hook, and a sharedmain. - Two test binaries are produced:
bwa_mem3_tests_unit(runs on every CI matrix row) andbwa_mem3_tests_integration(runs on a subset of rows). - The existing
kswv_selftestis ported totest/unit/test_kswv_correctness.cpp— 30,049 assertions against scalarksw_align2on 10k random plus curated edge pairs. - Five legacy integration sources are moved to
test/integration/viagit mv; their binaries still emit attest/<name>so existing scripts keep working. - Five inline CI bash regression blocks are extracted to
test/regression/*.sh(phix_parity, chr22_parity, thread_determinism, bam_roundtrip, meth_oracle). - A
coverageCI job buildslibbwa.aand both test binaries withCOVERAGE=1(-O0 --coverage), runs both test binaries, collects Cobertura XML viagcovr, and uploads to Codecov viacodecov/codecov-action.
PACKAGE_VERSION from git describe (PR #52)
Before PR #52, src/main.cpp hardcoded PACKAGE_VERSION "2.2.1". This string
appeared in bwa-mem3 version output and in the @PG VN: SAM header field
but was never updated, causing every build to report an outdated version.
The Makefile now generates src/version.h from git describe --tags --dirty,
falling back to a static FG_LABS_VERSION_FALLBACK when git describe cannot
reach a tag (source-tarball extractions, shallow clones — e.g. CI with the
default fetch-depth: 1). A write-if-changed mechanism (cmp -s + mv)
regenerates the file on every invocation but only bumps its mtime when the
stamped string changes, so only main.o is rebuilt when the version changes,
not the entire tree. src/version.h is .gitignored and removed by
make clean. Fixes
issue #40. Related upstream:
bwa-mem2#283,
bwa-mem2#284.
PGO target parameterization (PR #59)
The original pgo-generate and pgo-use Makefile targets hardcoded
arch=arm64 and a single shared pgo_profiles/ directory. PR #59 generalizes
both:
PGO_ARCH(default:arm64on ARM hosts,nativeotherwise) passes through to the recursivemakeinvocation asarch=$(PGO_ARCH). Accepts the same values as the rest of the Makefile:arm64,sse41,avx2,avx512bw,native, etc.PGO_PROFILE_DIRis now overridable (?=instead of=). Each(arch × training-regime)combination can capture into its own directory.- When
PGO_ARCH != arm64, the output binaries are namedbwa-mem3.pgo-instr.<arch>andbwa-mem3.pgo.<arch>so multiple per-arch PGO builds coexist. The default arm64 names are unchanged for backward compatibility. pgo-cleannow removes arch-suffixed PGO binaries in addition to bare names.
This enables the benchmarking workflow at bwa-mem3-bench, which requires per-arch × per-regime profile capture. See also Performance → PGO build.
CXXFLAGS/CPPFLAGS/LDFLAGS forwarding (PR #50)
The Makefile’s multi: rule compiled runsimd.cpp (the x86 multi-binary
launcher) without honoring CXXFLAGS, CPPFLAGS, or LDFLAGS. The $(EXE)
link honored CXXFLAGS and LDFLAGS but not CPPFLAGS.
PR #50 mirrors upstream bwa-mem2#290:
the multi: compile now honors all three variables, and $(EXE) link adds
$(CPPFLAGS). This allows downstream packagers (Debian, Bioconda) and
reproducible-build systems to inject hardening flags (-D_FORTIFY_SOURCE=2,
-fstack-protector-strong, -Wl,-z,relro) through the environment without
patching the Makefile. No functional change unless the env vars are set.
Closes issue #39.
Unit-test harness and ARM CI (PR #23)
Historically, PR #23 added a local bash harness (test/run_unit_tests.sh) that
built and ran the five C++ unit binaries under test/ against committed fixtures
in test/fixtures/, asserting exit 0 and non-empty diff-able output (those
binaries have since been consolidated into the doctest harness — see the section
above). It also
fixes several pre-existing issues blocking the harness:
test/Makefiledefaulted toicpc(Intel compiler, not available on GitHub runners); changed tog++on Linux x86.- ARM flags are mirrored from the parent Makefile so
cd test && makebuilds on macOS arm64 and Linux aarch64. - Three test sources (
smem2_test,bwt_seed_strategy_test,sa2ref_test) were missing thefmiSearch->load_index()call thatfmi_test.cpphas, causing immediate segfaults on run. test/main_banded.cppopenedfksw.txtbut never wrote to it; output is now written andmain()returns 0 on success.- Fixtures are added under
test/fixtures/covering phiX174, 50 bp test reads, BWT seed strategy inputs, SA pairs, and SW pairs.
CI matrix expansion (PR #24)
PR #24 stacks on PR #23 and expands the GitHub workflow .github/workflows/ci.yml from 5 matrix
rows to 7:
| Row | Runner | Arch | Role |
|---|---|---|---|
| 1 | ubuntu-latest | sse41 | smoke + unit tests |
| 2 | ubuntu-latest | avx2 | canonical deep tests |
| 3 | ubuntu-latest | avx2 (no mimalloc) | unchanged |
| 4 | ubuntu-24.04-arm | arm64 | unchanged |
| 5 | macos-latest | arm64 | unchanged |
| 6 (new) | ubuntu-latest | multi | runsimd dispatcher smoke |
| 7 (new) | ubuntu-latest | avx2 clang++ | Linux Clang smoke |
The canonical row (row 2) adds: --bam=6 roundtrip record-count parity,
thread-determinism (-t1 vs -t4 sorted diff), unit-test harness, chr22
pipeline parity vs bwa, SE smoke, interleaved smoke, and --meth Layers 1–3.
Changes catalog
| Item | bwa-mem3 PR | Upstream PR/issue | Status |
|---|---|---|---|
| doctest framework + Codecov | #34 | — | fork-only |
PACKAGE_VERSION from git describe | #52 | bwa-mem2#283, bwa-mem2#284 | fork-only (upstream issue + PR open) |
| PGO target parameterization | #59 | — | fork-only |
CXXFLAGS/CPPFLAGS/LDFLAGS forwarding | #50 | bwa-mem2#290 | fork-only (mirrors open upstream PR) |
| Unit-test harness + ARM CI | #23 | — | fork-only |
| CI matrix expansion | #24 | — | fork-only |
See also: Developer Guide → Regression test framework · Developer Guide → Release process · Performance → PGO build · Performance improvements · Upstream PR status
Upstream PR Status
This table cross-references every change carried in bwa-mem3 main to its
corresponding upstream bwa-mem2 PR or issue. “Fork-only” means no upstream PR
exists; the change may be submitted upstream in the future or may be
fork-specific by design. “Open” means the upstream PR or issue existed at the
time of bwa-mem3’s implementation but had not been merged. Upstream status is
current as of the bwa-mem3 0.1.0-pre release.
For prose descriptions of each change, follow the links in the “bwa-mem3 PR” column to the relevant deep-dive page section.
Full cross-reference table
| Topic | bwa-mem3 PR | Upstream PR / Issue | Upstream status |
|---|---|---|---|
| Correctness | |||
@PG CL: tab escaping | #54 | bwa-mem2#293 | open issue |
| SMEM buffer overflow on >151 bp reads | #55 | bwa-mem2#238, bwa-mem2#210 | PR closed without merge; issue open |
| kswv nrow==0 guard (all 5 kernels) | #51 | bwa-mem2#289 | open PR (upstream covers AVX-512BW only) |
AVX-512BW dispatch guard (!__AVX512BW__) | #26 | — | fork-only |
| AVX2 score2 plateau consolidation | #28 | — | fork-only |
| NEON + AVX-512BW 8-bit score2 fix | #29 | — | fork-only |
| AVX-512BW 16-bit score2 fix | #30 | — | fork-only |
| NEON 16-bit kernel rewrite | #31 | — | fork-only |
| kseq2bseq1 zero-initialization | #22 | — | fork-only |
| Proper-pair flag from emitted alignment | #17 | — | fork-only |
| Performance | |||
| Lockstep SMEM batching | #33 | — | fork-only |
Batched -H header ingestion (O(n) fix) | #49 | bwa-mem2#204 | open PR |
| libsais FM-index construction | #57 | — | fork-only |
| Consolidated mapping speedups | #58 | — | fork-only |
| Features | |||
--bam=LEVEL direct BAM output | #12 | — | fork-only |
--meth bisulfite alignment mode | #13 | — | fork-only |
| Vendored mimalloc allocator | #19 | — | fork-only |
HN:i hit count tag | #42 | lh3/bwa#438 | analogous to bwa aln; no direct upstream port |
--supp-rep-hard-cap MAPQ rescoring | #56 | bwa-mem2#260 | open issue |
bwa-mem3 shm shared-memory index | #65 | — | fork-only (v1 feature port) |
shm --meth symmetry | #67 | — | fork-only |
| Architecture support | |||
| Linux ARM64 / aarch64 build + CI | #1 | bwa-mem2#288 | open PR |
arch=avx512bw explicit Makefile target | #16 | — | fork-only |
| NEON kswv mate-rescue kernel | #18 | — | fork-only |
| AVX2 kswv mate-rescue kernel | #20 | — | fork-only |
| Build & infrastructure | |||
| doctest framework + Codecov | #34 | — | fork-only |
PACKAGE_VERSION from git describe | #52 | bwa-mem2#283, bwa-mem2#284 | open issue + open PR |
| PGO target parameterization | #59 | — | fork-only |
CXXFLAGS/CPPFLAGS/LDFLAGS forwarding | #50 | bwa-mem2#290 | open upstream PR |
| Unit-test harness + ARM CI | #23 | — | fork-only |
| CI matrix expansion (7 rows) | #24 | — | fork-only |
Upstream issues tracked but not yet fixed in bwa-mem3
The following upstream issues are tracked in the bwa-mem3 issue list but do
not yet have corresponding fixes in main:
| Issue | Upstream reference | Notes |
|---|---|---|
| Split-alignment evidence loss vs bwa 0.7.17 | bwa-mem2#273 | issue #47 — under investigation |
| MAPQ/coordinate parity vs bwa mem 0.7.18 | bwa-mem2#262, bwa-mem2#246, bwa-mem2#239 | issue #48 — tracking only |
See also: Correctness fixes · Performance improvements · Features · Architecture support · Build & infrastructure
Building from source
This page documents every build target available in the Makefile and what each produces. For the recommended production build workflow see Best Practices → Build.
Prerequisites
- A C++14-capable compiler: GCC 7+ or Clang 6+ on Linux; Clang 15+ (Xcode) on macOS.
- GNU make 3.81+.
- CMake 3.12+ (required only when
USE_MIMALLOC=1, which is the default). - libomp (macOS only):
brew install libomp. libsais uses OpenMP for parallel suffix-array construction. - Git submodules initialised:
git submodule update --init --recursive.
Warning — Submodules must be present
The build will fail with a clear error message if any of the required submodules (
ext/libsais,ext/htslib,ext/safestringlib,ext/mimalloc,ext/sse2neon) are missing. Always clone with--recursiveor rungit submodule update --init --recursivebeforemake.
Standard builds
Default build (host-native)
make
On x86 hosts this is equivalent to make multi (see below). On Apple Silicon and other aarch64 hosts the Makefile detects the architecture and builds a single ARM64 binary instead.
The resulting binary is bwa-mem3 in the repo root.
Single-arch x86 builds
Pass arch=<target> to compile a single binary with a specific ISA level:
| Command | SIMD level | ARCH_FLAGS |
|---|---|---|
make arch=sse41 | SSE4.1 | -msse … -msse4.1 |
make arch=sse42 | SSE4.2 | -msse … -msse4.2 |
make arch=avx | AVX | -mavx |
make arch=avx2 | AVX2 | -mavx2 |
make arch=avx512bw | AVX-512BW | -mavx512f -mavx512bw |
make arch=native | host CPU features | -march=native |
For Intel compiler (icpc / icpx) the flags differ slightly; see the Makefile for the ifeq ($(CXX), icpc) branches.
Multi-binary x86 build (default on x86)
make multi
Builds five ISA-specific binaries (bwa-mem3.sse41, bwa-mem3.sse42, bwa-mem3.avx, bwa-mem3.avx2, bwa-mem3.avx512bw) plus the thin launcher bwa-mem3 that execs the best-matching binary at runtime. See Multi-binary launcher for details.
ARM64 / Apple Silicon build
make arch=arm64
Compiles a single binary bwa-mem3.arm64 and creates a symlink bwa-mem3 -> bwa-mem3.arm64. See Apple Silicon / NEON port for background.
Tuned builds
Profile-Guided Optimization (PGO)
PGO produces the best single-binary performance. The workflow is two-phase:
# Phase 1: instrument binary
make pgo-generate # builds bwa-mem3.pgo-instr (arm64 default)
make pgo-generate PGO_ARCH=avx2 # or a specific x86 target
# Run your training workload with the instrumented binary
./bwa-mem3.pgo-instr mem -t 16 ref.fa r1.fq.gz r2.fq.gz > /dev/null
# Phase 2: optimised binary
make pgo-use # builds bwa-mem3.pgo
make pgo-use PGO_ARCH=avx2 # matching arch
PGO_ARCH accepts the same values as arch=. PGO_PROFILE_DIR defaults to pgo_profiles/ but can be overridden. Output binaries are named bwa-mem3.pgo (default arch) or bwa-mem3.pgo.<arch> when a non-default arch is specified, so multiple arch builds coexist.
Clean up instrumented objects and profile data:
make pgo-clean
Link-Time Optimization (LTO)
make lto-build # builds bwa-mem3.lto (native arch)
make lto-build LTO_ARCH=avx2 # explicit arch
LTO compiles bwa-mem3’s own translation units with -flto (thin LTO on Clang, full LTO on GCC) plus -fno-semantic-interposition on GCC. Third-party libraries (htslib, mimalloc, safestringlib) are linked without LTO. Clean:
make lto-clean
Compute-only profile binary
Used when profiling CPU hotspots without I/O noise. The -DDISABLE_OUTPUT flag short-circuits all BAM/SAM write paths and the file-open / header-emit step, so only alignment work contributes to wall time.
make profile-build # builds bwa-mem3.profile (native)
make profile-build PROFILE_ARCH=avx2 # explicit arch
./bwa-mem3.profile mem -t 16 ref.fa r1.fq.gz r2.fq.gz
make profile-clean
Build knobs
| Variable | Default | Effect |
|---|---|---|
USE_MIMALLOC | 1 | Include mimalloc; set 0 to use the system allocator |
ASAN | (unset) | Set to any non-empty value to enable AddressSanitizer (forces USE_MIMALLOC=0) |
COVERAGE | (unset) | Set to enable --coverage + -O0 for gcov line-level coverage |
EXTRA_CXXFLAGS | (empty) | Appended to CXXFLAGS; forwarded through PGO / LTO targets |
DISABLE_BATCHED_MATESW | (unset) | Set to 1 to disable the batched mate-rescue SW path on ARM |
CXX | c++ | Compiler. Paired CC is auto-derived from CXX for libsais. |
Cleaning
make clean
Removes object files, libbwa.a, all binaries, test binaries, libsais objects, safestringlib, htslib, and the mimalloc build tree.
make docs-clean
Removes only the mdbook build output (docs/book/). Covered in Developer Guide → Building context; see the Makefile docs targets for the full list.
Documentation targets
| Target | Action |
|---|---|
make docs | Build the mdbook into docs/book/ |
make docs-serve | Live-preview at http://localhost:3000 |
make docs-cli | Capture --help output for each subcommand into docs/_generated/cli/ |
make docs-clean | Remove docs/book/ |
make docs-install-tools | cargo install mdbook + three plugins |
See also: SIMD dispatch architecture · Multi-binary launcher · Best Practices → Build · Performance → PGO build · Apple Silicon / NEON port
SIMD dispatch architecture
bwa-mem3 uses two complementary mechanisms to select the best available SIMD code path at run time: a multi-binary launcher on x86 (handled separately in Multi-binary launcher) and compile-time conditional compilation inside each kernel, mediated by src/simd_compat.h.
This page covers the compile-time layer: what the macros do, which kernels are vectorised at each ISA level, and how the dispatch decision flows.
The simd_compat.h abstraction layer
src/simd_compat.h is the single point where platform detection and intrinsic selection occur. It is included by every file that touches SIMD code. The header resolves to one of four paths:
| Platform | Branch condition | Intrinsic headers |
|---|---|---|
| ARM / Apple Silicon | __ARM_NEON or __aarch64__ | sse2neon.h (translation) + <arm_neon.h> (native) |
| x86 AVX-512BW | __AVX512BW__ | <immintrin.h> |
| x86 AVX2 | __AVX2__ | <immintrin.h> |
| x86 SSE4.1 / SSE2 | __SSE4_1__ or __SSE2__ | <smmintrin.h> + <emmintrin.h> |
The ARM path defines APPLE_SILICON 1, sets SIMD_WIDTH8 = 16 and SIMD_WIDTH16 = 8 (128-bit NEON lanes), defines a posix_memalign-backed _mm_malloc replacement that enforces the 128-byte Apple Silicon cache-line alignment, and provides two optimised NEON helpers that sse2neon does not generate efficiently:
_mm_movemask_epi16— extracts the MSB of each 16-bit element usingvshrq_n_u16+vmovn_u16+ position-weightedvaddv_u8, replacing the_mm_movemask_epi8(v) & 0xAAAApattern used inbandedSWA.cpp._mm_blendv_epi16_fast— a bitwise select on 16-bit elements via NEONvbslq_s16, replacing the OR/AND/ANDNOT sequence sse2neon emits for_mm_blendv_epi8.
SIMD_WIDTH8 and SIMD_WIDTH16 control the lane counts in kswv.cpp and bandedSWA.cpp. On x86 they are set by the architecture-specific header rather than here; the macros differ per ISA level:
| ISA | SIMD_WIDTH8 | SIMD_WIDTH16 |
|---|---|---|
| SSE4.1 | 16 | 8 |
| AVX2 | 32 | 16 |
| AVX-512BW | 64 | 32 |
| ARM NEON | 16 | 8 |
Dispatch diagram
The full dispatch decision, from the shell to a kernel instruction, follows this flow:
flowchart TD
A[User runs: bwa-mem3 mem ...] --> B{Platform}
B -- ARM / Apple Silicon --> C[Single binary\nbwa-mem3.arm64]
B -- x86 --> D[Launcher: bwa-mem3\nsrc/runsimd.cpp]
D --> E{cpuid: best ISA}
E -- AVX-512BW --> F1[exec bwa-mem3.avx512bw]
E -- AVX2 --> F2[exec bwa-mem3.avx2]
E -- AVX --> F3[exec bwa-mem3.avx]
E -- SSE4.2 --> F4[exec bwa-mem3.sse42]
E -- SSE4.1 --> F5[exec bwa-mem3.sse41]
F1 & F2 & F3 & F4 & F5 --> G[main.cpp\ncompiled with matching ARCH_FLAGS]
C --> G
G --> H{Kernel call}
H -- kswv\nbatched SW --> I[kswv.cpp\nSIMD_WIDTH8/16 from simd_compat.h]
H -- bandedSWA\nmate-rescue --> J[bandedSWA.cpp\nblendv / movemask from simd_compat.h]
H -- FM-index\nbackward extension --> K[FMI_search.cpp\n__builtin_popcountl — not SIMD]
H -- libsais\nBWT construction --> L[libsais.c\nOpenMP parallel SA-IS]
I --> M[SIMD instructions\nat ISA level of this binary]
J --> M
Per-kernel vectorisation status
| Kernel | SSE4.1 | AVX2 | AVX-512BW | ARM NEON |
|---|---|---|---|---|
kswv (batched Smith-Waterman) | vectorised | vectorised (2x width) | vectorised (4x width) | native NEON |
bandedSWA (banded SW / mate-rescue) | vectorised | vectorised | vectorised | native NEON blendv |
FMI_search (FM-index backward ext.) | scalar | scalar | scalar | scalar |
libsais (BWT / SA construction) | OpenMP only | OpenMP only | OpenMP only | OpenMP only |
bam_writer (BAM serialisation) | — | — | — | — |
FMI_search is memory-bound with sequential pointer-chasing dependencies; adding SIMD to it produces no measurable speedup. libsais benefits from OpenMP-parallel induced sorting but not from SIMD widening within a single thread.
Adding a new SIMD kernel
- Include
simd_compat.hrather than any platform intrinsic header directly. - Use
SIMD_WIDTH8/SIMD_WIDTH16for lane-count arithmetic so the code compiles correctly across all ISA levels. - For ARM-specific optimisations, gate them with
#ifdef APPLE_SILICON(or#if defined(__ARM_NEON)) and provide asimd_compat.h-routed fallback for x86. - Verify correctness on at least SSE4.1 (lowest supported x86 level) and ARM64 using
make test.
Tip — Testing SIMD correctness
The kswv unit tests in
test/unit/test_kswv*.cppuse synthetic sequence-pair generators that drive edge cases (empty batches, nrow==0, homopolymers) across every SIMD width. Run them with./test/bwa_mem3_tests_unit --test-suite="unit/kswv"after modifying any vectorised kernel.
See also: Multi-binary launcher · Apple Silicon / NEON port · Building from source · Performance → SIMD dispatch matrix · Regression test framework
Multi-binary launcher (x86)
On x86 Linux and x86 macOS, bwa-mem3 is a thin launcher binary rather than the aligner itself. Its sole job is to detect the host CPU’s capabilities at startup and exec the best-matching ISA-specific binary in the same directory.
ARM / Apple Silicon does not use this mechanism. The make arm64 target creates a symlink bwa-mem3 -> bwa-mem3.arm64; there is only one NEON instruction-set level on all current ARM64 CPUs.
What make multi produces
make multi
Produces six files in the repo root:
| File | ISA | ARCH_FLAGS |
|---|---|---|
bwa-mem3 | launcher (no aligner code) | compiled from src/runsimd.cpp |
bwa-mem3.sse41 | SSE4.1 | -msse -msse2 -msse3 -mssse3 -msse4.1 |
bwa-mem3.sse42 | SSE4.2 | adds -msse4.2 |
bwa-mem3.avx | AVX | -mavx |
bwa-mem3.avx2 | AVX2 | -mavx2 |
bwa-mem3.avx512bw | AVX-512BW | -mavx512f -mavx512bw |
The six binaries must reside in the same directory for the launcher to find them.
How the launcher selects a binary
src/runsimd.cpp calls cpuid (via the __cpuid intrinsic or a hand-rolled CPUID wrapper) to read the CPU’s feature flags and picks the highest ISA level supported by the CPU:
- Check
CPUIDleaf 7 for AVX-512BW → execbwa-mem3.avx512bw - Check
CPUIDleaf 7 for AVX2 → execbwa-mem3.avx2 - Check
CPUIDleaf 1 for AVX → execbwa-mem3.avx - Check
CPUIDleaf 1 for SSE4.2 → execbwa-mem3.sse42 - Fallback → exec
bwa-mem3.sse41
The launcher calls execv with the same argv that was passed to it. The selected binary’s main() therefore receives the original arguments unchanged. The @PG CL: tag in the output SAM/BAM records the original invocation, not the ISA-suffixed binary name.
Note — exec replaces the process
The launcher does not fork. It calls execv(), which replaces the launcher process image with the ISA-specific binary. There is no wrapper process resident in memory during alignment.
Using a specific ISA binary directly
You can bypass the launcher and invoke a specific binary directly:
./bwa-mem3.avx2 mem -t 16 ref.fa r1.fq.gz r2.fq.gz
This is useful when benchmarking a particular ISA level, testing a regression, or deploying in an environment where only one binary is installed. The ISA-specific binary behaves identically to the launcher output for that ISA — there is no functional difference.
Distribution layout
When packaging or deploying on x86, include all five ISA binaries plus the launcher in the same directory:
bin/
bwa-mem3 ← launcher
bwa-mem3.sse41
bwa-mem3.sse42
bwa-mem3.avx
bwa-mem3.avx2
bwa-mem3.avx512bw
On ARM, only bwa-mem3 (the symlink) and bwa-mem3.arm64 are needed.
The mem SIMD banner
After selecting and executing a binary, the mem subcommand prints a single-line banner to stderr before alignment begins:
-----------------------------
Executing in AVX2 mode!!
-----------------------------
The banner text is set at compile time via #if __AVX512BW__ / #elif __AVX2__ / … preprocessor guards in src/main.cpp. This confirms at runtime which ISA path is active.
See also: SIMD dispatch architecture · Apple Silicon / NEON port · Building from source · Performance → SIMD dispatch matrix
Apple Silicon / NEON port
bwa-mem3 supports ARM64 (Apple Silicon and Linux aarch64) as a first-class build target. The port uses the sse2neon translation shim as a baseline and replaces the two most performance-critical SSE paths with native NEON intrinsics.
Architecture overview
The ARM build compiles a single binary (bwa-mem3.arm64) rather than a family of ISA-specific binaries. There is only one NEON instruction-set level on all current ARM64 CPUs, so the multi-binary launcher used on x86 is not needed. make arm64 builds the binary and creates a symlink bwa-mem3 -> bwa-mem3.arm64.
sse2neon shim
ext/sse2neon/sse2neon.h is a header-only library that maps Intel SSE intrinsics to their NEON equivalents. When APPLE_SILICON=1 is defined (set automatically when uname -m is arm64 or aarch64), src/simd_compat.h includes sse2neon and defines the SSE feature test macros (__SSE__ through __SSE4_2__) so that code guarded by those macros compiles without changes.
The translation is not zero-cost for all operations. Two patterns that sse2neon handles poorly are replaced with native NEON in src/simd_compat.h:
_mm_movemask_epi16— used heavily inbandedSWA.cppto extract the sign bit of each 16-bit lane. The native implementation shifts right by 15, narrows to 8-bit withvmovn_u16, and reduces with position-weightedvaddv_u8._mm_blendv_epi16_fast— a bitwise select on 16-bit lanes usingvbslq_s16. Replaces the three-operation OR/AND/ANDNOT sequence sse2neon emits for_mm_blendv_epi8.
Memory alignment
Apple Silicon uses 128-byte cache lines (versus 64 bytes on x86). simd_compat.h overrides _mm_malloc on ARM to call posix_memalign with a minimum alignment of 128 bytes for all SIMD allocations. CACHE_LINE_BYTES is set to 128 in macro.h when APPLE_SILICON=1.
Accelerate.framework
The Makefile links -framework Accelerate on macOS ARM builds. The framework is linked but not used for computation: bwa-mem3’s hot paths (Smith-Waterman, FM-index) do not match the large-matrix / large-vector patterns that BLAS and vDSP target. The link is retained to keep the option open and adds no overhead at runtime.
P-core / E-core detection
src/fastmap.cpp calls HTStatus() on macOS to detect the Apple Silicon microarchitecture. HTStatus() reads the hw.perflevel0.physicalcpu and hw.perflevel1.physicalcpu sysctl keys to report P-core and E-core counts and the L2 cache size (typically 4 MB on M-series chips). This information is printed at startup for diagnostic purposes. The L2 cache size is used to validate the compile-time BATCH_SIZE setting (currently 1024, which was already optimal for a 4 MB L2 cache).
Benchmark results
All measurements use 100K paired-end reads, 5% error rate, 30% indels, chr17 reference, 8 threads, on an M-series Apple Silicon machine.
| Build | Wall-clock (avg, s) | vs. baseline |
|---|---|---|
| sse2neon baseline (no native NEON) | 15.4 | — |
+ native NEON kswv.cpp | 14.4 | ~7% faster |
+ native NEON bandedSWA.cpp blendv | 13.8 | ~4% faster |
| PGO on top of native NEON | ~13.4 | ~3% further |
The FM-index (FMI_search.cpp) is memory-bound with sequential pointer-chasing dependencies and does not benefit from SIMD. libsais benefits from OpenMP-parallel suffix-array construction but not from SIMD widening within a single thread.
Optimization task summary
| Task | Status | Impact | Notes |
|---|---|---|---|
| Correctness verification | done | — | 200,006 alignments, 0 differences vs. reference |
| Dynamic L2 cache detection | done | ~0% | 4 MB detected; compile-time BATCH_SIZE=1024 already optimal |
Native NEON bandedSWA.cpp | done | ~4% | vbsl-based blendv in simd_compat.h |
| Multi-binary launcher | N/A | 0% | Not applicable on ARM (single NEON level) |
| Accelerate.framework | done | ~0% | Linked; no suitable compute patterns |
| M1/M2/M3/M4 detection | done | ~0% | P/E-core counts and L2 cache via sysctl |
Native NEON FMI_search.cpp | N/A | 0% | Memory-bound; SIMD cannot help |
| Profile-Guided Optimization | done | ~3% | make pgo-generate / make pgo-use |
Building for Apple Silicon
# Standard arm64 build
make arch=arm64
# PGO build (recommended for production on Apple Silicon)
make pgo-generate PGO_ARCH=arm64
./bwa-mem3.pgo-instr mem -t 8 ref.fa r1.fq.gz r2.fq.gz > /dev/null
make pgo-use PGO_ARCH=arm64
The resulting bwa-mem3.pgo binary delivers the full ~10% improvement over the pure sse2neon baseline.
Tip — Recommended production build on Apple Silicon
Use PGO for production deployments. The combined ~10% improvement from native NEON kernels plus PGO is consistent and verified on M-series hardware.
Files modified in the NEON port
src/kswv.cpp,src/kswv.h— native NEON batched Smith-Watermansrc/bandedSWA.h— SIMD width definitions for ARMsrc/simd_compat.h— sse2neon integration, aligned allocation,_mm_blendv_epi16_fast,_mm_movemask_epi16src/fastmap.cpp— L2 cache detection,HTStatus()for non-NUMA (macOS)src/macro.h—BATCH_SIZEandCACHE_LINE_BYTEStuning for Apple SiliconMakefile—arm64target, sse2neon flags, Accelerate linkage, PGO targets
See also: SIMD dispatch architecture · Building from source · Performance → PGO build · Performance → SIMD dispatch matrix · What’s Different → Architecture support
Regression test framework
bwa-mem3 has three categories of tests — unit, integration, and regression — plus a separate benchmark harness in bench/. Understanding the distinction helps you choose where to add a new test and what to expect from CI.
Test categories
| Category | Binary / runner | Fixtures | CI scope |
|---|---|---|---|
| unit | test/bwa_mem3_tests_unit | None; all inputs synthetic | Every matrix row |
| integration | test/bwa_mem3_tests_integration | Small committed FASTAs / FMI in test/fixtures/ | SSE4.1, AVX2, ARM64 Linux, macOS ARM |
| regression | test/regression/*.sh | Downloaded references (phiX, chr22) + bwa + dwgsim | Canonical AVX2 row only |
Unit tests must use only synthetic inputs generated programmatically and complete in under 100 ms each. They exercise individual kernels in isolation: kswv scoring, banded Smith-Waterman, KSW, FM-index operations, SMEM extraction, BAM encoding, and pair handling.
Integration tests may load small committed fixtures from test/fixtures/ and have a per-test budget of 10 seconds. They exercise cross-component paths: index loading, SMEM-to-alignment pipelines, and output format validation.
Regression tests are standalone bash scripts that shell out to the bwa-mem3 binary, may diff against third-party tool output (bwa, bwa-meth, samtools), and require fixtures that are either committed to the fixtures directory or downloaded by CI at run time.
Running tests locally
# Build the aligner and test binaries
make
make -C test -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)
# Run all unit tests
./test/bwa_mem3_tests_unit
# Run all integration tests
./test/bwa_mem3_tests_integration
# Run a specific test case or suite
./test/bwa_mem3_tests_unit --test-case="*kswv*"
./test/bwa_mem3_tests_unit --test-suite="unit/kswv"
./test/bwa_mem3_tests_unit --test-suite-exclude=slow
# Verbose output (also print passing assertions)
./test/bwa_mem3_tests_unit --success
The make test target is a convenience shortcut that builds and runs the unit and integration binaries plus the two legacy standalone regression tests (kswv_nrow_zero_test and shm_section_find_test):
make test
Running a regression test locally
Regression scripts expect certain environment variables to point at fixtures. The phiX parity test requires dwgsim:
mkdir -p /tmp/ci-test && cd /tmp/ci-test
curl -sL "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/819/615/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gz" | gunzip > phix174.fa
dwgsim -z 42 -N 500 -1 150 -2 150 -r 0.001 -S 2 phix174.fa reads
cd -
BWA_MEM2="$(pwd)/bwa-mem3" CI_TEST_DIR=/tmp/ci-test bash test/regression/phix_parity.sh
Test framework
The unit and integration binaries are built on doctest, a single-header C++ test framework. Tests are discovered by file glob: any test/unit/test_*.cpp file is compiled into the unit binary; any test/integration/test_*.cpp file is compiled into the integration binary. No Makefile edit is needed when adding a new test_*.cpp.
Test organisation
Tag each TEST_CASE with doctest::test_suite("category/module"):
TEST_CASE("nrow==0 batch does not store out of bounds"
* doctest::test_suite("unit/kswv")) {
// ...
}
The test_suite decorator is overriding (not additive). Encode the category (unit or integration) and module (kswv, bandedsw, ksw, fmindex, smem, bam, pair, cigar, util) as a single slash-separated string.
Framework helpers
The test/framework/ directory provides helpers shared across test files:
| Header | Provides |
|---|---|
scoring.h | ScoringMatrix, build_scoring_matrix, default_scoring_matrix |
seqpair.h | TestPair struct |
seqpair_gen.h | Deterministic pair generators: random, exact-match, all-mismatch, homopolymer, sub-cluster, N-bases |
seqpair_batch.h | BatchBuffers — flat-layout packer for kswv batch input |
ksw_runner.h | run_scalar_ksw, default gap/extra parameters |
kswv_runner.h | Two-pass run_kswv_batch |
kswr_cmp.h | Score / coordinate / score2 comparators |
junit_reporter.h | CI matrix-row banner and JUnit XML output |
Debugging a failing test
# Break into debugger at the first failing assertion
./test/bwa_mem3_tests_unit --test-case="*kswv*" --break
# Run a single SUBCASE
./test/bwa_mem3_tests_unit --test-case="*foo*" --subcase="bar"
# Enable per-phase diagnostics for kswv tests
BWA_TESTS_DEBUG_PHASE0=1 BWA_TESTS_DEBUG_PHASE1=1 \
./test/bwa_mem3_tests_unit --test-suite="unit/kswv"
JUnit artifacts are uploaded per CI matrix row (unit-results-<name>.xml, integration-results-<name>.xml) and available on the Actions run page.
Tip — Use ASAN for memory bugs
Build with
make ASAN=1 testto catch out-of-bounds writes in vectorised kernels. The kswv_nrow_zero_test specifically exercises the nrow==0 path that triggered a pre-allocation store bug; ASAN reports this immediately rather than at a later allocator operation.
Standalone regression tests
Three standalone regression tests live outside the doctest harness because they predated it. The two binaries are built and run by make test; the third is script-driven:
kswv_nrow_zero_test— binary; exercises the all-len1==0 batch path in every SIMD kswv variant. Catches the nrow==0 rowMax store overrun from issue #38 / upstream bwa-mem2 PR #289.shm_section_find_test— binary; exercises the shared-memory index section-find logic.shm_pack_round_trip_test— script-driven, invoked viatest/shm_pack_round_trip_test.sh, which builds the phiX index first.
Additional integration shell scripts in test/:
| Script | What it tests |
|---|---|
pg_cl_escape_test.sh | @PG CL: tab/newline escape in SAM headers |
mimalloc_loaded_test.sh | mimalloc override is active when USE_MIMALLOC=1 |
shm_round_trip_test.sh | bwa-mem3 shm load / list / drop cycle |
shm_meth_test.sh | --meth index compatibility with shm |
help_prescan_test.sh | --help prints without running alignment |
libsais_*.sh | libsais index correctness vs. BWA / determinism |
Benchmark harness (bench/)
bench/ is a separate performance measurement harness used during development to gate performance PRs. It is not part of the CI test suite.
cp bench/config.env.example bench/config.env
# Edit config.env to point at your index, reads, and binary paths
bench/run.sh baseline # N trials; appends to bench/results.csv
bench/run.sh candidate # N trials on the candidate binary
bench/compare.sh baseline candidate # wall-clock / RSS / md5 delta report
Each run records: tag, host, architecture, binary path, thread count, trial index, wall-clock seconds, max RSS (KB), and a golden md5 (single-threaded, @PG-stripped SAM). The md5 verifies byte-identical output across builds; wall-clock is the primary performance metric.
See also: Building from source · SIMD dispatch architecture · Contributing · What’s Different → Correctness fixes · What’s Different → Build & infrastructure
Release process
bwa-mem3 follows semantic versioning. Releases are driven by git tags. The version string is derived automatically from git describe and embedded in every binary at compile time.
Version stamping
The Makefile computes the version string at parse time:
FG_LABS_VERSION_FALLBACK := 0.1.0-pre
VERSION_STRING := $(shell git describe --tags --dirty 2>/dev/null || echo $(FG_LABS_VERSION_FALLBACK))
git describe produces a string such as v0.1.0 (on a tag), v0.1.0-3-gabcdef1 (three commits past the tag), or v0.1.0-dirty (uncommitted changes). If git describe fails — for example in a source tarball or a shallow clone without tag history — the build falls back to FG_LABS_VERSION_FALLBACK.
The string is written into src/version.h by the src/version.h: FORCE rule, which runs on every make invocation but only touches the file when the string changes. This minimises unnecessary recompilation of src/main.o.
PACKAGE_VERSION from src/version.h appears in:
bwa-mem3 versionoutput (stdout).- The
@PG VN:field in every SAM/BAM file produced bybwa-mem3 mem.
Verifying the version
./bwa-mem3 version
# Example output on a tagged commit:
# v0.1.0
# mimalloc 3.3.0 ← if USE_MIMALLOC=1
On an untagged commit the string includes the commit distance and short SHA:
v0.1.0-12-g3f7ab2e
Tagging a release
-
Confirm
mainbuilds cleanly and all tests pass:make clean && make make test make docs -
Update
NEWS.mdwith the release notes for the new version. -
Tag the release commit:
git tag -s v0.X.Y -m "Release v0.X.Y"Use a signed tag (
-s) when your GPG key is available. Annotated tags (-a) are acceptable when signing is not possible. -
Push the tag to the
fg-labsremote:git push fg-labs v0.X.YRead the Docs activates a versioned build at
/v0.X.Y/automatically when the tag appears on the remote. -
Create a GitHub release from the tag via
gh:gh release create v0.X.Y --repo fg-labs/bwa-mem3 \ --title "bwa-mem3 v0.X.Y" \ --notes-file <(sed -n '/^## \[0\.X\.Y\]/,/^## /p' NEWS.md | head -n -1)
Note — Tarball builds
Source tarballs created by GitHub (or
git archive) do not include git history, sogit describefails and the version falls back toFG_LABS_VERSION_FALLBACK. For reproducible tarball builds, setVERSION_STRINGexplicitly on the command line:make VERSION_STRING=v0.X.Y.
Branch and tag conventions
- All release tags are on the
mainbranch, which carries both upstream bwa-mem2 commits and fork-carried changes. See Branch and worktree conventions for the full branching model. - Tags are prefixed with
v:v0.1.0,v0.2.0, etc. - Pre-release tags use a
-presuffix:v0.1.0-pre. - Patch releases increment the third component:
v0.1.1.
What’s Different table update
When a release bundles new fork-carried commits that were not previously documented, update the FG-MAIN-TABLE in docs/src/whats-different/overview.md in the same PR before tagging. See Contributing for the rule.
See also: Branch and worktree conventions · What’s Different → Overview · Reference → Changelog · Building from source
Branch and worktree conventions
This page describes how the bwa-mem3 repository branches relate to upstream bwa-mem2, the policy for where PRs land, and the conventions for local worktrees when working on multiple branches simultaneously.
Branch model
master — upstream mirror
master tracks the upstream bwa-mem2 master branch verbatim. No fork-carried changes are applied here. When upstream bwa-mem2 merges new commits, master is fast-forwarded to match.
master is the starting point for upstream rebase operations. It is never the target of fork PRs.
main — fork integration branch
main carries all fork-carried commits on top of a rebased upstream baseline. This is the branch that:
- All new feature, fix, and improvement PRs target.
- All git tags (
v0.X.Y) are placed on. - Read the Docs
/latest/follows.
When upstream bwa-mem2 makes significant changes, master is fast-forwarded and then main is rebased onto the new master tip. The rebase is verified by running the full test suite before the result is pushed.
Feature and fix branches
All development work happens on short-lived branches that are merged into main via pull request. Branch name conventions:
| Prefix | Use |
|---|---|
feat/ | New features or capabilities |
fix/ | Bug fixes |
perf/ | Performance improvements |
test/ | Test additions or improvements |
docs/ | Documentation changes |
ci/ | CI / build system changes |
refactor/ | Code restructuring without behaviour change |
Branch names use kebab-case after the prefix: fix/kswv-nrow-zero, perf/libsais-fm-index, test/regression-tests.
Upstream rebase cadence
main is rebased onto master (i.e., onto upstream bwa-mem2) periodically — not on every upstream commit, but when upstream merges a batch of changes worth incorporating. The process is:
- Fast-forward
masterto the new upstream tip. - Rebase
mainontomaster, resolving any conflicts. - Run
make && make testto confirm the rebase is correct. - Push
masterandmainto thefg-labsremote.
Warning — Do not merge upstream into main
Always rebase rather than merge when incorporating upstream changes. Merge commits obscure the fork-carried commit history and make the What’s Different table harder to maintain.
Worktrees for parallel branches
When working on multiple branches simultaneously, use git worktrees instead of stashing or switching branches. Each worktree is a sibling directory of the main clone.
Creating a worktree for a PR branch
# Fetch the PR's head branch from the fg-labs remote
git fetch fg-labs <head-branch-name>
# Create a worktree with a local branch tracking the remote branch
git worktree add ../pr-<N> -b pr-<N> --track fg-labs/<head-branch-name>
The local branch name and directory name match the PR number (pr-N).
Creating a worktree for a new issue branch
# Fetch the latest main from fg-labs
git fetch fg-labs main
# Create a new feature branch off fg-labs/main
git worktree add ../issue-<N> -b <prefix>/issue-<N>-<short-slug> fg-labs/main
# Unset the upstream so the branch is untracked until first push
git -C ../issue-<N> branch --unset-upstream
On first push, push to fg-labs so the head branch is in the same organisation as the PR base:
git push -u fg-labs HEAD
Worktree naming conventions
| Directory name | Branch type |
|---|---|
main/ | Primary checkout; tracks fg-labs/main |
pr-<N>/ | PR review; local branch pr-N tracks fg-labs/<head-branch> |
issue-<N>/ | Issue work; local branch <prefix>/issue-N-<slug> |
| Descriptive name | Feature work not yet tied to a PR or issue |
Listing and removing worktrees
# List all worktrees
git worktree list
# Remove a worktree after the PR is merged
git worktree remove ../pr-<N>
git branch -D pr-<N>
# Remove an issue worktree
git worktree remove ../issue-<N>
git branch -D <prefix>/issue-<N>-<slug>
Note — Worktree directories are siblings, not nested
All worktree directories sit next to the main clone at the same directory level, not inside it. This avoids confusing
gitcommands that walk parent directories looking for.git.
PR policy
- All PRs target
main. - PRs from fork contributors should be opened against
fg-labs/bwa-mem3 main. - Every PR that adds a fork-carried commit must update the
FG-MAIN-TABLEindocs/src/whats-different/overview.mdin the same PR. See Contributing. - Merge policy: squash-merge for single-commit changes; rebase-merge for multi-commit PRs with a clean commit history.
See also: Contributing · Release process · What’s Different → Overview · Building from source
Contributing
This page covers the mechanics of submitting changes to bwa-mem3: commit conventions, PR workflow, CI requirements, and the rule for keeping the fork-lineage table current.
Before you start
- Check the open issues and existing PRs to avoid duplicate work.
- For substantial changes, open an issue first to discuss scope and approach.
- Fork or branch from
fg-labs/bwa-mem3 main. See Branch and worktree conventions for the branching model.
Commit message conventions
bwa-mem3 follows Conventional Commits (v1.0.0). Every commit message must start with a type prefix:
| Prefix | Use |
|---|---|
feat: | New feature or capability |
fix: | Bug fix |
perf: | Performance improvement |
test: | Test additions or changes |
docs: | Documentation only |
ci: | CI / build-system changes |
refactor: | Restructuring without behaviour change |
chore: | Maintenance (dependency bumps, version pins) |
The subject line is lowercase after the prefix, imperative mood, no trailing period. Keep it under 72 characters. Body lines wrap at 100 characters.
Good:
fix: kswv nrow==0 batch skips rowMax store when i==0
Exercises the all-len1==0 path across SSE4.1, AVX2, AVX-512BW, and ARM NEON.
Without the `if (i > 0)` guard, the store writes SIMD_WIDTH* bytes before the
allocation.
Closes #38.
Not acceptable:
Fixed stuff
Updated kswv
WIP
Pull request workflow
- Push your branch to
fg-labs/bwa-mem3(or your fork) and open a PR targetingfg-labs/bwa-mem3 main. - The PR description should explain the motivation, summarise the change, and note any benchmarks or test results.
- All CI jobs must pass before merge. See CI matrix below.
- CodeRabbitAI reviews every PR automatically. Address all comments, including inline suggestions, summary comments, and nitpicks. Do not dismiss comments without a reply explaining why the suggestion was not adopted.
- A project maintainer will review and merge once CI is green and all comments are resolved.
Note — Draft PRs first
Open PRs as drafts while CI is running or while you are actively revising. Convert to ready-for-review only when the branch is stable, CI is green, and you have self-reviewed the diff.
The FG-MAIN-TABLE rule
Every PR that introduces a new fork-carried commit — a commit that is on main but not on master (the upstream bwa-mem2 mirror) — must update the FG-MAIN-TABLE block in docs/src/whats-different/overview.md in the same PR.
The table records each fork-carried change, its bwa-mem3 PR number, the corresponding upstream bwa-mem2 PR or issue (if any), and its upstream status. Keeping this table current is the primary mechanism by which the project maintains transparency about its relationship to upstream.
Warning — Do not skip the table update
A PR that adds a fork-carried commit but omits the table update will be sent back for revision. The table is reviewed as part of the standard PR checklist.
What counts as a fork-carried commit
A commit is fork-carried if:
- It adds new behaviour, fixes a bug, or changes build infrastructure in a way that diverges from upstream bwa-mem2
master. - It is present on
fg-labs/bwa-mem3 mainbut not (yet) merged upstream.
Pure documentation commits, CI-only changes, and upstream-rebase bookkeeping commits do not need a table entry.
CI matrix
CI runs on every PR and on push to main. The matrix covers:
| Row | Architecture | ISA | Platform |
|---|---|---|---|
sse41 | x86_64 | SSE4.1 | Ubuntu |
avx2 | x86_64 | AVX2 | Ubuntu (canonical) |
avx512bw | x86_64 | AVX-512BW | Ubuntu |
arm64-linux | aarch64 | NEON | Ubuntu ARM |
arm64-macos | arm64 | NEON | macOS |
The canonical row (avx2) is the only one that runs regression tests (shell scripts in test/regression/). Unit tests run on every row. Integration tests run on the four widened canonical rows (SSE4.1, AVX2, ARM64 Linux, macOS ARM).
A PR must pass all rows before merge.
Code style
- C++14,
gnu++14dialect. - Match the style of the surrounding code. The codebase inherits the upstream bwa-mem2 style, which is C-ish C++ with minimal STL use in hot paths.
- For new test code, follow the doctest patterns documented in the test framework.
- New SIMD code must include
src/simd_compat.hrather than platform-specific headers directly. See SIMD dispatch architecture.
Adding a test for your change
- Bug fix → add a unit test or integration test that fails without the fix and passes with it.
- New feature → add unit tests for the core logic and, if the feature is end-to-end testable with a shell invocation, a regression test in
test/regression/. - Performance change → run the benchmark harness (
bench/) to confirm the improvement and include median wall-clock numbers in the PR description.
See Regression test framework for the full guide on where to add tests and how to organise them.
See also: Branch and worktree conventions · Regression test framework · Release process · What’s Different → Overview · Building from source
bwa-mem3-bench
bwa-mem3-bench is a benchmarking suite that measures the alignment performance of bwa-mem3 against the upstream bwa-mem2 v2.2.1 baseline. It runs on AWS Batch spot instances across four dataset types — whole-genome sequencing (WGS), whole-exome sequencing (WES), panel, and bisulfite-sequencing (methylation) — all aligned against the hg38 reference. The suite covers three CPU microarchitectures: ARM Neon, x86 AVX2, and x86 AVX-512. Results are collected into a SQLite database for local analysis and reporting. The project is implemented in Python (orchestration, reporting, and CLI), Rust (BAM comparison tool), Snakemake (alignment workflow), and AWS CDK (cloud infrastructure).
When you’d use it
Use bwa-mem3-bench when you need reproducible, multi-architecture throughput numbers before committing a bwa-mem3 change to production or before deciding whether to adopt bwa-mem3 in place of bwa-mem2. It provides a structured “bless baseline, then compare” workflow: an upstream bwa-mem2 run is blessed once per upstream tag and stored in S3; subsequent bwa-mem3 runs are measured against that fixed baseline. Running a full benchmark fires a Snakemake coordinator job on AWS Batch and costs roughly $10 in spot capacity.
How it relates to bwa-mem3
bwa-mem3-bench is the authoritative source of benchmark evidence for every performance claim made in the bwa-mem3 documentation and changelog. When the Performance Overview cites speedup numbers, those numbers come from bwa-mem3-bench runs collected after the relevant PR was merged. The suite also validates that bwa-mem3 does not regress relative to bwa-mem2 on any supported architecture before a new release is tagged.
Links
- GitHub: https://github.com/fg-labs/bwa-mem3-bench
- License: MIT
See also: Performance Overview · SIMD dispatch matrix · bwa-mem2 (upstream) · Release process
bwa-mem3-rs
bwa-mem3-rs is a Rust crate that provides idiomatic bindings to the bwa-mem family of short-read aligners — bwa (original), bwa-mem2, and bwa-mem3. It exposes a safe Rust API over the underlying C++ alignment engine, allowing Rust programs to index a reference, configure alignment parameters, and align reads without shelling out to an external process. The bindings link statically against the chosen backend, so a binary built with bwa-mem3-rs carries the aligner and its SIMD kernels as a self-contained artifact.
When you’d use it
Use bwa-mem3-rs when you are building a Rust bioinformatics tool or pipeline that needs short-read alignment as an in-process library call rather than a subprocess invocation. It is especially useful when latency between reads arriving and alignments being available matters (no process-startup overhead), or when you want tight integration between the aligner’s output and downstream Rust code such as UMI grouping, consensus calling, or duplicate marking.
How it relates to bwa-mem3
bwa-mem3-rs targets bwa-mem3 as its primary high-performance backend. It is the intended integration path for fgumi and other Fulcrum Genomics tools that need alignment as a library dependency. Changes to bwa-mem3’s public API, flag semantics, or output format are coordinated with bwa-mem3-rs to keep the bindings current.
Links
- GitHub: https://github.com/fg-labs/bwa-mem3-rs
- License: MIT
See also: fgumi · bwa-mem3-bench · Aligning short reads (mem) · Developer Guide — Contributing
bwa-mem2 (upstream)
bwa-mem2 is the direct predecessor of bwa-mem3 and the project from which the bwa-mem3 fork is derived. It was created at Intel’s Parallel Computing Lab by Vasimuddin Md and Sanchit Misra to accelerate the alignment algorithm originally written by Heng Li in bwa. bwa-mem2 achieves a 1.3–3.1x throughput improvement over the original bwa-mem by replacing key inner loops with vectorised implementations (SSE4.1, SSE4.2, AVX2, and AVX-512) and by switching to a more compact FM-index encoding. Its output is identical to bwa-mem at the alignment level, and it is distributed under the MIT license.
Lineage
The bwa alignment family has evolved through three generations, each building on the last:
- bwa — Written by Heng Li. Established the BWA-MEM algorithm, the SAM output
format conventions, and the
.bwt/.pac/.ann/.ambindex layout. - bwa-mem2 (Vasimuddin et al., Intel) — Replaced scalar inner loops with SIMD
kernels; introduced the compact
.bwt.2bit.64and.0123index formats; retained full output compatibility with bwa-mem. - bwa-mem3 (Fulcrum Genomics fork) — Carries correctness fixes, performance improvements, new features (bisulfite alignment, mimalloc, ARM Neon), and expanded architecture support on top of the bwa-mem2 codebase. See What’s Different from bwa-mem2 for the full change catalog.
When you’d use it
Use bwa-mem2 directly when you need a stable, widely validated aligner with precompiled binaries available via Bioconda and the project’s GitHub releases page, and when you do not require the features or fixes that bwa-mem3 adds. bwa-mem2 is also the right choice when you are working in an environment where the bwa-mem3 fork has not yet been validated against your specific reference or sequencing library type.
How it relates to bwa-mem3
bwa-mem3 tracks bwa-mem2’s master branch and periodically rebases fork-carried
commits on top of upstream changes. The What’s Different
section documents every divergence between the two projects, and the
Upstream PR status page tracks which bwa-mem3
changes have been proposed back to bwa-mem2. The goal is to keep the fork divergence
minimal and to upstream as many fixes as practical.
Links
- GitHub: https://github.com/bwa-mem2/bwa-mem2
- Citation: Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru. “Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems.” IEEE IPDPS 2019.
- License: MIT (with third-party components under their respective licenses)
See also: What’s Different from bwa-mem2 · Upstream PR status · bwa-mem3-bench · Citation
fgumi
fgumi (Fulcrum Genomics Unique Molecular Indexing tools) is a high-performance suite of command-line tools for processing UMI-tagged next-generation sequencing data. Written in Rust, it provides UMI extraction from FASTQ files, read grouping by UMI with configurable assignment strategies, UMI-aware deduplication, simplex and duplex consensus calling, CODEC consensus calling, quality filtering of consensus reads, and overlapping read-pair clipping. fgumi is the intended successor to the Scala-based fgbio toolkit for UMI processing, targeting significantly higher throughput on multi-core systems. It is published on Bioconda and documented at https://fgumi.readthedocs.io.
Warning — Research preview
fgumi is currently a research preview. The Fulcrum Genomics team targets June 2026 for recommending fgumi over fgbio for production use. Verify fitness for your application before deploying in a clinical or production pipeline.
When you’d use it
Use fgumi when your sequencing library includes unique molecular identifiers and you need to group reads by UMI, call simplex or duplex consensus sequences, or remove PCR duplicates in a UMI-aware manner. It handles the standard commercial UMI library preparations (IDT xGen, KAPA, Twist, QIAseq, and others) and the CODEC protocol for duplex sequencing. fgumi is designed to be run after alignment with bwa-mem3 (or bwa-mem2) and before downstream variant calling or methylation analysis.
How it relates to bwa-mem3
fgumi and bwa-mem3 are sibling projects maintained by Fulcrum Genomics and are designed to work together in the same alignment-and-consensus pipeline. bwa-mem3 provides the aligned BAM that fgumi takes as input for grouping and consensus calling. The two projects share build and documentation conventions (mdbook on Read the Docs, Fulcrum theme, conventional commits) and are benchmarked together in the fgumi-benchmarks internal dataset suite. The intended integration path for in-process alignment within fgumi is bwa-mem3-rs, the Rust bindings for bwa-mem3.
Links
- GitHub: https://github.com/fulcrumgenomics/fgumi
- Docs: https://fgumi.readthedocs.io
- License: MIT
See also: bwa-mem3-rs · Aligning short reads (mem) · Best Practices — Multi-sample workflows · bwa-mem3-bench
bwameth.py
bwameth.py is a Python script written by Brent Pedersen that implements bisulfite sequencing (BS-Seq) alignment using the in-silico three-letter genome approach. It converts all cytosines to thymines in both the reference and the reads (C-to-T on the forward strand, G-to-A on the reverse), aligns the converted sequences with bwa-mem (or optionally bwa-mem2), and then recovers the original read sequence from the aligner’s tag output to tabulate methylation. bwameth.py supports single-end and paired-end reads from the directional bisulfite protocol and is published at https://arxiv.org/abs/1401.1129.
When you’d use it
Use bwameth.py when you need a battle-tested, community-supported bisulfite aligner that runs on top of the standard bwa-mem or bwa-mem2 you have already installed, and when you prefer a Python wrapper over a self-contained binary. It also remains the reference for downstream tabulation tools such as MethylDackel and SNP callers such as biscuit that expect the bwameth.py output format. For the actual methylation tabulation and variant calling steps, bwameth.py’s author recommends those dedicated tools rather than the tabulation utilities bundled with the original script.
How it relates to bwa-mem3
bwa-mem3 mem --meth is a single-binary drop-in replacement for the bwameth.py
alignment pipeline. It inlines the C-to-T and G-to-A conversion, runs the bwa-mem3
alignment engine (with all of its correctness fixes and SIMD speedups), rewrites the
@SQ headers to collapse the per-strand contig pairs back to canonical chromosome names,
applies chimera QC, and emits a @PG ID:bwa-mem3-meth header. The output BAM is
compatible with the same downstream tabulation tools that consume bwameth.py output.
The Methylation Reference section documents the full
implementation in detail, including the YS:Z:, YC:Z:, and YD:Z: tags and the
--set-as-failed and --do-not-penalize-chimeras flags.
Tip — Interop with the bwameth.py c2t step
If your pipeline already performs its own C-to-T conversion before alignment, see Interop with external bwameth.py c2t for how to pass pre-converted reads to
bwa-mem3 mem --methwithout double-conversion.
Links
- GitHub: https://github.com/brentp/bwa-meth
- Paper: https://arxiv.org/abs/1401.1129
- License: MIT
See also: Methylation Reference: Overview · Quick start: methylation alignment · Best Practices — Methylation defaults · Interop with external bwameth.py c2t
Glossary
Terms used throughout this book, listed alphabetically.
@HD header
The first line of a SAM file header. Specifies the SAM format version (VN) and sort order (SO). Required when any other header lines are present. See Output: SAM/BAM, headers, tags.
@PG header
A SAM header line recording a program that processed the file, including ID, PN, VN, and CL fields. bwa-mem3 inserts ID:bwa-mem3 (or ID:bwa-mem3-meth in methylation mode). See Output: SAM/BAM, headers, tags.
@SQ header
A SAM header line describing a reference sequence (chromosome). Contains the sequence name (SN) and length (LN). In methylation mode, bwa-mem3 post-processes @SQ lines to collapse f/r-prefixed contig names back to one entry per chromosome. See Chimera QC and header rewriting.
BAM
Binary Alignment Map — a compressed, binary encoding of SAM. Produced by bwa-mem3 when the --bam flag is given or when output is piped through samtools. See Output: SAM/BAM, headers, tags.
Banded Smith-Waterman (banded SWA) A heuristic variant of the Smith-Waterman alignment algorithm that restricts the dynamic programming to a band of width w around the main diagonal. bwa-mem3 uses banded SWA for extension alignment; bwa-mem2 kernels are SIMD-vectorized and bwa-mem3 adds NEON implementations for Apple Silicon. See SIMD dispatch architecture.
c2t
Cytosine-to-thymine in-silico conversion applied to reads (or reference) before methylation alignment. In --meth mode, bwa-mem3 converts R1 reads C→T and R2 reads G→A inline, without writing intermediate FASTQ files. See Conversion details (C->T, G->A).
Chimera A read alignment where the aligned portion is short relative to the read length, often indicating a mapping artefact or a true chimeric molecule. In methylation mode, bwa-mem3 applies a chimera QC heuristic: if the longest contiguous M/=/X CIGAR run is less than 44% of the read length, the alignment is flagged 0x200, the proper-pair bit is cleared, and MAPQ is capped at 1. See Chimera QC and header rewriting.
FASTQ A text format for raw sequencing reads. Each record contains a sequence identifier, the nucleotide sequence, a separator, and per-base quality scores in ASCII-encoded Phred format. bwa-mem3 accepts gzip-compressed FASTQ as input. See Quick start: align paired-end FASTQs.
FM-index
Ferragina-Manzini index — a full-text index over the Burrows-Wheeler Transform of a sequence. bwa-mem3 uses the compressed .bwt.2bit.64 FM-index for seed finding (SMEM lookup). See Indexing the reference.
Hard clip
A CIGAR operation (H) indicating that bases at the read end are absent from the SEQ field of the alignment record. Hard clipping is used in supplementary alignments to avoid duplicating the read sequence. See Output: SAM/BAM, headers, tags.
kswv The SIMD-vectorized kernel implementing the inner loop of the Smith-Waterman extension alignment in bwa-mem2/bwa-mem3. bwa-mem3 carries correctness fixes for the score-saturation edge case across all SIMD width variants (NEON, AVX2, AVX-512BW). See Correctness fixes.
libsais A library implementing the suffix-array induced sorting (SAIS) algorithm. bwa-mem3 optionally uses libsais for FM-index construction, reducing indexing time compared to the default suffix-array builder. See Performance improvements.
LTO
Link-Time Optimization — a compiler mode that defers optimization to link time, enabling cross-compilation-unit inlining. Activated via make lto-build. See Building from source.
MAPQ Mapping quality — a Phred-scaled probability that a read alignment is incorrectly mapped. Reported in SAM field 5. bwa-mem3 follows bwa-mem2 MAPQ semantics; chimera QC in methylation mode caps MAPQ at 1 for chimeric alignments. See Output: SAM/BAM, headers, tags.
Mate rescue A step in paired-end alignment where, if one mate lacks a confident seed, bwa-mem3 attempts to find it by performing Smith-Waterman alignment in the region near the mapped mate. bwa-mem3 adds NEON and AVX2 implementations of the mate-rescue kernel. See Architecture support.
mimalloc
A high-performance memory allocator from Microsoft. bwa-mem3 vendors mimalloc and links it into every binary by default. To disable, build with USE_MIMALLOC=0. See Memory allocator (mimalloc).
Multi-binary launcher
On x86, bwa-mem3 ships a thin launcher binary that detects CPUID features at runtime and execs the appropriate arch-specialized binary (bwa-mem3.sse41, .avx2, .avx512bw, etc.). On ARM64 a single bwa-mem3.arm64 binary is built. See Multi-binary launcher (x86).
PGO
Profile-Guided Optimization — a two-pass build where the first pass instruments the binary, a representative workload is run to collect profiles, and the second pass uses those profiles to guide inlining and branch layout. Activated via make pgo-generate then make pgo-use. See PGO build.
Primary alignment The alignment record for a read that represents the aligner’s best placement. A read has exactly one primary alignment (or is reported as unmapped). All other alignments for the same read are marked supplementary (chimeric split read) or secondary (alternative mapping). See Output: SAM/BAM, headers, tags.
Proper-pair flag (0x2)
SAM flag bit indicating that both mates of a pair are mapped in the expected orientation and insert-size range. In bwa-mem3, the mem_sam_pe function sets this flag; a correctness fix (PR #17) ensures it is propagated correctly under all conditions. See Correctness fixes.
SAM Sequence Alignment Map — a tab-delimited text format for read alignments. Each record contains mandatory fields (QNAME, FLAG, RNAME, POS, MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, QUAL) plus optional tags. See Output: SAM/BAM, headers, tags.
SIMD dispatch Runtime selection of the fastest available SIMD instruction set (SSE4.1, AVX2, AVX-512BW, NEON) for hot alignment kernels. On x86 this is implemented by the multi-binary launcher; on ARM64 a single binary covers NEON. See SIMD dispatch matrix.
SMEM Super-Maximal Exact Match — a seed found by extending a read’s position in the FM-index as far as possible in both directions. SMEMs form the initial seeds for chaining and extension in the BWA-MEM algorithm. See Performance improvements.
Soft clip
A CIGAR operation (S) indicating that bases at the read end were not part of the alignment, but are still present in the SEQ field. Soft clipping commonly appears at adapter-containing or low-quality read ends. See Output: SAM/BAM, headers, tags.
Supplementary alignment A SAM record (FLAG bit 0x800 set) representing a chimeric read split across two or more genomic loci. The segment with the longest aligned span is typically designated primary; remaining segments are supplementary. Hard clipping is used to avoid duplicating the SEQ field. See Output: SAM/BAM, headers, tags.
See also: Citation · License · Changelog · Output: SAM/BAM, headers, tags · What’s Different — Overview
Citation
How to cite
bwa-mem3 is a derivative of bwa-mem2. If you use bwa-mem3 in published work, please cite the original bwa-mem2 paper:
Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru. Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2019. doi:10.1109/IPDPS.2019.00041
BibTeX:
@inproceedings{bwamem2-ipdps2019,
author = {Vasimuddin Md and Sanchit Misra and Heng Li and Srinivas Aluru},
title = {Efficient Architecture-Aware Acceleration of {BWA-MEM} for Multicore Systems},
booktitle = {IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
year = {2019},
doi = {10.1109/IPDPS.2019.00041},
url = {https://doi.org/10.1109/IPDPS.2019.00041}
}
Lineage
bwa-mem3 is maintained by Fulcrum Genomics as a derivative of bwa-mem2, itself derived from bwa (Li & Durbin, 2009). The BWA-MEM algorithm was originally described in:
Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997, 2013. arXiv:1303.3997
The bwa-mem3-specific changes and improvements carried on top of bwa-mem2 are documented in What’s Different from bwa-mem2.
See also: License · Changelog · What’s Different — Overview · Related Projects: bwa-mem2
License
bwa-mem3 is licensed under the MIT License (same as upstream bwa-mem2).
The MIT License
BWA-MEM2 (Sequence alignment using Burrows-Wheeler Transform),
Copyright (C) 2019 Intel Corporation, Heng Li.
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Contacts: Vasimuddin Md <vasimuddin.md@intel.com>; Sanchit Misra <sanchit.misra@intel.com>;
Heng Li <hli@jimmy.harvard.edu>
See also: Citation · Changelog · Related Projects: bwa-mem2 · What’s Different — Overview
Changelog
Release 0.1.0-pre (2026-04-28)
- Project renamed from
bwa-mem2tobwa-mem3. The new project tracks Fulcrum Genomics’ performance and feature work on top of the upstream bwa-mem2 codebase. - Default branch renamed from
fg-maintomain. - Binary renamed from
bwa-mem2tobwa-mem3. Arch-suffixed variants (bwa-mem3.sse41,.sse42,.avx,.avx2,.avx512bw,.arm64,.pgo,.profile,.lto) renamed to match. @PGSAM header tags now readID:bwa-mem3 PN:bwa-mem3(andbwa-mem3-methfor--methmode).- Test binaries renamed:
bwa_mem2_tests_unit→bwa_mem3_tests_unit,bwa_mem2_tests_integration→bwa_mem3_tests_integration. .bwt.2bit.64index file format unchanged — bwa-mem3 reads indexes built bybwa-mem2 indexwithout re-indexing.
Release 2.2.1 (17 March 2021)
Hotfix for v2.2: Fixed the bug mentioned in #135.
Release 2.2 (8 March 2021)
Changes since the last release (2.1):
- Passed the validation test on ~88 billions reads (Credits: Keiran Raine, CASM division, Sanger Institute)
- Fixed bugs reported in #109 causing mismatch between bwa-mem and bwa-mem2
- Fixed the issue (# 112) causing crash due to corrupted thread id
- Using all the SSE flags to create optimized SSE41 and SSE42 binaries
Release 2.1 (16 October 2020)
Release 2.1 of BWA-MEM2.
Changes since the last release (2.0):
-
Smaller index: the index size on disk is down by 8 times and in memory by 4 times due to moving to only one type of FM-index (2bit.64 instead of 2bit.64 and 8bit.32) and 8x compression of suffix array. For example, for human genome, index size on disk is down to ~10GB from ~80GB and memory footprint is down to ~10GB from ~40GB. There is a substantial decrease in index IO time due to the reduction and hardly any performance impact on read mapping.
-
Added support for 2 more execution modes: sse4.2 and avx.
-
Fixed multiple bugs including those reported in Issues #71, #80 and #85.
-
Merged multiple pull requests.
Release 2.0 (9 July 2020)
This is the first production release of BWA-MEM2.
Changes since the last release:
-
Made the source code more secure with more than 300 changes all across it.
-
Added support for memory re-allocations in case the pre-allocated fixed memory is insufficient.
-
Added support for MC flag in the sam file and support for -5, -q flags in the command line.
-
The output is now identical to the output of bwa-mem-0.7.17.
-
Merged index building code with FMI_Search class.
-
Added support for different ways to input read files, now, it is same as bwa-mem.
-
Fixed a bug in AVX512 sam processing part, which was leading to incorrect output.
Release 2.0pre2 (4 February 2020)
Miscellaneous changes:
-
Changed the license from GPL to MIT.
-
IMPORTANT: the index structure has changed since commit 6743183. Please rebuild the index if you are using a later commit or the new release.
-
Added charts in README.md comparing the performance of bwa-mem2 with bwa-mem.
Major code changes:
-
Fixed working for variable length reads.
-
Fixed a bug involving reads of length greater than 250bp.
-
Added support for allocation of more memory in small chunks if large pre-allocated fixed memory is insufficient. This is needed very rarely (thus, having no impact on performance) but prevents asserts from failing (code from crashing) in that scenario.
-
Fixed a memory leak due to not releasing the memory allocated for seeds after smem.
-
Fixed a segfault due to non-alignment of small allocated memory in the optimized banded Smith-Waterman.
-
Enabled working with genomes larger than 7-8 billion nucleotides (e.g. Wheat genome).
-
Fixed a segfault occuring (with gcc compiler) while reading the index.
See also: Citation · License · What’s Different — Overview · Developer Guide — Release process