Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Home

bwa-mem3

A faster, more correct, drop-in replacement for bwa mem and bwa-mem2.

If you align short reads with bwa or bwa-mem2 today, bwa-mem3 will give you the same answers — only quicker, with fewer rough edges, and with first-class support for things you used to need a wrapper script for.

Why bwa-mem3

  • Drop in, go faster. Same algorithm, same outputs, same flags as bwa-mem2 — but consolidated mapping speedups, a memory-bounded index builder, batched header ingestion, and a tuned allocator add up to measurable wall-clock wins on real workloads.
  • Methylation in one binary. A --meth flag turns bwa-mem3 into a drop-in replacement for the entire bwameth.py pipeline. No Python, no inline conversion script, no separate post-processing step. One bwa-mem3 index --meth ref.fa, one bwa-mem3 mem --meth ref.fa R1.fq R2.fq, done — header collapsed, tags emitted, chimeras flagged.
  • Stage the index once, align many. A bwa-mem3 shm subcommand pins the FM-index in shared memory so back-to-back runs on the same host skip the 28 GB read every time.
  • Correctness fixes upstream hasn’t merged yet. Tabs in -R, 151+ bp reads, AVX-512 mate-rescue, kswv score2 plateau across NEON/AVX2/AVX-512BW, mem_sam_pe proper-pair flag — every fix tracked back to the upstream PR or issue that found it.
  • Architecture-aware out of the box. SSE4.1, SSE4.2, AVX, AVX2, AVX-512BW, and ARM64/NEON. A multi-binary launcher picks the right one for your CPU.

Get started in 30 seconds

git clone --recursive https://github.com/fg-labs/bwa-mem3
cd bwa-mem3 && make
./bwa-mem3 index ref.fa
./bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam

Tip — Emit BAM directly

For production pipelines, add --bam=0 to skip the SAM text round-trip entirely. See Best Practices: Output format.

Where to start

What’s in this book


bwa-mem3 is a derivative of bwa-mem2 maintained by Fulcrum Genomics. MIT licensed. See License and Citation.

Installation

Bioconda (coming soon)

A Bioconda package for bwa-mem3 is in preparation. Once published, installation will be:

conda install -c bioconda bwa-mem3

This will be the recommended path for most users. Check back here or watch the fg-labs/bwa-mem3 repository for the announcement.

Build from source

Until the Bioconda package is available, build from source using the steps below.

Prerequisites

  • A C++17 compiler (GCC 8+ or Clang 7+)
  • GNU make
  • Rust toolchain (for cargo install of mdbook tools, not required for the aligner itself)
  • Git (for submodule checkout)

Clone and build

git clone --recursive https://github.com/fg-labs/bwa-mem3
cd bwa-mem3
make

The --recursive flag is required. bwa-mem3 vendors several libraries (mimalloc, sse2neon, and others) as git submodules. A shallow or non-recursive clone will fail to compile.

Warning — Shallow clone submodule pitfall

If you cloned without --recursive, initialize the submodules before running make:

git submodule update --init --recursive

Forgetting this step is the most common source of build failures.

Target architecture

By default, make builds a general-purpose binary that runs on any supported CPU. For maximum performance, specify the architecture that matches your deployment target:

FlagRequiresNotes
makeSSE4.1 or better (x86), any (ARM)Default; selects best dispatch at runtime on x86
make arch=avx2AVX2 (e.g. Haswell, Zen 2)Recommended for modern x86 servers
make arch=avx512bwAVX-512BW (e.g. Skylake-X, Ice Lake, Sapphire Rapids)Maximum x86 performance
make arch=arm64Apple Silicon / AWS GravitonNEON-vectorized build

See Performance — SIMD dispatch matrix for the full matrix of which kernels are vectorized under each target.

Memory allocator (mimalloc)

bwa-mem3 bundles mimalloc and links it into every binary by default. mimalloc reduces allocator contention under high thread counts and lowers wall-clock time on multi-threaded alignment runs.

To build without mimalloc, pass USE_MIMALLOC=0:

make USE_MIMALLOC=0

See User Guide — Memory allocator for details on how mimalloc is linked on Linux versus macOS and when opting out is appropriate.

Smoke test

After building, run the smoke test to confirm the binary works and report which allocator is active:

./bwa-mem3 version

Expected output (with mimalloc):

bwa-mem3-0.1.0-pre-12-gabcdef1
mimalloc 3.x.x

If the mimalloc line is absent, the build linked the system allocator (expected when USE_MIMALLOC=0 was passed or when the vendor submodule was not initialized).


See also: Quick start: align paired-end FASTQs · User Guide — Memory allocator · Developer Guide — Building from source

Quick start: align paired-end FASTQs

This page walks through the two-command workflow: index the reference once, then align reads.

Index the reference

bwa-mem3 index ref.fa

This produces five index files alongside ref.fa:

FileDescription
ref.fa.bwt.2bit.64FM-index in 2-bit packed format
ref.fa.01232-bit packed reference sequence
ref.fa.ambAmbiguous base positions
ref.fa.annSequence name and length annotations
ref.fa.pacPacked 4-bit reference sequence

Indexing hg38 takes roughly 2-3 minutes and requires approximately 60 GB of peak disk space during creation (including temporary/intermediate files); the final FM-index stored on disk is roughly 28 GB. The index is read once per mem invocation; for workloads that align many samples, load it into shared memory first (see Quick start: shared-memory index).

Align paired-end reads

bwa-mem3 mem -t 16 ref.fa r1.fq.gz r2.fq.gz > out.sam

-t 16 sets the thread count to 16. bwa-mem3 scales well up to the number of physical CPU cores; hyperthreading provides diminishing returns above that point. See User Guide — Threading and resource use for recommendations at different core counts.

The default output is uncompressed SAM on stdout. To write compressed BAM directly, use the --bam flag:

bwa-mem3 mem --bam -t 16 ref.fa r1.fq.gz r2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

Tip — Prefer BAM output in production

Piping BAM (--bam) to samtools sort avoids the text formatting and parsing overhead of SAM on both sides of the pipe. For large cohorts this yields a measurable wall-clock reduction. See Best Practices — Output format for the recommended pipeline and a discussion of when SAM is still useful.

Read group tagging

For downstream tools that require a @RG header (most variant callers), pass -R:

bwa-mem3 mem -t 16 \
  -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA\tLB:lib1' \
  ref.fa r1.fq.gz r2.fq.gz > out.sam

The value is a tab-delimited string following BWA conventions. Every aligned record receives an RG:Z: tag matching the ID field of the read-group header.

Output tags

bwa-mem3 emits standard SAM tags plus the HN:i: tag introduced by the fork:

TagTypeDescription
NM:iintEdit distance to the reference
MD:ZstringMismatch and deletion string
AS:iintAlignment score
XS:iintSuboptimal alignment score
SA:ZstringSupplementary alignment chain
HN:iintTotal number of primary alignments (reported and suppressed) found for the read, before the -h supplementary cap is applied

For the methylation-specific tags (YS:Z, YC:Z, YD:Z), see Methylation Reference — SAM tags.


See also: Quick start: methylation alignment · Quick start: shared-memory index · User Guide — Aligning short reads · CLI Reference — mem

Quick start: methylation alignment

bwa-mem3 supports bisulfite-converted (WGBS/RRBS/EM-seq) read alignment through a single --meth flag on both index and mem. No Python interpreter, no piped preprocessor, and no separate postprocessing step are required.

Note — Drop-in replacement for bwameth.py

bwa-mem3 with --meth is a single-binary drop-in replacement for the bwameth.py pipeline. The output BAM is byte-compatible for the standard tags used by methylation callers (Bismark, MethylDackel, PileOMeth, etc.).

Index the reference for methylation

Build the c2t doubled reference once:

bwa-mem3 index --meth ref.fa

This writes two additional files next to the standard index:

FileDescription
ref.fa.bwameth.c2tC→T converted reference (forward strand) with G→A reverse complement interleaved
ref.fa.bwameth.c2t.*FM-index files for the c2t reference

The c2t index is separate from the standard index produced by bwa-mem3 index ref.fa. You need both if you intend to run standard and methylation alignments against the same reference.

Align bisulfite-converted reads

bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam
samtools index out.bam

Pass the original (unconverted) reference path, not the .bwameth.c2t file. bwa-mem3 auto-appends .bwameth.c2t to the reference path when --meth is active.

What --meth does

--meth activates a pipeline of in-process transformations that would otherwise require external tools:

  1. Inline c2t read conversion. R1 reads have every C converted to T before alignment; R2 reads have every G converted to A. The original unconverted sequence is preserved in the YS:Z: SAM tag. The conversion direction for each read is recorded in YC:Z: (value CT or GA), matching the bwameth.py convention.

  2. bwameth.py-equivalent scoring defaults. --meth sets -B 2 -L 10 -U 100 -T 40 -CM automatically. These match the defaults used by bwameth.py and are optimized for bisulfite-converted reads where C→T mismatches carry no penalty. Any of these values can be overridden on the command line.

  3. Inline BAM post-processing. After alignment, bwa-mem3 rewrites the SAM stream in-process:

    • @SQ headers with f/r prefixes (e.g. fchr1, rchr1) are collapsed back to one entry per real chromosome (chr1). Read-level RNAME fields are rewritten to match.
    • Each mapped record gains a YD:Z: tag (f for forward-strand, r for reverse-strand) indicating which converted strand the read aligned to.
    • Chimera QC: reads whose longest M/=/X run is less than 44% of the read length are flagged 0x200 (QC-fail), have flag 0x2 (proper pair) cleared, and have MAPQ capped at 1.
    • Pair-level QC-fail propagation: if one mate is QC-failed, the other mate is also flagged.
    • A @PG ID:bwa-mem3-meth program record is appended to the header.
  4. Uncompressed BAM output. The post-processed stream is written as uncompressed BAM (wb0) rather than SAM text. This eliminates text serialization overhead and allows downstream samtools sort to read BAM natively. The stream is still fully readable by any htslib-based tool.

For full details on each tag, the chimera QC heuristic, and the --set-as-failed and --do-not-penalize-chimeras flags, see the Methylation Reference.


See also: Methylation Reference — Overview · Methylation Reference — SAM tags · Best Practices — Methylation defaults · CLI Reference — mem

Quick start: shared-memory index

The bwa-mem3 FM-index for a genome like hg38 is approximately 28 GB. By default, every bwa-mem3 mem invocation reads the index from disk, which can take 30–60 seconds on a spinning disk and several seconds even on fast NVMe storage. For workloads that align many small samples in sequence on the same machine, this per-invocation overhead accumulates.

bwa-mem3 shm stages the index once into POSIX shared memory. Subsequent mem invocations attach to the in-memory segment instead of reading from disk, reducing per-sample startup time to near zero.

Stage the index

bwa-mem3 shm ref.fa

This reads the index files from disk and copies them into a POSIX shared-memory segment. The command returns when staging is complete. The index stays in memory until it is explicitly dropped or the system is rebooted.

To stage a methylation (--meth) index:

bwa-mem3 shm --meth ref.fa

A standard and a methylation index for the same reference can be staged simultaneously; they occupy separate named segments.

Align using the staged index

No extra flag is needed. When bwa-mem3 mem starts, it checks whether a matching shared-memory segment exists. If one does, it attaches automatically:

bwa-mem3 mem -t 16 ref.fa r1.fq.gz r2.fq.gz > out.sam

Inspect and drop staged segments

List all currently staged indices:

bwa-mem3 shm -l

Drop all staged segments:

bwa-mem3 shm -d

When to use shared-memory indexing

Shared-memory indexing is most beneficial when:

  • Aligning tens to hundreds of small samples (e.g. amplicon panels, targeted sequencing) where per-sample read time dominates the per-sample alignment time.
  • Running a batch pipeline on a single large machine where the index fits comfortably in RAM (approximately 28 GB for hg38 with the standard index).
  • The same reference is used for all samples in the batch; a new shm invocation is required for each distinct reference.

It provides little benefit when:

  • Aligning a small number of large samples (WGS), where alignment time far exceeds index load time.
  • The available RAM is insufficient to hold the index alongside the operating system and alignment worker processes.

Warning — No staleness check — always drop before re-indexing

bwa-mem3 shm does not detect whether the on-disk index files have changed after staging. If you run bwa-mem3 index ref.fa again (e.g. to rebuild after a reference update), the shared-memory segment is not invalidated. Subsequent mem invocations will attach to the stale segment and produce silently incorrect alignments.

Always drop the segment before re-indexing:

bwa-mem3 shm -d
bwa-mem3 index ref.fa
bwa-mem3 shm ref.fa

See also: CLI Reference — shm · Best Practices — Multi-sample workflows · Best Practices — Anti-patterns · User Guide — Threading and resource use

Indexing the reference

Before aligning reads, bwa-mem3 builds an FM-index from the reference FASTA. The index is read back from disk at the start of every mem run, so it is built once and reused indefinitely.

Basic indexing

bwa-mem3 index ref.fa

The command writes five files alongside the input FASTA:

FileContents
ref.fa.bwt.2bit.64Burrows-Wheeler Transform, 2-bit packed, 64-bit offsets
ref.fa.0123Forward sequence, 2-bit packed
ref.fa.ambCoordinates and counts of ambiguous (N) bases
ref.fa.annSequence names and lengths
ref.fa.pacForward sequence, 4-bit packed

The .bwt.2bit.64 file dominates disk usage. For the human reference (hg38), expect roughly 28 GB total across all five files.

Methylation index (--meth)

bwa-mem3 index --meth ref.fa

Methylation mode builds a C-to-T doubled reference in addition to the standard FM-index files. The command writes a ref.fa.bwameth.c2t file (the doubled FASTA) and its own set of five index files with the .bwameth.c2t suffix:

ref.fa.bwameth.c2t
ref.fa.bwameth.c2t.bwt.2bit.64
ref.fa.bwameth.c2t.0123
ref.fa.bwameth.c2t.amb
ref.fa.bwameth.c2t.ann
ref.fa.bwameth.c2t.pac

The doubled reference is roughly twice the size of the standard one. For hg38, allow approximately 56 GB of disk space.

Tip — Pass the original FASTA to mem, not the c2t file

When running bwa-mem3 mem --meth, pass the original FASTA path (ref.fa), not ref.fa.bwameth.c2t. bwa-mem3 appends .bwameth.c2t automatically. The auto-append is skipped only when the path already ends in .bwameth.c2t, which is useful for external-c2t interop pipelines.

Output file locations

Index files are written to the same directory as the input FASTA by default. The input path is taken verbatim as a prefix — you can pass an absolute path to write into a different directory:

bwa-mem3 index /data/indexes/hg38/hg38.fa
# writes hg38.fa.bwt.2bit.64, etc. into /data/indexes/hg38/

Time and memory

Indexing hg38 takes roughly 60–90 minutes on a single core and requires about 80 GB of RAM during construction. The process is single-threaded; additional cores do not reduce wall time.

bwa-mem3 uses libsais to construct the suffix array, which is faster than the original bwa-mem2 approach. See Performance improvements for benchmark numbers.

Warning — Do not index over a live shared-memory segment

If you have previously staged the index into shared memory with bwa-mem3 shm, drop the segment first before re-indexing:

bwa-mem3 shm -d
bwa-mem3 index ref.fa

There is no staleness check. If bwa-mem3 mem finds a matching segment in shared memory it will attach to it even when the on-disk index has been updated. See Quick start: shared-memory index.

Arch flags and the index format

The FM-index format is architecture-independent. A single index can be used with any bwa-mem3 binary — bwa-mem3.avx2, bwa-mem3.avx512bw, and the ARM single-binary all read the same on-disk layout.


See also: Quick start: align paired-end FASTQs · Quick start: methylation alignment · Quick start: shared-memory index · Performance improvements · CLI Reference: index

Aligning short reads (mem)

bwa-mem3 mem aligns one or two FASTQ files against an indexed reference and writes SAM (default) or BAM (--bam) to stdout. It is a drop-in replacement for bwa-mem2 mem and supports all standard bwa-mem flags.

Basic usage

Paired-end:

bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam

Single-end:

bwa-mem3 mem -t 16 ref.fa reads.fq.gz > out.sam

Pipe directly to samtools:

bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

Using --bam=0 (uncompressed BAM) avoids SAM text formatting on the write side and SAM parsing on the samtools side, and skips the wasted compression that samtools sort would immediately decompress; the BAM bytes flow between processes in the pipe.

Key flags

Threading: -t

-t INT   number of threads [1]

Performance scales well through 8–16 threads on most machines. Beyond 32 threads, returns diminish on typical workloads because inter-thread locking and IO become the bottleneck. See Threading and resource use for detailed guidance.

Read-group header: -R

-R STR   read group header line, e.g. '@RG\tID:sample1\tSM:sample1\tLB:lib1\tPL:ILLUMINA'

Every production alignment should include a @RG header. The ID in the -R string is embedded as an RG:Z: tag on every output record.

Tip — Escape the tab correctly

Pass -R with a literal \t between fields. Most shells require single quotes or $'...' quoting to prevent interpretation of the backslash:

bwa-mem3 mem -R $'@RG\tID:s1\tSM:sample1' -t 16 ref.fa R1.fq.gz R2.fq.gz

Chunk size: -K

-K INT   process INT input bases in each batch [10000000]

Larger -K values increase memory use but can improve throughput on very deep or very wide batches. The default is appropriate for most workloads.

SAM output control: -S, -P

-S    skip mate rescue
-P    skip pairing; mate rescue performed unless -S also in use

These flags are primarily useful for debugging or non-standard workflows. Normal paired-end alignments should leave both at their defaults.

Output modes

SAM (default)

bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam

Plain-text SAM. Suitable for inspection, compatibility testing, and piping to tools that consume SAM.

BAM (--bam=0)

bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz > out.bam

Writes BAM directly. --bam=0 is uncompressed BAM, which avoids double-compression when piping into a downstream sorter and is roughly 10–15% faster end-to-end. Pass --bam=6 to write a fully compressed BAM if the output is the final product.

Note — –bam=0 is the recommended output mode

For production pipelines, always use --bam=0 and pipe to samtools sort. See Best Practices: output format for the canonical pipeline.

Methylation alignment (--meth)

Pass --meth for bisulfite/RRBS samples. This activates inline C-to-T read conversion, bwameth.py-compatible flag defaults, and inline BAM post-processing. See Quick start: methylation alignment for the two-command workflow and the Methylation Reference for full detail.

Shared-memory index auto-attach

When bwa-mem3 shm has staged the index into shared memory, bwa-mem3 mem attaches automatically — no extra flag is required. The shared-memory path is transparent to users.

Cross-references

The full flag list is in the CLI Reference: mem page.


See also: Output: SAM/BAM, headers, tags · Threading and resource use · Best Practices: output format · CLI Reference: mem · Methylation Reference: overview

Output: SAM/BAM, headers, tags

bwa-mem3 writes output in either SAM (default) or BAM (--bam) format. This page covers the header structure and every non-standard SAM tag emitted by bwa-mem3.

Output format

By default, bwa-mem3 mem writes SAM to stdout. Pass --bam (or --bam=N for a specific compression level) to write BAM. Level 0 (uncompressed) is the default when --bam is given without an argument, which is optimal when piping to a downstream samtools sort.

# SAM (default)
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam

# Uncompressed BAM — best for piping
bwa-mem3 mem --bam -t 16 ref.fa R1.fq.gz R2.fq.gz | samtools sort -@ 8 -o out.bam -

# Compressed BAM — useful when the output is the final file
bwa-mem3 mem --bam=6 -t 16 ref.fa R1.fq.gz R2.fq.gz > out.bam

SAM header

@HD

A default @HD VN:1.6 SO:unsorted line is emitted unless the user supplies one via -H. The sort order is unsorted because bwa-mem3 writes records in input read order; downstream sorting is always a separate step.

@SQ

One @SQ line is written per reference sequence, with the sequence name (SN:) and length (LN:) derived from the FM-index. If the index was built with a .dict or .hdr file that supplies @SQ records, those records are used instead of the auto-generated ones.

In methylation mode (--meth), the doubled reference contains sequences with an f or r prefix in their names. The inline BAM post-processor collapses these back to canonical chromosome names so that the output @SQ lines match a standard non-methylation alignment. See Chimera QC and header rewriting.

@PG

One @PG entry is written in standard mode:

IDDescription
bwa-mem3The alignment step. VN: is the bwa-mem3 version string; CL: is the full command line.

In methylation mode (--meth), a second @PG entry is appended:

IDDescription
bwa-mem3-methThe inline post-processor. VN: carries the version with -meth suffix; CL: is the full command line.

The bwa-mem3-meth entry follows immediately after the bwa-mem3 entry and records the post-processing step as a distinct pipeline node, matching the convention of separate-tool pipelines.

Tags emitted by bwa-mem3

Standard tags

bwa-mem3 emits the same standard tags as bwa-mem2 (NM:i, MD:Z, AS:i, XS:i, SA:Z, RG:Z, XA:Z, etc.). These are documented in the SAM specification and are not described further here.

HN:i — total alignment hit count

HN:i:<count>

The total number of primary alignments (both reported and suppressed) that the aligner found for this read, before the -h supplementary cap is applied. Useful for distinguishing “uniquely mapped” from “multi-mapped” reads without relying solely on MAPQ.

HN:i is emitted on the primary alignment record only.

Methylation-only tags

The following tags are emitted only when --meth is active. See SAM tags: YS, YC, YD for the full per-tag reference.

TagTypeDescription
YS:ZstringOriginal (pre-c2t) read sequence
YC:ZstringConversion direction: CT (R1, C→T) or GA (R2, G→A)
YD:ZstringMethylation strand: f (forward) or r (reverse)

MAPQ semantics

MAPQ semantics are inherited from bwa-mem2 and follow the same scoring model. In methylation mode, alignments identified as chimeras (longest M/=/X run covering less than 44% of the read length) have their MAPQ capped at 1 and the 0x200 (QC fail) flag set. See Chimera QC and header rewriting.


See also: Aligning short reads (mem) · Methylation Reference: SAM tags · Methylation Reference: post-processing · CLI Reference: mem · Best Practices: output format

Threading and resource use

The -t flag

-t INT   number of threads [1]

bwa-mem3 parallelizes alignment by dividing the input into fixed-size batches (controlled by -K) and processing batches concurrently. Threads share the in-memory FM-index; there is no per-thread copy.

How threads interact with performance

Where threads help

  • Seed finding (SMEM enumeration) is fully parallel across reads in a batch.
  • Extension (banded Smith-Waterman) is fully parallel.
  • Pair rescue is parallel.
  • BAM encoding (when --bam is active) is parallel.

Where threads stop helping

Thread count and wall-clock alignment time scale well to approximately 16–32 threads on a modern CPU. Beyond that, several effects conspire to flatten the curve:

  1. FM-index bandwidth. The index for hg38 is ~28 GB and does not fit in the L3 cache of any current server. At high thread counts, threads contend for memory bandwidth accessing the BWT.
  2. IO contention. On spinning disk or a shared network filesystem, concurrent reads of the same large index file saturate IO bandwidth before the CPU is saturated.
  3. Output serialization. SAM output is serialized per-record to stdout. BAM output with --bam reduces this bottleneck but does not eliminate it entirely.
MachineRecommended -tNotes
16-core workstation12–14Leave 2 cores for samtools sort
32-core server24–28Leave cores for downstream and OS overhead
64-core server40–48Marginal returns above 48; test with your workload
Multiple parallel runsdivide evenlySee below

These are starting points. Profile with your specific data and storage configuration to find the practical optimum.

Running multiple parallel alignments

When running multiple bwa-mem3 mem processes on the same machine, divide threads so that the total does not exceed the physical core count. For example, on a 32-core machine running four concurrent samples:

# Four parallel runs, 8 threads each
for sample in a b c d; do
  bwa-mem3 mem --bam -t 8 ref.fa ${sample}_R1.fq.gz ${sample}_R2.fq.gz \
    | samtools sort -@ 2 -o ${sample}.bam - &
done
wait

Using shared memory (bwa-mem3 shm) amortizes the index read-in cost across all four runs. See Quick start: shared-memory index and Best Practices: multi-sample workflows.

Memory use

Peak RAM during alignment is dominated by the in-memory FM-index. For hg38, expect roughly 28 GB of resident memory per bwa-mem3 mem process. Additional memory is used per batch (-K reads × read length × a small constant).

With bwa-mem3 shm, the index is mapped from a shared-memory segment, so multiple concurrent mem processes share the same physical pages. The OS deduplicates the pages; total RAM use is approximately one index, not one per process.

Tip — Use shm for repeated runs on the same machine

If you run more than a few samples on the same machine without rebooting, bwa-mem3 shm pays off immediately. The index is read from disk once and stays in RAM for all subsequent mem invocations.

IO recommendations

  • Use local NVMe storage for the index files when possible. The ~28 GB BWT read is the dominant IO event at the start of each mem run.
  • Write BAM (--bam) to a fast local disk or pipe directly to samtools sort. Avoid writing uncompressed SAM to a network filesystem.
  • Separate read and write paths if your storage topology allows it: read the index from one volume and write sorted BAM to another.

See also: Aligning short reads (mem) · Memory allocator (mimalloc) · Quick start: shared-memory index · Best Practices: multi-sample workflows · Performance: tuning checklist

Memory allocator (mimalloc)

bwa-mem3 vendors and links mimalloc, Microsoft’s high-performance memory allocator, into every binary by default. On multi-threaded alignment workloads, mimalloc reduces wall-clock time by replacing the system allocator with one optimized for many small, short-lived allocations — exactly the access pattern produced by the inner alignment loops.

What mimalloc replaces

The system allocator (glibc malloc on Linux, libSystem malloc on macOS) is a general-purpose allocator with a global lock. Under heavy multi-threaded allocation pressure — 16+ threads each issuing thousands of short-lived allocations per batch — the lock becomes a measurable bottleneck. mimalloc uses per-thread free lists and a segment-based heap to eliminate most of this contention.

Platform-specific linkage

The linkage strategy differs by OS:

PlatformMechanism
LinuxStatic linkage with --whole-archive. The entire mimalloc static library is embedded into the bwa-mem3 binary; its malloc/free symbols take precedence over glibc’s at link time.
macOSDynamic linkage via dyld interposing. libmimalloc.dylib is built alongside the binary; dyld’s DYLD_INSERT_LIBRARIES interposing mechanism replaces malloc/free at load time. The dylib ships next to the binary.

Warning — macOS: keep libmimalloc.dylib next to the binary

On macOS, libmimalloc.dylib must remain in the same directory as the bwa-mem3 binary (or be reachable via the embedded rpath). If you move bwa-mem3 without also moving libmimalloc.dylib, the binary will fall back to the system allocator silently — bwa-mem3 version will not print a mimalloc line, which is the indicator that the allocator is active.

Verifying that mimalloc is active

Run:

./bwa-mem3 version

When mimalloc is linked and loaded, the output includes a line like:

mimalloc 3.x.x

If that line is absent, mimalloc is not active.

Opting out

Pass USE_MIMALLOC=0 at build time to produce a binary linked against the system allocator:

make USE_MIMALLOC=0

Reasons to opt out:

  • AddressSanitizer (ASAN) builds. The Makefile automatically sets USE_MIMALLOC=0 when ASAN_FLAGS is detected, because ASAN and mimalloc’s malloc interposing cannot coexist cleanly.
  • Container environments where distributing a dylib alongside the binary is inconvenient.
  • Reproducibility testing to isolate whether a behavioral difference is allocator-related.

Note — Default is on

USE_MIMALLOC=1 is the default. Opt-out is not recommended for production workloads — mimalloc measurably reduces wall time on multi-threaded runs.

Build internals

The mimalloc source lives in ext/mimalloc/ as a git submodule. The Makefile target builds it via CMake before linking bwa-mem3. The relevant Makefile variables are MIMALLOC_SRC, MIMALLOC_BUILD, and MIMALLOC_LIB.

The feature was introduced in bwa-mem3 as part of the performance improvement work. See Features and Build & infrastructure for the PR history.


See also: Threading and resource use · Features: mimalloc · Getting Started: installation · Developer Guide: building from source · Performance: tuning checklist

Tips and best practices

This page collects the most commonly useful operational tips for running bwa-mem3. Each tip is a short actionable point; the linked pages provide the full rationale.

Index once, align many times

Build the FM-index once per reference version. The on-disk index format is stable across bwa-mem3 releases and architecture variants — bwa-mem3.avx2 and bwa-mem3.avx512bw read the same files. You do not need to re-index when upgrading bwa-mem3 unless the release notes say otherwise.

# Build once
bwa-mem3 index ref.fa

# Align many samples
for sample in a b c d; do
  bwa-mem3 mem --bam -t 16 ref.fa ${sample}_R1.fq.gz ${sample}_R2.fq.gz \
    | samtools sort -@ 4 -o ${sample}.bam -
done

Pipe to samtools sort -@

Never write an intermediate unsorted BAM to disk and then sort it in a second step. bwa-mem3’s --bam mode + samtools sort in a single pipeline avoids the extra write/read cycle and is significantly faster:

bwa-mem3 mem --bam -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

Allocate roughly 2/3 of available threads to bwa-mem3 mem and 1/3 to samtools sort. On a 24-core machine, -t 16 for bwa-mem3 and -@ 8 for samtools is a good starting point.

Stage the index in shared memory for batch workloads

When aligning more than a few samples on the same machine, reading the ~28 GB hg38 index from disk on every mem invocation is the dominant wall-clock cost. Stage it once:

bwa-mem3 shm ref.fa

Subsequent bwa-mem3 mem invocations attach automatically. The shared-memory segment persists until explicitly dropped (bwa-mem3 shm -d) or the machine reboots.

Warning — Always drop the segment before re-indexing

There is no staleness check. If you rebuild the index without first dropping the shared-memory segment, bwa-mem3 mem will attach to the stale segment and produce incorrect alignments without any warning. Always run bwa-mem3 shm -d before bwa-mem3 index.

Pin threads when running concurrent jobs

When running multiple bwa-mem3 mem processes in parallel, divide threads explicitly so that the total does not exceed the physical core count. Avoid relying on the scheduler to balance over-subscribed threads — each process will spin waiting for CPU time, and total throughput drops.

# Good: 4 jobs × 6 threads = 24 cores, on a 24-core machine
for sample in a b c d; do
  bwa-mem3 mem --bam -t 6 ref.fa ${sample}_R1.fq.gz ${sample}_R2.fq.gz \
    | samtools sort -@ 2 -o ${sample}.bam - &
done
wait

See Threading and resource use for per-machine thread count recommendations.

Use the right binary for your CPU

bwa-mem3 ships separate binaries per SIMD level. Using the highest level supported by your CPU gives the best performance:

CPU generationRecommended binary
Modern Intel/AMD (2018+)bwa-mem3.avx512bw or bwa-mem3.avx2
Older x86bwa-mem3.sse42 or bwa-mem3.sse41
Apple Silicon / AWS Gravitonbwa-mem3 (single ARM binary)

When you run bwa-mem3 (the launcher), it detects your CPU and execs the appropriate variant automatically. If you copy only a single SIMD binary, call it directly.

See Performance: SIMD dispatch matrix.

Include a read-group header

Always pass -R with at minimum ID: and SM: fields. Many downstream tools (GATK, fgbio, Picard) require a @RG header and will fail or warn without one.

bwa-mem3 mem \
  -R $'@RG\tID:run1\tSM:sample1\tLB:lib1\tPL:ILLUMINA' \
  -t 16 ref.fa R1.fq.gz R2.fq.gz

Further reading

The Best Practices section covers these topics in depth:


See also: Aligning short reads (mem) · Threading and resource use · Memory allocator (mimalloc) · Performance: tuning checklist · Best Practices: anti-patterns

Performance Overview

Performance claims in this section are benchmarked, not asserted. The canonical source of truth for benchmark methodology, hardware configurations, and current numbers is bwa-mem3-bench, a reproducible benchmarking harness that runs across AWS Batch architectures (x86 AVX2, AVX-512, ARM Graviton). Consult that repository before drawing conclusions from isolated anecdotal timings.

What drives bwa-mem3’s performance

bwa-mem3 inherits the SIMD-vectorized alignment kernels of bwa-mem2 and adds several improvements of its own. The headline gains relative to a stock bwa-mem2 build fall into four categories.

Vectorized alignment kernels. The Smith-Waterman and banded-SWA kernels (kswv, bandedSWA) are compiled against the widest SIMD ISA the current CPU supports — SSE4.1 through AVX-512BW on x86, or native NEON on ARM. On Apple Silicon, native NEON intrinsics replaced the sse2neon shim in the two hottest kernels, delivering roughly 10% additional throughput over the pure-translation baseline. See SIMD dispatch matrix for the full picture.

libsais FM-index construction. The indexing step uses the linear-time suffix-array/BWT construction library libsais in place of the original quadratic-time approach. This cuts bwa-mem3 index wall time substantially on large references. See What’s Different — Performance improvements for the corresponding PR details.

mimalloc allocator. bwa-mem3 vendors and statically links mimalloc, replacing the system malloc/free for all allocations. On Linux the library is injected via --whole-archive; on macOS it uses dyld interposition. The allocator shows consistent throughput gains on multi-threaded workloads because mimalloc avoids the lock contention in glibc’s ptmalloc at high thread counts. See User Guide — Memory allocator for details.

Profile-Guided Optimization (PGO). The build system provides make pgo-generate and make pgo-use targets that compile an instrumented binary, gather branch-probability and call-frequency profiles from a representative workload, and then recompile with those profiles applied. On Apple Silicon the measured gain is approximately 3%; on x86 the gain depends on the workload mix. PGO is opt-in and is not applied to the default make output. See PGO build for the full workflow.

Consolidated mapping speedups

PR #58 and the related lockstep SMEM-batching work (#33) reduced per-read overhead in the main mapping loop beyond what upstream bwa-mem2 carries. The batch -H ingestion improvement (#49) further reduces header-processing latency for large sample sets.

Benchmarking responsibly

Alignment throughput is sensitive to read length, error rate, reference size, thread count, CPU architecture, NUMA topology, and whether the index is cold (in-kernel page cache) or warm. The bwa-mem3-bench harness controls for these variables by running standardized workloads on defined instance types. If you need numbers for a procurement or publication decision, run the harness against your target hardware.


See also: SIMD dispatch matrix · PGO build · Tuning checklist · What’s Different — Performance improvements · bwa-mem3-bench

SIMD Dispatch Matrix

bwa-mem3 uses a multi-binary dispatch strategy on x86: the bwa-mem3 launcher reads the CPU’s CPUID bits at startup, then execvs the highest-capability variant binary found on disk. On ARM there is only one NEON level, so the launcher execs bwa-mem3.arm64 directly without any cpuid check.

Dispatch flowchart

flowchart TD
    A[bwa-mem3 launcher starts] --> B{Platform?}
    B -- ARM / aarch64 --> C[exec bwa-mem3.arm64]
    B -- x86 --> D[read CPUID via __cpuidex]
    D --> E{AVX-512BW supported?}
    E -- yes --> F[exec bwa-mem3.avx512bw]
    E -- no --> G{AVX2 supported?}
    G -- yes --> H[exec bwa-mem3.avx2]
    G -- no --> I{AVX supported?}
    I -- yes --> J[exec bwa-mem3.avx]
    I -- no --> K{SSE4.2 supported?}
    K -- yes --> L[exec bwa-mem3.sse42]
    K -- no --> M{SSE4.1 supported?}
    M -- yes --> N[exec bwa-mem3.sse41]
    M -- no --> O[error: no supported variant found]

The launcher reads CPUID leaf 1 (for SSE flags) and leaf 7 (for AVX2 and AVX-512 flags). It checks in descending capability order and stops at the first variant binary it finds on disk. If no variant binary is executable, it exits with an error.

The ARM path always tries bwa-mem3.arm64 first, then falls back to the bare bwa-mem3 name (which on ARM is a symlink to bwa-mem3.arm64 created by make arm64).

Building the variant binaries

make multi builds all five x86 variants and the launcher in sequence:

make multi

This produces:

FilenameArch flagsMinimum CPU
bwa-mem3.sse41-msse4.1Penryn (2007) / K10 (2011)
bwa-mem3.sse42-msse4.2Nehalem (2008) / Bulldozer (2011)
bwa-mem3.avx-mavxSandy Bridge (2011) / Bulldozer (2011)
bwa-mem3.avx2-mavx2Haswell (2013) / Excavator (2015)
bwa-mem3.avx512bw-mavx512f -mavx512bwSkylake-X (2017) / Zen 4 (2022)

For ARM builds, make arm64 produces a single binary and creates the symlink:

FilenameArch flagsPlatform
bwa-mem3.arm64-DAPPLE_SILICON=1 + sse2neon shimAny aarch64 / Apple Silicon

To build a single-arch binary for a known target (e.g. for a cluster with uniform hardware):

make arch=avx2

The resulting binary is named bwa-mem3 and contains only AVX2 code. The launcher is not built; it is not needed when the target ISA is fixed.

Kernel vectorization coverage

The table below lists the kernels that have SIMD implementations and which ISA levels they cover.

KernelSSE4.1SSE4.2AVXAVX2AVX-512BWNEON (arm64)
kswv (vectorized Smith-Waterman)8-wide int168-wide int168-wide int1616-wide int1632-wide int168-wide int16 (native)
bandedSWA (banded alignment)SSE2 baselineSSE2 baselineSSE2 baselineSSE2 baselineSSE2 baselinenative NEON blendv
FM-index lookup (FMI_search)SSE popcountSSE popcountSSE popcountSSE popcountSSE popcount__builtin_popcountl
libsais BWT constructionscalarscalarscalarOpenMP parallelOpenMP parallelOpenMP parallel

Note — FM-index is memory-bound

The FM-index backward-extension loop is limited by pointer-chasing through the cp_occ arrays, not by computation. Additional SIMD width does not increase throughput here. See the Apple Silicon optimization log in Developer Guide — Apple Silicon / NEON port for the profiling evidence.

Why the launcher uses execv, not a function pointer

The multi-binary design was inherited from bwa-mem2. Separate compilation units mean the compiler can use the target ISA’s full instruction set throughout — not just in hand-vectorized loops but also in auto-vectorized loops, register allocation, and branch heuristics. A single-binary dispatcher that calls ISA-specific function pointers achieves the same for hand-written kernels but leaves the compiler’s auto-vectorization gated at the baseline ISA. For a workload with this many scalar loops, the execv approach yields a measurable difference. For the ARM path, all CPUs have the same NEON level so the single-binary approach is fine.


See also: Performance overview · PGO build · Developer Guide — SIMD dispatch architecture · Developer Guide — Multi-binary launcher (x86) · Developer Guide — Apple Silicon / NEON port

PGO Build

Profile-Guided Optimization (PGO) is a two-pass compiler technique. In the first pass (pgo-generate) the compiler inserts counters into every branch, call site, and loop back-edge. You run a representative training workload against the instrumented binary so those counters accumulate real branch-probability data. In the second pass (pgo-use) the compiler recompiles every translation unit using the collected profiles to make better inlining, branch-prediction, and code-layout decisions.

bwa-mem3’s Makefile provides three targets that implement this workflow.

Observed gains

On Apple Silicon (M-series), PGO delivered approximately 3% throughput improvement over the native NEON build. The gain on x86 depends on the workload — short-read paired-end alignment on avx2 or avx512bw hardware typically sees 2–5%. PGO is most useful when you will run the same binary on the same hardware against the same workload repeatedly (e.g. a production pipeline node). It is not worth the extra build time for one-off or exploratory runs.

Workflow

Step 1: Build the instrumented binary

make pgo-generate

By default PGO_ARCH is set to arm64 on Apple Silicon / aarch64 hosts and native on x86 hosts. To target a specific ISA, pass PGO_ARCH explicitly:

make pgo-generate PGO_ARCH=avx2

This produces a binary named bwa-mem3.pgo-instr (or bwa-mem3.pgo-instr.avx2 for non-default arch). Profiles are written to the directory pgo_profiles/ by default. Override with PGO_PROFILE_DIR:

make pgo-generate PGO_ARCH=avx2 PGO_PROFILE_DIR=/scratch/pgo-profiles-avx2

Step 2: Run the training workload

Run a workload that is representative of your production use. A single-end or paired-end alignment run against the same reference and similar read length is sufficient. A larger training run produces more stable profiles but 5–10 million read pairs is generally enough.

./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null

The run discards output so you are measuring the alignment work alone.

Tip — Training workload size

Aim for a training run that exercises the same code paths as your production workload. If you align 150 bp paired-end reads in production, train on 150 bp reads. If you use --meth, include a methylation alignment run in training. A few million read pairs is sufficient; a full WGS run provides diminishing returns.

Step 3: Build the optimized binary

make pgo-use

Or with matching arch and profile dir:

make pgo-use PGO_ARCH=avx2 PGO_PROFILE_DIR=/scratch/pgo-profiles-avx2

This produces bwa-mem3.pgo (or bwa-mem3.pgo.avx2). The binary is ready to use in production.

Step 4: Clean up instrumentation artifacts

make pgo-clean

This removes the profile directory and all bwa-mem3.pgo-instr* and bwa-mem3.pgo* files.

Multi-arch builds with PGO

Each architecture requires its own profile because the instrumentation counters are embedded in arch-specific code. Run the full three-step workflow once per arch and keep the profiles in separate directories:

# AVX2 profile
make pgo-generate PGO_ARCH=avx2 PGO_PROFILE_DIR=pgo_profiles_avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2 PGO_PROFILE_DIR=pgo_profiles_avx2

# AVX-512BW profile (separate host or same host with matching CPU)
make pgo-generate PGO_ARCH=avx512bw PGO_PROFILE_DIR=pgo_profiles_avx512bw
./bwa-mem3.pgo-instr.avx512bw mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx512bw PGO_PROFILE_DIR=pgo_profiles_avx512bw

Warning — Profile portability

Profile data collected on one microarchitecture is not portable to a different one. An AVX2 profile collected on a Haswell CPU will not improve — and may pessimize — an AVX-512BW build run on a Sapphire Rapids CPU. Always collect profiles on the same hardware class where the optimized binary will run.

PGO and the multi-binary layout

The PGO targets produce a single optimized binary for a single arch target. They do not rebuild the full make multi set. If you want PGO-optimized multi-binary dispatch, build and profile each arch variant separately, place them alongside the launcher, and verify with ./bwa-mem3 version.

Relationship to LTO

make lto-build produces a Link-Time Optimization binary; make pgo-use produces a PGO-optimized binary. Both are independent opt-in targets. You can combine them by passing -flto (or -flto=thin for clang) as part of EXTRA_CXXFLAGS during the pgo-use step, but the combination has not been systematically benchmarked. In practice, LTO and PGO each provide modest single-digit gains; their interaction is compiler-specific.


See also: Performance overview · SIMD dispatch matrix · Tuning checklist · Best Practices — Build · Developer Guide — Building from source

Tuning Checklist

The items below are ordered by expected impact for most workloads. Work through them in sequence; there is little point optimizing output format before confirming you are running the right binary for your CPU.

1. Run the right binary for your CPU

If you built with make multi (recommended for production x86 deployments), the bwa-mem3 launcher reads CPUID at startup and execs the highest-capability variant automatically. Verify which variant is running by checking the banner printed to stderr at the start of a mem run:

-----------------------------
Executing in AVX512 mode!!
-----------------------------

If the banner says SSE4.1 on a machine you believe supports AVX2, the variant binary may be missing from the directory. Confirm with:

ls -1 bwa-mem3.sse41 bwa-mem3.sse42 bwa-mem3.avx bwa-mem3.avx2 bwa-mem3.avx512bw 2>&1

If files are missing, rebuild with make multi.

For ARM / Apple Silicon, there is only one binary level. Confirm it is in use:

ls -la bwa-mem3
# expect: bwa-mem3 -> bwa-mem3.arm64

See SIMD dispatch matrix for the full dispatch logic and the minimum CPU requirements for each variant.

Tip — Single-arch deployments

On a cluster where all nodes have the same CPU, build with make arch=avx2 (or the appropriate ISA). The launcher overhead is negligible but removing it simplifies the deployment: only one binary to distribute and no variant-lookup failures.

2. Build with PGO if you will run repeatedly

For production pipeline nodes that will process many samples against the same reference, a PGO build provides an additional 2–5% throughput at the cost of one extra build pass and a training run:

make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2

See PGO build for the full workflow, including multi-arch and profile portability notes.

3. Use shared memory for many small samples

When aligning many samples on one machine against the same reference, loading the index into POSIX shared memory once and reusing it across all mem invocations eliminates redundant I/O and reduces per-sample startup time significantly. The benefit grows with the number of samples and the size of the reference.

# Load the index into shared memory once
bwa-mem3 shm ref.fa

# Align each sample against the in-memory index
bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz | samtools sort -@ 4 -o sample.bam -

# When finished with all samples, drop the shared segment
bwa-mem3 shm -d

Warning — No staleness check

bwa-mem3 shm does not detect whether the on-disk index has changed after the segment was loaded. Always run bwa-mem3 shm -d before re-indexing a reference and re-loading with bwa-mem3 shm. Failing to do so results in alignments against a stale index.

See Getting Started — Shared-memory index and Best Practices — Multi-sample workflows for complete workflows.

4. Emit BAM directly

Use --bam (or --bam=0 for uncompressed BAM) to emit BAM instead of SAM. Uncompressed BAM avoids the text-formatting cost on the aligner side and the text-parsing cost on the downstream side. samtools sort reads BAM natively and is fastest when the input is uncompressed:

bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

The --bam flag (without =0) produces BGZF-compressed BAM. This is useful when writing directly to disk without a downstream piped tool.

See Best Practices — Output format for guidance on when SAM is still appropriate.

5. Pipe to a multi-threaded sorter

Sorting is typically the bottleneck after alignment. Keep a separate thread budget for samtools sort:

bwa-mem3 mem --bam=0 -t 12 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -m 2G -o out.bam -

On a 16-core machine, allocating 12 threads to mem and 8 to samtools sort (with overlap via the pipe) is a common starting point. The aligner is generally CPU-bound; the sorter is I/O-bound during merge. Profile both stages to find the right split for your hardware.

Tip — Thread count tuning

bwa-mem3 mem scales well to 16–32 threads on most workloads. Beyond 32 threads the per-thread work unit becomes small enough that synchronization overhead starts to erode gains. See User Guide — Threading and resource use for thread-scaling data.

Summary table

ItemActionReference
Right binary for CPUmake multi; verify bannerSIMD dispatch matrix
PGO for productionpgo-generate → train → pgo-usePGO build
Shared-memory indexbwa-mem3 shm ref.fa before batch runsQuick start: shm
Emit uncompressed BAM--bam=0Best Practices — Output format
Multi-threaded sortsamtools sort -@ with appropriate thread splitUser Guide — Threading

See also: Performance overview · SIMD dispatch matrix · PGO build · Best Practices — Build · User Guide — Threading and resource use

Build

This page describes the recommended build configuration for production use of bwa-mem3.

Choose the right arch target

The default make invocation builds the multi-binary launcher on x86 (or a single ARM64 binary on Apple Silicon). For production servers where the CPU family is known, specify the target explicitly so the compiler can generate tighter code and the binary does not need the launcher overhead:

# Most modern x86-64 servers (Skylake or later):
make arch=avx2

# Intel Cascade Lake / Sapphire Rapids, AWS c7i/m7i:
make arch=avx512bw

# Apple Silicon / AWS Graviton:
make arch=arm64

Omit arch= if the deployment target is heterogeneous or unknown; make (with no arguments) builds the full multi-binary suite on x86 and selects the fastest variant at runtime via cpuid.

See SIMD dispatch matrix for the full list of targets and which kernels each vectorizes.

Profile-Guided Optimization (PGO)

PGO typically yields 3–5% throughput improvement on real workloads. It is opt-in — the standard make target does not use it — but is recommended for any installation that will run many alignment jobs against the same reference.

The workflow is three steps:

# Step 1: Build an instrumented binary (produces bwa-mem3.pgo-instr).
make pgo-generate

# Step 2: Run a representative training workload.
#   Use reads and a reference that reflect actual production input.
#   About 10–30 million read pairs is sufficient.
./bwa-mem3.pgo-instr mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null

# Step 3: Build the PGO-optimized binary (produces bwa-mem3.pgo).
make pgo-use

To target a specific SIMD level, pass PGO_ARCH=:

make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Produces: bwa-mem3.pgo.avx2

Profile data is written to pgo_profiles/ by default. Pass PGO_PROFILE_DIR=<path> to change the location.

Tip — Training data matters

The training workload should resemble production input in read length, base quality distribution, and reference composition. A read set that is too short, too long, or too easy (low mismatch rate) will bias the branch predictions and may produce a build that is slower than the non-PGO baseline on real data.

mimalloc

mimalloc is compiled in by default (USE_MIMALLOC=1). The allocator improves multi-threaded throughput by reducing lock contention on malloc and free hot paths. Run bwa-mem3 version to confirm it is active:

bwa-mem3 version
# Expected output includes a line like:
#   mimalloc 3.x.x

To build without mimalloc (for example, when using AddressSanitizer or on a system with a known-incompatible allocator):

make USE_MIMALLOC=0

Summary

For a production installation on a known x86 server with AVX2:

make pgo-generate PGO_ARCH=avx2
./bwa-mem3.pgo-instr.avx2 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > /dev/null
make pgo-use PGO_ARCH=avx2
# Deploy: bwa-mem3.pgo.avx2

See also: SIMD dispatch matrix · PGO build · Memory allocator (mimalloc) · Building from source · Anti-patterns

Output Format

The choice of output format — SAM, compressed BAM, or uncompressed BAM — has a measurable effect on end-to-end pipeline wall time. This page explains why uncompressed BAM is the right default and shows the recommended pipeline.

Why uncompressed BAM is faster than SAM

When bwa-mem3 writes SAM (the default when --bam is not set), every alignment record must be serialized into ASCII text: integers are formatted as decimal strings, bases are encoded as characters, and flags are written as decimal numbers. The receiving process — typically samtools sort — then parses each field back from text into binary integers. Both conversions are pure overhead: the data is binary inside bwa-mem3 and binary inside samtools; text is only an interchange format that is immediately discarded.

Uncompressed BAM (--bam=0) bypasses this round-trip. bwa-mem3 writes binary BAM records directly via htslib’s wb0 mode. The write path performs no text formatting; the read path in samtools sort performs no text parsing. The htslib overhead of the wb0 write is negligible — it is effectively a buffered write(2) call with a small BAM block header prepended.

Compressed BAM (--bam=1) adds BGZF compression on top, which costs CPU on the write side and gains nothing: the pipe is in-process memory or a kernel pipe buffer, and samtools sort will re-compress the output anyway. Compressed BAM on a pipe wastes CPU on both sides.

bwa-mem3 mem --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

The -@ 8 flag gives samtools sort eight compression threads for writing the final sorted BAM. Tune this number based on available cores; the total core count should be split so that alignment threads and sort threads do not contend. A 16:8 split (bwa-mem3:samtools) works well on 24-core machines.

Tip — Thread allocation

Do not give all cores to bwa-mem3. Downstream samtools sort needs threads to compress and write the sorted BAM. Leaving 4–8 threads for samtools sort keeps the pipeline balanced and prevents a write bottleneck that would stall the aligner.

Methylation output

The --meth path always writes uncompressed BAM internally, regardless of the --bam flag. The post-processing step (header rewrite, chimera QC, YD:Z: tag) is performed inline before the record is handed to htslib, so the same pipeline shape applies:

bwa-mem3 mem --meth --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

When SAM is appropriate

SAM (the default, equivalent to omitting --bam) remains the right choice for:

  • Debugging. Plain text is readable with less, grep, and any text editor, making it easy to inspect individual records without samtools view.
  • Ad-hoc inspection. When you need to scan a few thousand reads to diagnose a mapping problem, piping to SAM and reading the output directly is faster than writing a BAM file and then querying it.
  • Compatibility with tools that require SAM input. Some legacy tools do not accept BAM. If the downstream tool does not support BAM, use SAM.

For production alignment jobs that feed samtools sort, always use --bam=0.

Summary table

Format--bam valuePipe overheadRecommended for
SAM(default / omit)High (text round-trip)Debugging, ad-hoc inspection
Uncompressed BAM0NegligibleProduction pipelines
Compressed BAM1High on write sideWriting directly to a file (no downstream sort)

See also: Aligning short reads (mem) · Output: SAM/BAM, headers, tags · Threading and resource use · Tuning checklist · CLI Reference: mem

Multi-Sample Workflows

When you need to align many samples back-to-back against the same reference on a single machine, loading the FM-index into shared memory once — and keeping it resident across all alignment jobs — eliminates the index I/O cost for every sample after the first.

The problem: repeated index loads

The bwa-mem3 FM-index for hg38 is approximately 28 GB on disk. Without shared memory, bwa-mem3 mem reads the entire index from disk on every invocation. On a fast NVMe drive this takes 30–60 seconds; on a network-attached or spinning-disk filesystem it can take several minutes. For a batch of 100 samples, that adds hours of pure I/O overhead.

Staging the index once with bwa-mem3 shm

# Stage the index into shared memory (one-time cost, ~28 GB for hg38).
bwa-mem3 shm ref.fa

# Align each sample. bwa-mem3 mem attaches automatically — no extra flag.
bwa-mem3 mem --bam=0 -t 16 ref.fa sample1_R1.fq.gz sample1_R2.fq.gz \
  | samtools sort -@ 4 -o sample1.bam -
bwa-mem3 mem --bam=0 -t 16 ref.fa sample2_R1.fq.gz sample2_R2.fq.gz \
  | samtools sort -@ 4 -o sample2.bam -
# ...

# When done, release the segment.
bwa-mem3 shm -d

For methylation workflows, stage the c2t index instead:

bwa-mem3 shm --meth ref.fa
bwa-mem3 mem --meth --bam=0 -t 16 ref.fa sample1_R1.fq.gz sample1_R2.fq.gz \
  | samtools sort -@ 4 -o sample1.bam -
bwa-mem3 shm -d

Confirming the index is staged

bwa-mem3 shm -l
# Prints the basename and memory usage of each staged segment.

If the listing is empty, the index is not staged and bwa-mem3 mem will fall back to loading from disk.

Thread layout for parallel alignment

Running multiple bwa-mem3 mem instances in parallel is efficient when the samples are independent and the machine has enough cores. The shared-memory index eliminates disk contention, so the bottleneck becomes CPU and memory bandwidth.

Guidelines for N-core machines:

  • N = 32: Two instances at -t 14 each, with -@ 4 for samtools sort. Keeps 4 cores reserved for OS and I/O.
  • N = 64: Two to four instances at -t 14 to -t 16, each with -@ 4 for samtools sort.
  • N = 128: Four to eight instances; keep at least 8–16 cores free for samtools sort threads and OS scheduling.

Tip — Memory bandwidth limit

The FM-index lookup is memory-bandwidth bound. On machines with NUMA topology (multi-socket or multi-chiplet), binding each bwa-mem3 instance to a NUMA node with numactl --cpunodebind=N --membind=N can improve throughput by reducing cross-node memory traffic.

Scripting a batch with a loop

bwa-mem3 shm ref.fa

for sample in sample1 sample2 sample3; do
  bwa-mem3 mem --bam=0 -t 16 ref.fa "${sample}_R1.fq.gz" "${sample}_R2.fq.gz" \
    | samtools sort -@ 4 -o "${sample}.bam" -
  samtools index "${sample}.bam"
done

bwa-mem3 shm -d

For parallel execution, replace the for loop body with a background job (or use a workflow manager such as Snakemake or Nextflow) and limit the degree of parallelism to match available cores.

Warning — Stale segment footgun

If you need to re-index the reference (e.g. after updating it), always run bwa-mem3 shm -d before bwa-mem3 index. There is no automatic staleness check. See Anti-patterns for details.


See also: Quick start: shared-memory index · CLI Reference: shm · Output format · Threading and resource use · Anti-patterns

Methylation Defaults

bwa-mem3 mem --meth ships with a set of scoring and filtering defaults that match the bwameth.py reference implementation. This page describes what those defaults are, when to keep them, and when to override them.

What --meth sets

When --meth is passed, the following flags are applied automatically in addition to enabling inline c2t conversion and BAM post-processing:

FlagValuePurpose
-B2Mismatch penalty. Reduced from the bwa-mem2 default of 4. Bisulfite-treated reads carry C→T and G→A mismatches at converted positions; a lower penalty prevents these from causing spurious soft-clipping or unmapped reads.
-L10Clipping penalty. Increased from the bwa-mem2 default of 5 to discourage clipping of read ends that carry converted bases at positions that look like mismatches.
-U100Unpaired read penalty. Higher than default; methylation libraries typically have well-defined insert sizes and anomalous pairing usually reflects a mapping artifact.
-T40Minimum alignment score threshold. Higher than default; raises the bar to report an alignment, reducing spurious low-quality hits against the doubled reference.
-CMTreats soft-clipped bases as matches in CIGAR output. Required for correct behavior of downstream methylation callers (e.g. Bismark, MethylDackel) that count clipped bases.

These defaults can all be overridden on the command line. The --meth flag sets them first; any explicit flag that follows overrides the --meth-set value.

When to keep the defaults

For standard whole-genome bisulfite sequencing (WGBS) workflows, the defaults are appropriate as-is. They were derived from the bwameth.py codebase and are expected by most downstream methylation calling tools. Unless you have a specific reason to deviate, use:

bwa-mem3 mem --meth --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 4 -o out.bam -
samtools index out.bam

When to override

Low-coverage or targeted bisulfite sequencing. If your library covers a small target region and insert sizes are more variable, consider lowering -T (e.g. -T 20) to recover short or soft-clipped alignments in the target.

Amplicon bisulfite sequencing. Amplicon reads have uniform insert sizes; the default -U 100 is appropriate. However, if your amplicons are short (< 100 bp), consider lowering -L further to reduce clipping at read ends.

Non-standard conversion chemistry. Some library preparations use only one strand conversion (C→T only, not G→A). In such cases, --set-as-failed r suppresses alignments to the reverse-complement strand, which reduces noise from strand-ambiguous alignments:

bwa-mem3 mem --meth --set-as-failed r --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 4 -o out.bam -

Chimeric reads from long-read-length protocols. By default, --meth applies a chimera QC heuristic: if the longest matching run (CIGAR M/=/X operations) is less than 44% of the read length, the alignment is flagged 0x200 (QC fail), the paired flag 0x2 is cleared, and MAPQ is capped at 1. If your protocol produces legitimate long reads where this heuristic over-aggressively flags alignments, pass --do-not-penalize-chimeras:

bwa-mem3 mem --meth --do-not-penalize-chimeras --bam=0 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 4 -o out.bam -

Note — Overrides are positional

Flags supplied after --meth on the command line override the defaults set by --meth. For example, bwa-mem3 mem --meth -B 4 ... uses -B 4 (not 2). Flags supplied before --meth are silently overwritten by --meth’s defaults, so always place overrides after --meth.

Downstream tool compatibility

The --meth output BAM is designed to be a drop-in replacement for the output of the bwameth.py pipeline. The following downstream tools have been used successfully with bwa-mem3 --meth output:

  • MethylDackel — extracts methylation calls from the YD:Z: strand tag.
  • Bismark — accepts the bwameth-convention YD:Z: strand annotation.
  • PileOMeth — reads the standard bisulfite BAM format.

If a tool requires the XB:Z: tag convention used by Bismark’s own aligner rather than the YD:Z: convention, a conversion step is needed before methylation calling.


See also: Methylation Reference: Overview · SAM tags: YS, YC, YD · Flags: –set-as-failed, –do-not-penalize-chimeras · Quick start: methylation alignment · Output format

Anti-Patterns

This page documents common mistakes that produce incorrect results or unnecessary failures when using bwa-mem3.

Re-indexing without dropping the shared-memory segment

Warning — Footgun

bwa-mem3 shm does not detect stale segments. If you re-run bwa-mem3 index after a shared-memory segment is already staged, the on-disk index files will not match the in-memory segment. bwa-mem3 mem will attach to the stale segment and produce incorrect alignments without any warning.

Always run bwa-mem3 shm -d before re-indexing:

bwa-mem3 shm -d           # drop all staged segments
bwa-mem3 index ref.fa     # rebuild the on-disk index
bwa-mem3 shm ref.fa       # re-stage the new index

There is no automatic staleness check in the implementation. The segment name is derived from the reference basename only; no content hash or modification timestamp is stored.

To confirm that no stale segments are staged, use bwa-mem3 shm -l before running any indexing step.

Forgetting to initialize submodules

bwa-mem3 depends on several submodules (ext/htslib, ext/safestringlib, ext/libsais, ext/mimalloc, ext/sse2neon). A shallow clone or a clone without --recursive will produce a build that fails at the linking step with missing symbols, or at runtime with missing index files.

Warning — Missing submodules

Always clone with --recursive, or initialize submodules after cloning:

git clone --recursive https://github.com/fg-labs/bwa-mem3
# or, after a bare clone:
git submodule update --init --recursive

If make reports missing headers (e.g. htslib/hts.h: No such file or directory), the submodules were not initialized.

Building without an arch target on a known CPU

The default make (no arch=) builds the multi-binary launcher suite on x86. On a production server with a known CPU family, this is unnecessary: the launcher adds a small cpuid dispatch overhead on every invocation, and the extra binaries consume disk space. More importantly, building without an explicit arch= means the compiler cannot assume any ISA beyond SSE4.1, so AVX2- and AVX-512-specific optimizations are not applied to the base binary.

Warning — Suboptimal build on known hardware

On a server with a known CPU family, always pass an explicit arch=:

make arch=avx2        # for Broadwell/Skylake and later x86
make arch=avx512bw    # for Cascade Lake, Ice Lake, Sapphire Rapids
make arch=arm64       # for Apple Silicon, AWS Graviton

The make multi target (or bare make on x86) is appropriate when you are building a binary that will be distributed and run on multiple CPU families, or when the target CPU is genuinely unknown.

See SIMD dispatch matrix for the full set of targets.

Mixing bwa-mem3 and bwa-mem2 outputs in the same pipeline

bwa-mem3 adds several custom SAM tags that bwa-mem2 does not emit: HN:i (total number of primary alignments — both reported and suppressed — that the aligner found for this read, before the -h supplementary cap is applied), and — in --meth mode — YS:Z:, YC:Z:, and YD:Z:. It also rewrites @SQ header lines in --meth mode (collapsing f/r strand prefixes back to one entry per chromosome).

Warning — Header and tag mismatch

Do not merge BAM files produced by bwa-mem3 and bwa-mem2 without verifying that the @PG headers and custom tags are handled correctly by the downstream tool. In methylation workflows, a bwa-mem2 BAM mixed into a bwa-mem3 --meth pipeline will be missing YD:Z: strand annotations, which will cause methylation callers to silently drop or misclassify those records.

If you must merge outputs from both tools, run samtools view -H on both files and confirm that @SQ lines are consistent and that the downstream tool can tolerate the tag differences.

Writing compressed BAM to a pipe

Passing --bam=1 (compressed BAM) when piping to samtools sort compresses the stream on the bwa-mem3 side and then immediately decompresses it on the samtools side. This wastes CPU on both ends with no benefit.

Use --bam=0 (uncompressed BAM) for all pipe-to-sort workflows. See Output format for the full explanation and recommended pipeline.


See also: Output format · Multi-sample workflows · Build · Quick start: shared-memory index · CLI Reference: shm

CLI Reference Overview

bwa-mem3 exposes four subcommands: index, mem, shm, and version. Run bwa-mem3 <subcommand> --help to see the full option list for any command.

How this section is structured

Each subcommand page follows the same layout:

  1. Introduction — what the subcommand does and when to reach for it.
  2. Synopsis — the verbatim --help output, auto-captured from the binary at build time and included here via mdbook’s {{#include}} directive. The snippet is regenerated by make docs-cli and CI fails if it drifts from the binary.
  3. Common usage — two or three worked command-line examples.
  4. Flag reference (for mem, grouped by topic) — per-flag prose covering semantics, defaults, and interaction with other flags that the --help text does not have room to explain.
  5. Notes / Gotchas — operational warnings about non-obvious behavior.
  6. See also — cross-links to related pages in this book.

Subcommands

index builds the FM-index from a reference FASTA. Pass --meth to produce a bwameth-style doubled c2t reference for methylation alignment.

mem aligns short reads against an indexed reference, producing SAM or BAM output. It is the primary alignment subcommand. The flag surface is large; the mem reference page groups flags by purpose to make them easier to navigate.

shm stages an FM-index into POSIX shared memory so that repeated bwa-mem3 mem invocations on the same machine skip the per-run disk read. It also lists and destroys staged segments.

version prints the bwa-mem3 release version and, when mimalloc is compiled in, the mimalloc version.


See also: User Guide — Aligning short reads · User Guide — Indexing the reference · Getting Started — Quick start: align paired-end FASTQs · Getting Started — Quick start: shared-memory index · Performance — Tuning checklist

index

bwa-mem3 index builds the FM-index (BWT + suffix array) that bwa-mem3 mem requires for alignment. Run it once per reference; the resulting files sit alongside the input FASTA and are reused for all subsequent alignment jobs. Pass --meth to produce a bwameth-compatible doubled c2t reference for bisulfite-seq alignment.

Synopsis

Usage: bwa-mem3 index [-p prefix] [-t N] [--max-memory SIZE] [--tmp-dir PATH] [--meth] <in.fasta>

  -p STR             output prefix (default: <in.fasta>)
  -t INT             worker threads [auto: detected cores, cgroup-aware]
  --max-memory SIZE  peak memory budget; SIZE accepts a G/M/K suffix
                     (case-insensitive) or bare bytes
                     [auto: min(50% of RAM, 32G), cgroup-aware]
  --tmp-dir PATH     scratch directory [$TMPDIR]
  --meth             build a bwameth-style doubled c2t reference + FMI.
                     Writes <in.fasta>.bwameth.c2t and the FMI alongside it.
                     Use with `bwa-mem3 mem --meth <in.fasta> R1.fq [R2.fq]`.
  -h, --help         print this help message and exit

Common usage

Build a standard index using all available cores:

bwa-mem3 index ref.fa

Build a methylation-aware index (required before bwa-mem3 mem --meth):

bwa-mem3 index --meth ref.fa

Limit peak RAM to 16 GB and write scratch data to /scratch:

bwa-mem3 index --max-memory 16G --tmp-dir /scratch ref.fa

Flag reference

-p STR — output prefix

By default, index files are written alongside <in.fasta> using the FASTA path as a prefix (e.g. ref.fa.bwt.2bit.64, ref.fa.0123, etc.). Use -p to write them to a different base path, such as a dedicated index directory:

bwa-mem3 index -p /idx/hg38 ref.fa
# writes /idx/hg38.bwt.2bit.64, /idx/hg38.0123, …
# align with: bwa-mem3 mem /idx/hg38 R1.fq R2.fq

-t INT — worker threads

Controls the number of threads used during index construction. The default auto-detects available cores and is cgroup-aware, so it behaves correctly inside containers and on shared cluster nodes. Set explicitly when you want to cap CPU usage.

--max-memory SIZE — peak memory budget

Limits how much RAM the indexer may use at once. SIZE accepts a G, M, or K suffix (case-insensitive) or a bare byte count. The default is min(50% of RAM, 32 GB), computed in a cgroup-aware manner.

For large references (hg38 and above) on machines with limited RAM, setting this to a value lower than the reference size causes the indexer to partition work and use --tmp-dir for intermediate files, at the cost of extra I/O.

--tmp-dir PATH — scratch directory

Scratch directory for intermediate files when memory is partitioned. Defaults to $TMPDIR. Point this at a fast local disk (NVMe or ramdisk) to minimize wall-clock time when --max-memory forces partitioned construction.

--meth — build a methylation (c2t) index

Writes a bwameth-style doubled reference — <in.fasta>.bwameth.c2t — and builds the FM-index over that file rather than the original FASTA. The c2t file and its index files are placed alongside the original FASTA.

Pass the original FASTA prefix (not the .bwameth.c2t path) to all three index, shm, and mem commands. The c2t suffix is appended automatically when --meth is present.

Notes / Gotchas

Tip — Index once, align many times

Index construction for hg38 takes several minutes and ~28 GB of disk. Build the index once and store it on shared storage; all alignment jobs on the same reference share the same index files.

Warning — –meth index is not interchangeable with the standard index

A --meth index is built over the c2t reference and cannot be used for normal (non-bisulfite) alignment. Keep separate index directories if you align both standard and bisulfite samples to the same reference.


See also: User Guide — Indexing the reference · CLI Reference — mem · CLI Reference — shm · Getting Started — Quick start: methylation alignment · Methylation Reference — Overview

mem

bwa-mem3 mem aligns short DNA reads against an indexed reference genome using the BWA-MEM algorithm. It accepts one or two FASTQ files (single-end or paired-end) and writes alignments to stdout in SAM or BAM format. It is the primary alignment subcommand; nearly all bwa-mem3 usage flows through it.

Synopsis

Usage: bwa-mem3 mem [options] <idxbase> <in1.fq> [in2.fq]
Options:
  Algorithm options:
    -o STR        Output SAM file name
    --bam[=N]     Emit BAM instead of SAM text. N=0 (default) = uncompressed;
                  1..9 = BGZF deflate levels. Writes to stdout; redirect with `>`.
    -t INT        number of threads [1]
    -k INT        minimum seed length [19]
    -w INT        band width for banded alignment [100]
    -d INT        off-diagonal X-dropoff [100]
    -r FLOAT      look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
    -y INT        seed occurrence for the 3rd round seeding [20]
    -c INT        skip seeds with more than INT occurrences [500]
    -D FLOAT      drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
    -W INT        discard a chain if seeded bases shorter than INT [0]
    -m INT        perform at most INT rounds of mate rescues for each read [50]
    -S            skip mate rescue
    -P            skip pairing; mate rescue performed unless -S also in use
Scoring options:
   -A INT        score for a sequence match, which scales options -TdBOELU unless overridden [1]
   -B INT        penalty for a mismatch [4]
   -O INT[,INT]  gap open penalties for deletions and insertions [6,6]
   -E INT[,INT]  gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]
   -L INT[,INT]  penalty for 5'- and 3'-end clipping [5,5]
   -U INT        penalty for an unpaired read pair [17]
Input/output options:
   -p            smart pairing (ignoring in2.fq)
   -R STR        read group header line such as '@RG\tID:foo\tSM:bar' [null]
   -H STR/FILE   insert STR to header if it starts with @; or insert lines in FILE [null]
   -j            treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)
   -5            for split alignment, take the alignment with the smallest coordinate as primary
   -q            don't modify mapQ of supplementary alignments
   -K INT        process INT input bases in each batch regardless of nThreads (for reproducibility) []
   -v INT        verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]
   -T INT        minimum score to output [30]
   -h INT[,INT]  if there are <INT hits with score >80.00% of the max score, output all in XA [5,200]
   -z FLOAT      the fraction of the max score to use with -h [0.80]
   -u            output XB instead of XA; XB is XA with the alignment score and mapping quality added
   -a            output all alignments for SE or unpaired PE
   -C            append FASTA/FASTQ comment to SAM output
   -V            output the reference FASTA header in the XR tag
   -Y            use soft clipping for supplementary alignments
   -M            mark shorter split hits as secondary
   -I FLOAT[,FLOAT[,INT[,INT]]]
                 specify the mean, standard deviation (10% of the mean if absent), max
                 (4 sigma from the mean if absent) and min of the insert size distribution.
                 FR orientation only. [inferred]
Bisulfite (--meth) options:
   --meth        enable inline bwameth-style C→T/G→A read conversion + meth-aware BAM
                 emission. Implies --bam. Requires the reference to have been built
                 with `bwa-mem3 index --meth` (emits ref.fa.bwameth.c2t).
   --set-as-failed f|r
                 flag alignments to the matching strand ('f' or 'r') as QC-fail (0x200)
   --do-not-penalize-chimeras
                 disable the longest-match <44% chimera heuristic (no 0x200 / MAPQ cap)
Supplementary MAPQ rescoring (fg-labs extension):
   --supp-rep-hard-cap INT
                 force MAPQ=0 for supplementary alignments whose chain contains any seed
                 with >=INT genome occurrences (i.e. the supp region is repetitive on its
                 own). 0 disables (default). Typical values 5-20; lower = more aggressive.
                 Primary MAPQ is unaffected.
Help:
   --help        print this help message and exit
Note: Please read the man page for detailed description of the command line and options.

Common usage

Paired-end alignment, 16 threads, SAM to stdout:

bwa-mem3 mem -t 16 ref.fa R1.fq.gz R2.fq.gz > out.sam

Paired-end alignment, emit uncompressed BAM, pipe directly to samtools sort:

bwa-mem3 mem --bam -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -@ 8 -o out.bam -
samtools index out.bam

Paired-end methylation alignment with a read group header:

bwa-mem3 mem --meth -t 16 \
  -R '@RG\tID:lib1\tSM:sample1\tPL:ILLUMINA' \
  ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam -

Flag reference

Input / output

-o STR — output file

Write output to STR instead of stdout. Honored for both SAM and --bam output; the path is opened lazily so BAM mode can hand it to htslib instead of truncating it as a SAM-text file. Stdout redirection (>) remains an alternative.

--bam[=N] — emit BAM

Emit BAM instead of SAM. N controls BGZF compression: 0 (default when --bam is used without =) writes uncompressed BAM, which costs almost no CPU and is the recommended mode for piping to samtools sort. Values 19 select increasing BGZF deflate levels; use --bam=6 or --bam=9 only when writing directly to final storage without a downstream sort step.

Tip — Prefer –bam for production pipelines

Uncompressed BAM (--bam or --bam=0) eliminates the text-formatting cost on the aligner side and the text-parse cost on the samtools sort side. For any pipeline that immediately sorts or processes the output, this is faster than SAM at no quality cost.

-R STR — read group header

Injects a @RG header line and tags every alignment with RG:Z:<ID>. The value is a tab-separated @RG line with literal \t escapes, for example:

-R '@RG\tID:run1\tSM:HG001\tPL:ILLUMINA\tLB:lib1'

bwa-mem3 escapes any literal tab characters inside -R values before writing them to the @PG CL: field, preventing header corruption (fix for issue #45).

-H STR/FILE — extra header lines

If STR begins with @, it is injected verbatim as a header line. Otherwise STR is treated as a path and every line in the file is injected. Useful for adding @CO comments or custom @RG / @PG entries.

-p — smart pairing

Reads interleaved paired-end data from a single FASTQ file (in1.fq) rather than two separate files. The second positional argument (in2.fq) is ignored.

-5 — leftmost-coordinate primary

For split alignments, designates the alignment with the smallest genomic coordinate as primary, rather than the longest alignment. Useful for some downstream tools that expect the leftmost alignment to be primary.

-q — preserve supplementary MAPQ

By default, bwa-mem3 may downgrade the MAPQ of supplementary alignments. -q suppresses that adjustment.

-K INT — fixed batch size

Forces each thread batch to process exactly INT input bases regardless of the number of threads. Useful when you need bit-for-bit reproducible output across runs with different -t values: fix -K to the same value and the output is deterministic.

-v INT — verbosity

Controls stderr diagnostic output: 1 = errors only, 2 = warnings, 3 = informational messages (default), 4+ = debugging.

-a — all alignments

Output all alignments for single-end or unpaired paired-end reads, including secondary alignments. Equivalent to enabling secondary-alignment reporting.

-C — append FASTA/FASTQ comment

Appends the comment field from the FASTA/FASTQ header to the SAM output as an additional column. Useful when the comment carries barcodes or UMIs.

-V — reference header in XR tag

Emits the reference FASTA header line for each alignment position as an XR SAM tag.

-Y — soft-clip supplementary alignments

Uses soft clipping instead of hard clipping for supplementary alignments. Some downstream tools require this.

-M — mark shorter split hits as secondary

Marks the shorter alignment in a split read as secondary (sets 0x100 flag) rather than supplementary. Required for compatibility with tools that do not handle supplementary alignments (e.g. Picard’s duplicate-marking before certain versions).

-j — treat ALT contigs as primary

Treats ALT contigs as part of the primary assembly by ignoring the <idxbase>.alt file. Use when your workflow does not include ALT-aware postprocessing.

Scoring

All scoring flags accept integer values. Changing -A (match score) scales the penalty flags that default to multiples of -A; explicit overrides of individual flags are unaffected.

FlagDefaultMeaning
-A INT1Score for a sequence match. Scales -T, -d, -B, -O, -E, -L, -U unless overridden.
-B INT4Mismatch penalty.
-O INT[,INT]6,6Gap open penalty for deletions and insertions respectively.
-E INT[,INT]1,1Gap extension penalty per base. A gap of length k costs -O + -E * k.
-L INT[,INT]5,5Clipping penalty for 5’ and 3’ ends.
-U INT17Penalty for an unpaired read pair (affects mate-rescue scoring).
-T INT30Minimum alignment score to output. Alignments below this threshold are not reported.

Note — –meth overrides scoring defaults

When --meth is active, bwa-mem3 applies bwameth.py-compatible defaults: -B 2 -L 10 -U 100 -T 40 -CM. Any of these can still be overridden by passing the flag explicitly after --meth.

Paired-end

-I FLOAT[,FLOAT[,INT[,INT]]] — insert size distribution

Specifies the mean, standard deviation (default: 10% of mean), maximum (default: 4 sigma above mean), and minimum of the insert size distribution for FR-orientation paired-end reads. By default bwa-mem3 infers these parameters from the first batch of reads. Provide them explicitly for speed or when the reference is short and inference may be inaccurate.

-m INT — mate rescue rounds

Maximum number of mate-rescue attempts per read. Reduce to speed up alignment on data where the default (50) wastes time on unrescuable pairs.

-S — skip mate rescue

Disables mate rescue entirely. Faster but may reduce sensitivity for discordant pairs.

-P — skip pairing

Skips the pairing step; mate rescue still runs unless -S is also given.

Filtering

-c INT — skip repetitive seeds

Seeds with more than INT occurrences in the reference are skipped. Lowering this (e.g. to 50) speeds up alignment of highly repetitive reads but may reduce sensitivity. Raising it increases sensitivity in repeat-heavy regions at a cost in runtime.

-D FLOAT — chain length fraction

Drops chains shorter than FLOAT times the longest overlapping chain. The default (0.50) discards chains that are less than half the length of the best chain.

-W INT — minimum seeded bases

Discards chains with fewer than INT seeded bases. Raising this filters out very short, low-confidence chains.

-h INT[,INT] — secondary alignment reporting

If there are fewer than INT hits with score exceeding FLOAT (see -z) times the maximum score, all of them are output in the XA auxiliary tag. The second integer is a hard cap on the number of XA entries. Defaults: 5, 200.

-z FLOAT — secondary score fraction

Fraction of the maximum alignment score used as the threshold for secondary hit reporting with -h. Default: 0.80.

-u — emit XB instead of XA

Outputs XB in place of XA. XB is an extension of XA that also carries the alignment score and mapping quality for each secondary hit.

Methylation (--meth)

--meth — enable bisulfite alignment mode

Activates inline C→T (R1) and G→A (R2) read conversion, bwameth-compatible scoring defaults, inline BAM post-processing, and forces --bam output. The reference must have been indexed with bwa-mem3 index --meth.

Pass the original FASTA prefix as <idxbase> — the .bwameth.c2t suffix is appended automatically. If <idxbase> already ends in .bwameth.c2t (interop with an external c2t converter), the auto-append is skipped.

See Methylation Reference for the full treatment.

--set-as-failed {f|r} — strand QC-fail flag

Forces the QC-fail bit (0x200) on all alignments to the forward (f) or reverse (r) bisulfite strand. Used when one strand is known to be unreliable for a given library preparation.

--do-not-penalize-chimeras — disable chimera heuristic

Disables the longest-match < 44% chimera heuristic that would otherwise set 0x200, clear 0x2, and cap MAPQ at 1 for likely chimeric alignments. Use when the default chimera filter is too aggressive for your library type.

Threading

-t INT — number of threads

Number of worker threads. Defaults to 1. Set to the number of physical cores available to this job. Scaling is workload- and hardware-dependent: on typical machines the curve flattens around 16–32 threads (FM-index bandwidth and I/O contention dominate); on high-memory / fast-I/O servers the aligner can keep scaling toward ~64 threads on hg38 before saturating. See the threading guide for measured guidance and per-machine recommendations.

See User Guide — Threading and resource use for guidance on thread counts at various machine sizes.

Supplementary MAPQ rescoring

--supp-rep-hard-cap INT — cap MAPQ for repetitive supplementary alignments

Forces MAPQ=0 for supplementary alignments whose chain contains any seed with at least INT occurrences in the genome. This targets supplementary alignments anchored in repetitive regions that upstream MAPQ scoring may overestimate. 0 disables the cap (default). Typical values are 5–20; lower values are more aggressive. Primary alignment MAPQ is unaffected.

Debug

-k INT — minimum seed length

Minimum exact-match seed length. Shorter seeds increase sensitivity but raise runtime. The default (19) is calibrated for 100–150 bp Illumina reads.

-w INT — band width

Band width for the banded Smith-Waterman extension. Wider bands can recover alignments with long indels at greater CPU cost.

-d INT — X-dropoff

Off-diagonal X-dropoff for the Z-drop heuristic. Controls how far an alignment extension continues after a score drop.

-r FLOAT — re-seeding factor

Seeds longer than -k * FLOAT are re-seeded internally to find sub-seeds. Lowering this produces more seeds and higher sensitivity at greater cost.

-y INT — third-round seed occurrence threshold

Seed occurrence threshold for the third round of seeding. Rarely needs adjustment outside highly repetitive genomes.

Notes / Gotchas

Warning — –meth requires a –meth index

Running bwa-mem3 mem --meth against a standard (non-c2t) index produces incorrect alignments without an error. Confirm that the index was built with bwa-mem3 index --meth before aligning bisulfite data.

Note — SIMD variant printed to stderr at startup

When mem starts it prints a banner (Executing in AVX512 mode!! etc.) to stderr. This is informational and does not affect stdout output.


See also: User Guide — Aligning short reads · User Guide — Output: SAM/BAM, headers, tags · CLI Reference — index · Methylation Reference — Overview · Best Practices — Output format

shm

bwa-mem3 shm stages an FM-index into POSIX shared memory so that subsequent bwa-mem3 mem invocations on the same machine attach to the in-memory segment instead of re-reading the index files from disk. For workloads that align many small samples back-to-back against the same reference — such as clinical panels or amplicon sequencing — this removes the dominant I/O bottleneck. shm also lists and destroys staged segments.

Synopsis


Usage: bwa-mem3 shm [-d|-l|--help] [--meth] [idxbase]

Options:
  -d        destroy all indices in shared memory (matches bwa v1 behavior)
  -l        list names of indices in shared memory
  --meth    stage a `bwa-mem3 index --meth` index — auto-appends
            `.bwameth.c2t` to <idxbase>, mirroring `mem --meth`
  -h --help print this help and exit

Stage with no flags: `bwa-mem3 shm <idxbase>` loads the index into
POSIX shared memory; subsequent `bwa-mem3 mem <idxbase> ...` runs
auto-attach instead of re-reading from disk. For meth indices, pass
the same plain `<idxbase>` to all three commands plus `--meth` on
`index`, `shm`, and `mem` (the c2t suffix is auto-appended).

Footgun: if you re-build the index, run `bwa-mem3 shm -d` first.
There is no staleness check -- a stale segment will silently mis-align.

macOS: POSIX shm has implementation-defined per-segment caps; large
       indices may simply fail to stage. Prefer Linux for production.
Linux: /dev/shm defaults to ~50% of RAM on bare metal; in containers
       it is often much smaller and may need raising via --shm-size
       (Docker) or an emptyDir tmpfs (Kubernetes).

Common usage

Stage a standard index, align two samples, then release the segment:

bwa-mem3 shm ref.fa
bwa-mem3 mem -t 16 ref.fa sample1_R1.fq sample1_R2.fq > sample1.sam
bwa-mem3 mem -t 16 ref.fa sample2_R1.fq sample2_R2.fq > sample2.sam
bwa-mem3 shm -d

Stage a methylation index and align:

bwa-mem3 shm --meth ref.fa
bwa-mem3 mem --meth -t 16 ref.fa R1.fq R2.fq | samtools sort -o out.bam -
bwa-mem3 shm -d

List all currently staged segments:

bwa-mem3 shm -l

Flag reference

(no flags) <idxbase> — stage an index

Loads all index files for <idxbase> into a POSIX shared-memory segment. After staging, any bwa-mem3 mem <idxbase> ... on the same machine auto-attaches and reads from memory rather than disk.

-d — destroy all segments

Removes every bwa-mem3 shared-memory segment on the machine. This is the correct clean-up command after a batch job and the required step before re-building the index (see the footgun warning below).

-l — list staged indices

Prints the names of all currently staged segments. Useful to confirm that staging succeeded before launching alignment jobs.

--meth — stage a methylation index

Auto-appends .bwameth.c2t to <idxbase> before staging, mirroring the behavior of bwa-mem3 index --meth and bwa-mem3 mem --meth. Pass the same plain <idxbase> to all three commands; the c2t suffix is handled transparently.

Notes / Gotchas

Warning — No staleness check — always destroy before re-indexing

There is no staleness check. If you re-run bwa-mem3 index ref.fa after staging, the on-disk index files will not match the in-memory segment, but bwa-mem3 mem will still attach to the stale segment and silently produce incorrect alignments. Always run bwa-mem3 shm -d before re-indexing.

Note — Platform limits

macOS: POSIX shared memory has implementation-defined per-segment size caps. Staging a full hg38 index (~28 GB) may fail silently or with a cryptic error. Prefer Linux for production use with large references.

Linux containers: /dev/shm typically defaults to ~50% of physical RAM on bare metal but is often much smaller inside Docker containers or Kubernetes pods. Raise the limit with --shm-size (Docker) or an emptyDir tmpfs volume with an explicit size (Kubernetes) before attempting to stage a large index.


See also: Getting Started — Quick start: shared-memory index · CLI Reference — index · CLI Reference — mem · Best Practices — Multi-sample workflows · Best Practices — Anti-patterns

version

bwa-mem3 version prints the release version of the binary and, when mimalloc is compiled in, the mimalloc version. It is a quick way to confirm which build is on PATH and whether the allocator is active.

Synopsis

mimalloc 3.3.0
v<MAJOR.MINOR>-<N>-g<COMMIT>

Common usage

Confirm the installed version and allocator:

./bwa-mem3 version

A typical run prints two lines (the mimalloc line goes to stderr and the version string to stdout, so the order in a merged stream is not guaranteed):

mimalloc 3.3.0
v0.1.0-pre-7-g613a4dc

The first line is the mimalloc version (present only when USE_MIMALLOC=1, the default). The second line is the bwa-mem3 version string, derived from git describe at build time and stored as PACKAGE_VERSION in the binary. When building from a tarball without git history, the fallback value is set via FG_LABS_VERSION_FALLBACK at compile time.

Notes / Gotchas

Note — mimalloc line goes to stderr, version to stdout

The mimalloc line is printed to stderr and the version string to stdout. Scripts that capture the version should redirect stderr appropriately or use bwa-mem3 version 2>/dev/null.

Tip — No mimalloc line means USE_MIMALLOC=0

If only the version string appears and no mimalloc line, the binary was built without the bundled allocator (make USE_MIMALLOC=0). See User Guide — Memory allocator (mimalloc) for when this is appropriate.


See also: User Guide — Memory allocator (mimalloc) · Developer Guide — Release process · Getting Started — Installation · What’s Different — Build & infrastructure

Methylation Reference Overview

bwa-mem3 --meth is a single-binary, single-command drop-in replacement for the bwameth.py bisulfite-sequencing alignment pipeline. No Python installation, no piped preprocessing step, and no separate post-processing script — one bwa-mem3 index --meth builds the reference, and one bwa-mem3 mem --meth aligns and post-processes reads from raw FASTQ to sorted-ready BAM.

The output BAM is structurally equivalent to what the bwameth.py pipeline produces: consolidated @SQ headers (one entry per real chromosome rather than one per doubled-reference contig), YS:Z: / YC:Z: tags carrying the original pre-conversion read sequence and the conversion direction, YD:Z: indicating the strand hypothesis, chimera QC flags, and a @PG ID:bwa-mem3-meth provenance entry. Downstream tools that consume bwameth.py output — including MethylDackel and Bismark — work without change.

Pipeline at a glance

The diagram below shows the internal flow when bwa-mem3 mem --meth runs. Every step executes inside the single process; no external programs or temporary files are required.

flowchart LR
    A[Raw FASTQ\nR1 / R2] -->|inline C→T / G→A| B[c2t-converted reads\n+ YS:Z + YC:Z in comment]
    B -->|bwa mem core| C[mem_aln_t\nalignments vs doubled ref]
    C -->|chrom map\nf/r → real chr| D[header rewrite\n@SQ consolidated]
    D -->|YD:Z tagging\nchimera QC\nQC-fail propagation| E[BAM output\nwb0 uncompressed]

Steps:

  1. FASTQ ingest with inline c2t conversion. R1 bases have every C replaced with T; R2 bases have every G replaced with A. The original bases are preserved in a YS:Z: comment field and the conversion direction is stored in YC:Z:. This conversion happens in-memory — the FASTQ is never written to disk in converted form.

  2. Alignment against the doubled reference. The converted reads are aligned against the ref.fa.bwameth.c2t reference, which contains both a forward C→T projection (f-prefixed contigs) and a reverse G→A projection (r-prefixed contigs) of each chromosome.

  3. Header rewriting and chrom consolidation. The f/r-prefixed contig names used internally are collapsed: every pair fchr1 / rchr1 becomes a single @SQ SN:chr1 entry in the output BAM header. RNAME and RNEXT fields in each record are rewritten to the consolidated name.

  4. Tag emission and QC. Each aligned record receives a YD:Z:{f,r} tag indicating which strand it mapped to. The chimera QC heuristic flags records whose longest M/=/X CIGAR run covers less than 44% of the read length. QC-fail flags propagate across all records in a read group. The original pre-c2t sequence from YS:Z: is copied back into the BAM SEQ field so that methylation callers (e.g. MethylDackel) see real cytosines rather than the converted sequence.

  5. BAM output. Records are written as uncompressed BAM (wb0 mode via htslib). The @PG ID:bwa-mem3-meth line records the exact command line. The caller pipes directly to samtools sort.

Quick-start commands

# Index the reference once (builds ref.fa.bwameth.c2t + FMI)
bwa-mem3 index --meth ref.fa

# Align paired-end FASTQs
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam
samtools index out.bam

Note — bwameth.py compatibility

The default scoring parameters applied by --meth (-B 2 -L 10 -U 100 -T 40 -CM) match those used by bwameth.py so outputs are comparable. Any parameter can be overridden on the command line.


See also: bwameth.py drop-in mapping · Conversion details · SAM tags: YS, YC, YD · Chimera QC and header rewriting · Quick start: methylation alignment

bwameth.py Drop-In Mapping

bwa-mem3 --meth is designed to produce output that is equivalent to the bwameth.py pipeline for the standard paired-end case. This page explains what changes between the two approaches and what stays the same.

Command comparison

bwameth.py pipeline (multi-step)

# Step 1: build a doubled reference with bwameth.py
bwameth.py index ref.fa                # writes ref.fa.bwameth.c2t + bwa-mem2 FMI

# Step 2: align (bwameth.py converts reads, calls bwa-mem2, post-processes)
bwameth.py map --bwa-mem2 -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam
samtools index out.bam

bwa-mem3 –meth (single binary)

# Step 1: build the doubled reference with bwa-mem3
bwa-mem3 index --meth ref.fa           # same ref.fa.bwameth.c2t layout as bwameth.py

# Step 2: align (inline c2t conversion + post-processing, no Python)
bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz \
  | samtools sort -o out.bam
samtools index out.bam

The index files produced by bwa-mem3 index --meth and bwameth.py index are identical in layout: the same ref.fa.bwameth.c2t doubled-reference FASTA followed by the bwa-mem2 FM-index files (.bwt.2bit.64, .0123, .pac, .amb, .ann).

What is gained

No Python or bwameth.py dependency. The entire pipeline — read conversion, alignment, and BAM post-processing — runs inside a single bwa-mem3 process. This simplifies deployment: one binary, no virtual environment, no version pinning of bwameth.py.

No intermediate files. bwameth.py writes a converted FASTQ (or pipes it) before handing off to the aligner. bwa-mem3 --meth performs the C→T / G→A conversion in-memory on each read batch before passing it to the alignment kernel. No temporary FASTQ is written and no extra pipe stage is needed.

Inline BAM post-processing. Header rewriting, YD:Z: tagging, chimera QC, and QC-fail propagation all happen inside the same process and the same pass over the alignments. There is no separate post-processing step. Output is written as uncompressed BAM (wb0) — a near-zero-cost format that downstream samtools sort reads natively.

Same flag defaults. --meth applies -B 2 -L 10 -U 100 -T 40 -CM automatically, matching bwameth.py’s default scoring. All parameters can be overridden.

What stays the same

The output BAM is field-compatible with bwameth.py output for the standard methylation tag set, flags, and SEQ representation (the @PG provenance line intentionally differs — see below):

Fieldbwameth.pybwa-mem3 –meth
@SQ headersOne per real chromosomeOne per real chromosome
YS:Z:Pre-c2t original sequenceSame
YC:Z:Conversion direction (CT or GA)Same
YD:Z:Strand (f or r)Same
@PGID:bwamethID:bwa-mem3-meth
Chimera QC thresholdLongest M < 44% of readSame (44%)
Chimera QC flags0x200, clear 0x2, MAPQ ≤ 1Same
SEQ fieldPre-c2t bases (RC-flipped when is_rev)Same

The @PG ID: is intentionally different so provenance is unambiguous. All downstream tools that rely on YS:Z:, YC:Z:, YD:Z:, and the QC flags behave identically.

Info — End-to-end regression coverage

PR #13 includes a three-layer regression test that verifies 100% chrom+pos match, 100% CIGAR match, and byte-identical SEQ across 92,684 paired-end records compared to a bwameth.py reference run.

When to prefer bwameth.py

If your workflow requires bwameth.py-specific features (e.g. bwameth.py markduplicates or non-standard bwameth.py post-processors), continue using bwameth.py. bwa-mem3 --meth targets the indexing + alignment + standard post-processing path only.


See also: Overview · Conversion details · SAM tags: YS, YC, YD · Chimera QC and header rewriting · Related Projects: bwameth.py

Conversion Details (C→T, G→A)

Bisulfite sequencing relies on chemical conversion of unmethylated cytosines to uracil (read as thymine after PCR). bwa-mem3 --meth models this with an in-memory read transformation applied to every read before the alignment kernel sees the bases.

What gets converted

Paired-end bisulfite reads follow a strand convention:

  • R1 (read 1): every C in the base sequence is replaced with T. This models the OT (original top) and CTOB (complementary to original bottom) strands as they appear after bisulfite treatment and PCR.

  • R2 (read 2): every G in the base sequence is replaced with A. This models the OB (original bottom) and CTOT strands.

Single-end mode uses the R1 (C→T) rule for all reads.

The doubled reference built by bwa-mem3 index --meth (or bwameth.py) contains two projections of each chromosome:

  • f-prefixed contigs (e.g. fchr1): the chromosome with every C replaced by T.
  • r-prefixed contigs (e.g. rchr1): the reverse complement of the chromosome with every G replaced by A.

Converted R1 reads are therefore alignable to f-prefixed contigs and converted R2 reads to r-prefixed contigs. The contig prefix records which strand hypothesis was used and feeds the YD:Z: tag directly.

Where conversion happens

Read conversion runs inside src/fastmap.cpp in the meth_mode ingest block, immediately after sequence parsing and before any alignment work. The transformation is applied to the in-memory bseq1_t.seq buffer; the original FASTQ file is never rewritten.

Before the bases are modified, the original sequence is recorded in the read’s comment buffer as:

YS:Z:<l_seq bases>\tYC:Z:<direction>

where <direction> is CT for R1 (C→T) and GA for R2 (G→A). These fields pass through the alignment kernel untouched and are emitted as SAM aux tags in the output BAM. See SAM tags: YS, YC, YD for the per-tag reference.

Sequence restoration in the BAM SEQ field

Methylation callers such as MethylDackel identify methylated cytosines by examining the BAM SEQ field at each CpG site. They need to see real C/T bases — not the uniformly-converted T/A bases that were used for alignment.

meth_mem_aln_to_bam (in src/meth_bam.cpp) restores the original sequence from YS:Z: before writing the BAM record:

  1. The YS:Z: payload is located at the start of the bseq1_t.comment field (offset +5 past the YS:Z: header bytes).
  2. For forward-aligned records (!p.is_rev), the pre-c2t bases are copied directly into the BAM SEQ buffer.
  3. For reverse-aligned records (p.is_rev), the bases are reverse-complemented using the standard TGCAN table before being placed in SEQ.
  4. If YS:Z: is absent (e.g. when running with an external c2t converter that does not emit it), the code falls back to the converted sequence in s->seq, with the same RC flip logic.

Warning — Soft-clip and supplementary trimming

When computing the SEQ range for supplementary alignments, the qb/qe boundaries account for soft-clip or hard-clip operations at the CIGAR ends. The YS:Z: restoration applies over the same trimmed range so SEQ length always matches the emitted CIGAR.

QUAL field handling

The QUAL field is taken directly from the original FASTQ (bseq1_t.qual) over the same [qb, qe) range and is never modified by the c2t process. Quality scores correspond to the original base calls, not the converted ones.

Relationship to the reference index

bwa-mem3 index --meth ref.fa writes ref.fa.bwameth.c2t, which applies the same C→T / G→A projection to the reference sequence. The resulting file is compatible with what bwameth.py index produces, so the same doubled-reference FASTA can be used interchangeably with either tool across tested versions.


See also: Overview · SAM tags: YS, YC, YD · Interop with external bwameth.py c2t · User Guide → Indexing the reference · Best Practices → Methylation defaults

SAM Tags: YS, YC, YD

bwa-mem3 --meth emits three methylation-specific auxiliary tags that carry the information downstream methylation callers need. Two of these (YS:Z: and YC:Z:) are set during FASTQ ingest and pass through the alignment kernel unchanged. The third (YD:Z:) is set during BAM post-processing based on the contig name of the alignment.

Tag reference

YS:Z: — original (pre-conversion) sequence

PropertyValue
TypeZ (NUL-terminated string)
LengthEqual to l_seq (full read length)
Set byFASTQ ingest (src/fastmap.cpp meth_mode block)
Emitted onAll records (mapped and unmapped)

YS:Z: holds the original base sequence of the read before the C→T or G→A conversion. The value is the ASCII string of bases as read from the FASTQ, in read order (not reverse-complemented).

This tag serves two purposes:

  1. SEQ restoration. meth_mem_aln_to_bam copies the YS:Z: payload back into the BAM SEQ field (with reverse-complement when is_rev is set) so that methylation callers see real cytosines. Without this restoration the SEQ field would show only Ts where Cs existed in the original read.

  2. Downstream inspection. Tools that need to examine the unconverted sequence independently of the BAM SEQ field can read YS:Z: directly.

Note — Format inside the comment buffer

Internally, the ingest code stores the value as YS:Z:<bases>\tYC:Z:<dir> starting at offset 0 of bseq1_t.comment. meth_mem_aln_to_bam locates the payload at comment + 5 (past the YS:Z: prefix). The two tags are always co-emitted in this order.

YC:Z: — conversion direction

PropertyValue
TypeZ (NUL-terminated string)
ValuesCT (R1, C→T) or GA (R2, G→A)
Set byFASTQ ingest (src/fastmap.cpp meth_mode block)
Emitted onAll records

YC:Z: records which conversion was applied to the read:

  • CT — C→T conversion applied; this is an R1 read (or a single-end read).
  • GA — G→A conversion applied; this is an R2 read.

bwameth.py uses YC:Z: for the same purpose and with the same values. Tools such as MethylDackel use YC:Z: to determine which cytosines to call as methylated. YC:Z:CT records are candidates for CpG methylation on the top strand; YC:Z:GA records are candidates on the bottom strand.

YD:Z: — strand hypothesis

PropertyValue
TypeZ (NUL-terminated string)
Valuesf (forward / top strand) or r (reverse / bottom strand)
Set bymeth_mem_aln_to_bam (src/meth_bam.cpp)
Emitted onMapped records only (not unmapped)

YD:Z: records which strand of the doubled reference the read aligned to. The value is derived from the f/r prefix on the internal contig name via the meth_chrom_map_t.direction array. Unmapped reads do not receive YD:Z:.

  • f — the read aligned to an f-prefixed contig (the C→T projection of the top strand).
  • r — the read aligned to an r-prefixed contig (the G→A projection of the bottom strand).

This tag is used by --set-as-failed (see Flags) and is also consumed by downstream methylation callers to confirm which strand each alignment supports.

Tag emission summary

TagRecordsSource
YS:Z:AllFASTQ ingest (comment buffer)
YC:Z:AllFASTQ ingest (comment buffer)
YD:Z:Mapped onlymeth_mem_aln_to_bam from chrom map

Tip — Checking tags with samtools

To inspect these tags on a BAM file:

samtools view out.bam | cut -f12- | grep -oP 'Y[SCD]:Z:[^\t]+'

Or use samtools view -H to confirm the @PG ID:bwa-mem3-meth entry is present and the @SQ lines are consolidated (no f/r prefixes).


See also: Overview · Conversion details · Chimera QC and header rewriting · Flags: –set-as-failed, –do-not-penalize-chimeras · User Guide → Output: SAM/BAM, headers, tags

Chimera QC and Header Rewriting

After the alignment kernel produces mem_aln_t records, bwa-mem3 --meth applies a set of post-processing steps before writing BAM output. These steps are implemented in src/meth_bam.cpp and run in the same process, in the same pass over the aligned records.

@SQ header consolidation

The doubled reference (ref.fa.bwameth.c2t) contains two contigs for each chromosome:

  • fchr1, fchr2, … — C→T projections of each chromosome.
  • rchr1, rchr2, … — G→A projections of each chromosome.

If the raw alignment header were written directly, every downstream tool would see twice as many sequences as there are real chromosomes, with unfamiliar f/r-prefixed names. meth_bam_writer_open instead builds a consolidated header using the meth_chrom_map_t:

  1. meth_chrom_map_build_from_bns iterates over bns->anns and strips the leading f/r from each contig name.
  2. The first contig with a given stripped name registers that name in the output list; subsequent contigs with the same stripped name map to the same output index.
  3. The BAM @SQ lines are written from the consolidated list — one SN: per real chromosome.

RNAME, RNEXT, and SA/XA tag contig references in every record are rewritten through cmap->out_tid and cmap->output_names so they reference the consolidated names. The mapping from internal (doubled-ref) contig index to output contig index is cmap->out_tid[p.rid].

Note — TLEN computation uses consolidated TIDs

Template length (TLEN) is computed using the consolidated output TIDs, not the internal p.rid values. Two mates that rescue onto fchr1 and rchr1 respectively both map to output chr1, so TLEN is reported as a non-zero distance rather than zero (which would happen if the mismatched internal TIDs were used).

Chimera QC heuristic

bwameth.py applies a heuristic to flag reads that look like chimeric fragments: if the longest contiguous alignment run (sum of M/=/X CIGAR operations) covers less than 44% of the read length, the read is considered a potential chimera.

bwa-mem3 --meth applies the same heuristic inside meth_mem_aln_to_bam:

if (100 * longest_M_run < 44 * l_seq):
    flag  |=  0x200   # set QC fail
    flag  &= ~0x2     # clear proper pair
    mapq   =  min(mapq, 1)

The threshold constant is MIN_LONGEST_M_PCT = 44 (defined at the top of src/meth_bam.cpp). The longest run is computed by cigar_longest_m_mem from src/cigar_util.cpp, which counts M, =, and X operations.

The chimera heuristic is only applied to mapped records (!(flag & 0x4) && direction != 0). Unmapped records are not touched.

To disable this heuristic, pass --do-not-penalize-chimeras. See Flags for details.

--set-as-failed strand filtering

Before the chimera check, meth_mem_aln_to_bam checks whether opt->meth_set_as_failed is set and matches the record’s strand direction:

if (meth_set_as_failed != 0 && meth_set_as_failed == direction):
    flag |= 0x200

This unconditionally marks all alignments to the specified strand (f or r) as QC-failed before chimera logic runs. The chimera check then applies on top of the already-set fail flag.

Pair-level QC-fail propagation

Once per read group (all records sharing the same query name), after individual records have been processed:

meth_bam_group_propagate_qcfail(group, n)

This function scans all records in the group. If any record has 0x200 set, it propagates that flag to every other record in the group and clears 0x2 (proper pair) on all of them. This ensures that a chimeric or strand-filtered primary alignment also marks its supplementary alignments and the mate as QC-failed, preventing inconsistent flag states in the output BAM.

@PG ID:bwa-mem3-meth insertion

meth_bam_writer_open appends a @PG line to the header after the original bwa-mem3 @PG entry:

@PG  ID:bwa-mem3-meth  PN:bwa-mem3-meth  VN:<version>-meth  CL:<command line>

The <command line> field is the full bwa-mem3 mem --meth ... invocation with embedded tab characters replaced by spaces (htslib does not permit literal tabs in @PG CL: fields). This records the exact parameters used for provenance and reproducibility.

Tip — Verifying the header

After alignment, confirm consolidation and provenance with:

samtools view -H out.bam | grep -E '^@SQ|^@PG'

You should see one @SQ line per chromosome (no f/r prefixes) and both @PG ID:bwa-mem3 and @PG ID:bwa-mem3-meth entries.


See also: Overview · SAM tags: YS, YC, YD · Flags: –set-as-failed, –do-not-penalize-chimeras · Conversion details · User Guide → Output: SAM/BAM, headers, tags

Flags: –set-as-failed, –do-not-penalize-chimeras

bwa-mem3 --meth adds two flags that control QC behavior during BAM post-processing. Both flags affect the chimera QC and strand-filtering logic inside meth_mem_aln_to_bam (src/meth_bam.cpp).

--set-as-failed {f|r}

Marks every alignment to the specified strand as QC-failed (0x200) regardless of alignment quality or CIGAR structure.

Accepted values:

  • f — flag all alignments to f-prefixed contigs (C→T top-strand projection).
  • r — flag all alignments to r-prefixed contigs (G→A bottom-strand projection).

Effect on records:

When --set-as-failed f (or r) is set and a mapped record’s YD:Z: strand matches the specified value, the record’s SAM flag has 0x200 set. The chimera heuristic then runs on top of the already-set flag, but since 0x200 is already present, it can only enforce additional constraints (clearing 0x2, capping MAPQ). QC-fail propagation then spreads the flag to all records in the read group.

When to use it:

Some experimental designs produce reads that are expected to align exclusively to one strand. Flagging the other strand as QC-failed before downstream analysis prevents spurious methylation calls from mis-strand alignments. It is also useful for diagnosing library preparation issues: run once with --set-as-failed r and once without to compare yield on each strand.

Warning — All records on the strand are flagged

--set-as-failed is a blunt instrument. It marks every alignment to the chosen strand, including correctly aligned reads that simply happened to land on the complementary strand due to library structure. Use this flag only when your library is expected to be strand-specific.

--do-not-penalize-chimeras

Disables the chimera QC heuristic entirely.

Without this flag, any mapped record whose longest M/=/X CIGAR run covers less than 44% of the read length receives:

  • 0x200 (QC fail) set.
  • 0x2 (proper pair) cleared.
  • MAPQ capped at 1.

With --do-not-penalize-chimeras, none of these penalties are applied. Records are written with the MAPQ and flags as determined by the alignment kernel. The chimera check in meth_mem_aln_to_bam is skipped entirely.

When to use it:

The 44% threshold was calibrated for standard mammalian whole-genome bisulfite sequencing (WGBS) libraries with typical read lengths. For short reads (< 50 bp), reads with large insertions, or amplicon designs where short alignments are expected, the heuristic can produce excessive false-positive flagging. In those cases, disable it and apply a custom chimera filter downstream.

It is also useful when benchmarking: comparing bwa-mem3 --meth output against bwameth.py output on a specific dataset is cleaner when chimera filtering is disabled, since bwameth.py’s chimera logic may differ in edge cases.

Note — Pair-level propagation still applies

--do-not-penalize-chimeras only suppresses the chimera heuristic. If --set-as-failed is also active, those flags are still set, and meth_bam_group_propagate_qcfail still propagates any 0x200 flags across the read group.

Flag interaction summary

Condition0x200 set?0x2 cleared?MAPQ capped?
Normal aligned recordNoNoNo
Chimera heuristic triggers (default)YesYesYes (≤1)
--set-as-failed strand matchesYesNoNo
Both chimera + --set-as-failed activeYesYesYes (≤1)
--do-not-penalize-chimeras onlyNoNoNo

See also: Overview · Chimera QC and header rewriting · SAM tags: YS, YC, YD · Best Practices → Methylation defaults · CLI Reference → mem

Interop with External bwameth.py c2t

Some workflows use bwameth.py’s c2t subcommand to convert reads before passing them to an aligner. bwa-mem3 --meth supports this pattern by detecting whether the caller has already provided a pre-converted FASTQ and whether the reference path already points to the doubled-reference FASTA.

Auto-detect logic for the reference path

When --meth is active, bwa-mem3 mem ordinarily appends .bwameth.c2t to the reference path so the user can pass the original FASTA prefix:

bwa-mem3 mem --meth -t 16 ref.fa R1.fq.gz R2.fq.gz
# internally uses ref.fa.bwameth.c2t as the reference

If the reference path already ends with .bwameth.c2t, the auto-append is skipped:

bwa-mem3 mem --meth -t 16 ref.fa.bwameth.c2t R1.fq.gz R2.fq.gz
# no suffix appended; ref.fa.bwameth.c2t is used as-is

This detection is a simple suffix check on the path string. It allows callers that manage the doubled-reference path explicitly to pass it without triggering double-append.

Using bwameth.py c2t as the read preprocessor

If your pipeline already runs bwameth.py c2t to convert reads (for example, because it needs to reuse converted reads across multiple aligners), you can pipe the output directly to bwa-mem3 mem --meth:

bwameth.py c2t R1.fq.gz R2.fq.gz \
  | bwa-mem3 mem --meth -p -t 16 ref.fa.bwameth.c2t /dev/stdin \
  | samtools sort -o out.bam

Key points for this pattern:

  • Pass the .bwameth.c2t reference path explicitly so the auto-append is suppressed.
  • Use -p to tell bwa-mem3 mem that the input contains interleaved paired-end reads (bwameth.py c2t emits interleaved output to stdout).
  • Use /dev/stdin as the reads argument to read from the pipe.
  • The bwa-mem3 --meth inline c2t conversion is not applied when the reads arrive pre-converted; however, YS:Z: and YC:Z: tags are only written by the inline conversion path. If you need those tags in the output, you must either use the integrated mode (no external c2t step) or ensure your external preprocessor emits compatible comments in the FASTQ.

Warning — YS:Z: and YC:Z: require integrated mode

When reads are pre-converted by an external tool and piped in, the inline c2t step in src/fastmap.cpp is bypassed. YS:Z: and YC:Z: tags will not be present in the output BAM unless the external converter writes them as FASTQ comment fields in the format YS:Z:<seq>\tYC:Z:<dir> and those comments are passed through. MethylDackel and similar callers use YS:Z: to restore original bases for methylation calling; if the tag is absent, they fall back to reading SEQ directly, which may affect accuracy.

Header rewriting and BAM post-processing with external c2t

Whether reads are converted inline or externally, all BAM post-processing steps apply identically when --meth is active:

  • @SQ header consolidation (f/r contigs → one entry per chromosome).
  • YD:Z: tag emission from the contig name prefix.
  • Chimera QC heuristic (unless --do-not-penalize-chimeras is set).
  • Pair-level QC-fail propagation.
  • @PG ID:bwa-mem3-meth insertion.

The post-processing pipeline depends only on the reference contig names (to determine YD:Z:) and the alignment flags — not on whether reads were converted inline or externally.

Summary of path variants

Reference argRead sourceAuto-append?Inline c2t?YS:Z: emitted?
ref.faRaw FASTQYes (→ ref.fa.bwameth.c2t)YesYes
ref.fa.bwameth.c2tRaw FASTQNoYesYes
ref.fa.bwameth.c2tPre-converted (pipe)NoNoNo (unless pre-emitted)

See also: Overview · Conversion details · SAM tags: YS, YC, YD · bwameth.py drop-in mapping · Related Projects: bwameth.py

What’s Different from bwa-mem2

This section tracks every change that bwa-mem3 carries on top of upstream bwa-mem2/bwa-mem2’s master branch, explains why each change was made, and records its upstream disposition.

How this section is organized

Each deep-dive page covers one category of change:

  • Correctness fixes — bugs in upstream bwa-mem2 that are fixed in bwa-mem3, including the kswv SIMD score2 plateau series, the proper-pair flag regression, the zero-init crash, the SMEM buffer overflow, and the @PG tab-escape issue.
  • Performance improvements — lockstep SMEM batching, batched -H header ingestion, libsais FM-index construction, and the consolidated mapping speedup suite.
  • Features--meth bisulfite mode, mimalloc allocator, --supp-rep-hard-cap, bwa-mem3 shm, shm --meth, the HN:i tag, and the --bam=LEVEL output flag.
  • Architecture support — the Linux ARM64/aarch64 build, the arch=avx512bw Makefile target, the NEON kswv mate-rescue kernel, and the AVX2 kswv mate-rescue kernel.
  • Build & infrastructure — the doctest framework, Codecov integration, PACKAGE_VERSION from git describe, PGO target parameterization, CXXFLAGS/CPPFLAGS/LDFLAGS forwarding, the unit-test harness, and the CI matrix expansion.
  • Upstream PR status — a single table cross-referencing every fork-carried change to its corresponding upstream PR or issue, with current upstream disposition.

Carried on top of upstream

CommitTopicUpstream status
ae73227Apple Silicon / ARM64 NEON support (PR #288 work)PR #288 open
744a9e7ci: cross-platform build + dwgsim phiX end-to-end testfork-only
490502bfix: drop unused global stat that shadows libcfork-only

Additional fork-level changes

  • Vendored mimalloc allocator: ext/mimalloc is pinned at v3.3.0 and linked into every binary by default (USE_MIMALLOC=1). Linux uses --whole-archive static linkage; macOS uses dyld-interposed shared linkage. USE_MIMALLOC=1 is the supported and recommended default on all platforms; USE_MIMALLOC=0 is provided as a best-effort opt-out and is CI-gated on Linux x86 only. See Features for details.

  • --supp-rep-hard-cap INT (opt-in, default disabled): forces MAPQ=0 on supplementary alignments whose chain contains a seed with >=INT genome occurrences. Addresses the long-standing bwa/bwa-mem2 issue where a supp fragment that maps to many places standalone (e.g. a short read in a CCATCC repeat) inherits a high MAPQ from its primary because the supp’s competing repetitive chains get filtered out during the full-read pipeline and therefore never contribute to its sub/sub_n. See upstream #260 for the reporter case. Primary MAPQ is unaffected; default output is byte-identical to stock bwa-mem2. Typical values are 5–20 (lower = more aggressive); the upstream #260 repro drops from MAPQ=60 to MAPQ=0 at --supp-rep-hard-cap 18.

Version stamping

PACKAGE_VERSION (the value reported by bwa-mem3 version and written to the @PG VN: SAM header field) is generated at build time by the Makefile from git describe --tags --dirty, e.g. v2.3-30-g61813ef for a tree 30 commits past upstream tag v2.3 at commit 61813ef.

  • No manual bumping required: cut a fresh release by tagging the commit (git tag -a vX.Y-fg-labs.N -m ...) and the next build picks it up.
  • Builds where git describe --tags fails (source-tarball extractions, or shallow clones / checkouts with no tag reachable from HEAD — including CI’s default actions/checkout fetch-depth of 1) fall back to the static FG_LABS_VERSION_FALLBACK in Makefile. Bump that when cutting a release that will be consumed as a tarball, or in CI artifacts.
  • src/version.h is generated and .gitignored; make clean removes it.

Branching and update policy

  • master tracks upstream unchanged.
  • main is upstream/master plus the commits above. Rebased onto upstream roughly quarterly, or sooner when an upstream release we care about lands.
  • Contributions go via PR targeting main. CI and CodeRabbit gate merges.
  • Any PR that adds or removes a fork-carried commit must update the table above in the same PR.

Consuming

Clone this repo and check out main:

git clone https://github.com/fg-labs/bwa-mem3.git
cd bwa-mem3
git checkout main

Or vendor the branch into a downstream repo by pinning to a specific commit (not the branch tip) so your build is reproducible.

Relationship to upstream

We submit the generally-useful fixes and features carried here as PRs against bwa-mem2/bwa-mem2 when the upstream maintainers are actively merging; while they are not, fixes land here first and we drop them from main once they appear upstream.


See also: Correctness fixes · Performance improvements · Features · Upstream PR status · Developer Guide → Contributing

Correctness Fixes

This page documents bugs present in upstream bwa-mem2 that bwa-mem3 fixes. Each fix is isolated to a single PR so it can be reviewed independently and dropped from main once upstream merges the equivalent patch.

@PG CL: tab escaping (PR #54)

When a read-group string is passed via -R '@RG\tID:x\tSM:y', the tab characters in the argument were copied verbatim into the @PG CL: SAM header field. The SAM specification uses tabs as field delimiters, so the resulting header line appeared to have extra ID: and other tag fields embedded inside CL:. Lenient parsers (samtools, htsjdk) tolerated the output; strict parsers (noodles, some fgbio configurations) rejected the file as malformed.

The fix replaces each tab character with a space when building the @PG CL: value in src/main.cpp. The @RG line itself is not modified, so the read-group metadata is preserved correctly. A regression shell test (test/pg_cl_escape_test.sh) asserts that the @PG line contains exactly five tab-separated fields after the fix. Upstream issue reference: bwa-mem2#293.

SMEM buffer overflow on reads longer than 151 bp (PR #55)

bwa-mem2 hardcoded READ_LEN 151 in src/macro.h to size the per-thread matchArray SMEM buffer at compile time. The FMI walk wrote past this buffer without bounds checking when reads exceeded 151 bp, causing memory corruption that manifested as segfaults or silent wrong output on 300 bp MiSeq reads, error-corrected long reads, and any run with a non-default -k that extended seed length.

A second cap, MAX_READ_LEN_FOR_LOCKSTEP 512, guarded the lockstep driver’s per-slot stack arrays with a hard assert that aborted on anything longer.

The fix eliminates both compile-time caps. Every per-thread SMEM buffer is now heap-allocated on the memory management context (mmc) and grown on demand from each batch’s observed max_readlength. The pre-walk grow in mem_collect_smem sizes matchArray[tid] to BATCH_MUL * BATCH_SIZE * max_readlength, and all array writes are bounds-checked with a structured smem_overflow_die on overflow. Regression tests cover 300 bp, 1 kbp, and 3 kbp phiX reads; all three segfaulted before the fix and produce correct NM:i:0 alignments after. Upstream references: bwa-mem2#210 (issue), bwa-mem2#238 (closed unmerged upstream PR).

kswv nrow==0 guard (PR #51)

When a SIMD batch contained only padding pairs (all len1 == 0), the DP loop never executed and nrow was zero. The post-loop rowMax + (i-1) * SIMD_WIDTH store still executed, walking SIMD_WIDTH bytes before the beginning of the rowMax allocation. On glibc this produced a free(): invalid pointer abort; on macOS libc it silently corrupted the heap.

The fix wraps the post-loop store in an if (i > 0) guard on all five SIMD kswv kernels: NEON u8, NEON 16, AVX2 u8, AVX-512BW u8, and AVX-512BW 16. The upstream patch bwa-mem2#289 covered only the two AVX-512BW kernels; bwa-mem3 broadens it to the three additional kernels carried in this fork. A dedicated regression test (test/kswv_nrow_zero_test.cpp) builds all-padding batches and verifies each kernel is clean under AddressSanitizer.

kswv score2 plateau series (PRs #26, #27, #28, #29, #30, #31)

The batched mate-rescue Smith-Waterman path (kswv) contains a family of related bugs across its SIMD kernels that inflated the suboptimal score (score2 / XS) and consequently deflated MAPQ relative to upstream bwa-mem2.

AVX-512BW dispatch guard (PR #26). GCC with -mavx512bw automatically defines __AVX2__, so the #elif __AVX2__ branch in src/kswv.h and src/kswv.cpp matched first on every AVX-512BW build. The 256-bit AVX2 kernel produced only 32-lane results into 64-lane score[]/te1[]/qe[] arrays sized for AVX-512BW; the upper 32 lanes held uninitialized values. mem_matesw_batch_post read those bogus te values, bwa_gen_cigar2 returned NULL, and mem_reg2aln triggered an a.cigar != NULL assertion on every AVX-512BW dispatch host (AWS c7a, c7i). The fix qualifies the #elif __AVX2__ guard with !__AVX512BW__, matching the existing pattern in bandedSWA.h. Closes issue #25.

AVX2 score2 plateau fix (PR #27 closed, PR #28 merged). The AVX2 256-bit kswv kernel added in PR #20 used a dense SIMD max over every rowMax row to compute the suboptimal score. Scalar ksw_u8 instead collapses consecutive rows above minsc into a single b[] entry anchored at the max-score row, then finds the best anchor outside the primary region. The dense max pulled in tail rows from a plateau whose anchor sat inside the primary region, inflating XS by 1–4 on a minority of reads and reducing MAPQ by 2–18 on those reads. PR #27 (closed) temporarily disabled the AVX2 batched path. PR #28 fixes the kernel itself by replacing the dense scan with a per-lane scalar emulation of the b[] build-and-scan logic.

NEON and AVX-512BW 8-bit port (PR #29). The same dense-rowMax score2 scan existed in kswv_neon_u8 and kswv_512_u8. Confirmed on ARM: rebuilding smoke-1M on darwin/arm64 pre-fix produced the identical four MAPQ regressions as the AVX2 case. PR #29 ports the per-lane scalar b[]-emulation fix to both kernels.

AVX-512BW 16-bit port (PR #30). kswv_512_16 carried four bugs: the same dense-rowMax plateau pattern, aggregate maxl/minh bounds instead of per-lane bounds (a gap from PR #21), no minsc filter, and no qe mask. The per-lane scalar emulation from PR #29 fixes all four naturally.

NEON 16-bit rewrite (PR #31). kswv_neon_16 was effectively dead code before this PR. Five interacting bugs produced 20,435 BAM diffs vs scalar reference on smoke-1M -A 2: the score table reinterpreted int16 xor indices as int8 lookups (inflating match scores by ~256 per cell), the table was too small for the 16-bit SoA encoding, rowMax was never written, the early-exit fired on row 0 for all pairs without a KSW_XSTOP target, and all the fix-3 class bugs from PRs #28–#30 were missing. The PR rewrites the kernel from scratch against kswv_neon_u8’s structure using 32-byte int8 tables indexed via vqtbl2_s8, per-lane freeze, exit0 bitmap, and per-lane scalar score2.

kseq2bseq1 zero-initialization (PR #22)

bseq_read_orig grows its sequence buffer with realloc, leaving tail entries uninitialized. kseq2bseq1 populated only name, comment, seq, qual, and l_seq for each entry, leaving sam, bams, n_bams, and cap_bams at whatever values realloc happened to return. PR #13 added an unconditional free(ret->seqs[i].bams) in the output loop (fastmap.cpp:571), which turned those garbage values into a crash — a pointer being freed was not allocated abort under system malloc and a SIGSEGV under mimalloc — once input exceeded the initial 256-sequence allocation. The crash was deterministic and reproducible with -t1.

The fix is a single memset(s, 0, sizeof(*s)) at the top of kseq2bseq1.

Proper-pair flag from emitted alignment (PR #17)

In the no_pairing emission path of mem_sam_pe and mem_sam_pe_batch_post, the proper-pair bit (0x2) was computed from a[i].a[0].rb regardless of which alignment was actually emitted. When the primary’s alignment score fell below the reporting threshold opt->T but a non-primary ALT hit cleared it, mem_reg2aln emitted a[i].a[n_pri[i]] while mem_infer_dir still read the below-threshold primary. In that case the SAM flag did not reflect the coordinates in the record.

The fix stores the selected alignment index per mate in a which[2] array and passes a[i].a[which[i]].rb to mem_infer_dir, ensuring the proper-pair flag always matches the emitted record. The bug was present in the bwa-mem2 initial commit from 2019. Upstream reference: pre-existing bug, no open upstream PR at time of merge.


Changes catalog

Itembwa-mem3 PRUpstream PR/issueStatus
@PG CL: tab escape#54bwa-mem2#293fork-only (open upstream issue)
SMEM buffer overflow on >151 bp reads#55bwa-mem2#238, bwa-mem2#210fork-only (upstream PR closed unmerged)
kswv nrow==0 guard#51bwa-mem2#289fork-only (upstream PR open)
AVX-512BW dispatch guard#26fork-only
AVX2 score2 plateau disable (superseded)#27closed (superseded by #28)
AVX2 score2 plateau fix#28fork-only
NEON + AVX-512BW 8-bit score2 fix#29fork-only
AVX-512BW 16-bit score2 fix#30fork-only
NEON 16-bit kernel rewrite#31fork-only
kseq2bseq1 zero-initialization#22fork-only
Proper-pair flag from emitted alignment#17fork-only

See also: Performance improvements · Architecture support · Upstream PR status · Developer Guide → Regression test framework · Performance → SIMD dispatch matrix

Performance Improvements

This page covers the performance work carried in bwa-mem3 on top of upstream bwa-mem2. Every change listed here preserves byte-identical SAM/BAM output vs the upstream baseline it was benchmarked against.

For current benchmark numbers across architectures and workloads, see bwa-mem3-bench, the canonical source of truth for benchmark methodology and results.

Lockstep SMEM batching (PR #33)

Seeding in bwa-mem2 advances one read’s SMEM walk at a time. Because each forward/backward extension step issues a random access into the cp_occ checkpoint array (~4 GB for human genome), the CPU stalls on cache misses between steps. Lockstep batching advances SMEM_LOCKSTEP_N reads’ SMEM walks in slot-interleaved round-robin order so that the out-of-order engine can overlap the cp_occ cache-miss loads for read i+N with the compute-bound walk of read i.

Each read slot (BatchSlot) carries its own prev[] walk buffer and match_buf[] reorder buffer. A tight recycling loop assigns finished slots to the next unprocessed read immediately. The match-emit cursor enforces input-index order so output is byte-identical to scalar. SMEM_LOCKSTEP_N is compile-time tunable; N=1 dispatches to the unchanged scalar path for bisection.

Measured improvement on 150 bp NovaSeq WGS (1M pairs, hg38, Graviton3 r7g.4xlarge, 8 threads): −6.1% wall time (82 s → 77 s). The backwardExt hot cp_occ load share dropped from 65.5% to 53.3% of function time — direct evidence that the OoO engine is overlapping cross-slot loads. On 300 bp MiSeq reads the workload is SW-dominated (~85% of cycles in kswv kernels) and the SMEM improvement is within noise; parity holds.

Supersedes PR #15 (cross-read _mm_prefetch shape), which regressed on Graviton3.

Batched -H header ingestion (PR #49, closes issue #37)

Passing a large header file via -H <file> re-ran strlen on the growing header string and called realloc on every input line, making ingestion O(n²) in the number of header lines. For a ~70 MB / ~1.5 M-line header (reported in upstream bwa-mem2#204) this caused runtimes exceeding 10 minutes before alignment started.

The fix introduces bwa_insert_header_file, a batched helper that determines the file size with fseek/ftell, allocates a single buffer, copies all @-prefixed lines in one pass, and calls bwa_insert_header once. The fix also addresses four correctness gaps in the upstream PR #204: the return-value assignment was dropped (leaving hdr_line stale after realloc), const FILE* caused compiler warnings, empty files were not guarded, and each fgets was not bounded by remaining buffer. A regression test (test/header_insert_test.cpp) diffs the batched path against the pre-patch per-line baseline across eight edge cases.

libsais FM-index construction (PR #57)

bwa-mem3 index now builds the FM-index using libsais v2.9.1 (Ilya Grebnov) instead of the sais-lite (Yuta Mori saisxx) library that bwa-mem2 inherited. libsais is actively maintained, supports OpenMP-parallel induced sorting, and produces a byte-identical FM-index. No changes are required to existing indexes — bwa-mem3 reads index files built by bwa-mem2 index without re-indexing.

For a human reference (GRCh38 + decoys), libsais reduces indexing wall time and peak memory vs sais-lite. Exact numbers depend on thread count and available RAM; see the PR body for measurements on Graviton3.

Consolidated mapping speedups (PR #58)

PR #58 is a multi-phase performance audit of bwa-mem2’s hot path, squashed and rebased onto main. It incorporates improvements across five subsystems:

  • ksw2 banded SW — tuned the band extension loop to reduce redundant computation in the common case.
  • SMEM lockstep batching — additional refinements on top of PR #33.
  • SAL prefetch — prefetch hints for the suffix array lookup hot path.
  • SAM record building — reduced per-record allocation in the text formatting path.
  • PGO build — the opt-in profile-guided optimization target (see also Performance → PGO build) is included in this suite.

On the smoke-1M workload (1M PE 150 bp reads, hg38, Graviton3 r7g.4xlarge, 16 threads, warm page cache), this PR contributed the largest single-step wall time reduction in the main branch’s performance history. Benchmark details are maintained at bwa-mem3-bench.


Changes catalog

Itembwa-mem3 PRUpstream PR/issueStatus
Lockstep SMEM batching#33fork-only
Batched -H header ingestion#49bwa-mem2#204fork-only (upstream PR open)
Large header performance (issue)issue #37closed by #49
libsais FM-index construction#57fork-only
Consolidated mapping speedups#58fork-only

See also: Performance → Overview · Performance → PGO build · Correctness fixes · Build & infrastructure · bwa-mem3-bench

Features

This page covers user-facing features added to bwa-mem3 on top of upstream bwa-mem2. None of these features change default behavior: output produced by bwa-mem3 mem without any of these flags is byte-identical to the corresponding bwa-mem2 output (except for the @PG ID: and PN: fields which now read bwa-mem3).

--meth bisulfite alignment mode (PR #13)

--meth turns bwa-mem3 index and bwa-mem3 mem into a single-binary drop-in replacement for the entire bwameth.py pipeline. No Python, no separate post-processing step, no bwameth.py dependency.

bwa-mem3 index --meth ref.fa          # once per reference
bwa-mem3 mem --meth ref.fa R1.fq R2.fq | samtools sort -o out.bam

index --meth writes <ref>.bwameth.c2t — a doubled reference with f/r-prefixed contigs and C→T / G→A projection, byte-identical to the index that bwameth.py index-mem2 produces.

mem --meth performs inline C→T conversion of R1 and G→A conversion of R2 before seeding, stashes the original bases in YS:Z:, records the conversion direction in YC:Z:, consolidates the f/r contig pairs back to one @SQ per real chromosome, applies a chimera QC heuristic (longest M/=/X run < 44% of read length → set 0x200, clear proper-pair 0x2, cap MAPQ at 1), copies YS:Z: back into the SEQ field for CpG-calling tools, and writes a @PG ID:bwa-mem3-meth entry.

On the bwameth.py example fixture (92,684 reads), end-to-end output is byte-identical on chrom, pos, CIGAR, and SEQ vs the bwameth.py oracle. Stacks on PR #12 (--bam). See the Methylation Reference for full details.

Vendored mimalloc allocator (PR #19)

bwa-mem3 vendors mimalloc v3.3.0 as a pinned submodule at ext/mimalloc and links it into every binary by default (USE_MIMALLOC=1). On Linux, static linkage uses --whole-archive; on macOS, dyld-interposed shared linkage is used.

Measured on AWS c7g.4xlarge (Graviton3, 16 threads, 29M 150 bp paired-end exome-capture reads vs hg38, page cache dropped between iterations): −24.5% wall-clock time (528.6 s → 424.7 s) compared to the same build with USE_MIMALLOC=0. No user-visible interface change; no runtime configuration required.

USE_MIMALLOC=0 is a supported best-effort opt-out and is CI-gated on Linux x86. bwa-mem3 version prints the mimalloc version string when it is active.

--supp-rep-hard-cap supplementary MAPQ rescoring (PR #56)

Supplementary alignments for a split read inherit MAPQ from the full-read scoring pipeline. Competing repetitive chains for the supplementary fragment are filtered out during full-read chain scoring (mem_chain_flt) before Smith-Waterman, so they never contribute to sub/sub_n. A supp fragment landing in a CCATCC repeat that would map equally well to 50+ locations standalone can therefore carry MAPQ=60 from its primary.

--supp-rep-hard-cap INT opts into rescoring: if any seed in a supplementary alignment’s chain has >=INT genome occurrences (from the SMEM SA count), the supplementary MAPQ is forced to 0. Primary alignment MAPQ and coordinates are unaffected. Default output (no flag) is byte-identical to upstream bwa-mem2.

The SMEM SA-occurrence count is preserved on each seed as mem_seed_t.n_hits and propagated to mem_alnreg_t.chain_n_hits during chain-to-alignment conversion. Typical values for INT are 5–20; lower is more aggressive. The upstream bwa-mem2#260 reporter case drops from MAPQ=60 to MAPQ=0 at --supp-rep-hard-cap 18. Closes issue #46.

Shared-memory index: bwa-mem3 shm (PR #65)

bwa-mem3 mem reloads the FM-index from disk on every invocation. For hg38 the index is ~28 GB; for short alignment jobs (targeted panels, small sample batches) this load cost dominates runtime and makes per-invocation IOPS the bottleneck.

PR #65 ports the bwa shm command from bwa-mem v1 to bwa-mem3 with strict v1 CLI parity:

bwa-mem3 shm <index-prefix>    # load index into shared-memory segment once
bwa-mem3 mem <index-prefix> ...  # subsequent runs attach instead of re-reading
bwa-mem3 shm -d <index-prefix>  # detach and free the segment

The index lives in a POSIX shared-memory segment. Multiple bwa-mem3 mem processes on the same host share the same in-memory copy. Closes issue #64.

Warning — Stale index

bwa-mem3 shm does not detect when the on-disk index has been rebuilt. Always run bwa-mem3 shm -d <prefix> before running bwa-mem3 index and then re-stage with bwa-mem3 shm <prefix>. Using a stale shared-memory segment produces silently wrong alignments.

bwa-mem3 shm --meth (PR #67)

bwa-mem3 mem --meth <prefix> auto-appends .bwameth.c2t to locate the methylation index built by bwa-mem3 index --meth <prefix>. Before PR #67, staging a methylation index in shared memory required passing the full .bwameth.c2t-suffixed path to shm while continuing to pass the plain prefix to mem. The mismatch was easy to forget, and the failure mode — a run that silently attached the wrong segment — was difficult to diagnose.

PR #67 adds --meth support to bwa-mem3 shm so the same plain-prefix convention works end-to-end:

bwa-mem3 shm --meth ref.fa       # stages ref.fa.bwameth.c2t
bwa-mem3 mem --meth ref.fa ...   # attaches automatically
bwa-mem3 shm -d --meth ref.fa   # detaches

HN:i hit count tag (PR #42)

Every primary SAM/BAM record now carries an HN:i:<n> tag reporting the number of secondary alignment candidates clustered with this primary under XA_drop_ratio. This count is captured before the -h/max_XA_hits cap truncates the XA:Z: string, so HN reports the true number of alternate loci even when no XA:Z: field appears in the record.

This makes it possible to distinguish:

  • HN:i:0 + no XA:Z: — genuinely unique mapper.
  • HN:i:N + XA:Z:... (N ≤ -h) — multi-mapper with all alternates listed.
  • HN:i:N + no XA:Z: (N > -h) — multi-mapper whose alternates were suppressed by the cap.

Motivated by lh3/bwa#438, which adds HN to bwa aln. HN is emitted in both SAM (mem_aln2sam) and BAM (mem_aln_to_bam) paths and is absent when -a (MEM_F_ALL) is active.

--bam=LEVEL direct BAM output (PR #12)

bwa-mem3 mem --bam (or --bam=0 through --bam=9) emits BAM directly via htslib, bypassing the SAM-text-to-BAM conversion round trip that normally occurs when the output is piped to samtools view -bS.

  • --bam / --bam=0: uncompressed BAM (BGZF framing only) — near-zero CPU overhead, smaller than SAM text, fast downstream parsing.
  • --bam=1..9: BGZF deflate at the specified level.
  • No flag: SAM text on stdout (default, unchanged).

The implementation adds src/bam_writer.{h,cpp}, a new module that converts mem_aln_t to bam1_t via mem_aln_to_bam. htslib v1.21 is pulled in as a submodule at ext/htslib. On the bwameth.py example fixture (92,961 records), samtools view of --bam output vs SAM text produces a zero-line diff across all 11 SAM columns and all aux tags. See Best Practices → Output format for the recommended pipeline.


Changes catalog

Itembwa-mem3 PRUpstream PR/issueStatus
--meth bisulfite alignment mode#13fork-only
Vendored mimalloc allocator#19fork-only
--supp-rep-hard-cap MAPQ rescoring#56bwa-mem2#260fork-only (upstream issue open)
bwa-mem3 shm shared-memory index#65fork-only
shm --meth symmetry#67fork-only
HN:i hit count tag#42lh3/bwa#438fork-only (analogous to bwa aln)
--bam=LEVEL direct BAM output#12fork-only

See also: Methylation Reference → Overview · User Guide → Memory allocator · User Guide → Output: SAM/BAM, headers, tags · Getting Started → Quick start: shared-memory index · Best Practices → Output format

Architecture Support

This page covers the architecture-specific build and runtime work carried in bwa-mem3. The goal is a single codebase that builds cleanly on all supported targets and runs the best available SIMD kernels on each.

For the full dispatch matrix and runtime selection logic, see Performance → SIMD dispatch matrix and Developer Guide → SIMD dispatch architecture.

Linux ARM64 / aarch64 build (PR #1)

The Apple Silicon work that reached the fork in commit ae73227 gated ARM behavior on $(UNAME_M) == arm64. On macOS, uname -m returns arm64. On Linux ARM64, it returns aarch64. The Makefile’s ifeq check therefore fell through to the x86 multi target on every Linux aarch64 host, failing with:

g++: error: unrecognized command-line option '-msse'

PR #1 introduces an IS_ARM variable ($(filter $(UNAME_M),arm64 aarch64)) that matches both names. All four architecture-conditional blocks in the Makefile are rewritten to use IS_ARM: the NEON/sse2neon flag block, the x86 arch-specific block, the ARM64 single-binary build block, and the multi target ARM64 short-circuit. The CI workflow is extended to trigger on pushes to fg-main (the integration branch) and adds an ubuntu-24.04-arm matrix row so the aarch64 path is exercised on every PR.

arch=avx512bw explicit build target (PR #16)

The AVX-512 Smith-Waterman kernels in bwa-mem2 are guarded by the __AVX512BW__ preprocessor macro — not __AVX512F__. The only way to build them before this PR was arch=avx512, but make multi emitted the dispatch binary as bwa-mem2.avx512bw. The build selector (avx512), the preprocessor guard (__AVX512BW__), and the dispatcher suffix (.avx512bw) disagreed.

PR #16 adds arch=avx512bw as an explicit Makefile target with flags -mavx512f -mavx512bw and switches make multi to invoke arch=avx512bw when emitting bwa-mem3.avx512bw. The legacy arch=avx512 is preserved as an alias with identical flags. No C++ is changed; the fix is 11 insertions and 2 deletions in the Makefile.

This is a pure build-correctness fix: before PR #16, arch=avx512bw and arch=multi builds on AVX-512BW hardware silently compiled the wrong kernel (see Correctness → AVX-512BW dispatch guard for the downstream effect).

NEON kswv mate-rescue (PR #18)

bwa-mem2 has a batched mate-rescue Smith-Waterman path (BWAMEM_BATCHED_MATESW) that uses SIMD kswv kernels to score rescue candidates in parallel. On ARM64 the gate was __AVX512BW__, which is never true on NEON hardware. The NEON kswv::getScores8 kernel existed in the source but was unreachable in production.

PR #18 enables this path on ARM64 by replacing the __AVX512BW__ gate with a new BWAMEM_BATCHED_MATESW macro that fires on NEON/Apple Silicon as well. Along the way, four kernel bugs were found and fixed:

  1. te split — the te (traceback end) value needed separate hi/lo tracking for 16-lane u8 batches.
  2. Freeze mask — a frozen_vec mask now gates gmax/te/qe updates after KSW_XSTOP fires, preventing stale values from escaping to the score2 scan.
  3. Per-lane score2 exclusionlen1, low/high, and qe masks were not applied per-lane in Loop 1, allowing lanes without a valid primary to contribute spurious suboptimal scores.
  4. minsc filter on rowMax — sub-minsc plateau scores were leaking into score2 because the scalar ksw_u8 gating condition (imax >= minsc) was not replicated.

Measured on an M-series Mac (8 threads, 500k PE 100 bp reads on chr17): 1.42× speedup (−29.4% wall time) with byte-identical sorted SAM output.

AVX2 kswv mate-rescue (PR #20)

PR #18 enabled batched mate-rescue on ARM64. Most x86 production deployments (AWS c6a, c6i, older Xeons) use AVX2 without AVX-512BW and were excluded from the same gate. PR #20 extends the batched path to AVX2 by adding a 256-bit kswv256_u8 kernel and widening BWAMEM_BATCHED_MATESW to fire on __AVX2__.

The AVX2 kernel is a direct port of the corrected NEON kernel from PR #18, with an additional fix for per-lane te2 tracking (_mm256_blendv_epi8 on a sign-extended 8→16 bit mask). Verified byte-identical sorted SAM vs the pre-BWAMEM_BATCHED_MATESW scalar control on EC2 m5.xlarge (Skylake-SP, 4 threads, 500k chr17 PE pairs).

Note: PR #20 introduced a score2 plateau regression in the AVX2 kernel that was identified and fixed in the correctness series (PRs #27, #28, #29).


Changes catalog

Itembwa-mem3 PRUpstream PR/issueStatus
Linux ARM64 / aarch64 build + CI#1bwa-mem2#288fork-only (upstream PR open)
arch=avx512bw explicit target#16fork-only
NEON kswv mate-rescue kernel#18fork-only
AVX2 kswv mate-rescue kernel#20fork-only

See also: Performance → SIMD dispatch matrix · Developer Guide → SIMD dispatch architecture · Developer Guide → Apple Silicon / NEON port · Correctness fixes · Performance → PGO build

Build & Infrastructure

This page covers the build-system, testing, and CI infrastructure changes carried in bwa-mem3 on top of upstream bwa-mem2.

doctest framework and Codecov (PR #34)

PR #34 establishes the long-term test infrastructure for bwa-mem3:

  • doctest 2.4.11 is vendored as a single-header under ext/doctest/, with the SHA256 recorded in ext/doctest/VERSION.
  • A new test/framework/ static library provides shared helpers: scoring matrices, deterministic sequence-pair generators, kswv-style batch packers, scalar and SIMD runners, kswr comparators, a JUnit reporter hook, and a shared main.
  • Two test binaries are produced: bwa_mem3_tests_unit (runs on every CI matrix row) and bwa_mem3_tests_integration (runs on a subset of rows).
  • The existing kswv_selftest is ported to test/unit/test_kswv_correctness.cpp — 30,049 assertions against scalar ksw_align2 on 10k random plus curated edge pairs.
  • Five legacy integration sources are moved to test/integration/ via git mv; their binaries still emit at test/<name> so existing scripts keep working.
  • Five inline CI bash regression blocks are extracted to test/regression/*.sh (phix_parity, chr22_parity, thread_determinism, bam_roundtrip, meth_oracle).
  • A coverage CI job builds libbwa.a and both test binaries with COVERAGE=1 (-O0 --coverage), runs both test binaries, collects Cobertura XML via gcovr, and uploads to Codecov via codecov/codecov-action.

PACKAGE_VERSION from git describe (PR #52)

Before PR #52, src/main.cpp hardcoded PACKAGE_VERSION "2.2.1". This string appeared in bwa-mem3 version output and in the @PG VN: SAM header field but was never updated, causing every build to report an outdated version.

The Makefile now generates src/version.h from git describe --tags --dirty, falling back to a static FG_LABS_VERSION_FALLBACK when git describe cannot reach a tag (source-tarball extractions, shallow clones — e.g. CI with the default fetch-depth: 1). A write-if-changed mechanism (cmp -s + mv) regenerates the file on every invocation but only bumps its mtime when the stamped string changes, so only main.o is rebuilt when the version changes, not the entire tree. src/version.h is .gitignored and removed by make clean. Fixes issue #40. Related upstream: bwa-mem2#283, bwa-mem2#284.

PGO target parameterization (PR #59)

The original pgo-generate and pgo-use Makefile targets hardcoded arch=arm64 and a single shared pgo_profiles/ directory. PR #59 generalizes both:

  • PGO_ARCH (default: arm64 on ARM hosts, native otherwise) passes through to the recursive make invocation as arch=$(PGO_ARCH). Accepts the same values as the rest of the Makefile: arm64, sse41, avx2, avx512bw, native, etc.
  • PGO_PROFILE_DIR is now overridable (?= instead of =). Each (arch × training-regime) combination can capture into its own directory.
  • When PGO_ARCH != arm64, the output binaries are named bwa-mem3.pgo-instr.<arch> and bwa-mem3.pgo.<arch> so multiple per-arch PGO builds coexist. The default arm64 names are unchanged for backward compatibility.
  • pgo-clean now removes arch-suffixed PGO binaries in addition to bare names.

This enables the benchmarking workflow at bwa-mem3-bench, which requires per-arch × per-regime profile capture. See also Performance → PGO build.

CXXFLAGS/CPPFLAGS/LDFLAGS forwarding (PR #50)

The Makefile’s multi: rule compiled runsimd.cpp (the x86 multi-binary launcher) without honoring CXXFLAGS, CPPFLAGS, or LDFLAGS. The $(EXE) link honored CXXFLAGS and LDFLAGS but not CPPFLAGS.

PR #50 mirrors upstream bwa-mem2#290: the multi: compile now honors all three variables, and $(EXE) link adds $(CPPFLAGS). This allows downstream packagers (Debian, Bioconda) and reproducible-build systems to inject hardening flags (-D_FORTIFY_SOURCE=2, -fstack-protector-strong, -Wl,-z,relro) through the environment without patching the Makefile. No functional change unless the env vars are set. Closes issue #39.

Unit-test harness and ARM CI (PR #23)

Historically, PR #23 added a local bash harness (test/run_unit_tests.sh) that built and ran the five C++ unit binaries under test/ against committed fixtures in test/fixtures/, asserting exit 0 and non-empty diff-able output (those binaries have since been consolidated into the doctest harness — see the section above). It also fixes several pre-existing issues blocking the harness:

  • test/Makefile defaulted to icpc (Intel compiler, not available on GitHub runners); changed to g++ on Linux x86.
  • ARM flags are mirrored from the parent Makefile so cd test && make builds on macOS arm64 and Linux aarch64.
  • Three test sources (smem2_test, bwt_seed_strategy_test, sa2ref_test) were missing the fmiSearch->load_index() call that fmi_test.cpp has, causing immediate segfaults on run.
  • test/main_banded.cpp opened fksw.txt but never wrote to it; output is now written and main() returns 0 on success.
  • Fixtures are added under test/fixtures/ covering phiX174, 50 bp test reads, BWT seed strategy inputs, SA pairs, and SW pairs.

CI matrix expansion (PR #24)

PR #24 stacks on PR #23 and expands the GitHub workflow .github/workflows/ci.yml from 5 matrix rows to 7:

RowRunnerArchRole
1ubuntu-latestsse41smoke + unit tests
2ubuntu-latestavx2canonical deep tests
3ubuntu-latestavx2 (no mimalloc)unchanged
4ubuntu-24.04-armarm64unchanged
5macos-latestarm64unchanged
6 (new)ubuntu-latestmultirunsimd dispatcher smoke
7 (new)ubuntu-latestavx2 clang++Linux Clang smoke

The canonical row (row 2) adds: --bam=6 roundtrip record-count parity, thread-determinism (-t1 vs -t4 sorted diff), unit-test harness, chr22 pipeline parity vs bwa, SE smoke, interleaved smoke, and --meth Layers 1–3.


Changes catalog

Itembwa-mem3 PRUpstream PR/issueStatus
doctest framework + Codecov#34fork-only
PACKAGE_VERSION from git describe#52bwa-mem2#283, bwa-mem2#284fork-only (upstream issue + PR open)
PGO target parameterization#59fork-only
CXXFLAGS/CPPFLAGS/LDFLAGS forwarding#50bwa-mem2#290fork-only (mirrors open upstream PR)
Unit-test harness + ARM CI#23fork-only
CI matrix expansion#24fork-only

See also: Developer Guide → Regression test framework · Developer Guide → Release process · Performance → PGO build · Performance improvements · Upstream PR status

Upstream PR Status

This table cross-references every change carried in bwa-mem3 main to its corresponding upstream bwa-mem2 PR or issue. “Fork-only” means no upstream PR exists; the change may be submitted upstream in the future or may be fork-specific by design. “Open” means the upstream PR or issue existed at the time of bwa-mem3’s implementation but had not been merged. Upstream status is current as of the bwa-mem3 0.1.0-pre release.

For prose descriptions of each change, follow the links in the “bwa-mem3 PR” column to the relevant deep-dive page section.

Full cross-reference table

Topicbwa-mem3 PRUpstream PR / IssueUpstream status
Correctness
@PG CL: tab escaping#54bwa-mem2#293open issue
SMEM buffer overflow on >151 bp reads#55bwa-mem2#238, bwa-mem2#210PR closed without merge; issue open
kswv nrow==0 guard (all 5 kernels)#51bwa-mem2#289open PR (upstream covers AVX-512BW only)
AVX-512BW dispatch guard (!__AVX512BW__)#26fork-only
AVX2 score2 plateau consolidation#28fork-only
NEON + AVX-512BW 8-bit score2 fix#29fork-only
AVX-512BW 16-bit score2 fix#30fork-only
NEON 16-bit kernel rewrite#31fork-only
kseq2bseq1 zero-initialization#22fork-only
Proper-pair flag from emitted alignment#17fork-only
Performance
Lockstep SMEM batching#33fork-only
Batched -H header ingestion (O(n) fix)#49bwa-mem2#204open PR
libsais FM-index construction#57fork-only
Consolidated mapping speedups#58fork-only
Features
--bam=LEVEL direct BAM output#12fork-only
--meth bisulfite alignment mode#13fork-only
Vendored mimalloc allocator#19fork-only
HN:i hit count tag#42lh3/bwa#438analogous to bwa aln; no direct upstream port
--supp-rep-hard-cap MAPQ rescoring#56bwa-mem2#260open issue
bwa-mem3 shm shared-memory index#65fork-only (v1 feature port)
shm --meth symmetry#67fork-only
Architecture support
Linux ARM64 / aarch64 build + CI#1bwa-mem2#288open PR
arch=avx512bw explicit Makefile target#16fork-only
NEON kswv mate-rescue kernel#18fork-only
AVX2 kswv mate-rescue kernel#20fork-only
Build & infrastructure
doctest framework + Codecov#34fork-only
PACKAGE_VERSION from git describe#52bwa-mem2#283, bwa-mem2#284open issue + open PR
PGO target parameterization#59fork-only
CXXFLAGS/CPPFLAGS/LDFLAGS forwarding#50bwa-mem2#290open upstream PR
Unit-test harness + ARM CI#23fork-only
CI matrix expansion (7 rows)#24fork-only

Upstream issues tracked but not yet fixed in bwa-mem3

The following upstream issues are tracked in the bwa-mem3 issue list but do not yet have corresponding fixes in main:

IssueUpstream referenceNotes
Split-alignment evidence loss vs bwa 0.7.17bwa-mem2#273issue #47 — under investigation
MAPQ/coordinate parity vs bwa mem 0.7.18bwa-mem2#262, bwa-mem2#246, bwa-mem2#239issue #48 — tracking only

See also: Correctness fixes · Performance improvements · Features · Architecture support · Build & infrastructure

Building from source

This page documents every build target available in the Makefile and what each produces. For the recommended production build workflow see Best Practices → Build.

Prerequisites

  • A C++14-capable compiler: GCC 7+ or Clang 6+ on Linux; Clang 15+ (Xcode) on macOS.
  • GNU make 3.81+.
  • CMake 3.12+ (required only when USE_MIMALLOC=1, which is the default).
  • libomp (macOS only): brew install libomp. libsais uses OpenMP for parallel suffix-array construction.
  • Git submodules initialised: git submodule update --init --recursive.

Warning — Submodules must be present

The build will fail with a clear error message if any of the required submodules (ext/libsais, ext/htslib, ext/safestringlib, ext/mimalloc, ext/sse2neon) are missing. Always clone with --recursive or run git submodule update --init --recursive before make.

Standard builds

Default build (host-native)

make

On x86 hosts this is equivalent to make multi (see below). On Apple Silicon and other aarch64 hosts the Makefile detects the architecture and builds a single ARM64 binary instead.

The resulting binary is bwa-mem3 in the repo root.

Single-arch x86 builds

Pass arch=<target> to compile a single binary with a specific ISA level:

CommandSIMD levelARCH_FLAGS
make arch=sse41SSE4.1-msse … -msse4.1
make arch=sse42SSE4.2-msse … -msse4.2
make arch=avxAVX-mavx
make arch=avx2AVX2-mavx2
make arch=avx512bwAVX-512BW-mavx512f -mavx512bw
make arch=nativehost CPU features-march=native

For Intel compiler (icpc / icpx) the flags differ slightly; see the Makefile for the ifeq ($(CXX), icpc) branches.

Multi-binary x86 build (default on x86)

make multi

Builds five ISA-specific binaries (bwa-mem3.sse41, bwa-mem3.sse42, bwa-mem3.avx, bwa-mem3.avx2, bwa-mem3.avx512bw) plus the thin launcher bwa-mem3 that execs the best-matching binary at runtime. See Multi-binary launcher for details.

ARM64 / Apple Silicon build

make arch=arm64

Compiles a single binary bwa-mem3.arm64 and creates a symlink bwa-mem3 -> bwa-mem3.arm64. See Apple Silicon / NEON port for background.

Tuned builds

Profile-Guided Optimization (PGO)

PGO produces the best single-binary performance. The workflow is two-phase:

# Phase 1: instrument binary
make pgo-generate                              # builds bwa-mem3.pgo-instr (arm64 default)
make pgo-generate PGO_ARCH=avx2               # or a specific x86 target

# Run your training workload with the instrumented binary
./bwa-mem3.pgo-instr mem -t 16 ref.fa r1.fq.gz r2.fq.gz > /dev/null

# Phase 2: optimised binary
make pgo-use                                   # builds bwa-mem3.pgo
make pgo-use PGO_ARCH=avx2                     # matching arch

PGO_ARCH accepts the same values as arch=. PGO_PROFILE_DIR defaults to pgo_profiles/ but can be overridden. Output binaries are named bwa-mem3.pgo (default arch) or bwa-mem3.pgo.<arch> when a non-default arch is specified, so multiple arch builds coexist.

Clean up instrumented objects and profile data:

make pgo-clean
make lto-build                                 # builds bwa-mem3.lto (native arch)
make lto-build LTO_ARCH=avx2                   # explicit arch

LTO compiles bwa-mem3’s own translation units with -flto (thin LTO on Clang, full LTO on GCC) plus -fno-semantic-interposition on GCC. Third-party libraries (htslib, mimalloc, safestringlib) are linked without LTO. Clean:

make lto-clean

Compute-only profile binary

Used when profiling CPU hotspots without I/O noise. The -DDISABLE_OUTPUT flag short-circuits all BAM/SAM write paths and the file-open / header-emit step, so only alignment work contributes to wall time.

make profile-build                             # builds bwa-mem3.profile (native)
make profile-build PROFILE_ARCH=avx2          # explicit arch
./bwa-mem3.profile mem -t 16 ref.fa r1.fq.gz r2.fq.gz

make profile-clean

Build knobs

VariableDefaultEffect
USE_MIMALLOC1Include mimalloc; set 0 to use the system allocator
ASAN(unset)Set to any non-empty value to enable AddressSanitizer (forces USE_MIMALLOC=0)
COVERAGE(unset)Set to enable --coverage + -O0 for gcov line-level coverage
EXTRA_CXXFLAGS(empty)Appended to CXXFLAGS; forwarded through PGO / LTO targets
DISABLE_BATCHED_MATESW(unset)Set to 1 to disable the batched mate-rescue SW path on ARM
CXXc++Compiler. Paired CC is auto-derived from CXX for libsais.

Cleaning

make clean

Removes object files, libbwa.a, all binaries, test binaries, libsais objects, safestringlib, htslib, and the mimalloc build tree.

make docs-clean

Removes only the mdbook build output (docs/book/). Covered in Developer Guide → Building context; see the Makefile docs targets for the full list.

Documentation targets

TargetAction
make docsBuild the mdbook into docs/book/
make docs-serveLive-preview at http://localhost:3000
make docs-cliCapture --help output for each subcommand into docs/_generated/cli/
make docs-cleanRemove docs/book/
make docs-install-toolscargo install mdbook + three plugins

See also: SIMD dispatch architecture · Multi-binary launcher · Best Practices → Build · Performance → PGO build · Apple Silicon / NEON port

SIMD dispatch architecture

bwa-mem3 uses two complementary mechanisms to select the best available SIMD code path at run time: a multi-binary launcher on x86 (handled separately in Multi-binary launcher) and compile-time conditional compilation inside each kernel, mediated by src/simd_compat.h.

This page covers the compile-time layer: what the macros do, which kernels are vectorised at each ISA level, and how the dispatch decision flows.

The simd_compat.h abstraction layer

src/simd_compat.h is the single point where platform detection and intrinsic selection occur. It is included by every file that touches SIMD code. The header resolves to one of four paths:

PlatformBranch conditionIntrinsic headers
ARM / Apple Silicon__ARM_NEON or __aarch64__sse2neon.h (translation) + <arm_neon.h> (native)
x86 AVX-512BW__AVX512BW__<immintrin.h>
x86 AVX2__AVX2__<immintrin.h>
x86 SSE4.1 / SSE2__SSE4_1__ or __SSE2__<smmintrin.h> + <emmintrin.h>

The ARM path defines APPLE_SILICON 1, sets SIMD_WIDTH8 = 16 and SIMD_WIDTH16 = 8 (128-bit NEON lanes), defines a posix_memalign-backed _mm_malloc replacement that enforces the 128-byte Apple Silicon cache-line alignment, and provides two optimised NEON helpers that sse2neon does not generate efficiently:

  • _mm_movemask_epi16 — extracts the MSB of each 16-bit element using vshrq_n_u16 + vmovn_u16 + position-weighted vaddv_u8, replacing the _mm_movemask_epi8(v) & 0xAAAA pattern used in bandedSWA.cpp.
  • _mm_blendv_epi16_fast — a bitwise select on 16-bit elements via NEON vbslq_s16, replacing the OR/AND/ANDNOT sequence sse2neon emits for _mm_blendv_epi8.

SIMD_WIDTH8 and SIMD_WIDTH16 control the lane counts in kswv.cpp and bandedSWA.cpp. On x86 they are set by the architecture-specific header rather than here; the macros differ per ISA level:

ISASIMD_WIDTH8SIMD_WIDTH16
SSE4.1168
AVX23216
AVX-512BW6432
ARM NEON168

Dispatch diagram

The full dispatch decision, from the shell to a kernel instruction, follows this flow:

flowchart TD
    A[User runs: bwa-mem3 mem ...] --> B{Platform}

    B -- ARM / Apple Silicon --> C[Single binary\nbwa-mem3.arm64]
    B -- x86 --> D[Launcher: bwa-mem3\nsrc/runsimd.cpp]

    D --> E{cpuid: best ISA}
    E -- AVX-512BW --> F1[exec bwa-mem3.avx512bw]
    E -- AVX2 --> F2[exec bwa-mem3.avx2]
    E -- AVX --> F3[exec bwa-mem3.avx]
    E -- SSE4.2 --> F4[exec bwa-mem3.sse42]
    E -- SSE4.1 --> F5[exec bwa-mem3.sse41]

    F1 & F2 & F3 & F4 & F5 --> G[main.cpp\ncompiled with matching ARCH_FLAGS]
    C --> G

    G --> H{Kernel call}

    H -- kswv\nbatched SW --> I[kswv.cpp\nSIMD_WIDTH8/16 from simd_compat.h]
    H -- bandedSWA\nmate-rescue --> J[bandedSWA.cpp\nblendv / movemask from simd_compat.h]
    H -- FM-index\nbackward extension --> K[FMI_search.cpp\n__builtin_popcountl — not SIMD]
    H -- libsais\nBWT construction --> L[libsais.c\nOpenMP parallel SA-IS]

    I --> M[SIMD instructions\nat ISA level of this binary]
    J --> M

Per-kernel vectorisation status

KernelSSE4.1AVX2AVX-512BWARM NEON
kswv (batched Smith-Waterman)vectorisedvectorised (2x width)vectorised (4x width)native NEON
bandedSWA (banded SW / mate-rescue)vectorisedvectorisedvectorisednative NEON blendv
FMI_search (FM-index backward ext.)scalarscalarscalarscalar
libsais (BWT / SA construction)OpenMP onlyOpenMP onlyOpenMP onlyOpenMP only
bam_writer (BAM serialisation)

FMI_search is memory-bound with sequential pointer-chasing dependencies; adding SIMD to it produces no measurable speedup. libsais benefits from OpenMP-parallel induced sorting but not from SIMD widening within a single thread.

Adding a new SIMD kernel

  1. Include simd_compat.h rather than any platform intrinsic header directly.
  2. Use SIMD_WIDTH8 / SIMD_WIDTH16 for lane-count arithmetic so the code compiles correctly across all ISA levels.
  3. For ARM-specific optimisations, gate them with #ifdef APPLE_SILICON (or #if defined(__ARM_NEON)) and provide a simd_compat.h-routed fallback for x86.
  4. Verify correctness on at least SSE4.1 (lowest supported x86 level) and ARM64 using make test.

Tip — Testing SIMD correctness

The kswv unit tests in test/unit/test_kswv*.cpp use synthetic sequence-pair generators that drive edge cases (empty batches, nrow==0, homopolymers) across every SIMD width. Run them with ./test/bwa_mem3_tests_unit --test-suite="unit/kswv" after modifying any vectorised kernel.


See also: Multi-binary launcher · Apple Silicon / NEON port · Building from source · Performance → SIMD dispatch matrix · Regression test framework

Multi-binary launcher (x86)

On x86 Linux and x86 macOS, bwa-mem3 is a thin launcher binary rather than the aligner itself. Its sole job is to detect the host CPU’s capabilities at startup and exec the best-matching ISA-specific binary in the same directory.

ARM / Apple Silicon does not use this mechanism. The make arm64 target creates a symlink bwa-mem3 -> bwa-mem3.arm64; there is only one NEON instruction-set level on all current ARM64 CPUs.

What make multi produces

make multi

Produces six files in the repo root:

FileISAARCH_FLAGS
bwa-mem3launcher (no aligner code)compiled from src/runsimd.cpp
bwa-mem3.sse41SSE4.1-msse -msse2 -msse3 -mssse3 -msse4.1
bwa-mem3.sse42SSE4.2adds -msse4.2
bwa-mem3.avxAVX-mavx
bwa-mem3.avx2AVX2-mavx2
bwa-mem3.avx512bwAVX-512BW-mavx512f -mavx512bw

The six binaries must reside in the same directory for the launcher to find them.

How the launcher selects a binary

src/runsimd.cpp calls cpuid (via the __cpuid intrinsic or a hand-rolled CPUID wrapper) to read the CPU’s feature flags and picks the highest ISA level supported by the CPU:

  1. Check CPUID leaf 7 for AVX-512BW → exec bwa-mem3.avx512bw
  2. Check CPUID leaf 7 for AVX2 → exec bwa-mem3.avx2
  3. Check CPUID leaf 1 for AVX → exec bwa-mem3.avx
  4. Check CPUID leaf 1 for SSE4.2 → exec bwa-mem3.sse42
  5. Fallback → exec bwa-mem3.sse41

The launcher calls execv with the same argv that was passed to it. The selected binary’s main() therefore receives the original arguments unchanged. The @PG CL: tag in the output SAM/BAM records the original invocation, not the ISA-suffixed binary name.

Note — exec replaces the process

The launcher does not fork. It calls execv(), which replaces the launcher process image with the ISA-specific binary. There is no wrapper process resident in memory during alignment.

Using a specific ISA binary directly

You can bypass the launcher and invoke a specific binary directly:

./bwa-mem3.avx2 mem -t 16 ref.fa r1.fq.gz r2.fq.gz

This is useful when benchmarking a particular ISA level, testing a regression, or deploying in an environment where only one binary is installed. The ISA-specific binary behaves identically to the launcher output for that ISA — there is no functional difference.

Distribution layout

When packaging or deploying on x86, include all five ISA binaries plus the launcher in the same directory:

bin/
  bwa-mem3           ← launcher
  bwa-mem3.sse41
  bwa-mem3.sse42
  bwa-mem3.avx
  bwa-mem3.avx2
  bwa-mem3.avx512bw

On ARM, only bwa-mem3 (the symlink) and bwa-mem3.arm64 are needed.

The mem SIMD banner

After selecting and executing a binary, the mem subcommand prints a single-line banner to stderr before alignment begins:

-----------------------------
Executing in AVX2 mode!!
-----------------------------

The banner text is set at compile time via #if __AVX512BW__ / #elif __AVX2__ / … preprocessor guards in src/main.cpp. This confirms at runtime which ISA path is active.


See also: SIMD dispatch architecture · Apple Silicon / NEON port · Building from source · Performance → SIMD dispatch matrix

Apple Silicon / NEON port

bwa-mem3 supports ARM64 (Apple Silicon and Linux aarch64) as a first-class build target. The port uses the sse2neon translation shim as a baseline and replaces the two most performance-critical SSE paths with native NEON intrinsics.

Architecture overview

The ARM build compiles a single binary (bwa-mem3.arm64) rather than a family of ISA-specific binaries. There is only one NEON instruction-set level on all current ARM64 CPUs, so the multi-binary launcher used on x86 is not needed. make arm64 builds the binary and creates a symlink bwa-mem3 -> bwa-mem3.arm64.

sse2neon shim

ext/sse2neon/sse2neon.h is a header-only library that maps Intel SSE intrinsics to their NEON equivalents. When APPLE_SILICON=1 is defined (set automatically when uname -m is arm64 or aarch64), src/simd_compat.h includes sse2neon and defines the SSE feature test macros (__SSE__ through __SSE4_2__) so that code guarded by those macros compiles without changes.

The translation is not zero-cost for all operations. Two patterns that sse2neon handles poorly are replaced with native NEON in src/simd_compat.h:

  • _mm_movemask_epi16 — used heavily in bandedSWA.cpp to extract the sign bit of each 16-bit lane. The native implementation shifts right by 15, narrows to 8-bit with vmovn_u16, and reduces with position-weighted vaddv_u8.
  • _mm_blendv_epi16_fast — a bitwise select on 16-bit lanes using vbslq_s16. Replaces the three-operation OR/AND/ANDNOT sequence sse2neon emits for _mm_blendv_epi8.

Memory alignment

Apple Silicon uses 128-byte cache lines (versus 64 bytes on x86). simd_compat.h overrides _mm_malloc on ARM to call posix_memalign with a minimum alignment of 128 bytes for all SIMD allocations. CACHE_LINE_BYTES is set to 128 in macro.h when APPLE_SILICON=1.

Accelerate.framework

The Makefile links -framework Accelerate on macOS ARM builds. The framework is linked but not used for computation: bwa-mem3’s hot paths (Smith-Waterman, FM-index) do not match the large-matrix / large-vector patterns that BLAS and vDSP target. The link is retained to keep the option open and adds no overhead at runtime.

P-core / E-core detection

src/fastmap.cpp calls HTStatus() on macOS to detect the Apple Silicon microarchitecture. HTStatus() reads the hw.perflevel0.physicalcpu and hw.perflevel1.physicalcpu sysctl keys to report P-core and E-core counts and the L2 cache size (typically 4 MB on M-series chips). This information is printed at startup for diagnostic purposes. The L2 cache size is used to validate the compile-time BATCH_SIZE setting (currently 1024, which was already optimal for a 4 MB L2 cache).

Benchmark results

All measurements use 100K paired-end reads, 5% error rate, 30% indels, chr17 reference, 8 threads, on an M-series Apple Silicon machine.

BuildWall-clock (avg, s)vs. baseline
sse2neon baseline (no native NEON)15.4
+ native NEON kswv.cpp14.4~7% faster
+ native NEON bandedSWA.cpp blendv13.8~4% faster
PGO on top of native NEON~13.4~3% further

The FM-index (FMI_search.cpp) is memory-bound with sequential pointer-chasing dependencies and does not benefit from SIMD. libsais benefits from OpenMP-parallel suffix-array construction but not from SIMD widening within a single thread.

Optimization task summary

TaskStatusImpactNotes
Correctness verificationdone200,006 alignments, 0 differences vs. reference
Dynamic L2 cache detectiondone~0%4 MB detected; compile-time BATCH_SIZE=1024 already optimal
Native NEON bandedSWA.cppdone~4%vbsl-based blendv in simd_compat.h
Multi-binary launcherN/A0%Not applicable on ARM (single NEON level)
Accelerate.frameworkdone~0%Linked; no suitable compute patterns
M1/M2/M3/M4 detectiondone~0%P/E-core counts and L2 cache via sysctl
Native NEON FMI_search.cppN/A0%Memory-bound; SIMD cannot help
Profile-Guided Optimizationdone~3%make pgo-generate / make pgo-use

Building for Apple Silicon

# Standard arm64 build
make arch=arm64

# PGO build (recommended for production on Apple Silicon)
make pgo-generate PGO_ARCH=arm64
./bwa-mem3.pgo-instr mem -t 8 ref.fa r1.fq.gz r2.fq.gz > /dev/null
make pgo-use PGO_ARCH=arm64

The resulting bwa-mem3.pgo binary delivers the full ~10% improvement over the pure sse2neon baseline.

Tip — Recommended production build on Apple Silicon

Use PGO for production deployments. The combined ~10% improvement from native NEON kernels plus PGO is consistent and verified on M-series hardware.

Files modified in the NEON port

  • src/kswv.cpp, src/kswv.h — native NEON batched Smith-Waterman
  • src/bandedSWA.h — SIMD width definitions for ARM
  • src/simd_compat.h — sse2neon integration, aligned allocation, _mm_blendv_epi16_fast, _mm_movemask_epi16
  • src/fastmap.cpp — L2 cache detection, HTStatus() for non-NUMA (macOS)
  • src/macro.hBATCH_SIZE and CACHE_LINE_BYTES tuning for Apple Silicon
  • Makefilearm64 target, sse2neon flags, Accelerate linkage, PGO targets

See also: SIMD dispatch architecture · Building from source · Performance → PGO build · Performance → SIMD dispatch matrix · What’s Different → Architecture support

Regression test framework

bwa-mem3 has three categories of tests — unit, integration, and regression — plus a separate benchmark harness in bench/. Understanding the distinction helps you choose where to add a new test and what to expect from CI.

Test categories

CategoryBinary / runnerFixturesCI scope
unittest/bwa_mem3_tests_unitNone; all inputs syntheticEvery matrix row
integrationtest/bwa_mem3_tests_integrationSmall committed FASTAs / FMI in test/fixtures/SSE4.1, AVX2, ARM64 Linux, macOS ARM
regressiontest/regression/*.shDownloaded references (phiX, chr22) + bwa + dwgsimCanonical AVX2 row only

Unit tests must use only synthetic inputs generated programmatically and complete in under 100 ms each. They exercise individual kernels in isolation: kswv scoring, banded Smith-Waterman, KSW, FM-index operations, SMEM extraction, BAM encoding, and pair handling.

Integration tests may load small committed fixtures from test/fixtures/ and have a per-test budget of 10 seconds. They exercise cross-component paths: index loading, SMEM-to-alignment pipelines, and output format validation.

Regression tests are standalone bash scripts that shell out to the bwa-mem3 binary, may diff against third-party tool output (bwa, bwa-meth, samtools), and require fixtures that are either committed to the fixtures directory or downloaded by CI at run time.

Running tests locally

# Build the aligner and test binaries
make
make -C test -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)

# Run all unit tests
./test/bwa_mem3_tests_unit

# Run all integration tests
./test/bwa_mem3_tests_integration

# Run a specific test case or suite
./test/bwa_mem3_tests_unit --test-case="*kswv*"
./test/bwa_mem3_tests_unit --test-suite="unit/kswv"
./test/bwa_mem3_tests_unit --test-suite-exclude=slow

# Verbose output (also print passing assertions)
./test/bwa_mem3_tests_unit --success

The make test target is a convenience shortcut that builds and runs the unit and integration binaries plus the two legacy standalone regression tests (kswv_nrow_zero_test and shm_section_find_test):

make test

Running a regression test locally

Regression scripts expect certain environment variables to point at fixtures. The phiX parity test requires dwgsim:

mkdir -p /tmp/ci-test && cd /tmp/ci-test
curl -sL "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/819/615/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gz" | gunzip > phix174.fa
dwgsim -z 42 -N 500 -1 150 -2 150 -r 0.001 -S 2 phix174.fa reads
cd -
BWA_MEM2="$(pwd)/bwa-mem3" CI_TEST_DIR=/tmp/ci-test bash test/regression/phix_parity.sh

Test framework

The unit and integration binaries are built on doctest, a single-header C++ test framework. Tests are discovered by file glob: any test/unit/test_*.cpp file is compiled into the unit binary; any test/integration/test_*.cpp file is compiled into the integration binary. No Makefile edit is needed when adding a new test_*.cpp.

Test organisation

Tag each TEST_CASE with doctest::test_suite("category/module"):

TEST_CASE("nrow==0 batch does not store out of bounds"
          * doctest::test_suite("unit/kswv")) {
    // ...
}

The test_suite decorator is overriding (not additive). Encode the category (unit or integration) and module (kswv, bandedsw, ksw, fmindex, smem, bam, pair, cigar, util) as a single slash-separated string.

Framework helpers

The test/framework/ directory provides helpers shared across test files:

HeaderProvides
scoring.hScoringMatrix, build_scoring_matrix, default_scoring_matrix
seqpair.hTestPair struct
seqpair_gen.hDeterministic pair generators: random, exact-match, all-mismatch, homopolymer, sub-cluster, N-bases
seqpair_batch.hBatchBuffers — flat-layout packer for kswv batch input
ksw_runner.hrun_scalar_ksw, default gap/extra parameters
kswv_runner.hTwo-pass run_kswv_batch
kswr_cmp.hScore / coordinate / score2 comparators
junit_reporter.hCI matrix-row banner and JUnit XML output

Debugging a failing test

# Break into debugger at the first failing assertion
./test/bwa_mem3_tests_unit --test-case="*kswv*" --break

# Run a single SUBCASE
./test/bwa_mem3_tests_unit --test-case="*foo*" --subcase="bar"

# Enable per-phase diagnostics for kswv tests
BWA_TESTS_DEBUG_PHASE0=1 BWA_TESTS_DEBUG_PHASE1=1 \
  ./test/bwa_mem3_tests_unit --test-suite="unit/kswv"

JUnit artifacts are uploaded per CI matrix row (unit-results-<name>.xml, integration-results-<name>.xml) and available on the Actions run page.

Tip — Use ASAN for memory bugs

Build with make ASAN=1 test to catch out-of-bounds writes in vectorised kernels. The kswv_nrow_zero_test specifically exercises the nrow==0 path that triggered a pre-allocation store bug; ASAN reports this immediately rather than at a later allocator operation.

Standalone regression tests

Three standalone regression tests live outside the doctest harness because they predated it. The two binaries are built and run by make test; the third is script-driven:

  • kswv_nrow_zero_test — binary; exercises the all-len1==0 batch path in every SIMD kswv variant. Catches the nrow==0 rowMax store overrun from issue #38 / upstream bwa-mem2 PR #289.
  • shm_section_find_test — binary; exercises the shared-memory index section-find logic.
  • shm_pack_round_trip_test — script-driven, invoked via test/shm_pack_round_trip_test.sh, which builds the phiX index first.

Additional integration shell scripts in test/:

ScriptWhat it tests
pg_cl_escape_test.sh@PG CL: tab/newline escape in SAM headers
mimalloc_loaded_test.shmimalloc override is active when USE_MIMALLOC=1
shm_round_trip_test.shbwa-mem3 shm load / list / drop cycle
shm_meth_test.sh--meth index compatibility with shm
help_prescan_test.sh--help prints without running alignment
libsais_*.shlibsais index correctness vs. BWA / determinism

Benchmark harness (bench/)

bench/ is a separate performance measurement harness used during development to gate performance PRs. It is not part of the CI test suite.

cp bench/config.env.example bench/config.env
# Edit config.env to point at your index, reads, and binary paths
bench/run.sh baseline         # N trials; appends to bench/results.csv
bench/run.sh candidate        # N trials on the candidate binary
bench/compare.sh baseline candidate  # wall-clock / RSS / md5 delta report

Each run records: tag, host, architecture, binary path, thread count, trial index, wall-clock seconds, max RSS (KB), and a golden md5 (single-threaded, @PG-stripped SAM). The md5 verifies byte-identical output across builds; wall-clock is the primary performance metric.


See also: Building from source · SIMD dispatch architecture · Contributing · What’s Different → Correctness fixes · What’s Different → Build & infrastructure

Release process

bwa-mem3 follows semantic versioning. Releases are driven by git tags. The version string is derived automatically from git describe and embedded in every binary at compile time.

Version stamping

The Makefile computes the version string at parse time:

FG_LABS_VERSION_FALLBACK := 0.1.0-pre
VERSION_STRING := $(shell git describe --tags --dirty 2>/dev/null || echo $(FG_LABS_VERSION_FALLBACK))

git describe produces a string such as v0.1.0 (on a tag), v0.1.0-3-gabcdef1 (three commits past the tag), or v0.1.0-dirty (uncommitted changes). If git describe fails — for example in a source tarball or a shallow clone without tag history — the build falls back to FG_LABS_VERSION_FALLBACK.

The string is written into src/version.h by the src/version.h: FORCE rule, which runs on every make invocation but only touches the file when the string changes. This minimises unnecessary recompilation of src/main.o.

PACKAGE_VERSION from src/version.h appears in:

  • bwa-mem3 version output (stdout).
  • The @PG VN: field in every SAM/BAM file produced by bwa-mem3 mem.

Verifying the version

./bwa-mem3 version
# Example output on a tagged commit:
# v0.1.0
# mimalloc 3.3.0        ← if USE_MIMALLOC=1

On an untagged commit the string includes the commit distance and short SHA:

v0.1.0-12-g3f7ab2e

Tagging a release

  1. Confirm main builds cleanly and all tests pass:

    make clean && make
    make test
    make docs
    
  2. Update NEWS.md with the release notes for the new version.

  3. Tag the release commit:

    git tag -s v0.X.Y -m "Release v0.X.Y"
    

    Use a signed tag (-s) when your GPG key is available. Annotated tags (-a) are acceptable when signing is not possible.

  4. Push the tag to the fg-labs remote:

    git push fg-labs v0.X.Y
    

    Read the Docs activates a versioned build at /v0.X.Y/ automatically when the tag appears on the remote.

  5. Create a GitHub release from the tag via gh:

    gh release create v0.X.Y --repo fg-labs/bwa-mem3 \
      --title "bwa-mem3 v0.X.Y" \
      --notes-file <(sed -n '/^## \[0\.X\.Y\]/,/^## /p' NEWS.md | head -n -1)
    

Note — Tarball builds

Source tarballs created by GitHub (or git archive) do not include git history, so git describe fails and the version falls back to FG_LABS_VERSION_FALLBACK. For reproducible tarball builds, set VERSION_STRING explicitly on the command line: make VERSION_STRING=v0.X.Y.

Branch and tag conventions

  • All release tags are on the main branch, which carries both upstream bwa-mem2 commits and fork-carried changes. See Branch and worktree conventions for the full branching model.
  • Tags are prefixed with v: v0.1.0, v0.2.0, etc.
  • Pre-release tags use a -pre suffix: v0.1.0-pre.
  • Patch releases increment the third component: v0.1.1.

What’s Different table update

When a release bundles new fork-carried commits that were not previously documented, update the FG-MAIN-TABLE in docs/src/whats-different/overview.md in the same PR before tagging. See Contributing for the rule.


See also: Branch and worktree conventions · What’s Different → Overview · Reference → Changelog · Building from source

Branch and worktree conventions

This page describes how the bwa-mem3 repository branches relate to upstream bwa-mem2, the policy for where PRs land, and the conventions for local worktrees when working on multiple branches simultaneously.

Branch model

master — upstream mirror

master tracks the upstream bwa-mem2 master branch verbatim. No fork-carried changes are applied here. When upstream bwa-mem2 merges new commits, master is fast-forwarded to match.

master is the starting point for upstream rebase operations. It is never the target of fork PRs.

main — fork integration branch

main carries all fork-carried commits on top of a rebased upstream baseline. This is the branch that:

  • All new feature, fix, and improvement PRs target.
  • All git tags (v0.X.Y) are placed on.
  • Read the Docs /latest/ follows.

When upstream bwa-mem2 makes significant changes, master is fast-forwarded and then main is rebased onto the new master tip. The rebase is verified by running the full test suite before the result is pushed.

Feature and fix branches

All development work happens on short-lived branches that are merged into main via pull request. Branch name conventions:

PrefixUse
feat/New features or capabilities
fix/Bug fixes
perf/Performance improvements
test/Test additions or improvements
docs/Documentation changes
ci/CI / build system changes
refactor/Code restructuring without behaviour change

Branch names use kebab-case after the prefix: fix/kswv-nrow-zero, perf/libsais-fm-index, test/regression-tests.

Upstream rebase cadence

main is rebased onto master (i.e., onto upstream bwa-mem2) periodically — not on every upstream commit, but when upstream merges a batch of changes worth incorporating. The process is:

  1. Fast-forward master to the new upstream tip.
  2. Rebase main onto master, resolving any conflicts.
  3. Run make && make test to confirm the rebase is correct.
  4. Push master and main to the fg-labs remote.

Warning — Do not merge upstream into main

Always rebase rather than merge when incorporating upstream changes. Merge commits obscure the fork-carried commit history and make the What’s Different table harder to maintain.

Worktrees for parallel branches

When working on multiple branches simultaneously, use git worktrees instead of stashing or switching branches. Each worktree is a sibling directory of the main clone.

Creating a worktree for a PR branch

# Fetch the PR's head branch from the fg-labs remote
git fetch fg-labs <head-branch-name>

# Create a worktree with a local branch tracking the remote branch
git worktree add ../pr-<N> -b pr-<N> --track fg-labs/<head-branch-name>

The local branch name and directory name match the PR number (pr-N).

Creating a worktree for a new issue branch

# Fetch the latest main from fg-labs
git fetch fg-labs main

# Create a new feature branch off fg-labs/main
git worktree add ../issue-<N> -b <prefix>/issue-<N>-<short-slug> fg-labs/main

# Unset the upstream so the branch is untracked until first push
git -C ../issue-<N> branch --unset-upstream

On first push, push to fg-labs so the head branch is in the same organisation as the PR base:

git push -u fg-labs HEAD

Worktree naming conventions

Directory nameBranch type
main/Primary checkout; tracks fg-labs/main
pr-<N>/PR review; local branch pr-N tracks fg-labs/<head-branch>
issue-<N>/Issue work; local branch <prefix>/issue-N-<slug>
Descriptive nameFeature work not yet tied to a PR or issue

Listing and removing worktrees

# List all worktrees
git worktree list

# Remove a worktree after the PR is merged
git worktree remove ../pr-<N>
git branch -D pr-<N>

# Remove an issue worktree
git worktree remove ../issue-<N>
git branch -D <prefix>/issue-<N>-<slug>

Note — Worktree directories are siblings, not nested

All worktree directories sit next to the main clone at the same directory level, not inside it. This avoids confusing git commands that walk parent directories looking for .git.

PR policy

  • All PRs target main.
  • PRs from fork contributors should be opened against fg-labs/bwa-mem3 main.
  • Every PR that adds a fork-carried commit must update the FG-MAIN-TABLE in docs/src/whats-different/overview.md in the same PR. See Contributing.
  • Merge policy: squash-merge for single-commit changes; rebase-merge for multi-commit PRs with a clean commit history.

See also: Contributing · Release process · What’s Different → Overview · Building from source

Contributing

This page covers the mechanics of submitting changes to bwa-mem3: commit conventions, PR workflow, CI requirements, and the rule for keeping the fork-lineage table current.

Before you start

  1. Check the open issues and existing PRs to avoid duplicate work.
  2. For substantial changes, open an issue first to discuss scope and approach.
  3. Fork or branch from fg-labs/bwa-mem3 main. See Branch and worktree conventions for the branching model.

Commit message conventions

bwa-mem3 follows Conventional Commits (v1.0.0). Every commit message must start with a type prefix:

PrefixUse
feat:New feature or capability
fix:Bug fix
perf:Performance improvement
test:Test additions or changes
docs:Documentation only
ci:CI / build-system changes
refactor:Restructuring without behaviour change
chore:Maintenance (dependency bumps, version pins)

The subject line is lowercase after the prefix, imperative mood, no trailing period. Keep it under 72 characters. Body lines wrap at 100 characters.

Good:

fix: kswv nrow==0 batch skips rowMax store when i==0

Exercises the all-len1==0 path across SSE4.1, AVX2, AVX-512BW, and ARM NEON.
Without the `if (i > 0)` guard, the store writes SIMD_WIDTH* bytes before the
allocation.

Closes #38.

Not acceptable:

Fixed stuff
Updated kswv
WIP

Pull request workflow

  1. Push your branch to fg-labs/bwa-mem3 (or your fork) and open a PR targeting fg-labs/bwa-mem3 main.
  2. The PR description should explain the motivation, summarise the change, and note any benchmarks or test results.
  3. All CI jobs must pass before merge. See CI matrix below.
  4. CodeRabbitAI reviews every PR automatically. Address all comments, including inline suggestions, summary comments, and nitpicks. Do not dismiss comments without a reply explaining why the suggestion was not adopted.
  5. A project maintainer will review and merge once CI is green and all comments are resolved.

Note — Draft PRs first

Open PRs as drafts while CI is running or while you are actively revising. Convert to ready-for-review only when the branch is stable, CI is green, and you have self-reviewed the diff.

The FG-MAIN-TABLE rule

Every PR that introduces a new fork-carried commit — a commit that is on main but not on master (the upstream bwa-mem2 mirror) — must update the FG-MAIN-TABLE block in docs/src/whats-different/overview.md in the same PR.

The table records each fork-carried change, its bwa-mem3 PR number, the corresponding upstream bwa-mem2 PR or issue (if any), and its upstream status. Keeping this table current is the primary mechanism by which the project maintains transparency about its relationship to upstream.

Warning — Do not skip the table update

A PR that adds a fork-carried commit but omits the table update will be sent back for revision. The table is reviewed as part of the standard PR checklist.

What counts as a fork-carried commit

A commit is fork-carried if:

  • It adds new behaviour, fixes a bug, or changes build infrastructure in a way that diverges from upstream bwa-mem2 master.
  • It is present on fg-labs/bwa-mem3 main but not (yet) merged upstream.

Pure documentation commits, CI-only changes, and upstream-rebase bookkeeping commits do not need a table entry.

CI matrix

CI runs on every PR and on push to main. The matrix covers:

RowArchitectureISAPlatform
sse41x86_64SSE4.1Ubuntu
avx2x86_64AVX2Ubuntu (canonical)
avx512bwx86_64AVX-512BWUbuntu
arm64-linuxaarch64NEONUbuntu ARM
arm64-macosarm64NEONmacOS

The canonical row (avx2) is the only one that runs regression tests (shell scripts in test/regression/). Unit tests run on every row. Integration tests run on the four widened canonical rows (SSE4.1, AVX2, ARM64 Linux, macOS ARM).

A PR must pass all rows before merge.

Code style

  • C++14, gnu++14 dialect.
  • Match the style of the surrounding code. The codebase inherits the upstream bwa-mem2 style, which is C-ish C++ with minimal STL use in hot paths.
  • For new test code, follow the doctest patterns documented in the test framework.
  • New SIMD code must include src/simd_compat.h rather than platform-specific headers directly. See SIMD dispatch architecture.

Adding a test for your change

  • Bug fix → add a unit test or integration test that fails without the fix and passes with it.
  • New feature → add unit tests for the core logic and, if the feature is end-to-end testable with a shell invocation, a regression test in test/regression/.
  • Performance change → run the benchmark harness (bench/) to confirm the improvement and include median wall-clock numbers in the PR description.

See Regression test framework for the full guide on where to add tests and how to organise them.


See also: Branch and worktree conventions · Regression test framework · Release process · What’s Different → Overview · Building from source

bwa-mem3-bench

bwa-mem3-bench is a benchmarking suite that measures the alignment performance of bwa-mem3 against the upstream bwa-mem2 v2.2.1 baseline. It runs on AWS Batch spot instances across four dataset types — whole-genome sequencing (WGS), whole-exome sequencing (WES), panel, and bisulfite-sequencing (methylation) — all aligned against the hg38 reference. The suite covers three CPU microarchitectures: ARM Neon, x86 AVX2, and x86 AVX-512. Results are collected into a SQLite database for local analysis and reporting. The project is implemented in Python (orchestration, reporting, and CLI), Rust (BAM comparison tool), Snakemake (alignment workflow), and AWS CDK (cloud infrastructure).

When you’d use it

Use bwa-mem3-bench when you need reproducible, multi-architecture throughput numbers before committing a bwa-mem3 change to production or before deciding whether to adopt bwa-mem3 in place of bwa-mem2. It provides a structured “bless baseline, then compare” workflow: an upstream bwa-mem2 run is blessed once per upstream tag and stored in S3; subsequent bwa-mem3 runs are measured against that fixed baseline. Running a full benchmark fires a Snakemake coordinator job on AWS Batch and costs roughly $10 in spot capacity.

How it relates to bwa-mem3

bwa-mem3-bench is the authoritative source of benchmark evidence for every performance claim made in the bwa-mem3 documentation and changelog. When the Performance Overview cites speedup numbers, those numbers come from bwa-mem3-bench runs collected after the relevant PR was merged. The suite also validates that bwa-mem3 does not regress relative to bwa-mem2 on any supported architecture before a new release is tagged.


See also: Performance Overview · SIMD dispatch matrix · bwa-mem2 (upstream) · Release process

bwa-mem3-rs

bwa-mem3-rs is a Rust crate that provides idiomatic bindings to the bwa-mem family of short-read aligners — bwa (original), bwa-mem2, and bwa-mem3. It exposes a safe Rust API over the underlying C++ alignment engine, allowing Rust programs to index a reference, configure alignment parameters, and align reads without shelling out to an external process. The bindings link statically against the chosen backend, so a binary built with bwa-mem3-rs carries the aligner and its SIMD kernels as a self-contained artifact.

When you’d use it

Use bwa-mem3-rs when you are building a Rust bioinformatics tool or pipeline that needs short-read alignment as an in-process library call rather than a subprocess invocation. It is especially useful when latency between reads arriving and alignments being available matters (no process-startup overhead), or when you want tight integration between the aligner’s output and downstream Rust code such as UMI grouping, consensus calling, or duplicate marking.

How it relates to bwa-mem3

bwa-mem3-rs targets bwa-mem3 as its primary high-performance backend. It is the intended integration path for fgumi and other Fulcrum Genomics tools that need alignment as a library dependency. Changes to bwa-mem3’s public API, flag semantics, or output format are coordinated with bwa-mem3-rs to keep the bindings current.


See also: fgumi · bwa-mem3-bench · Aligning short reads (mem) · Developer Guide — Contributing

bwa-mem2 (upstream)

bwa-mem2 is the direct predecessor of bwa-mem3 and the project from which the bwa-mem3 fork is derived. It was created at Intel’s Parallel Computing Lab by Vasimuddin Md and Sanchit Misra to accelerate the alignment algorithm originally written by Heng Li in bwa. bwa-mem2 achieves a 1.3–3.1x throughput improvement over the original bwa-mem by replacing key inner loops with vectorised implementations (SSE4.1, SSE4.2, AVX2, and AVX-512) and by switching to a more compact FM-index encoding. Its output is identical to bwa-mem at the alignment level, and it is distributed under the MIT license.

Lineage

The bwa alignment family has evolved through three generations, each building on the last:

  1. bwa — Written by Heng Li. Established the BWA-MEM algorithm, the SAM output format conventions, and the .bwt / .pac / .ann / .amb index layout.
  2. bwa-mem2 (Vasimuddin et al., Intel) — Replaced scalar inner loops with SIMD kernels; introduced the compact .bwt.2bit.64 and .0123 index formats; retained full output compatibility with bwa-mem.
  3. bwa-mem3 (Fulcrum Genomics fork) — Carries correctness fixes, performance improvements, new features (bisulfite alignment, mimalloc, ARM Neon), and expanded architecture support on top of the bwa-mem2 codebase. See What’s Different from bwa-mem2 for the full change catalog.

When you’d use it

Use bwa-mem2 directly when you need a stable, widely validated aligner with precompiled binaries available via Bioconda and the project’s GitHub releases page, and when you do not require the features or fixes that bwa-mem3 adds. bwa-mem2 is also the right choice when you are working in an environment where the bwa-mem3 fork has not yet been validated against your specific reference or sequencing library type.

How it relates to bwa-mem3

bwa-mem3 tracks bwa-mem2’s master branch and periodically rebases fork-carried commits on top of upstream changes. The What’s Different section documents every divergence between the two projects, and the Upstream PR status page tracks which bwa-mem3 changes have been proposed back to bwa-mem2. The goal is to keep the fork divergence minimal and to upstream as many fixes as practical.

  • GitHub: https://github.com/bwa-mem2/bwa-mem2
  • Citation: Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru. “Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems.” IEEE IPDPS 2019.
  • License: MIT (with third-party components under their respective licenses)

See also: What’s Different from bwa-mem2 · Upstream PR status · bwa-mem3-bench · Citation

fgumi

fgumi (Fulcrum Genomics Unique Molecular Indexing tools) is a high-performance suite of command-line tools for processing UMI-tagged next-generation sequencing data. Written in Rust, it provides UMI extraction from FASTQ files, read grouping by UMI with configurable assignment strategies, UMI-aware deduplication, simplex and duplex consensus calling, CODEC consensus calling, quality filtering of consensus reads, and overlapping read-pair clipping. fgumi is the intended successor to the Scala-based fgbio toolkit for UMI processing, targeting significantly higher throughput on multi-core systems. It is published on Bioconda and documented at https://fgumi.readthedocs.io.

Warning — Research preview

fgumi is currently a research preview. The Fulcrum Genomics team targets June 2026 for recommending fgumi over fgbio for production use. Verify fitness for your application before deploying in a clinical or production pipeline.

When you’d use it

Use fgumi when your sequencing library includes unique molecular identifiers and you need to group reads by UMI, call simplex or duplex consensus sequences, or remove PCR duplicates in a UMI-aware manner. It handles the standard commercial UMI library preparations (IDT xGen, KAPA, Twist, QIAseq, and others) and the CODEC protocol for duplex sequencing. fgumi is designed to be run after alignment with bwa-mem3 (or bwa-mem2) and before downstream variant calling or methylation analysis.

How it relates to bwa-mem3

fgumi and bwa-mem3 are sibling projects maintained by Fulcrum Genomics and are designed to work together in the same alignment-and-consensus pipeline. bwa-mem3 provides the aligned BAM that fgumi takes as input for grouping and consensus calling. The two projects share build and documentation conventions (mdbook on Read the Docs, Fulcrum theme, conventional commits) and are benchmarked together in the fgumi-benchmarks internal dataset suite. The intended integration path for in-process alignment within fgumi is bwa-mem3-rs, the Rust bindings for bwa-mem3.


See also: bwa-mem3-rs · Aligning short reads (mem) · Best Practices — Multi-sample workflows · bwa-mem3-bench

bwameth.py

bwameth.py is a Python script written by Brent Pedersen that implements bisulfite sequencing (BS-Seq) alignment using the in-silico three-letter genome approach. It converts all cytosines to thymines in both the reference and the reads (C-to-T on the forward strand, G-to-A on the reverse), aligns the converted sequences with bwa-mem (or optionally bwa-mem2), and then recovers the original read sequence from the aligner’s tag output to tabulate methylation. bwameth.py supports single-end and paired-end reads from the directional bisulfite protocol and is published at https://arxiv.org/abs/1401.1129.

When you’d use it

Use bwameth.py when you need a battle-tested, community-supported bisulfite aligner that runs on top of the standard bwa-mem or bwa-mem2 you have already installed, and when you prefer a Python wrapper over a self-contained binary. It also remains the reference for downstream tabulation tools such as MethylDackel and SNP callers such as biscuit that expect the bwameth.py output format. For the actual methylation tabulation and variant calling steps, bwameth.py’s author recommends those dedicated tools rather than the tabulation utilities bundled with the original script.

How it relates to bwa-mem3

bwa-mem3 mem --meth is a single-binary drop-in replacement for the bwameth.py alignment pipeline. It inlines the C-to-T and G-to-A conversion, runs the bwa-mem3 alignment engine (with all of its correctness fixes and SIMD speedups), rewrites the @SQ headers to collapse the per-strand contig pairs back to canonical chromosome names, applies chimera QC, and emits a @PG ID:bwa-mem3-meth header. The output BAM is compatible with the same downstream tabulation tools that consume bwameth.py output. The Methylation Reference section documents the full implementation in detail, including the YS:Z:, YC:Z:, and YD:Z: tags and the --set-as-failed and --do-not-penalize-chimeras flags.

Tip — Interop with the bwameth.py c2t step

If your pipeline already performs its own C-to-T conversion before alignment, see Interop with external bwameth.py c2t for how to pass pre-converted reads to bwa-mem3 mem --meth without double-conversion.


See also: Methylation Reference: Overview · Quick start: methylation alignment · Best Practices — Methylation defaults · Interop with external bwameth.py c2t

Glossary

Terms used throughout this book, listed alphabetically.


@HD header The first line of a SAM file header. Specifies the SAM format version (VN) and sort order (SO). Required when any other header lines are present. See Output: SAM/BAM, headers, tags.

@PG header A SAM header line recording a program that processed the file, including ID, PN, VN, and CL fields. bwa-mem3 inserts ID:bwa-mem3 (or ID:bwa-mem3-meth in methylation mode). See Output: SAM/BAM, headers, tags.

@SQ header A SAM header line describing a reference sequence (chromosome). Contains the sequence name (SN) and length (LN). In methylation mode, bwa-mem3 post-processes @SQ lines to collapse f/r-prefixed contig names back to one entry per chromosome. See Chimera QC and header rewriting.

BAM Binary Alignment Map — a compressed, binary encoding of SAM. Produced by bwa-mem3 when the --bam flag is given or when output is piped through samtools. See Output: SAM/BAM, headers, tags.

Banded Smith-Waterman (banded SWA) A heuristic variant of the Smith-Waterman alignment algorithm that restricts the dynamic programming to a band of width w around the main diagonal. bwa-mem3 uses banded SWA for extension alignment; bwa-mem2 kernels are SIMD-vectorized and bwa-mem3 adds NEON implementations for Apple Silicon. See SIMD dispatch architecture.

c2t Cytosine-to-thymine in-silico conversion applied to reads (or reference) before methylation alignment. In --meth mode, bwa-mem3 converts R1 reads C→T and R2 reads G→A inline, without writing intermediate FASTQ files. See Conversion details (C->T, G->A).

Chimera A read alignment where the aligned portion is short relative to the read length, often indicating a mapping artefact or a true chimeric molecule. In methylation mode, bwa-mem3 applies a chimera QC heuristic: if the longest contiguous M/=/X CIGAR run is less than 44% of the read length, the alignment is flagged 0x200, the proper-pair bit is cleared, and MAPQ is capped at 1. See Chimera QC and header rewriting.

FASTQ A text format for raw sequencing reads. Each record contains a sequence identifier, the nucleotide sequence, a separator, and per-base quality scores in ASCII-encoded Phred format. bwa-mem3 accepts gzip-compressed FASTQ as input. See Quick start: align paired-end FASTQs.

FM-index Ferragina-Manzini index — a full-text index over the Burrows-Wheeler Transform of a sequence. bwa-mem3 uses the compressed .bwt.2bit.64 FM-index for seed finding (SMEM lookup). See Indexing the reference.

Hard clip A CIGAR operation (H) indicating that bases at the read end are absent from the SEQ field of the alignment record. Hard clipping is used in supplementary alignments to avoid duplicating the read sequence. See Output: SAM/BAM, headers, tags.

kswv The SIMD-vectorized kernel implementing the inner loop of the Smith-Waterman extension alignment in bwa-mem2/bwa-mem3. bwa-mem3 carries correctness fixes for the score-saturation edge case across all SIMD width variants (NEON, AVX2, AVX-512BW). See Correctness fixes.

libsais A library implementing the suffix-array induced sorting (SAIS) algorithm. bwa-mem3 optionally uses libsais for FM-index construction, reducing indexing time compared to the default suffix-array builder. See Performance improvements.

LTO Link-Time Optimization — a compiler mode that defers optimization to link time, enabling cross-compilation-unit inlining. Activated via make lto-build. See Building from source.

MAPQ Mapping quality — a Phred-scaled probability that a read alignment is incorrectly mapped. Reported in SAM field 5. bwa-mem3 follows bwa-mem2 MAPQ semantics; chimera QC in methylation mode caps MAPQ at 1 for chimeric alignments. See Output: SAM/BAM, headers, tags.

Mate rescue A step in paired-end alignment where, if one mate lacks a confident seed, bwa-mem3 attempts to find it by performing Smith-Waterman alignment in the region near the mapped mate. bwa-mem3 adds NEON and AVX2 implementations of the mate-rescue kernel. See Architecture support.

mimalloc A high-performance memory allocator from Microsoft. bwa-mem3 vendors mimalloc and links it into every binary by default. To disable, build with USE_MIMALLOC=0. See Memory allocator (mimalloc).

Multi-binary launcher On x86, bwa-mem3 ships a thin launcher binary that detects CPUID features at runtime and execs the appropriate arch-specialized binary (bwa-mem3.sse41, .avx2, .avx512bw, etc.). On ARM64 a single bwa-mem3.arm64 binary is built. See Multi-binary launcher (x86).

PGO Profile-Guided Optimization — a two-pass build where the first pass instruments the binary, a representative workload is run to collect profiles, and the second pass uses those profiles to guide inlining and branch layout. Activated via make pgo-generate then make pgo-use. See PGO build.

Primary alignment The alignment record for a read that represents the aligner’s best placement. A read has exactly one primary alignment (or is reported as unmapped). All other alignments for the same read are marked supplementary (chimeric split read) or secondary (alternative mapping). See Output: SAM/BAM, headers, tags.

Proper-pair flag (0x2) SAM flag bit indicating that both mates of a pair are mapped in the expected orientation and insert-size range. In bwa-mem3, the mem_sam_pe function sets this flag; a correctness fix (PR #17) ensures it is propagated correctly under all conditions. See Correctness fixes.

SAM Sequence Alignment Map — a tab-delimited text format for read alignments. Each record contains mandatory fields (QNAME, FLAG, RNAME, POS, MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, QUAL) plus optional tags. See Output: SAM/BAM, headers, tags.

SIMD dispatch Runtime selection of the fastest available SIMD instruction set (SSE4.1, AVX2, AVX-512BW, NEON) for hot alignment kernels. On x86 this is implemented by the multi-binary launcher; on ARM64 a single binary covers NEON. See SIMD dispatch matrix.

SMEM Super-Maximal Exact Match — a seed found by extending a read’s position in the FM-index as far as possible in both directions. SMEMs form the initial seeds for chaining and extension in the BWA-MEM algorithm. See Performance improvements.

Soft clip A CIGAR operation (S) indicating that bases at the read end were not part of the alignment, but are still present in the SEQ field. Soft clipping commonly appears at adapter-containing or low-quality read ends. See Output: SAM/BAM, headers, tags.

Supplementary alignment A SAM record (FLAG bit 0x800 set) representing a chimeric read split across two or more genomic loci. The segment with the longest aligned span is typically designated primary; remaining segments are supplementary. Hard clipping is used to avoid duplicating the SEQ field. See Output: SAM/BAM, headers, tags.


See also: Citation · License · Changelog · Output: SAM/BAM, headers, tags · What’s Different — Overview

Citation

How to cite

bwa-mem3 is a derivative of bwa-mem2. If you use bwa-mem3 in published work, please cite the original bwa-mem2 paper:

Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru. Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2019. doi:10.1109/IPDPS.2019.00041

BibTeX:

@inproceedings{bwamem2-ipdps2019,
  author    = {Vasimuddin Md and Sanchit Misra and Heng Li and Srinivas Aluru},
  title     = {Efficient Architecture-Aware Acceleration of {BWA-MEM} for Multicore Systems},
  booktitle = {IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
  year      = {2019},
  doi       = {10.1109/IPDPS.2019.00041},
  url       = {https://doi.org/10.1109/IPDPS.2019.00041}
}

Lineage

bwa-mem3 is maintained by Fulcrum Genomics as a derivative of bwa-mem2, itself derived from bwa (Li & Durbin, 2009). The BWA-MEM algorithm was originally described in:

Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997, 2013. arXiv:1303.3997

The bwa-mem3-specific changes and improvements carried on top of bwa-mem2 are documented in What’s Different from bwa-mem2.


See also: License · Changelog · What’s Different — Overview · Related Projects: bwa-mem2

License

bwa-mem3 is licensed under the MIT License (same as upstream bwa-mem2).

                           The MIT License

   BWA-MEM2  (Sequence alignment using Burrows-Wheeler Transform),
   Copyright (C) 2019 Intel Corporation, Heng Li.

   Permission is hereby granted, free of charge, to any person obtaining
   a copy of this software and associated documentation files (the
   "Software"), to deal in the Software without restriction, including
   without limitation the rights to use, copy, modify, merge, publish,
   distribute, sublicense, and/or sell copies of the Software, and to
   permit persons to whom the Software is furnished to do so, subject to
   the following conditions:

   The above copyright notice and this permission notice shall be
   included in all copies or substantial portions of the Software.

   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
   BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
   ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
   CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
   SOFTWARE.

Contacts: Vasimuddin Md <vasimuddin.md@intel.com>; Sanchit Misra <sanchit.misra@intel.com>;
                                Heng Li <hli@jimmy.harvard.edu>

See also: Citation · Changelog · Related Projects: bwa-mem2 · What’s Different — Overview

Changelog

Release 0.1.0-pre (2026-04-28)

  • Project renamed from bwa-mem2 to bwa-mem3. The new project tracks Fulcrum Genomics’ performance and feature work on top of the upstream bwa-mem2 codebase.
  • Default branch renamed from fg-main to main.
  • Binary renamed from bwa-mem2 to bwa-mem3. Arch-suffixed variants (bwa-mem3.sse41, .sse42, .avx, .avx2, .avx512bw, .arm64, .pgo, .profile, .lto) renamed to match.
  • @PG SAM header tags now read ID:bwa-mem3 PN:bwa-mem3 (and bwa-mem3-meth for --meth mode).
  • Test binaries renamed: bwa_mem2_tests_unitbwa_mem3_tests_unit, bwa_mem2_tests_integrationbwa_mem3_tests_integration.
  • .bwt.2bit.64 index file format unchanged — bwa-mem3 reads indexes built by bwa-mem2 index without re-indexing.

Release 2.2.1 (17 March 2021)

Hotfix for v2.2: Fixed the bug mentioned in #135.

Release 2.2 (8 March 2021)

Changes since the last release (2.1):

  • Passed the validation test on ~88 billions reads (Credits: Keiran Raine, CASM division, Sanger Institute)
  • Fixed bugs reported in #109 causing mismatch between bwa-mem and bwa-mem2
  • Fixed the issue (# 112) causing crash due to corrupted thread id
  • Using all the SSE flags to create optimized SSE41 and SSE42 binaries

Release 2.1 (16 October 2020)

Release 2.1 of BWA-MEM2.

Changes since the last release (2.0):

  • Smaller index: the index size on disk is down by 8 times and in memory by 4 times due to moving to only one type of FM-index (2bit.64 instead of 2bit.64 and 8bit.32) and 8x compression of suffix array. For example, for human genome, index size on disk is down to ~10GB from ~80GB and memory footprint is down to ~10GB from ~40GB. There is a substantial decrease in index IO time due to the reduction and hardly any performance impact on read mapping.

  • Added support for 2 more execution modes: sse4.2 and avx.

  • Fixed multiple bugs including those reported in Issues #71, #80 and #85.

  • Merged multiple pull requests.

Release 2.0 (9 July 2020)

This is the first production release of BWA-MEM2.

Changes since the last release:

  • Made the source code more secure with more than 300 changes all across it.

  • Added support for memory re-allocations in case the pre-allocated fixed memory is insufficient.

  • Added support for MC flag in the sam file and support for -5, -q flags in the command line.

  • The output is now identical to the output of bwa-mem-0.7.17.

  • Merged index building code with FMI_Search class.

  • Added support for different ways to input read files, now, it is same as bwa-mem.

  • Fixed a bug in AVX512 sam processing part, which was leading to incorrect output.

Release 2.0pre2 (4 February 2020)

Miscellaneous changes:

  • Changed the license from GPL to MIT.

  • IMPORTANT: the index structure has changed since commit 6743183. Please rebuild the index if you are using a later commit or the new release.

  • Added charts in README.md comparing the performance of bwa-mem2 with bwa-mem.

Major code changes:

  • Fixed working for variable length reads.

  • Fixed a bug involving reads of length greater than 250bp.

  • Added support for allocation of more memory in small chunks if large pre-allocated fixed memory is insufficient. This is needed very rarely (thus, having no impact on performance) but prevents asserts from failing (code from crashing) in that scenario.

  • Fixed a memory leak due to not releasing the memory allocated for seeds after smem.

  • Fixed a segfault due to non-alignment of small allocated memory in the optimized banded Smith-Waterman.

  • Enabled working with genomes larger than 7-8 billion nucleotides (e.g. Wheat genome).

  • Fixed a segfault occuring (with gcc compiler) while reading the index.


See also: Citation · License · What’s Different — Overview · Developer Guide — Release process