Alignment (BWA/Bowtie2) and What the Numbers Mean
Alignment places reads onto a reference genome/transcriptome. You should treat alignment as a probabilistic statement—especially in repeats and low-complexity regions.
Key concepts
- Primary vs secondary alignments for multi-mapping reads
- Soft clipping indicates partial matches (adapters, SVs, errors)
- MAPQ reflects placement ambiguity, not base-call quality
- Proper pair depends on orientation and insert size expectations
Typical commands
# Index reference (BWA)
bwa index reference.fa
# Align paired-end reads
bwa mem -t 8 reference.fa trimmed_R1.fastq.gz trimmed_R2.fastq.gz \
| samtools sort -@ 4 -o sample.bam
# Quick stats
samtools flagstat sample.bam
samtools stats sample.bam | head
Sanity checks
| Metric | Interpretation |
|---|---|
| % mapped | Low mapping may indicate contamination, wrong reference, or low quality |
| % duplicates | High duplicates suggest low library complexity or over-amplification |
| Insert size | Unexpected distribution can signal library prep issues |
| Coverage uniformity | Bias suggests GC/capture effects or mapping problems |
Healthy
High mapping, expected insert sizes, stable coverage.
Investigate
Low MAPQ reads pile up in repeats; filter carefully.
Mapping summary (example)
Insert size distribution (example)
A note on reference choice
Always align to the correct genome build and annotation version. For RNA-seq, align to genome + spliced aligner (STAR/HISAT2), or use pseudoalignment (Salmon/Kallisto) to a transcriptome. Mixing builds invalidates coordinates and downstream interpretation.