Genome Assembly: from reads to contigs, scaffolds, and validation
Assembly is the art of reconstructing longer sequences from shorter reads. It is tempting to celebrate contiguity metrics, but a "better" assembly is only better if it is also biologically plausible, clean, and well validated.
Assembly strategies
Strategy
Best when
Main challenge
Short-read de novo
Small genomes, limited budget
Repeats collapse easily
Long-read assembly
Structural resolution and repeat-rich genomes matter
Raw error rate may require polishing
Hybrid assembly
You want long-range contiguity and short-read polishing
Managing multiple error models and file types
Metrics that actually matter
Metric
What it tells you
N50
Contiguity only; not correctness
L50
How many contigs make up half the assembly
Completeness
Whether expected conserved content is present
Contamination
Whether foreign sequence is mixed in
Read back-mapping
Whether the assembly is supported by the original reads
A very high N50 with poor completeness or contamination issues is not a good assembly.
Interactive assembly graph explorer
Click a concept to see what assemblers are trying to resolve when repeats, bubbles, and coverage patterns compete with one another.
Reads: the assembly only knows what was sequenced
Coverage depth, fragment length, read accuracy, and contamination all shape the graph before any algorithmic magic happens.
Low coverage creates fragmentation
Mixed samples or contamination create misleading branches
k-mers: assemblies infer connectivity from overlapping sequence words
Short-read assemblers often use de Bruijn graphs. The choice of k affects repeat resolution, memory use, and sensitivity to sequencing errors.
Too small k merges unrelated sequence too easily
Too large k fragments low-coverage regions
Repeats: the main reason contigs break or collapse
When two genomic regions look the same to the data, the graph can join them incorrectly or fail to separate them. Long reads are often valuable here because they span repeats.
Collapsed repeats inflate apparent coverage
Repeat resolution often requires long-range information
Bubbles: not every branch is an error
A bubble can represent sequencing error, heterozygosity, or strain variation. Assemblers need heuristics to decide which branches to keep or collapse.
Error bubbles are usually low support and short
Biological variation can create real alternative paths
Contigs and polishing: assembly is not done when the graph is traversed
After contigs are produced, polishing, contamination screening, and completeness assessment are essential. Otherwise highly contiguous output may still be wrong.
Use read back-mapping to confirm support
Validate with BUSCO, contamination checks, and manual inspection
k-mer spectrum (example)
Multiple peaks can reflect heterozygosity, repeats, contamination, or mixed ploidy assumptions.
Cumulative contig curve (example)
This view helps you see whether assembly size is dominated by a few long contigs or many short fragments.
Validation stack you should report
Completeness: conserved gene content or lineage-specific expectations