Bioinformatics Tutorial

Genome Assembly: from reads to contigs, scaffolds, and validation

Assembly is the art of reconstructing longer sequences from shorter reads. It is tempting to celebrate contiguity metrics, but a "better" assembly is only better if it is also biologically plausible, clean, and well validated.

Assembly strategies
StrategyBest whenMain challenge
Short-read de novoSmall genomes, limited budgetRepeats collapse easily
Long-read assemblyStructural resolution and repeat-rich genomes matterRaw error rate may require polishing
Hybrid assemblyYou want long-range contiguity and short-read polishingManaging multiple error models and file types
Metrics that actually matter
MetricWhat it tells you
N50Contiguity only; not correctness
L50How many contigs make up half the assembly
CompletenessWhether expected conserved content is present
ContaminationWhether foreign sequence is mixed in
Read back-mappingWhether the assembly is supported by the original reads

A very high N50 with poor completeness or contamination issues is not a good assembly.

Interactive assembly graph explorer

Click a concept to see what assemblers are trying to resolve when repeats, bubbles, and coverage patterns compete with one another.

reads k-mers repeat bubble contig polish

Reads: the assembly only knows what was sequenced

Coverage depth, fragment length, read accuracy, and contamination all shape the graph before any algorithmic magic happens.

  • Low coverage creates fragmentation
  • Mixed samples or contamination create misleading branches

k-mers: assemblies infer connectivity from overlapping sequence words

Short-read assemblers often use de Bruijn graphs. The choice of k affects repeat resolution, memory use, and sensitivity to sequencing errors.

  • Too small k merges unrelated sequence too easily
  • Too large k fragments low-coverage regions

Repeats: the main reason contigs break or collapse

When two genomic regions look the same to the data, the graph can join them incorrectly or fail to separate them. Long reads are often valuable here because they span repeats.

  • Collapsed repeats inflate apparent coverage
  • Repeat resolution often requires long-range information

Bubbles: not every branch is an error

A bubble can represent sequencing error, heterozygosity, or strain variation. Assemblers need heuristics to decide which branches to keep or collapse.

  • Error bubbles are usually low support and short
  • Biological variation can create real alternative paths

Contigs and polishing: assembly is not done when the graph is traversed

After contigs are produced, polishing, contamination screening, and completeness assessment are essential. Otherwise highly contiguous output may still be wrong.

  • Use read back-mapping to confirm support
  • Validate with BUSCO, contamination checks, and manual inspection
k-mer spectrum (example)

Multiple peaks can reflect heterozygosity, repeats, contamination, or mixed ploidy assumptions.

Cumulative contig curve (example)

This view helps you see whether assembly size is dominated by a few long contigs or many short fragments.

Validation stack you should report
  • Completeness: conserved gene content or lineage-specific expectations
  • Contamination: taxonomy-aware screening, GC/coverage outlier inspection
  • Support: read back-mapping rate and local inspection of suspicious joins
  • Biological plausibility: total size, GC%, chromosome count, and expected content
  • Polishing history: which reads and tools were used to correct residual errors
Common assembly failure modes

Coverage too low

Low-support regions fragment the graph and create missing sequence.

Contamination mixed in

Foreign reads can create extra contigs with different GC or coverage behavior.

Chasing N50 only

Aggressive scaffolding can improve contiguity while introducing misjoins that invalidate downstream interpretation.