Genome Assembly: from reads to contigs, scaffolds, and validation

Assembly is the art of reconstructing longer sequences from shorter reads. It is tempting to celebrate contiguity metrics, but a "better" assembly is only better if it is also biologically plausible, clean, and well validated.

Assembly strategies

Strategy	Best when	Main challenge
Short-read de novo	Small genomes, limited budget	Repeats collapse easily
Long-read assembly	Structural resolution and repeat-rich genomes matter	Raw error rate may require polishing
Hybrid assembly	You want long-range contiguity and short-read polishing	Managing multiple error models and file types

Metrics that actually matter

Metric	What it tells you
`N50`	Contiguity only; not correctness
`L50`	How many contigs make up half the assembly
Completeness	Whether expected conserved content is present
Contamination	Whether foreign sequence is mixed in
Read back-mapping	Whether the assembly is supported by the original reads

A very high N50 with poor completeness or contamination issues is not a good assembly.

Interactive assembly graph explorer

Click a concept to see what assemblers are trying to resolve when repeats, bubbles, and coverage patterns compete with one another.

Reads: the assembly only knows what was sequenced

Coverage depth, fragment length, read accuracy, and contamination all shape the graph before any algorithmic magic happens.

Low coverage creates fragmentation
Mixed samples or contamination create misleading branches

k-mers: assemblies infer connectivity from overlapping sequence words

Short-read assemblers often use de Bruijn graphs. The choice of k affects repeat resolution, memory use, and sensitivity to sequencing errors.

Too small k merges unrelated sequence too easily
Too large k fragments low-coverage regions

Repeats: the main reason contigs break or collapse

When two genomic regions look the same to the data, the graph can join them incorrectly or fail to separate them. Long reads are often valuable here because they span repeats.

Collapsed repeats inflate apparent coverage
Repeat resolution often requires long-range information

Bubbles: not every branch is an error

A bubble can represent sequencing error, heterozygosity, or strain variation. Assemblers need heuristics to decide which branches to keep or collapse.

Error bubbles are usually low support and short
Biological variation can create real alternative paths

Contigs and polishing: assembly is not done when the graph is traversed

After contigs are produced, polishing, contamination screening, and completeness assessment are essential. Otherwise highly contiguous output may still be wrong.

Use read back-mapping to confirm support
Validate with BUSCO, contamination checks, and manual inspection

k-mer spectrum (example)

Multiple peaks can reflect heterozygosity, repeats, contamination, or mixed ploidy assumptions.

Cumulative contig curve (example)

This view helps you see whether assembly size is dominated by a few long contigs or many short fragments.

Validation stack you should report

Completeness: conserved gene content or lineage-specific expectations
Contamination: taxonomy-aware screening, GC/coverage outlier inspection
Support: read back-mapping rate and local inspection of suspicious joins
Biological plausibility: total size, GC%, chromosome count, and expected content
Polishing history: which reads and tools were used to correct residual errors

Common assembly failure modes

Coverage too low

Low-support regions fragment the graph and create missing sequence.

Contamination mixed in

Foreign reads can create extra contigs with different GC or coverage behavior.

Chasing N50 only

Aggressive scaffolding can improve contiguity while introducing misjoins that invalidate downstream interpretation.