Bioinformatics Tutorial

Read QC & Trimming

The goal of QC is not to make plots—it’s to decide whether data is usable and what preprocessing is needed. Good QC catches sample swaps, adapter contamination, low complexity reads, and systematic quality decay.

What to look at first
  • Total reads and yield per sample/lane
  • Per-base quality profiles
  • Adapter/primer contamination signatures
  • Per-read GC distribution and outliers
  • Overrepresented sequences (possible contamination)

Typical tools

  • fastqc / multiqc for reporting
  • cutadapt, fastp, trimmomatic for trimming

Example commands

# QC
fastqc -t 4 sample_R1.fastq.gz sample_R2.fastq.gz
multiqc .

# Adapter trimming (illustrative)
cutadapt -j 4 -a ADAPTER_FWD -A ADAPTER_REV \
  -q 20,20 --minimum-length 50 \
  -o trimmed_R1.fastq.gz -p trimmed_R2.fastq.gz \
  sample_R1.fastq.gz sample_R2.fastq.gz
How to think about Phred scores

Phred scores convert error probabilities into an additive scale:

Q = -10 * log10(p_error)

This means going from Q20 to Q30 is a 10× reduction in error probability. But trimming too aggressively can remove real signal—especially for low-input RNA-seq or ancient DNA.

Rule of thumb

Trim adapters confidently; trim qualities conservatively unless you have a reason.

Watch out

If most reads require heavy trimming, check library prep and run quality.

Per-base quality summary (example)

Quality often decays toward the end of reads. The exact shape depends on platform and run conditions.

Adapter signal across cycles (example)

A rising adapter fraction at later cycles suggests short inserts relative to read length.

Decision matrix
ObservationLikely causeAction
Sharp quality drop after cycle NEnd-of-read decayConsider trimming last cycles or use a quality cutoff
Adapter peaksShort insertsTrim adapters; evaluate minimum length
Unexpected GC peakContamination or biased captureScreen contamination; check sample metadata
Many identical readsPCR duplicates / low complexityConsider deduplication; assess library complexity
QC Nightmare Scenarios (When to re-sequence)

1. The "Sawtooth" Pattern

If per-base quality oscillates every few bases, it suggests a mechanical or optical failure during the run. This data is often unreliable for variant calling.

2. >90% Duplication Rate

If you have 50M reads but only 5M unique sequences, you essentially sequenced "noise". This often means input DNA was too low or PCR was over-cycled.