Bioinformatics Tutorial
Interactive | practical | reference-safe

Bioinformatics, from raw reads to interpretable biology

This tutorial site is designed to help learners move beyond copy-pasting commands. Each section explains what the files mean, why the metrics change, where errors enter a workflow, and how to report results responsibly.

10+
guided lessons and sandboxes
What this site teaches
  • How FASTQ, BAM, VCF, GTF, and count matrices connect to one another
  • How to diagnose QC problems before they poison downstream analysis
  • How to choose between alignment, assembly, quantification, and filtering strategies
  • How to turn outputs into defensible biological conclusions
Best way to study
  1. Read one concept page.
  2. Open the charts and interactive panels.
  3. Try the related browser exercise.
  4. Write one paragraph explaining the outputs in your own words.
Core file types
FileRole
FASTQRaw reads with qualities
BAM/CRAMReads placed on a reference
VCFObserved sequence variation
GTF/GFFGene/transcript models
MatrixGene/cell/sample summaries for statistics
Interactive workflow atlas

Click a stage to see what question it answers, what file usually comes out, and which later lesson builds on it. This gives you a visual map of the entire learning path.

FASTQ metadata QC trimming alignment assembly VCF counts statistics + biology

Inputs & formats

Every workflow begins with validating what you actually received: correct sample sheet, consistent filenames, paired-end structure, and the right reference files. Many downstream mistakes are really metadata mistakes.

Continue with Data formats and Getting started.

QC & trimming

At this stage you ask whether the data are usable, contaminated, adapter-rich, low complexity, or damaged. The goal is not to beautify reads-it is to make principled preprocessing decisions.

Continue with Read QC & Trimming.

Mapping / assembly

Once reads are clean enough, you either place them on a reference or reconstruct longer sequences from the data. The right choice depends on the biological question and the availability of trustworthy references.

Continue with Alignment, Assembly, and Pipelines.

Features & counts

Mapped reads become counts, variants, transcripts, taxa, or structural evidence. This is where raw sequence data start turning into data tables suitable for modeling and interpretation.

Continue with Variant calling, RNA-seq, and Metagenomics.

Interpretation

Statistics does not replace biological reasoning. The last step is explaining what the outputs mean, what assumptions were made, and which limitations still matter.

Use Resources & Practice to structure reports and keep learning.

Three recommended learning tracks
Track B

Expression and transcriptomics

If your goal is gene regulation, treatment response, or cell states, focus on RNA-seq and single-cell interpretation after mastering QC basics.

QC -> RNA-seq -> Single-cell RNA-seq

Track C

Reproducible workflows

For lab infrastructure, focus on project structure, workflow engines, versioning, and reporting so analyses remain reusable and auditable.

Pipelines -> Resources & Practice -> Tools reference

Read length distribution (example)

Trimmed read distributions help you see whether preprocessing removed only a tail of low-quality bases or whether a large fraction of reads was heavily shortened.

GC content distribution (example)

GC shifts can indicate contamination, protocol bias, or expected organism-specific composition. The lesson is not the shape alone, but whether it matches your expectation.

Where time goes in a typical DNA pipeline

This overview helps learners understand why workflow engines and checkpointing matter. Expensive steps deserve clear logs, versioning, and reusable outputs.

Good habits from day one
  • Keep a sample sheet with stable IDs and exact file paths.
  • Write down reference genome and annotation versions.
  • Save the plots that drove each QC decision.
  • Use one notebook or report per project, not scattered screenshots.
  • Prefer small reproducible workflows over one giant undocumented command history.
First commands to memorize
# Inspect FASTQ header patterns
zcat sample_R1.fastq.gz | head

# Count reads (4 lines per read)
expr $(zcat sample_R1.fastq.gz | wc -l) / 4

# Look at BAM summary metrics
samtools flagstat sample.bam

# Inspect VCF header and first records
bcftools view -h sample.vcf.gz | head
bcftools view -H sample.vcf.gz | head