Bioinformatics Tutorial

Pipelines & Reproducibility

A workflow is more than a list of commands. Good bioinformatics pipelines protect you from sample mix-ups, reference mismatches, environment drift, and undocumented threshold changes. If you cannot rerun it confidently, you do not fully understand it yet.

Why analyses become irreproducible
  • Sample names are edited manually instead of driven by a manifest.
  • Reference FASTA and annotation versions are not pinned together.
  • Tool versions drift over time because environments are not recorded.
  • QC thresholds are changed in ad hoc ways and never written down.
  • Intermediate files are regenerated without notes about which options changed.
Recommended project layout
project/
|-- config/
|   |-- samples.tsv
|   `-- references.yaml
|-- raw/
|-- trimmed/
|-- align/
|-- counts/
|-- qc/
|-- results/
`-- workflow/
    |-- Snakefile
    `-- envs/

The exact folder names do not matter as much as consistency and clarity. A new collaborator should be able to guess where each file belongs.

Interactive reproducibility stack

Click a layer to see what it protects you from. Reproducibility usually fails because one of these layers was skipped.

sample sheet / manifest reference + annotation conda / container Snakemake / Nextflow logs + reports + checksums

Sample sheet: one source of truth for sample identity

A manifest turns filenames into structured metadata. If sample names, lanes, condition labels, and read paths are not centralized, manual mistakes spread quickly.

  • Track sample ID, condition, lane, read1/read2, batch, and notes
  • Keep machine-readable tables such as TSV/YAML, not prose alone

Reference versions: genome build and annotation must travel together

A result is not reproducible if the reference sequence or gene model changed. Record exact FASTA, GTF, database, and index versions, plus checksums if possible.

  • Never write only "human genome" in a report
  • Document build, annotation release, and decoy / transcriptome choices

Environment: the same command can behave differently across versions

Pinning packages with Conda, Mamba, or containers reduces drift. A clean environment is often the difference between a pipeline that re-runs and one that mysteriously changes behavior.

  • Save environment YAMLs or container tags
  • Record --version for core tools in reports

Workflow engine: encode dependencies instead of relying on memory

Snakemake and Nextflow make inputs, outputs, and dependencies explicit. That helps you re-run only what changed and reduces accidental order-of-operations errors.

  • Rules / processes should have clear inputs, outputs, and logs
  • QC gates belong inside the workflow, not only in notebook comments

Audit trail: if a result cannot be explained, it cannot be trusted

Final reports should include sample manifests, tool versions, references, QC summaries, and reasoning for filters. Logs and checksums make reruns and troubleshooting much easier.

  • Archive logs, MultiQC, counts, VCF summary tables, and code revisions
  • Explain why thresholds were chosen, not just what they were
Sample sheet anatomy
ColumnWhy it matters
sample_idStable ID used across all outputs
conditionDefines the biological comparison
batchLets you model or inspect technical structure
read1, read2Exact raw input paths
notesRoom for anomalies, reruns, or exclusions
Tooling comparison
ApproachStrengthWeakness
Bash scriptsSimple and transparent for small jobsHarder to scale and recover after changes
SnakemakePython-friendly, explicit file graphCan become messy if rules are not modular
NextflowGood for cloud/HPC and containerized workflowsMore abstraction up front
ContainersLocks environment versionsStill need clear references and manifests
Example QC gate waterfall

A good pipeline does not just run tools. It measures what survives each stage so you can explain where data were lost and whether that loss was expected.

Minimal Snakemake rule (illustrative)
rule fastqc:
    input:
        "raw/{sample}_R1.fastq.gz",
        "raw/{sample}_R2.fastq.gz"
    output:
        html="qc/{sample}_fastqc.html"
    log:
        "logs/fastqc/{sample}.log"
    shell:
        "fastqc -o qc {input} > {log} 2>&1"
Minimal Nextflow process (illustrative)
process FASTQC {
  input:
    tuple val(sample_id), path(reads)

  output:
    path "${sample_id}_fastqc.html"

  script:
  """
  fastqc ${reads.join(' ')}
  """
}
Release checklist for a pipeline-driven analysis
  • Every output can be traced back to one sample sheet and one reference bundle.
  • Logs, MultiQC reports, and environment definitions are archived.
  • Filters and exclusions are described in plain English.
  • Checksums or version tags are stored for key reference files.
  • Final result tables are linked to the exact workflow revision that produced them.