Pipelines & Reproducibility

A workflow is more than a list of commands. Good bioinformatics pipelines protect you from sample mix-ups, reference mismatches, environment drift, and undocumented threshold changes. If you cannot rerun it confidently, you do not fully understand it yet.

Why analyses become irreproducible

Sample names are edited manually instead of driven by a manifest.
Reference FASTA and annotation versions are not pinned together.
Tool versions drift over time because environments are not recorded.
QC thresholds are changed in ad hoc ways and never written down.
Intermediate files are regenerated without notes about which options changed.

Recommended project layout

project/
|-- config/
|   |-- samples.tsv
|   `-- references.yaml
|-- raw/
|-- trimmed/
|-- align/
|-- counts/
|-- qc/
|-- results/
`-- workflow/
    |-- Snakefile
    `-- envs/

The exact folder names do not matter as much as consistency and clarity. A new collaborator should be able to guess where each file belongs.

Interactive reproducibility stack

Click a layer to see what it protects you from. Reproducibility usually fails because one of these layers was skipped.

Sample sheet: one source of truth for sample identity

A manifest turns filenames into structured metadata. If sample names, lanes, condition labels, and read paths are not centralized, manual mistakes spread quickly.

Track sample ID, condition, lane, read1/read2, batch, and notes
Keep machine-readable tables such as TSV/YAML, not prose alone

Reference versions: genome build and annotation must travel together

A result is not reproducible if the reference sequence or gene model changed. Record exact FASTA, GTF, database, and index versions, plus checksums if possible.

Never write only "human genome" in a report
Document build, annotation release, and decoy / transcriptome choices

Environment: the same command can behave differently across versions

Pinning packages with Conda, Mamba, or containers reduces drift. A clean environment is often the difference between a pipeline that re-runs and one that mysteriously changes behavior.

Save environment YAMLs or container tags
Record --version for core tools in reports

Workflow engine: encode dependencies instead of relying on memory

Snakemake and Nextflow make inputs, outputs, and dependencies explicit. That helps you re-run only what changed and reduces accidental order-of-operations errors.

Rules / processes should have clear inputs, outputs, and logs
QC gates belong inside the workflow, not only in notebook comments

Audit trail: if a result cannot be explained, it cannot be trusted

Final reports should include sample manifests, tool versions, references, QC summaries, and reasoning for filters. Logs and checksums make reruns and troubleshooting much easier.

Archive logs, MultiQC, counts, VCF summary tables, and code revisions
Explain why thresholds were chosen, not just what they were

Sample sheet anatomy

Column	Why it matters
`sample_id`	Stable ID used across all outputs
`condition`	Defines the biological comparison
`batch`	Lets you model or inspect technical structure
`read1`, `read2`	Exact raw input paths
`notes`	Room for anomalies, reruns, or exclusions

Tooling comparison

Approach	Strength	Weakness
Bash scripts	Simple and transparent for small jobs	Harder to scale and recover after changes
Snakemake	Python-friendly, explicit file graph	Can become messy if rules are not modular
Nextflow	Good for cloud/HPC and containerized workflows	More abstraction up front
Containers	Locks environment versions	Still need clear references and manifests

Example QC gate waterfall

A good pipeline does not just run tools. It measures what survives each stage so you can explain where data were lost and whether that loss was expected.

Minimal Snakemake rule (illustrative)

rule fastqc:
    input:
        "raw/{sample}_R1.fastq.gz",
        "raw/{sample}_R2.fastq.gz"
    output:
        html="qc/{sample}_fastqc.html"
    log:
        "logs/fastqc/{sample}.log"
    shell:
        "fastqc -o qc {input} > {log} 2>&1"

Minimal Nextflow process (illustrative)

process FASTQC {
  input:
    tuple val(sample_id), path(reads)

  output:
    path "${sample_id}_fastqc.html"

  script:
  """
  fastqc ${reads.join(' ')}
  """
}

Release checklist for a pipeline-driven analysis

Every output can be traced back to one sample sheet and one reference bundle.
Logs, MultiQC reports, and environment definitions are archived.
Filters and exclusions are described in plain English.
Checksums or version tags are stored for key reference files.
Final result tables are linked to the exact workflow revision that produced them.