Bioinformatics Tutorial

Data Formats: FASTA, FASTQ, SAM/BAM/CRAM, VCF

Bioinformatics is file-format driven. If you understand what each file represents, every tool becomes easier to reason about.

FASTA (reference or assembled sequences)

FASTA stores sequences without quality scores. Each record is:

>sequence_id optional description
ACGTTGCA...

Common uses: reference genomes, transcriptomes, contigs, protein sequences.

Common gotchas

  • Line breaks are irrelevant to sequence meaning.
  • Ambiguous bases appear as N (or IUPAC codes).
  • Headers are not standardized; pipelines often parse IDs.
FASTQ (reads + per-base quality)

FASTQ has 4 lines per read:

@read_id
ACGT...
+
FFF...  (ASCII-encoded Phred qualities)

Phred score $Q$ relates to error probability $p$ by $Q=-10\log_{10}(p)$.

Example: $Q=30$ means $p=10^{-3}$ (β‰ˆ0.1% error).

SAM/BAM/CRAM (alignments)

SAM is text; BAM/CRAM are compressed binary equivalents. The key idea: reads are aligned to a reference with coordinates and a CIGAR string.

FieldMeaning
RNAME, POSReference contig and 1-based start position
MAPQMapping quality (alignment confidence)
CIGARMatch/insert/delete/clip operations
FLAGBitwise flags (paired, reverse, secondary, duplicate, …)

Practical tip: base quality answers β€œis this base call reliable?” while mapping quality answers β€œis this alignment placement reliable?”. They are different failure modes.

VCF (variants)

VCF is a table of genomic positions with alleles plus annotations and per-sample genotypes.

#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  SAMPLE1
chr1    123  .   A    G    99    PASS    DP=42  GT:DP   0/1:40

If you only remember one thing: always read the INFO and FORMAT definitions in the VCF header; fields differ by caller.

k-mer spectrum (toy example)

k-mers help detect contamination and estimate genome characteristics (coverage/heterozygosity) in assembly workflows.

Coverage depth profile (toy example)

Coverage dips can indicate repeats, GC bias, or mapping ambiguity.