Variant Calling (GATK/BCFtools): From Alignments to VCF
Variant calling converts aligned reads into hypotheses about differences from the reference. The hard part is not producing a VCF—it’s distinguishing true variants from artifacts (mapping bias, PCR, strand bias, low depth, contamination).
Typical DNA variant pipeline (high level)
- QC & trim
- Align to reference
- Mark duplicates (if applicable)
- Call variants
- Filter (hard filters or VQSR)
- Annotate + interpret
Example commands (illustrative)
# Basic calling with bcftools (example)
samtools mpileup -f reference.fa sample.bam \
| bcftools call -mv -Oz -o sample.vcf.gz
bcftools index sample.vcf.gz
# Basic filtering idea (thresholds depend on experiment!)
bcftools filter -e 'DP<10 || QUAL<30' sample.vcf.gz -Oz -o sample.filtered.vcf.gz
How to read a VCF genotype
Genotypes in diploid samples are commonly:
0/0homozygous reference0/1heterozygous1/1homozygous alternate./.missing
Key fields
| Field | Meaning |
|---|---|
DP | Total depth |
AD | Allelic depths (REF, ALT…) |
GQ | Genotype quality |
AF | Allele fraction (caller-defined) |
Ti/Tv ratio across samples (example)
Ti/Tv is a quick plausibility metric (especially for exomes). Strange values can indicate calling issues or contamination.
Variant allele fraction histogram (example)
VAF shapes differ across germline vs somatic experiments, purity, copy number, and filters.
Common artifact checklist
Technical
- Low complexity regions, repeats, segmental duplications
- Strand bias (ALT supported mostly on one strand)
- Position bias (ALT mostly at read ends)
- Read-mapping ambiguity (low MAPQ)
Biological / design
- Sample contamination / swaps
- Unexpected ploidy / sex chromosomes
- Tumor purity and subclonality (somatic)
- Batch effects in capture/coverage