Single-cell RNA-seq: filtering cells, defining clusters, and avoiding false stories
Single-cell workflows are powerful because they separate heterogeneous cell states-but they are also easy to over-interpret. QC thresholds, batch handling, doublets, and annotation strategy shape the final biological story.
| Stage | Main output |
|---|---|
| Barcode / UMI processing | Cell-by-gene count matrix |
| Cell QC filtering | High-quality cell subset |
| Normalization + HVG selection | Comparable expression matrix |
| Dimensionality reduction | PCA / UMAP coordinates |
| Clustering + annotation | Cell states / identities |
| Marker analysis | Genes that separate clusters or conditions |
- Ambient RNA: free-floating RNA contaminates droplets and distorts weakly expressed markers
- Doublets: two cells captured together can look like a fake hybrid cell type
- Mitochondrial RNA: often increases in stressed or damaged cells
- Batch effects: chemistry, run date, and donor handling can dominate clustering
- Over-clustering: fine clusters are not always distinct biological cell types
Adjust the QC thresholds and see how many synthetic cells remain. This helps build intuition for the tradeoff between removing low-quality cells and accidentally discarding rare but real populations.
Cluster separation in 2D is helpful for intuition, but cluster boundaries are still shaped by preprocessing, resolution choice, and batch handling.
Very tiny clusters can represent rare populations, but they can also reflect doublets, broken cells, or over-clustering.
Use marker combinations
One marker gene is rarely enough. Look for coherent marker sets and pathway logic, not a single famous gene.
Check QC overlays
If one cluster is mostly high-mito cells or low-count cells, it may be technical, not biological.
Respect donor / batch structure
A cluster that appears in only one batch may be real-or just a processing artifact.
- Report QC thresholds and justify why they were chosen.
- Describe how doublets and ambient RNA were handled.
- State whether integration / batch correction was applied and why.
- Use multiple markers and known biology when naming clusters.
- Be careful when treating cluster-level DE as cell-type discovery without validation.