GraphPop: graph-native computation decouples population genomics complexity from sample count
GraphPop: graph-native computation decouples population genomics complexity from sample count
Estaji, E.; Zhao, S.-W.; Chen, Z.-Y.; Nie, S.; Mao, J.-F.
AbstractMatrix-based population genomics tools scale as O(V x N), re-reading the full genotype matrix for every analysis. Here we present GraphPop, a graph database engine that reduces summary statistic complexity to O(V x K) where K is population count, independent of sample count, by computing on pre-aggregated allele-count arrays stored as graph node properties. The same architecture enables annotation-conditioned queries via edge traversal, persistent analytical records, and multi-statistic composition. Applied to rice 3K (29.6M SNPs, 3,024 accessions) and human 1000 Genomes (3,202 samples, 22 autosomes), GraphPop reveals that all 12 rice subpopulations show piN/piS > 1.0, uncovers opposite consequence-level Fst regimes between species, and identifies KCNE1 as a candidate pre-Out-of-Africa sweep via convergence of five stored statistics. GraphPop achieves 146-327x query-time speedup for pre-aggregated statistics and 63-179x for bit-packed haplotype computation, at constant ~160 MB memory. This complexity reduction makes systematic, annotation-integrated population genomics practical for the crop, livestock, conservation, and ecological datasets that constitute the majority of the field.