Science Cast

Solving the Diagnostic Odyssey with Synthetic Phenotype Data

Gianlucca ColangeloMarch 23, 2026 3:57pm

Views (7)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

Solving the Diagnostic Odyssey with Synthetic Phenotype Data

bioRxivPDFMarch 23, 2026 12:00am

Authors

Colangelo, G.; Marti, M.

Abstract

The space of possible phenotype profiles over the Human Phenotype Ontology (HPO) is combinatorially vast, whereas the space of candidate disease genes is far smaller. Phenotype-driven diagnosis is therefore highly non-bijective: many distinct symptom profiles can correspond to the same gene, but only a small fraction of the theoretical phenotype space is biologically and clinically plausible. When a structured ontology exists, this constraint can be exploited to generate realistic synthetic cases. We introduce GraPhens, a simulation framework that uses gene-local HPO structure together with two empirically motivated soft priors, over the number of observed phenotypes per case and phenotype specificity, to generate synthetic phenotype-gene pairs that are novel yet clinically plausible. We use these synthetic cases to train GenPhenia, a graph neural network that reasons over patient-specific phenotype subgraphs rather than flat phenotype sets. Despite being trained entirely on synthetic data, GenPhenia generalizes to real, previously unseen clinical cases and outperforms existing phenotype-driven gene-prioritization methods on two real-world datasets. These results show that when patient-level data are scarce but a structured ontology is available, principled simulation can provide effective training data for end-to-end neural diagnosis models.

TwitterandLinkedIn

0 comments

Add comment

Solving the Diagnostic Odyssey with Synthetic Phenotype Data

Solving the Diagnostic Odyssey with Synthetic Phenotype Data

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments