A Generalizable Machine Learning Model for Early Detection of Hepatocellular Carcinoma Using Bisulfite Sequencing Data
A Generalizable Machine Learning Model for Early Detection of Hepatocellular Carcinoma Using Bisulfite Sequencing Data
Subharam, M.; Koehler, R.
AbstractHepatocellular carcinoma (HCC) is a leading cause of cancer-related mortality worldwide, with early detection significantly improving survival outcomes. However, both current screening methods (e.g., ultrasound and AFP) and emerging molecular diagnostics have struggled to achieve the sensitivity and specificity required for reliable early-stage detection. In this study, we present a generalizable machine learning framework for HCC detection using cfDNA methylation profiles derived from bisulfite sequencing data. We trained separate XGBoost models on three cfDNA datasets from previously published studies, each based on a different bisulfite sequencing methodology: MCTA Seq, targeted bisulfite sequencing (using methylation-correlated blocks), and low-pass whole-genome bisulfite sequencing (WGBS). For each model, we selected features based on biologically relevant methylation markers reported in the original study, ensuring that feature selection was grounded in experimentally validated signals. A meta-classifier was trained on shared CpG site features to route incoming samples to the most appropriate model, eliminating the need for forced feature harmonization. To evaluate generalizability, we used an independent moderate-depth WGBS liver tissue dataset for blind validation. The dataset was filtered separately to match the feature set of each training model. Models trained on MCTA Seq and WGBS generalized well (blind validation accuracy up to 92%), while the padlock probe based MCB model failed to transfer, underscoring the importance of platform aware modeling. Across datasets, methylation patterns separating cancerous from non-cancerous samples remained stable, even when absolute values shifted, as in tissue vs. blood. Threshold adjustment compensated for this shift without compromising classification. Intra dataset k-fold accuracy remained consistently high, confirming the robustness of methylation based classification across patient cohorts and disease stages. Overall, we demonstrate a generalizable machine learning framework capable of predicting HCC with over 90% accuracy using cfDNA methylation data from any of three distinct bisulfite sequencing assays, including off-the-shelf whole genome bisulfite sequencing (WGBS) at moderate depth. By aligning assay specific models with a meta classifier for sample routing, we provide a modular strategy that avoids harmonization pitfalls while preserving accuracy. The next logical step is clinical validation using biobank-derived plasma samples sequenced with WGBS and passed through this framework to confirm real world performance.