CombinGym: a benchmark platform for machine learning-assisted design of combinatorial protein variants
CombinGym: a benchmark platform for machine learning-assisted design of combinatorial protein variants
Chen, Y.; Fu, L.; Lu, X.; Li, W.; Gao, Y.; Wang, Y.; Ruan, Z.; Si, T.
AbstractCombinatorial mutagenesis is essential for exploring protein sequence-function landscapes in engineering applications. However, while large-scale machine learning benchmarks exist for protein function prediction, they are primarily limited to single-mutant libraries, leaving a critical gap for combinatorial mutagenesis. Here we introduce CombinGym, a benchmarking platform featuring 14 curated combinatorial mutagenesis datasets spanning 9 proteins with diverse functional properties including binding affinity, fluorescence, and enzymatic activities. We evaluated nine machine learning algorithms from five methodological categories (alignment-based, protein language, structure-based, sequence-label, and substitution-based) across multiple prediction tasks, assessing both zero-shot and supervised learning performance using Spearman's {rho} and Normalized Discounted Cumulative Gain metrics. Our analysis reveals the substantial impact of measurement noise and data processing strategies on model performance. By implementing hierarchical dataset splits (0-vs-rest, 1-vs-rest, 2-vs-rest, and 3-vs-rest scenarios), we demonstrate the value of lower-order mutation data for empowering machine learning models to predict higher-order mutant properties. We validated this capacity through both in silico simulation (improving fluorescence brightness of an oxygen-independent fluorescent protein) and experimental validation (engineering enzyme substrate specificity), achieving a substantial increase in specific activity. All datasets, benchmarks, and metrics are available through an interactive website (https://www.combingym.org), facilitating collaborative dataset expansion and model development through integration with automated biofoundry platforms.