NativeReady: an open benchmark and sequence-based triage model for native mass spectrometry suitability
NativeReady: an open benchmark and sequence-based triage model for native mass spectrometry suitability
Znabu, B. F.; Atif, Z.
AbstractNative mass spectrometry is a central analytical method for characterizing intact proteins, antibody-drug conjugates, and non-covalent assemblies, and it is increasingly the deciding measurement in biotherapeutic development pipelines. A single screening attempt requires days of expression, purification, and buffer exchange into ammonium acetate, followed by 30 to 60 minutes of optimization on a Q-Exactive UHMR or comparable instrument. To our knowledge, no published sequence-based predictor currently estimates native MS suitability before experimental screening. We curated 634 unique proteins with documented native MS outcomes, drawn from a 232-protein hand-curated base set, 358 entries recovered from RCSB PDB by full-text searching for native MS terminology, and 44 evidence-based extractions from supplementary tables across 80 EuropePMC papers. We trained four model variants on this benchmark: a 36-feature BioPython physicochemical baseline, an ESM-2 linear probe, an ESM-2 PCA-256 random forest, and a combined model that concatenates ESM-2 PCA components with BioPython features. All variants were evaluated under cluster-aware 5-fold cross-validation (GroupKFold over ESM-2 embedding-similarity clusters) with isotonic calibration, and standard stratified 5-fold cross-validation is reported as a sensitivity analysis. Under cluster-aware 5-fold cross-validation (GroupKFold over ESM-2 embedding-similarity clusters, our defense against homology leakage), the combined model achieved an AUC of 0.869 plus or minus 0.036, robust against the original stratified-CV value (0.873) and the BioPython baseline (0.852). The ESM-2-only variants showed AUC drops of 0.024 to 0.046 between stratified and cluster-aware splits, indicating that some of the apparent ESM-2 contribution under standard CV reflects homology leakage. Negative recall was 9.4 percent under cluster-aware splitting versus 26.0 percent under stratified, confirming that the model's apparent failure-detection capability was substantially inflated by within-fold homology. We report both numbers and treat the cluster-aware values as the primary results. We release the curated dataset, the trained model, and an interactive web tool at nativeready.netlify.app. In its current form, NativeReady should be interpreted primarily as a positive-suitability triage tool; failure prediction remains limited by the scarcity of experimentally documented negative cases. We propose a user-contribution mechanism to accumulate real failure data over time. To our knowledge, no published sequence-based predictor currently estimates native MS suitability before experimental screening, and NativeReady is the first open benchmark and triage model specifically designed for this task. Keywords: native mass spectrometry; protein language models; ESM-2; biotherapeutic characterization; benchmark dataset; sequence-based prediction.