Interpreting biochemical text with language models:a machine learning framework for reaction extraction and cheminformatic validation
Interpreting biochemical text with language models:a machine learning framework for reaction extraction and cheminformatic validation
Lim, D.; Badrinarayanan, S.; Sterling, K. C.; Rajesh, G.; Mistry, E.; Yang, D.; Lee, M.; Hsu, K. B.; Manjrekar, M.; Areff, C.; Xie, P.; Kristanto, I. A.; Chandran, A.; Anderson, J. C.
AbstractRecent advancements in large language models (LLMs) offer new opportunities for automating the manual curation of biochemical reaction databases from scientific literature. In this study, we present an integrated pipeline that enhances LLM-based extraction of enzymatic reactions with machine learning and cheminformatics-informed validation. Using BRENDA-linked PubMed articles, we evaluate GPT-4\'s ability to extract reactions and infer missing chemical entities in textual descriptions of enzymatic reactions. Extracted reactions are converted to SMILES and InChI notations before being encoded into molecular fingerprint similarity scores and atom mapping metrics. These cheminformatics metrics are then used to train machine learning classifiers that validate GPT extractions. We employ a Positive-Unlabeled learning approach with synthetic invalid reactions to train various classifiers and assess model performances. The best classifier is then benchmarked on GPT extractions. Our findings show that GPT can accurately infer incomplete reactions and cheminformatics tools can serve as effective predictors of reaction validity. This work demonstrates a scalable framework for automated and reliable curation of enzymatic reaction databases, highlighting the potential of combining LLMs with cheminformatics and machine learning for reliable scientific knowledge extraction.