CPP2Vec: a Representation Learning Approach for Cell-Penetrating Peptides Prediction
CPP2Vec: a Representation Learning Approach for Cell-Penetrating Peptides Prediction
Svolou, S.; Konstantakos, V.; Krithara, A.; Paliouras, G.
AbstractBackground: Cell-penetrating peptides (CPPs) facilitate the delivery of a variety of therapeutic molecules across the plasma membrane, from small chemical substances to nucleic acid-based macromolecules, such as antisense oligonucleotides (ASOs). Among neutral ASOs, peptide nucleic acids (PNAs) and phosphorodiamidate morpholino oligomers (PMOs) have been extensively studied as potential medical treatments for Duchenne Muscular Dystrophy (DMD), a severe genetic disease that causes muscle degeneration progressively. Over the last few decades, many in silico methods have emerged to detect novel CPPs, counterbalancing the cost of wet-lab experiments. Results: In this study, we propose CPP2Vec, a Word2Vec-based CPP prediction method, where the Word2Vec technique is used to represent amino acid sequences of peptides. We developed three task-specific supervised machine learning models for CPP-Classification, Uptake-Efficiency and PMO-Delivery. The first two models were designed to determine if an unseen peptide is a CPP and to predict its uptake efficiency, respectively, while the PMO-Delivery model predicts if a peptide could enhance the cellular delivery of a PMO-complex compared to its naked version. Furthermore, we explored an alternative approach using pretrained protein-based Large Language Models (LLMs) - T5, BERT, and ESM-2 - to generate the embeddings, resulting in three task-specific models, namely CPP2LLM. A comparison of CPP2Vec and CPP2LLM with state-of-the-art CPP prediction tools is included, proving their significant predictive performance. Conclusion: In this research, we present a Machine Learning (ML)-based tool that introduces the use of the Word2Vec technique in the field of CPPs prediction. Notably, it stands out for not requiring any manual a priori feature engineering and for its ability to generalize without any changes between studied tasks. CPP2Vec is available for use at: https://github.com/SSvolou/CPP2Vec.