TogoMCP: Natural Language Querying of Life-Science Knowledge Graphs via Schema-Guided LLMs and the Model Context Protocol
TogoMCP: Natural Language Querying of Life-Science Knowledge Graphs via Schema-Guided LLMs and the Model Context Protocol
Kinjo, A. R.; Yamamoto, Y.; Bustamante-Larriet, S.; Labra-Gayo, J. E.; Fujisawa, T.
AbstractQuerying the RDF Portal knowledge graph maintained by DBCLS, which aggregates more than 70 life-science databases, requires proficiency in both SPARQL and database-specific RDF schemas, placing this resource beyond the reach of most researchers. Large Language Models (LLMs) can, in principle, translate natural-language questions into executable SPARQL, but without schema-level context, they frequently fabricate non-existent predicates or fail to resolve entity names to database-specific identifiers. We present TogoMCP, a system that recasts the LLM as a protocol-driven inference engine orchestrating specialized tools via the Model Context Protocol (MCP). Two mechanisms are essential to its design: (i) the MIE (Metadata-Interoperability-Exchange) file, a concise YAML document that dynamically supplies the LLM with each target database's structural and semantic context at query time; and (ii) a two-stage workflow separating entity resolution via external REST APIs from schema-guided SPARQL generation.On a benchmark of 50 biologically grounded questions spanning five types and 23 databases, TogoMCP achieved a large improvement over an unaided baseline (Cohen's d = 0.92, Wilcoxon p < 10-6), with win rates exceeding 80% for question types with precise, verifiable answers. An ablation study identified MIE files as the single indispensable component: removing them reduced the effect to a non-significant level (d = 0.08), while a one-line instruction to load the relevant MIE file recovered the full benefit of an elaborate behavioral protocol. These results suggest a general design principle: concise, dynamically delivered schema context is more valuable than complex orchestration logic.