PromptBio-Bench: Benchmarking LLM-based Bioinformatics Agents for End-to-End Data Analysis
PromptBio-Bench: Benchmarking LLM-based Bioinformatics Agents for End-to-End Data Analysis
Guo, W.; Zhang, M.; Han, B.; Ma, Y.; Leng, Y.; Hebbar, S.; Zhou, X.; Gu, W.; Yang, X.; Dhar, S.
AbstractLarge language model (LLM)-based agents hold transformative potential for automating bioinformatics workflows; however, systematic evaluations of their capabilities remain limited, hindering a clear assessment of their readiness for real-world application. We introduce PromptBio-Bench, a comprehensive evaluation suite of 194 expert-curated tasks spanning bioinformatics and data science at varied difficulty levels, and an evaluation framework for structured file comparison and scoring against expert reference answers. Benchmarking three state-of-the-art agents revealed that Biomni and ToolsGenie achieved comparable performance, and accuracy declined markedly at higher difficulty levels across all agents. As foundation models and agent frameworks continue to evolve, PromptBio-Bench provides a valuable benchmark infrastructure for the community to systematically track the progress of agentic bioinformatics.