![]()
1st Place, Bio-AI Co-Scientist Competition (Dec 2025). Hosted by the AI-Bio Institute, SNU and IITP (Ministry of Science and ICT). Judges: Peter Frazier (Cornell), Masashi Sugiyama (RIKEN/UTokyo), Hyeshik Chang (SNU), Soo Seok Hwang (SNU), Chaok Seok (SNU/Galux).
We built a a multi-agent AI system designed to address complex biomedical research questions through autonomous collaboration. A Principal Investigator (PI) agent coordinates specialized agents — integrating computational analysis, literature search, and database queries — under the adversarial oversight of a Critic agent. The competition had two categories, server (with short time limit) and human (with longer time limit, allowing human intervention). For the Server Category, we submitted fully automated responses generated by the framework, using diverse LLMs and an LLM-as-a-judge protocol for quality selection. For the Human Category, we used a cross-model validation protocol: outputs from multiple LLMs were systematically cross-referenced, checked for hallucinations against raw data, and iteratively refined. Human intervention focused on orchestrating verification, not on direct content modification.
Two modes:
Each agent has access to: execute_python (data analysis, plotting); search_literature (RAG over local PDFs plus PubMed, arXiv, Semantic Scholar via PaperQA); query_database (DrugBank, BindingDB, STRING, GWAS Catalog); and read_file / find_files for navigating intermediate results. Agents choose tools autonomously until satisfied with the answer.
An LLM-as-a-judge system (FastChat-style) scores answers on a 1–10 scale across six criteria: Scientific Accuracy (30%), Evidence Quality (20%), Methodological Rigor (15%), Completeness (15%), Clarity (10%), Critical Thinking (10%). Pairwise comparison picks winners between answer variants, enabling systematic selection across pipeline modes and model providers.
team-size=3, max-iterations=30, rounds=2. When multiple outputs were generated, pairwise evaluation selected the highest-quality answer.Highlights per problem:
GM42878, DUXBL3; incorrect DEG counts) were caught and removed, while valid conceptual insights — e.g. the “biphasic exhaustion” hypothesis — were preserved.Structured workflows plus adversarial review let multi-agent systems answer research-level questions. LLM web-assistants used as mutual fact-checkers reliably catch hallucinations while preserving valid insights, suggesting AI can serve as a research collaborator when the verification machinery is in place. Difficulty came from balancing novelty and accuracy. For problems like P5, the competition judges commented how none of the competitors provided a truly novel insight. We are working on improving the system’s ability to generate novel hypotheses while maintaining scientific rigor (keep an eye out for our upcoming workshop proposal on this topic!).