Ploid AI on BixBench: benchmarking our agent against the field
AI agents are playing a growing role in biological research, and proper evaluation needs to keep up. BixBench [1] tackles this by pairing real bioinformatics datasets with research questions that test whether an agent can actually analyze data and produce correct answers across a range of domains.
We ran Ploid AI on both the full BixBench benchmark and the curated BixBench-Verified-50 subset [2]. This post presents our results in detail.
Why BixBench
We observed a growing tendency in the field to benchmark bioinformatics agents against BixBench, and we believe it is a good way to test our agent against the market. Agents receive real biological datasets and short-answer research questions spanning domains from genomics to proteomics. It tests biological knowledge, the ability to load data, choose appropriate methods, execute analysis pipelines, and interpret results.
However, not all failures on BixBench reflect agent limitations. Some questions are ambiguous, underspecified, or have incorrect ground truth. To address this, Phylo curated BixBench-Verified-50 [2], a subset of 50 questions reviewed by domain experts for correctness and clarity. We evaluated Ploid AI on both versions.
Overall results
| Agent | BixBench | BixBench-Verified-50 |
|---|---|---|
| Ploid AI | 71.2% | 92.0% |
| K-Dense Web** | not reported | 90.0% |
| BIOS*** | 64.4% | 90.0% |
| Biomni Lab (2026-02-03)* | 52.2% | 88.7% |
| Edison Analysis* | 42.4% | 78.0% |
| Claude Code (Opus 4.6)* | 39.5% | 65.3% |
| OpenAI Agents SDK (GPT-5.2)* | 38.5% | 61.3% |
*Results for Claude Code, OpenAI Agents SDK, Edison Analysis, and Biomni Lab sourced from [3].
**Results for K-Dense Web sourced from [4].
K-Dense Web does not report a score on the full BixBench benchmark; only BixBench-Verified-50 results are available.
Ploid AI achieves 71.2% on the full BixBench and 92.0% on the verified subset, leading both general-purpose and domain-specific agents on both benchmarks. K-Dense Web [4] and BIOS [5] also report strong results on the verified subset (90.0% each). BIOS additionally reports 64.4% on the full BixBench benchmark [6], while K-Dense Web does not publish a score on the full benchmark in recent publications.
All reported results use MCQ-based evaluation without a refusal option. This modality removes ambiguity introduced by formatting mismatches, rounding differences, and uncertainty inherent in LLM-based answer grading.
BixBench-Verified-50: results by subject
Total accuracy: 92.0% across 50 verified tasks.
| Subject | Tasks | Accuracy | Mean Response Time (s) | Mean Cost ($) |
|---|---|---|---|---|
| Antimicrobial Resistance | 2 | 100.0% | 251.35 | 0.3059 |
| Differential Expression Analysis | 14 | 92.9% | 403.94 | 0.4040 |
| Epigenomics | 3 | 100.0% | 35.68 | 0.1121 |
| Functional Genomics | 4 | 50.0% | 70.49 | 0.1894 |
| Genomic Variant Analysis | 4 | 100.0% | 118.45 | 0.3781 |
| Genomics | 20 | 95.0% | 254.36 | 0.4670 |
| Imaging | 4 | 100.0% | 72.68 | 0.1950 |
| Machine Learning and AI | 1 | 100.0% | 37.83 | 0.1029 |
| Other | 2 | 100.0% | 35.86 | 0.1097 |
| Phylogenetics | 1 | 100.0% | 417.23 | 0.3094 |
| Phylogenetics and Evolutionary Analysis | 13 | 92.3% | 284.12 | 0.5312 |
| Proteomics | 2 | 100.0% | 31.57 | 0.1085 |
| RNA-seq | 16 | 93.8% | 353.75 | 0.3723 |
| SNP Analysis | 2 | 100.0% | 251.35 | 0.3059 |
| Sequence Analysis | 9 | 88.9% | 147.58 | 0.2354 |
| Transcriptomics | 18 | 83.3% | 328.16 | 0.3665 |
| Whole Genome Sequencing (WGS) | 14 | 92.9% | 324.90 | 0.5291 |
Ploid AI achieves over 90% accuracy in 14 out of 17 subject areas. The most challenging categories are Functional Genomics (50.0%) and Transcriptomics (83.3%), which often involve multi-step analytical workflows where the benchmark expects specific methodological choices among several valid alternatives.
BixBench full: results by subject
Total accuracy: 71.2% across 205 tasks.
| Subject | Tasks | Accuracy | Mean Response Time (s) | Mean Cost ($) |
|---|---|---|---|---|
| Antimicrobial Resistance | 6 | 100.0% | 550.64 | 0.6839 |
| Differential Expression Analysis | 67 | 61.2% | 501.97 | 0.5689 |
| Epigenomics | 12 | 75.0% | 84.83 | 0.2189 |
| Functional Genomics | 10 | 50.0% | 460.87 | 0.4948 |
| Genomic Variant Analysis | 16 | 50.0% | 157.68 | 0.4742 |
| Genomics | 74 | 77.0% | 272.18 | 0.6059 |
| Imaging | 21 | 90.5% | 44.74 | 0.1352 |
| Integrative Omics | 2 | 100.0% | 72.27 | 0.2456 |
| Machine Learning and AI | 5 | 100.0% | 415.86 | 0.4325 |
| Network Biology | 4 | 25.0% | 837.72 | 0.7370 |
| Other | 25 | 84.0% | 153.51 | 0.4250 |
| Phylogenetics | 4 | 50.0% | 206.86 | 0.2485 |
| Phylogenetics and Evolutionary Analysis | 47 | 85.1% | 286.78 | 0.6877 |
| Proteomics | 4 | 100.0% | 31.52 | 0.1094 |
| RNA-seq | 69 | 60.9% | 500.10 | 0.5747 |
| SNP Analysis | 6 | 100.0% | 550.64 | 0.6839 |
| Sequence Analysis | 30 | 56.7% | 418.52 | 0.4541 |
| Single-Cell Analysis | 2 | 100.0% | 511.43 | 0.1698 |
| Transcriptomics | 69 | 56.5% | 507.49 | 0.5929 |
| Whole Genome Sequencing (WGS) | 48 | 83.3% | 358.65 | 0.7413 |
On the full benchmark, performance remains strong across most categories. Areas with lower scores (Differential Expression Analysis, 61.2%; RNA-seq, 60.9%; Transcriptomics, 56.5%; Sequence Analysis, 56.7%) correlate with domains where BixBench contains a higher proportion of ambiguous or underspecified questions, as documented in the BixBench-Verified-50 curation process [2]. The jump from full to verified accuracy in these categories confirms that a significant fraction of "failures" stem from benchmark quality issues rather than agent limitations.
What the gap between full and verified tells us
The difference between BixBench (71.2%) and BixBench-Verified-50 (92.0%) is informative. Across all agents in the comparison, scores increase substantially on the verified subset. This pattern confirms what Phylo identified [3]: a meaningful portion of failures on the original benchmark come from ambiguous questions, underspecified context, or incorrect ground truth, not from genuine agent capability gaps.
For Ploid AI, the 20.8 percentage point improvement on the verified subset is one of the largest in the comparison, showing that our agent performs well even on underspecified questions, with a performance increase substantially above other agents when evaluated on curated, expert-verified tasks.
References
[1] BixBench, FutureHouse. A benchmark for bioinformatics agents. huggingface.co/datasets/futurehouse/BixBench
[2] BixBench-Verified-50, Phylo. A curated subset of 50 expert-verified BixBench questions. huggingface.co/datasets/phylobio/BixBench-Verified-50
[3] Phylo. Evaluating AI Agents in Biology. February 2026. phylo.bio/blog/evaluating-ai-agents-in-biology
[4] K-Dense. K-Dense Web Scores 90.0% on BixBench-Verified-50. March 2026. k-dense.ai/blog/bixbench-verified-50
[5] Bio Protocol AI. BixBench Verified 50: Evaluating BIOS Biological Agents. March 2026. ai.bio.xyz/blog/bixbench-verified-50-evaluating-bios-biological-agents
[6] Bio Protocol AI. BIOS Benchmark Results. April 2026. bio-xyz.github.io/bio-benchmark