Ploid AI on BixBench: benchmarking our agent against the field

AI agents are playing a growing role in biological research, and proper evaluation needs to keep up. BixBench ^[1] tackles this by pairing real bioinformatics datasets with research questions that test whether an agent can actually analyze data and produce correct answers across a range of domains.

We ran Ploid AI on both the full BixBench benchmark and the curated BixBench-Verified-50 subset ^[2]. This post presents our results in detail.

Why BixBench

We observed a growing tendency in the field to benchmark bioinformatics agents against BixBench, and we believe it is a good way to test our agent against the market. Agents receive real biological datasets and short-answer research questions spanning domains from genomics to proteomics. It tests biological knowledge, the ability to load data, choose appropriate methods, execute analysis pipelines, and interpret results.

However, not all failures on BixBench reflect agent limitations. Some questions are ambiguous, underspecified, or have incorrect ground truth. To address this, Phylo curated BixBench-Verified-50 ^[2], a subset of 50 questions reviewed by domain experts for correctness and clarity. We evaluated Ploid AI on both versions.

Overall results

Agent	BixBench	BixBench-Verified-50
Ploid AI	71.2%	92.0%
K-Dense Web^**	not reported	90.0%
BIOS^***	64.4%	90.0%
Biomni Lab (2026-02-03)^*	52.2%	88.7%
Edison Analysis^*	42.4%	78.0%
Claude Code (Opus 4.6)^*	39.5%	65.3%
OpenAI Agents SDK (GPT-5.2)^*	38.5%	61.3%

*Results for Claude Code, OpenAI Agents SDK, Edison Analysis, and Biomni Lab sourced from [3].

**Results for K-Dense Web sourced from [4].

***Results for BIOS sourced from [5] (BixBench-Verified-50) and [6] (full BixBench).

K-Dense Web does not report a score on the full BixBench benchmark; only BixBench-Verified-50 results are available.

Agent accuracy comparison across BixBench and BixBench-Verified-50

Ploid AI achieves 71.2% on the full BixBench and 92.0% on the verified subset, leading both general-purpose and domain-specific agents on both benchmarks. K-Dense Web ^[4] and BIOS ^[5] also report strong results on the verified subset (90.0% each). BIOS additionally reports 64.4% on the full BixBench benchmark ^[6], while K-Dense Web does not publish a score on the full benchmark in recent publications.

Evaluation mode disclaimer

All reported results use MCQ-based evaluation without a refusal option. This modality removes ambiguity introduced by formatting mismatches, rounding differences, and uncertainty inherent in LLM-based answer grading.

BixBench-Verified-50: results by subject

Total accuracy: 92.0% across 50 verified tasks.

BixBench-Verified-50 accuracy by subject area

Subject	Tasks	Accuracy	Mean Response Time (s)	Mean Cost ($)
Antimicrobial Resistance	2	100.0%	251.35	0.3059
Differential Expression Analysis	14	92.9%	403.94	0.4040
Epigenomics	3	100.0%	35.68	0.1121
Functional Genomics	4	50.0%	70.49	0.1894
Genomic Variant Analysis	4	100.0%	118.45	0.3781
Genomics	20	95.0%	254.36	0.4670
Imaging	4	100.0%	72.68	0.1950
Machine Learning and AI	1	100.0%	37.83	0.1029
Other	2	100.0%	35.86	0.1097
Phylogenetics	1	100.0%	417.23	0.3094
Phylogenetics and Evolutionary Analysis	13	92.3%	284.12	0.5312
Proteomics	2	100.0%	31.57	0.1085
RNA-seq	16	93.8%	353.75	0.3723
SNP Analysis	2	100.0%	251.35	0.3059
Sequence Analysis	9	88.9%	147.58	0.2354
Transcriptomics	18	83.3%	328.16	0.3665
Whole Genome Sequencing (WGS)	14	92.9%	324.90	0.5291

Ploid AI achieves over 90% accuracy in 14 out of 17 subject areas. The most challenging categories are Functional Genomics (50.0%) and Transcriptomics (83.3%), which often involve multi-step analytical workflows where the benchmark expects specific methodological choices among several valid alternatives.

BixBench full: results by subject

Total accuracy: 71.2% across 205 tasks.

BixBench Full accuracy by subject area

Subject	Tasks	Accuracy	Mean Response Time (s)	Mean Cost ($)
Antimicrobial Resistance	6	100.0%	550.64	0.6839
Differential Expression Analysis	67	61.2%	501.97	0.5689
Epigenomics	12	75.0%	84.83	0.2189
Functional Genomics	10	50.0%	460.87	0.4948
Genomic Variant Analysis	16	50.0%	157.68	0.4742
Genomics	74	77.0%	272.18	0.6059
Imaging	21	90.5%	44.74	0.1352
Integrative Omics	2	100.0%	72.27	0.2456
Machine Learning and AI	5	100.0%	415.86	0.4325
Network Biology	4	25.0%	837.72	0.7370
Other	25	84.0%	153.51	0.4250
Phylogenetics	4	50.0%	206.86	0.2485
Phylogenetics and Evolutionary Analysis	47	85.1%	286.78	0.6877
Proteomics	4	100.0%	31.52	0.1094
RNA-seq	69	60.9%	500.10	0.5747
SNP Analysis	6	100.0%	550.64	0.6839
Sequence Analysis	30	56.7%	418.52	0.4541
Single-Cell Analysis	2	100.0%	511.43	0.1698
Transcriptomics	69	56.5%	507.49	0.5929
Whole Genome Sequencing (WGS)	48	83.3%	358.65	0.7413

On the full benchmark, performance remains strong across most categories. Areas with lower scores (Differential Expression Analysis, 61.2%; RNA-seq, 60.9%; Transcriptomics, 56.5%; Sequence Analysis, 56.7%) correlate with domains where BixBench contains a higher proportion of ambiguous or underspecified questions, as documented in the BixBench-Verified-50 curation process ^[2]. The jump from full to verified accuracy in these categories confirms that a significant fraction of "failures" stem from benchmark quality issues rather than agent limitations.

What the gap between full and verified tells us

The difference between BixBench (71.2%) and BixBench-Verified-50 (92.0%) is informative. Across all agents in the comparison, scores increase substantially on the verified subset. This pattern confirms what Phylo identified ^[3]: a meaningful portion of failures on the original benchmark come from ambiguous questions, underspecified context, or incorrect ground truth, not from genuine agent capability gaps.

For Ploid AI, the 20.8 percentage point improvement on the verified subset is one of the largest in the comparison, showing that our agent performs well even on underspecified questions, with a performance increase substantially above other agents when evaluated on curated, expert-verified tasks.

References

[1] BixBench, FutureHouse. A benchmark for bioinformatics agents. huggingface.co/datasets/futurehouse/BixBench

[2] BixBench-Verified-50, Phylo. A curated subset of 50 expert-verified BixBench questions. huggingface.co/datasets/phylobio/BixBench-Verified-50

[3] Phylo. Evaluating AI Agents in Biology. February 2026. phylo.bio/blog/evaluating-ai-agents-in-biology

[4] K-Dense. K-Dense Web Scores 90.0% on BixBench-Verified-50. March 2026. k-dense.ai/blog/bixbench-verified-50

[5] Bio Protocol AI. BixBench Verified 50: Evaluating BIOS Biological Agents. March 2026. ai.bio.xyz/blog/bixbench-verified-50-evaluating-bios-biological-agents

[6] Bio Protocol AI. BIOS Benchmark Results. April 2026. bio-xyz.github.io/bio-benchmark