Ploid AI on BixBench: benchmarking our agent against the field

6 min readPloidยท Founding Team
AI AgentsBenchmarksBioinformatics
Ploid AI on BixBench: benchmarking our agent against the field

AI agents are playing a growing role in biological research, and proper evaluation needs to keep up. BixBench [1] tackles this by pairing real bioinformatics datasets with research questions that test whether an agent can actually analyze data and produce correct answers across a range of domains.

We ran Ploid AI on both the full BixBench benchmark and the curated BixBench-Verified-50 subset [2]. This post presents our results in detail.

Why BixBench

We observed a growing tendency in the field to benchmark bioinformatics agents against BixBench, and we believe it is a good way to test our agent against the market. Agents receive real biological datasets and short-answer research questions spanning domains from genomics to proteomics. It tests biological knowledge, the ability to load data, choose appropriate methods, execute analysis pipelines, and interpret results.

However, not all failures on BixBench reflect agent limitations. Some questions are ambiguous, underspecified, or have incorrect ground truth. To address this, Phylo curated BixBench-Verified-50 [2], a subset of 50 questions reviewed by domain experts for correctness and clarity. We evaluated Ploid AI on both versions.

Overall results

AgentBixBenchBixBench-Verified-50
Ploid AI71.2%92.0%
K-Dense Web**not reported90.0%
BIOS***64.4%90.0%
Biomni Lab (2026-02-03)*52.2%88.7%
Edison Analysis*42.4%78.0%
Claude Code (Opus 4.6)*39.5%65.3%
OpenAI Agents SDK (GPT-5.2)*38.5%61.3%

*Results for Claude Code, OpenAI Agents SDK, Edison Analysis, and Biomni Lab sourced from [3].

**Results for K-Dense Web sourced from [4].

***Results for BIOS sourced from [5] (BixBench-Verified-50) and [6] (full BixBench).

K-Dense Web does not report a score on the full BixBench benchmark; only BixBench-Verified-50 results are available.

Agent accuracy comparison across BixBench and BixBench-Verified-50

Ploid AI achieves 71.2% on the full BixBench and 92.0% on the verified subset, leading both general-purpose and domain-specific agents on both benchmarks. K-Dense Web [4] and BIOS [5] also report strong results on the verified subset (90.0% each). BIOS additionally reports 64.4% on the full BixBench benchmark [6], while K-Dense Web does not publish a score on the full benchmark in recent publications.

BixBench-Verified-50: results by subject

Total accuracy: 92.0% across 50 verified tasks.

BixBench-Verified-50 accuracy by subject area

SubjectTasksAccuracyMean Response Time (s)Mean Cost ($)
Antimicrobial Resistance2100.0%251.350.3059
Differential Expression Analysis1492.9%403.940.4040
Epigenomics3100.0%35.680.1121
Functional Genomics450.0%70.490.1894
Genomic Variant Analysis4100.0%118.450.3781
Genomics2095.0%254.360.4670
Imaging4100.0%72.680.1950
Machine Learning and AI1100.0%37.830.1029
Other2100.0%35.860.1097
Phylogenetics1100.0%417.230.3094
Phylogenetics and Evolutionary Analysis1392.3%284.120.5312
Proteomics2100.0%31.570.1085
RNA-seq1693.8%353.750.3723
SNP Analysis2100.0%251.350.3059
Sequence Analysis988.9%147.580.2354
Transcriptomics1883.3%328.160.3665
Whole Genome Sequencing (WGS)1492.9%324.900.5291

Ploid AI achieves over 90% accuracy in 14 out of 17 subject areas. The most challenging categories are Functional Genomics (50.0%) and Transcriptomics (83.3%), which often involve multi-step analytical workflows where the benchmark expects specific methodological choices among several valid alternatives.

BixBench full: results by subject

Total accuracy: 71.2% across 205 tasks.

BixBench Full accuracy by subject area

SubjectTasksAccuracyMean Response Time (s)Mean Cost ($)
Antimicrobial Resistance6100.0%550.640.6839
Differential Expression Analysis6761.2%501.970.5689
Epigenomics1275.0%84.830.2189
Functional Genomics1050.0%460.870.4948
Genomic Variant Analysis1650.0%157.680.4742
Genomics7477.0%272.180.6059
Imaging2190.5%44.740.1352
Integrative Omics2100.0%72.270.2456
Machine Learning and AI5100.0%415.860.4325
Network Biology425.0%837.720.7370
Other2584.0%153.510.4250
Phylogenetics450.0%206.860.2485
Phylogenetics and Evolutionary Analysis4785.1%286.780.6877
Proteomics4100.0%31.520.1094
RNA-seq6960.9%500.100.5747
SNP Analysis6100.0%550.640.6839
Sequence Analysis3056.7%418.520.4541
Single-Cell Analysis2100.0%511.430.1698
Transcriptomics6956.5%507.490.5929
Whole Genome Sequencing (WGS)4883.3%358.650.7413

On the full benchmark, performance remains strong across most categories. Areas with lower scores (Differential Expression Analysis, 61.2%; RNA-seq, 60.9%; Transcriptomics, 56.5%; Sequence Analysis, 56.7%) correlate with domains where BixBench contains a higher proportion of ambiguous or underspecified questions, as documented in the BixBench-Verified-50 curation process [2]. The jump from full to verified accuracy in these categories confirms that a significant fraction of "failures" stem from benchmark quality issues rather than agent limitations.

What the gap between full and verified tells us

The difference between BixBench (71.2%) and BixBench-Verified-50 (92.0%) is informative. Across all agents in the comparison, scores increase substantially on the verified subset. This pattern confirms what Phylo identified [3]: a meaningful portion of failures on the original benchmark come from ambiguous questions, underspecified context, or incorrect ground truth, not from genuine agent capability gaps.

For Ploid AI, the 20.8 percentage point improvement on the verified subset is one of the largest in the comparison, showing that our agent performs well even on underspecified questions, with a performance increase substantially above other agents when evaluated on curated, expert-verified tasks.

References

[1] BixBench, FutureHouse. A benchmark for bioinformatics agents. huggingface.co/datasets/futurehouse/BixBench

[2] BixBench-Verified-50, Phylo. A curated subset of 50 expert-verified BixBench questions. huggingface.co/datasets/phylobio/BixBench-Verified-50

[3] Phylo. Evaluating AI Agents in Biology. February 2026. phylo.bio/blog/evaluating-ai-agents-in-biology

[4] K-Dense. K-Dense Web Scores 90.0% on BixBench-Verified-50. March 2026. k-dense.ai/blog/bixbench-verified-50

[5] Bio Protocol AI. BixBench Verified 50: Evaluating BIOS Biological Agents. March 2026. ai.bio.xyz/blog/bixbench-verified-50-evaluating-bios-biological-agents

[6] Bio Protocol AI. BIOS Benchmark Results. April 2026. bio-xyz.github.io/bio-benchmark