OpenAI Unveils EVMbench – A Benchmark for AI Agents Tackling Smart‑Contract Security
San Francisco, 19 Feb 2026 – OpenAI announced a new evaluation suite, EVMbench, designed to measure how effectively artificial‑intelligence agents can locate, remediate, and even exploit vulnerabilities in Ethereum‑compatible smart contracts. The benchmark, released in a joint paper with crypto‑investment firm Paradigm and security outfit OtterSec, pits a range of leading language‑model‑based agents against 120 curated flaws drawn from real‑world audits.
What EVMbench Measures
EVMbench simulates an “economically meaningful” environment: for each vulnerability the AI agent receives a monetary reward proportional to the amount it could theoretically extract if the flaw were exploited. The test therefore gauges not only technical detection skills but also the potential financial impact of the agents’ actions.
The vulnerabilities span typical categories such as re‑entrancy, integer overflow, and access‑control errors, and were sourced from 40 publicly disclosed audit reports and open‑source competition entries.
Head‑to‑Head Results
| Rank | AI Agent | Average “Detect Award”* |
|---|---|---|
| 1 | Claude Opus 4.6 (Anthropic) | $37,824 |
| 2 | OC‑GPT‑5.2 (OpenAI) | $31,623 |
| 3 | Gemini 3 Pro (Google) | $25,112 |
*The “detect award” reflects the hypothetical earnings an agent could claim by successfully exploiting a given flaw.
Anthropic’s Claude Opus 4.6 emerged as the most lucrative exploiter, topping the leaderboard by a margin of roughly $6,200 over OpenAI’s own OC‑GPT‑5.2. Google’s Gemini 3 Pro placed third, confirming that the leading commercial models are already competitive in this niche.
Why This Matters
Smart contracts now lock billions of dollars in decentralized finance (DeFi) protocols, stablecoins, and tokenized assets. In 2025, cyber‑criminals siphoned an estimated $3.4 billion from crypto wallets—a modest rise from the previous year but a stark reminder of the sector’s exposure to code‑level weaknesses.
OpenAI’s launch of EVMbench underscores a growing consensus that AI agents could become double‑edged swords: powerful allies for auditors and defenders, yet potentially formidable tools for attackers. As one of the paper’s authors remarked, “Smart contracts secure billions of dollars in assets, and AI agents are likely to be transformative for both attackers and defenders.”
The Bigger Picture: AI‑Powered Stablecoin Payments
The benchmark arrives amid speculation that AI agents will soon handle routine financial transactions on behalf of users. Circle CEO Jeremy Allaire has forecast that billions of autonomous agents could be transacting with stablecoins within the next five years. Likewise, former Binance chief Changpeng Zhao recently suggested that cryptocurrency may become the “native currency for AI agents.”
If agents can reliably audit and remediate contract vulnerabilities at scale, their deployment in payment‑routing, self‑driving wallets, and automated market‑making could accelerate, mitigating the very risks the benchmark highlights.
Industry Reactions
- Haseeb Qureshi, managing partner at venture firm Dragonfly, noted on X that the original promise of smart contracts—to replace complex legal agreements—has been hampered by human‑centred design. He sees AI‑mediated wallets as the missing component that could finally bring “GPS‑like” precision to crypto transactions.
- Security firms are watching the results closely. OtterSec’s co‑founder emphasized that benchmarks like EVMbench provide a yardstick for measuring progress in automated security tooling, a field that has traditionally relied on manual audits.
Key Takeaways
- EVMbench offers the first systematic, financially weighted assessment of AI agents’ ability to spot and exploit smart‑contract bugs.
- Claude Opus 4.6 currently leads the field, but OpenAI’s own OC‑GPT‑5.2 and Google’s Gemini 3 Pro are not far behind, indicating rapid convergence among top models.
- The benchmark highlights a dual‑use risk: the same AI capabilities that can accelerate defensive audits could also be weaponized by malicious actors.
- Stabilizing AI‑driven finance may depend on integrating such agents into the auditing pipeline, reducing the attack surface before autonomous agents begin handling high‑value stablecoin payments.
- Stakeholders—from developers to regulators—should monitor AI‑derived security metrics as part of broader risk‑management strategies for DeFi and beyond.
Looking Forward
OpenAI hopes that EVMbench will become a living repository, updated with new vulnerabilities and reflecting the evolving threat landscape. By quantifying AI performance in an economically relevant context, the benchmark may guide both the development of safer smart‑contract code and the responsible deployment of AI agents across the crypto ecosystem.
The full EVMbench paper and dataset are available on OpenAI’s website.
Source: https://cointelegraph.com/news/openai-benchmark-ai-agents-detect-smart-contract-flaws?utm_source=rss_feed&utm_medium=feed&utm_campaign=rss_partner_inbound
















