Skip to main content
Industry

Agent Platform Wars: How May 2026 Settled Three Open Questions About AI Software Development

Google launched Antigravity 2.0 and built an operating system on stage in twelve hours. OpenAI's reasoning model autonomously disproved an 80-year-old math conjecture that Erdős himself never cracked. Anthropic crossed into its first profitable quarter at a $900 billion valuation. Three weeks, three answers. Here's what they mean for anyone shipping software in 2026.

24 min readAndrey Sorokin
Three AI agent platforms competing on stage during a May 2026 product keynote, with multi-agent orchestration diagrams overhead

Agent Platform Wars: How May 2026 Settled Three Open Questions About AI Software Development

We wrote about agent teams in April. Six weeks ago it felt like the most important shift in a year. Then May 2026 happened, and three of the loudest unresolved questions in this industry got answered inside of a single week.

Question one: is Anthropic going to own multi-agent coding forever, or is Google actually going to fight for it? Answered May 19 at I/O, when Google relaunched Antigravity as an agent platform and built an operating system on stage in twelve hours.

Question two: can a reasoning model actually do original work, or is it always going to be a very good autocomplete? Answered May 20, when OpenAI's internal reasoning model autonomously disproved a conjecture Paul Erdős posed in 1946 and seven of the world's top mathematicians signed off on the proof.

Question three: is any of this actually a business, or is it the same hype cycle we lived through with crypto and metaverse? Answered the same day, when Anthropic told investors it expects $10.9 billion in Q2 revenue, its first operating profit, and a funding round at a $900 billion valuation.

Three answers in seven days. That is the kind of compression you usually only see in retrospect. We watched it live, and we changed our internal workflow because of it. This is what happened, what we made of it, and what we think it means if you are paying someone to write software for you right now.

The Week That Picked the Fights for the Rest of 2026

Here is the timeline, sourced from the vendors and the major analyst coverage. None of this is speculation.

  • May 19 - Google opened I/O 2026 with the launch of Gemini 3.5 Flash and a relaunch of Antigravity 2.0 as an agent-first development platform, complete with a new desktop application, an Antigravity CLI, an Antigravity SDK, and an agent harness with new primitives for subagents, hooks, and asynchronous task management. The keynote included a live demo where the platform built a functioning operating system from a blank prompt in roughly twelve hours of autonomous work. Pricing landed at a $100/month AI Ultra tier and $200/month Premium, with up to 20x the request limits of the standard Pro plan.
  • May 19 - Google also revealed Gemini 3.5 as a frontier model targeting agentic and coding workloads, with the company claiming it outperforms Gemini 3.1 Pro on almost every benchmark while running four times faster. The agent harness was co-developed using Antigravity itself, which is the kind of detail you put in your launch slides when you want the message to land.
  • May 20 - OpenAI published an internal reasoning model's autonomous disproof of an 80-year-old conjecture in discrete geometry first posed by Paul Erdős in 1946. The proof, roughly 125 pages of original work using infinite class field towers and Golod-Shafarevich theory, was verified by Timothy Gowers (Cambridge), Daniel Litt (Toronto), Will Sawin (Princeton), Jacob Tsimerman (Toronto), Sébastien Bubeck (OpenAI), Thomas Bloom, and Melanie Matchett Wood (Harvard). Sawin refined the polynomial improvement to δ = 0.014. Gowers, who is famously hard to impress on this topic, told Scientific American that "no previous AI-generated proof has come close" to publication standards for a top math journal.
  • May 20 - Anthropic told investors via channels that surfaced in CNBC and PYMNTS that Q2 2026 revenue is on track for $10.9 billion, more than double Q1's $4.8 billion, with an expected first-ever operating profit of around $559 million. The funding round, expected to close the week of May 22, would value Anthropic at $900 billion and push the annualized run rate above $50 billion by the end of June. That valuation, if it lands, leapfrogs OpenAI as the most valuable AI startup in the world.

Three vendors. Three answers. One week. And then the rest of us went back to work on a Monday.

Antigravity 2.0 Is Google Saying "We Are Coming For This"

For six months we have been telling clients that if they want a serious multi-agent development setup, the realistic choice was Claude Code with subagents or one of the smaller bespoke harnesses (Crew, AutoGPT, the various Cursor configurations). Google's first Antigravity release in late 2025 was interesting but felt like a coding-IDE play, not a platform. That is no longer true.

Antigravity 2.0 is now squarely a multi-agent orchestration platform. It has the same three things that made Claude Code work: a real harness with named primitives, support for long-running autonomous tasks, and an SDK so you can extend it instead of fighting it. The differences are where Google has structural advantages: native integration with Firebase, Android, AI Studio and Google Cloud, plus a $100 ultra tier that is competitive on price-per-agent-hour against anything else on the market.

Screenshot from Google's Antigravity 2.0 product page showing the multi-agent development platform with parallel agents working on code, design, and architecture tasks.

Antigravity 2.0's product page positions it as a hub for orchestrating multiple agents in parallel - one writing code, another producing assets, a third planning architecture. Source: antigravity.google/product/antigravity-2.

The twelve-hour OS build was not a stunt, even though it played like one on stage. The point of the demo was that Antigravity now does the same thing Anthropic's three-agent harness does (a planner, a builder, an evaluator), but with the kind of sub-agent and async-task primitives that make multi-hour autonomous work less of a coin flip. We tried it in our own evaluation environment within forty-eight hours of the launch. It is real. It is also not yet our daily driver, for reasons we will get into.

The Three-Way Comparison, Without the Marketing

Here is how the three platforms actually stack up after we have spent real time with each one. This is from inside our workflow, not from a vendor slide deck.

The honest read: Claude Code remains our daily driver because of the maturity of the harness, the quality of the code-reviewer and security-auditor subagents we have invested months in tuning, and because Opus 4.7 still leads on the coding benchmarks that map to our actual work. Antigravity is the most credible challenger that has shown up since Cursor first landed. We expect to be running real client work on it by Q3 if the trajectory holds. GPT-5.5 sits in our toolbelt for the work it is genuinely best at (tool use, terminal automation, omnimodal), but we do not use it as a primary coder.

The thing the comparison table does not capture: the second a serious competitor showed up, every team that was already on Claude Code got a much bigger lever. Vendor lock-in is a real cost. Knowing you could move your harness to Antigravity if the price-performance flipped is part of why we can plan twelve months ahead instead of three.

The Erdős Proof Is Why You Should Take "Autonomous Agent" Seriously Now

A lot of the agent conversation in 2025 was about whether the model was really reasoning or just doing very good pattern matching over a large enough corpus. That is a fair question. It is also, as of May 20, a harder question to answer with "just pattern matching."

OpenAI's announcement is, on the surface, a math story. Underneath, it is the most important capability story of the year. A general-purpose reasoning model, not one trained specifically for mathematics, connected an open problem in combinatorial geometry to a deep result in algebraic number theory that human mathematicians who knew both fields had never put together in eight decades of trying. The model did this autonomously. Seven mathematicians who do not work for OpenAI checked the proof and signed off on it. Will Sawin then went and refined the polynomial improvement to δ = 0.014.

If you have followed AI math claims for the last two years, you know how often these stories have collapsed under close reading. October 2025's GPT-5 Erdős announcement collapsed when independent mathematicians found that the "solved" problems already had answers in the literature. Bubeck has been honest about that history. The reason this one is different is that the proof is original, the methods are deep enough to be hard to fake, and the verifiers are people who do not get fooled.

No previous AI-generated proof has come close to meeting publication standards for a top mathematics journal.

Timothy GowersFields Medallist, University of Cambridge

We are not building math software, and the reader probably is not either. So why does this matter for a Calgary business deciding who to hire to build their next product?

Because the same model architecture that did this autonomously is the architecture that powers the agents writing our code. The Erdős result is a clean external proof that the reasoning has gotten good enough to find paths that are not in the training data, hold the structure of a long argument across hundreds of pages, and self-correct when an intermediate step does not check out. Every one of those capabilities matters in production software. The agent that can keep a 125-page proof coherent is the same agent that can refactor your fifty-file authentication module without losing the thread halfway through.

We are not saying the model is "thinking." We are saying that the gap between "this is glorified autocomplete" and "this is a system that can do original reasoning over long horizons" got materially narrower in May. People paying for code in mid-2026 should price that in.

The $900 Billion Question: Is Any Of This Sustainable?

The third answer is the boring one and the most consequential. For most of the last eighteen months, the argument against going all in on agentic AI was a financial one. The models were impressive. The costs were astronomical. Nobody was making money. Therefore the whole stack was a bubble that would deflate the second the next funding round looked harder.

That story got harder to tell on May 20.

Two things to read into these numbers if you are deciding whether to bet your roadmap on this stack. First, the revenue is enterprise, not consumer. The Q2 jump is driven by API and Claude Code adoption inside companies that are paying because the tool is actually saving them money on engineering, not because their CEO read an article. Second, the operating profit number is real even if you bracket out training costs (which Anthropic does) and stock comp (which it also does). It is not the bubble math people use to dress up a money-losing quarter. This is the model business, at scale, starting to clear its own variable costs.

Anthropic is not the only one. Google's Gemini revenue is up. OpenAI is on a similar trajectory, just with a different cost structure. The capital that has been deployed across this category is starting to come back, which is the thing that determines whether the platforms we are building on this year are still here in 2028.

For our purposes the implication is direct. We can sign multi-year client contracts now that assume the tooling and the vendors are still going to be around. We could not honestly do that in mid-2025.

What We Actually Changed In May

We are going to be specific, because the only thing more useful than vendor news is somebody telling you how they reorganized their work because of it. Here is what changed at Rocky Soft this month.

1. We added Antigravity 2.0 to our internal evaluation rotation

Every month we pick one internal tool, one non-client project, and rebuild it end-to-end in a new agent stack. May's build was a portfolio-traffic dashboard, originally shipped with Claude Code in March. We rebuilt it in Antigravity 2.0 using the new subagent and async-task primitives. The build was faster than we expected (about six hours including a full review pass), and the multi-agent parallelism (one agent on the dashboard, one on the data pipeline, one on the tests) handled the coordination better than we have ever seen out of a Google product. It is not replacing Claude Code for client work yet. It is now plausible that it will by autumn.

2. We started writing our specs for two platforms, not one

This is the practical effect of the platform-war shift. The spec we hand to an agent is now structured to be portable. Less Claude-specific syntax in our tasks.md files. More tool-neutral acceptance criteria. The result is that we can route the same feature spec at Claude Code and at Antigravity, compare the diffs, and pick the one we ship. We caught a subtle SSR-hydration bug last week because the two agents made different assumptions about a shared component, and the disagreement surfaced the bug before either of us did.

3. We are running longer autonomous sessions, with more discipline

The Erdős proof, more than anything else this month, gave us the confidence to extend our autonomous-run leash. We are now routinely letting an agent work for two to three hours unattended on well-specified tasks (refactors, test backfills, migration scripts), with the evaluator agent gating any output before a human ever sees it. We have a hard cap on the leash for anything touching auth, payments, or data migrations. The longer the leash gets, the harder we have to push on the spec quality and the evaluator harness, which is exactly what every responsible team in this category has been saying out loud since April.

4. We doubled the size of our evaluator agent prompts

This is unglamorous and easy to underrate. When the implementation agent gets stronger, the evaluator has to get stronger or the gap closes and you stop catching mistakes. We rewrote the code-reviewer and security-auditor subagent prompts this month, with explicit checks for the kinds of "looks-plausible-but-wrong" output that stronger models produce more often than weaker ones. The most useful change was forcing the evaluator to read three to five real similar files in the repo before rendering its verdict, instead of evaluating in isolation. Our reviewer-caught-bug rate is up materially since we made the change.

5. We told two clients to wait six weeks on a model decision

This is the most boring change and probably the most valuable. We had two early-stage product builds where we would have committed to a model and a harness in April. We held off. The May releases changed enough about the price-performance math that the right call on both projects is now different than it was a month ago. Telling a client to wait is not what they pay you to do. Telling them to wait because the right answer is six weeks away is exactly what they pay you to do.

The Hard Numbers From the Production Side

While the vendors were doing keynotes, the analyst side of the house was publishing the first audited enterprise productivity numbers from the agent rollouts that started in late 2025. The headline numbers from the 2026 enterprise data lined up with what we are seeing on our own projects, but the granularity is what matters.

  • Knowledge workers on production AI agents recover a median of 6.4 hours per week per seat, with senior engineers at the top of the band saving 10 to 12 hours.
  • Routine PR review costs $0.72 with an agent versus $48 of senior-engineer time for the same review, a 66x reduction. (Caveat: agents do not replace the review, they triage it. The 66x number is for the first pass, not the final sign-off.)
  • 80% of enterprise applications shipped in Q1 2026 embed at least one AI agent, up from 33% in 2024. If you are shipping software in mid-2026 without one, you are now the outlier.
  • 41% of agent deployments reach positive ROI in year one. 19% never pay back. The difference is not the model. It is "evaluation drift, governance gaps, and unmeasured rework." Which is exactly what we have been saying since the vibe-coding article in March.
  • Median engineering payback period is 9.3 months, with top-quartile shops getting there in five to six. Customer service is faster (4.1 months) because the work is more uniform. Legal is slower (14.8 months) for the same reason in reverse.

The number we keep coming back to internally is the 19% of programs that never pay back. None of those 19% failed because the model could not do the work. They failed because the team did not build the spec discipline, the evaluator gate, or the documentation infrastructure that the model needs in order to do the work reliably. The model is the easy part now. Everything around the model is the work.

What This Means If You Are Calgary And You Need Software Built

A practical translation, since this blog is read mostly by founders, ops people, and business owners in Calgary and across Canada who are evaluating who to trust with their next build.

  • Pick a partner who is platform-portable, not platform-loyal. May 2026 should be the last month where it was reasonable to be "the Claude shop" or "the Google shop." The serious teams now write specs that are portable across at least two agent platforms and use the disagreements between them as a quality check.
  • Ask about the evaluator agent. Not the implementer. The evaluator. The implementation has gotten cheap. The second-opinion layer is what determines whether what gets shipped is actually correct. If a vendor cannot describe theirs in detail, walk.
  • Discount any quoted timeline by 30 to 50%. Compared to a 2025 build, the same feature is genuinely faster to ship now. If your vendor is quoting in 2025 weeks, ask what their multi-agent stack looks like. The honest ones will give you a real number and a real reason it is lower than it used to be.
  • Take the cost savings as runway, not as a discount. The teams that win the second half of 2026 are not the ones that lay off their engineers and pocket the agent savings. They are the ones that take the same engineering org and ship two or three times the roadmap. If a vendor is selling the agent shift as a cost cut, they have missed the actual leverage.
  • The vendors are going to be here next year. This is the boring point that matters most. The financial answer in May is what makes the technical answer worth committing to. A $900B AI lab that just turned a profit is not going to vanish on you mid-project. Sign the contract.

The Bottom Line

For two years the agent conversation has been carried by three open questions: is anyone going to challenge Anthropic, can the models actually reason over long horizons, and is the money real. In May 2026 all three got concrete answers. Google showed up. The Erdős proof landed. Anthropic crossed into profit at a valuation nobody priced six months ago.

The implication for the rest of the year is straightforward. The agent platforms are now genuinely competitive, which means the tools you use are going to keep getting better at a faster cadence than any other software category in living memory. The reasoning is good enough that the bottleneck has fully moved up the stack to specifications, governance, and second-opinion discipline. And the businesses underneath the platforms are real enough that you can plan a project against them in a way you could not a year ago.

We started this article saying May 2026 settled three open questions. It also opened a new one, which is the question that matters most going forward: now that the platforms are credible and the reasoning is real and the money is there, what does your team do with it? That part is on us. All of us.


At Rocky Soft, we build production-grade web and mobile applications using Next.js, React, Node.js, NestJS, and React Native, with a multi-platform agent workflow that runs Claude Code, Antigravity 2.0, and Codex against the same specifications and ships whichever produces the cleaner diff. Based in Calgary, Alberta, we work with clients across Canada who need software that ships, works, and stays maintainable after we hand over the keys. Let's talk about your project.

Frequently Asked Questions

What is Google Antigravity 2.0 and how does it compare to Claude Code?

Antigravity 2.0 is Google's relaunched agent-first development platform, announced at Google I/O on May 19, 2026. It includes a desktop app, a CLI, an SDK, and an agent harness with new primitives for subagents, hooks, and asynchronous task management - the same architectural pattern Anthropic's Claude Code popularized. Antigravity's structural advantages are native Firebase, Android, and Google Cloud integration plus a $100/month AI Ultra pricing tier. Claude Code currently has the more mature harness and a deeper third-party ecosystem. For most production teams in mid-2026, Claude Code remains the daily driver while Antigravity 2.0 is the first credible challenger to seriously evaluate.

Did OpenAI really solve a math problem Erdős could not?

OpenAI announced on May 20, 2026 that a general-purpose reasoning model autonomously disproved a 1946 conjecture by Paul Erdős on unit-distance configurations in the plane. The model produced roughly 125 pages of original number-theory proof using infinite class field towers and Golod-Shafarevich theory. Seven external mathematicians, including Timothy Gowers (Cambridge), Daniel Litt and Jacob Tsimerman (Toronto), and Will Sawin (Princeton), verified the proof. Sawin then refined the polynomial improvement to δ = 0.014. Gowers stated that no previous AI-generated proof has come close to publication standards for a top math journal. The result is the strongest external evidence to date that frontier reasoning models can do original autonomous work over long horizons.

Is Anthropic really worth $900 billion?

According to investor briefings reported by CNBC, PYMNTS, and others on May 20, 2026, Anthropic is in active talks to raise more than $30 billion at a valuation of $900 billion, which would make it the world's most valuable AI startup ahead of OpenAI. Q2 2026 revenue is projected at $10.9 billion, more than double Q1 and more than all of 2025 combined, with an expected first ever operating profit of around $559 million. Annualized run rate is projected to surpass $50 billion by the end of June 2026. The profitability number excludes training costs and stock-based compensation, which is standard for the category. Whether the valuation holds will depend on whether the closing round actually clears the numbers.

Should we wait to start our project until the agent platforms stabilize?

No. The platforms are stabilizing faster than any project timeline. Waiting six months to start your build means waiting on the wrong axis. The right move is to pick a partner who writes platform-portable specifications, uses evaluator agents on every implementation, and can move work across Claude Code, Antigravity 2.0, and other platforms as the price-performance math shifts. That way the platform improvements between now and the end of your build show up as a quality and speed bonus, not as a stalled project.

What is the single biggest workflow change we should make right now?

If you are using AI coding tools at all, invest in your evaluator agent before you invest in anything else. The implementation models have gotten cheap. The second-opinion layer that grades the implementation is now the single highest-leverage piece of your stack. Write a dedicated code-reviewer prompt, give it access to read three to five similar files in your repo before it judges, and require its sign-off before any human review. The 19% of agent programs that never reach payback fail here. The teams that win do not.

How are these May 2026 changes relevant to a small business in Calgary?

Directly. The cost of building software has dropped enough in the last six weeks that features which would have been "we cannot afford that" in 2025 are now in the budget for a 12-week build. The risk has also shifted: it is no longer about whether the technology works, it is about whether your vendor knows how to use it responsibly. Ask any partner you are considering about their evaluator agent setup, their platform-portability strategy, and their position on multi-agent orchestration. The good answers are concrete and detailed. The bad answers will tell you a lot.

Sources

  • TechCrunch. (2026, May 19). "Google launches Antigravity 2.0 with an updated desktop app and CLI tool at IO 2026." Read article
  • Google Developers Blog. (2026, May 19). "I/O 2026 developer highlights: Antigravity, Gemini API, AI Studio." Read article
  • Google. (2026, May 19). "Gemini 3.5: frontier intelligence with action." Read announcement
  • TechCrunch. (2026, May 19). "With Gemini 3.5 Flash, Google bets its next AI wave on agents, not chatbots." Read article
  • TechCrunch. (2026, May 20). "OpenAI claims it solved an 80-year-old math problem - for real this time." Read article
  • Scientific American. (2026, May). "OpenAI announces AI's biggest math breakthrough yet." Read article
  • CNBC. (2026, May 20). "Anthropic set to hit $10.9 billion in revenue in Q2, source says." Read article
  • PYMNTS. (2026, May). "Anthropic Eyes $900 Billion Valuation as Quarterly Revenue Doubles." Read article
  • Digital Applied. (2026). "AI Agent Productivity Statistics 2026: 100+ ROI Data Points." Read report
  • Larridin. (2026). "Developer Productivity Benchmarks 2026: AI-Native Engineering Data." Read report

Related Articles

Have a project in mind? Let's build it.

From web apps to mobile solutions — we turn ideas into production software.