AI agents are becoming increasingly sophisticated. They have evolved from answering questions to autonomously performing complex multi-step tasks.
But before these agents can be trusted to book travel or perform financial analysis on behalf of users, model providers and the startups building such agents want to ensure that their agents perform reliably across a wide range of scenarios.
AI labs often use benchmarks to show off the capabilities of their models, but high scores, even for agent-oriented benchmarks, don’t actually prove that the AI can successfully perform a variety of complex jobs in the real world.
Patronus AI, a startup founded in 2023 by former meta-AI researchers Anand Kanappan and Rebecca Kian, helps model makers and companies fine-tune their models to do just that by building simulated digital environments to evaluate agent performance.
This San Francisco-based startup is sure to be solving an important problem. Glenn Solomon, managing director at Notable Capital, says demand for his company’s simulated environments is nearly insatiable, with customers including virtually every frontier AI lab and many emerging startups.
Patronus’ revenue has increased 15x over the past year, driving significant investor interest. The company announced Thursday a $50 million Series B round led by Greenfield Partners with participation from Notable Capital, Lightspeed, Datadog, and Samsung. This round brings the company’s total funding to $70 million.
Patronus uses what it calls a “digital world model” to create replicas of its website and internal systems. In these environments, agents are stress tested after training using reinforcement learning, repeatedly rewarding successful completion of tasks and penalizing errors.
AI Lab sees great value in these digital simulations because they give agents the opportunity to try out different and sometimes unpredictable scenarios. The company is comparing its approach to how Waymo first trained its self-driving cars by building synthetic worlds and testing the vehicles against rare hazards, such as bad weather or children chasing balls.
The difference with AI agents is that they tend to take shortcuts and fail to complete tasks correctly. “Patronus is very good at detecting hacks and making sure models are held accountable,” Solomon said.
Kannappan said Patronus currently offers simulated digital worlds for software engineering and finance, but these are just the beginning.
“Today we are very focused on verifiable problems, problems that we can quickly see and verify, but there are many more areas that cannot be verified or are very difficult to verify,” he said.
Just because these processes are verifiable doesn’t mean they’re simple. “We want to be able to actually build an environment where we can run agents that can run for 10 hours, 10 days, or 10 weeks,” Kannappan said.
As for competition, Patronus believes it primarily competes with the in-house teams AI Labs has built to evaluate agent behavior. While human data companies like Mercor and Surge rely on reinforcement learning to help build models, Patronus operates differently by evaluating how agents behave without human involvement.
If you buy through links in our articles, we may earn a small commission. This does not affect editorial independence.
