AI researchers and labs have made significant advances in evaluating AI models for everything from safety and compliance to sycophantism and cooperativeness. However, companies and developers appear to be facing new and specific needs to ensure that their AI systems work as intended for their specific products and services.
To simplify its testing process, Microsoft on Tuesday completely lifted ASSERT, which stands for Adaptive Spec-driven Scoring for Assessment and Regression Testing.
According to Microsoft, this open-source framework uses AI to transform high-level natural language descriptions of goals, policies, or intended behavior into thorough, explorable, scored tests, making it easier to evaluate application-specific AI behavior.
ASSERT takes a plain description of an AI model’s expected behavior and policies, transforms them into a structured set of acceptable and unacceptable behaviors, generates problem scenarios and test cases, runs them against the target system, and scores the results. It can also record the path taken by the AI system, including intermediate actions and tool calls, allowing developers to inspect where failures occur.
Developers can also provide system context, tools, and constraints if they wish to further customize what is evaluated.
For example, developers can specify that the document investigation AI agent should not send emails to people outside the company, limit sensitive information to executive level, and provide concise summaries with up-front context in mind. ASSERT uses these rules to generate test cases that continuously check whether the system follows those rules.

According to Microsoft, this framework fills a gap that broader, general assessments cannot, when AI models are intended to behave in ways that are shaped by the context, policies, and tools of an application or product.
“One of the things we learned is that evaluation is absolutely critical to making good decisions,” said Sarah Bird, chief product officer for responsible AI at Microsoft. “Because if you don’t understand how an AI system works, it’s very difficult to know whether it meets your organization’s standards…What we found is that if you really want to have a system you can trust, you need to evaluate many more application-specific aspects.”
Bird said ASSERT can be used during system construction, after deployment, and even for continuous monitoring.
This release comes amid gradual but broader changes in the AI industry. As models improve in functionality, researchers are focusing on repeatable tests and regression checks, and evaluation groups such as Stanford’s HELM, MLCommons’ AILuminate, and METR are deploying benchmarks that measure how models perform under different conditions.
If you buy through links in our articles, we may earn a small commission. This does not affect editorial independence.
