Since 2024, Anthropic’s Performance Optimization team has been giving job candidates take-home tests to check their knowledge. But as AI coding tools have improved, testing has had to change significantly to stay ahead of AI-powered fraud.
Team leader Tristan Hume explained the history of the challenge in a blog post Wednesday. “Each time a new Claude model appeared, the tests had to be redesigned,” Hume writes. “Given the same time limit, Claude Opus 4 outperformed most human applicants. It was still able to distinguish the strongest candidates, but then Claude Opus 4.5 even matched those applicants.”
This results in serious problems in evaluating candidates. Without in-person proctoring, there is no way to tell if someone is using AI to cheat on an exam. If a person cheats, he or she will quickly rise to the top. “Under the constraints of the take-home test,” Hume writes, “there was no longer any way to distinguish between the accomplishments of the best candidates and the most competent models.”
The issue of AI cheating is already causing havoc in schools and universities around the world, so it’s ironic that AI labs are also having to deal with it. But Anthropic is also uniquely equipped to address this issue.
Ultimately, Hume designed a new test that had less to do with hardware optimization and was novel enough to overwhelm modern AI tools. However, as part of the post, he shared his original test to see if anyone reading could come up with a better solution.
“If you can achieve Opus 4.5, we’d love to hear from you,” the post reads.
