About the Role

A technical researcher to own how we evaluate frontier models on the ARC-AGI benchmarks. This person will run new models end-to-end, mine the data exhaust from every run, and translate what we learn into reports and public communication that shape the conversation on where model capability is heading. This is a remote, full-time role.

What You'll Do

Own our model benchmarking and testing process, and run new frontier models against ARC-AGI-1, ARC-AGI-2, and ARC-AGI-3 as they ship
Build and own the ARC Prize Analysis Package - a repeatable report produced for every new frontier model, turning raw logs into insight on capability, failure modes, and gaps
Own the official and community leaderboards end-to-end - from scoring pipeline to public page
Serve as primary contact for new labs testing on ARC-AGI, and communicate findings externally via Twitter, newsletter, and policy and partner briefings

What We're Looking For

Research background with hands-on model evaluation experience - you've run evals before and know how to read the results (model training experience not required)
Deep understanding of how modern models work and fail, and comfortable building your own tooling and analysis to answer the questions you care about
Strong ownership instinct and clear technical communicator

Example outputs this role would produce: a model score announcement and a model analysis blog post.

Benchmark Testing and Analysis Lead

Job summary

Work model

About the Role

What You'll Do

What We're Looking For