- Home
- Remote Jobs
- Benchmark Testing and Analysis Lead
Already filled
Don't miss the next one. Get matching roles delivered to your inbox.
Benchmark Testing and Analysis Lead
Job summary
Work model
About the Role
A technical researcher to own how we evaluate frontier models on the ARC-AGI benchmarks. This person will run new models end-to-end, mine the data exhaust from every run, and translate what we learn into reports and public communication that shape the conversation on where model capability is heading. This is a remote, full-time role.
What You'll Do
- Own our model benchmarking and testing process, and run new frontier models against ARC-AGI-1, ARC-AGI-2, and ARC-AGI-3 as they ship
- Build and own the ARC Prize Analysis Package - a repeatable report produced for every new frontier model, turning raw logs into insight on capability, failure modes, and gaps
- Own the official and community leaderboards end-to-end - from scoring pipeline to public page
- Serve as primary contact for new labs testing on ARC-AGI, and communicate findings externally via Twitter, newsletter, and policy and partner briefings
What We're Looking For
- Research background with hands-on model evaluation experience - you've run evals before and know how to read the results (model training experience not required)
- Deep understanding of how modern models work and fail, and comfortable building your own tooling and analysis to answer the questions you care about
- Strong ownership instinct and clear technical communicator
Example outputs this role would produce: a model score announcement and a model analysis blog post.