- Home
- Remote Jobs
- Benchmark Testing and Analysis Lead
Benchmark Testing and Analysis Lead
Job summary
Remote (US)
Work model
Fully remote
Only United States
Job description
About the Role
A technical researcher to own how we evaluate frontier models on the ARC-AGI benchmarks. This person will run new models end-to-end, mine the data exhaust from every run, and translate what we learn into reports and public communication that shape the conversation on where model capability is heading. This is a remote, full-time role.
What You'll Do
- Own our model benchmarking and testing process, and run new frontier models against ARC-AGI-1, ARC-AGI-2, and ARC-AGI-3 as they ship
- Build and own the ARC Prize Analysis Package - a repeatable report produced for every new frontier model, turning raw logs into insight on capability, failure modes, and gaps
- Own the official and community leaderboards end-to-end - from scoring pipeline to public page
- Serve as primary contact for new labs testing on ARC-AGI, and communicate findings externally via Twitter, newsletter, and policy and partner briefings
What We're Looking For
- Research background with hands-on model evaluation experience - you've run evals before and know how to read the results (model training experience not required)
- Deep understanding of how modern models work and fail, and comfortable building your own tooling and analysis to answer the questions you care about
- Strong ownership instinct and clear technical communicator
Example outputs this role would produce: a model score announcement and a model analysis blog post.