Benchmark Testing and Analysis Lead

Job summary

Remote (US)

Work model

Fully remote
Only United States
1 week ago
Job description

About the Role

A technical researcher to own how we evaluate frontier models on the ARC-AGI benchmarks. This person will run new models end-to-end, mine the data exhaust from every run, and translate what we learn into reports and public communication that shape the conversation on where model capability is heading. This is a remote, full-time role.

What You'll Do

  • Own our model benchmarking and testing process, and run new frontier models against ARC-AGI-1, ARC-AGI-2, and ARC-AGI-3 as they ship
  • Build and own the ARC Prize Analysis Package - a repeatable report produced for every new frontier model, turning raw logs into insight on capability, failure modes, and gaps
  • Own the official and community leaderboards end-to-end - from scoring pipeline to public page
  • Serve as primary contact for new labs testing on ARC-AGI, and communicate findings externally via Twitter, newsletter, and policy and partner briefings

What We're Looking For

  • Research background with hands-on model evaluation experience - you've run evals before and know how to read the results (model training experience not required)
  • Deep understanding of how modern models work and fail, and comfortable building your own tooling and analysis to answer the questions you care about
  • Strong ownership instinct and clear technical communicator

Example outputs this role would produce: a model score announcement and a model analysis blog post.