About Atla

Atla is committed to engineering safe, beneficial AI systems that will have a massive positive impact on the future of humanity. We are a London-based start-up building the most capable AI evaluation models. Become part of our growing world-class team, backed by Y Combinator, Creandum, and the founders of Reddit, Cruise, Rappi, Instacart and more.

Role

As Atla’s research intern, you will collaborate with our researchers and obtain deep experience in a growing AI startup. As part of your role, you will:

Conduct cutting-edge machine learning research, contributing to research initiatives that have practical applications in our product development.
Disseminate your research results through the production of publications, datasets, and code.

Our ongoing research projects encompass but are not limited to:
Iterative Self ImprovementThis project applies iterative self-improvement to enhance our general-purpose evaluator. This involves using the model’s outputs to refine its training data iteratively, rather than relying on fixed datasets. Prior work [1, 2, 3, 4] demonstrates the effectiveness of this approach, and we aim to extend it to evaluation systems.We will leverage our internal training data, infrastructure, and benchmarks to iteratively refine the evaluator. You will collaborate with engineers to build infrastructure for iteratively generating better and more informative data. Techniques from our research on techmulti-stage synthetic data generation will be incorporated to improve data quality.Key challenges include addressing bias amplification, semantic drift, and maintaining diversity of data to ensure model stability and alignment. This project aims to advance safe iterative training methodologies and deliver a more capable evaluator, with findings targeted for a top-tier conference. The scope can be tailored to your skills and interests.[1] Wang, Y., et al. (2023). SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions.[2] Yuan, W., et al. (2024). Self-Rewarding Language Models.[3] Wang, T., et al. (2024). Self-Taught Evaluators.[4] Li, X., et al. (2024). MONTESSORI-INSTRUCT: Generate Influential Training Data Tailored for Student Learning.

Inference Time ComputeThis project explores inference-time compute scaling to enhance our general-purpose evaluator, particularly for complex tasks like coding, which benefit from longer reasoning chains. Recent research [1, 2] has shown the effectiveness of inference-time compute in improving performance on reasoning and mathematical tasks by leveraging more tokens during inference.We will investigate methods to train models capable of utilising additional tokens effectively for reasoning. This involves experimenting with reinforcement learning (RL) approaches, such as group reinforcement policy optimisation (GRPO), to encourage self-verification and reasoning strategies. You will work with engineers to develop the necessary training infrastructure.Key challenges include addressing trade-offs between token efficiency and performance while mitigating common issues. The project aims to develop robust methods for inference-time compute scaling and contribute findings to a top-tier conference. The scope can be tailored to your skills and interests.[1] Guo, D., et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.[2] Snell, C., et al. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.
Agentic EvaluationThis project investigates how to evaluate agentic systems using an LLM-as-a-Judge framework. Agents introduce new challenges due to their ability to reason, plan, and interact with external tools [1,2]. Evaluating their capabilities and safety requires new approaches, with potential directions including:

Agent-as-a-Judge: Using agentic systems to evaluate other agentic systems, reducing reliance on human judgment and enabling automated, scalable evaluation frameworks [3].
Task-driven and multi-step evaluation: Moving beyond single-action accuracy to assess long-horizon reasoning, adaptability, and decision-making in dynamic environments [4].

AI agents are becoming the next major AI paradigm, with 2025 set to be a pivotal year for their development. As models evolve from passive assistants to autonomous agents, rigorous evaluation is essential to ensure their reliability and safety [5,6].This project aims to develop a framework for evaluating agents, create benchmarks, and contribute findings to a top-tier conference. The scope can be tailored to your skills and interests.[1] Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models.[2] Deng, X., et al. (2023). MIND2WEB: Towards a Generalist Agent for the Web.[3] Zhuge, M., et al. (2024). Agent-as-a-Judge: Evaluate Agents with Agents.[4] Nathani, D., et al. (2025). MLGym: A New Framework and Benchmark for Advancing AI Research Agents.[5] Altman, S. (2024). The Intelligence Age.[6] Heikkilä, M., & Heaven, W. D. (2025). Anthropic’s Chief Scientist on 4 Ways Agents Will Be Even Better in 2025. MIT Technology Review.

Qualifications

Evidence of exceptional research engineering ability:

Are currently pursuing, or in the process of obtaining, a PhD in Machine Learning, NLP, Artificial Intelligence, or a related discipline. We will also consider exceptional non-PhD candidates.
Proven track record in empirical research, including designing and executing experiments, and effectively writing up and communicating findings.
Publications in top AI conferences.
Aptitude for distilling and applying ideas from complex research papers.

Nice to have

Previous internship experience at elite AI research labs (OpenAI, DeepMind, Meta, Anthropic, etc.).
Experience using large-scale distributed training strategies, data annotation and evaluation pipelines, or implementing state of the art ML models.
Interested in and thoughtful about the impacts of AI technology.

About you

You'll work by and thrive through our core principles:Own the Outcome

Create real value: Every action should deliver tangible, meaningful value for the people who use what we build.
Drive to completion: Do the second 90%.
Do fewer things, better: Prioritize focus over breadth.

Back the Team

Collaborate for excellence: The whole is greater than the sum of its parts.
Seek truth: Let the best ideas win, no matter where they come from, and let go of ego.
Argue passionately, then commit fully: Debate fiercely, but once a decision is made, own it like it’s yours.

Drive the Mission

Advance AI safety: Every action should contribute towards the safe development of AI.
Go big or go home: “The people who are crazy enough to think they can change the world are the ones who do.”

Compensation

Highly competitive

Research Intern

atla