Data Scientist, AI Evaluation

Full-time | Hybrid, Sydney

The Company

Checkbox is a Series A, Sequoia-backed technology company on a mission to enable meaningful work for all.

We help in-house legal teams capture, triage, manage and resolve work through no-code automation, matter management and AI-powered legal intake. Our customers include leading global organisations such as SAP, Disney, Coca-Cola, BMW, Allianz and Stryker.

We’re now building toward an AI-native Legal Service Hub: a new way for legal teams to manage work with systems that can understand context, reason through next steps, orchestrate workflows and keep humans in control.

As part of this, we’re building a new AI platform and greenfield data platform layer to support the next generation of Checkbox products.

The Role

We’re looking for a Data Scientist to help define how we measure, evaluate and improve the quality of AI systems across Checkbox.

This is not a traditional dashboarding or reporting role. You’ll help us answer a more important question:

Is our AI doing the right thing, reliably, for real customer use cases?

You’ll design evaluation frameworks, build benchmark datasets, analyse AI behaviour in production and create feedback loops that help our AI systems improve over time.

As Checkbox moves toward agentic AI, this role will evaluate more than model outputs. You’ll help us measure whether AI systems are using the right context, selecting the right tools, following the right workflow, taking the right action and knowing when to involve a human.

What You’ll Do

Define what “good” looks like for AI-powered legal workflows, including intake classification, triage, matter routing, summarisation, extraction and workflow execution.
Design evaluation frameworks that measure accuracy, consistency, grounding, usefulness, safety, task completion, tool usage and reliability.
Build and maintain benchmark datasets, including labelled examples, edge cases, customer scenarios, ambiguous requests and human-reviewed test sets.
Analyse production AI behaviour to identify failure modes, workflow breakdowns, weak retrieval, poor confidence signals and product improvement opportunities.
Develop scoring methods for objective and subjective AI outputs using human review, automated evaluation and statistical analysis.
Partner with Product and Engineering on model, prompt, retrieval, workflow and orchestration improvements.
Help design feedback loops that turn user corrections, human review and workflow outcomes into measurable AI improvements.
Contribute to our greenfield data platform by defining the event, feedback, evaluation and observability data needed to measure AI performance over time.

What We’re Looking For

Experience in data science, applied machine learning, product analytics, AI evaluation or a similar role.
Strong Python and SQL skills, with the ability to work across structured and unstructured data.
Strong statistical thinking, including experimentation, performance comparison and separating signal from noise.
Understanding of modern AI systems, including LLMs, prompts, embeddings, RAG, evaluation methods and AI product quality challenges.
Experience defining metrics and evaluation criteria for ambiguous or complex product behaviours.
Strong product sense and a genuine interest in whether AI systems are useful, trusted and valuable to end users.
Ability to work closely with engineers, product managers, designers and customer-facing teams to turn AI quality problems into clear action.
Clear communication skills, with the ability to explain technical findings in a practical and accessible way.

Nice to Have

Experience with GenAI, agentic AI systems, AI copilots, RAG systems, tool-using LLM applications or AI workflow automation.
Experience with AI evaluation, LLM observability or AI quality tooling such as LangSmith, Weights & Biases, Arize, Phoenix, Humanloop or similar.
Experience designing human-in-the-loop review, annotation or feedback workflows.
Experience with legal tech, enterprise SaaS, workflow automation or document-heavy products.
Experience contributing to data platform design, event tracking, data modelling or analytics infrastructure.

Why Join Us

You’ll join at a rare moment: early enough to shape the foundations, but with real customers, real usage and real enterprise problems to solve.

You’ll help build the evaluation and data layer behind AI systems that need to operate in complex, high-trust environments where quality and reliability genuinely matter.

You’ll play a key role in helping Checkbox move toward its broader vision: becoming the Legal Service Hub for modern organisations.

Benefits

Competitive salary and equity.
Hybrid working from our Sydney office.
High ownership and direct impact on a strategic AI product area.
Opportunity to help shape a greenfield data platform and AI evaluation function from the ground up.
A collaborative, ambitious team building category-defining software for legal teams.

Data Scientist, AI Evaluation

Job details

Checkbox%20technology