GROWTH PATH

This is an individual contributor role with strong ownership expectations. High performers may be considered for workstream lead or functional lead responsibilities after approximately 12 months, based on demonstrated ownership, delivery, technical judgment, mentoring, cross-functional influence, and ability to reduce dependency on the Director of ML.

ABOUT THE ROLE

We are looking for an ML Evaluation Engineer to own model quality, regression testing, release validation, and production impact analysis for clinical AI systems. This role sits between applied ML, clinical data, MLOps, and production operations.

Your job is to ensure that every model or workflow release is measurable, stable, and not degrading important clinical behavior. You will maintain evaluation datasets, create hidden test sets, run regression checks, analyze production issues, and produce release-readiness reports.

WHAT YOU WILL DO

Build and maintain evaluation frameworks for clinical NLP, LLM, RAG, information extraction, and structured abstraction systems.
Create and manage hidden test datasets that are not directly visible to model developers, reducing overfitting risk.
Define release metrics, regression thresholds, slice-based evaluation, failure-mode tracking, and release/blocker criteria.
Compare model versions and identify performance degradation across clinical segments, document types, clients, data sources, labels, and edge cases.
Work with Clinical AI Data Specialists to design gold sets, hidden test sets, adjudication workflows, and label quality checks.
Work with Research Engineers to understand model changes, expected behavior, and evaluation risks without compromising test-set independence.
Work with MLOps/Data Engineering to monitor production behavior, triage bugs, analyze incident impact, and prioritize fixes.
Create release-readiness reports before production deployment.
Build dashboards, scripts, and automated checks for evaluation, monitoring, regression testing, and model comparison.
Prioritize model bugs based on clinical severity, user impact, frequency, regression risk, and operational urgency.

WHAT WE EXPECT

3–6+ years of experience in ML engineering, data science, model evaluation, ML QA, applied NLP evaluation, or data-heavy quality engineering.
Strong Python and data analysis skills.
Strong understanding of precision/recall/F1, calibration, confidence thresholds, dataset splits, leakage, overfitting, statistical testing, and error analysis.
Experience building evaluation pipelines, benchmark suites, test harnesses, dashboards, or regression frameworks.
Ability to work with imperfect labels, annotation disagreement, clinical ambiguity, and hidden evaluation sets.
Strong independence and judgment; ability to challenge releases when evidence is weak.
Clear written communication for release reports, incident analysis, and quality decisions.

NICE TO HAVE

Experience with LLM evaluation, RAG evaluation, extraction evaluation, clinical NLP, or healthcare ML.
Experience with model monitoring, production incident analysis, data drift, or observability.
Experience with MLflow, Weights & Biases, Evidently, Great Expectations, DeepEval, Ragas, pytest, Airflow, Prefect, or similar tools.
Clinical or biomedical NLP exposure.

SUCCESS IN 6 MONTHS

Establishes a repeatable release validation process.
Maintains hidden evaluation datasets and prevents overfitting to test data.
Produces release reports that leadership, ML, and engineering can trust.
Catches meaningful regressions before release.
Provides reliable impact analysis for production issues and helps prioritize fixes.

About Triomics

Triomics is building the agentic AI layer for oncology EHRs. Cancer hospitals spend billions on highly trained staff manually reading unstructured patient records - pathology reports, clinical notes, genomic panels - to power workflows like trial matching, registry curation, visit prep, and quality reporting. We replace that manual work with task-driven AI agents that sit inside the EMR and process records at scale, in real time.

Our platform is trusted by leading cancer centers including Memorial Sloan Kettering, Mount Sinai, and Yale Cancer Center. We have grown 10x in the last year and process millions of oncology medical documents monthly.

Our investors include Battery Ventures, Lightspeed, General Catalyst, Nexus Venture Partners, and Y Combinator.

Why Join Triomics

Impact at scale. The systems your teams build directly power AI workflows that accelerate cancer research and improve patient outcomes.
Cutting-edge problems. Hard, data-intensive systems at the intersection of AI, healthcare, and scale - in a highly regulated industry where reliability is non-negotiable.
World-class team. Work alongside top talent across AI, engineering, and product, with best-in-industry compensation.
Culture that ships. Fast-paced, ownership-driven, with company-sponsored workations.

Perks & Benefits

Lunch provided at the office - one less daily decision.
Flexible working hours - we care about output, not clock-ins.
Comprehensive health insurance for you and your family.
Zomato meal benefits for early starts and late nights.

ML Model Evaluation Engineer

Other open roles at Triomics(6)