Harper logo

Senior Member of Technical Staff, AI Quality

Harper

Posted about 1 hour ago

The Problem

36 million businesses in America need insurance. 77% are underinsured. 40% have no coverage at all. We're building 90%+ AI-led commercial insurance distribution. ~1,000 new customers/month, 100x growth in a year.

Our agents handle intake, sales, service, voice, and submission packaging. They get better every week - but "better" is only true if we can prove it. Today, AI engineers ship a prompt change, a tool change, or a new model and judge it by vibe: "feels worse," "feels better," "the demo passed." Vibes don't survive Series B.

Build the evals that turn agent quality from a vibe into a number. Catch every regression before it ships.

The Thesis

AI only compounds when the company can tell whether it is getting better. Demos do not count. Vibes do not count. The bar is a real customer case, a real transcript, a real failure mode, and a regression suite that catches the same mistake forever.

You will build the evals that make Harper's agents trustworthy. When the agent improves, we know. When it regresses, we know before the customer does. That is how we scale judgment without scaling headcount.

The Role

Harper operates like a factory with a series of modules spanning the full lifecycle from intake through renewals. Across them we run a stack of internal AI systems covering operator guidance, the operational backbone that matches risks to underwriters, autonomous communications, and voice AI for customer interactions.

Every one of those agents needs to be evaluated, regression-tested, and monitored in production. You'll work alongside the engineer setting the AI-quality direction and own a specific agent surface end-to-end.

What You'll Do

  • Build capability + regression eval suites for assigned agents - intake, submissions, placements, renewals, CRM, or voice

  • Curate golden datasets - Real failure modes from real customer transcripts, real underwriter back-and-forth, real call recordings. 20–50 quality cases per agent, not thousands of synthetic ones.

  • Design graders - Deterministic first (string match, state check, tool-call assertions). LLM-as-judge where deterministic fails. Human calibration on samples.

  • Ship pre-merge eval gates - Every PR touching an agent / prompt / tool runs the relevant suite in CI. Below threshold → blocked.

  • Wire production trajectory monitoring - Online evaluators score live trajectories. Drift detection within hours.

  • Convert ops findings into tests - Critique's flagged failures become regression cases. Every repeat issue becomes a permanent test.

You Might Be a Fit If…

  • You've built or operated eval frameworks for production LLM systems

  • You can describe a specific regression an eval suite you built caught - and how it would have leaked otherwise

  • You've designed an LLM-as-judge rubric that survived human calibration

  • You can debug a hallucination by reading transcripts, not aggregate dashboards

  • You write code with AI daily and have strong opinions on which agent behaviors matter

  • You're 3–6 years into your career

Requirements

  • 3–6 years software engineering experience

  • Production LLM / agent eval experience - capability + regression suite design, LLM-as-judge graders, golden datasets

  • Familiarity with at least one major eval framework

  • Strong written communication - eval rubric docs, failure-mode taxonomies

  • Based in San Francisco or willing to relocate

Nice to Have

  • Open-source contribution to eval frameworks

  • Red-team / adversarial-testing experience for LLM systems

  • Voice AI eval experience (latency, interruption handling, transcription accuracy)

  • ML eval / observability background

Compensation

  • OTE: $176,000–$253,000 cash compensation (base salary + target performance bonus)

  • Equity: competitive equity, so you share in the company you are helping build

  • Location: San Francisco, in-office

Benefits

  • Health, dental, and vision insurance

  • Commuter benefits

  • Team meals and snacks

The Process

  1. Founder call (15 min) - Mission, pace, scope

  2. Tech Lead deep-dive (60 min) - Eval architecture, grader design, real failure modes

  3. Super Day on-site - full-day simulation of working at Harper: live eval-suite design, code review, team context, and founder/CTO time

Want to see the full job description?

Sign in to view the complete details and apply to this position.

Job details

Workplace

Office

Location

San Francisco

Experience

SE

Salary

176k - 253k USD

per year

Similar

Jobr Assistant extension

Get the extension →