The Problem

36 million businesses in America need insurance. 77% are underinsured. 40% have no coverage at all. We're building 90%+ AI-led commercial insurance distribution. ~1,000 new customers/month, 100x growth in a year.

Our agents handle intake, sales, service, voice, and submission packaging. They get better every week - but "better" is only true if we can prove it. Today, AI engineers ship a prompt change, a tool change, or a new model and judge it by vibe: "feels worse," "feels better," "the demo passed." Vibes don't survive Series B.

Build the evals that turn agent quality from a vibe into a number. Catch every regression before it ships.

The Thesis

AI only compounds when the company can tell whether it is getting better. Demos do not count. Vibes do not count. The bar is a real customer case, a real transcript, a real failure mode, and a regression suite that catches the same mistake forever.

You will build the evals that make Harper's agents trustworthy. When the agent improves, we know. When it regresses, we know before the customer does. That is how we scale judgment without scaling headcount.

The Role

Harper operates like a factory with a series of modules spanning the full lifecycle from intake through renewals. Across them we run a stack of internal AI systems covering operator guidance, the operational backbone that matches risks to underwriters, autonomous communications, and voice AI for customer interactions.

Every one of those agents needs to be evaluated, regression-tested, and monitored in production. You'll work alongside the engineer setting the AI-quality direction and own a specific agent surface end-to-end.

What You'll Do

Build capability + regression eval suites for assigned agents - intake, submissions, placements, renewals, CRM, or voice
Curate golden datasets - Real failure modes from real customer transcripts, real underwriter back-and-forth, real call recordings. 20–50 quality cases per agent, not thousands of synthetic ones.
Design graders - Deterministic first (string match, state check, tool-call assertions). LLM-as-judge where deterministic fails. Human calibration on samples.
Ship pre-merge eval gates - Every PR touching an agent / prompt / tool runs the relevant suite in CI. Below threshold → blocked.
Wire production trajectory monitoring - Online evaluators score live trajectories. Drift detection within hours.
Convert ops findings into tests - Critique's flagged failures become regression cases. Every repeat issue becomes a permanent test.

You Might Be a Fit If…

You've built or operated eval frameworks for production LLM systems
You can describe a specific regression an eval suite you built caught - and how it would have leaked otherwise
You've designed an LLM-as-judge rubric that survived human calibration
You can debug a hallucination by reading transcripts, not aggregate dashboards
You write code with AI daily and have strong opinions on which agent behaviors matter
You're 3–6 years into your career

Requirements

3–6 years software engineering experience
Production LLM / agent eval experience - capability + regression suite design, LLM-as-judge graders, golden datasets
Familiarity with at least one major eval framework
Strong written communication - eval rubric docs, failure-mode taxonomies
Based in San Francisco or willing to relocate

Nice to Have

Open-source contribution to eval frameworks
Red-team / adversarial-testing experience for LLM systems
Voice AI eval experience (latency, interruption handling, transcription accuracy)
ML eval / observability background

Compensation

OTE: $176,000–$253,000 cash compensation (base salary + target performance bonus)
Equity: competitive equity, so you share in the company you are helping build
Location: San Francisco, in-office

Benefits

Health, dental, and vision insurance
Commuter benefits
Team meals and snacks

The Process

Founder call (15 min) - Mission, pace, scope
Tech Lead deep-dive (60 min) - Eval architecture, grader design, real failure modes
Super Day on-site - full-day simulation of working at Harper: live eval-suite design, code review, team context, and founder/CTO time
Founder + Tech Lead offer conversation - No committee. Best offer, first.

To Apply

If you've turned vibes into a number - built an eval suite that caught a regression a model upgrade silently introduced - send your resume, the framework, and a transcript of a failure you found that nobody else did.

Senior Member of Technical Staff, AI Quality