uRun logo

Founding ML infrastructure Engineer

uRun

Posted 1 day ago

The problem we saw

Most AI infrastructure is built for batch: send a query, wait, get a response, reset. Powerful, but transactional. AI is becoming interactive — sessions that hold state, models that stay alive between turns, generation that responds as it runs — and the infrastructure to deliver that at scale doesn't really exist yet.

The bottleneck isn't the models anymore. It's the infrastructure underneath them.

What we're building to fix it

uRun is the inference cloud for interactive AI: the compute layer that makes real-time, stateful inference possible at scale. We came out of stealth in April 2026, are backed by top-tier investors, and are founded by Keegan McCallum, who scaled inference infrastructure for some of the most demanding generative AI workloads in production.

We're an infrastructure company. We build the layer that model labs, builders, and research teams ship on top of.

Where you come in

We are building the next generation of AI inference infrastructure. As our ML Infrastructure and Platform Engineer, you will own the architecture and scaling of our GPU compute platform from the ground up.

This is a founding technical hire with end-to-end ownership across the full infrastructure stack, from bare metal to model serving. You will work directly with the founding team and define how we build.

What you'll actually be doing day-to-day

  • Design and scale our GPU compute platform to support 1,000+ GPU clusters, ensuring high availability and low-latency inference across the fleet

  • Build and maintain the infrastructure layer for our compute marketplace, including multi-tenant scheduling, isolation, and billing-aware resource allocation

  • Own production reliability for ML systems end-to-end: observability, incident response, and SLA achievement across model serving and infrastructure

  • Architect feature stores and model registry systems that support rapid iteration and reproducibility at scale

  • Design an experiment tracking infrastructure capable of handling thousands of concurrent runs with full auditability

  • Build resource orchestration and scheduling systems that optimise for throughput, cost, and latency across heterogeneous hardware

  • Set engineering standards for infrastructure reliability, capacity planning, and operational excellence as an early technical leader

What skills you need for the journey

  • Proven experience designing and operating large-scale distributed infrastructure at 1,000+ nodes or equivalent complexity, in any domain

  • Deep expertise in distributed systems, cluster orchestration (Kubernetes, Slurm, or custom schedulers), and large-scale resource scheduling

  • Strong production reliability instincts: observability, incident response, capacity planning, and SLA ownership across complex systems

  • Experience building infrastructure that other engineers build on top of, not just operating it

  • Ability to operate as a technical lead: set direction, make tradeoffs under uncertainty, and raise the bar for the team around you

  • Startup orientation. You are energised by ambiguity, move fast, and build for scale from day one

Things that will give you an edge

  • Exposure to ML infrastructure concepts: GPU networking (NCCL, InfiniBand, RoCE), model serving frameworks (vLLM, SGLang, TensorRT-LLM), or hardware-aware performance tuning (CuTe, Triton, TileLang)

  • Experience with multi-cloud GPU procurement and capacity management across AWS, GCP, Azure, and bare metal providers

  • Familiarity with inference marketplace architectures, dynamic routing, or spot/preemptible workload management

  • Prior experience at a Series A or earlier stage company scaling from early infrastructure to production

What you'll get in return

Competitive salary and meaningful equity in an early-stage AI infrastructure company. The band above is our target; for an exceptional candidate we'll go higher. Equity is real — you're early, and the grant reflects that.

  • Health, dental, and vision — full coverage

  • 401(k) — company-supported retirement savings

  • FSA/HSA — flexible spending accounts for healthcare costs

  • Paid time off — we trust you to manage your time

  • Top-tier tooling — access to the best AI tools available: Claude, Codex, Kimi, and whatever else helps you move faster

  • MacBook Pro and AirPods — the hardware you need, on us

How we work (and what that feels like day-to-day)

We build the stage, not the show. We're an infrastructure company, a developer-tools company, and a production partner for model labs — and focus is a deliberate choice we've made and hold to.

Day-to-day, that means a small team, a high bar, and real ownership. You won't wait for permission or inherit a backlog of someone else's decisions. In a founding infrastructure role, the function is what you make it.

It also means ambiguity: priorities shift, not everything is documented, and you'll often be the person who decides what "good enough for now" means. That suits some people and not others, and we'd rather you know that before you apply.

Watch our launch party video

Read the manifesto

Want to see the full job description?

Sign in to view the complete details and apply to this position.

Job details

Workplace

Hybrid

Location

San Francisco

Salary

200k - 350k USD

per year

Similar

Jobr Assistant extension

Get the extension →