
AI Platform Engineer
Noon
Posted about 14 hours ago
About Noon
We are on a mission to reinvent how designers work in the AI era. We’re backed by top investors including First Round, Chemistry, Homebrew, Scribble and senior leaders from OpenAI, Meta, Google, Ramp, Stripe and more. We’re building the next-generation AI design tool for product teams.
About the Role
We’re hiring an AI Platform Engineer to own how our models run in production. You’ll build the inference stack that delivers sub-second responses to designers at scale, optimize latency and cost, and own the reliability of every AI capability in the product. This is the role for someone who lives in serving infrastructure and treats GPU utilization like a craft.
You’ll own the platform layer end-to-end: serving, autoscaling, observability, deployment, and the cost-and-latency economics of running models at scale.
What You’ll Do
Architect and operate the inference platform: serving stack, autoscaling, multi-tenancy, observability
Optimize end-to-end latency (TTFT, TPOT, p95) with quantization, batching, KV-cache management, and speculative decoding
Design multi-GPU parallelism strategies (DP / TP / PP) and own GPU utilization and cost economics
Build a hybrid local + cloud serving architecture — small models on the user’s Mac for fast paths, larger models in the cloud for slow paths
Own canary deployment, rollback automation, and SLO/SLA-driven reliability for all AI features
Build production observability: latency, drift, quality, and cost dashboards
Evaluate and integrate inference engines (vLLM, Triton, TGI, TensorRT, MLX) for cloud and on-device paths
Take fine-tuned models from research artifacts to production traffic
Must-Have Requirements
8+ years software engineering experience
2+ years deploying ML or LLM systems at production scale
Deep, demonstrable experience with one or more inference serving systems (vLLM, Triton, TGI, TensorRT, ONNX Runtime)
Concrete production wins on latency and throughput engineering (p50/p95/p99, GPU utilization, cost-per-token)
Reliability engineering depth: canary deployment, rollback, SLO-driven ops, on-call readiness
Cloud and Kubernetes-based ML deployment experience
Multi-GPU parallelism experience (FSDP, DDP, TP, PP) a strong plus
Nice to Have
On-device inference experience (MLX, Core ML, ONNX Runtime on consumer hardware)
Production experience with quantization, distillation, and mixed-precision inference
Experience with streaming inference and real-time AI UX
Background running inference at startup scale — comfortable with cost-per-user economics, not just raw throughput
What You’ll Build
The inference platform powering every AI feature in the product
Sub-second response paths for high-frequency design actions
A hybrid local + cloud serving architecture, with intelligent routing between fast and slow paths
Observability infrastructure: latency, drift, quality, and cost
Multi-model orchestration with on-device fast paths and cloud slow paths
Reliable, measurable, real-time streaming AI experiences
Benefits
Salary: $300,000-$400,000 base salary
Job details
Jobr Assistant extension
Get the extension →