Noon logo

AI Platform Engineer

Noon

Posted about 14 hours ago

About Noon

We are on a mission to reinvent how designers work in the AI era. We’re backed by top investors including First Round, Chemistry, Homebrew, Scribble and senior leaders from OpenAI, Meta, Google, Ramp, Stripe and more. We’re building the next-generation AI design tool for product teams.

About the Role

We’re hiring an AI Platform Engineer to own how our models run in production. You’ll build the inference stack that delivers sub-second responses to designers at scale, optimize latency and cost, and own the reliability of every AI capability in the product. This is the role for someone who lives in serving infrastructure and treats GPU utilization like a craft.

You’ll own the platform layer end-to-end: serving, autoscaling, observability, deployment, and the cost-and-latency economics of running models at scale.

What You’ll Do

  • Architect and operate the inference platform: serving stack, autoscaling, multi-tenancy, observability

  • Optimize end-to-end latency (TTFT, TPOT, p95) with quantization, batching, KV-cache management, and speculative decoding

  • Design multi-GPU parallelism strategies (DP / TP / PP) and own GPU utilization and cost economics

  • Build a hybrid local + cloud serving architecture — small models on the user’s Mac for fast paths, larger models in the cloud for slow paths

  • Own canary deployment, rollback automation, and SLO/SLA-driven reliability for all AI features

  • Build production observability: latency, drift, quality, and cost dashboards

  • Evaluate and integrate inference engines (vLLM, Triton, TGI, TensorRT, MLX) for cloud and on-device paths

  • Take fine-tuned models from research artifacts to production traffic

Must-Have Requirements

  • 8+ years software engineering experience

  • 2+ years deploying ML or LLM systems at production scale

  • Deep, demonstrable experience with one or more inference serving systems (vLLM, Triton, TGI, TensorRT, ONNX Runtime)

  • Concrete production wins on latency and throughput engineering (p50/p95/p99, GPU utilization, cost-per-token)

  • Reliability engineering depth: canary deployment, rollback, SLO-driven ops, on-call readiness

  • Cloud and Kubernetes-based ML deployment experience

  • Multi-GPU parallelism experience (FSDP, DDP, TP, PP) a strong plus

Nice to Have

  • On-device inference experience (MLX, Core ML, ONNX Runtime on consumer hardware)

  • Production experience with quantization, distillation, and mixed-precision inference

  • Experience with streaming inference and real-time AI UX

  • Background running inference at startup scale — comfortable with cost-per-user economics, not just raw throughput

What You’ll Build

  • The inference platform powering every AI feature in the product

  • Sub-second response paths for high-frequency design actions

  • A hybrid local + cloud serving architecture, with intelligent routing between fast and slow paths

  • Observability infrastructure: latency, drift, quality, and cost

  • Multi-model orchestration with on-device fast paths and cloud slow paths

  • Reliable, measurable, real-time streaming AI experiences

Benefits

  • Salary: $300,000-$400,000 base salary

Want to see the full job description?

Sign in to view the complete details and apply to this position.

Job details

Workplace

Hybrid

Location

San Francisco

Salary

300k - 400k USD

per year

Similar

Jobr Assistant extension

Get the extension →