This job was posted more than 40 days ago and might be expired.
Perplexity logo

Member of Technical Staff (AI Infrastructure Engineer)

Posted 2 months ago

OfficeSan Francisco220k - 405k USD

We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters

Responsibilities

  • Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads

  • Manage and optimize Slurm-based HPC environments for distributed training of large language models

  • Develop robust APIs and orchestration systems for both training pipelines and inference services

  • Implement resource scheduling and job management systems across heterogeneous compute environments

  • Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure

  • Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm

  • Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services

  • Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Qualifications

  • Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management

  • Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization

  • Experience with deploying and managing distributed training systems at scale

  • Deep understanding of container orchestration and distributed systems architecture

  • High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)

  • Experience managing GPU clusters and optimizing compute resource utilization

Required Skills

  • Expert-level Kubernetes administration and YAML configuration management

  • Proficiency with Slurm job scheduling, resource management, and cluster configuration

  • Python and C++ programming with focus on systems and infrastructure automation

  • Hands-on experience with ML frameworks such as PyTorch in distributed training contexts

  • Strong understanding of networking, storage, and compute resource management for ML workloads

  • Experience developing APIs and managing distributed systems for both batch and real-time workloads

  • Solid debugging and monitoring skills with expertise in observability tools for containerized environments

Preferred Skills

  • Experience with Kubernetes operators and custom controllers for ML workloads

  • Advanced Slurm administration including multi-cluster federation and advanced scheduling policies

  • Familiarity with GPU cluster management and CUDA optimization

  • Experience with other ML frameworks like TensorFlow or distributed training libraries

  • Background in HPC environments, parallel computing, and high-performance networking

  • Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices

  • Experience with container registries, image optimization, and multi-stage builds for ML workloads

Required Experience

  • Demonstrated experience managing large-scale Kubernetes deployments in production environments

  • Proven track record with Slurm cluster administration and HPC workload management

  • Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure

  • Experience supporting both long-running training jobs and high-availability inference services

  • Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management

Job details
Workplace
Office
Location
San Francisco
Salary
220k - 405k USD
per year

Perplexity is a free AI-powered answer engine that provides accurate, trusted, and real-time answers to any question.

Key team members

Vitaly Golomb

Vitaly Golomb

Ben Bloch Roc

Ben Bloch Roc

🔹Fabio Bottacci

🔹Fabio Bottacci

Byron Deeter Byron Deeter is an Influencer

Byron Deeter Byron Deeter is an Influencer

Apply smarter with Jobr

Jobr aggregates jobs directly from company career portals — no middlemen. Our team applies on your behalf with AI-tailored resumes, reviewed by a human before submission.

Direct from company career pages
AI-personalised cover letters
Human review before every submit
Application tracking & follow-ups