company logo

Infrastructure Engineer – AI HPC Compute

CloudRaft.com

Remote

Remote

Full Time

About CloudRaftCloudRaft is a dynamic and forward-thinking company specializing in cutting-edge AI and cloud-native solutions. We thrive on creativity, collaboration, and innovation, empowering our team to solve complex challenges and deliver impactful results. Join us to be part of a team that values growth, excellence, and a passion for technology.
Overview:We are seeking an Infrastructure Engineer to lead the lifecycle management of cutting-edge GPU-based platforms powering AI, ML, and HPC workloads. You will design, automate, and optimize high-performance compute clusters, ensuring robust provisioning, observability, and operational excellence at scale.
Key ResponsibilitiesNOTE: This position involves inclusion in a 24x7 on-call rotation to provide operational support as needed. This is a remote position open to candidates currently residing in India.
GPU & Baremetal Lifecycle Management
  • Own end-to-end lifecycle management of high-performance GPU servers (NVIDIA/AMD), optimizing for AI/ML training, inference, and HPC workloads.
  • Install, upgrade, and maintain OS (Ubuntu, RHEL), GPU drivers (NVIDIA), DCGM, firmware, BIOS, and BMC—focusing on system stability and performance.
  • Perform advanced diagnostics, health-checks, and RMA/hardware swaps in collaboration with data center vendors.
  • Standardize and maintain golden images tuned for AI/ML performance, ensuring fast and reproducible deployments.

Cluster Networking & Storage
  • Have understanding of high-speed network fabrics (InfiniBand, 400G Ethernet) and distributed storage (e.g., Ceph, Lustre, Vast Data) for scalable, low-latency compute clusters.
  • Liaise with Networking engineering and diagnose network bottlenecks and optimize for distributed AI workloads.

Provisioning & Automation
  • Design and implement automation for GPU node imaging, provisioning, access control, and scheduling in multi-tenant environments (Bash, Python, Ansible).
  • Ensure rapid and consistent provisioning of GPU resources tailored to workload-specific requirements (AI training, simulation, etc).
  • Integrate real-time GPU metrics into dashboards and billing systems for transparency and usage reporting.

Monitoring & Observability
  • Deploy and manage observability stacks (Prometheus, Grafana) for deep monitoring of GPU/server performance, thermal profiles, and system health.
  • Build dashboards and intelligent alerting systems for proactive incident response (e.g., thermal throttling, resource contention).
  • Deliver actionable insights to engineering teams and end users on resource utilization and efficiency.

DevOps and Infrastructure Engineering
  • Manage and operate Kubernetes cluster with GPU scheduling.
  • Champion Infrastructure-as-Code (Terraform, Ansible) for reproducible, scalable deployment of compute platforms.
  • Develop and maintain CI/CD pipelines for infrastructure workflows supporting AI Factory or GPUaaS operations.
  • Apply best practices for security, resiliency, and scaling in a distributed, high-throughput environment.


Qualifications & Skills
  • 3+ years experience managing large-scale, bare metal compute infrastructure with NVIDIA GPU specialization.
  • Expert-level Linux system administration (Ubuntu/RHEL) in production HPC/AI environments.
  • In-depth hands-on with NVIDIA drivers, DCGM, network fabrics, BIOS/firmware, and BMC management.
  • Proficiency in automation and scripting (Python, Bash, Ansible); advanced debug and troubleshooting skills.
  • Experience in deploying and operating monitoring solutions (Prometheus, Grafana, or equivalent).
  • Proven track record implementing Infrastructure as Code (Terraform, Ansible, etc).
  • Solid grasp of HPC/AI workload patterns, complex job schedulers (SLURM, Kubernetes), and best practices for distributed training/inferencing.
  • Strong communication skills and ability to document operational runbooks, images, and workflows.

Bonus Points
  • Experience operating Kubernetes clusters in GPU-accelerated, bare metal environments.
  • Familiarity with distributed storage (Ceph, Lustre, Vast Data, DDN or similar), GPU virtualization, or multi-tenant hardware isolation.
  • Background in building AI Cloud or large scale private cloud.


Infrastructure Engineer – AI HPC Compute

Remote

Remote

Full Time

September 15, 2025

company logo

CloudRaft

cloudraftio