Senior Systems Engineer HPC - R-21841
Rackspace.com
Office
Gurgaon
Full Time
Responsibilities:
System Administration & Maintenance: Install, configure, and maintain HPC clusters (hardware, software, operating systems), perform regular updates/patching, manage user accounts and permissions, and troubleshoot/resolve hardware or software issues.
Performance & Optimization: Monitor and analyse system and application performance, identify bottlenecks, implement tuning solutions, and profile workloads to improve efficiency.
Cluster & Resource Management: Manage and optimize job scheduling, resource allocation, and cluster operations using tools such as Slurm, LSF, Bright Cluster Manager / Base Command Manager, OpenHPC, and Warewulf.
Networking & Interconnects: Configure, manage, and tune Linux networking (TCP/IP, DNS, routing) and high-speed HPC interconnects (InfiniBand, Ethernet) to ensure low-latency, high-bandwidth communication.
Storage & Data Management: Implement and maintain large-scale storage and parallel file systems (Lustre, Ceph, GPFS), ensure data integrity, manage backups, and support disaster recovery.
Security & Authentication: Implement security controls, ensure compliance with policies, and manage authentication and directory services such as LDAP and Active Directory.
DevOps & Automation: Use configuration management and DevOps practices (Ansible, Terraform, Jenkins, Git) to automate deployments, application packaging (RPM/DEB), and system configurations.
User Support & Collaboration: Provide technical support, documentation, and training to researchers; collaborate with scientists, HPC architects, and engineers to align infrastructure with research needs.
Planning & Innovation: Contribute to the design and planning of HPC infrastructure upgrades, evaluate and recommend hardware/software solutions, and explore cloud-based HPC solutions where applicable.
System Administration & Maintenance: Install, configure, and maintain HPC clusters (hardware, software, operating systems), perform regular updates/patching, manage user accounts and permissions, and troubleshoot/resolve hardware or software issues.
Performance & Optimization: Monitor and analyse system and application performance, identify bottlenecks, implement tuning solutions, and profile workloads to improve efficiency.
Cluster & Resource Management: Manage and optimize job scheduling, resource allocation, and cluster operations using tools such as Slurm, LSF, Bright Cluster Manager / Base Command Manager, OpenHPC, and Warewulf.
Networking & Interconnects: Configure, manage, and tune Linux networking (TCP/IP, DNS, routing) and high-speed HPC interconnects (InfiniBand, Ethernet) to ensure low-latency, high-bandwidth communication.
Storage & Data Management: Implement and maintain large-scale storage and parallel file systems (Lustre, Ceph, GPFS), ensure data integrity, manage backups, and support disaster recovery.
Security & Authentication: Implement security controls, ensure compliance with policies, and manage authentication and directory services such as LDAP and Active Directory.
DevOps & Automation: Use configuration management and DevOps practices (Ansible, Terraform, Jenkins, Git) to automate deployments, application packaging (RPM/DEB), and system configurations.
User Support & Collaboration: Provide technical support, documentation, and training to researchers; collaborate with scientists, HPC architects, and engineers to align infrastructure with research needs.
Planning & Innovation: Contribute to the design and planning of HPC infrastructure upgrades, evaluate and recommend hardware/software solutions, and explore cloud-based HPC solutions where applicable.
Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or a related field (equivalent experience may substitute for degree).
- Minimum of 10 years of systems experience, including at least 5 years working specifically with HPC.
- Strong knowledge of Linux operating systems (e.g., Rocky Linux, Ubuntu) with a fundamental understanding of Linux internals, system administration, and performance tuning.
- Experience building and managing RPM and DEB packages.
- Experience with cluster management tools such as Bright Cluster Manager, OpenHPC stack, or Warewulf.
- Proficiency with job schedulers and resource managers such as Slurm and LSF.
- Strong understanding of Linux networking (e.g., TCP/IP, DNS, routing) and HPC interconnects (e.g., InfiniBand, Ethernet) including performance tuning.
- Knowledge of parallel file systems such as Lustre, Ceph, or GPFS.
- Working knowledge of Linux authentication and directory services such as LDAP and Active Directory.
- Proficiency in scripting languages (e.g., Python, Bash, R) and familiarity with MPI libraries for parallel and distributed computing (nice to have).
- Strong experience with DevOps and configuration management tools, including Ansible, Terraform, Jenkins, and Git.
- Knowledge of HPC in cloud environments (e.g., AWS, Azure, GCP HPC offerings) is a plus.
- Strong knowledge of Linux security, compliance standards, and data protection best practices.
- Excellent communication, interpersonal, and problem-solving skills.
Senior Systems Engineer HPC - R-21841
Office
Gurgaon
Full Time
October 6, 2025