company logo

Site Reliability Engineer (Vancouver)

Gauss Labs

Office

Vancouver

Full Time

Gauss Labs is seeking a highly skilled Site Reliability Engineer to join our team in Vancouver. As an SRE at Gauss Labs, you will play a critical role in ensuring our industrial AI platform's reliability, performance, and scalability. You will be responsible for building and maintaining a robust solution that supports our growing business at customer sites. This role requires a high level of technical expertise, a collaborative mindset, and a strong desire to continuously improve systems and processes.

Responsibilities

  • Monitoring and Alerting: Creating and maintaining robust monitoring systems to proactively identify and resolve issues before they impact customers. Implementing effective alerting mechanisms to ensure timely response to critical events.
  • Incident Response: Participating in on-call rotations and leading incident response efforts to minimize downtime and restore service quickly.
  • Automation: Developing and implementing automation tools and scripts to streamline operations, reduce manual effort, and improve efficiency.
  • Capacity Planning: Forecasting resource needs, optimizing resource utilization, and ensuring customers' infrastructure can handle increasing workloads.
  • Performance Optimization: Identifying and resolving performance bottlenecks, optimizing system performance, and improving response times.
  • Collaboration: Partnering with software engineers, data scientists, and other teams to ensure alignment and efficient operations.
  • Customer Focus: Working closely with the AI Program Manager and Technical Account Manager to understand customer issues, provide technical support, and improve customer satisfaction.
  • Continuous Improvement: Driving a culture of continuous improvement by identifying opportunities to enhance system reliability, performance, and efficiency.

Basic Qualifications

  • Bachelor's degree in computer science, engineering, or a related discipline
  • 5+ years of industry experience as a Site Reliability Engineer
  • Experience with cloud platforms (AWS, GCP, Azure), containerization technologies (Docker, Kubernetes), observability and alerting tools (Prometheus, Grafana, ElasticSearch, Jaeger)
  • Experience with scripting languages (Python, Bash)
  • Working knowledge of Github, Github actions, CI/CD concepts
  • Experience in ticket management, issue resolution, and troubleshooting
  • Strong problem-solving and troubleshooting skills
  • Excellent customer communication and interpersonal skills, fluency in verbal and written English

Preferred Qualifications

  • Knowledge of AI/ML infrastructure and workloads
  • Knowledge of big data technologies (Kafka, Flink)
  • Knowledge of database technologies (MongoDB, PostgreSQL)
[Hiring process]Application review - Phone interview - Virtual onsite interview - VP interview/Core Value interview

Site Reliability Engineer (Vancouver)

Office

Vancouver

Full Time

June 30, 2025

company logo

Gauss Labs