ROLE OVERVIEW

The Junior Site Reliability Engineer (SRE) is responsible for ensuring the availability, performance, and reliability of production systems hosted on Google Cloud Platform (GCP), with a strong focus on voice and real-time communication services. This role provides L2 production support, actively manages incidents, and drives root cause analysis to prevent recurrence. You will work closely with engineering, network, and operations teams to improve system resilience, automate operational tasks, and meet SLA commitments. The ideal candidate brings a strong mix of cloud reliability engineering and voice/VoIP technical expertise in a live production environment.

SPECIFIC DUTIES AND RESPONSIBILITIES

Monitor Production Systems: Use monitoring tools (e.g., Cloud Monitoring) to ensure the health and performance of cloud-based production systems on Google Cloud Platform (GCP).
Incident Management: Respond to production incidents, triage issues, and ensure timely resolution. Perform root cause analysis (RCA) and document findings.
Performance Tuning: Analyze system performance, identify bottlenecks, and make recommendations for improvements to optimize service reliability, scalability, and speed.
System Alerts and Incident Escalation: Set up and maintain system alerts to proactively detect issues. Escalate critical issues to appropriate teams and ensure swift resolution.
Collaboration with Engineering: Work closely with development and operations teams to ensure smooth production releases, provide feedback on system performance, and implement monitoring solutions for new services.
System Documentation: Maintain documentation related to system configurations, monitoring setups, and incident resolutions to create knowledge-sharing practices across teams.
Service Level Agreements (SLAs): Track and report on SLA performance, ensuring that production services meet predefined availability and reliability standards.
Proactive System Health Checks: Conduct routine system health checks, reviewing logs and performance metrics, to ensure system uptime.
Disaster Recovery and Backup: Monitor backup systems and ensure that disaster recovery procedures are in place and tested.

COMPETENCIES

Core Competencies

3+ years experience in cloud production support, Site Reliability Engineering, or System Reliability roles
3+ years hands-on experience with Google Cloud Platform (GCP), including Compute Engine, GKE, Cloud Monitoring, Logging, and Storage
3+ years experience using monitoring and observability tools to track system health and performance
3+ years experience in system performance metrics (CPU, memory, disk, network) and issue diagnosis
3+ years experience managing incidents and troubleshooting live production systems
3+ years experience in scripting or automation using Bash, Python, or similar languages

Complementary Competencies

Strong experience with VoIP and UC technologies including SIP, RTP/SRTP, WebRTC, SBCs (Ribbon, Oracle, AudioCodes), SIP trunks, gateways, and voice codecs (G.711, G.729)
Proven ability to troubleshoot IP telephony and real-time communications using tools such as Wireshark and network analyzers
Solid understanding of network fundamentals (TCP/IP, VLANs, routing, switching, QoS) and voice security best practices (TLS, SRTP, firewalls)
Experience integrating voice, contact center (ACD/IVR), and UC platforms within cloud-native and hybrid environments
Proficiency in automation and scripting for voice and system management (Python, Bash, PowerShell)

Experience with observability and monitoring tools (Prometheus, Grafana, Zabbix, Elastic Stack)
Hands-on exposure to network and VoIP analysis tools such as Netscout NG1 and Wireshark
Familiarity with automation and CI/CD tools (Ansible, N8N, Jenkins, GitLab CI/CD)
Exposure to multi-cloud environments (AWS, Azure)

Certifications (Preferred)

CCNA (Collaboration) or CompTIA Network+
Cloud certifications (GCP, AWS, or Azure)

QUALIFICATIONS

Educational Qualifications

Bachelor’s degree in computer science, Information Technology, or related field.

Work Conditions

Work From Home Set-up
Night Shift (8PM to 5AM), rotating weekend shifts

Site Reliability Engineer (GCP) (Work From Home)

About this role

Job details

Company