Site Reliability Engineering Specialist I

BlackBerry.com

Office

Waterloo, Ontario, Canada

Full Time

Worker Sub-Type:

Regular

Job Description:

Are you the person we're looking for?

Are you fascinated by cloud-native technologies? Do you feel a nagging disquiet when you see something that isn't automated well, or worse, not at all? Are you intrigued by the challenges presented by running globally distributed, multi-cloud infrastructure that enterprises and governments around the world depend on?

As a Site Reliability Engineer (SRE) on the BlackBerry Service Engineering & Operations team, you'll be responsible for keeping BlackBerry services running smoothly and securely, with the availability that our customers expect. You'll do this by blending operational discipline with systems engineering principles, emphasizing robust automation and rigorous observability.

The kind of SRE we want is comfortable with Kubernetes and containers, particularly in public clouds such as AWS and Azure. You're no stranger to Git workflows and GitOps practices. Need that merge request rebased? No problem. GitLab pipeline failing? You've got it covered. You've used Terraform enough to know how to navigate its idiosyncrasies. You're able to wrangle PromQL to build the perfect Grafana dashboards to monitor your services and craft high-quality, actionable alerts. You might even enjoy spending your free time going for long walks on the beach while thinking about all the ways complex systems can fail -- who are we to judge!

Toil is your enemy. You consider it a good day when you can fire up Vim and whip up a Bash or Python script to automate an annoying task you do regularly. VS Code? That works too, we're equal opportunity at BlackBerry. (Unless you use Emacs, of course.) Polyglot? Even better. We've also got tools written in Go, Ruby, and even C++.

A broader knowledge of programming languages and software development practices is a strong asset that helps you build and manage world-class services and makes you a better partner to BlackBerry's various development teams, which as an SRE is a fundamental aspect of your job. Software architects, developers, and product owners will look to you for your infrastructure and operational insights that will shape the solutions our customers use every day.

If this sounds like you, come join our team of SREs and help us solve interesting and challenging problems!

Responsibilities

Partner with product development teams to design, build and maintain secure, highly available cloud-based services
Support CI/CD pipelines
Ensure the services you support have essential metrics, high quality dashboards and alerts, with well-documented runbooks
Maintain existing services by measuring overall system health and ensuring platforms and related software are current and patched
Be a member of an on-call rotation (includes additional compensation) in a global 24x7 environment, responding to escalations, performing root cause analyses, and striving to ensure the same incident never occurs twice
With support form BlackBerry, apply to secure Canadian Secret Security Clearance to support certain infrastructure within our portfolio.
Help maintain our catalog of reusable, cross-service automation and build custom automation as needed for your services
Find inventive ways of reducing costs and improving the performance of existing systems
Plan for infrastructure and services to meet targeted SLOs and capacity
Document as much as possible, and automate everything else

Skills And Qualifications

Post-secondary degree in Computer Science or related technical discipline, or equivalent practical experience
Three or more years of experience working with cloud technologies, systems administration, or related fields in a production environment
Extremely comfortable using Linux and navigating around the shell
Experience with public clouds such as AWS, or Azure
Experience automating infrastructure deployments using tools like Terraform, Ansible, Chef, Puppet, Salt, etc.
Sound knowledge of observability principles and experience using solutions such as Prometheus, Grafana, Zabbix, or related SaaS
Experience using container orchestration platforms such as Kubernetes or Docker Swarm
Good understanding of the full infrastructure stack: networks and network protocols, block and object storage, virtualization and operating systems, traffic steering (especially load balancers and DNS), and databases
Experience with CI/CD pipelines using solutions such as GitLab CI/CD, GitHub Actions, or Jenkins
Competent and preferably fluent in at least one programming language (Bash counts, but something like Python, JavaScript, Ruby, or Go is preferred)

Projects you could be part of

Plan, design and migrate globally deployed services from private to public clouds
Improve and build upon our existing observability platform by researching and prototyping new monitoring, logging or tracing technologies
Partner with a product development team to help bring new secure and resilient services from idea to production
Work with a task force team to help improve the performance and availability of business critical service deployments
Become part of a DevOps group to build custom solutions, establish deployment patterns and help standardize operational practices

#Li-Nr1

Scheduled Weekly Hours: