Senior Software Engineer/SRE - Core Communications Location New York Business Area Engineering and CTO Ref # 10045384

Description & Requirements

About Core Communications (CC):

We build the core messaging products that power Bloomberg’s internal and client communication: IB (Instant Bloomberg), MSG (Message), and other collaboration platforms. These systems are used by the financial industry to exchange billions of messages daily, from trade ideas and pricing quotes to mission-critical communications. We're building the backbone of financial dialogue, operating at massive scale and high stakes.

About our Team:

The Core Communications SRE team are the guardians of reliability and stability for all CC products. Our focus is on enabling teams to build and operate resilient, observable, and scalable systems. We define standards, provide tools, and lead reliability-focused initiatives across all stages of the development lifecycle. Our scope spans infrastructure, application health, and incident response, working closely with over 100 developers and multiple product and platform teams.

We view our systems holistically, from application code and cluster provisioning to monitoring pipelines and reliability governance. As our platforms evolve and scale, we proactively identify architectural and operational risks, and partner with teams to mitigate them. This includes defining meaningful SLOs with Product, strengthening our observability stack, and developing cross-cutting tools that improve diagnosis and response.

We’ll Trust You To:

Define and promote reliability-focused standards and best practices across observability, alerting, incident response, and provisioning
Build and maintain troubleshooting tools leveraging distributed tracing and health signals to accelerate root cause analysis
Partner with Product teams to define and measure meaningful SLOs aligned with user experience
Lead initiatives to identify and mitigate reliability risks across CC systems — spanning performance, capacity, and resiliency
Collaborate with developers to embed reliability into the software development lifecycle, from design through deployment
Contribute to the creation of a culture of reliability by advocating for failure-aware design and sharing best practices across teams
Develop automation to reduce manual operational effort and support scalable, safe growth of our infrastructure

What’s in it for you:

You’ll have a direct and visible impact on the stability, resilience, and scalability of Bloomberg’s most fundamental and critical products — IB and MSG, which are relied upon daily by the global financial industry for essential decision-making and communication. The work you do will directly shape the reliability experience of our clients and internal users alike.

This role gives you the autonomy to drive reliability initiatives end-to-end, from infrastructure design and tooling to rollout and adoption across engineering teams. You’ll play a key role in fostering a culture of reliability within Core Communications, influencing how systems are built, monitored, and maintained.

In your day-to-day, you’ll help create tooling and frameworks to define and track reliability metrics that guide long-term stability efforts across our platforms. You’ll collaborate with teams to implement distributed tracing and end-to-end health monitoring, enabling faster debugging and deeper visibility into system behavior. You’ll contribute to the development of libraries, dashboards, and automation that bring consistency to alerting, provisioning, and incident response across the broader CC organization. And you’ll help lead the adoption of chaos testing and failure injection practices to validate how our systems perform under real-world stress.

You’ll work closely with engineers, product managers, and SREs across multiple teams and regions — building deep technical expertise and a strong cross-functional network. We also support ongoing learning through conference attendance, industry engagement, and knowledge-sharing, so you can continue to grow and bring fresh perspectives back into the team.

You’ll need to have:

4+ years of experience in software engineering, and experience working on a SRE team
Proficiency in Python and proven experience with C++
Strong understanding of distributed systems and system reliability
Familiarity with SLOs, SLIs, and SLAs, and how to relate system performance back to client impact
Strong collaboration and communication skills
A degree in Computer Science, Engineering, or equivalent practical experience

We’d love to see:

Hands-on experience with monitoring and alerting tools (e.g., Grafana, Splunk, distributed tracing)
Experience with Kafka and Java
Experience with chaos engineering, failure injection, or resilience testing frameworks
Exposure to capacity planning and scaling analysis
An interest in treating security as part of reliability
Contributions to open source or involvement in SRE communities
Awareness of industry compliance frameworks (e.g., DORA, SOC 2) and how they relate to system reliability
Experience with big data technologies like Apache Spark, Amazon S3

Salary Range = 160000 - 240000 USD Annually + Benefits + Bonus
The referenced salary range is based on the Company's good faith belief at the time of posting. Actual compensation may vary based on factors such as geographic location, work experience, market conditions, education/training and skill level.

We offer one of the most comprehensive and generous benefits plans available and offer a range of total rewards that may include merit increases, incentive compensation (exempt roles only), paid holidays, paid time off, medical, dental, vision, short and long term disability benefits, 401(k) +match, life insurance, and various wellness programs, among others. The Company does not provide benefits directly to contingent workers/contractors and interns.