This job was posted more than 40 days ago and might be expired.
RemoteRemote - United Kingdom
Key Responsibilities
Incident & Problem Management
- Lead major incident (MI) bridges and restore service with minimum business impact.
- Handle all L3 escalations, perform deep diagnostics across Java, JVM, middleware, OS, and infra.
- Own technical RCAs, drive long‑term and systemic remediation.
- Identify recurring failure patterns and risks.
Reliability Engineering
- Apply SRE principles: SLIs/SLOs, error budgets, resilience patterns.
- Tune JVM parameters, analyze thread/heap dumps, and improve performance.
- Influence application architecture for fault tolerance, scalability, and recoverability.
- Validate DR readiness, failover behavior, and resilience testing outcomes.
Change, Release & Risk
- Provide technical approval and risk assessment for high-risk changes.
- Enforce operational readiness for new apps and major releases.
- Ensure changes meet audit, compliance, and regulatory expectations.
Automation, Monitoring & Observability
- Build advanced automation using Shell/Python/PowerShell.
- Develop frameworks for health validation, automated recovery, and compliance checks.
- Define observability standards; optimize alerts and improve MTTR.
Leadership & Mentorship
- Mentor L1/L2 teams; review and approve runbooks, SOPs, and KB articles.
- Act as a trusted technical advisor to stakeholders and leadership.
Skills & Qualifications
Technical (Mandatory)
- Strong knowledge of application architecture, distributed systems, and middleware.
- Java expertise: JVM internals, GC, memory management, thread/heap dump analysis, performance tuning.
- .Net -- CLR internals, garbage collection, memory management, thread/dump analysis, and application performance tuning.
- Strong Unix/Linux, networking basics, and advanced scripting (Shell/Python/PowerShell/VBS).
- Advanced SQL and understanding of databases; Autosys (or equivalent scheduler).
- Handson with observability tools: Splunk, AppDynamics/Dynatrace, ELK, Grafana, Prometheus.
Reliability & Operations
- Major incident leadership, deep RCA, change/release readiness, DR & resilience engineering.
- Experience in regulated production environments.
Soft Skills
- Strong technical leadership and decision‑making.
- Clear communication during high‑pressure incidents.
- Ownership mindset and business awareness.
Experience & Education
- 7–12+ years in Application Reliability, Production Support, SRE, or platform operations.
- Bachelor’s degree in Computer Science/Engineering or equivalent.
- ITIL, cloud, or industry certifications (preferred).
- Banking/financial domain experience (preferred).
Working Conditions
- On‑call and after‑hours support as required.
- Fast‑paced environment with multiple priorities.
- Hybrid working model
Other open roles at Ensono(6)
Chief of Staff, Cloud & Infrastructure
Downers Grove, IL
On-siteExpert Systems Engineer, AIX
Remote - United States
🏡 RemoteTechnology Consultant
Bengaluru, India; Chennai, India; Hyderabad, India; Pune, India
On-siteSenior ServiceNow Developer
Bengaluru, India; Chennai, India; Pune, India
On-siteSenior Data Center Operations Analyst
Pune, India
On-siteEnsono
View company pageEnsono helps enterprises modernize and manage mission-critical IT with flexibility, expertise, and innovation — spanning mainframe, cloud, and AI to deliver better business outcomes.
Key team members

Ken S.

Jeff Stemler

Jason Shehab

Meg Hall
Apply smarter with Jobr
Jobr aggregates jobs directly from company career portals — no middlemen. Our team applies on your behalf with AI-tailored resumes, reviewed by a human before submission.
Direct from company career pages
AI-personalised cover letters
Human review before every submit
Application tracking & follow-ups