[Remote] Site Reliability Engineering Manager
Note: The job is a remote job and is open to candidates in USA. Dice is seeking a Senior Manager of Site Reliability Engineering (SRE) to enhance SRE practices within the Financial Services & Innovation organization. This role involves establishing operational discipline, driving SRE standards, and ensuring alignment across teams to improve reliability and performance.
Responsibilities
- Drive adoption of the SRE operating model across application teams
- Establish clarity in roles between:
- SRE
- Production Support Engineering (PSE)
- Application teams
- Ensure SRE practices are embedded into the development lifecycle, not treated as post-production activities
- Define and enforce:
- SLIs, SLOs, and Error Budgets
- Production readiness criteria
- Reliability best practices
- Lead SLO adoption and compliance reviews across the organization
- Establish governance frameworks to ensure consistent application of standards
- Partner with:
- Application product teams
- Production Support Engineering (MG team)
- Platform / Infrastructure / Observability teams
- Drive alignment and reduce friction between engineering and operations
- Ensure clear handoffs, escalation models, and operational ownership
- Lead adoption of centralized observability standards across:
- Metrics
- Logging
- Tracing
- Align tooling (AppDynamics, Splunk, Prometheus, etc.)
- Ensure monitoring and alerting are SLO-driven and actionable, not noise-based
- Partner with PSE to strengthen:
- Incident management processes
- RCA (Root Cause Analysis) standards
- Drive identification of patterns and systemic issues
- Ensure learnings translate into engineering improvements and automation
- Identify opportunities to:
- Reduce manual operational work
- Improve system resilience
- Enable self-healing capabilities
- Promote a culture of engineering over reaction
- Define and track reliability metrics across FS&I
- Build reporting that provides visibility into:
- System health
- Incident trends
- SLO performance
- Translate technical data into actionable business insights
Skills
- 10+ years in engineering, operations, or SRE roles
- 5+ years leading SRE, platform, or reliability-focused teams
- Proven experience implementing SRE practices at scale (SLIs, SLOs, error budgets)
- Strong background in cloud environments (AWS, Azure, Google Cloud Platform)
- Hands-on experience with observability tools (Splunk, AppDynamics, Prometheus, etc.)
- Experience in incident management and production operations at scale
- Ability to operate effectively in high-pressure and complex enterprise environments
- Experience driving organizational transformation (not just technical implementation)
- Strong understanding of CI/CD, DevOps, and automation practices
- Experience working in regulated or large enterprise environments
- Familiarity with AIOps or advanced automation strategies
Company Overview
Company H1B Sponsorship