[Remote] Sr. Site Reliability Engineer (AI Platforms)
Note: The job is a remote job and is open to candidates in USA. Optomi, in partnership with a premier client in the financial services industry, is seeking a Site Reliability Engineer to establish and scale reliability practices for AI-powered applications and services in production. This role will drive production readiness, observability, incident management, and automation while partnering closely with engineering teams to ensure highly available, resilient systems.
Responsibilities
- Define and enforce production readiness standards for AI services and agent-based applications prior to deployment
- Establish and manage SLIs, SLOs, and error budgets, including burn-rate monitoring and alerting
- Ensure services have appropriate runbooks, rollback procedures, monitoring, and on-call ownership
- Track reliability metrics and enforce operational standards across engineering teams
- Instrument AI services and agent pipelines using structured JSON logging, custom metrics, and distributed tracing
- Build dashboards and alerting for service health, latency, error rates, dependency performance, and agent execution metrics
- Identify and address observability gaps unique to AI systems, including context limitations, model timeouts, tool invocation failures, and partial task execution
- Develop monitoring strategies that surface reliability risks before production impact occurs
- Build and maintain automation that supports production readiness reviews, incident analysis, SLO monitoring, and reliability validation
- Develop tooling and workflows that automate operational checks and reliability enforcement
- Maintain reliability standards, operational documentation, runbooks, and service ownership mappings
- Continuously evolve reliability controls as new failure patterns emerge across AI-powered systems
- Lead incident response and post-incident review efforts for production services
- Perform root cause analysis and drive remediation efforts through completion
- Identify recurring failure patterns and implement systemic reliability improvements
- Support on-call operations and validate escalation processes for critical services
- Review application architectures, infrastructure designs, and code changes through a reliability lens
- Evaluate resiliency patterns such as retries, circuit breakers, health checks, graceful degradation, and rollback strategies
- Partner with engineering teams to address reliability risks before production deployment
Skills
- 4+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Production Operations
- Hands-on experience managing production services and reliability programs
- Strong understanding of SLI/SLO frameworks, error budgets, and operational excellence practices
- Experience building monitoring, alerting, and observability solutions using platforms such as Datadog, Dynatrace, New Relic, Grafana, or similar
- Strong scripting or programming experience with Python, TypeScript, or comparable languages
- Experience with distributed systems observability, including structured logging, metrics, and tracing
- Experience supporting AI/ML, automation, or data-driven platforms in production
- Strong background leading incident response and post-incident review processes
- Experience integrating operational workflows with ticketing and documentation platforms
- Experience working within regulated or highly available production environments
Company Overview
Company H1B Sponsorship