[Remote] Sr. Site Reliability Engineer (AI Platforms)

Remote · USA Full-time New today

Note: The job is a remote job and is open to candidates in USA. Optomi, in partnership with a premier client in the financial services industry, is seeking a Site Reliability Engineer to establish and scale reliability practices for AI-powered applications and services in production. This role will drive production readiness, observability, incident management, and automation while partnering closely with engineering teams to ensure highly available, resilient systems.

Responsibilities

Define and enforce production readiness standards for AI services and agent-based applications prior to deployment
Establish and manage SLIs, SLOs, and error budgets, including burn-rate monitoring and alerting
Ensure services have appropriate runbooks, rollback procedures, monitoring, and on-call ownership
Track reliability metrics and enforce operational standards across engineering teams
Instrument AI services and agent pipelines using structured JSON logging, custom metrics, and distributed tracing
Build dashboards and alerting for service health, latency, error rates, dependency performance, and agent execution metrics
Identify and address observability gaps unique to AI systems, including context limitations, model timeouts, tool invocation failures, and partial task execution
Develop monitoring strategies that surface reliability risks before production impact occurs
Build and maintain automation that supports production readiness reviews, incident analysis, SLO monitoring, and reliability validation
Develop tooling and workflows that automate operational checks and reliability enforcement
Maintain reliability standards, operational documentation, runbooks, and service ownership mappings
Continuously evolve reliability controls as new failure patterns emerge across AI-powered systems
Lead incident response and post-incident review efforts for production services
Perform root cause analysis and drive remediation efforts through completion
Identify recurring failure patterns and implement systemic reliability improvements
Support on-call operations and validate escalation processes for critical services
Review application architectures, infrastructure designs, and code changes through a reliability lens
Evaluate resiliency patterns such as retries, circuit breakers, health checks, graceful degradation, and rollback strategies
Partner with engineering teams to address reliability risks before production deployment

Skills

4+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Production Operations
Hands-on experience managing production services and reliability programs
Strong understanding of SLI/SLO frameworks, error budgets, and operational excellence practices
Experience building monitoring, alerting, and observability solutions using platforms such as Datadog, Dynatrace, New Relic, Grafana, or similar
Strong scripting or programming experience with Python, TypeScript, or comparable languages
Experience with distributed systems observability, including structured logging, metrics, and tracing
Experience supporting AI/ML, automation, or data-driven platforms in production
Strong background leading incident response and post-incident review processes
Experience integrating operational workflows with ticketing and documentation platforms
Experience working within regulated or highly available production environments

Company Overview

OPTOMI is an IT staffing firm that serves its consultants, clients, and employees through its consultant-focused approach. It was founded in 2012, and is headquartered in Roswell, Georgia, USA, with a workforce of 501-1000 employees. Its website is http://www.optomi.com/.

Company H1B Sponsorship

Optomi has a track record of offering H1B sponsorships, with 7 in 2025, 6 in 2024, 2 in 2023, 5 in 2022, 8 in 2021, 7 in 2020. Please note that this does not guarantee sponsorship for this specific role.

Apply To This Job

Apply

[Remote] Sr. Site Reliability Engineer (AI Platforms)

Related roles

[Remote] Senior Azure Data Consultant (Microsoft Fabric Modernization)

[Remote] SAP Program Manager

[Remote] Data Analytics Contractor

[Remote] Senior Technical Project Manager

[Remote] Data Analytics Contractor

[Remote] Territory Manager, Product Assembly

[Remote] MEP Program Manager

[Remote] Temporary Operations Support Specialist

[Remote] Data & AI Engineer - 90408785 - Remote Job Details | Amtrak

[Remote] Data & AI Senior Engineer - 90405345 - Remote Job Details | Amtrak

Experienced Full Stack Software Engineer – Web & Cloud Application Development for Freshers

Virtual Teacher, CTE-Computer Science (MN, NJ)

Remote Agents for Customer Service in Travel

Hospitality Housekeeper

Remote Data Entry Clerk

Experienced Full Stack Customer Service Representative – Work from Home Opportunity with arenaflex

Microsoft Jobs Near Me $25Hr - VacancyGlobal

Associate Buyer - Produce

Flight Support Specialist - Join Our Team at American Airlines ($35/Hour)

Adjunct Faculty, Aviation