All roles

[Remote] Site Reliability Engineer

Remote · USA Full-time New today

Note: The job is a remote job and is open to candidates in USA. Runpod is a rapidly growing company that provides a foundational platform for developers to build and run custom AI systems. As a Site Reliability Engineer, you will ensure the stability and resilience of Runpod’s distributed platform by partnering with engineering teams, improving system design, and enhancing observability to prevent incidents.

Responsibilities

  • Define and implement SLIs/SLOs for critical services
  • Lead incident response and coordinate cross-team mitigation efforts
  • Conduct blameless postmortems and ensure corrective actions are completed
  • Perform production readiness reviews for new services and features
  • Identify systemic risks and drive preventative improvements
  • Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)
  • Improve signal-to-noise ratio in alerts and reduce alert fatigue
  • Build internal tooling for reliability tracking and reporting
  • Improve visibility into GPU performance and distributed systems health
  • Automate recurring operational workflows
  • Build tools and scripts (Python, Go, Bash) to eliminate manual processes
  • Improve deployment safety through automation and guardrails
  • Strengthen CI/CD reliability and release processes
  • Partner with engineering teams to improve system resilience
  • Provide guidance on fault tolerance, scalability, and failure handling
  • Contribute to architectural discussions with a reliability-first mindset

Skills

  • 5+ years of experience in SRE, Reliability Engineering, or Production Engineering
  • Strong Linux systems and Networking expertise
  • Experience managing containerized production systems
  • Strong understanding of distributed systems and failure modes
  • Experience defining and managing SLIs/SLOs
  • Proven incident response and postmortem leadership experience
  • Strong scripting or programming skills
  • Experience with monitoring and alerting systems
  • Excellent written communication skills
  • Successful completion of a background check
  • Experience with GPU infrastructure or AI/ML platforms
  • Experience improving reliability in high-growth or large scale environments
  • Familiarity with GPU observability tooling
  • Experience with Infrastructure as Code
  • Experience working in startup environments
  • Experience building internal reliability platforms or frameworks

Benefits

  • Meaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside.
  • Generous medical, dental & vision plans
  • Flexible PTO- take the time you need to recharge
  • Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication
  • Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.

Company Overview

  • Runpod is a cloud platform designed for GPUs, enabling developers to deploy customized full-stack AI applications. It was founded in 2022, and is headquartered in Mount Laurel, New Jersey, USA, with a workforce of 51-200 employees. Its website is https://www.runpod.io.
  • Company H1B Sponsorship

  • Runpod has a track record of offering H1B sponsorships, with 4 in 2025, 3 in 2024. Please note that this does not guarantee sponsorship for this specific role.
  • Apply To This Job

    Related roles

    [Remote] Email Marketing Specialist- Global

    Remote · USA Full-time

    [Remote] Commercial Restoration Project Manager

    Remote · USA Full-time

    [Remote] Senior Product Manager

    Remote · USA Full-time

    [Remote] Account Executive

    Remote · USA Full-time

    [Remote] Engineering Manager - Front-End (UI/UX)

    Remote · USA Full-time

    [Remote] Senior Product Marketing Manager

    Remote · USA Full-time

    [Remote] Quality Assurance Engineer

    Remote · USA Full-time

    [Remote] Member of Technical Staff, Financial Infrastructure

    Remote · USA Full-time

    [Remote] Sr. Product Manager

    Remote · USA Full-time

    [Remote] Senior Mobile Engineer

    Remote · USA Full-time

    Customer Experience Specialist – Real‑Time Rider Operations for arenaflex’s Autonomous Ride‑Hailing Platform

    Remote · USA Full-time

    HR Manager - Corporate Functions

    Remote · USA Full-time

    Full-Stack Engineer (Remote)

    Remote · USA Full-time

    ETL Engineer

    Remote · USA Full-time

    Experienced Part-time Remote Data Entry Clerk – Administrative Support for arenaflex

    Remote · USA Full-time

    App Designer (UI/UX) - Founding Member (m/w/d) - AI Health

    Remote · USA Full-time

    Experienced Data Entry Associate – Flexible Remote Work Opportunities at arenaflex

    Remote · USA Full-time

    FW: Contract Job :: Infrastructure Project Manager :: Remote

    Remote · USA Full-time

    Experienced Team Scheduler – Eastern Airlines Customer Care (Remote Jobs Work From Home)

    Remote · USA Full-time

    Project Coordinator, Construction job at JLL - Jones Lang LaSalle in Chicago, IL, Phoenix, AZ

    Remote · USA Full-time