[Remote] Senior Site Reliability Engineer, Core Cloud Engineering
Note: The job is a remote job and is open to candidates in USA. Vultr is on a mission to make high-performance cloud infrastructure easy to use, affordable, and locally accessible for enterprises and AI innovators around the world. They are seeking a Senior Site Reliability Engineer to ensure the reliability and performance of Vultr's cloud services for their 1.5 million users, focusing on large-scale distributed systems and infrastructure automation.
Responsibilities
- Production Control Plane Operations: Operate and scale Vultr’s control plane, ensuring availability, correctness, and performance across global datacenters
- Hypervisor & Infrastructure Reliability: Design, implement, and maintain automation to manage hypervisor fleets (KVM, QEMU, libvirt) and supporting infrastructure at scale
- Networking & Systems Automation: Develop tooling and automation for Open vSwitch (OVS), BGP routing, and other networking components to ensure resilient and self-healing network operations
- Performance & Reliability Tuning: Continuously analyze and improve system performance across compute, storage, and network layers, with an emphasis on reducing toil and eliminating single points of failure
- Observability & Incident Response: Implement advanced monitoring, logging, and tracing solutions (Grafana, Sentry, SumoLogic) while leading incident response to minimize impact and drive postmortem culture
- CI/CD & Configuration Management: Maintain and evolve infrastructure pipelines (GitLab CI/CD, Puppet) to enable safe, fast, and reliable changes to both control plane and hypervisor infrastructure
- Collaboration: Work closely with Software Engineers, Network Engineers, and Product teams to align platform reliability with business and user needs
- Documentation & Standards: Produce clear technical documentation for runbooks, operational procedures, and automation frameworks to improve team efficiency and reliability standards
- Mentorship & Leadership: Coach and mentor team members in best practices for site reliability, incident handling, automation, and low-level Linux systems debugging
Skills
- Proficiency in PHP with strong scripting and automation skills
- Experience running large-scale distributed systems and control plane infrastructure in production
- Strong background in hypervisor technologies (libvirt, QEMU, KVM) and Linux systems administration
- Expertise in networking protocols and tools, particularly BGP and Open vSwitch (OVS), with automation experience
- Deep knowledge of observability and monitoring frameworks (Grafana, Sentry, SumoLogic) and incident management
- Advanced troubleshooting skills across compute, networking, and storage subsystems
- Experience building and maintaining CI/CD pipelines (GitLab) and configuration management (Puppet)
- Familiarity with MySQL or similar databases, with an understanding of operational considerations for reliability and scale
- Strong problem-solving abilities and the drive to tackle complex, low-level reliability challenges
- Effective cross-team communication and collaboration skills
- A commitment to continuous improvement and fostering a culture of operational excellence
Benefits
- Excellent Medical Benefits w/ 100% company paid premiums for employee only plan + 100% company paid dental & vision premiums
- 401(k) plan that matches 100% up to 4% with immediate vesting
- Professional Development Reimbursement of $2,500 each year
- 11 Holidays + Paid Time Off Accrual + Rollover Plan
- Commitment matters to Vultr! Increased PTO at 3 year & 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year
- $500 first year remote office setup + $400 each following year for new equipment
- Internet reimbursement up to $75 per month
- Gym membership reimbursement up to $50 per month
- Company paid Wellable subscription
Company Overview
Company H1B Sponsorship