[Remote] Site Reliability Engineer | $70/hr Remote
Note: The job is a remote job and is open to candidates in USA. Crossing Hurdles is seeking a Site Reliability Engineer to enhance their AI training environments. The role focuses on deploying and managing containerized systems while ensuring performance optimization and system stability.
Responsibilities
- Deploy, monitor, and recover containerized AI training environments
- Troubleshoot infrastructure bottlenecks and resolve system failures in real time
- Build and manage resilient systems for stability and performance optimization
- Collaborate with engineering teams to improve CI/CD pipelines and automation
- Manage filesystem structures, storage, and process scheduling in containerized environments
- Execute dynamic replanning during runtime issues and system failures
- Document system processes, solutions, and best practices
Skills
- Strong experience with terminal-based system administration and troubleshooting
- Expertise in containerized environments such as Docker or Kubernetes
- Strong Python skills for scripting, automation, and debugging
- Proficiency in Bash and familiarity with additional programming languages
- Strong understanding of infrastructure, build systems, and version control
- Ability to manage dynamic infrastructure recovery in high-pressure scenarios
- Excellent written and verbal communication skills
Company Overview