[Remote] Senior Compute Platform Engineer
Note: The job is a remote job and is open to candidates in USA. Stack AV is developing revolutionary AI and advanced autonomous systems to enhance safety and efficiency in the trucking transportation industry. The Senior Compute Platform Engineer will be responsible for designing and operating high scale batch compute systems and workflow orchestration systems, ensuring reliability and efficiency in complex workloads.
Responsibilities
- Design and operate distributed systems for scheduling and executing large-scale batch workloads across Kubernetes clusters
- Build and maintain compute platform abstractions
- Optimize utilization of compute resources
- Develop and improve multi-tenant scheduling strategies
- Improve reliability and fault tolerance of large-scale distributed jobs and platform components
- Collaborate with teams across the company to understand workload requirements and improve platform capabilities
- Contribute to platform tooling, automation, and CI/CD workflows
Skills
- 7+ years of experience building and operating distributed systems or infrastructure platforms
- Strong experience with Kubernetes and container orchestration in production grade environments
- Proficiency developing in Golang and Python
- Experience designing and operating large-scale batch compute systems
- Strong debugging and problem-solving skills in complex distributed systems
- Ability to collaborate across teams and communicate technical concepts clearly
- Experience with at least one batch scheduling system such as Kueue, Armada, Volcano, or Slurm
Company Overview