[Remote] Software Engineer- Site Reliability Engineering
Note: The job is a remote job and is open to candidates in USA. Noctua Technology is a software engineering and consulting corporation focused on data engineering, machine learning, and cloud technologies. They are seeking a motivated Site Reliability Engineer (SRE) to apply software engineering principles to operations, ensuring the reliability, scalability, and performance of production systems.
Responsibilities
- Define, measure, and report on Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure system reliability and uptime
- Develop and deploy Infrastructure as Code (IaC) using Terraform, CloudFormation, or similar tools, with an emphasis on repeatability and change management
- Implement and manage containerized and serverless architectures using Docker, Kubernetes, and cloud-native services, focusing on performance and error budgets
- Build and maintain reliable and self-healing CI/CD pipelines to automate deployments and improve development workflows
- Implement and refine comprehensive monitoring, alerting, and logging to detect and address performance and availability issues proactively
- Eliminate toil by extensively automating operational tasks, including provisioning, patching, and deployments, using scripting and configuration management tools such as Python, Bash, or Ansible
- Conduct post-incident reviews (blameless postmortems) to drive continuous improvement in system reliability and operational processes
- Implement cloud security best practices, including identity and access management (IAM), encryption, and compliance controls
- Proactively identify and address system weaknesses and ensure performance under stress
- Support disaster recovery and high availability strategies through backup and failover planning
- Collaborate with development teams to improve the operability and production readiness of applications from design through deployment
- Create and maintain documentation for cloud architectures, deployment processes, and best practices
- Contribute to internal knowledge-sharing initiatives, ensuring continuous learning within the team
- Provide technical guidance and support to clients and internal teams on cloud infrastructure and reliability best practices, with a focus on defining Service Level Agreements (SLAs)
- Act on client feedback to refine and enhance cloud solutions
- Conduct training and knowledge-sharing sessions to help clients manage their cloud environments effectively
- Stay updated on the latest developments in cloud infrastructure and technology trends
- Drive innovation by proposing and implementing new techniques and technologies
Skills
- 1-5 years of experience in site reliability engineering, cloud engineering, or related fields
- Strong software engineering skills with an emphasis on writing clean, modular, and maintainable code, specifically for automation and system management
- Proficiency in Infrastructure as Code (IaC) tools like Terraform or CloudFormation
- Experience with containerization and orchestration tools like Docker and Kubernetes
- Knowledge of networking concepts, cloud security best practices, and identity management
- Experience with programming or scripting languages such as Python, Bash, or Go
- Familiarity with CI/CD pipelines and DevOps methodologies
- Strong problem-solving skills and the ability to troubleshoot complex cloud environments
- Effective communication skills and a willingness to learn and collaborate
- Bachelor's or advanced degree in Computer Science or a related field
- Google Cloud Professional Cloud Architect
- Google Cloud Professional Cloud DevOps Engineer
- AWS Certified Solutions Architect
- AWS Certified Developer
- AWS Certified SysOps Administrator
- Azure Solutions Architect Expert
- CompTIA Security+ certification or an equivalent DoD 8140/8570 IAT Level II baseline certification
Company Overview