Andromeda
Where technology meets empathy – pioneering the future of human-robot interaction.
Site Reliability Engineer – AI Infrastructure
Location
California
Posted
6 days ago
Salary
Not specified
Bachelor Degree5 yrs expEnglishAnsibleGrafanaKubernetesLinuxPrometheusPythonTerraformGo
Job Description
• Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers
• Build automation and tooling to streamline cluster deployments and integrations
• Debug customer issues across networking, storage, scheduling, and system layers
• Improve reliability and scalability of both training and inference infrastructure
• Design and implement monitoring, alerting, and observability for critical systems
• Collaborate with engineering and product teams to plan and deliver infrastructure for new services
• Participate in on-call and incident response, leading postmortems and reliability improvements
Job Requirements
- 5+ years experience in SRE, DevOps, or infrastructure engineering roles
- Strong Linux systems and networking fundamentals
- Deep experience with Kubernetes and container orchestration at scale
- Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.)
- Strong automation and scripting skills (Python, Go, or Bash)
- Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.)
- Track record of operating production systems and leading incident response
Benefits
- Ownership and autonomy to shape systems
- Opportunities to work directly with customers and providers