Andromeda

Where technology meets empathy – pioneering the future of human-robot interaction.

Site Reliability Engineer – AI Infrastructure

Full TimeRemoteTeam 11-50Company SiteLinkedIn

Location

California

Posted

6 days ago

Salary

Not specified

Bachelor Degree5 yrs expEnglishAnsibleGrafanaKubernetesLinuxPrometheusPythonTerraformGo

Job Description

• Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers • Build automation and tooling to streamline cluster deployments and integrations • Debug customer issues across networking, storage, scheduling, and system layers • Improve reliability and scalability of both training and inference infrastructure • Design and implement monitoring, alerting, and observability for critical systems • Collaborate with engineering and product teams to plan and deliver infrastructure for new services • Participate in on-call and incident response, leading postmortems and reliability improvements

Job Requirements

  • 5+ years experience in SRE, DevOps, or infrastructure engineering roles
  • Strong Linux systems and networking fundamentals
  • Deep experience with Kubernetes and container orchestration at scale
  • Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.)
  • Strong automation and scripting skills (Python, Go, or Bash)
  • Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.)
  • Track record of operating production systems and leading incident response

Benefits

  • Ownership and autonomy to shape systems
  • Opportunities to work directly with customers and providers

Related Categories

Related Job Pages