Where technology meets empathy – pioneering the future of human-robot interaction.

Site Reliability Engineer – AI Infrastructure

Full TimeRemoteTeam 11-50Company Site LinkedIn

Location

California

Posted

6 days ago

Salary

Not specified

Bachelor Degree5 yrs expEnglishAnsibleGrafanaKubernetesLinuxPrometheusPythonTerraformGo

Job Description

• Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers • Build automation and tooling to streamline cluster deployments and integrations • Debug customer issues across networking, storage, scheduling, and system layers • Improve reliability and scalability of both training and inference infrastructure • Design and implement monitoring, alerting, and observability for critical systems • Collaborate with engineering and product teams to plan and deliver infrastructure for new services • Participate in on-call and incident response, leading postmortems and reliability improvements

Job Requirements

5+ years experience in SRE, DevOps, or infrastructure engineering roles
Strong Linux systems and networking fundamentals
Deep experience with Kubernetes and container orchestration at scale
Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.)
Strong automation and scripting skills (Python, Go, or Bash)
Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.)
Track record of operating production systems and leading incident response

Benefits

Ownership and autonomy to shape systems
Opportunities to work directly with customers and providers

Related Categories

DevOps Engineer

Related Job Pages

DevOps Engineer Jobs in California Remote Full-time Jobs (US)Remote Python Jobs (US)More US Remote Jobs