Connectivity Everywhere

Staff Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteTeam 51-200H1B No SponsorCompany Site LinkedIn

Location

United States

Posted

123 days ago

Salary

$160K - $200K / year

Bachelor Degree7 yrs expEnglishAWSCloudDistributed SystemsGoogle Cloud PlatformGrafanaJavaKubernetesPrometheusPythonTerraformGo

Job Description

• Design, build, and own the technical roadmap for Aalyria's centralized observability platform, integrating and scaling tools for metrics (Prometheus), logging (Loki), and distributed tracing (Tempo/OpenTelemetry) • Define, implement, and manage a robust framework of Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for our core products, ensuring we are launch-ready • Establish and evangelize observability best practices, providing standards, documentation, and tooling (e.g., OpenTelemetry libraries) to empower our Go and Java application teams to instrument their services effectively • Partner with core software engineers to provide the tools and insights needed to debug performance, optimize computational pipelines (including CPU/GPU workloads), and ensure the reliability of large-scale distributed systems • Automate the deployment, scaling, and management of the entire observability stack using Infrastructure as Code (Terraform) and GitOps principles (ArgoCD) • Partner closely with the core infrastructure team to ensure deep visibility into our Kubernetes clusters and underlying GCP and AWS environments • Develop and lead the company's monitoring, alerting, and incident response strategy, driving a culture of proactive reliability and blameless post-mortems

Job Requirements

7+ years of experience in an SRE or platform engineering role
Deep, hands-on expertise building, scaling, and managing observability platforms (e.g., Prometheus, Grafana, Loki/ELK, OpenTelemetry, Tempo/Jaeger, Honeycomb, etc.)
Strong production-level experience with Google Cloud Platform (GCP) and Kubernetes
Proven mastery of Infrastructure as Code (IaC) with Terraform and GitOps principles (e.g., ArgoCD)
Proficiency in a systems programming language, with a strong preference for Go and Python for debugging and writing tooling
Demonstrable experience defining, implementing, and managing SLOs, SLIs, and error budgets for production services