Aalyria
Connectivity Everywhere
Staff Site Reliability Engineer
Location
United States
Posted
123 days ago
Salary
$160K - $200K / year
Bachelor Degree7 yrs expEnglishAWSCloudDistributed SystemsGoogle Cloud PlatformGrafanaJavaKubernetesPrometheusPythonTerraformGo
Job Description
• Design, build, and own the technical roadmap for Aalyria's centralized observability platform, integrating and scaling tools for metrics (Prometheus), logging (Loki), and distributed tracing (Tempo/OpenTelemetry)
• Define, implement, and manage a robust framework of Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for our core products, ensuring we are launch-ready
• Establish and evangelize observability best practices, providing standards, documentation, and tooling (e.g., OpenTelemetry libraries) to empower our Go and Java application teams to instrument their services effectively
• Partner with core software engineers to provide the tools and insights needed to debug performance, optimize computational pipelines (including CPU/GPU workloads), and ensure the reliability of large-scale distributed systems
• Automate the deployment, scaling, and management of the entire observability stack using Infrastructure as Code (Terraform) and GitOps principles (ArgoCD)
• Partner closely with the core infrastructure team to ensure deep visibility into our Kubernetes clusters and underlying GCP and AWS environments
• Develop and lead the company's monitoring, alerting, and incident response strategy, driving a culture of proactive reliability and blameless post-mortems
Job Requirements
- 7+ years of experience in an SRE or platform engineering role
- Deep, hands-on expertise building, scaling, and managing observability platforms (e.g., Prometheus, Grafana, Loki/ELK, OpenTelemetry, Tempo/Jaeger, Honeycomb, etc.)
- Strong production-level experience with Google Cloud Platform (GCP) and Kubernetes
- Proven mastery of Infrastructure as Code (IaC) with Terraform and GitOps principles (e.g., ArgoCD)
- Proficiency in a systems programming language, with a strong preference for Go and Python for debugging and writing tooling
- Demonstrable experience defining, implementing, and managing SLOs, SLIs, and error budgets for production services
Benefits
- Competitive salary
- Comprehensive benefits (401(k), dental, vision, health, life insurance)
- Paid time off
- Equity options
- Flexible working arrangements including hybrid remote/in-office schedules
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
DevOps Engineer124 days ago
Full TimeRemoteTeam 11-50Since 2017H1B No Sponsor
Senior DevOps Engineer / Cloud Architect designing multi-account architectures
AWSAzureCloudPostgresPythonReactTypeScript
United States
DevOps Engineer126 days ago
ContractRemoteTeam 11-50Since 2003H1B No Sponsor
DevOps Engineer designing and managing CI/CD pipelines and cloud infrastructure
AnsibleAWSAzureCloudDockerEC2JenkinsKubernetesPythonTerraform
United States
Senior DevOps Engineer, Remote
Trax TechnologiesThe global leader in Transportation Spend Management (TSM) visibility for Freight Audit & Payment solutions.
DevOps Engineer126 days ago
Full TimeRemoteTeam 501-1,000Since 1993H1B No Sponsor
Senior DevOps Engineer leading infrastructure for supply chain optimization.
AWSCloudDistributed SystemsDNSDockerKubernetesTCP/IP
United States
DevOps Engineer, Platform Engineer
MAK-SYSTEMCreate & deliver innovative technologies to ensure efficiency, compliance & safety of blood, plasma & cellular products
DevOps Engineer127 days ago
Full TimeRemoteTeam 201-500Since 1984H1B No Sponsor
Platform Engineer supporting AWS platforms at MAK-SYSTEM
AnsibleAWSChefDockerJavaJenkinsKubernetesLinuxMySQLOraclePostgresPuppetSubversionTerraformUnix
United States