Aalyria
Connectivity Everywhere
Site Reliability Engineer
Location
United States
Posted
123 days ago
Salary
$115K - $135K / year
Bachelor Degree4 yrs expEnglishAWSCloudDistributed SystemsGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonTerraformGo
Job Description
• Help design and build Aalyria's centralized observability platform, integrating and scaling tools for metrics (e.g. Prometheus), logging (e.g. Loki), and distributed tracing (e.g. Tempo/OpenTelemetry).
• Define, implement, and manage a robust framework of Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for our core products, ensuring we are launch-ready.
• Partner with SWEs to implement observability best practices, develop standard templates and documentation, and configure tooling (e.g., OpenTelemetry libraries).
• Automate the deployment, scaling, and management of the entire observability stack using Infrastructure as Code (e.g. Terraform) and GitOps principles (e.g. ArgoCD).
• Partner closely with the core infrastructure team to ensure deep visibility into our Kubernetes clusters and underlying GCP and AWS environments.
• Develop and lead the company's monitoring, alerting, and incident response strategy, driving a culture of proactive reliability and blameless post-mortems.
Job Requirements
- 4+ years of experience in an SRE or platform engineering role, with a focus on observability for large-scale, distributed compute or network systems.
- Deep, hands-on expertise building, scaling, and managing observability platforms (e.g., Prometheus, Grafana, Loki/ELK, OpenTelemetry, Tempo/Jaeger, Honeycomb, etc.).
- Proven experience using these tools to support performance analysis and debugging of complex distributed systems.
- Strong production-level experience with Google Cloud Platform (GCP) and Kubernetes.
- Experience using Infrastructure as Code (IaC) and GitOps principles (e.g., ArgoCD).
- Proficiency in a systems programming language, with a strong preference for Go and Python for debugging and writing tooling.
- Demonstrable experience defining, implementing, and managing SLOs, SLIs, and error budgets for production services for high availability distributed systems.
Benefits
- Innovative Environment: Work at a cutting-edge company shaping the future of aerospace communications.
- Impactful Work: Directly contribute to critical national security programs and initiatives.
- Growth Opportunities: Expand your career with opportunities for professional development and advancement.
- Inclusive Culture: Be part of a collaborative, supportive, and inclusive workplace where your contributions matter.
- Flexibility: Flexible working arrangements including hybrid remote/in-office schedules.
- Competitive salary, comprehensive benefits (401(k), dental, vision, health, life insurance), paid time off, and equity options.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
DevOps Engineer125 days ago
Full TimeRemoteTeam 11-50Since 2017H1B No Sponsor
Senior DevOps Engineer / Cloud Architect designing multi-account architectures
AWSAzureCloudPostgresPythonReactTypeScript
United States
DevOps Engineer126 days ago
ContractRemoteTeam 11-50Since 2003H1B No Sponsor
DevOps Engineer designing and managing CI/CD pipelines and cloud infrastructure
AnsibleAWSAzureCloudDockerEC2JenkinsKubernetesPythonTerraform
United States
Senior DevOps Engineer, Remote
Trax TechnologiesThe global leader in Transportation Spend Management (TSM) visibility for Freight Audit & Payment solutions.
DevOps Engineer126 days ago
Full TimeRemoteTeam 501-1,000Since 1993H1B No Sponsor
Senior DevOps Engineer leading infrastructure for supply chain optimization.
AWSCloudDistributed SystemsDNSDockerKubernetesTCP/IP
United States
DevOps Engineer, Platform Engineer
MAK-SYSTEMCreate & deliver innovative technologies to ensure efficiency, compliance & safety of blood, plasma & cellular products
DevOps Engineer128 days ago
Full TimeRemoteTeam 201-500Since 1984H1B No Sponsor
Platform Engineer supporting AWS platforms at MAK-SYSTEM
AnsibleAWSChefDockerJavaJenkinsKubernetesLinuxMySQLOraclePostgresPuppetSubversionTerraformUnix
United States