Senior Staff Site Reliability Engineer

Full TimeRemote

Location

United States

Posted

10 hours ago

Salary

Not specified

No structured requirement data.

Job Description

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.

Role Description

We are hiring for a highly experienced Senior Staff SRE Engineer to act as a senior technical authority within our reliability function. This is a deeply hands-on individual contributor role, to build and operate SRE practices at scale. You will:

  • Design and evolve resilient infrastructure
  • Drive reliability across multiple engineering streams
  • Ensure our AI-driven products operate with high availability, performance, and security
  • Work across platform, product, data, and ML teams
  • Help productionise models and standardise customer environments
  • Strengthen Kubernetes-based architecture
  • Mature our CI/CD pipelines end-to-end
  • Collaborate with other Staff engineers and Architects to shape the global product architect and technology vision

Responsibilities

  • Architect, deploy, and operate scalable, secure production environments (AWS preferred)
  • Lead reliability improvements across multiple engineering streams
  • Design and evolve Kubernetes-based infrastructure, including migration and optimisation initiatives
  • Build and enforce strong Infrastructure-as-Code standards
  • Define and operationalise SLIs, SLOs, and error budgets
  • Strengthen observability across applications, infrastructure, data pipelines, and ML systems
  • Work closely with product and data teams to integrate model analytics and product telemetry into reliability insights
  • Work across and optimise the entire CI/CD pipeline, from build to deploy to rollback
  • Improve release safety, deployment frequency, and predictability of SLAs
  • Lead incident response for complex cross-system failures and drive postmortems
  • Reduce operational toil through automation and platform engineering improvements
  • Design processes and tooling to absorb, standardise, and troubleshoot customer environments
  • Support and productionise ML workloads (MLOps practices including model deployment, monitoring, retraining workflows)
  • Ensure infrastructure aligns with enterprise-grade security and regulatory requirements
  • Mentor engineers and raise the overall reliability bar across teams

Qualifications

  • Extensive hands-on experience in SRE or Production Engineering roles
  • Demonstrated experience building or scaling SRE practices in high-growth or complex environments
  • Deep expertise in AWS or Azure-based cloud infrastructure
  • Strong experience with Kubernetes (including migration, scaling, and production hardening)
  • Advanced Infrastructure-as-Code experience (Terraform or equivalent)
  • End-to-end CI/CD pipeline design and optimisation experience
  • Strong experience with observability tooling across distributed systems
  • Experience troubleshooting complex multi-tenant or customer-hosted environments
  • Experience supporting production data platforms and ML systems
  • MLOps experience, including model deployment and monitoring
  • Strong understanding of distributed systems, scalability, and fault tolerance
  • Systems thinker who understands interactions across infrastructure, product, data, and ML
  • Excellent communication skills and ability to work cross-functionally

Preferred Experience

  • Experience in large-scale global B2B/B2C products
  • Experience working with AI/ML systems, NLP, or LLM-based products
  • Experience integrating product analytics and model performance metrics into operational monitoring
  • Background in enterprise environments with strong security and compliance requirements
  • Experience implementing regulatory controls within cloud infrastructure
  • Experience scaling infrastructure during rapid growth phases
  • Experience evaluating infrastructure tooling and vendors
  • Experience in collaborating with large scale enterprise customers to deploy and operate environments within their accounts and VPCs

Personal Characteristics

  • Strong problem solver who anticipates failure modes
  • High ownership mentality and accountability
  • Comfortable working across streams and influencing without formal authority
  • Learning-oriented with a drive for continuous improvement

Job Requirements

  • Extensive hands-on experience in SRE or Production Engineering roles
  • Demonstrated experience building or scaling SRE practices in high-growth or complex environments
  • Deep expertise in AWS or Azure-based cloud infrastructure
  • Strong experience with Kubernetes (including migration, scaling, and production hardening)
  • Advanced Infrastructure-as-Code experience (Terraform or equivalent)
  • End-to-end CI/CD pipeline design and optimisation experience
  • Strong experience with observability tooling across distributed systems
  • Experience troubleshooting complex multi-tenant or customer-hosted environments
  • Experience supporting production data platforms and ML systems
  • MLOps experience, including model deployment and monitoring
  • Strong understanding of distributed systems, scalability, and fault tolerance
  • Systems thinker who understands interactions across infrastructure, product, data, and ML
  • Excellent communication skills and ability to work cross-functionally
  • Preferred Experience
  • Experience in large-scale global B2B/B2C products
  • Experience working with AI/ML systems, NLP, or LLM-based products
  • Experience integrating product analytics and model performance metrics into operational monitoring
  • Background in enterprise environments with strong security and compliance requirements
  • Experience implementing regulatory controls within cloud infrastructure
  • Experience scaling infrastructure during rapid growth phases
  • Experience evaluating infrastructure tooling and vendors
  • Experience in collaborating with large scale enterprise customers to deploy and operate environments within their accounts and VPCs
  • Personal Characteristics
  • Strong problem solver who anticipates failure modes
  • High ownership mentality and accountability
  • Comfortable working across streams and influencing without formal authority
  • Learning-oriented with a drive for continuous improvement

Related Categories

Related Job Pages