Senior Staff Site Reliability Engineer

Full TimeRemote

Location

United States

Posted

10 hours ago

Salary

Not specified

No structured requirement data.

Job Description

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.

Role Description

We are hiring for a highly experienced Senior Staff SRE Engineer to act as a senior technical authority within our reliability function. This is a deeply hands-on individual contributor role, to build and operate SRE practices at scale. You will:

Design and evolve resilient infrastructure
Drive reliability across multiple engineering streams
Ensure our AI-driven products operate with high availability, performance, and security
Work across platform, product, data, and ML teams
Help productionise models and standardise customer environments
Strengthen Kubernetes-based architecture
Mature our CI/CD pipelines end-to-end
Collaborate with other Staff engineers and Architects to shape the global product architect and technology vision

Responsibilities

Architect, deploy, and operate scalable, secure production environments (AWS preferred)
Lead reliability improvements across multiple engineering streams
Design and evolve Kubernetes-based infrastructure, including migration and optimisation initiatives
Build and enforce strong Infrastructure-as-Code standards
Define and operationalise SLIs, SLOs, and error budgets
Strengthen observability across applications, infrastructure, data pipelines, and ML systems
Work closely with product and data teams to integrate model analytics and product telemetry into reliability insights
Work across and optimise the entire CI/CD pipeline, from build to deploy to rollback
Improve release safety, deployment frequency, and predictability of SLAs
Lead incident response for complex cross-system failures and drive postmortems
Reduce operational toil through automation and platform engineering improvements
Design processes and tooling to absorb, standardise, and troubleshoot customer environments
Support and productionise ML workloads (MLOps practices including model deployment, monitoring, retraining workflows)
Ensure infrastructure aligns with enterprise-grade security and regulatory requirements
Mentor engineers and raise the overall reliability bar across teams

Qualifications

Extensive hands-on experience in SRE or Production Engineering roles
Demonstrated experience building or scaling SRE practices in high-growth or complex environments
Deep expertise in AWS or Azure-based cloud infrastructure
Strong experience with Kubernetes (including migration, scaling, and production hardening)
Advanced Infrastructure-as-Code experience (Terraform or equivalent)
End-to-end CI/CD pipeline design and optimisation experience
Strong experience with observability tooling across distributed systems
Experience troubleshooting complex multi-tenant or customer-hosted environments
Experience supporting production data platforms and ML systems
MLOps experience, including model deployment and monitoring
Strong understanding of distributed systems, scalability, and fault tolerance
Systems thinker who understands interactions across infrastructure, product, data, and ML
Excellent communication skills and ability to work cross-functionally

Preferred Experience

Experience in large-scale global B2B/B2C products
Experience working with AI/ML systems, NLP, or LLM-based products
Experience integrating product analytics and model performance metrics into operational monitoring
Background in enterprise environments with strong security and compliance requirements
Experience implementing regulatory controls within cloud infrastructure
Experience scaling infrastructure during rapid growth phases
Experience evaluating infrastructure tooling and vendors
Experience in collaborating with large scale enterprise customers to deploy and operate environments within their accounts and VPCs

Personal Characteristics

Strong problem solver who anticipates failure modes
High ownership mentality and accountability
Comfortable working across streams and influencing without formal authority
Learning-oriented with a drive for continuous improvement

Job Requirements

Extensive hands-on experience in SRE or Production Engineering roles
Demonstrated experience building or scaling SRE practices in high-growth or complex environments
Deep expertise in AWS or Azure-based cloud infrastructure
Strong experience with Kubernetes (including migration, scaling, and production hardening)
Advanced Infrastructure-as-Code experience (Terraform or equivalent)
End-to-end CI/CD pipeline design and optimisation experience
Strong experience with observability tooling across distributed systems
Experience troubleshooting complex multi-tenant or customer-hosted environments
Experience supporting production data platforms and ML systems
MLOps experience, including model deployment and monitoring
Strong understanding of distributed systems, scalability, and fault tolerance
Systems thinker who understands interactions across infrastructure, product, data, and ML
Excellent communication skills and ability to work cross-functionally
Preferred Experience
Experience in large-scale global B2B/B2C products
Experience working with AI/ML systems, NLP, or LLM-based products
Experience integrating product analytics and model performance metrics into operational monitoring
Background in enterprise environments with strong security and compliance requirements
Experience implementing regulatory controls within cloud infrastructure
Experience scaling infrastructure during rapid growth phases
Experience evaluating infrastructure tooling and vendors
Experience in collaborating with large scale enterprise customers to deploy and operate environments within their accounts and VPCs
Personal Characteristics
Strong problem solver who anticipates failure modes
High ownership mentality and accountability
Comfortable working across streams and influencing without formal authority
Learning-oriented with a drive for continuous improvement

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)More US Remote Jobs

Senior Staff Site Reliability Engineer

Job Description

Job Requirements

Related Guides

Related Categories

Related Job Pages