Senior Staff Site Reliability Engineer
Full TimeRemote
Location
United States
Posted
10 hours ago
Salary
Not specified
No structured requirement data.
Job Description
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.
Role Description
We are hiring for a highly experienced Senior Staff SRE Engineer to act as a senior technical authority within our reliability function. This is a deeply hands-on individual contributor role, to build and operate SRE practices at scale. You will:
- Design and evolve resilient infrastructure
- Drive reliability across multiple engineering streams
- Ensure our AI-driven products operate with high availability, performance, and security
- Work across platform, product, data, and ML teams
- Help productionise models and standardise customer environments
- Strengthen Kubernetes-based architecture
- Mature our CI/CD pipelines end-to-end
- Collaborate with other Staff engineers and Architects to shape the global product architect and technology vision
Responsibilities
- Architect, deploy, and operate scalable, secure production environments (AWS preferred)
- Lead reliability improvements across multiple engineering streams
- Design and evolve Kubernetes-based infrastructure, including migration and optimisation initiatives
- Build and enforce strong Infrastructure-as-Code standards
- Define and operationalise SLIs, SLOs, and error budgets
- Strengthen observability across applications, infrastructure, data pipelines, and ML systems
- Work closely with product and data teams to integrate model analytics and product telemetry into reliability insights
- Work across and optimise the entire CI/CD pipeline, from build to deploy to rollback
- Improve release safety, deployment frequency, and predictability of SLAs
- Lead incident response for complex cross-system failures and drive postmortems
- Reduce operational toil through automation and platform engineering improvements
- Design processes and tooling to absorb, standardise, and troubleshoot customer environments
- Support and productionise ML workloads (MLOps practices including model deployment, monitoring, retraining workflows)
- Ensure infrastructure aligns with enterprise-grade security and regulatory requirements
- Mentor engineers and raise the overall reliability bar across teams
Qualifications
- Extensive hands-on experience in SRE or Production Engineering roles
- Demonstrated experience building or scaling SRE practices in high-growth or complex environments
- Deep expertise in AWS or Azure-based cloud infrastructure
- Strong experience with Kubernetes (including migration, scaling, and production hardening)
- Advanced Infrastructure-as-Code experience (Terraform or equivalent)
- End-to-end CI/CD pipeline design and optimisation experience
- Strong experience with observability tooling across distributed systems
- Experience troubleshooting complex multi-tenant or customer-hosted environments
- Experience supporting production data platforms and ML systems
- MLOps experience, including model deployment and monitoring
- Strong understanding of distributed systems, scalability, and fault tolerance
- Systems thinker who understands interactions across infrastructure, product, data, and ML
- Excellent communication skills and ability to work cross-functionally
Preferred Experience
- Experience in large-scale global B2B/B2C products
- Experience working with AI/ML systems, NLP, or LLM-based products
- Experience integrating product analytics and model performance metrics into operational monitoring
- Background in enterprise environments with strong security and compliance requirements
- Experience implementing regulatory controls within cloud infrastructure
- Experience scaling infrastructure during rapid growth phases
- Experience evaluating infrastructure tooling and vendors
- Experience in collaborating with large scale enterprise customers to deploy and operate environments within their accounts and VPCs
Personal Characteristics
- Strong problem solver who anticipates failure modes
- High ownership mentality and accountability
- Comfortable working across streams and influencing without formal authority
- Learning-oriented with a drive for continuous improvement
Job Requirements
- Extensive hands-on experience in SRE or Production Engineering roles
- Demonstrated experience building or scaling SRE practices in high-growth or complex environments
- Deep expertise in AWS or Azure-based cloud infrastructure
- Strong experience with Kubernetes (including migration, scaling, and production hardening)
- Advanced Infrastructure-as-Code experience (Terraform or equivalent)
- End-to-end CI/CD pipeline design and optimisation experience
- Strong experience with observability tooling across distributed systems
- Experience troubleshooting complex multi-tenant or customer-hosted environments
- Experience supporting production data platforms and ML systems
- MLOps experience, including model deployment and monitoring
- Strong understanding of distributed systems, scalability, and fault tolerance
- Systems thinker who understands interactions across infrastructure, product, data, and ML
- Excellent communication skills and ability to work cross-functionally
- Preferred Experience
- Experience in large-scale global B2B/B2C products
- Experience working with AI/ML systems, NLP, or LLM-based products
- Experience integrating product analytics and model performance metrics into operational monitoring
- Background in enterprise environments with strong security and compliance requirements
- Experience implementing regulatory controls within cloud infrastructure
- Experience scaling infrastructure during rapid growth phases
- Experience evaluating infrastructure tooling and vendors
- Experience in collaborating with large scale enterprise customers to deploy and operate environments within their accounts and VPCs
- Personal Characteristics
- Strong problem solver who anticipates failure modes
- High ownership mentality and accountability
- Comfortable working across streams and influencing without formal authority
- Learning-oriented with a drive for continuous improvement