Senior Tech Lead – SRE

DevOps EngineerDevOps EngineerFull TimeRemoteTeam 10,001+Since 1961H1B SponsorCompany SiteLinkedIn

Location

California + 3 moreAll locations: California, Illinois, Montana, South Dakota

Posted

28 days ago

Salary

$106.9K - $147K / year

Bachelor Degree7 yrs expEnglishAWSAzureCloudDistributed SystemsGoogle Cloud PlatformKafkaOraclePostgresPy SparkPythonSQLGo

Job Description

• Lead SRE team initiatives focused on system reliability, automation, and operational excellence. • Architect and implement solutions to enhance availability, performance, and scalability of cloud and on-premises services. • Oversee incident management processes, ensuring timely response and thorough root cause analysis. • Develop and refine monitoring, alerting, and reporting frameworks; ensure actionable insights for service health. • Guide adoption of Infrastructure as Code (IaC) and CI/CD pipelines to streamline deployments and reduce risk. • Collaborate with software engineering and product teams to integrate reliability requirements into design and development. • Mentor engineers on SRE principles, fostering a culture of continuous improvement and operational rigor. • Establish service level objectives (SLOs), service level indicators (SLIs), and error budgets in partnership with stakeholders. • Manage on-call rotations, ensuring effective coverage and knowledge sharing. • Document architectural decisions, operational procedures, and incident retrospectives. • Operational Excellence for AI Systems – Identifying AI/ML Use Cases, Influence and implement SRE best practices including SLIs/SLOs for ML workloads, automated remediation, capacity modeling. • Observability & Monitoring for ML - Define and implement monitoring strategies for model drift, data anomalies, pipeline failures, system performance, and user experience. • Proactive risk identification and mitigation during deployments to ensure system stability. • Ensure long-term stability through Technical Debt Maintaining observability and performance of critical pharmacy applications. • Supporting timely restoration of services during outages, with 24/7 coverage to meet enterprise Service Level Agreements (SLAs). • Driving incident response and root cause analysis to prevent recurrence and improve system resilience.

Job Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
  • 7+ years of relevant experience in SRE, DevOps, or software engineering, including 2+ years in a technical leadership role.
  • Minimum 5 years' relevant experience with Python, Pyspark, Azure Databricks, Snowflake, SQL, ORACLE, POSTGRES, File Transfer, REST API, and KAFKA
  • Proficiency with cloud platforms (AWS, Azure, GCP), container orchestration, and automation tools.
  • Strong scripting and programming skills (e.g., Python, Go, Bash).
  • Deep understanding of distributed systems, networking, and security principles.
  • Proven experience leading large-scale incident response and postmortem processes.
  • Excellent communication and stakeholder management skills.
  • Experience building automation around: CI/CD (ADO YAML pipelines), Testing and validation.

Benefits

  • medical, dental and vision benefits
  • 401(k) retirement savings plan
  • time off (including paid time off, company and personal holidays, volunteer time off, paid parental and caregiver leave)
  • short-term and long-term disability
  • life insurance and many other opportunities

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Full TimeRemoteTeam 98Since 2015

Lead reliability, scalability, security, and automation efforts for business-critical services. Build infrastructure-as-code, implement compliance (FedRAMP/IL5), plan roadmaps, optimize cost, and collaborate with security and architects.

AWSGCPAzurePythonGoRubyJavaAnsibleTerraformPulumiFedrampIl5Dod Impact Level 5CmmcNist 800-53Ai Tools
United States
$120K - $180K / year
Full TimeRemoteTeam 176Since 2006

The Senior DevOps Engineer will manage AWS infrastructure for reliability and performance, improve automation and observability, and enhance security and cost efficiency.

AuroraAWSCdkCloudFormationCloudfrontCloudwatchDockerEcsEksElasticacheMySQLNginxPHPRdsRedisRoute53TerraformWaf
Georgia

Director of DevOps and Site Reliability Engineering (SRE)

CargoSprint

Empowering the people that make global commerce happen.

DevOps Engineer28 days ago
Full TimeRemoteTeam 201-500Since 2012H1B Sponsor

Lead DevOps, SRE, and Database teams to build scalable Azure Cloud infrastructure, implement CI/CD pipelines, and drive automation and security practices.

Ai-Driven ToolingAzure CloudAzure DevopsAzure MonitorCI/CDCosmosdbDockerElkGithub CopilotGrafanaKubernetesMySQLPostgreSQLPrometheusRedisSQL ServerTerraform
United States
Full TimeRemoteTeam 90Since 2015

The DevOps Engineer will architect and maintain AWS infrastructure, manage Kubernetes orchestration, implement CI/CD practices, and support AI/ML deployment, ensuring operational reliability and scalability.

Argo WorkflowsAWSCI/CDDockerGrafanaHasuraKafkaKubernetesLookerPrometheusSnowflakeTerraform
New York