Senior Site Reliability Engineer – SRE

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 1-10Company SiteLinkedIn

Location

Illinois

Posted

70 days ago

Salary

$165K - $225K / year

Seniority

Senior

Bachelor Degree5 yrs expEnglishAnsibleDNSGrafanaKubernetesLinuxPrometheusPythonTerraformGo

Job Description

• Design, build, and operate production Kubernetes clusters on bare-metal infrastructure. • Implement and operate custom Kubernetes networking solutions. • Develop and maintain custom Kubernetes operators and controllers. • Deploy and optimize NVIDIA GPU operators and custom scheduling logic for GPU workloads. • Build deep integrations between Kubernetes and underlying infrastructure. • Design and implement automation using Terraform, Ansible, Helm, and custom operators. • Manage production bare-metal infrastructure across multiple regions ensuring high availability, fault tolerance, and graceful degradation. • Build comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack. • Identify and resolve performance bottlenecks across infrastructure domains.

Job Requirements

  • 5+ years in SRE, DevOps, or infrastructure engineering roles with proven experience operating production infrastructure at scale.
  • Deep hands-on experience building and operating production Kubernetes clusters on bare-metal infrastructure.
  • Strong understanding of Kubernetes internals including custom resource definitions (CRDs), operators, controllers, admission webhooks, and scheduling.
  • Strong fundamentals in Linux systems administration, performance tuning, troubleshooting, and automation in production environments.
  • Proficiency with infrastructure-as-code tools (Terraform, Ansible, Helm) and building automation to reduce operational overhead.
  • Solid understanding of networking concepts including IPAM, DNS, DHCP, VLAN/VXLAN, routing, load balancing, and experience troubleshooting network issues in production.
  • Experience building and maintaining comprehensive monitoring solutions using tools like Prometheus, Grafana, and centralized logging systems.
  • Understanding of SRE principles including SLIs/SLOs/SLAs, error budgets, incident management, and blameless postmortems.
  • Strong scripting skills in Go, Python, or Bash for automation, tooling development, and operational efficiency.
  • Demonstrated ability to troubleshoot complex issues under pressure, manage incidents effectively, and communicate clearly during outages.
  • Excellent communication skills and ability to work across teams including systems engineers, network engineers, and software developers.

Benefits

  • 6% 401(k) match
  • Fully covered health insurance premiums
  • Other comprehensive offerings to support your well-being and success as we grow together.

Related Categories

Related Job Pages

More DevOps Engineer Jobs

AAPC logo

DevOps Engineer

AAPC

Advancing the Business of Healthcare

DevOps Engineer70 days ago
Full TimeRemoteTeam 51-200Since 1988H1B No Sponsor

DevOps Engineer with expertise in Azure, AWS, and Terraform

AWSAzureCloudDockerPythonTerraform
United States
Cyngn logo

Deployment Engineer

Cyngn

Autonomous Vehicle solutions and retrofits for industrial use cases across logistics, material handling, and mining.

DevOps Engineer70 days ago
Full TimeRemoteTeam 51-200H1B Sponsor

Deployment Engineer optimizing autonomy for Cyngn's autonomous vehicles in customer facilities

GrafanaLinux
United States
$90K - $112K / year
H1 logo

Senior DevOps Engineer, AWS Cloud

H1

H1 is the connecting force for global HCP, clinical, scientific and research information.

DevOps Engineer70 days ago
Full TimeRemoteTeam 201-500H1B Sponsor

Senior DevOps Engineer scaling AWS cloud infrastructure for healthcare company

AWSAzureCloudGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonTerraform
New York
$120K - $145K / year
CaptivateIQ logo

Senior Site Reliability Engineer

CaptivateIQ

The agile commission solution. We're hiring!

DevOps Engineer71 days ago
Full TimeRemoteTeam 201-500Since 2017H1B No Sponsor

The Site Reliability Engineering team in CaptivateIQ operates across the engineering organization, supporting our development teams by providing them with the tools and processes they need to get their job done well. We ensure that the service provided by our product is great for...

TerraformAWSECSBashPythonGoDatadogInfrastructure as CodeContainersContainer OrchestrationObservabilityReliability Engineering
United States
$195.7K - $225K / year