Devsu

Join Devsu and discover a workplace that values your growth, supports your well-being, and empowers you to make a global impact.

Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteCompany Site

Location

United States + 1 moreAll locations: United States, Dominican Republic

Posted

8 days ago

Salary

Not specified

No structured requirement data.

Job Description

We are seeking a Site Reliability Engineer (SRE) with deep expertise in monitoring, observability, and reliability engineering to support systems running across on-premises infrastructure and Google Cloud Platform (GCP).

This role is primarily responsible for designing, operating, and improving monitoring, alerting, and observability platforms, with a strong focus on Grafana and Kubernetes environments.

As a secondary responsibility, this role provides backup coverage for the Application Support team during periods of resource constraints or major incidents, offering L2/L3 technical support when required.

Responsibilities

Monitoring & Observability (Core Focus)

  • Own and operate the monitoring and observability stack across on-prem and GCP environments
  • Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications
  • Define, tune, and maintain alerts to ensure high signal-to-noise ratio
  • Establish observability standards and best practices across teams
  • Improve visibility into system health, performance, and reliability

Site Reliability Engineering

  • Apply SRE principles to improve availability, performance, and resilience
  • Define and track SLIs, SLOs, and error budgets
  • Participate in on-call rotations and SEV incident response
  • Lead or contribute to incident investigations and root cause analysis (RCA)
  • Drive preventative actions to reduce repeat incidents

Kubernetes & Platform Reliability

  • Support and monitor Kubernetes environments (GKE and on-prem clusters)
  • Monitor cluster health, capacity, and resource utilization
  • Troubleshoot platform-level issues impacting application reliability
  • Collaborate with Platform and Engineering teams on reliability improvements

Secondary Responsibilities (Backup Application Support)
  • These responsibilities are activated as needed, not part of day-to-day operations.
  • Provide L2/L3 application support coverage during:
    • Support team resource shortages
    • High-severity incidents (SEVs)
    • Peak support periods or escalations
  • Triage and troubleshoot application issues using existing runbooks and dashboards
  • Collaborate with Application Support and Engineering teams during incidents
  • Ensure all actions, findings, and resolutions are documented in ServiceNow (SNOW)
  • Strong experience as a Site Reliability Engineer or Reliability Engineer
  • Deep hands-on expertise with Grafana (dashboards, alerting, troubleshooting)
  • Solid experience with monitoring and observability systems
  • Production experience operating Kubernetes environments
  • Experience supporting systems in GCP and on-prem environments
  • Strong Linux systems and troubleshooting skills
  • Fluent English (written and spoken).
  • Ability to work in PST time zone.
  • Ability to participate in an on-call rotation that includes coverage for one weekend day. Time worked during the weekend is compensated with one day off during the week, in accordance with the established work schedule.

Technology Stack:

  • Observability: Grafana, Prometheus, logging platforms
  • Containers: Kubernetes (GKE and on-prem)
  • Cloud: Google Cloud Platform (GCP)
  • Operations: Linux, networking, infrastructure monitoring
  • Incident Tools: PagerDuty, ServiceNow, Slack (or equivalents)

Nice to have: 

  • Experience supporting application teams during SEV incidents
  • Knowledge of capacity planning and performance tuning
  • Scripting skills (Python, Bash, etc.)
  • Experience with hybrid infrastructure environments

At Devsu, we believe in creating an environment where you can thrive both personally and professionally. By joining our team, you’ll enjoy:

  • A stable, long-term contract with opportunities for career growth
  • Private health insurance
  • A remote-friendly culture that promotes work-life balance
  • Continuous training, mentorship, and learning programs to keep you at the forefront of the industry
  • Free access to AI training resources and state-of-the-art AI tools to elevate your daily work
  • A flexible Paid Time Off (PTO) policy as well as paid holiday days
  • Challenging, world-class software projects for clients in the US and LatAm
  • Collaboration with some of the most talented software engineers in Latin America and the US, in a diverse work environment

Join Devsu and discover a workplace that values your growth, supports your well-being, and empowers you to make a global impact.

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Senior DevOps Engineer

ChowNow

The only fair-for-all food ordering marketplace — no commissions for restaurants and no hidden fees for diners.

DevOps Engineer8 days ago
Full TimeRemoteTeam 201-500Since 2011H1B Sponsor

Senior DevOps Engineer responsible for enhancing technology infrastructure at ChowNow

AnsibleAWSEC2ElasticSearchLinuxMySQLPostgreSQLPythonRedisTerraform
United States
$169.7K - $200.5K / year

Senior DevOps Engineer (Exol)

Exol

Symbotic is an automation technology leader reimagining the supply chain with its end-to-end, AI-powered robotic and software platform. Symbotic reinvents the warehouse as a strategic asset for the world’s largest retail, wholesale, and food & beverage companies Applying next-gen technology, high-density storage and machine learning to solve today's complex distribution challenges Transforms the flow of goods and the economics of supply chain for its customers

DevOps Engineer8 days ago
Full TimeRemote

The role involves designing, building, and maintaining production-grade cloud infrastructure using Terraform, focusing on state management and module development for scalable delivery pipelines. Responsibilities also include architecting secure GCP solutions, optimizing CI/CD pipelines, and implementing robust monitoring and alerting systems.

United States
$147K - $202K / year

Junior Dev Ops Engineer

BlueVoyant

At BlueVoyant, we recognize that effective cyber security requires active prevention and defense across both your organization and supply chain. Our proprietary data, analytics, and technology, coupled with deep expertise, works as a force multiplier to secure your full ecosystem. Founded in 2017 by Fortune 500 executives. Headquartered in New York City. Offices in Maryland, Tel Aviv, San Francisco, London, Budapest, and Latin America.

DevOps Engineer8 days ago
Full TimeRemoteTeam 501-1,000

The role involves reducing operational workload through task automation and assisting with the deployment, support, and troubleshooting of services in production environments. Responsibilities also include improving CI/CD pipelines, contributing to cloud infrastructure using Terraform, and supporting Kubernetes clusters.

United States
DevOps Engineer8 days ago
Full TimeRemoteTeam 1,001-5,000Since 2010H1B No Sponsor

Senior Developer/DevSecOps Engineer managing JBOSS and Artifactory deployment

AnsibleJavaLinuxRealm
United States
$145K - $155K / year