Search Atlas

We are committed to fostering a healthy work-life balance, innovation, and a collaborative, inclusive culture—no matter where you work. We host monthly virtual game days and events, and our team enjoys the flexibility of contributing to charity initiatives of their choice. We believe in supporting both personal growth and professional success, ensuring that remote work doesn’t mean disconnected work. Collaborative & Engaged: We’re a tight-knit team that supports each other and shares knowledge. Excellence Driven: We aim for the highest standards, always raising the bar. Self-Starter Mentality: We take initiative and problem-solve independently. Innovative: We embrace change, experiment, and think outside the box. Student Mentality: We learn from our mistakes and constantly evolve.

Platform Reliability Engineer (Agentic AI)

Platform EngineerPlatform EngineerFull TimeRemoteTeam 81Since 2021

Location

United States

Posted

11 hours ago

Salary

$70K - $120K / month

KubernetesTerraformArgo CDGoPythonAWSGKEEKSOpen TelemetryPrometheusGrafanaKarpenterKEDAGit OpsContainer OrchestrationMlopsDistributed TracingInfrastructure AS Code

Job Description

The Mission: Building the Autonomous Nervous System

Search Atlas is moving beyond suggestions to full execution.

Our agent, Atlas Brain, handles SEO, AEO, Google Ads, and AI Content Generation autonomously—zero manual intervention.

While Platform Engineers build self-service tools for developers, you ensure those tools enable autonomous AI execution with 99.99% reliability. You're not keeping dashboards alive; you're building the engine that allows an AI Agent to replace manual marketing execution. If the platform is reliable, the agent is unstoppable.



What You Will Do:

Architect the Autonomous Backbone

Design and maintain the Kubernetes-based platform (EKS/GKE) that hosts Atlas Brain and its distributed agentic workers—handling millions of requests across SEO crawling, content generation, and ad optimization pipelines.

Engineer for Zero-Touch

Automate every aspect of infrastructure using Terraform, ArgoCD, and Go/Python. If you have to do it twice, it must be a script. Enable true "zero manual execution" at the infrastructure level.

Scale Agentic Workflows
  • Optimize ML inference pipelines for real-time agent decision-making

  • Architect high-concurrency crawling systems that feed Atlas Brain's intelligence

  • Ensure sub-second latency for agent task execution (SEO, Content, AI Builder)

  • Handle high-frequency data pipelines: real-time bidding, SERP monitoring, content generation at scale

Define Radical Reliability for AI

Establish SLOs/SLIs specifically for AI execution success rates and agent task completion, not just "uptime." Design self-healing systems that preemptively resolve failures before they impact autonomous workflows.

Observability for Agent Decisions

Build distributed tracing and monitoring for complex agentic interactions—trace agent decision trees across SEO/AEO/Ads workflows, enabling rapid diagnosis of "why the agent made that choice." Implement OpenTelemetry, Prometheus, and Grafana for full visibility into autonomous execution.

Safety & Guardrails

Implement guardrails and safety controls for autonomous agent execution in marketing contexts—ensuring AI actions align with business rules, budget constraints, and compliance requirements. Design human-in-the-loop escalation paths for edge cases.

Cost & Performance Governance

Proactively optimize cloud spend and resource allocation (Karpenter/KEDA) as we scale to thousands of agencies. Balance performance with cost efficiency for unpredictable AI workloads.


Technical Requirements

Experience: 6+ years in Platform Engineering, SRE, or Infrastructure roles within high-growth SaaS environments—with proven experience supporting AI/ML systems at scale.

Infrastructure as Code: Mastery of Terraform, ArgoCD, and GitOps workflows.

Container Orchestration: Expert-level Kubernetes (EKS/GKE) networking, scaling, security, and multi-tenancy patterns.

MLOps for Agents (Must-Have):

  • Hands-on experience with MLOps pipelines for autonomous agents

  • Model versioning and deployment strategies for continuous agent improvement

  • Prompt management and A/B testing of agent behaviors

  • Guardrails for safe tool execution and decision boundaries

  • Scaling AI inference services (LLMs, embeddings, classification models)

Languages: Proficiency in Python for building custom platform tools and automation.

Observability: Deep expertise in distributed tracing and monitoring for complex, event-driven systems—specifically for debugging AI agent decision chains.

Data-Intensive Systems: Experience with high-frequency data pipelines, web crawling at scale, real-time processing, and low-latency requirements.



Why This Is Different

Unlike traditional SRE roles focused on keeping services up, you're building the infrastructure that enables autonomous AI to execute business-critical marketing tasks. Every millisecond of latency you eliminate, every self-healing mechanism you deploy, directly impacts whether Atlas Brain can truly replace manual agency work.

This is not traditional SRE—you're building the autonomous nervous system for AI execution.



What Success Looks Like

  • Atlas Brain executes millions of marketing tasks daily with <0.1% failure rate

  • Zero infrastructure-related incidents requiring manual intervention during business hours

  • Platform scales from hundreds to thousands of agency clients without reliability degradation

  • Complete observability into agent behavior: "We know not just that the agent acted, but why"


Ready to build the platform that makes autonomous marketing execution a reality?

Job Requirements

  • 6+ years in Platform Engineering, SRE, or Infrastructure roles within high-growth SaaS environments—with proven experience supporting AI/ML systems at scale.
  • Mastery of Terraform, ArgoCD, and GitOps workflows.
  • Expert-level Kubernetes (EKS/GKE) networking, scaling, security, and multi-tenancy patterns.
  • Hands-on experience with MLOps pipelines for autonomous agents.
  • Proficiency in Python for building custom platform tools and automation.
  • Deep expertise in distributed tracing and monitoring for complex, event-driven systems—specifically for debugging AI agent decision chains.
  • Experience with high-frequency data pipelines, web crawling at scale, real-time processing, and low-latency requirements.
  • Model versioning and deployment strategies for continuous agent improvement.
  • Prompt management and A/B testing of agent behaviors.
  • Guardrails for safe tool execution and decision boundaries.
  • Scaling AI inference services (LLMs, embeddings, classification models).

Benefits

  • Opportunity to build the infrastructure that enables autonomous AI to execute business-critical marketing tasks.
  • Every millisecond of latency you eliminate, every self-healing mechanism you deploy, directly impacts whether Atlas Brain can truly replace manual agency work.
  • What Success Looks Like
  • Atlas Brain executes millions of marketing tasks daily with <0.1% failure rate.
  • Zero infrastructure-related incidents requiring manual intervention during business hours.
  • Platform scales from hundreds to thousands of agency clients without reliability degradation.
  • Complete observability into agent behavior: "We know not just that the agent acted, but why."

Related Categories

Related Job Pages

More Platform Engineer Jobs

Platform Engineer15 hours ago
Full TimeRemoteTeam 51-200

This Senior Platform Engineer acts as the primary contact for business users regarding advanced projects, changes, and issues within the Blue Yonder platform, playing a crucial role in the global rollout and ongoing maintenance of the TMS platform. Key duties involve partnering with stakeholders to translate needs into documented requirements, determining platform solutions, designing integrations, and managing configuration/development delivery.

Blue Yonder TMODBlue Yonder TMTMSSupply ChainTransportation ManagementLogisticsSaaSSLA ManagementSystem IntegrationData AnalysisProject ManagementAgileWaterfallSOX ComplianceSecurity Protocols
United States
Platform Engineer22 hours ago
Full TimeRemoteTeam 10,001+Since 2010H1B No Sponsor

Associate Director of Platform Engineering managing digital transformation in oncology.

AWSAzureCloudDockerGoogle Cloud PlatformKubernetes
United States
$158.4K - $208.4K / year

Associate Director, Platform Engineering - BioAgent

BeiGene

BeOne is committed to fair and equitable compensation practices. Actual compensation packages are determined by several factors that are unique to each candidate, including but not limited to job-related skills, depth of experience, certifications, relevant education or training, and specific work location. We are proud to be an equal opportunity employer. BeOne does not discriminate on the basis of race, religion, color, sex, gender identity, sexual orientation, age, disability, national origin, veteran status or any other basis covered by appropriate law. In order to ensure reasonable accommodation for individuals protected by Section 503 of the Rehabilitation Act of 1973, the Vietnam Era Veterans’ Readjustment Assistance Act of 1974, Title I of the Americans with Disabilities Act of 1990, and any other applicable federal, state or local laws, applicants who require reasonable accommodation in the job application process may contact accommodationsus@beonemed.com.

Platform Engineer1 day ago
Full TimeRemoteTeam 2,862Since 2010

The Associate Director is responsible for building and operating an enterprise-grade platform to accelerate digital transformation, owning core engineering capabilities for scalable digital products like workflow automation and analytics enablement. This role involves defining and executing the platform engineering strategy, translating business needs into a technical roadmap, and overseeing the design and lifecycle management of key platform services.

United States
$158K - $208K / year

Staff Platform Engineer

Beam Benefits

Simpler, smarter employee benefits #BeamBenefits

Platform Engineer1 day ago
Full TimeRemoteTeam 201-500H1B No Sponsor

The Staff Platform Engineer will drive the evolution and maturation of the production platform and developer experience by building systems that enhance speed, safety, and reduce cognitive load for other engineers. Responsibilities include tactically applying AI, collaborating with application teams to resolve bottlenecks, and designing/maintaining high-availability infrastructure.

United States
$195K - $200K / year