Aspire Software
We never stop building. A vertical acquisition software company that owns, operates and manages a diverse portfolio.
Senior Site Reliability Engineer
Location
Maryland
Posted
33 days ago
Salary
Not specified
5 yrs expEnglishAzureCloudDockerKubernetesLinuxTerraformVault
Job Description
• Own and operate a production cloud platform running on Microsoft Azure and Cloud Foundry (or comparable platforms)
• Ensure availability, performance, and reliability across infrastructure and platform components
• Serve as the primary escalation point for platform-level incidents
• Lead incident response, root cause analysis, and post-incident remediation
• Use modern monitoring, alerting, and AI-assisted observability tools to improve detection, diagnosis, and resolution of incidents
• Drive continuous improvements to reduce operational risk, after-hours incidents, and manual intervention
• Own certificate and secrets lifecycle management, including TLS automation and secure secrets handling (e.g., CredHub, Vault)
• Ensure secure and compliant practices around identity, access, and credential management
• Partner with engineering teams to embed security and reliability best practices into platform workflows
• Automate common operational tasks using Bash and/or PowerShell
• Support and extend infrastructure-as-code using Terraform and/or Bicep
• Improve platform consistency and repeatability through Git-driven, automation-first workflows
• Leverage AI-assisted tooling to support scripting, troubleshooting, and operational documentation
• Support PCI and other compliance activities, including technical control implementation, audit support, and remediation tracking
• Maintain clear runbooks, diagrams, and documentation to enable repeatable operations and knowledge transfer
• Partner with internal teams and external auditors to support compliance requirements
• Work closely with application engineers, junior SRE/support staff, and vendor partners
• Provide technical guidance and mentorship to junior teammates
• Act as a trusted partner to engineering teams on reliability, performance, and operational readiness
Job Requirements
- 5+ years of experience in SRE, DevOps, or infrastructure engineering roles supporting production environments
- Hands-on experience with Cloud Foundry, Kubernetes, or Docker in production (Cloud Foundry preferred)
- Strong experience with Microsoft Azure, including networking, compute, IAM, and monitoring
- Strong Linux systems administration experience (RHEL preferred); comfort with Windows Server environments
- Proficiency in PowerShell and/or Bash scripting
- Solid understanding of TLS/PKI workflows, including certificate management and rotation
- Proven experience managing incidents end-to-end and performing root cause analysis
- Strong written communication skills and a disciplined approach to documentation
- Experience using modern automation, observability, or AI-enabled operational tools to improve reliability and efficiency