- Company Name
- Grain
- Job Title
- Site Reliability Engineer
- Job Description
-
Job title: Site Reliability Engineer
Role Summary: Lead the design, implementation, and operation of a high‑availability, SOC 2‑compliant infrastructure for a payments platform built on AWS and Kubernetes. Own observability, security posture, CI/CD, IaC, and cost optimization while maintaining development velocity.
Expactations: 4+ years of production DevOps/infrastructure experience; expertise in AWS CDK (or Terraform/Pulumi), Kubernetes (EKS preferred); hands‑on CI/CD pipeline development; comprehensive observability implementation; strong security and compliance background in finance or payments. Additional value for SOC 2, PCI‑DSS, zero‑trust, fintech, or crypto exposure.
Key Responsibilities: • Design and deploy robust logging, metrics, tracing, and alerting systems. • Establish SOC 2‑compliant access controls, secrets management, and audit logging; build zero‑trust architecture. • Create and maintain CI/CD pipelines with automated tests, staged rollouts, and rollbacks. • Manage AWS infrastructure with CDK, ensuring reproducibility and auditability. • Optimize cost and performance without compromising reliability. • Collaborate with engineering to improve deployment, debugging, and operational workflows. • Ensure high availability, rapid incident response, and minimal downtime.
Required Skills: • AWS (CDK, EKS, ECS, Lambda, EC2)
• Kubernetes cluster management (production)
• CI/CD tooling (GitHub Actions, Jenkins, etc.)
• End‑to‑end observability (log aggregation, metrics, tracing)
• Security best practices, particularly in financial/payment contexts
• Live‑site incident management and PagerDuty-style alerts
• IaC fundamentals with CDK or equivalent Terraform/Pulumi
• Strong scripting (Python, Bash, Go) and automation mindset
Required Education & Certifications: • Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience. • SOC 2 or similar compliance certifications are a plus.