Skills

Communication Leadership Incident Response Encryption Cloud Security CI/CD DevOps Kubernetes Monitoring Jenkins Ansible Stakeholder Management Strategic thinking Sales Networking Organization AWS CI/CD Pipelines TCP/IP Terraform Prometheus Grafana

Job Specifications

CarltonOne is a global B2B technology leader, and part of the Goldman Sachs portfolio, helping organizations around the world reward and inspire exceptional people. Our solutions empower employees to be more productive, sales teams to perform at their best, and customers to stay engaged and loyal.

Our platform powers the global engagement industry, enabling companies to deliver impactful employee recognition, customer loyalty, rewards, sales, and channel incentive programs. We partner with over 450 clients, 500 vendors, and serve 14 million members across 185 countries.

Beyond engagement, every CarltonOne solution drives our eco-action mission: funding tree planting to help restore the planet. To date, we’ve funded over 20 million trees and are on track to plant millions more each year. Learn more at carltonone.com.

About the Opportunity:

We are seeking a strategic and technically adept SRE Manager to lead our Site Reliability Engineering team. This role is pivotal in ensuring the reliability, scalability, and performance of our cloud-native infrastructure and services. You will guide a team of SREs, collaborate cross-functionally with DevOps, Security, and Engineering, and champion best practices in observability, incident response, and automation.

Responsibilities:

Leadership & Strategy

Lead, mentor, and grow a team of Site Reliability Engineers, fostering a culture of ownership, continuous learning, and operational excellence
Define and drive SRE strategy, including Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budget management
Collaborate with cross-functional teams (Engineering, DevOps, Security, Product) to align reliability goals with business objectives
Build and maintain strong relationships with stakeholders across the organization

Reliability & Incident Management

Establish and continuously improve the end-to-end incident management lifecycle, from detection through post-incident review
Lead coordination of incident response efforts across engineering, DevOps, and support teams during major outages
Implement and maintain runbooks and playbooks for common incident scenarios
Facilitate blameless postmortems to identify root causes, document findings, and ensure follow-up actions are completed
Track and report on incident metrics (MTTR, MTTD, frequency, severity) to identify trends and drive continuous improvement
Drive automation initiatives to reduce toil, eliminate manual effort, and improve system resilience

Monitoring, Observability & Performance

Design and implement comprehensive monitoring and observability strategies using industry-leading tools including Datadog, Grafana, CloudWatch, and Prometheus
Deploy and optimize cloud security monitoring using Rapid7 InsightCloudSec and Wiz for threat detection and compliance
Leverage Cloudflare for edge performance monitoring and DDoS protection
Establish actionable alerting systems with proper thresholds and escalation paths
Analyze performance, availability metrics, and capacity trends to proactively identify and resolve issues
Create and maintain dashboards that provide visibility into system health and business-critical metrics

Operational Excellence & Cloud Infrastructure

Lead root cause analysis for recurring issues and implement long-term preventative solutions
Optimize cloud resource usage and costs through automation, right-sizing, and performance tuning
Oversee disaster recovery planning and testing to meet Recovery Point Objective (RPO) and Recovery Time Objective (RTO) requirements
Implement and maintain Infrastructure-as-Code (IaC) practices using Terraform, CloudFormation, and Helm
Champion security best practices including RBAC, IAM policies, encryption, and vulnerability management
Drive capacity planning initiatives to ensure infrastructure scales with business growth

Qualifications

Bachelor’s degree in computer science, Engineering, or related field
7+ years of experience in cloud infrastructure, DevOps, or SRE roles, with 2+ years in a leadership
Proven experience managing incident response and reliability programs at scale
Deep expertise in AWS services (EKS, EC2, S3, VPC, IAM, RDS Aurora, Lambda)
Strong background in Kubernetes, container orchestration, and service meshes
Proficiency in Infrastructure-as-Code (Terraform, CloudFormation, Helm)
Experience with CI/CD pipelines and automation (Bamboo, Jenkins, Ansible)
Solid understanding of networking concepts (TCP/IP, DNS, load balancing, CDN)
Familiarity with monitoring and observability platforms (Datadog, Grafana, CloudWatch)
Excellent communication, stakeholder management, and cross-functional collaboration skills
Strong incident management and crisis leadership capabilities
Strategic thinking with focus on long-term reliability and scalability goals

Nice to Have

AWS Certified Solutions Architect or SRE-related certifications (SRE Practitioner, CKA, CKAD)
Experience with ITIL or other incident management frameworks
Solid understan

About the Company

CarltonOne offers the world's most powerful eCommerce and Engagement platform for creating B2B employee recognition, customer loyalty, rewards, and sales/channel incentive programs. Recognized as one of the top 50 most inspiring workplaces in North America, CarltonOne helps our partners and clients operate programs with over 10 million rewards in over 185 countries. Every transaction on our platform fuels our Evergrow sustainability mission to fight climate change with a unique eco-action business model that is funding the p... Know more