cover image
ClientMind Recruiting Inc.

Site Reliability Engineer

Hybrid

Bethesda, United states

Mid level

Full Time

01-12-2025

Share this job:

Skills

Python CI/CD Kubernetes Monitoring Networking AWS Spark Kafka Prometheus Grafana

Job Specifications

Clientmind Recruiting is searching for a Site Reliability Engineer for a growing tech company based in the Bethesda, MD area. This will be onsite 1x per week (Tuesday).

This role centers on maintaining the “common” IaC constructs (Python-based abstractions in

AWS CDK and CDK8s) that define their platform. These include networking, EKS configuration, data stores, observability, autoscaling patterns, and deployment primitives. You’ll work closely with backend engineers to make infrastructure safe, consistent, and easy to adopt.

Responsibilities

Design, implement, and evolve shared CDK and CDK8s constructs used by multiple services and teams.

Maintain base infrastructure components: VPC, EKS, node groups, RDS, OpenSearch, and MSK.

Operate and extend Kubernetes cluster addons: ingress controllers, cert-manager, autoscaler, monitoring/logging stacks.

Ensure high reliability through well-structured alerting (Prometheus, CloudWatch), autoscaling, and recovery patterns.

Manage and publish baseline templates, configuration schemas, and documentation for infrastructure usage.

Own the CI/CD processes for IaC codebases and platform component releases.

Collaborate with engineering teams to diagnose infrastructure issues and propose robust solutions.

Apply SRE principles—SLIs/SLOs, observability, fault-tolerance—to all shared platform services.

Support IAM roles, secrets management, and tenant isolation patterns.

Required Experience

5+ years of infrastructure or SRE experience, including AWS (VPC, IAM, RDS, MSK, S3) and Kubernetes (Helm, RBAC, ServiceAccounts).

Fluency in Python and experience with Infrastructure-as-Code using AWS CDK, CDK8s, or equivalent frameworks.

Strong understanding of Prometheus, Grafana, and alert routing practices.

Experience designing reusable infrastructure patterns or internal developer platforms.

Proven ability to improve reliability through automation, monitoring, and operational best practices.

Nice to Have

Experience supporting Spark on Kubernetes, Argo, or Kafka-based batch pipelines.

Awareness of cost-efficiency strategies across EC2, storage, and autoscaling.

About the Company

If you're a startup, small business, non-profit organization, or a company that doesn't hire on a regular basis, we have a solution for you. Since 2012, we've provided professional recruiting services to our clients when they need them for a fraction of what typical recruiting firms charge and without the hassle of doing it yourself.  Beyond skills and experience, we make sure that candidates are fit for your culture, work environment, and the "X" factor for your company. We know that every minute you have to wait for the ri... Know more