Job Specifications
We are a rapidly growing organization focused on delivering scalable, safe, and sustainable energy storage solutions for critical infrastructure, including data centers, industrial facilities, and the grid. Our mission is to pioneer innovative technologies that enable long-duration, non-toxic energy storage systems made in the U.S.
Role Overview
We’re seeking a DevOps Engineer to build and operate the cloud and on-prem infrastructure that powers our Energy Management System (EMS). You’ll use Infrastructure as Code (IaC) with Terraform and configuration management with Ansible to provision, secure, and scale services across Amazon Web Services (AWS) and Microsoft Azure (preferred). You will partner closely with backend, ML, and controls engineers to deliver reliable, observable, and cost-efficient platforms for our applications.
You will be responsible for designing, automating, and maintaining production-grade infrastructure; creating secure CI/CD pipelines; implementing robust observability (metrics, logs, traces); enforcing security baselines and secrets management; and driving operational excellence (SRE practices, incident response, cost optimization).
Key Responsibilities
Design, provision, and manage cloud and on-prem environments using Terraform (IaC) and Ansible (configuration management).
Build and maintain CI/CD pipelines (e.g., GitHub Actions, Azure DevOps) for services, data pipelines, and ML workloads.
Operate containerized workloads with Docker and Kubernetes (AKS/EKS or equivalent), including cluster add-ons, ingress, autoscaling, and upgrades.
Implement observability: metrics (e.g., Prometheus), logging (e.g., OpenSearch/ELK), tracing (e.g., OpenTelemetry), and actionable alerts (e.g., Grafana/CloudWatch/Azure Monitor).
Enforce cloud security baselines (IAM/role design, least privilege, network segmentation, TLS, key management) and manage secrets (e.g., HashiCorp Vault, AWS Secrets Manager, Azure Key Vault).
Automate backups, disaster recovery (RTO/RPO goals), blue/green or canary releases, and infrastructure testing (e.g., Terratest).
Collaborate with engineers to design scalable APIs, data services, and event pipelines; champion reliability (SLOs/SLIs), performance, and cost efficiency.
Support audits and compliance readiness (e.g., SOC 2 practices) through policy-as-code and strong documentation.
Participate in on-call/incident response, root-cause analysis, and post-mortems; drive continuous improvement.
Qualifications
Bachelor’s degree in Computer Science, Software/Systems Engineering, or related field (or equivalent experience).
5+ years in DevOps/SRE/Platform Engineering roles operating production systems.
Proven expertise with Terraform and Ansible in production.
Hands-on experience on AWS and/or Azure (both preferred) designing secure, scalable architectures.
Strong CI/CD experience (GitHub Actions, Azure DevOps, or similar) and proficiency with Docker and Kubernetes.
Solid understanding of networking (VPC/VNet, subnets, routing, load balancers, DNS) and Linux administration.
Experience implementing observability stacks (metrics/logs/traces) and actionable alerting.
Strong grasp of security best practices (IAM, secrets, encryption, compliance basics) and cost optimization.
Excellent communication; able to work independently and manage multiple initiatives.
Preferred Qualifications
Experience with .NET / C# build and deployment workflows; Python for tooling/automation is a plus.
Experience with message and data systems (e.g., Kafka, RabbitMQ, PostgreSQL, SQL Server, Redis).
Familiarity with energy/industrial environments (e.g., EMS, SCADA, BMS/DERMS) is a plus.
Experience with policy-as-code (e.g., Open Policy Agent), infrastructure testing (Terratest), and GitOps (e.g., Argo CD/Flux).
Knowledge of backup/DR architecture, multi-account/multi-subscription patterns, and cross-cloud networking.
Contributions to internal platforms, golden paths/templates, or developer experience initiatives.