cover image
Gradle Technologies

Gradle Technologies

gradle.com

2 Jobs

154 Employees

About the Company

Gradle Technologies is the award-winning developer productivity company behind Gradle Build Tool—one of the most used build systems in the world—and Develocity®, the leading developer observability platform. Develocity provides comprehensive toolchain observability, build and test acceleration technologies, continuous GRC automation, and rapid troubleshooting features for Apache Maven, Android, sbt, npm, Python, and Gradle Build Tool. Top companies like Netflix, LinkedIn, ASML, Airbnb, Microsoft, Nasdaq, and SAP use Develocity to deliver critical software faster at scale.

Listed Jobs

Company background Company brand
Company Name
Gradle Technologies
Job Title
Staff Site Reliability Engineer
Job Description
**Job Title** Staff Site Reliability Engineer **Role Summary** Lead technical and operational reliability for the Develocity SaaS platform. Channel influence across engineering, cloud operations, and customers to shape an SRE vision, standards, and tools in a distributed, remote‑first environment. **Expectations** - Founding member of a new SRE team, establishing practices and culture. - Hands‑on technical operator and mentor for junior SREs. - Own responsibility for reliability, performance, and availability of customer‑facing services and supporting infrastructure. - Drive automation, observability, and improvement of cloud application platform. - Collaborate across time zones with clear, asynchronous communication. **Key Responsibilities** - Operate and maintain all Develocity production instances and supporting services. - Define, evolve, and enforce SRE standards: on‑call, incident response, post‑mortems, SLOs, error budgets. - Participate in follow‑the‑sun on‑call rotation and act as escalation point for high‑severity incidents. - Lead incident resolution, blameless retrospectives, and drive measurable reliability improvements. - Set reliability priorities based on risk, customer impact, and business goals. - Identify systemic risks and evolve SaaS operations as the platform grows. - Lead architectural/design reviews to ensure scalability, operability, and reliability. - Automate deployment, upgrade, monitoring, self‑healing, recovery, and operational workflows. - Build comprehensive observability: logging, metrics, tracing, alerting. - Own disaster recovery, backups, and business continuity. - Partner with engineering leadership to balance feature delivery and operational excellence. - Mentor and onboard new SREs; contribute to hiring and assessment. - Communicate with customers during incidents and maintenance. - Optimize performance, resource utilization, and operational cost. **Required Skills** - Proven SRE or DevOps experience (5+ years). - Deep knowledge of AWS, Kubernetes, Docker, and cloud-native infrastructure. - Expertise in incident response, post‑mortem analysis, SLO management, and error budgeting. - Strong automation skills: scripting (Python, Bash, PowerShell), infrastructure‑as‑code (Terraform, CloudFormation). - Experience with CI/CD pipelines, build/test acceleration tools. - Observability tools: Prometheus, Grafana, ELK/EFK, OpenTelemetry, distributed tracing. - Disaster recovery and business continuity planning. - Excellent written communication and documentation; adept at asynchronous collaboration. - Leadership ability to mentor, influence, and shape team practices. **Required Education & Certifications** - Bachelor’s degree in Computer Science, Engineering, or related technical field (or equivalent work experience). - Certifications such as Certified Kubernetes Administrator (CKA), AWS Certified Solutions Architect, or comparable cloud‑platform/vendor credentials are highly desirable.
New york, United states
Remote
09-02-2026
Company background Company brand
Company Name
Gradle Technologies
Job Title
Senior Site Reliability Engineer
Job Description
**Job title** Senior Site Reliability Engineer **Role Summary** Lead the creation and ongoing operations of a new, distributed SRE team for a cloud‑native SaaS platform. Manage reliability, performance, and availability of production services, including Kubernetes clusters on AWS, artifact registries, and related infrastructure. Own incident response, automation, observability, and disaster recovery while driving best‑practice SRE culture across engineering teams. **Expectations** - 5+ years in SRE, DevOps, or equivalent, operating large‑scale production services. - Proven expertise in Kubernetes (EKS, on‑prem), AWS (EC2, RDS, S3, EKS), and Infrastructure as Code (Terraform). - Strong incident‑management track record, including on‑call participation and post‑mortem ownership. - Deep understanding of SRE principles: SLAs, SLOs, error budgets, and reliability tooling. - Advanced scripting skills (Python, Bash) and automation mindset. - Excellent written and verbal English for asynchronous cross‑time‑zone communication. **Key Responsibilities** - Operate and maintain all production instances and supporting services. - Participate in follow‑the‑sun on‑call rotation, leading incident detection, triage, resolution, and post‑mortem. - Design, implement, and maintain end‑to‑end observability (logs, metrics, traces, alerts). - Automate deployments, upgrades, monitoring, self‑healing, and recovery workflows. - Build reliability into new features from inception in collaboration with engineering. - Own disaster‑recovery plans, backups, and business‑continuity exercises. - Communicate incident status and planned maintenance with customers. - Optimize performance, resource usage, and operational costs. - Evolve SaaS operations as scale grows, establishing SRE practices in new teams. **Required Skills** - Kubernetes administration (deployment, scaling, troubleshooting). - AWS operations (EKS, EC2, RDS, S3). - Terraform, CloudFormation, or equivalent IaC. - Prometheus, Grafana, ELK/EFK stacks, distributed tracing. - Incident‑response tooling (PagerDuty, Opsgenie, or similar). - Scripting: Python, Bash, or Go for automation. - Knowledge of SLO/SLA definition and monitoring. - Disaster‑recovery design and execution. - Strong documentation and asynchronous communication. **Required Education & Certifications** - Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent technical experience. - Optional certifications: Certified Kubernetes Administrator (CKA), AWS Certified Solutions Architect – Associate, or equivalent SRE‑focused credentials.
San francisco bay, United states
Remote
Senior
09-02-2026