- Company Name
- Tundra Technical Solutions
- Job Title
- Site Reliability Engineer - Data Services
- Job Description
-
Job Title: Site Reliability Engineer – Data Services
Role Summary: Deliver reliability, performance, and stability for enterprise data pipelines (Autosys, Informatica/ETL, Airflow) on AWS, applying SRE best practices (SLIs/SLOs, error budgets, observability).
Expactations: 5+ years in data engineering/IT operations, strong SRE focus, AWS and data services proficiency, excellent debugging, communication, and documentation.
Key Responsibilities: • Design, build, and maintain data pipeline reliability and scalability. • Define and monitor SLIs/SLOs, manage error budgets, and drive incident response. • Automate toil reduction through monitoring, alerting, and infrastructure-as-code. • Collaborate with data engineering, platform, and DevOps teams to integrate observability and configuration management. • Perform root-cause analysis, post‑mortems, and continuous improvement of data workflows.
Required Skills: • SRE fundamentals (SLIs, SLOs, error budgets, incident response). • AWS cloud services (EC2, S3, RDS, Redshift, Glue, EMR). • Autosys job scheduling and monitoring. • Scripting in Python or Shell. • Advanced SQL and data modeling. • ETL tools experience (Informatica, DataStage, Talend). • Observability platforms (Datadog, Dynatrace, Prometheus). • Strong problem‑solving, documentation, and communication.
Required Education & Certifications: • Bachelor’s degree in Information Technology, Computer Science, or related field. • ITIL Foundation (preferred). • Informatica and AWS Data certifications (preferred).