cover image
TechDoQuest

Site Reliability Engineer (SRE)

On site

Toronto, Canada

Freelance

22-01-2026

Share this job:

Skills

Python PowerShell SQL NoSQL Cassandra Incident Response Encryption ServiceNow GitHub CI/CD DevOps Docker Kubernetes Monitoring Configuration Management Jenkins Ansible Networking Windows git SQL Server Azure AWS cloud platforms GCP Redis CI/CD Pipelines OpenShift GitHub Actions

Job Specifications

Job Summary

We are seeking a skilled Site Reliability Engineer (SRE) to enhance the reliability, scalability, and performance of our systems and applications. The ideal candidate will have strong experience in automation, cloud platforms, observability, incident management, and DevOps practices. This role involves working closely with cross‑functional teams to ensure high availability, continuous improvement, and efficient service delivery.

Key Responsibilities

Design, build, and maintain automation for infrastructure provisioning and configuration management.
Implement and manage monitoring, observability, and alerting systems to ensure service reliability.
Collaborate with development and operations teams to enhance CI/CD pipelines and deployment automation.
Lead incident response, root‑cause analysis, and continuous improvement initiatives.
Manage cloud infrastructure, container orchestration platforms, and distributed systems at scale.
Ensure security, compliance, and governance across systems and processes.
Optimize application performance and conduct capacity planning and load testing.
Maintain documentation, runbooks, SLOs/SLAs, and operational processes.

Required Skills & Experience

1. Automation & Configuration Management

Ansible: Writing playbooks, roles, and modules.
Python: Scripting for automation, monitoring, API integration.
PowerShell: Automation for Windows, AD, and cloud resources.

2. Monitoring & Observability

Dynatrace: Synthetic & real user monitoring, alerting, performance analysis.
Moogsoft: Event correlation, alert management, incident orchestration.
Elasticsearch Stack: Log aggregation & querying; familiarity with Kibana/Logstash.

3. Incident & Service Management

ServiceNow: Ticket lifecycle, CMDB, workflow automation.

4. Infrastructure & Platforms

Cloud: AWS, Azure, or GCP (compute, storage, serverless, networking).
Containers: Kubernetes/OpenShift, Docker, Helm.

5. Database & Storage

SQL Server: Query tuning, replication, HA/DR setups.
Distributed DBs: Cassandra, Redis, NoSQL systems.
Backup & disaster recovery planning.

6. Security & Compliance

IAM, encryption, secrets management (e.g., HashiCorp Vault).
Vulnerability scanning and compliance frameworks (e.g., SOC 2).

7. CI/CD & DevOps

CI/CD tools: Jenkins, GitHub Actions, UrbanCode Deploy (UCD).
Git workflows and branching strategies.
Artifact management: Artifactory, Nexus.

8. Performance Engineering

Load testing using JMeter.
Capacity planning & performance optimization.
Defining and measuring SLIs, SLOs, SLAs.

About the Company

TechDoQuest is a modern IT consulting and delivery partner helping businesses scale faster with smarter technology and global execution. With operations across Canada, the U.S., and India, we specialize in: IT Consulting & Advisory Custom Software Development Cloud & DevOps Services Building and Managing Global Capability Centers (GCCs) At TechDoQuest, we combine strategic insight with hands-on execution -- delivering lean, cost-conscious, and scalable solutions that drive measurable business outcomes. Whether you're a start... Know more