cover image
AMD

Software Development Engineer– Software DevOps & Continuous Integration Team

On site

Calgary, Canada

$ 189,240 /year

Full Time

21-02-2026

Share this job:

Skills

Python Go Bash MySQL GitHub CI/CD DevOps Kubernetes Monitoring Jenkins Ansible Test Quality Assurance Training Architecture PyTorch TensorFlow Programming Databases Software Development cloud platforms C++ Embedded Systems CI/CD Pipelines Grafana CMake GitHub Actions

Job Specifications

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

The Role

The AI/ML Frameworks team is hiring an MTS Software Development Engineer to build and maintain scalable DevOps infrastructure that accelerates AMD’s AI software development.

You will design and own CI/CD pipelines, manage Kubernetes‑based GPU environments, and automate systems using Python, Go, and Ansible. The role involves creating and maintaining production‑grade automation and tooling that enables fast, reliable software delivery across teams.

The Person

The ideal candidate is a skilled DevOps/infrastructure engineer with strong programming abilities. They write clean, maintainable code in Python or Go, and can navigate ML framework source code (PyTorch, TensorFlow, ROCm) to debug issues, optimize build processes, or contribute fixes. They have solid knowledge of build systems and toolchains—understanding how CMake, Bazel, and compiler toolchains work is critical for effective issue triaging and root cause analysis. They are proficient in Kubernetes, CI/CD tools, and infrastructure automation frameworks such as Ansible. Familiarity with C++ is valuable for navigating lower-level framework components. This person thrives in collaborative, fast-paced environments, can drive technical execution with minimal oversight, and is passionate about knowledge sharing and upleveling their team.

Key Responsibilities

Build System Expertise & Issue Triaging: Develop deep expertise in build tools and flows (CMake, Bazel, Make, compiler toolchains). Triage complex build failures by understanding the full build pipeline—from source to binary. Identify root causes across infrastructure, toolchain, and code-level issues.
Team Training & Knowledge Sharing: Train and mentor team members on build systems, CI/CD workflows, and debugging techniques. Create documentation, runbooks, and training sessions to ensure the team can effectively triage issues independently. Foster a culture of continuous learning around build infrastructure.
ML Framework Integration & Code Contribution: Understand the architecture and codebase of ML frameworks (PyTorch, TensorFlow, ROCm stack). Review, debug, and contribute code changes as needed to resolve build issues, improve CI reliability, or support new features.
Tooling & Automation Development: Design and develop internal tools, automation scripts, and services primarily in Python and Go. Write well-tested, production-grade code to solve infrastructure and workflow challenges.
CI/CD Pipeline Development: Design, implement, and manage efficient continuous integration and delivery pipelines using Buildkite, GitHub Actions, and Jenkins to enable rapid and reliable software deployment for ML workloads.
Kubernetes Infrastructure Management: Deploy and maintain robust Kubernetes-based environments across both on-premise and cloud platforms to support scalable service orchestration.
Infrastructure Automation: Automate provisioning, configuration, and management of infrastructure using Ansible, Python, and Bash to improve system consistency and reduce manual intervention.
Service Deployment with Helm: Administer application and service deployment in Kubernetes using Helm charts for consistent and repeatable release processes.
GPU Server Support: Configure, manage, and maintain GPU-based compute environments including lifecycle automation and hardware-level test integration for ML training and inference workloads.
Database and Observability Integration: Interact with MySQL databases to support dynamic data updates and integrate data sources into Grafana dashboards for monitoring and insights.
Cross-Functional Collaboration: Work closely with ML framework developers, SREs, and project stakeholders to ensure system-level alignment and high-impact delivery.
Quality Assurance Enablement: Integrate automated testing frameworks into CI pipelines to ensure code quality, stability, and performance across development cycles.

Preferred Experience

Build Systems & Toolchains: Strong understanding of CMake, Bazel, Make, and compiler toolchains (GCC, Clang, LLVM). Ability to debug complex build failures, understand dependency resolution, and optimize build performance.
Programming Languages: Strong proficiency in Python and Go for building tools, services, and automation. The abili

About the Company

We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences – the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of divers... Know more