- Company Name
- WNTD
- Job Title
- Solutions Architect
- Job Description
-
**Job title:** Solutions Architect – NVIDIA GPU Cluster Design & Validation
**Role Summary:**
Senior technical architect responsible for end‑to‑end design, validation, and deployment of NVIDIA GPU clusters in enterprise and hyperscale environments. Owns the full architecture lifecycle, from requirement gathering to operational readiness, ensuring high‑performance, scalable, and resilient infrastructure for AI, HPC, and ML workloads.
**Expactations:**
- Deliver architecture and design documents (HLD/LLD) aligned with NVIDIA reference models (NVAIE, DGX, SuperPod).
- Validate hardware selections, interconnects, and software stacks to meet customer requirements.
- Provide expert guidance to engineering, networking, DevOps, security, and datacenter operations teams.
**Key Responsibilities:**
- Lead architecture of NVIDIA H100/H200 GPU clusters, NVLink/NVSwitch, DGX/HGX/SuperPod designs.
- Produce and maintain high‑level and low‑level design artefacts including compute, network, storage, power, and cooling considerations.
- Define and execute validation test plans for performance, resilience, networking throughput, and workload behavior.
- Oversee integration of GPU nodes, networking, storage, and orchestration (Kubernetes, Slurm, Bright Cluster Manager).
- Manage deployment lifecycle: rack layout, cabling, airflow, power, factory and site acceptance testing, operational readiness.
- Collaborate with internal stakeholders and external vendors (NVIDIA, Mellanox, Supermicro, Dell, HPE) to ensure design feasibility and compliance.
- Produce detailed architecture documents, diagrams, acceptance criteria, and operational runbooks.
- Conduct knowledge transfer and training to internal teams.
**Required Skills:**
- Proven experience architecting multi‑node NVIDIA GPU clusters for AI/ML/HPC.
- Deep knowledge of NVLink/NVSwitch, InfiniBand (200/400Gb), RoCE, and high‑performance fabric designs.
- Hands‑on experience with cluster orchestration (Kubernetes, Slurm, PBS).
- Strong understanding of CUDA, NCCL, Docker/OCI containers, and NVIDIA software stacks.
- Proficiency with Linux systems engineering, hardware validation, and multi‑layer troubleshooting.
- Excellent communication, documentation, and stakeholder‑management skills.
- Ability to own architectural decisions and lead cross‑functional delivery.
**Required Education & Certifications:**
- Bachelor’s degree in Computer Science, Electrical Engineering, or related field (equivalent experience acceptable).
- Preferred certifications: NVIDIA Certified Associate/Expert, Kubernetes CKA/CKS, or equivalent vendor accreditation.