- Company Name
- Boson AI
- Job Title
- Network Engineer, AI/ML Infrastructure
- Job Description
-
Job Title: Network Engineer, AI/ML Infrastructure
Role Summary:
Design, implement, and maintain high‑performance InfiniBand and ultra‑high‑speed Ethernet fabrics that support AI/ML workloads. Manage end‑to‑end network lifecycle—from planning and deployment to monitoring and optimization—ensuring low latency, high throughput for GPU‑to‑GPU traffic, Ceph storage connectivity, and multi‑site data center interconnects. Collaborate with HPC and ML teams to scale capacity and evaluate emerging networking technologies.
Expectations:
- 4+ years of production network engineering experience.
- Proven knowledge of L2/L3 protocols (TCP/IP, BGP, OSPF, VLANs).
- Hands‑on expertise with 100Gb+ Ethernet and InfiniBand, including RDMA, RoCE, IPoIB.
- Strong background in network security (firewalls, ACLs, segmentation).
- Experience with HPC network topologies and GPU‑centric bandwidth optimization.
- Problem‑solving mindset with ability to troubleshoot performance bottlenecks and latency.
Key Responsibilities:
- Configure, maintain, and upgrade InfiniBand and high‑speed Ethernet fabrics (Mellanox, NVIDIA, Micas Networks).
- Optimize RDMA and GPU‑to‑GPU communication paths for maximum throughput.
- Manage network switches and implement secure segmentation using VLANs and ACLs.
- Identify and resolve bottlenecks, latency issues, and packet loss.
- Plan and execute network expansion, capacity planning, and technology evaluations.
- Develop automation scripts/tools to streamline operations (e.g., configuration, monitoring).
- Oversee infrastructure monitoring using Prometheus, Grafana, or similar.
- Collaborate with storage teams to optimize Ceph cluster networking.
- Design and implement multi‑site connectivity (VPN, WAN, direct interconnect).
- Maintain cloud networking integrations (AWS, GCP, Azure VPC, Direct Connect, ExpressRoute).
Required Skills:
- L2/L3 networking, BGP, OSPF, VLAN, ACL, firewall configuration.
- InfiniBand (RDMA, RoCE, IPoIB) and 100Gb+ Ethernet expertise.
- Networking hardware: Broadcom Tomahawk, NVIDIA/Mellanox switches.
- Network security design and implementation.
- Troubleshooting and performance tuning.
- Automation (Ansible, Python, shell scripting).
- Monitoring & observability (Prometheus, Grafana).
- Distributed storage network understanding (Ceph).
- Cloud networking (VPC, VPN, Direct Connect/ExpressRoute).
- Familiarity with multi‑site WAN optimization.
Required Education & Certifications:
- Bachelor’s degree in Computer Science, Information Technology, Electrical Engineering, or related field.
- Relevant networking certifications preferred: CCNA, CCNP, JNCIP, or equivalent.
- Certifications in HPC or cloud networking (e.g., AWS Certified Advanced Networking, GCP Professional Cloud Network Engineer) are a plus.