AI/HPC Network Development Engineer - Networking: xAI

xAI is seeking a highly motivated engineer with deep experience in large-scale networking to help build and optimize their massive GPU clusters. The company has a track record of rapidly deploying 100k GPU clusters on an ethernet network.

This role requires the ability to develop at hyper-scale, focusing on optimizing performance and availability of the network used for AI training and inference. You will be instrumental in designing the next iteration of the network to seamlessly scale new GPU infrastructure.

Key Responsibilities
Performance Optimization: Spend time deep inside network components like NCCL to optimize configurations and ensure no performance is left on the table for training models and customer inference queries.

Development & Automation: Develop at hyper scale and utilize Python to automate repetitive tasks, working with and analyzing large datasets.

Infrastructure Design: Help design the next iteration of the backend and front-end networks to enable seamless, rapid build-out of new GPU infrastructure.

Monitoring: Build metric dashboards and expertise in creating a portfolio of metrics for performance and operations to optimize the fleet traffic.

Operations: Participate in a team on-call rotation and assist in scaling and maintenance efforts, with the goal of reducing repetitive operational tasks through automation.

Required Qualifications
Experience: A minimum of 10 years designing and operating large-scale networks, with 5 years specifically in the ethernet AI/HPC space.

Networking Expertise: Deep understanding of congestion control on ethernet with RoCEv2 (Remote Direct Memory Access over Converged Ethernet), with Infiniband being an added bonus.

AI Workloads: Deep understanding of AI training and inference workloads and how they operate on the network. This includes the ability to use and debug NCCL (NVIDIA Collective Communications Library) and potentially commit to the library.

Metrics: Expertise in creating a portfolio of metrics for performance and operations.

Automation: Experience with Python to automate tasks and analyze large sets of data.

Interview Process
The main interview process, following a successful CV review and initial phone interview, consists of five stages:

Coding assessment in a language of your choice.

Data center network technologies and RoCEv2 technical interview.

Manager Interview.

Meet and greet with the wider team, where you will present a body of work you are proud of.

A final interview stage (implied by the five stage structure).

AI/HPC Network Development Engineer - Networking: xAI

🧠 Related Jobs