Software Engineer, SystemML - Scaling / Performance

Job Description

About the Role

In this role, you will be a member of the Network.AI Software team and part of the bigger DC networking organization. The team develops and owns the software stack around NCCL (NVIDIA Collective Communications Library), which enables multi-GPU and multi-node data communication through HPC-style collectives. NCCL has been integrated into PyTorch and is on the critical path of multi-GPU distributed training. The team aims to enable Meta-wide ML products and innovations to leverage large-scale GPU training and inference through an observable, reliable, and high-performance distributed AI/GPU communication stack. Currently, the focus is on building customized features, benchmarks, performance tuners, and stacks around NCCL and PyTorch to improve distributed ML reliability and performance, especially for Large-Scale GenAI/LLM training.

Key Responsibilities

Enabling reliable and highly scalable distributed ML training on Meta's large-scale GPU training infrastructure with a focus on GenAI/LLM scaling.

Requirements

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.
Specialized experience in one or more of the following: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g., PyTorch).

Nice to Have

Knowledge of GPU architectures and CUDA programming.
Experience working with DL frameworks like PyTorch, Caffe2, or TensorFlow.
Experience in AI framework and trainer development for large-scale distributed deep learning models.
PhD in Computer Science, Computer Engineering, or relevant technical field.
Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel.
Experience in HPC and parallel computing.
Knowledge of ML, deep learning, and LLM.
Experience with NCCL and distributed GPU reliability/performance improvement on RoCE/Infiniband.

Qualifications

Bachelor's degree or equivalent practical experience in a relevant technical field.

Benefits & Perks

Competitive compensation ranging from $70.67/hour to $208,000/year plus bonus, equity, and benefits.
Additional benefits offered by Meta.

Working at Meta

Meta builds technologies that help people connect, find communities, and grow businesses. The company values innovation, inclusivity, and building immersive experiences beyond screens. Meta is committed to equal employment opportunity and providing accommodations for candidates with disabilities.

Apply Now