Meta logo

Software Engineer, SystemML - Scaling / Performance

Meta

Menlo Park, CA
Full Time
Mid Level
147k-208k
10 days ago

Job Description

About the Role

In this role, you will be a member of the Network.AI Software team and part of the bigger DC networking organization. The team develops and owns the software stack around NCCL (NVIDIA Collective Communications Library), which enables multi-GPU and multi-node data communication through HPC-style collectives. NCCL has been integrated into PyTorch and is on the critical path of multi-GPU distributed training. The team aims to enable Meta-wide ML products and innovations to leverage large-scale GPU training and inference through an observable, reliable, and high-performance distributed AI/GPU communication stack. Currently, the focus is on building customized features, benchmarks, performance tuners, and stacks around NCCL and PyTorch to improve distributed ML reliability and performance, especially for Large-Scale GenAI/LLM training.

Key Responsibilities

  • Enabling reliable and highly scalable distributed ML training on Meta's large-scale GPU training infrastructure with a focus on GenAI/LLM scaling.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.
  • Specialized experience in one or more of the following: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g., PyTorch).

Nice to Have

  • Knowledge of GPU architectures and CUDA programming.
  • Experience working with DL frameworks like PyTorch, Caffe2, or TensorFlow.
  • Experience in AI framework and trainer development for large-scale distributed deep learning models.
  • PhD in Computer Science, Computer Engineering, or relevant technical field.
  • Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel.
  • Experience in HPC and parallel computing.
  • Knowledge of ML, deep learning, and LLM.
  • Experience with NCCL and distributed GPU reliability/performance improvement on RoCE/Infiniband.

Qualifications

  • Bachelor's degree or equivalent practical experience in a relevant technical field.

Benefits & Perks

  • Competitive compensation ranging from $70.67/hour to $208,000/year plus bonus, equity, and benefits.
  • Additional benefits offered by Meta.

Working at Meta

Meta builds technologies that help people connect, find communities, and grow businesses. The company values innovation, inclusivity, and building immersive experiences beyond screens. Meta is committed to equal employment opportunity and providing accommodations for candidates with disabilities.

Apply Now

Job Details

Posted AtJul 13, 2025
Salary147k-208k
Job TypeFull Time
ExperienceMid Level

Job Skills

AI Insights

Key skills identified from this job posting

Sign upto access all insights for this job

About Meta

Website

meta.com

Location

Menlo Park, CA

Industry

Web Search Portals and All Other Information Services

Get job alerts

Set up personalized alerts for your job search and get tailored job digests for close matches