Meta
In this role, you will be a member of the Network.AI Software team and part of the bigger DC networking organization. The team develops and owns the software stack around NCCL (NVIDIA Collective Communications Library), which enables multi-GPU and multi-node data communication through HPC-style collectives. NCCL has been integrated into PyTorch and is on the critical path of multi-GPU distributed training. The team aims to enable Meta-wide ML products and innovations to leverage large-scale GPU training and inference through an observable, reliable, and high-performance distributed AI/GPU communication stack. Currently, the focus is on building customized features, benchmarks, performance tuners, and stacks around NCCL and PyTorch to improve distributed ML reliability and performance, especially for Large-Scale GenAI/LLM training.
Meta builds technologies that help people connect, find communities, and grow businesses. The company values innovation, inclusivity, and building immersive experiences beyond screens. Meta is committed to equal employment opportunity and providing accommodations for candidates with disabilities.
Key skills identified from this job posting
Sign upto access all insights for this job
Website
meta.com
Location
Menlo Park, CA
Industry
Web Search Portals and All Other Information Services
Other opportunities you might be interested in
Actalent
Adobe
Boeing
Adobe
Infosys