ML Ops Engineer

AMD

Atlanta, GA

Full Time

Mid Level

143k-215k

6 days ago

Job Description

About the Role

AMD is seeking a driven and collaborative MLOps Engineer to join our Engineering Operations team in Atlanta. You will support and optimize large-scale, multi-GPU/CPU ML infrastructure to enable world-class AI and rendering research. Collaborating with teams across North America and Europe, you will design robust, automated pipelines and help push the boundaries of machine learning and high-performance compute in a production data center environment. We care deeply about transforming lives with AMD technology to enrich our industry, communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences—covering data centers, artificial intelligence, PCs, gaming, and embedded systems. AMD values innovation, execution excellence, collaboration, humility, and inclusivity.

Key Responsibilities

Architect, deploy, and maintain high-availability Linux/GPU/CPU server clusters for ML workloads, ensuring optimal performance, security, and scalability.
Collaborate cross-functionally with data science, research, and IT teams across North America and Europe to streamline ML model training, testing, deployment, and monitoring pipelines.
Build and automate end-to-end CI/CD workflows for ML using tools such as MLflow, DVC, Kubeflow, Airflow, or similar.
Configure, monitor, and optimize large-scale NAS and data transfer for sharing models, datasets, and training results.
Proactively monitor infrastructure and application health using tools like Prometheus and Grafana, addressing performance bottlenecks, failures, and incidents.
Implement robust security, user management, and access protocols in line with international compliance standards such as GDPR.
Document processes, workflows, and troubleshooting guides for global teams; support remote debugging and rapid incident response.
Stay abreast of trends in AI infrastructure, MLOps toolchains, and AMD hardware accelerators.

Requirements

Strong programming/scripting background in Python, Bash, or Go.
Proven experience with Linux server administration.
Practical experience managing GPU/CPU clusters and Kubernetes orchestration.
Experience with infrastructure automation tools such as Ansible and Terraform.
Familiarity with MLOps stacks including MLflow, DVC, Kubeflow, Flyte, or Airflow.
Experience monitoring and troubleshooting distributed workloads for ML/AI, HPC, or rendering.
Knowledge of configuring and managing NAS or other distributed file systems for large data.
Understanding of networking concepts such as TCP/IP, VLANs, firewalls, and data privacy and compliance.

Nice to Have

Previous support experience with render farms or real-time graphics pipelines.
Experience supporting large-scale data transfer and storage solutions.
Knowledge of AMD hardware accelerators.

Qualifications

Degree in Computer Science, Computer Engineering, Electrical Engineering, or a closely related field.

Benefits & Perks

Benefits offered are described: AMD benefits at a glance.

Working at AMD

AMD values innovation, collaboration, humility, and inclusivity. We push the limits of technology to solve important challenges and strive for execution excellence while embracing diverse perspectives.

Apply Now