AMD logo

ML Ops Engineer

AMD

Atlanta, GA
Full Time
Mid Level
143k-215k
6 days ago

Job Description

About the Role

AMD is seeking a driven and collaborative MLOps Engineer to join our Engineering Operations team in Atlanta. You will support and optimize large-scale, multi-GPU/CPU ML infrastructure to enable world-class AI and rendering research. Collaborating with teams across North America and Europe, you will design robust, automated pipelines and help push the boundaries of machine learning and high-performance compute in a production data center environment. We care deeply about transforming lives with AMD technology to enrich our industry, communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences—covering data centers, artificial intelligence, PCs, gaming, and embedded systems. AMD values innovation, execution excellence, collaboration, humility, and inclusivity.

Key Responsibilities

  • Architect, deploy, and maintain high-availability Linux/GPU/CPU server clusters for ML workloads, ensuring optimal performance, security, and scalability.
  • Collaborate cross-functionally with data science, research, and IT teams across North America and Europe to streamline ML model training, testing, deployment, and monitoring pipelines.
  • Build and automate end-to-end CI/CD workflows for ML using tools such as MLflow, DVC, Kubeflow, Airflow, or similar.
  • Configure, monitor, and optimize large-scale NAS and data transfer for sharing models, datasets, and training results.
  • Proactively monitor infrastructure and application health using tools like Prometheus and Grafana, addressing performance bottlenecks, failures, and incidents.
  • Implement robust security, user management, and access protocols in line with international compliance standards such as GDPR.
  • Document processes, workflows, and troubleshooting guides for global teams; support remote debugging and rapid incident response.
  • Stay abreast of trends in AI infrastructure, MLOps toolchains, and AMD hardware accelerators.

Requirements

  • Strong programming/scripting background in Python, Bash, or Go.
  • Proven experience with Linux server administration.
  • Practical experience managing GPU/CPU clusters and Kubernetes orchestration.
  • Experience with infrastructure automation tools such as Ansible and Terraform.
  • Familiarity with MLOps stacks including MLflow, DVC, Kubeflow, Flyte, or Airflow.
  • Experience monitoring and troubleshooting distributed workloads for ML/AI, HPC, or rendering.
  • Knowledge of configuring and managing NAS or other distributed file systems for large data.
  • Understanding of networking concepts such as TCP/IP, VLANs, firewalls, and data privacy and compliance.

Nice to Have

  • Previous support experience with render farms or real-time graphics pipelines.
  • Experience supporting large-scale data transfer and storage solutions.
  • Knowledge of AMD hardware accelerators.

Qualifications

  • Degree in Computer Science, Computer Engineering, Electrical Engineering, or a closely related field.

Benefits & Perks

  • Benefits offered are described: AMD benefits at a glance.

Working at AMD

AMD values innovation, collaboration, humility, and inclusivity. We push the limits of technology to solve important challenges and strive for execution excellence while embracing diverse perspectives.

Apply Now

Job Details

Posted AtJul 18, 2025
Job CategoryDevOps
Salary143k-215k
Job TypeFull Time
Work ModeOnsite
ExperienceMid Level

Job Skills

AI Insights

Key skills identified from this job posting

Sign upto access all insights for this job

About AMD

Website

amd.com

Location

Atlanta, GA

Industry

Semiconductor and Related Device Manufacturing

Get job alerts

Set up personalized alerts for your job search and get tailored job digests for close matches