DevOps Engineer II
Summary
I am a DevOps Engineer with a strong focus on setting up and managing HPC clusters for high-performance AI/ML workloads. My day-to-day responsibilities include: -
- Build infrastructure across multiple CSPs such as AWS, Azure, GCP, and emerging providers like Nebius, based on customer requirements.
- Manage networking (EFA, IB, TCPX0), deploy monitoring components on the cluster, and build dashboards in Grafana.
- Validate clusters and communication between GPU nodes using nccl-tests and Nemotron training models.
- Automate remediation of known issues using scripts and Ansible playbooks.
- Manage users and customer authentication via OIDC on the clusters.
- Provide support during customer engagement periods and work closely with AI/ML developers to fulfill their requirements on the cluster.
Experienced in working with multiple CSPs, managing K8s and VM-based clusters(Slurm), handling customer communication, and exploring new operators and tools like Soperator and Slinky.
Expectations
I am looking for learning opportunities, attractive pay, and growth. I have no issues working after office hours or putting in hard work to meet product guidelines, as long as the pay scale supports it.
Employment Preferences
Spoken Languages
- English - Fluent
- Hindi - Native
Expected Base Salary
*,*00,000 INR
Academic Degree
Experience
Total Professional Experience
Startup Experience
Big-Tech Companies
Enterprise Experience
Skills
Contacts are hidden
Send a connection request to the candidate to get their contact details.
Contact Candidate
