DevOps Engineer II

Summary

I am a DevOps Engineer with a strong focus on setting up and managing HPC clusters for high-performance AI/ML workloads. My day-to-day responsibilities include: -
- Build infrastructure across multiple CSPs such as AWS, Azure, GCP, and emerging providers like Nebius, based on customer requirements.
- Manage networking (EFA, IB, TCPX0), deploy monitoring components on the cluster, and build dashboards in Grafana.
- Validate clusters and communication between GPU nodes using nccl-tests and Nemotron training models.
- Automate remediation of known issues using scripts and Ansible playbooks.
- Manage users and customer authentication via OIDC on the clusters.
- Provide support during customer engagement periods and work closely with AI/ML developers to fulfill their requirements on the cluster.

Experienced in working with multiple CSPs, managing K8s and VM-based clusters(Slurm), handling customer communication, and exploring new operators and tools like Soperator and Slinky.

Expectations

I am looking for learning opportunities, attractive pay, and growth. I have no issues working after office hours or putting in hard work to meet product guidelines, as long as the pay scale supports it.

Employment Preferences

Spoken Languages

  • English - Fluent
  • Hindi - Native
Expected Base Salary

*,*00,000 INR

Academic Degree
Experience

Total Professional Experience

2 years

Startup Experience

2 years

Big-Tech Companies

no experience

Enterprise Experience

2 years
Contact Candidate

Contacts are hidden

Send a connection request to the candidate to get their contact details.

Contact Candidate