Direct message the job poster from cander To manage and oversee the HPC (High Performance Computing AI) infrastructure, including GPU nodes, MLOps platforms, and data injection. This role involves ensuring the reliability, scalability, and efficiency of the infrastructure, while optimizing resource utilization, improving deployment efficiency, and ensuring data accuracy. To handle the technical implementation, updates, and coordination with the AI development team to ensure that infrastructure capabilities align with ongoing project needs and technological advancements Key Accountabilities
Follow all relevant policies, processes, standard operating procedures, and instructions and ensure work is carried out in a controlled and consistent manner. Contribute to the identification of opportunities for continuous improvement of systems, processes, and practices, considering leading practices, improvement of business processes, cost reduction, and productivity improvement. Promote the implementation and adherence to relevant policies, processes, and operating procedures to others. Adhere to risk management rules and regulations in the delivery of own work to always ensure compliance. HPC Infrastructure Management
Manage and optimize GPU nodes, MLOps platforms, and data injection systems to ensure high availability and performance. Oversee the installation, configuration, and maintenance of HPC infrastructure, ensuring optimal scalability and performance. Perform system health checks, upgrades, and optimizations to maintain efficient computational resources. Support advanced AI workloads by ensuring HPC resources are aligned with project demands and computational requirements. Collaboration with AI Development Team
Work closely with AI, data science, and other technical teams to understand computational requirements and optimize HPC systems to meet these needs. Provide feedback on infrastructure requirements and actively participate in resource planning sessions to ensure effective support for AI model training and other computational tasks. Ensure HPC system configurations are tailored to handle evolving AI and big data requirements efficiently. Requirements
6-8 years of experience in managing HPC infrastructure, with hands-on experience in GPU nodes, AI, and MLOps systems. Experience with cluster and workload management systems such as SLURM, PBS, or Torque. Knowledge in parallel file systems (e.g., Lustre, GPFS) and storage management. Understanding of high-speed interconnects like InfiniBand and network configuration for HPC environments. Qualifications
Seniority level : Mid-Senior level Employment type : Full-time Job function : Information Technology Industries : IT Services and IT Consulting, IT System Data Services, and Data Infrastructure and Analytics We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr
Manager Infrastructure • Riyadh, Saudi Arabia