Data Center Hardware & Equipment Specialist
Company : King Abdullah University of Science & Technology
Position Summary
The Data Center Hardware & Equipment Specialist at KAUST's flagship supercomputer facility plays a crucial role in ensuring the physical infrastructure's security and operational continuity. This position oversees the installation, configuration, and maintenance of servers, including GPUs, within the CRAY supercomputer environment and other hosted equipment. This position ensures compliance with hardware tracking processes and updates the asset register, collaborating with various teams to meet procurement and regulatory requirements while maintaining high standards of security and operational efficiency.
Working within the Scientific Computing Center (SCC) data center and other Data Centers, hosting HPCs, including Shaheen III, the Data Center Hardware & Equipment Specialist manages data center hardware and equipment, ensuring that all components are properly installed and configured to support operations. The Specialist maintains detailed records of hardware assets, tracks equipment status, and ensures compliance with established protocols. This role is essential to maintaining the integrity and performance of the HPE / CRAY supercomputer's infrastructure.
Job Description
Major Responsibilities
- Oversee the installation, configuration, and maintenance of servers, including GPUs
- Ensure compliance with hardware tracking processes and update the asset register
- Perform regular inspections and maintenance of data center hardware
- Monitor physical conditions of servers and other IT infrastructure
- Manage network cables, server hardware and other equipment
- Collaborate with vendors to proactively manage backup part inventory to ensure uptime
- Work with compliance officers to ensure accurate accounting of controlled equipment
- Follow data center access protocols to ensure the security of the system and prevent unauthorized removal of parts
- Respond to alarms and incidents, providing immediate resolution
- Collaborate with procurement teams to acquire necessary equipment
- Ensure adherence to export control regulations and NIST SP 800-53 standards.
- Maintain detailed records of hardware assets and track equipment status
- Oversee / perform complex software / hardware troubleshooting, patches, and re-installations
- Manage infrastructure capacity and performance, verifying application logs and monitoring activity
Personal Requirements
Demonstrates expertise in managing and maintaining HPE / CRAY supercomputer hardware infrastructure, including servers, storage systems, and networking equipmentShows proficiency in handling GPUs and optimizing the infrastructure performance within the supercomputer environmentExhibits strong understanding of HPC systems and architecturesCommunicates effectively with stakeholdersUses monitoring tools to optimize supercomputer infrastructure performanceDemonstrates analytical skills to troubleshoot and resolve hardware issuesShows flexibility to adapt to new technologies and changing business needsEnsures compliance with hardware tracking processesCollaborates across teams to achieve goalsAbility to manage logistics of heavy equipment and work in confined spacesExperience
Detailed knowledge of HPE / CRAY supercomputer hardware infrastructure and NVIDIA supercomputer GPUs, including installation, maintenance, and troubleshooting of the supporting infrastructureExperience working in data centers, managing large-scale hardware deployments, and ensuring uptime and reliabilityProven track record in overseeing the installation, configuration, and maintenance of servers and data center equipmentFamiliarity with hardware tracking processes and asset register managementQualifications
Bachelor’s degree in Computer Science, Information Technology, Electrical Engineering, Electronics Engineering or related fieldRelevant certifications preferred (e.g., CDCTP, DCCA, CCNP Data Center, RCDD)Minimum of 7 years of experience in managing HPC hardware and data center equipment / infrastructureKnowledge of data center infrastructure and operationsUnderstanding of IT asset management for controlled equipment#J-18808-Ljbffr