Staff HPC Infrastructure Engineer

Company: Guardant Health
Job type: Full-time

Job Description
About the Role:
Guardant’s HPC team builds and operates the computational technology backbone of the company. 
This includes scalable data storage that holds petabytes of genomics data, high-performance compute clusters running a custom bioinformatics pipeline in production and R&D environments, and the software infrastructure that hosts an ecosystem of services for internal data processing and external data integration. To facilitate Guardant Health’s fast growth in the next few years, the HPC team is looking for a strong technical engineer who can help maintain and help grow the HPC infrastructure during its aggressive expansion, while working with corporate IT, SQA, and DevOps/SRE teams.  
While preferred to have someone local to the San Francisco Bay and on premises in Redwood City and Palo Alto, this role can be mostly worked remotely.  While on rotation, during maintenances and during cluster deployments, being present at the location of the work is required. 
Essential Duties and Responsibilities:
Act as a technical lead in day to day operations
Help manage the HPC interconnects
Help integrate the HPC systems with the bandwidth on-demand system
Help integrate the HPC system with the single namespace storage system
Help integrate cloud bursting as part of the HPC abstraction work
Work with the networking infrastructure team to manage and optimize the connectivity to and from the HPC systems and locales
Help manage multiple HPC clusters and cluster file systems. 
Help research, develop and implement the next generation HPC solution
Troubleshoot the production system stack down to source code level e.g. shell scripts, python and others.
Maintain, monitor, and support the infrastructure environment and/or facilities.
Use and maintain enhanced production monitoring and additional capability.
Support improvements for increased system reliability and performance.
Support multiple systems or applications of medium to high complex (complexity defined by size, technology used, and system feeds and interfaces) with multiple concurrent users, ensuring control, integrity, and accessibility.
Support systems at remote locations, including internationally
Work with offsite consultants to maintain the infrastructure
Work with vendors to troubleshoot, upgrade and repair systems as needed
Participate in a 24/7 on-call rotation

Apply for this job