The High Performance Computing (HPC) Development group develops technologies to Improve HPC Operations and Efficiency. We focus on systems for Extreme HPC Architectures, Development of HPC Operational Analysis, and Enabling High Performance Data Storage and Transfer.
Our Heterogeneous Advanced Architecture Platforms (HAAPs) team supports and develops innovative solutions for operation and efficient utilization of leading-edge technology systems with the latest processors, GPUs, and network fabrics. These testbed systems are used to deploy and analyze node and rack scale preproduction and prototype architectures to assess their suitability for future HPC platform acquisitions. In close partnership with the Scalable Computer Architecture group, part of the Center for Computing Research (CCR), the HAAPs team’s testbeds support enables CCR’s exploration of application performance, programming models, memory subsystems, power/energy research, and other areas. Our Advanced Technology Systems (ATS) testbeds enable application porting within Sandia’s environment in preparation for production calculations at extreme-scale on the (ATS) Trinity (LANL) and Sierra (LLNL) HPC platforms.
Our HPC Operational Analysis team develops new operational methodologies based on advanced system analytics. Centered around Sandia’s R&D 100 Award Winning Lightweight Distributed Metric Service (LDMS) monitoring software, the group develops monitoring, analysis, and response software and methodologies to enable new insights into the performance and utilization of extreme-scale HPC platforms and applications run on them. We leverage our domain knowledge from operating the HAAPs systems to enable performance understanding for leading-edge architectures.
Our High Performance Computing (HPC) Data Management Team develops and deploys High Performance Storage System (HPSS) as part of a DOE collaboration with IBM. We support over 85 systems, including four production tape libraries. Our team deploys and supports the data transfer tools for transferring of HPC data both within Sandia and across the wide area between the tri-labs (SNL, LANL, LLNL). We are currently developing a new data transfer tool for Extreme Scale Computing. Extreme Scale Computing can generate data sets in the 500TB range, a single file in the 100TB range, and billions of files. Upcoming work includes collaboration on developing new methodologies for User Metadata Tagging for our complex environment and supporting the data large-data storage and retrieval needs to support our HPC Operational Analysis.
For additional information:
- Heterogeneous Advanced Architecture Platforms (HAAPs) testbeds at http://www.sandia.gov/asc/computational_systems/HAAPS.html
- Lightweight Distributed Metric Service (LDMS) at ovis.sandia.gov