National Cyberinfrastructure Prototype Moves into Full System Operation
Note: This article originally appeared on www.sdsc.edu, click here to read the original piece.
Published May 10, 2023
A multimillion-dollar cyberinfrastructure resource for the national research community has reached a milestone. The Prototype National Research Platform (PNRP)—an innovative system, funded by the National Science Foundation and created to advance scientific discoveries—has entered formal operations as a testbed for exploring a wide range of hardware and new approaches for moving data over high-performance content delivery networks.
Having successfully completed its acquisition review, PNRP will operate as a testbed for the next three years, during which researchers will explore the system’s design and hardware for use in science and engineering research. Innovative features include field programmable gate arrays (FPGAs), composable infrastructure and graphics processing units (GPUs). Following the testbed phase, PNRP will become broadly available through a formal allocations process.
“Reaching this milestone is a culmination of a multi-year process from proposal, through acquisition, deployment, early user operations and formal review. It means the attainment of our goal to provide the research community with an open system created for growth and inclusion; a way for academic institutions to join and participate in a national system to enlarge and enrich the national cyberinfrastructure ecosystem,” said Frank Würthwein, director of the San Diego Supercomputer Center (SDSC) at the University of California San Diego.
As a distributed system, PNRP features hardware at three primary sites—SDSC, the University of Nebraska-Lincoln (UNL) and the Massachusetts Green High Performance Computing Center (MGHPCC). In addition to the computing hardware at each of the primary sites, the system includes five data caches that are collocated and distributed on the Internet2 network backbone. The data caches provide data replication and movement services that reduce the round trip latencies from anywhere in the U.S. to about 10 milliseconds, or 0.01 seconds.
“The PNRP collaboration represents the future of distributed research computing, where the sources and users of data are part of an integrated fabric. We are excited to support this next phase of the project and look forward to working with the PNRP team over the coming years to realize a vision of enabling research data, anywhere, any time,” said James Deaton, vice president of Network Services at Internet2.
Reliability testing of the system has been run to identify any problems that in rare instances might occur, or that become apparent only when running at scale. According to PNRP administrators, the tests showed that PNRP hardware at each of the facility sites performed well.
“One of the most interesting features of the PNRP is the distributed systems management model,” said Derek Weitzel, who leads UNL’s responsibility for systems administration in the new platform. “PNRP was integrated into existing infrastructure that had been developed over the past several years. The Kubernetes-based approach substantially reduced the time to deploy and integrate hardware. UNL received the cluster on a Monday and had jobs running on Friday that same week, something that would be nearly impossible with a traditional HPC cluster.”
John Goodhue is executive director of MGHPCC, which is operated by a consortium of universities in the northeast, serves thousands of researchers locally and around the world and houses one of the PNRP GPU resources—providing a full complement of data center facility, networking, security and 24/7 operations. “We are pleased to be collaborating on PNRP, which, like MGHPCC, seeks to strengthen the national CI ecosystem through regionally based partnerships,” Goodhue said. “PNRP is innovative in technological and organizational dimensions, both of which are essential ingredients to advancing research.”
Early-user feedback
PNRP underwent a 30-day Early User Operations phase, during which the system was put through its paces on real-world applications in preparation for operations. Early-use cases ranged from studies on autonomous agents (e.g., robots, drones and cars) and cerebral organoids to synthesizing textures for 3D shapes and estimating sea surface temperature in cloudy conditions. Early users included researchers from University of California campuses, MIT, the International Gravitational Wave Network and others. Following are examples of early-use cases:
IceCube Neutrino Observatory
IceCube is located at the South Pole and consists of 5,160 digital optical modules (DOMs) distributed over one km3 of ice. Determining the direction of incoming neutrinos depends critically on accurately modeling optical properties of the ice. This numerically intensive process needs up to 400 GPU years and a new model must be constructed annually to account for ice flow.
The observatory’s computing director, Benedikt Riedel, said, “PNRP’s usability was very good and porting efforts were minimal, with only storage needing to be accessed differently and the computation appearing like any other Open Science Grid (OSG) site,” adding that performance of the A10 GPUs was excellent.
Genomics processing and analysis
UC San Diego’s Tianqi Zhang and Tajana Rosing, one of the PNRP co-principal investigators, developed applications that run on FPGA accelerators for basic genomics processing components, like sequence trimming and alignment, and integrated them with the pipelines for COVID-19 phylogenetic inference, microbial metagenome analysis and cancer variant detection.
“It’s pretty easy to migrate the previous programs to the new U55C cluster [PNRP]. The development platform is also similar to the local environment, with only a few board configurations needing administrator intervention. We are currently scaling up and optimizing the accelerators on the multi-FPGA nodes. If successful, it will provide O(10x) speedup and O(100x) power savings compared to CPU,” said Zhang.
According to Robert Sinkovits, an expert in scientific applications at SDSC, with the variety and scale of applications and use cases, SDSC “feels confident the [scientific] community will be able to make excellent use of PNRP.”
Support from Industry
Industry partners provide key technical features of the HPC subsystem, which include a mix of FPGA chips, GPUs with memory and storage in a fully integrated extremely low-latency fabric from GigaIO, which provides the composable architecture of the new platform. PNRP’s high-performance, low latency cluster integrated by Applied Data Systems (ADS) features composable PCIe fabric technology, along with FPGAs and FP64 GPUs, and two A10-based GPU clusters integrated by Supermicro, one located at UNL and one at MGHPCC.
According to GigaIO, composability provides users flexibility and the ability to use accelerators such as GPUs and FPGAs in an easy-to-orchestrate, reconfigurable system that saves time and makes optimal use of the resources. “The ability to build formerly impossible computing configurations and seamlessly transform systems to match workloads enables customers like SDSC to do more science for less money. We are proud to have worked closely with SDSC, ADS and Gigabyte to bring this revolutionary system online and make it available to all PNRP researchers,” said Alan Benjamin, CEO of GigaIO.
ADS President Craig Swanson said that it was an honor to be selected as the integration vendor partner to build, configure and support the cutting edge composable infrastructure. “It’s only our ability to execute and work closely with our partners, that we are able to stand up such bleeding-edge technology to aide in the research community’s quest to push the boundaries of science,” he said.
PNRP is supported by the National Science Foundation (award no. 2112167).
View source version on www.sdsc.edu: https://www.sdsc.edu/News%20Items/PR20230510_PNRP.html