GigaIO introduces single-node AI supercomputer
Note: This article originally appeared on networkworld.com, click here to read the original piece.
The SuperNODE system combines 32 AMD Instinct MI210 accelerators in a single server using the GigaIO’s FabreX low-latency PCIe memory fabric.
Installation and configuration of high-performance computing (HPC) systems can be a considerable challenge that requires skilled IT pros to set up the software stack, for example, and optimize it for maximum performance – it isn’t like building a PC with parts bought off NewEgg.
GigaIO, which specializes in infrastructure for AI and technical computing, is looking to simplify the task. The vendor recently announced a self-contained, single-node system with 32 configured GPUs in the box to offer simplified deployment of AI and supercomputing resources.
Up to now, the only way to harness 32 GPUs would require four servers with eight GPUs apiece. There would be latency to contend with, as the servers communicate over networking protocols, and all that hardware would consume floor space.
What makes GigaIO’s device – called SuperNODE – notable is that it offers a choice of GPUs: up to 32 AMD Instinct MI210 GPUs or 24 NVIDIA A100s, plus up to 1PB storage to a single off-the-shelf server. The MI210 is a step down in performance from the top-of-the-line MI250 card (at least for now) that’s used in the Frontier exaFLOP supercomputer. It has a few less cores and less memory but is still based on AMD’s Radeon GPU technology.
“AMD collaborates with startup innovators like GigaIO in order to bring unique solutions to the evolving workload demands of AI and HPC,” said Andrew Dieckmann, corporate vice president and general manager of the data center and accelerated processing group at AMD, in a statement. “The SuperNODE system created by GigaIO and powered by AMD Instinct accelerators offers compelling TCO for both traditional HPC and generative AI workloads.”
SuperNODE is built on GigaIO’s FabreX custom fabric technology, a memory-centric fabric that reduces latency from system memory of one server communicating with other servers in the system to just 200ns. This enables the FabreX Gen4 implementation to scale up to 512Gbits/sec bandwidth.
FabreX can connect a variety of resources, including accelerators such as GPUs, DPUs, TPUs, FPGAs and SoCs; storage devices, such as NVMe, PCIe native storage; and other I/O resources connected to compute nodes. Basically, anything that uses a PCI Express bus can be connected to FabreX for direct device-to-device communication across the same fabric.
SuperNODE has three modes of operation: beast mode, for applications that make the most of many or all GPUs; freestyle mode, where every user gets their own GPU to use for processing purposes; and swarm mode, where applications run on multiple servers.
SuperNODE can run existing applications written on popular AI frameworks such as PyTorch and TensorFlow without requiring modification. It uses Nvidia’s Bright Cluster Manager Data Science software to manage and configure the environment and handle scheduling as well as container management.
SuperNODE is available now from GigaIO.
View source version on networkworld.com: https://www.networkworld.com/article/3703128/gigaio-introduces-single-node-ai-supercomputer.html