GigaIO Secures Largest Order for SuperNODE Featuring AMD Instinct MI300X Accelerators
Note: This article originally appeared on hpcwire.com, click here to read the original piece.
December 6, 2023
SAN JOSE, Calif., Dec. 6, 2023 — GigaIO, an award-winning provider of open workload-defined infrastructure for AI and accelerated computing, today announced the largest order yet for its flagship SuperNODE, which will utilize tens of thousands of the AMD Instinct MI300X accelerators launching today. GigaIO’s novel infrastructure will form the backbone of a bare-metal specialized AI cloud codenamed “TensorNODE,” to be built by cloud provider TensorWave for supplying access to AMD data center GPUs, especially for use in LLMs.
The SuperNODE was the world’s first 32 GPU single-node supercomputer when launched in June and won two coveted HPCwire Editors’ Choice Awards at Supercomputing 2023 in Denver last month: Best AI Product or Technology, and Top 5 New Products or Technologies to Watch. The TensorNODE deployment will build upon this architecture to a far greater scale, leveraging GigaIO’s PCIe Gen-5 memory fabric to provide a simpler workload setup and deployment than is possible with legacy networks, and eliminating the associated performance tax.
“TensorWave is excited to bring this innovative solution to market with GigaIO and AMD,” said Darrick Horton, CEO of TensorWave. “We selected the GigaIO platform because of its superior capabilities, in addition to GigaIO’s alignment with our values and commitment to open standards. We’re leveraging this novel infrastructure to support large-scale AI workloads and we are proud to be collaborating with AMD as one of the first cloud providers to deploy the MI300X accelerator solutions.”
GPU utilization is key in this era of GPU scarcity but requires significant VRAM and memory bandwidth. TensorWave will use FabreX to create the very first petabyte-scale GPU memory pool without the performance impact of non memory-centric networks. The first installment of TensorNODE is expected to be operational starting in early 2024 with an architecture that will support up to 5,760 GPUs across a single FabreX memory fabric domain. Workloads will be able to access more than a petabyte of VRAM in a single job from any node, enabling even the largest jobs to be completed in record time. Throughout 2024, multiple TensorNODEs will be deployed.
The composable nature of GigaIO’s dynamic infrastructure provides TensorWave with tremendous flexibility and agility over standard static infrastructure; as LLMs and AI user needs evolve over time, the infrastructure can be tuned on the fly to meet both current and future needs. Additionally, TensorWave’s cloud will be greener than alternatives by eliminating redundant servers and associated networking equipment, providing a savings in cost, complexity, space, water, and power.
“We are thrilled to power TensorWave’s infrastructure at scale by combining the power of the revolutionary AMD Instinct MI300X accelerators with GigaIO’s AI infrastructure architecture, including our unique memory fabric, FabreX. This deployment validates the pioneering approach we have taken to reimagining data center infrastructure,” said Alan Benjamin, CEO of GigaIO. “The TensorWave team brings both a visionary approach to cloud computing and a deep expertise in standing up and deploying very sophisticated accelerated data centers.”
TensorNODE is an all-AMD solution featuring both 4th Gen AMD CPUs and MI300X accelerators. The unprecedented expected performance of the TensorNODE is made possible by the MI300X, which delivers 192GB of HBM3 memory per accelerator. The leadership memory capacity of these accelerators, combined with GigaIO’s memory fabric — which allows for near-perfect scaling with no compromise to performance — solves the challenge of underutilized or idle GPU cores.
“We are excited about our collaboration with GigaIO and TensorWave to bring unique solutions to the evolving workload demands of AI and HPC,” said Andrew Dieckmann, Corporate Vice President and General Manager, Data Center and Accelerated Processing of AMD. “GigaIO’s SuperNODE architecture, powered by AMD Instinct accelerators and AMD EPYC CPUs, is expected to deliver impressive performance and flexibility.”
About GigaIO
GigaIO provides workload-defined infrastructure through its universal dynamic memory fabric, FabreX, which seamlessly composes rack-scale resources and integrates natively into industry-standard tools. The SuperNODE and the SuperDuperNODE are “impossible servers,” fully engineered to “Just Work” for AI and accelerated computing. These solutions allow users to deploy systems in hours instead of months and run more workloads at lower cost through higher utilization of resources and more agile deployment.
About TensorWave
TensorWave is a deep tech company on a mission to develop an advanced cloud computing platform for AI workloads and beyond. Its upcoming TensorNODE deployment is ushering in the Next Wave of AI Compute, leveraging the AMD MI300X accelerator at scale. TensorWave is optimized for large-scale enterprises and platforms with LLM workloads that leverage PyTorch, including training and fine-tuning. Visit www.tensorwave.com to learn more and sign up for early access.
View source version on hpcwire.com: https://www.hpcwire.com/off-the-wire/gigaio-secures-largest-order-for-supernode-featuring-amd-instinct-mi300x-accelerators/