GigaIO’s SuperNODE to Power TensorWave Deployment with AMD MI300X
Note: This article originally appeared on insidehpc.com, click here to read the original piece.
December 7, 2023
San Jose, California, December 6, 2023 – GigaIO, provider of open workload-defined infrastructure for AI and accelerated computing, has announced what the company said is the largest order yet for its SuperNODE utilizing tens of thousands of the AMD Instinct MI300X accelerators.
GigaIO’s infrastructure will form the backbone of a bare-metal specialized AI cloud code-named “TensorNODE,” to be built by cloud provider TensorWave for supplying access to AMD data center GPUs, especially for use in LLMs.
The company said the SuperNODE, launched last June, was the world’s first 32-GPU single-node supercomputer. The TensorNODE deployment will build upon this architecture to a greater scale, leveraging GigaIO’s PCIe Gen-5 memory fabric to provide a simpler workload setup and deployment than is possible with legacy networks, and eliminating the associated performance tax., according to GigaIO.
“TensorWave is excited to bring this innovative solution to market with GigaIO and AMD,” said Darrick Horton, CEO of TensorWave. “We selected the GigaIO platform because of its superior capabilities, in addition to GigaIO’s alignment with our values and commitment to open standards. We’re leveraging this novel infrastructure to support large-scale AI workloads and we are proud to be collaborating with AMD as one of the first cloud providers to deploy the MI300X accelerator solutions.”
GPU utilization is key in this era of GPU scarcity but requires significant VRAM and memory bandwidth. TensorWave will use FabreX to create the very first petabyte-scale GPU memory pool without the performance impact of non memory-centric networks. The first installment of TensorNODE is expected to be operational starting in early 2024 with an architecture that will support up to 5,760 GPUs across a single FabreX memory fabric domain. Workloads will be able to access more than a petabyte of VRAM in a single job from any node, enabling even the largest jobs to be completed in record time. Throughout 2024, multiple TensorNODEs will be deployed.
The composable nature of GigaIO’s dynamic infrastructure provides TensorWave with tremendous flexibility and agility over standard static infrastructure; as LLMs and AI user needs evolve over time, the infrastructure can be tuned on the fly to meet both current and future needs.
TensorWave’s cloud will be greener than alternatives by eliminating redundant servers and associated networking equipment, providing a savings in cost, complexity, space, water, and power.
“We are thrilled to power TensorWave’s infrastructure at scale by combining the power of the revolutionary AMD Instinct MI300X accelerators with GigaIO’s AI infrastructure architecture, including our unique memory fabric, FabreX. This deployment validates the pioneering approach we have taken to reimagining data center infrastructure,” said Alan Benjamin, CEO of GigaIO. “The TensorWave team
brings both a visionary approach to cloud computing and a deep expertise in standing up and deploying very sophisticated accelerated data centers.”
TensorNODE is an all-AMD solution featuring both 4th Gen AMD CPUs and MI300X accelerators. The expected performance of the TensorNODE is made possible by the MI300X, which delivers 192GB of HBM3 memory per accelerator. The leadership memory capacity of these accelerators, combined with GigaIO’s memory fabric — which allows for near-perfect scaling with no compromse to performance — solves the challenge of underutilized or idle GPU cores.
“We are excited about our collaboration with GigaIO and TensorWave to bring unique solutions to the evolving workload demands of AI and HPC,” said Andrew Dieckmann, Corporate Vice President and General Manager, Data Center and Accelerated Processing of AMD. “GigaIO’s SuperNODE architecture, powered by AMD Instinct accelerators and AMD EPYC CPUs, is expected to deliver impressive performance and flexibility.”
View source version on insidehpc.com: https://insidehpc.com/2023/12/gigaios-supernode-to-power-tensorwave-deployment-with-amd-mi300x/