Note: This article was originally published on InsideHPC. To see the original piece, click here.
Recently we were approached by Robert, a system architect for a public health research institute, with a problem that may resonate with many of you within (and outside) the genomics field. Robert’s users, having received an upgraded version of their analysis software, which now supported GPUs instead of just CPUS, were “literally eating RAM, CPU, storage and GPU”.
The good news: using a server connected to 8 GPUs, instead of the CPU-only previous version, a use case took 65 minutes instead of 40 days to complete. That’s the good news.
The bad news: now every single work group in the institute wanted their own server and GPU bank. Robert knew full well this massive new hardware investment would only be used a fraction of the time as Intel’s research shows that when GPUs are located inside a server, they are mostly used less than 15% of the time, and rarely if ever over 25%. With such a low utilization rate and contemplating also the added staff time to transition from server to server – a several day process -, it was clear to him he needed a solution where resources could be pooled and shared. And ideally the servers could also be networked to run even larger banks of GPUs.
As luck would have it, after reading an article about GigaIO’s solution for composable infrastructure, Robert decided to share the GPUs across servers via a flexible infrastructure using GigaIO’s FabreX network fabric over PCIe.
The ability to attach a group of resources to one server, run the job(s), and reallocate the same resources to other servers is the obvious solution to a growing problem: the incredible rate of change of AI and HPC applications is accelerating, triggering the need for ever faster GPUs and FPGAs to take advantage of the new software updates and new applications being developed.
Another GigaIO customer is a national lab facing a similar issue: individual scientists had one or two GPUs in their own high-end desktop systems. With the latest software updates now enabling more GPUs per server, everyone is clamoring for more of this expensive – and underutilized – resource. The solution again was to remove the GPUs and locate them all in a central pool connected by GigaIO’s FabreX network fabric, and have individuals check out however many GPUs they needed to run their use case, then release it to others.
As AI and Big Data applications migrate from the small specialized departments in your company to everywhere from marketing to product development’s more sophisticated design and modeling tools to Manufacturing 4.0 process control and automation, you may well be facing similar pressure and conundrum: to unleash accelerated computing across your organization yet also maximize utilization rates and TCO.
The solution could be composable infrastructure with GigaIO’s FabreX, which integrates into your existing racks, without being locked into a proprietary system. Because FabreX uses PCIe, the same language every device in your rack also speaks, latency plummets and even servers, not just JBOGs and JBOFs, can be connected into one network fabric, with resources shared across servers. Being totally software and hardware agnostic, FabreX allows you to utilize your existing investment and deploy as you see fit with minimal risk.
Enterprises of all sizes have common needs for this new IT infrastructure:
- They need systems that are reconfigurable so that every group does not need to buy dedicated resources that are frequently underutilized. Different groups have to be able to share expensive accelerators and storage, grouping together large pools of resources for quick processing, but easily being able to break these pools up so that are in constant use.
- They require systems that are easy to expand in any direction – more storage, more CPUs, more GPUs, more FPGAs, more memory, more servers – when necessary, without having to duplicate all parts of the system.
- Their systems must be simple to transition as the solution changes. Their AI systems – from the software, to the algorithms, to the business problem being addressed – are going through rapid transformation. No one wants to be stuck with an IT infrastructure that is obsolete way before the depreciation schedule is complete. Individual components need to be easily accessible and on their own obsolescence and upgrade schedules, to minimize the system total cost of ownership.
- They need systems that are standards based and support the huge variety and vendors of accelerators and storage. Support for one accelerator or even one accelerator or storage company is simply not good enough for these systems. Today’s hot GPU will quickly be replaced by a newer and faster one, or their software will pick up support for FPGAs, in addition to GPUs, or a new storage medium will become essential. In all cases, the infrastructure being put in place has to be able to accommodate all of these without limiting the future potential of the system.
It’s time for a new approach.
GigaIO’s FabreX is a fundamentally new network architecture that integrates computing, storage, accelerators and other communication I/O into a single-system cluster fabric, using industry standard PCI Express technology. This new architecture offers the lowest system latency and highest bandwidth in rack systems today, and creates a unified, software-defined, composable infrastructure.
Because of its exceptional low latency (as low as 500 ns, end-to-end) and high bandwidth, FabreX makes the disaggregation of storage and accelerators possible. And with FabreX, this same breakthrough performance applies to server to server traffic as well, meaning you can both scale up server capabilities, and then scale those solutions out, to build systems that will delight your user groups. FabreX also supports standard Redfish APIs, so leading third-party container environments and orchestration and composition software can easily run on the network, making a true software defined infrastructure that dynamically assigns resources to match changing workloads. Gone are the days of making massive and costly changes to large numbers of servers to incorporate new technology.
So like Robert, you too can upgrade or add compute, storage and accelerators at the component level that plug-n-play with your environment, with every major subsystem now operating on its own upgrade cycle. Your users will enjoy access to the computing power their crave, and your boss will love you for optimizing TCO as FabreX drives much higher utilization of the resources – from 15% to 90% in Robert’s case.
Thank you for reading this piece. For more information click here or contact us via email at email@example.com. We look forward to hearing from you. Please continue to practice safe distancing, use a mask and wash your hands.