TAKING THE PCI EXPRESS TO MALLEABLE SYSTEMS
Note: This article was originally published on The Next Platform. To read the original article, click here.
It took decades for server virtualization to go mainstream, making their way from hardware and software partitions on mainframes three decades ago down to proprietary and Unix systems two decades ago to X86 servers with VMware, XenServer, Microsoft, and Red Hat all doing their part. We are at the very front end of a different kind of server virtualization now, comprised of disaggregation and composability, and hopefully this time around it will not take three decades to mainstream it.
We are guessing it is probably more like five to ten years, and all it would take is a good recession (if there ever is such a thing) to speed up the adoption. The reason disaggregation and composability are going to take off is simple enough. Each server has too much stranded capacity, and in a world where the software stacks can be built up and broken down in the blink of an eye, having static hardware, with everything locked down inside of a node and at best virtualized within a box but not configurable across boxes, the needs of the software cannot be met by the rigidity of the hardware. Something has to give, and it is the motherboard and the metal skin of the server.
That is what GigaIO, one of the several makers of fabrics that bring composability to systems and across clusters, believes and it is the world the company’s hardware and software engineers are trying to build, starting with PCI-Express switches and its homegrown FabreX switch fabric that does the disaggregating and the composing of stuff that hangs off the peripheral bus such as non-volatile storage, GPUs, FPGAs, and the like. Scott Taylor, director of software development at GigaIO, talked about this stranded capacity issue at length at our recent HPC Day event, which we hosted prior to the SC19 supercomputing conference kicking off in Denver.
There is no reason why someone could not build a giant, wonking PCI-Express switch with loads of aggregate bandwidth that could drive lots of ports running at various PCI-Express speeds, but thus far no one has built a PCI-Express switch with more than 96 lanes, which supports 24 ports of PCI-Express 3.0 with four lanes per port.
“What is interesting about the Gen 4 roadmap is that Gen 4 is coming out, but we still have Gen 3 GPUs and Gen 3 servers,” says Taylor. “We are going to roll out a Gen 4 switch here at the beginning of 2020, and we will not initially be plugging in Gen 4 servers or Gen 4 GPUs. So we will initially have a higher radix count, so we can use a x8 Gen 4 link to connect to a x16 Gen 3 server or x16 Gen 3 GPUs. So we will see a virtual increase in radix and people will be able to do more interesting things with composition.”
The interesting thing is not so much the hardware, but the mix and match nature of HPC compute and AI frameworks and cloud computing atop the composed infrastructure. Early adopter customers are looking at mashing up MPI distributed computing with SLURM or Bright Cluster Manager or Adaptive Computing Moab job schedulers and OpenStack cloud controllers with the Valence composability extensions, and to also plug into existing rack-scale infrastructure from Intel, Dell, and others.
“We have a framework and a REST API that will support that, so we can plug into these ecosystems,” says Taylor. “You have got to plug into somebody else’s stack. People are using OpenStack, they are using these workload managers, and we wanted to come in and seamlessly plug in this composition element into those frameworks. And we are now seeing that come to fruition.”
Some people think that hyperconverged infrastructure – the merging of virtual compute and virtual SAN storage on the same cluster – is diametrically opposed to systems with disaggregation and composability. But these are actually complementary technologies. The only way to get true disaggregation and composability, says Taylor, is over PCI-Express, which is how peripherals and servers link to each other under the covers. But you are limited to tens of meters of copper cable and 100 meters for optical cable for Mini-SAS HD links using the PCI-Express protocol, and that means eventually pools of components have to be linked in some hyperconverged fashion over traditional InfiniBand and Ethernet networking.
Ultimately, we may end up with rack-scale infrastructure like Egenera dreamed of it with its BladeFrame systems, conceived of by former Goldman Sachs chief technology officer Vern Brownell two decades ago, without all of the blade chassis lock in and issues of blade incompatibility.