As digital transformation has brought virtualization and then hardware disaggregation to data centers and servers, the one stubborn resource that still resists being shared is memory, specifically DRAM. What if it were possible to build an elastic memory pool shared across servers to break that last hurdle towards a completely disaggregated compute model, where CPUs may no longer be prevalent, or even relevant?
The vision of a wholly disaggregated rack, and beyond the rack of disaggregation at the data center scale, is a goal heralded by such luminaries as the CEO and founder of NVIDIA, Jason Huang, who proclaimed in 2020 that “The new style of data centers will be disaggregated, composable, and accelerated.”
A bit of history
First, there was storage.
The march towards freeing resources from the servers in which they were trapped, and towards pooling them in composable resource blocks to enable sharing across users and workflows, started first with storage, around 2015. Composing storage was the easy part because latency is measured in milliseconds, not nanoseconds, so it is easy to achieve over Ethernet or InfiniBand. For the vast majority of so-called “composable vendors” today, that is still the only portion of the compute operation that they make available to be dynamically allocated across the data center. This constraint is what we refer to at GigaIO as the “storage/not storage” construct: storage gets composed, but all the rest of the elements of the compute operation stays fixed, trapped behind the CPU, whether they are GPUs, FPGAs, ASICs or other non-storage resources.
Many AI workflows don’t use North/South network traffic, but mostly East/West, which is why NVIDIA created NVLink, GPU Direct RMA, and GDR, to accommodate constant, low-latency, high-performance East/West communication.
With these advances came the next step in further sharing server resources, that of disaggregating accelerators outside the servers by locating them in expansion chassis called JBOGs (Just a Bunch of GPUs) and flash memory in JBOFs (Just a Bunch of Flash). Those enclosures then typically communicate with the server over the PCIe bus, delivering more compute capability and more diverse computation options at native PCIe latency. A couple of vendors, including GigaIO, introduced these types of solutions in 2018, and they have found welcomed reception in HPC, and other compute-intensive and latency-sensitive applications.
The next step in disaggregation is harder: to address North/South network traffic, or compose GPUs to more than to a single server, and still to do so over native PCIe (Peripheral Component Interconnect Express), the language all rack resources “speak.” Here only GigaIO offers today the ability to transform the entire rack (and beyond) into rack-scale compute units, instead of the server being the unit of computing, with all resources in the rack communicating at PCIe bandwidth and latency.
The next step in disaggregation, which is yet to be solved, is pooled memory because with memory, low latency is paramount.
The state of the memory pooling technology
One way to pool memory is a JBOM (Just a Bunch of Memory), for example using Intel Optanetm or Samsung’s new offering as a substitute for memory. The value proposition is to get the same performance as if the Optane persistent memory was located inside the server. However, when this is done over InfiniBand or Ethernet, the latency advantages of Optane are completely occulted.
Today: Overprovisioning DRAM
Most commonly, this is the solution many IT managers are being forced to pursue. Since they never know how much DRAM they might need in the future, every time they purchase a server, they also buy the maximum DRAM available. This has obvious Capex implications but is a way to future-proof against new applications that might require more memory in their next update. Many organizations have built smaller private cloud data centers, overprovisioned just in case.
Tomorrow: Lean provisioning
Another approach could be to thin-provision each server, say 64GB per server since you might only need 256GB DRAM about 5% of the time. Then, what if you could purchase a PCIe card with a bunch of DRAM, which would cost the same as if it were located in each server, but instead host 1TB of DRAM in a JBOM? The lean memory packaging could then be composed independently and dynamically to whatever server needs the memory, whenever they need it.
This solution can be achieved in a small package with a server consisting of a CPU+thin DRAM and compute modules in a much smaller package, so that many compute modules could fit in a 1.U form factor, provisioned at will. This could be accomplished either with a many- socket or a simple 1P socket.
This approach would drastically reduce Capex for users and hyperscalers, … Magic, right?
The technology behind the magic
Let’s start with a description of the two ways that the compute modules we just described can get data out of memory. First, a few concepts.
- Memory register directly into the CPU is fastest because it is packaged with the CPU itself.
- DRAM L1 cache is very fast but not quite fast enough to pool memory. With 32K or 64K cache in L1, when you run out, programs won’t execute.
- L2 cache is slightly slower yet, also embedded in the CPU. If L1 doesn’t find the value it is looking for (usually 128K or 256K of data), it will produce a cache miss.
- L3 cache is slower yet, but bigger, typically 4MB, and growing in size due to die shrinkage(7n, 5n, etc.) .
- When a cache miss occurs, it goes to DRAM, marking the first time the DRAM read occurs. The DRAM is slow compared to the CPU, so the CPU inserts wait or spin states.
- Sometimes the data needed is more significant than what can be accommodated by a cache so only a portion of DRAM can be accommodated. A cache miss can then occur. When that happens, a memory swap comes into play, with a 4k or 8K page held in memory.
- At that point, the operating system performs a context switch: it puts the CPU to sleep and goes time-slicing on another application while it is waiting for pages to add to the cache. Arg, and then we wait…
With the basics of memory management covered above, broadly, there are two methods memory gets shared:
- In method #1, cache miss comes into play with less than 15 ns to under 1 ns. A very fast cache is a copy of memory; it is not memory. This management is handled in hardware, which is the only thing fast enough.
- In method #2, the cache miss has been exhausted; the DRAM has no data, it’s in the swap space: this will trigger a page fault. Memory pages have to be the exact size, either 4k or 8K each.
GigaIO’s staged approach to solving the holy grail
- Lean PCIe memory packaging solution.
We are demonstrating FPGAs (Alveo 250) with 256k RAM, where we add a PCIe interface to the DRAM, plug it into a JBOX, and compose it to a server. PCIe has access to DRAM because it is a memory-based protocol anyway. Now the server has extra DRAM, but over PCIe, which is useful as this is faster than Optane – about 500 ns to the CPU. The CPU will time out at 1000 ns.
The packaging can use traditional I/O to get page faults out of swap space and be the fastest on the planet, using PMDK created by Intel. The memory transfer occurs from the card to the memory of the server using a DMA engine.
This approach gives the ability to compose memory through page faults.
- Memory server.
This alternative route to memory pooling can be achieved without any hardware development, just software. The basis is a regular server with a maximum of DRAM, and others which are thin-provisioned. Any server connected via the same fabric can then be granted access to a portion of the memory using Non Transparent Bridging (NTB) over the fabric. As far as the memory server, it treats all memory as its own, as the transfer could be under 500 ns. The memory could be carved logically to make it available in whatever configuration is needed by the application.
One limitation is that it typically would not be dynamic unless changes are made to the operating system. To allocate a portion of the memory to a different server would require a reboot.
- The true holy grail: CXL + PCIe Gen5
To really get to memory pooling without limitations, the world needs to wait for CXL and PCIe Gen5. For GigaIO this will be an easy migration, which we expect to be faster than the Gen 3 to Gen 4 transition. We are active members of the CXL consortium and stand ready to implement it as soon as the silicon becomes available.
With CXL, rather than satisfying with page faulting, CXL.cache and CXL.mem can be satisfied with cache miss over PCIe Gen5, because CXL introduces cache coherency. We can then extend memory space completely outside the server – we could have for example zero memory on the CPU and have it all on PCIe. Because we don’t have to go through the operating system, the transfers from the memory card to the compute module are automatically generated, in DRAM L4. Then we become like L5 cache, with all elements of the compute operation including memory disaggregated using CXL. The Holy Grail.
GigaIO is the only player with the intellectual property resources to pull off a memory server as we wait for CXL to become reality. Competitors in the PCIe composable space might maybe be able to design a PCIe memory card, but the ability to share a resource owned by one server to another server over native PCIe is protected by GigaIO patents. Other vendors can only communicate server to server with NVMe-oF over InfiniBand, which introduces too much latency to ever be viable as a memory pooling solution.
The holy grail is within sight, and GigaIO is leading the quest.
For more details on just how bright the future is with CXL, download the white paper “The Future of Composability with CXL”.
 Gartner’s July 2020 report on Composable Infrastructure includes the following vendors: Dell EMC, DriveScale, GigaIO, HPE, Intel, Liqid and Western Digital.