Nvidia and IBM unveil Big Accelerator Memory

Recently, Microsoft announced that the DirectStorage API has landed on PCs. This technology allows NVMe SSDs to bypass CPU and memory and transfer data directly to video memory, which can greatly reduce the loading time of Windows games. Nvidia, IBM, and Cornell University have also found a way to make GPUs work seamlessly with SSDs without the need for proprietary APIs. This feature is called Big Accelerator Memory (BaM), and it may be applied to various computing tasks in the future, which will be very helpful for large-scale data workloads.

With the continuous development of GPUs, modern GPUs are no longer limited to graphics applications, but also used in various heavy workloads. GPUs are also getting closer to CPUs in terms of programmability and therefore also require direct access to large storage devices. In order to process data more efficiently, GPUs are generally equipped with high-speed large-capacity video memory which is very expensive. For example, the new A100 computing card based on the Ampere architecture is equipped with 80GB of HBM2e memory at a rate of 3.2 Gbps, providing a memory bandwidth of 2 TB/s. However, with the rapid growth of the amount of data processed by the GPU, the existing methods are far from keeping up with the needs of data processing, and it is more urgent and important to optimize the interoperability between the GPU and storage devices.

There are a few key factors involved in improving interoperability between GPUs and SSDs. First, NVMe calls and data transfers place a lot of burden on the CPU, which is an inefficient operation in terms of overall performance and efficiency. Second, the constant operations between the CPU and GPU also greatly limit the effective memory bandwidth used by applications that require large amounts of data. TomsHardware said that the goal of Big Accelerator Memory is to expand GPU memory capacity and increase effective storage access bandwidth while making it easier for GPUs to access massive amounts of data in expanded memory.

Big Accelerator Memory essentially allows NVIDIA GPUs to fetch data directly from system memory and storage devices without the involvement of the CPU. The GPU uses its own video memory as a software-managed cache and uses the PCIe interface, RDMA, and customized Linux kernel drivers to move data, allowing the SSD to directly read and write the GPU’s video memory when needed. If the required data is not available locally, the GPU thread will queue commands for the SSD.

Big Accelerator Memory does not use virtual memory address translation, so serialization events such as TLB misses will not occur. Nvidia and its partners plan to open source the driver to allow others to use the BaM concept.

Big Accelerator Memory

BaM mitigates the I/O traffic amplification by enabling the GPU threads to read or write small amounts of data on-demand, as determined by the compute,” Nvidia’s document reads. “We show that the BaM infrastructure software running on GPUs can identify and communicate the fine-grain accesses at a sufficiently high rate to fully utilize the underlying storage devices, even with consumer-grade SSDs, a BaM system can support application performance that is competitive against a much more expensive DRAM-only solution, and the reduction in I/O amplification can yield significant performance benefit.”