Fast Checkpoint Restore for AMD GPUs with CRIU

Rajneesh Rajneesh Bhardwaj, Felix Kuehling and David Yat Sin

Playlists: 'xdc2021' videos starting here / audio

CRIU a.k.a Checkpoint Restore in Userspace is the de-facto choice for Checkpoint and Restore but one of its major limitations is to Checkpoint and Restore tasks that have a device state associated with them and need the driver to manage their state which CRIU cannot control but provides a flexible plugin mechanism to achieve this. So far there is no serious real device plugin (at least in public domain) that deals with a complex device such as a GPU. We would like to discuss our work to support CRIU with AMD ROCm which is AMD's fully open source solution to Machine Learning and HPC compute space. This will potentially be extended to support video decode / encode using render nodes.

CRIU already has a plugin architecture to support processes using device files. Using this architecture we added a plugin for supporting CRIU with GPU compute applications running on the AMD ROCm software stack. This requires new ioctls in the KFD kernel mode driver to save and restore hardware and kernel mode driver state, such as memory mappings, VRAM contents, user mode queues, and signals. We also needed a few new plugin hooks in CRIU itself to support remapping of device files and mmap offsets within them, and finalizing GPU virtual memory mappings and resuming execution of the GPU after all VMAs have been restored by the PIE code.

The result is the first real-world plugin and the first example of GPU support in CRIU.

While there were several new challenges that we faced to enable this work, we were finally able to support real tensorflow/pytorch work loads across multi-gpu nodes using criu and were also able to migrate the containers running gpu bound worklaods.In this talk, we'd like to talk about our journey where we started with a small 64KB buffer object in GPU VRAM to Gigabytes of single VRAM buffer objects across GPUs. We started with /PROC/PID/MEM interface initially and then switched to a faster direct approach that only worked with large PCIE BAR GPUs but that was still slow. For instance, to copy 16GB of VRAM, it used to take ~15 mins with the direct approach on large bars and more than 45 mins with small bars. We then switched to using system DMA engines built into most AMD GPus and this resulted in very significant improvements. We can checkpoint the same amount of data within 5 seconds now. For this we initially modified libdrm but the maintainers didn't agree to change an private API to expose GEM handles to the userspace so we finally ended up make a kernel change and exporting the buffer objects in VRAM as DMABUF objects and then import in our plugin using libdrm.

We are going to present the architecture of our plugin, how it interacts with CRIU and our GPU driver during the checkpoint and restore flow. We can also talk about some security considerations and initial test results and performance stats.

Further reading:
Our work-in-progress code: