As artificial intelligence (AI) and machine learning (ML) workloads continue to grow in complexity and scale, the need for powerful and efficient computational resources becomes more critical. Running workloads on Kubernetes allows you to leverage the scalability and self-healing capabilities, however, there are challenges when it comes to managing and optimizing GPU resources. This is where the GPU Operators and plugins step in. They provide a solution to deploy and manage GPUs in Kubernetes.
There are a few GPU operators like Intel Device Plugin Operator, AMD GPU operator and NVIDIA GPU Operator. However,the NVIDIA GPU Operator is one of the most popular ones. It provides a comprehensive solution to streamline the deployment, management, and optimization of GPUs in Kubernetes environments.
In this post, we’ll dive into the NVIDIA GPU Operator, its features and understand some basic constructs that enable you to use those features. Let’s begin!
Managing GPUs in Kubernetes clusters can be a daunting task. Traditional methods often require manual installation and configuration of GPU drivers, which can be time-consuming and error-prone. Additionally, leveraging advanced GPU features and ensuring efficient data transfer between GPUs and other system components requires specialized knowledge and tools. Without a streamlined approach, these challenges can hinder the performance and scalability of AI/ML workloads.
The NVIDIA GPU Operator offers multiple features. It makes setting up GPU drivers and their configuration on Kubernetes effortless. The ability to use advanced features like vGPU, Multi-Instance GPU (MIG), and GPU Time-Slicing is crucial when it comes to running multiple AI workloads on given nodes. Also, GPUs need fast communication abilities with other applications/GPUs as well as with the storage setup. GPUDirect RDMA, GPU Direct storage and GDR Copy play an important role in enabling the same. GPU Operator helps easily bring all of these features and more to your Kubernetes cluster.
Automated installation and maintenance of GPU drivers: The NVIDIA GPU Operator automates the installation and upkeep of GPU drivers, eliminating the need for manual intervention. This automation ensures that the drivers are always up-to-date and correctly configured, allowing AI/ML workloads to run smoothly and efficiently.
Configuration of advanced GPU features:
Configuring GPUDirect RDMA and GPUDirect Storage:
Configuring GDR Copy: GPUDirect RDMA (GDR) Copy is a low-latency GPU memory copy library based on GPUDirect RDMA technology that allows the CPU to directly map and access GPU memory. It enhances the efficiency of memory copy operations, reducing overhead and improving overall performance.
Sandboxed Workloads: Enables applications to run in an isolated environment leveraging virtual machines (VMs) or containers with security restrictions. This helps with enhanced security, better resource management and reproducibility of the models.
To leverage the power of the NVIDIA GPU Operator for managing GPU resources in your Kubernetes clusters, you need to follow a structured installation process and meet certain prerequisites.
Before installing the NVIDIA GPU Operator, ensure that the following prerequisites are met:
Kubernetes Cluster v1.18 or later
Follow these steps to install the NVIDIA GPU Operator on your Kubernetes cluster.
Set up Helm Repository. Add the NVIDIA Helm repository to your Helm configuration.
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
Create a dedicated namespace for the GPU Operator:
kubectl create namespace gpu-operator
Use Helm to install the GPU Operator in the created namespace:
helm install --namespace gpu-operator gpu-operator nvidia/gpu-operator
Verify installation. Check the status of the deployed resources to ensure the GPU Operator is running correctly:
kubectl get pods -n gpu-operator
You should see the GPU Operator and its components running in the gpu-operator namespace.
Check Node Configuration: Verify that the GPU Operator has configured the nodes correctly by checking for NVIDIA drivers and other necessary components.
$ kubectl describe nodes
Name: infracloud01
Roles: control-plane
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
...
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=true
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.present=true
Annotations: node.alpha.kubernetes.io/ttl: 0
nvidia.com/gpu-driver-upgrade-enabled: true
projectcalico.org/IPv4Address: 192.168.0.52/24
...
Capacity:
cpu: 32
ephemeral-storage: 982345180Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131600376Ki
pods: 110
Allocatable:
cpu: 32
ephemeral-storage: 905329316390
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131497976Ki
pods: 110
...
Look for annotations and labels indicating that the nodes are GPU-enabled and configured.
You can now test your setup by deploying a sample CUDA VectorAdd application in your cluster. The sample application adds two vectors together and requests 1 gpu resource. The GPU allocation request is handled by the GPU operator.
Run the CUDA VectorAdd application.
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvidia/samples:vectoradd-cuda11.2.1"
resources:
limits:
nvidia.com/gpu: 1
EOF
pod/cuda-vectoradd created
Check the pod logs.
$ kubectl logs cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
Let’s look at some of the key technologies of GPU Operators in detail.
GPU concurrency is the ability of a GPU to execute multiple operations simultaneously by utilizing its parallel processing capabilities. This feature is essential for enhancing the performance, efficiency, and scalability of AI/ML workloads. With parallel processing, GPUs can significantly speed up training and inference, manage larger and more complex datasets, and provide real-time responses.
vGPU (Virtual GPU), Multi-Instance GPU (MIG), and GPU Time-Slicing are key technologies that enable GPU concurrency in various scenarios and through different mechanisms. Here’s a brief overview of each.
NVIDIA GPUDirect RDMA (Remote Direct Memory Access) and GPUDirect Storage (GDS) are advanced technologies designed to optimize data transfer for high-performance computing applications. GPUDirect RDMA enables direct communication between GPUs across different nodes, bypassing the CPU and reducing latency. This direct data path is crucial for applications requiring fast, low-latency communication, such as distributed AI training and real-time data processing. By minimizing CPU involvement, GPUDirect RDMA significantly enhances performance and efficiency, enabling faster computations and more scalable AI workloads.
(GPUDirect RDMA: directly connecting GPUs over the network)
Similarly, GPUDirect Storage facilitates direct data transfer between GPUs and storage devices, bypassing the CPU and system memory. This direct access to storage devices, such as NVMe SSDs, accelerates data throughput and reduces latency, which is essential for data-intensive applications. By streamlining the data flow, GPUDirect Storage ensures that GPUs can quickly access and process large datasets, leading to faster and more efficient computations.
(GStorage access pattern with and without GPUDirect Storage)
GDR Copy stands for GPUDirect RDMA Copy, which is a low-latency GPU memory copy library based on GPUDirect RDMA. One of the prominent use-cases for this library is to transfer data between the CPU (host) and GPU when the GPU is waiting to receive data and signal from the host to start the processing operations. GDR Copy uses the GPUDirect RDMA capability of exposing a part of the GPU memory to external devices, such as NICs, CPU, etc. to do CPU driven data copy operations. This enables GDR Copy to execute copy operations between CPU and GPU memory with lower latency and higher throughput, as opposed to GPU driven memory copy operations, such as cudaMemcpy, for small data sizes.
(Host-Device memory copy operation using cudaMemcpy and GDR Copy)
Figure above shows the difference between host-device memory copy operations using GDR Copy vs cudaMemcpy. cudaMemcpy is GPU driven operation and it uses GPU DMA engine to move data. This involves DMA engine overheads for lower data sizes. GDR Copy allows the CPU to access GPU memory directly via BAR mappings, allowing for low latency data transfers.
The NVIDIA GPU Operator uses several Custom Resource Definitions (CRDs) to manage the lifecycle of GPU drivers and associated components on Kubernetes. Here are some of the main CRDs.
The ClusterPolicy custom resource definition (CRD) is central to the NVIDIA GPU Operator. It serves as a single point of configuration for deploying all necessary NVIDIA software components on a Kubernetes cluster. The ClusterPolicy CRD allows administrators to specify and manage the entire lifecycle of GPU-related components, including drivers, runtime, device plugins, and monitoring tools.
The custom resource allows managing the configuration of important properties such as:
The NvidiaDriver custom resource definition (CRD) specifically manages the deployment and lifecycle of NVIDIA drivers on the Kubernetes nodes. It ensures that the correct version of the driver is installed and running, compatible with the GPU hardware and Kubernetes environment. While the driver configuration can also be controlled by the Cluster Policy CR, the Driver CR allows to specify the driver type and version per node.
Here are some of the configurations that can be managed with it.
In this post, we saw how the NVIDIA GPU Operator is a crucial tool for optimizing and managing GPU resources in Kubernetes clusters, specifically tailored to meet the demanding needs of AI and ML workloads. It automates the installation and maintenance of GPU drivers, simplifies the configuration of advanced GPU features like Virtual GPU (vGPU), Multi-Instance GPU (MIG), GPU Time Slicing, GPUDirect RDMA, and GPUDirect Storage, and significantly enhances performance and efficiency.
We also examined the key technologies supported by the GPU Operator, such as GPUDirect RDMA and GPUDirect Storage, which are essential for low-latency, high-speed data transfers. We also discussed GPU sharing technologies like vGPU, MIG, and GPU Time-Slicing and how all three aim to enable shared GPU access, improve efficiency, and reduce costs, but for different use cases and hardware configurations.
In summary, the NVIDIA GPU Operator is vital for managing GPU resources efficiently in Kubernetes environments, supporting advanced technologies, and simplifying complex configurations, thereby driving superior performance and scalability for AI and ML workloads.
Now that you have learned about NVIDIA GPU Operator, its features and key concepts, you can bring in AI & GPU Cloud experts to help you build your own AI cloud.
If you found this post useful and informative, subscribe to our weekly newsletter for more posts like this. We’d love to hear your thoughts on this post, so do start a conversation on LinkedIn with Sameer and Sanket.