Raptisv Blog

This article is a guide to setup Nvidia GPU Operator on a Kubernetes cluster running on Azure virtual machine.

VM setup with Nvidia image

Setup a virtual machine on Azure using the NVIDIA GPU-Optimized VMI image. The image contains

  • Ubuntu Server OS
  • NVIDIA Driver
  • Docker-ce
  • NVIDIA Container Toolkit
  • Azure CLI, NGC CLI
  • Miniconda, JupyterLab, Git

Install helm

sudo curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
    && sudo chmod 700 get_helm.sh \
    && sudo ./get_helm.sh

Install GPU Operator

The image already has the drivers installed and so we set --set driver.enabled=false

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Make sure you install the operator using the correct --kubeconfig path

helm install --kubeconfig ~/.kube/config --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.enabled=false

Set containrd as default runtime

sudo nvidia-ctk runtime configure --runtime=containerd

Restart containrd

sudo systemctl restart containerd

Use the GPU from a k8s deployment

Set the node selector to deploy on one of the nodes where a GPU is available

nodeSelector:
    nvidia.com/gpu.present: "true"

Set the limits as set below to use the GPU

limits:
    nvidia.com/gpu: 1

Extras

Uninstall Nvidia GPU Operator

helm uninstall -n gpu-operator $(helm list -n gpu-operator | grep gpu-operator | awk '{print $1}')