Raptisv Blog

This article is a guide to setup Nvidia GPU Operator on a Kubernetes cluster running on Azure virtual machine.

VM setup with Nvidia image

Setup a virtual machine on Azure using the NVIDIA GPU-Optimized VMI image. The image contains

  • Ubuntu Server OS
  • NVIDIA Driver
  • Docker-ce
  • NVIDIA Container Toolkit
  • Azure CLI, NGC CLI
  • Miniconda, JupyterLab, Git

Install helm

sudo curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
    && sudo chmod 700 get_helm.sh \
    && sudo ./get_helm.sh

Install GPU Operator

The image already has the drivers installed and so we set --set driver.enabled=false

helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --set driver.enabled=false

Set containrd as default runtime

sudo nvidia-ctk runtime configure --runtime=containerd

Restart containrd

sudo systemctl restart containerd

Use the GPU from a k8s deployment

Set the node selector to deploy on one of the nodes where a GPU is available

nodeSelector:
    nvidia.com/gpu.present: "true"

Set the limits as set below to use the GPU

limits:
    nvidia.com/gpu: 1

Extras

Uninstall Nvidia GPU Operator

helm uninstall -n gpu-operator $(helm list -n gpu-operator | grep gpu-operator | awk '{print $1}')