Nvidia GPU Device Plugin
Name: nomad-device-nvidia
Use the NVIDIA device plugin to expose NVIDIA GPUs to Nomad. The driver automatically supports Multi-Instance GPU (MIG).
The NVIDIA device plugin uses NVML bindings to get data regarding available
NVIDIA devices and then exposes them via Fingerprint RPC. The plugin detects
whether the GPU has Multi-Instance GPU enabled, and when enabled, the plugin
fingerprints all instances as individual GPUs. You may exclude GPUs from
fingerprinting by setting the ignored_gpu_ids
field.
Fingerprinted Attributes
Attribute | Unit |
---|---|
memory | MiB |
power | W (Watt) |
bar1 | MiB |
driver_version | string |
cores_clock | MHz |
memory_clock | MHz |
pci_bandwidth | MB/s |
display_state | string |
persistence_mode | string |
Runtime Environment
The nvidia-gpu
device plugin exposes the following environment variables:
NVIDIA_VISIBLE_DEVICES
- List of NVIDIA GPU IDs available to the task.
Additional Task Configurations
Additional environment variables can be set by the task to influence the runtime environment. See Nvidia's documentation.
Installation Requirements
In order to use the nomad-device-nvidia
device driver the following prerequisites must be met:
- GNU/Linux x86_64 with kernel version > 3.10
- NVIDIA GPU with Architecture > Fermi (2.1)
- NVIDIA drivers >= 340.29 with binary
nvidia-smi
- Docker v19.03+
Container Toolkit Installation
Follow the NVIDIA Container Toolkit installation instructions from NVIDIA to prepare a machine to use docker containers with NVIDIA GPUs. You should be able to run this simple command to test your environment and produce meaningful output.
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Plugin Configuration
plugin "nomad-device-nvidia" { config { enabled = true ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"] fingerprint_period = "1m" }}
The nomad-device-nvidia
device plugin supports the following configuration in the agent
config:
enabled
(bool: true)
- Control whether the plugin should be enabled and running.ignored_gpu_ids
(array<string>: [])
- Specifies the set of GPU UUIDs that should be ignored when fingerprinting.fingerprint_period
(string: "1m")
- The period in which to fingerprint for device changes.
Limitations
The NVIDIA integration only works with drivers who natively integrate with NVIDIA's container runtime library.
Nomad has tested support with the docker
driver.
Source Code & Compiled Binaries
The source code for this plugin can be found at hashicorp/nomad-device-nvidia. You can also find pre-built binaries on the releases page.
Examples
Inspect a node with a GPU:
$ nomad node status 4d46e59f // ...TRUNCATED... Device Resource Utilizationnvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
Display detailed statistics on a node with a GPU:
$ nomad node status -stats 4d46e59f // ...TRUNCATED... Device Resource Utilizationnvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB // ...TRUNCATED... Device StatsDevice = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]BAR1 buffer state = 2 / 16384 MiBDecoder utilization = 0 %ECC L1 errors = 0ECC L2 errors = 0ECC memory errors = 0Encoder utilization = 0 %GPU utilization = 0 %Memory state = 0 / 11441 MiBMemory utilization = 0 %Power usage = 37 / 149 WTemperature = 34 C
Run the following example job to see that the GPU was mounted in the container:
job "gpu-test" { datacenters = ["dc1"] type = "batch" group "smi" { task "smi" { driver = "docker" config { image = "nvidia/cuda:11.0-base" command = "nvidia-smi" } resources { device "nvidia/gpu" { count = 1 # Add an affinity for a particular model affinity { attribute = "${device.model}" value = "Tesla K80" weight = 50 } } } } }}
$ nomad run example.nomad.hcl==> Monitoring evaluation "21bd7584" Evaluation triggered by job "gpu-test" Allocation "d250baed" created: node "4d46e59f", group "smi" Evaluation status changed: "pending" -> "complete"==> Evaluation "21bd7584" finished with status "complete" $ nomad alloc status d250baed // ...TRUNCATED... Task "smi" is "dead"Task ResourcesCPU Memory Disk Addresses0/100 MHz 0 B/300 MiB 300 MiB Device Statsnvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB Task Events:Started At = 2019-01-23T18:25:32ZFinished At = 2019-01-23T18:25:34ZTotal Restarts = 0Last Restart = N/A Recent Events:Time Type Description2019-01-23T18:25:34Z Terminated Exit Code: 02019-01-23T18:25:32Z Started Task started by client2019-01-23T18:25:29Z Task Setup Building Task Directory2019-01-23T18:25:29Z Received Task received by client $ nomad alloc logs d250baedWed Jan 23 18:25:32 2019+-----------------------------------------------------------------------------+| NVIDIA-SMI 410.48 Driver Version: 410.48 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. ||===============================+======================+======================|| 0 Tesla K80 On | 00004477:00:00.0 Off | 0 || N/A 33C P8 37W / 149W | 0MiB / 11441MiB | 0% Default |+-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+| Processes: GPU Memory || GPU PID Type Process name Usage ||=============================================================================|| No running processes found |+-----------------------------------------------------------------------------+