Modern processors
Every Nomad node has a Central Processing Unit (CPU) providing the computational power needed for running operating system processes. Nomad uses the CPU to run tasks defined by the Nomad job submitter. For Nomad to know which nodes have sufficient capacity for running a given task, each node in the cluster is fingerprinted to gather information about the performance characteristics of its CPU. The two metrics associated with each Nomad node with regard to CPU performance are its bandwidth (how much it can compute) and the number of cores.
Modern CPUs may contain heterogeneous core types. Apple introduced the M1 CPU in 2020 which contains both performance (P-Core) and efficiency (E-Core) types. Each core type operates at a different base frequency. Intel introduced a similar topology in its Raptor Lake chips in 2022. When fingerprinting the characteristics of a CPU Nomad is capable of taking these advanced CPU topologies into account.
Calculating CPU resources
The total CPU bandwidth of a Nomad node is the sum of the product between the frequency of each core type and the total number of cores of that type in the CPU.
bandwidth = (p_cores * p_frequency) + (e_cores * e_frequency)
The total number of cores is computed by summing the number of P-Cores and the number of E-Cores.
cores = p_cores + e_cores
Nomad does not distinguish between logical and physical CPU cores. One of the defining differences between the P-Core and E-Core types is that the E-Cores do not support hyperthreading, whereas P-Cores do. As such a single physical P-Core is presented as 2 logical cores, and a single E-Core is presented as 1 logical core.
The example below is from a Nomad node with an Intel i9-13900 CPU. It is made up of mixed core types, with a P-Core base frequency of 2 GHz and an E-Core base frequency of 1.5 GHz.
These characteristics are reflected in the cpu.frequency.performance
and
cpu.frequency.efficiency
node attributes respectively.
cpu.arch = amd64cpu.frequency.efficiency = 1500cpu.frequency.performance = 2000cpu.modelname = 13th Gen Intel(R) Core(TM) i9-13900cpu.numcores = 32cpu.numcores.efficiency = 16cpu.numcores.performance = 16cpu.reservablecores = 32cpu.totalcompute = 56000cpu.usablecompute = 56000
Reserving CPU resources
In the fingerprinted node attributes, cpu.totalcompute
indicates the total
amount of CPU bandwidth the processor is capable of delivering. In some cases it
may be beneficial to reserve some amount of a node's CPU resources for use by
the operating system and other non-Nomad processes. This can be done in client
configuration.
The amount of reserved CPU can be specified in bandwidth via cpu
.
client { reserved { cpu = 3000 # mhz }}
Or as a specific set of cores
on which to disallow the scheduling of Nomad
tasks. This capability is available on Linux systems only.
client { reserved { cores = "0-3" }}
When the CPU is constrained by one of the above configurations, the node
attribute cpu.usablecompute
indicates the total amount of CPU bandwidth
available for scheduling of Nomad tasks.
Allocating CPU Resources
When scheduling jobs, a Task must specify how much CPU resource should be
allocated on its behalf. This can be done in terms of bandwidth in MHz with the
cpu
attribute. This MHz value is translated directly into cpushares on
Linux systems.
task { resources { cpu = 2000 # mhz }}
Note that the isolation mechanism around CPU resources is dependent on each task driver and its configuration. The standard behavior is that Nomad ensures a task has access to at least as much of its allocated CPU bandwidth. In which case if a node has idle CPU capacity, a task may use additional CPU resources. Some task drivers enable limiting a task to use only the amount of bandwidth allocated to the task, described in the CPU Hard Limits section below.
On Linux systems, Nomad supports reserving whole CPU cores specifically for a task. No task will be allowed to run on a CPU core reserved for another task.
task { resources { cores = 4 }}
Nomad Enterprise supports NUMA aware scheduling, which enables operators to more finely control which CPU cores may be reserved for tasks.
CPU hard limits
Some task drivers support the configuration option cpu_hard_limit
. If enabled
this option restricts tasks from bursting above their CPU limit even when there
is idle capacity on the node. The tradeoff is consistency versus utilization.
A task with too few CPU resources may operate fine until another task is placed
on the node causing a reduction in available CPU bandwidth, which could cause
disruption for the underprovisioned task.
CPU environment variables
To help tasks understand the resources available to them, Nomad sets the following environment variables in their runtime environment.
NOMAD_CPU_LIMIT
- The amount of CPU bandwidth allocated on behalf of the task.NOMAD_CPU_CORES
- The set of cores in cpuset notation reserved for the task. This value is only set ifresources.cores
is configured.
NOMAD_CPU_CORES=3-5NOMAD_CPU_LIMIT=9000
NUMA
Nomad clients are commonly provisioned on real hardware in an on-premise
environment or in the cloud on large .metal
instance types. In either case, it
is likely the underlying server is designed around a non-uniform memory access (NUMA) topology.
Servers that contain multiple CPU sockets or multiple RAM banks per CPU socket
are characterized by the non-uniform access times involved in accessing system
memory.
The simplified example machine above has the following topology:
- 2 physical CPU sockets
- 4 system memory banks, 2 per socket
- 8 physical cpu cores (4 per socket)
- 2 logical cpu cores per physical core
- 4 PCI devices, 1 per memory bank
Optimizing performance
Operating system processes take longer to access memory across a NUMA boundary.
Using the example above if a task is scheduled on Core 0, accessing memory in Mem 1 might take 20% longer than accessing memory in Mem 0, and accessing memory in Mem 2 might take 300% longer.
The extreme differences are due to various physical hardware limitations. A core accessing memory in its own NUMA node is optimal. Programs which perform a high throughput of reads or writes to/from system memory will have their performance substantially hindered by not optimizing their spatial locality with regard to the systems NUMA topology.
SLIT tables
Modern machines will define System Locality Distance Information (SLIT) tables in their firmware. These tables are understood and made referenceable by the Linux kernel. There are two key pieces of information provided by SLIT tables:
- Which CPU cores belong to which NUMA nodes
- The penalty incurred for accessing each NUMA node from a core in every other NUMA node
The lscpu
command can be used to describe the Core associativity on a machine.
For example on an r6a.metal
EC2 instance:
$ lscpu | grep NUMANUMA node(s): 4NUMA node0 CPU(s): 0-23,96-119NUMA node1 CPU(s): 24-47,120-143NUMA node2 CPU(s): 48-71,144-167NUMA node3 CPU(s): 72-95,168-191
And the associated performance degradations are available via numactl
:
$ numactl -Havailable: 4 nodes (0-3)...node distances:node 0 1 2 3 0: 10 12 32 32 1: 12 10 32 32 2: 32 32 10 12 3: 32 32 12 10
These SLIT table node distance values are presented as approximate relative ratios. The value of 10 represents an optimal situation where a memory access is occurring from a CPU that is part of the same NUMA node. A value of 20 would indicate a 200% performance degradation, 30 for 300%, etc.
Node Attributes
Nomad clients will fingerprint the machine's NUMA topology and export the core associativity as node attributes. This data can provide a Nomad operator a better understanding of when it might be useful to make use of NUMA aware scheduling for certain workloads.
numa.node.count = 4numa.node0.cores = 0-23,96-119numa.node1.cores = 24-47,120-143numa.node2.cores = 48-71,144-167numa.node3.cores = 72-95,168-191
NUMA aware scheduling Enterprise
Nomad Enterprise is capable of scheduling tasks in a way that is optimized for the NUMA topology of a client node. Nomad is able to correlate CPU cores with memory nodes and assign tasks to run on specific CPU cores so as to minimize any cross-memory node access patterns. Additionally, Nomad is able to correlate devices to memory nodes and enable NUMA-aware scheduling to take device associativity into account when making scheduling decisions.
A task may specify a numa
block indicating its NUMA optimization preference. This example allocates a 1080ti
GPU and ensures it is on the same NUMA node as the 4 CPU cores reserved for the task.
task { resources { cores = 4 memory = 2048 device "nvidia/gpu/1080ti" { count = 1 } numa { affinity = "require" devices = [ "nvidia/gpu/1080ti" ] } }}
affinity
options
This is a required field. There are three supported affinity options:
none
, prefer
, and require
, each with their own advantages and tradeoffs.
option none
In the none
mode, the Nomad scheduler leverages the apathy of jobs without
preference of NUMA affinity to help reduce core fragmentation within NUMA nodes.
It does so by bin-packing the core request of these jobs onto the NUMA nodes
with the fewest unused cores available.
The none
mode is the default mode if the numa
block is not specified.
resources { cores = 4 numa { affinity = "none" }}
option prefer
In the prefer
mode, the Nomad scheduler uses the hardware topology of a node
to calculate an optimized selection of available cores, but does not limit
those cores to come from a single NUMA node.
resources { cores = 4 numa { affinity = "prefer" }}
option require
In the require
mode, the Nomad scheduler uses the topology of each potential
client to find a set of available CPU cores that belong to the same NUMA node.
If no such set of cores can be found, that node is marked exhausted for the
resource of numa-cores
.
resources { cores = 4 numa { affinity = "require" }}
devices
options
devices
is an optional list of devices that must be colocated on the NUMA node
along with allocated CPU cores.
The following diagram shows how a set of devices can be correlated to CPU and memory.
This example declares three devices and configures two in the numa
block.
task { resources { cores = 8 memory = 16384 device "nvidia/gpu/H100" { count = 2 } device "intel/net/XXVDA2" { count = 1 } device "xilinx/fpga/X7" { count = 1 } numa { affinity = "require" devices = [ "nvidia/gpu/H100", "intel/net/XXVDA2" ] } }}
Virtual CPU fingerprinting
When running on a virtualized host such as Amazon EC2 Nomad makes use of the
dmidecode
tool to detect CPU performance data. Some Linux distributions will
require installing the dmidecode
package manually.