Prevent priority inversion with preemption
Preemption allows Nomad to evict running allocations to place allocations of a higher priority. Allocations of a job that are blocked temporarily go into "pending" status until the cluster has additional capacity to run them. This is useful when operators need to run relatively higher priority tasks sooner even under resource contention across the cluster.
Nomad v0.9.0 added Preemption for system jobs. Nomad v0.9.3 Enterprise added preemption for service and batch jobs. Nomad v0.12.0 made preemption an open source feature for all three job types.
Preemption is enabled by default for system jobs. It can be enabled for service and batch jobs by sending a payload with the appropriate options specified to the scheduler configuration API endpoint.
Prerequisites
To perform the tasks described in this guide, you need to have a Nomad environment with Consul installed. You can use this repository to provision a sandbox environment; however, you need to use Nomad v0.12.0 or higher or Nomad Enterprise v0.9.3 or higher.
You need a cluster with one server node and three client nodes. To simulate resource contention, the nodes in this environment each have 1 GB RAM (For AWS, you can choose the t2.micro instance type).
Tip
This tutorial is for demo purposes and is only using a single server node. Three or five server nodes are recommended for a production cluster.
Create a job with low priority
Start by creating a job with relatively lower priority into your Nomad cluster.
One of the allocations from this job will be preempted in a subsequent
deployment when there is a resource contention in the cluster. Copy the
following job into a file and name it webserver.nomad.hcl
.
job "webserver" { datacenters = ["dc1"] type = "service" priority = 40 group "webserver" { count = 3 network { port "http" { to = 80 } } service { name = "apache-webserver" port = "http" check { name = "alive" type = "http" path = "/" interval = "10s" timeout = "2s" } } task "apache" { driver = "docker" config { image = "httpd:latest" ports = ["http"] } resources { memory = 600 } } }}
Note that the count is 3 and that each allocation is specifying 600 MB of memory. Remember that each node only has 1 GB of RAM.
Run the low priority job
Use the nomad job run
command to start the webserver.nomad.hcl
job.
$ nomad job run webserver.nomad.hcl==> Monitoring evaluation "35159f00" Evaluation triggered by job "webserver" Evaluation within deployment: "278b2e10" Allocation "0850e103" created: node "cf8487e2", group "webserver" Allocation "551a7283" created: node "ad10ba3b", group "webserver" Allocation "8a3d7e1e" created: node "18997de9", group "webserver" Evaluation status changed: "pending" -> "complete"==> Evaluation "35159f00" finished with status "complete"
Check the status of the webserver
job using the nomad job status
command
at this point and verify that an allocation has been placed on each client node
in the cluster.
$ nomad job status webserverID = webserverName = webserverSubmit Date = 2021-02-11T19:18:29-05:00Type = servicePriority = 40...AllocationsID Node ID Task Group Version Desired Status Created Modified0850e103 cf8487e2 webserver 0 run running 8s ago 2s ago551a7283 ad10ba3b webserver 0 run running 8s ago 2s ago8a3d7e1e 18997de9 webserver 0 run running 8s ago 1s ago
Create a job with high priority
Create another job with a priority greater than the "webserver" job. Copy the
following into a file named redis.nomad.hcl
.
job "redis" { datacenters = ["dc1"] type = "service" priority = 80 group "cache1" { count = 1 network { port "db" { to = 6379 } } service { name = "redis-cache" port = "db" check { name = "alive" type = "tcp" interval = "10s" timeout = "2s" } } task "redis" { driver = "docker" config { image = "redis:latest" ports = ["db"] } resources { memory = 700 } } }}
Note that this job has a priority of 80 (greater than the priority of the
webserver
job from earlier) and requires 700 MB of memory. This allocation
will create a resource contention in the cluster since each node only has 1 GB
of memory with a 600 MB allocation already placed on it.
Observe a run before and after enabling preemption
Try to run redis.nomad.hcl
Remember that preemption for service and batch jobs is not enabled by
default. This means that the redis
job will be queued due
to resource contention in the cluster. You can verify the resource contention
before actually registering your job by running the nomad job plan
command.
$ nomad job plan redis.nomad.hcl+ Job: "redis"+ Task Group: "cache1" (1 create) + Task: "redis" (forces create) Scheduler dry-run:- WARNING: Failed to place all allocations. Task Group "cache1" (failed to place 1 allocation): * Resources exhausted on 3 nodes * Dimension "memory" exhausted on 3 nodes...
Run the redis.nomad.hcl
job with the nomad job run
command. Observe that the
allocation was queued.
$ nomad job run redis.nomad.hcl==> Monitoring evaluation "3c6593b4" Evaluation triggered by job "redis" Evaluation within deployment: "ae55a4aa" Evaluation status changed: "pending" -> "complete"==> Evaluation "3c6593b4" finished with status "complete" but failed to place all allocations: Task Group "cache1" (failed to place 1 allocation): * Resources exhausted on 3 nodes * Dimension "memory" exhausted on 3 nodes Evaluation "249fd21b" waiting for additional capacity to place remainder
You can also verify the allocation has been queued by now by fetching the status
of the job using the nomad job status
command.
$ nomad job status redisID = redisName = redisSubmit Date = 2021-02-11T19:22:55-05:00Type = servicePriority = 80...Placement FailureTask Group "cache1": * Resources exhausted on 3 nodes * Dimension "memory" exhausted on 3 nodes...AllocationsNo allocations placed
Stop the redis
job for now. In the next steps, you will enable service job
preemption and re-deploy. Use the nomad job stop
command with the -purge
flag set.
$ nomad job stop -purge redis==> Monitoring evaluation "a9c9945d" Evaluation triggered by job "redis" Evaluation within deployment: "ae55a4aa" Evaluation status changed: "pending" -> "complete"==> Evaluation "a9c9945d" finished with status "complete"
Enable service job preemption
Get the current scheduler configuration using the
Nomad API. Setting an environment variable with your cluster address makes the
curl
commands more reusable. Substitute in the proper address for your Nomad
cluster.
$ export NOMAD_ADDR=http://127.0.0.1:4646
If you are enabling preemption in an ACL-enabled Nomad cluster, you will also
need to authenticate to the API with a Nomad token via the
X-Nomad-Token
header. In this case, you can use an environment variable to add
the header option and your token value to the command. If you don't use tokens,
skip this step. The curl
commands will run correctly when the variable is
unset.
$ export NOMAD_AUTH='--header "X-Nomad-Token: «replace with your token»"'
ACLs, consult the
Now, fetch the configuration with the following curl
command.
$ curl --silent ${NOMAD_AUTH} \ ${NOMAD_ADDR}/v1/operator/scheduler/configuration?pretty
{ "SchedulerConfig": { "SchedulerAlgorithm": "binpack", "PreemptionConfig": { "SystemSchedulerEnabled": true, "BatchSchedulerEnabled": false, "ServiceSchedulerEnabled": false }, "CreateIndex": 5, "ModifyIndex": 5 }, "Index": 5, "LastContact": 0, "KnownLeader": true}
Note that BatchSchedulerEnabled and
ServiceSchedulerEnabled are both set to false
by default.
Since you are preempting service jobs in this guide, you need to set
ServiceSchedulerEnabled
to true
. Do this by directly interacting
with the API.
Create the following JSON payload and place it in a file named scheduler.json
:
{ "PreemptionConfig": { "SystemSchedulerEnabled": true, "BatchSchedulerEnabled": false, "ServiceSchedulerEnabled": true }}
Note that ServiceSchedulerEnabled has been set to true
.
Run the following command to update the scheduler configuration:
$ curl --silent ${NOMAD_AUTH} \ --request POST --data @scheduler.json \ ${NOMAD_ADDR}/v1/operator/scheduler/configuration
You should now be able to inspect the scheduler configuration again and verify that preemption has been enabled for service jobs (output below is abbreviated):
$ curl --silent ${NOMAD_AUTH} \ ${NOMAD_ADDR}/v1/operator/scheduler/configuration?pretty
... "PreemptionConfig": { "SystemSchedulerEnabled": true, "BatchSchedulerEnabled": false, "ServiceSchedulerEnabled": true },...
Try running the redis job again
Now that you have enabled preemption on service jobs, deploying your redis
job
should evict one of the lower priority webserver
allocations and place it into
a queue. You can run nomad plan
to output a preview of what will happen:
$ nomad job plan redis.nomad.hcl+ Job: "redis"+ Task Group: "cache1" (1 create) + Task: "redis" (forces create) Scheduler dry-run:- All tasks successfully allocated. Preemptions: Alloc ID Job ID Task Group8a3d7e1e-40ee-f731-5135-247d8b7c2901 webserver webserver...
The preceding plan output shows that one of the webserver
allocations will be
evicted in order to place the requested redis
instance.
Now use the nomad job run
command to run the redis.nomad.hcl
job file.
$ nomad job run redis.nomad.hcl==> Monitoring evaluation "fef3654f" Evaluation triggered by job "redis" Evaluation within deployment: "37b37a63" Allocation "6ecc4bbe" created: node "cf8487e2", group "cache1" Evaluation status changed: "pending" -> "complete"==> Evaluation "fef3654f" finished with status "complete"
Run the nomad job status
command on the webserver
job to verify one of
the allocations has been evicted.
$ nomad job status webserverID = webserverName = webserverSubmit Date = 2021-02-11T19:18:29-05:00Type = servicePriority = 40...SummaryTask Group Queued Starting Running Failed Complete Lostwebserver 1 0 2 0 1 0 Placement FailureTask Group "webserver": * Resources exhausted on 3 nodes * Dimension "memory" exhausted on 3 nodes... AllocationsID Node ID Task Group Version Desired Status Created Modified0850e103 cf8487e2 webserver 0 evict complete 21m17s ago 18s ago551a7283 ad10ba3b webserver 0 run running 21m17s ago 20m55s ago8a3d7e1e 18997de9 webserver 0 run running 21m17s ago 20m57s ago
Stop the job
Use the nomad job stop
command on the redis
job. This will provide the
capacity necessary to unblock the third webserver
allocation.
$ nomad job stop redis==> Monitoring evaluation "df368cb1" Evaluation triggered by job "redis" Evaluation within deployment: "37b37a63" Evaluation status changed: "pending" -> "complete"==> Evaluation "df368cb1" finished with status "complete"
Run the nomad job status
command on the webserver
job. The output should
now indicate that a new third allocation was created to replace the one that was
preempted.
$ nomad job status webserverID = webserverName = webserverSubmit Date = 2021-02-11T19:18:29-05:00Type = servicePriority = 40Datacenters = dc1Namespace = defaultStatus = runningPeriodic = falseParameterized = false SummaryTask Group Queued Starting Running Failed Complete Lostwebserver 0 0 3 0 1 0 Latest DeploymentID = 278b2e10Status = successfulDescription = Deployment completed successfully DeployedTask Group Desired Placed Healthy Unhealthy Progress Deadlinewebserver 3 3 3 0 2021-02-12T00:28:51Z AllocationsID Node ID Task Group Version Desired Status Created Modified4e212aec cf8487e2 webserver 0 run running 21s ago 3s ago0850e103 cf8487e2 webserver 0 evict complete 22m48s ago 1m49s ago551a7283 ad10ba3b webserver 0 run running 22m48s ago 22m26s ago8a3d7e1e 18997de9 webserver 0 run running 22m48s ago 22m28s ago
Next steps
The process you learned in this tutorial can also be applied to batch jobs as well. Read more about preemption in the Nomad documentation.