Define restart behavior in your jobs
Nomad will by default attempt to restart a job locally on the node that it is running or scheduled to be running on. These defaults vary by the scheduler type in use for the job: system, service, or batch.
To customize this behavior, the task group can be annotated with configurable
options using the restart
stanza. Nomad will restart the failed
task up to attempts
times within a provided interval
. Operators can also
choose whether to keep attempting restarts on the same node, or to fail the task
so that it can be rescheduled on another node, via the mode
parameter.
Setting mode to fail
in the restart stanza allows rescheduling to occur
potentially moving the task to another node and is best practice.
The following CLI example shows job status and allocation status for a failed
task that is being restarted by Nomad. Allocations are in the pending
state
while restarts are attempted. The Recent Events
section in the CLI shows
ongoing restart attempts.
$ nomad job status demoID = demoName = demoSubmit Date = 2018-04-12T14:37:18-05:00Type = servicePriority = 50Datacenters = dc1Status = runningPeriodic = falseParameterized = false SummaryTask Group Queued Starting Running Failed Complete Lostdemo 0 3 0 0 0 0 AllocationsID Node ID Task Group Version Desired Status Created Modifiedce5bf1d1 8a184f31 demo 0 run pending 27s ago 5s agod5dee7c8 8a184f31 demo 0 run pending 27s ago 5s agoed815997 8a184f31 demo 0 run pending 27s ago 5s ago
In the following example, the allocation ce5bf1d1
is restarted by Nomad
approximately every ten seconds, with a small random jitter. It eventually
reaches its limit of three attempts and transitions into a failed
state, after
which it becomes eligible for rescheduling.
$ nomad alloc status ce5bf1d1ID = ce5bf1d1Eval ID = 64e45d11Name = demo.demo[1]Node ID = a0ccdd8bJob ID = demoJob Version = 0Client Status = failedClient Description = <none>Desired Status = runDesired Description = <none>Created = 56s agoModified = 22s ago Task "demo" is "dead"Task ResourcesCPU Memory Disk Addresses100 MHz 300 MiB 300 MiB Task Events:Started At = 2018-04-12T22:29:08ZFinished At = 2018-04-12T22:29:08ZTotal Restarts = 3Last Restart = 2018-04-12T17:28:57-05:00 Recent Events:Time Type Description2018-04-12T17:29:08-05:00 Not Restarting Exceeded allowed attempts 3 in interval 5m0s and mode is "fail"2018-04-12T17:29:08-05:00 Terminated Exit Code: 1272018-04-12T17:29:08-05:00 Started Task started by client2018-04-12T17:28:57-05:00 Restarting Task restarting in 10.364602876s2018-04-12T17:28:57-05:00 Terminated Exit Code: 1272018-04-12T17:28:57-05:00 Started Task started by client2018-04-12T17:28:47-05:00 Restarting Task restarting in 10.666963769s2018-04-12T17:28:47-05:00 Terminated Exit Code: 1272018-04-12T17:28:47-05:00 Started Task started by client2018-04-12T17:28:35-05:00 Restarting Task restarting in 11.777324721s