Rolling updates
Nomad supports rolling updates as a first class feature. To enable rolling
updates a job or task group is annotated with a high-level description of the
update strategy using the update
stanza. Under the hood, Nomad handles
limiting parallelism, interfacing with Consul to determine service health and
even automatically reverting to an older, healthy job when a deployment fails.
Enable rolling updates for job
Rolling updates are enabled by adding the update
stanza to the job
specification. The update
stanza may be placed at the job level or in an
individual task group. When placed at the job level, the update strategy is
inherited by all task groups in the job. When placed at both the job and group
level, the update
stanzas are merged, with group stanzas taking precedence
over job level stanzas. There is more information about
inheritance in the update
stanza documentation.
job "geo-api-server" { # ... group "api-server" { count = 6 # Add an update stanza to enable rolling updates of the service update { max_parallel = 2 min_healthy_time = "30s" healthy_deadline = "10m" } task "server" { driver = "docker" config { image = "geo-api-server:0.1" } # ... } }}
In this example, by adding the simple update
stanza to the "api-server" task
group, you inform Nomad that updates to the group should be handled with a
rolling update strategy.
Thus when a change is made to the job file that requires new allocations to be made, Nomad will deploy 2 allocations at a time and require that the allocations be running in a healthy state for 30 seconds before deploying more versions of the new group.
By default Nomad determines allocation health by ensuring that all tasks in the group are running and that any service check the tasks register are passing.
Check the planned changes
Suppose you make a change to a file to update the version of a Docker container that is configured with the same rolling update strategy from above.
@@ -2,6 +2,8 @@ job "geo-api-server" { group "api-server" { task "server" { driver = "docker" config {- image = "geo-api-server:0.1"+ image = "geo-api-server:0.2"
The nomad job plan
command allows you to visualize the series of steps the
scheduler would perform. You can analyze this output to confirm it is correct:
$ nomad job plan geo-api-server.nomad.hcl+/- Job: "geo-api-server"+/- Task Group: "api-server" (2 create/destroy update, 4 ignore) +/- Task: "server" (forces create/destroy update) +/- Config { +/- image: "geo-api-server:0.1" => "geo-api-server:0.2" } Scheduler dry-run:- All tasks successfully allocated. Job Modify Index: 7To submit the job with version verification run: nomad job run -check-index 7 geo-api-server.nomad.hcl When running the job with the check-index flag, the job will only be run if thejob modify index given matches the server-side version. If the index haschanged, another user has modified the job and the plan's results arepotentially invalid.
Here you can observe that Nomad will begin a rolling update by creating and
destroying two allocations first; for the time being ignoring four of the old
allocations, consistent with the configured max_parallel
count.
Inspect a deployment
After running the plan you can submit the updated job by running nomad run
.
Once run, Nomad will begin the rolling update of the service by placing two
allocations at a time of the new job and taking two of the old jobs down.
You can inspect the current state of a rolling deployment using nomad status
:
$ nomad status geo-api-serverID = geo-api-serverName = geo-api-serverSubmit Date = 07/26/17 18:08:56 UTCType = servicePriority = 50Datacenters = dc1Status = runningPeriodic = falseParameterized = false SummaryTask Group Queued Starting Running Failed Complete Lostapi-server 0 0 6 0 4 0 Latest DeploymentID = c5b34665Status = runningDescription = Deployment is running DeployedTask Group Desired Placed Healthy Unhealthyapi-server 6 4 2 0 AllocationsID Node ID Task Group Version Desired Status Created At14d288e8 f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTCa134f73c f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTCa2574bb6 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC496e7aa2 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC9fc96fcc f7b1ee08 api-server 0 run running 07/26/17 18:04:30 UTC2521c47a f7b1ee08 api-server 0 run running 07/26/17 18:04:30 UTC6b794fcb f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC9bc11bd7 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC691eea24 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTCaf115865 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
The output indicates that Nomad has created a deployment to conduct the rolling update from job version 0 to 1. It has placed four instances of the new job and has stopped four of the old instances. Consult the list of deployed allocations, and note that Nomad has placed four instances of job version 1 but only considers 2 of them healthy. This is because the two newest placed allocations haven't been healthy for the required 30 seconds yet.
Wait for the deployment to complete, and then re-issue the command. You will receive output similar to the following:
$ nomad status geo-api-serverID = geo-api-serverName = geo-api-serverSubmit Date = 07/26/17 18:08:56 UTCType = servicePriority = 50Datacenters = dc1Status = runningPeriodic = falseParameterized = false SummaryTask Group Queued Starting Running Failed Complete Lostcache 0 0 6 0 6 0 Latest DeploymentID = c5b34665Status = successfulDescription = Deployment completed successfully DeployedTask Group Desired Placed Healthy Unhealthycache 6 6 6 0 AllocationsID Node ID Task Group Version Desired Status Created Atd42a1656 f7b1ee08 api-server 1 run running 07/26/17 18:10:10 UTC401daaf9 f7b1ee08 api-server 1 run running 07/26/17 18:10:00 UTC14d288e8 f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTCa134f73c f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTCa2574bb6 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC496e7aa2 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC9fc96fcc f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC2521c47a f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC6b794fcb f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC9bc11bd7 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC691eea24 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTCaf115865 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
Nomad has successfully transitioned the group to running the updated canary and did so with no downtime to your service by ensuring only two allocations were changed at a time and the newly placed allocations ran successfully. Had any of the newly placed allocations failed their health check, Nomad would have aborted the deployment and stopped placing new allocations. If configured, Nomad can automatically revert back to the old job definition when the deployment fails.
Use auto-revert on failed deployments
In the case you do a deployment in which the new allocations are unhealthy, Nomad will fail the deployment and stop placing new instances of the job. It optionally supports automatically reverting back to the last stable job version on deployment failure. Nomad keeps a history of submitted jobs and whether the job version was stable. A job is considered stable if all its allocations are healthy.
To enable this, add the auto_revert
parameter to the update
stanza:
update { max_parallel = 2 min_healthy_time = "30s" healthy_deadline = "10m" # Enable automatically reverting to the last stable job on a failed # deployment. auto_revert = true}
Now imagine you want to update your image to "geo-api-server:0.3" but you instead update it to the below and run the job:
@@ -2,6 +2,8 @@ job "geo-api-server" { group "api-server" { task "server" { driver = "docker" config {- image = "geo-api-server:0.2"+ image = "geo-api-server:0.33"
Running nomad job deployments
will show that the deployment fails, and Nomad
auto-reverted to the last stable job:
$ nomad job deployments geo-api-serverID Job ID Job Version Status Description0c6f87a5 geo-api-server 3 successful Deployment completed successfullyb1712b7f geo-api-server 2 failed Failed due to unhealthy allocations - rolling back to job version 13eee83ce geo-api-server 1 successful Deployment completed successfully72813fcf geo-api-server 0 successful Deployment completed successfully
Nomad job versions increment monotonically. Even though Nomad reverted to the
job specification at version 1, it creates a new job version. You can observe the
differences between a job's versions and how Nomad auto-reverted the job using
the job history
command:
$ nomad job history -p geo-api-serverVersion = 3Stable = trueSubmit Date = 07/26/17 18:44:18 UTCDiff =+/- Job: "geo-api-server"+/- Task Group: "api-server" +/- Task: "server" +/- Config { +/- image: "geo-api-server:0.33" => "geo-api-server:0.2" } Version = 2Stable = falseSubmit Date = 07/26/17 18:45:21 UTCDiff =+/- Job: "geo-api-server"+/- Task Group: "api-server" +/- Task: "server" +/- Config { +/- image: "geo-api-server:0.2" => "geo-api-server:0.33" } Version = 1Stable = trueSubmit Date = 07/26/17 18:44:18 UTCDiff =+/- Job: "geo-api-server"+/- Task Group: "api-server" +/- Task: "server" +/- Config { +/- image: "geo-api-server:0.1" => "geo-api-server:0.2" } Version = 0Stable = trueSubmit Date = 07/26/17 18:43:43 UTC
This output describes the process of a reverted deployment. Starting at the end of the output and working backwards, Nomad shows that version 0 was submitted. Next, version 1 was an image change from 0.1 to 0.2 of geo-api-server and was flagged as stable. Version 2 of the job attempted to update geo-api-server from 0.2 to 0.33; however, the deployment failed and never became stable. Finally, version 3 of the job is created when Nomad automatically reverts the failed deployment and redeploys the last healthy version--geo-api-server version 0.2.