Configure Nomad jobs for blue-green and canary deployments

12min
|
Nomad

Sometimes rolling updates do not offer the required flexibility for updating an application in production. Often organizations prefer to put a "canary" build into production or utilize a technique known as a "blue/green" deployment to ensure a safe application roll-out to production while minimizing downtime.

Blue/Green deployments

Blue/Green deployments have several other names including Red/Black or A/B, but the concept is generally the same. In a blue/green deployment, there are two application versions. Only one application version is active at a time, except during the transition phase from one version to the next. The term "active" tends to mean "receiving traffic" or "in service".

Imagine a hypothetical API server which has five instances deployed to production at version 1.3, and you want to safely update to version 1.4. You want to create five new instances at version 1.4 and in the case that they are operating correctly you want to promote them and take down the five versions running 1.3. In the event of failure, you can quickly rollback to 1.3.

To start, you examine your job which is running in production:

job "docs" {  # ...   group "api" {    count = 5     update {      max_parallel     = 1      canary           = 5      min_healthy_time = "30s"      healthy_deadline = "10m"      auto_revert      = true      auto_promote     = false    }     task "api-server" {      driver = "docker"       config {        image = "api-server:1.3"      }    }  }}

Notice that the job has an update stanza with the canary count equal to the desired count. This allows a Nomad job to model blue/green deployments. When you change the job to run the "api-server:1.4" image, Nomad will create five new allocations while leaving the original "api-server:1.3" allocations running.

Observe how this works by changing the image to run the new version:

@@ -2,6 +2,8 @@ job "docs" {  group "api" {    task "api-server" {      config {-       image = "api-server:1.3"+       image = "api-server:1.4"

Next, plan these changes. Save the modified jobspec with the new version of api-server to a file name docs.nomad.hcl.

$ nomad job plan docs.nomad.hcl+/- Job: "docs"+/- Task Group: "api" (5 canary, 5 ignore)  +/- Task: "api-server" (forces create/destroy update)    +/- Config {      +/- image: "api-server:1.3" => "api-server:1.4"        } Scheduler dry-run:- All tasks successfully allocated. Job Modify Index: 7To submit the job with version verification run: nomad job run -check-index 7 docs.nomad.hcl When running the job with the check-index flag, the job will only be run if thejob modify index given matches the server-side version. If the index haschanged, another user has modified the job and the plan's results arepotentially invalid.

Run the changes.

$ nomad job run docs.nomad.hcl## ...

The plan output states that Nomad is going to create five canaries running the "api-server:1.4" image and ignore all the allocations running the older image. Now, if you examine the status of the job you will note that both the blue ("api-server:1.3") and green ("api-server:1.4") set are running.

$ nomad status docsID            = docsName          = docsSubmit Date   = 07/26/17 19:57:47 UTCType          = servicePriority      = 50Datacenters   = dc1Status        = runningPeriodic      = falseParameterized = false SummaryTask Group  Queued  Starting  Running  Failed  Complete  Lostapi         0       0         10       0       0         0 Latest DeploymentID          = 32a080c1Status      = runningDescription = Deployment is running but requires manual promotion DeployedTask Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthyapi         true         false     5        5         5       5        0 AllocationsID        Node ID   Task Group  Version  Desired  Status   Created At6d8eec42  087852e2  api         1        run      running  07/26/17 19:57:47 UTC7051480e  087852e2  api         1        run      running  07/26/17 19:57:47 UTC36c6610f  087852e2  api         1        run      running  07/26/17 19:57:47 UTC410ba474  087852e2  api         1        run      running  07/26/17 19:57:47 UTC85662a7a  087852e2  api         1        run      running  07/26/17 19:57:47 UTC3ac3fe05  087852e2  api         0        run      running  07/26/17 19:53:56 UTC4bd51979  087852e2  api         0        run      running  07/26/17 19:53:56 UTC2998387b  087852e2  api         0        run      running  07/26/17 19:53:56 UTC35b813ee  087852e2  api         0        run      running  07/26/17 19:53:56 UTCb53b4289  087852e2  api         0        run      running  07/26/17 19:53:56 UTC

Now that the new version is running in production, you can route traffic to it and validate that it is working properly. If so, you would promote the deployment and Nomad would stop allocations running the older version. If not, you would either troubleshoot one of the running containers or destroy the new containers by failing the deployment.

Promote the deployment

After deploying the new image along side the old version you have determined it is functioning properly and you want to transition fully to the new version. Doing so is as simple as promoting the deployment:

$ nomad deployment promote 32a080c1==> Monitoring evaluation "61ac2be5"    Evaluation triggered by job "docs"    Evaluation within deployment: "32a080c1"    Evaluation status changed: "pending" -> "complete"==> Evaluation "61ac2be5" finished with status "complete"

If you inspect the job's status, you can observe that after promotion, Nomad stopped the older allocations and is only running the new one. This now completes the blue/green deployment.

$ nomad status docsID            = docsName          = docsSubmit Date   = 07/26/17 19:57:47 UTCType          = servicePriority      = 50Datacenters   = dc1Status        = runningPeriodic      = falseParameterized = false SummaryTask Group  Queued  Starting  Running  Failed  Complete  Lostapi         0       0         5        0       5         0 Latest DeploymentID          = 32a080c1Status      = successfulDescription = Deployment completed successfully DeployedTask Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthyapi         true         true      5        5         5       5        0 AllocationsID        Node ID   Task Group  Version  Desired  Status    Created At6d8eec42  087852e2  api         1        run      running   07/26/17 19:57:47 UTC7051480e  087852e2  api         1        run      running   07/26/17 19:57:47 UTC36c6610f  087852e2  api         1        run      running   07/26/17 19:57:47 UTC410ba474  087852e2  api         1        run      running   07/26/17 19:57:47 UTC85662a7a  087852e2  api         1        run      running   07/26/17 19:57:47 UTC3ac3fe05  087852e2  api         0        stop     complete  07/26/17 19:53:56 UTC4bd51979  087852e2  api         0        stop     complete  07/26/17 19:53:56 UTC2998387b  087852e2  api         0        stop     complete  07/26/17 19:53:56 UTC35b813ee  087852e2  api         0        stop     complete  07/26/17 19:53:56 UTCb53b4289  087852e2  api         0        stop     complete  07/26/17 19:53:56 UTC

Fail a deployment

After deploying the new image alongside the old version you have determined it is not functioning properly and you want to roll back to the old version. Doing so is as simple as failing the deployment:

$ nomad deployment fail 32a080c1Deployment "32a080c1-de5a-a4e7-0218-521d8344c328" failed. Auto-reverted to job version 0. ==> Monitoring evaluation "6840f512"    Evaluation triggered by job "example"    Evaluation within deployment: "32a080c1"    Allocation "0ccb732f" modified: node "36e7a123", group "cache"    Allocation "64d4f282" modified: node "36e7a123", group "cache"    Allocation "664e33c7" modified: node "36e7a123", group "cache"    Allocation "a4cb6a4b" modified: node "36e7a123", group "cache"    Allocation "fdd73bdd" modified: node "36e7a123", group "cache"    Evaluation status changed: "pending" -> "complete"==> Evaluation "6840f512" finished with status "complete"

After failing the deployment, check the job's status. Confirm that Nomad has stopped the new allocations and is only running the old ones, and that the working copy of the job has reverted back to the original specification running "api-server:1.3".

$ nomad status docsID            = docsName          = docsSubmit Date   = 07/26/17 19:57:47 UTCType          = servicePriority      = 50Datacenters   = dc1Status        = runningPeriodic      = falseParameterized = false SummaryTask Group  Queued  Starting  Running  Failed  Complete  Lostapi         0       0         5        0       5         0 Latest DeploymentID          = 6f3f84b3Status      = successfulDescription = Deployment completed successfully DeployedTask Group  Auto Revert  Desired  Placed  Healthy  Unhealthycache       true         5        5       5        0 AllocationsID        Node ID   Task Group  Version  Desired  Status    Created At27dc2a42  36e7a123  api         1        stop     complete  07/26/17 20:07:31 UTC5b7d34bb  36e7a123  api         1        stop     complete  07/26/17 20:07:31 UTC983b487d  36e7a123  api         1        stop     complete  07/26/17 20:07:31 UTCd1cbf45a  36e7a123  api         1        stop     complete  07/26/17 20:07:31 UTCd6b46def  36e7a123  api         1        stop     complete  07/26/17 20:07:31 UTC0ccb732f  36e7a123  api         2        run      running   07/26/17 20:06:29 UTC64d4f282  36e7a123  api         2        run      running   07/26/17 20:06:29 UTC664e33c7  36e7a123  api         2        run      running   07/26/17 20:06:29 UTCa4cb6a4b  36e7a123  api         2        run      running   07/26/17 20:06:29 UTCfdd73bdd  36e7a123  api         2        run      running   07/26/17 20:06:29 UTC

$ nomad job deployments docsID        Job ID   Job Version  Status      Description6f3f84b3  example  2            successful  Deployment completed successfully32a080c1  example  1            failed      Deployment marked as failed - rolling back to job version 0c4c16494  example  0            successful  Deployment completed successfully

Deploy with canaries

Canary updates are a useful way to test a new version of a job before beginning a rolling update. The update stanza supports setting the number of canaries the job operator would like Nomad to create when the job changes via the canary parameter. When the job specification is updated, Nomad creates the canaries without stopping any allocations from the previous job.

This pattern allows operators to achieve higher confidence in the new job version because they can route traffic, examine logs, etc, to determine the new application is performing properly.

job "docs" {  # ...   group "api" {    count = 5     update {      max_parallel     = 1      canary           = 1      min_healthy_time = "30s"      healthy_deadline = "10m"      auto_revert      = true      auto_promote     = false    }     task "api-server" {      driver = "docker"       config {        image = "api-server:1.3"      }    }  }}

In the example above, the update stanza tells Nomad to create a single canary when the job specification is changed.

You can experience how this behaves by changing the image to run the new version:

@@ -2,6 +2,8 @@ job "docs" {  group "api" {    task "api-server" {      config {-       image = "api-server:1.3"+       image = "api-server:1.4"

Next, plan these changes.

$ nomad job plan docs.nomad.hcl+/- Job: "docs"+/- Task Group: "api" (1 canary, 5 ignore)  +/- Task: "api-server" (forces create/destroy update)    +/- Config {      +/- image: "api-server:1.3" => "api-server:1.4"        } Scheduler dry-run:- All tasks successfully allocated. Job Modify Index: 7To submit the job with version verification run: nomad job run -check-index 7 docs.nomad.hcl When running the job with the check-index flag, the job will only be run if thejob modify index given matches the server-side version. If the index haschanged, another user has modified the job and the plan's results arepotentially invalid. $ nomad job run docs.nomad.hcl# ...

Run the changes.

$ nomad job run docs.nomad.hcl## ...

Note from the plan output, Nomad is going to create one canary that will run the "api-server:1.4" image and ignore all the allocations running the older image. After running the job, The nomad status command output shows that the canary is running along side the older version of the job:

$ nomad status docsID            = docsName          = docsSubmit Date   = 07/26/17 19:57:47 UTCType          = servicePriority      = 50Datacenters   = dc1Status        = runningPeriodic      = falseParameterized = false SummaryTask Group  Queued  Starting  Running  Failed  Complete  Lostapi         0       0         6        0       0         0 Latest DeploymentID          = 32a080c1Status      = runningDescription = Deployment is running but requires manual promotion DeployedTask Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthyapi         true         false     5        1         1       1        0 AllocationsID        Node ID   Task Group  Version  Desired  Status   Created At85662a7a  087852e2  api         1        run      running  07/26/17 19:57:47 UTC3ac3fe05  087852e2  api         0        run      running  07/26/17 19:53:56 UTC4bd51979  087852e2  api         0        run      running  07/26/17 19:53:56 UTC2998387b  087852e2  api         0        run      running  07/26/17 19:53:56 UTC35b813ee  087852e2  api         0        run      running  07/26/17 19:53:56 UTCb53b4289  087852e2  api         0        run      running  07/26/17 19:53:56 UTC

Now if you promote the canary, this will trigger a rolling update to replace the remaining allocations running the older image. The rolling update will happen at a rate of max_parallel, so in this case, one allocation at a time.

$ nomad deployment promote 37033151==> Monitoring evaluation "37033151"    Evaluation triggered by job "docs"    Evaluation within deployment: "ed28f6c2"    Allocation "f5057465" created: node "f6646949", group "cache"    Allocation "f5057465" status changed: "pending" -> "running"    Evaluation status changed: "pending" -> "complete"==> Evaluation "37033151" finished with status "complete"

Check the status.

$ nomad status docsID            = docsName          = docsSubmit Date   = 07/26/17 20:28:59 UTCType          = servicePriority      = 50Datacenters   = dc1Status        = runningPeriodic      = falseParameterized = false SummaryTask Group  Queued  Starting  Running  Failed  Complete  Lostapi         0       0         5        0       2         0 Latest DeploymentID          = ed28f6c2Status      = runningDescription = Deployment is running DeployedTask Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthyapi         true         true      5        1         2       1        0 AllocationsID        Node ID   Task Group  Version  Desired  Status    Created Atf5057465  f6646949  api         1        run      running   07/26/17 20:29:23 UTCb1c88d20  f6646949  api         1        run      running   07/26/17 20:28:59 UTC1140bacf  f6646949  api         0        run      running   07/26/17 20:28:37 UTC1958a34a  f6646949  api         0        run      running   07/26/17 20:28:37 UTC4bda385a  f6646949  api         0        run      running   07/26/17 20:28:37 UTC62d96f06  f6646949  api         0        stop     complete  07/26/17 20:28:37 UTCf58abbb2  f6646949  api         0        stop     complete  07/26/17 20:28:37 UTC

Alternatively, if the canary was not performing properly, you could abandon the change using the nomad deployment fail command, similar to the blue/green example.

Enable rolling updates

Configure exit signals