Disaster Recovery for Consul on Kubernetes
Disaster recovery planning is an essential element of developing any business continuity plan. This document provides you with the information you need to design a disaster recovery plan that will allow you to recover from a primary datacenter loss or outage when running Consul on Kubernetes, and is intended for operators that are managing either single datacenters or multi-datacenter federations. The tutorial assumes you are operating a fully secured Consul on Kubernetes datacenter, which is the default when installing Consul on Kubernetes using the official Consul Helm chart. In this scenario, you will have TLS, ACLs, and gossip encryption enabled.
In this tutorial you will:
- Review the essential data and secrets you must backup and secure in order to recover from a datacenter loss or lengthy cloud provider outage
- Optionally, set up a lab environment to practice performing a recovery
- Review the manual recovery steps you will take to recover a lost primary datacenter
To complete the optional lab for this tutorial you will need:
Plan for disaster
To recover a Consul on Kubernetes primary datacenter from a disaster or during a long term outage you will need:
- A recent backup of Consul's internal state store
- A current backup of the Consul secrets
Snapshots
Consul refers to a point in time backup of its internal state store as a snapshot.
The Consul CLI consul snapshot save
command can be used to export a backup, or
enterprise users can use consul snapshot agent
to run a daemon process that
periodically saves snapshots.
In either scenario, it is your responsibility to both automate periodic snapshots and export the snapshot(s) to some form of long term storage that can survive the loss of the Kubernetes cluster.
With a valid snapshot, you can use the consul snapshot restore
command to restore
a newly created Consul datacenter to the last known good state of your lost datacenter.
Similar to any other database recovery operation, without a valid backup you cannot recover
from a disaster or long term outage.
Secrets
There are four essential secrets that you must backup and secure in order to recover from the loss of your secured primary Consul datacenter.
- The last active Consul ACL bootstrap token
- The last active Consul CA cert
- The last active Consul CA key
- The last active gossip encryption key
Without access to these four secrets you cannot recover from a disaster or long term outage. It is your responsibility to manage these secrets in some form of long term storage external to the Kubernetes secrets engine, so that they can survive the loss of the Kubernetes cluster.
Do not forget to update these values and take a new snapshot backup whenever you rotate your secrets. We recommend that you automate the secrets rotation process, and include a backup to an external secrets management solution as part of that automation.
Setup lab environment (optional)
In this next section, you will create a lab environment that you can use to practice the steps necessary to perform a primary datacenter recovery. If you do not wish or need to create a lab environment, feel free to skip ahead to the recovery steps section later in the document.
Note
You will need to make use of multiple terminal session windows, and you will set session specific environment variables in your primary terminal session window. Unless otherwise instructed, you will be using this primary session window so that the environment variables will be available to the sample code.
Clone repository
We have provided you with the following git repository that contains terraform
files to set up an AWS environment, but these instructions should work for any
Kubernetes distribution or cloud platform.
Clone the repository.
$ git clone https://github.com/hashicorp/learn-consul-kubernetes
Change directory into the newly cloned repository. This must be your working directory for the rest of the tutorial.
$ cd learn-consul-kubernetes/disaster-recovery
Checkout the specific tag of the repository tested with this tutorial.
$ git checkout tags/v0.0.5
Initialize Kubernetes
Note
This tutorial assumes you have both the AWS CLI and AWS IAM Authenticator installed, and that you have currently authenticated using the AWS CLI. Setting up the optional lab on AWS will result in additional charges.
Issue the following command to initialize the dc1
terraform working directory.
$ terraform -chdir=dc1 initInitializing modules...Downloading terraform-aws-modules/eks/aws 13.2.1 for dc1.eks......TRUNCATED...If you ever set or change modules or backend configuration for Terraform,rerun this command to reinitialize your working directory. If you forget, othercommands will detect it and remind you to do so if necessary.
Apply the terraform configuration for the Kubernetes cluster that will host the primary Consul datacenter.
$ terraform -chdir=dc1 apply -auto-approvemodule.dc1.module.eks.aws_iam_role.cluster[0]: Creating......TRUNCATED...Apply complete! Resources: 42 added, 0 changed, 0 destroyed.
Issue the following command to initialize the dc2
terraform working directory.
$ terraform -chdir=dc2 initInitializing modules...Downloading terraform-aws-modules/eks/aws 13.2.1 for dc2.eks......TRUNCATED...If you ever set or change modules or backend configuration for Terraform,rerun this command to reinitialize your working directory. If you forget, othercommands will detect it and remind you to do so if necessary.
Apply the terraform configuration for the Kubernetes cluster that will host the secondary Consul datacenter.
$ terraform -chdir=dc2 apply -auto-approvemodule.dc2.module.eks.aws_iam_role.cluster[0]: Creating......TRUNCATED...Apply complete! Resources: 42 added, 0 changed, 0 destroyed.
Set the KUBECONFIG
environment variable from your primary terminal session window
to the output from the terraform module. If you change to a new shell session this
value will not be available.
$ export KUBECONFIG=~/.kube/dc1:~/.kube/dc2
EKS will create custom cluster names. Run the following script to modify your
KUBECONFIG
so that the cluster names conform to the names used by the tutorial.
$ sh ./eks-init.shContext "eks_dc1" renamed to "dc1".Context "eks_dc2" renamed to "dc2".Switched to context "dc1".
Configure Vault
To recover from the loss of your primary datacenter, you must store your Consul secrets in some secure location that will survive the loss of the Kubernetes cluster. While you may use any secrets management and CA provisioning strategy you like, this tutorial will use Vault as both the Consul CA as well as the external secrets storage engine. The Vault configuration in this tutorial is not valid for production. It is meant to provide an example of the different concerns you will need to address when designing your disaster recovery plan. For more information on operating Vault in production, refer to the Vault area of the HashiCorp Learn platform.
In a new terminal session window start a Vault server in dev
mode. This will
be a long-running session that you should leave open until the end of the tutorial.
To simplify the instructions, you will launch Vault with a root token set to "education".
$ vault server -dev -dev-root-token-id="education"...TRUNCATED... You may need to set the following environment variable: $ export VAULT_ADDR='http://127.0.0.1:8200' The unseal key and root token are displayed below in case you want toseal/unseal the Vault or re-authenticate. Unseal Key: <generated-unseal-key>Root Token: education Development mode should NOT be used in production installations!
The Vault output instructs you to set the VAULT_ADDR
environment variable to
localhost
, however your Kubernetes environment won't be able to reach that address.
Instead, you will use ngrok
which is a secure tunnelling solution that allows you
to create a secure URL to your localhost server.
In a another new terminal session window, start ngrok
and instruct it to
expose port 8200
using the http
protocol. This will also be a long-running
session that you should not terminate until the end of the tutorial.
$ ngrok http 8200ngrok by @inconshreveable (Ctrl+C to quit) Session Status onlineAccount Derek Strickland (Plan: Free)Version 2.3.35Region United States (us)Web Interface http://127.0.0.1:4040Forwarding http://<generated-subdomain>.ngrok.io -> http://localhost:8200Forwarding https://<generated-subdomain>.ngrok.io -> http://localhost:8200 Connections ttl opn rt1 rt5 p50 p90 0 0 0.00 0.00 0.00 0.00
From your primary terminal session window, set the VAULT_ADDR
environment
variable to the URL ngrok
created for you in the previous step. Replace the
<generated-subdomain>
text with the value output by ngrok
.
$ export VAULT_ADDR="https://<generated-subdomain>.ngrok.io"
From your primary terminal session window, set the VAULT_TOKEN
environment
variable to the value you included vault server -dev
command.
$ export VAULT_TOKEN="education"
Unseal Vault using the unseal key included in the output from the
vault server -dev
command.
$ vault operator unseal <generated-unseal-key>Key Value--- -----Seal Type shamirInitialized trueSealed falseTotal Shares 1Threshold 1Version 1.6.1Storage Type inmemCluster Name vault-cluster-b8860bf8Cluster ID 27c6d3f8-554e-9dac-6f7f-bdf7ccafd270HA Enabled false
Enable Vault's secrets engine.
$ vault secrets enable -version=2 kvSuccess! Enabled the kv secrets engine at: kv/
The Vault lab setup is now complete and can be used as both a CA for Consul, as well as for storage of secrets. Again, this configuration is not appropriate for production environments, but can provide inspiration for how you might design your own secrets management solution.
Configure the service mesh
Install and configure the primary Consul Kubernetes datacenter using the following script. Make sure to use your primary terminal session window where you set the Vault environment variables, and to pass the arguments in the specified order.
$ sh ./dc1/dc1-init.sh $VAULT_ADDR $VAULT_TOKENHang tight while we grab the latest from your chart repositories......TRUNCATED... $ helm status consul $ helm get all consul
Deploy the example application backend services to the primary datacenter.
$ sh ./dc1/dc1-deployment.shservice/postgres created...TRUNCATED...deployment.apps/product-api createdCreated: product-api => postgres (allow)
Export the Consul ACL bootstrap token to the CONSUL_TOKEN
environment variable.
You will need this throughout the tutorial. Make sure to set this in your primary
session window.
$ export CONSUL_TOKEN=$(kubectl get secrets/consul-bootstrap-acl-token --template={{.data.token}} | base64 -D)
Install and configure the secondary Consul Kubernetes datacenter using the following command. Make sure to use your primary session window where you set the Vault environment variables, and to pass the arguments in the specified order.
$ sh ./dc2/dc2-init.sh $VAULT_ADDR $VAULT_TOKENSwitched to context "dc1".Switched to context "dc2".secret/consul-federation created...TRUNCATED... $ helm status consul $ helm get all consul
Deploy the example application backend services to the primary datacenter.
$ sh ./dc2/dc2-deployment.sh $CONSUL_TOKENSwitched to context "dc2".service/public-api createdserviceaccount/public-api createdservicedefaults.consul.hashicorp.com/public-api createddeployment.apps/public-api createdservice/frontend createdserviceaccount/frontend createdservicedefaults.consul.hashicorp.com/frontend createdconfigmap/nginx-configmap createddeployment.apps/frontend createdCreated: frontend => public-api (allow)Created: public-api => product-api (allow)Switched to context "dc1".
When these commands complete, you will have a four tier app deployed across two federated datacenters. The primary datacenter will have a Postgres and REST API pod. The secondary datacenter will have a frontend pod, and a public API pod. The public API will have the REST API defined as an upstream, and will be configured to communicate with it over the WAN using a mesh gateway.
Validate federation
Next you will issue a series of commands to validate that the lab setup worked as planned, and that the federated datacenters are able to communicate with each other.
Issue the following command from the primary terminal session window to exec
into the consul-server statefulset and inspect the results of the consul members -wan
command. All servers in both datacenters should be returned with a status of alive
.
$ kubectl exec statefulset/consul-server -- consul members -wanNode Address Status Type Build Protocol DC Segmentconsul-server-0.dc1 10.0.4.73:8302 alive server 1.10.0 2 dc1 <all>consul-server-0.dc2 10.0.5.209:8302 alive server 1.10.0 2 dc2 <all>consul-server-1.dc1 10.0.5.53:8302 alive server 1.10.0 2 dc1 <all>consul-server-1.dc2 10.0.6.170:8302 alive server 1.10.0 2 dc2 <all>consul-server-2.dc1 10.0.6.106:8302 alive server 1.10.0 2 dc1 <all>consul-server-2.dc2 10.0.4.17:8302 alive server 1.10.0 2 dc2 <all>
Issue the following command to exec into the consul-server stateful set and inspect
the results of the consul catalog services
. All
services should be returned.
$ kubectl exec statefulset/consul-server -- consul catalog services -datacenter dc1 \ && kubectl exec statefulset/consul-server -- consul catalog services -datacenter dc2
Example output.
$consulmesh-gatewaypostgrespostgres-sidecar-proxyproduct-apiproduct-api-sidecar-proxyconsulfrontendfrontend-sidecar-proxymesh-gatewaypublic-apipublic-api-sidecar-proxy
Switch your kubectl
context to dc2
.
$ kubectl config use-context dc2Switched to context "dc2".
Issue the following command in the primary terminal session window to start a tunnel from the local development host to the application UI pod so that you can validate that the services are able to communicate across the federation.
$ kubectl port-forward deploy/frontend 8080:80Forwarding from 127.0.0.1:8080 -> 80Forwarding from [::1]:8080 -> 80Handling connection for 8080
Open localhost:8080 in a new tab to view the example application
UI. If the page displays without error, it proves that communication is occurring
between the federated datacenters across the WAN using the configured mesh gateways.
Enter CTRL-C
to close the tunnel.
Backup primary and export snapshot
Now you will take steps to ensure you have the necessary Consul state backup and secrets exported from the primary datacenter. This emulates what your automated backup and secrets management processes should be handling to ensure that you are always in a recoverable state.
From the primary datacenter, exec into the Consul server, and use consul snapshot save
to back up the current state of the Consul datacenter.
Switch your kubectl
context to dc1
.
$ kubectl config use-context dc1Switched to context "dc1".
Save a snapshot to the tmp
directory in the pod.
$ kubectl exec consul-server-0 -- consul snapshot save -token $CONSUL_TOKEN /tmp/backup.snapSaved and verified snapshot to index 5866
Use kubectl cp
to export a copy of the snapshot to your local development
host. In a production scenario, this backup should be stored in some form
of long term storage that is itself backed up. You should include testing your
backups and disaster recovery steps as part of your disaster recovery plan.
$ kubectl cp consul-server-0:tmp/backup.snap ./dc1/backup/backup.snap
Export secrets to Vault
Now you will export the required secrets from the Kubernetes secrets engine into the Vault secrets engine for external storage.
Use kubectl get secret
to export the secrets that will be required to restore
the Consul datacenter. Each line pipes them directly to vault kv put
so that
the secrets are never stored to disk. If you are not using Vault for external
storage, you will have to adapt these scripts to fit your situation. If your
implementation includes temporarily exporting the secrets to disk as part of the
process, don't forget to delete the secrets from disk.
$ kubectl get secret consul-bootstrap-acl-token -o yaml | vault kv put secret/consul-recovery/consul-bootstrap-acl-token value=- \ && kubectl get secret consul-ca-cert -o yaml | vault kv put secret/consul-recovery/consul-ca-cert value=- \ && kubectl get secret consul-ca-key -o yaml | vault kv put secret/consul-recovery/consul-ca-key value=- \ && kubectl get secret consul-gossip-encryption-key -o yaml | vault kv put secret/consul-recovery/consul-gossip-encryption-key value=- \ && kubectl get secret vault-config -o yaml | vault kv put secret/consul-recovery/vault-config value=-
Example output:
Key Value--- -----...TRUNCATED...created_time 2021-01-19T14:15:52.969276Zdeletion_time n/adestroyed falseversion 1
Notice, that in addition to the four secrets listed previously, the Vault CA
configuration is also being exported. This is required for this lab, since it uses
Vault as a CA. While this is required for this lab, and will be referenced several
times throughout the tutorial, it is not required if you use Consul as your CA.
If you use any 3rd party CA, Vault or otherwise, you must ensure you back up
the connect.ca_config
stanza you provided to Consul during the Helm install. The
dc1/dc1-init.sh
script generated this configuration and secret for this tutorial,
and the dc1/dc1-values.yaml
file configured the Consul datacenter to use that secret.
Review those files for an example of how you could configure your own 3rd party CA.
Simulate loss of primary datacenter
To simulate the loss of your primary datacenter, you will delete the primary datacenter using the platform specific instructions below.
The Consul Helm chart deploys a load balancer to support the mesh gateway
that was configured during the Consul installation to enable multi-dc federation.
Since you did not create this resource using terraform, terraform is not aware
of it. You must perform a helm uninstall
before you issue the terraform destroy
command to delete the primary datacenter.
$ helm uninstall consulrelease "consul" uninstalled
Use kubectl get svc
to see if the LoadBalancer still exists.
$ kubectl get svcNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEconsul-mesh-gateway LoadBalancer 172.20.244.35 a5bf7ac338ad04facb4047c9af3b0126-1412956651.us-east-2.elb.amazonaws.com 443:31523/TCP 10mkubernetes ClusterIP 172.20.0.1 <none> 443/TCP 21mpostgres ClusterIP 172.20.14.172 <none> 5432/TCP 7mproduct-api ClusterIP 172.20.128.119 <none> 9090/TCP 6m58s
Keep checking until the consul-mesh-gateway
service is gone.
$ kubectl get svcNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEkubernetes ClusterIP 172.20.0.1 <none> 443/TCP 21mpostgres ClusterIP 172.20.14.172 <none> 5432/TCP 7mproduct-api ClusterIP 172.20.128.119 <none> 9090/TCP 6m58s
Now, use terraform to destroy the dc1
Kubernetes cluster in EKS.
$ terraform -chdir=./dc1 destroy -auto-approve...TRUNCATED...Destroy complete! Resources: 42 destroyed.
At this point, the optional lab setup is complete, and you can proceed with the recovery process.
Recovery steps
The remainder of the tutorial outlines the manual recovery steps you will take to restore service. If you skipped the optional lab setup, this section assumes you are starting with the necessary data and secrets backups and a functional secondary datacenter. It also assumes your primary datacenter is offline and completely unavailable.
Create new primary
If your primary datacenter is lost or experiencing a long term outage, you will need to create a new Kubernetes cluster to host your primary datacenter, and then install and configure Consul on that cluster.
To continue working with the lab environment, issue the following command to initialize
the new-dc1
terraform working directory.
$ terraform -chdir=new-dc1 initInitializing modules......TRUNCATED...If you ever set or change modules or backend configuration for Terraform,rerun this command to reinitialize your working directory. If you forget, othercommands will detect it and remind you to do so if necessary.
If you are following along with the lab, apply the terraform configuration for the new Kubernetes cluster that will host the new primary Consul datacenter.
$ terraform -chdir=new-dc1 apply -auto-approvemodule.new-dc1.module.eks.aws_iam_role.cluster[0]: Creating......TRUNCATED...Apply complete! Resources: 42 added, 0 changed, 0 destroyed.
Reset the KUBECONFIG
environment variable to merge the contents of the new dc1
Kubernetes cluster configuration.
$ export KUBECONFIG=~/.kube/dc1:~/.kube/dc2
Rename your EKS cluster so that it will match the generic name used throughout this tutorial.
$ kubectl config rename-context eks_dc1 dc1Context "eks_dc1" renamed to "dc1".
Recover secrets
Now that you have created a new Kubernetes cluster to host your primary datacenter, you will load the secrets from your offline secrets management solution into the Kubernetes secrets engine running in your cluster. If you are using something other than Vault for your external secrets management solution, you will need to adapt the example instructions to fit your scenario. When you adapt to your scenario, the outcome must be that you create this set of secrets with these names in your new cluster's Kubernetes secrets engine, and you must ensure the proper values are set from whatever external store you used.
Switch your kubectl
context to your new primary datacenter.
$ kubectl config use-context dc1Switched to context "dc1".
Earlier in the tutorial you exported secrets stored in the Kubernetes secrets engine to the Vault secrets engine. Now, you will reverse that process by exporting the secrets from the Vault secrets engine and importing them into the Kubernetes runtime secrets engine.
$ vault kv get -field=value secret/consul-recovery/consul-bootstrap-acl-token | kubectl apply -f- \ && vault kv get -field=value secret/consul-recovery/consul-ca-cert | kubectl apply -f- \ && vault kv get -field=value secret/consul-recovery/consul-ca-key | kubectl apply -f- \ && vault kv get -field=value secret/consul-recovery/consul-gossip-encryption-key | kubectl apply -f- \ && vault kv get -field=value secret/consul-recovery/vault-config | kubectl apply -f-
Example output:
secret/consul-bootstrap-acl-token createdsecret/consul-ca-cert createdsecret/consul-ca-key createdsecret/consul-gossip-encryption-key createdsecret/vault-config created
The secrets are now loaded into the Kubernetes runtime secrets engine, and are ready to be consumed by the Consul Helm chart during the installation of Consul to the new primary datacenter.
Install Consul to the new primary datacenter
Now you will install Consul to the newly created Kubernetes cluster. During this installation it is important that the Helm values file be configured with the following configuration.
- The
global.tls.caCert
andglobal.tls.caKey
entries are set to reference the secrets you have restored to the Kubernetes secrets engine - ACLs must be disabled by setting both
global.acls.manageSystemACLs
andglobal.acls.createReplicationToken
tofalse
- The
global.acls.bootstrapToken.secretName
must be set to reference the secret you have restored to the Kubernetes secrets engine - although ACLs are currently disabled, this will be used in the next step
We have provided the new-dc1-values-step1.yaml
file that is configured
correctly for this phase of the recovery process, and can be used as a reference.
Install the Consul Helm chart using the new-dc1-values-step1.yaml
file in the
new-dc1
folder to install Consul with this initial configuration.
$ helm install consul hashicorp/consul -f ./new-dc1/new-dc1-values-step1.yaml --version "0.32.0" --waitNAME: consul...TRUNCATED...helm status consulhelm get all consul
Restore snapshot
Now that you have loaded the secrets and installed Consul, you will restore the
snapshot backup to the new primary datacenter dc1
.
Copy the backup file to a server in the new dc1
datacenter.
$ kubectl cp ./dc1/backup/backup.snap consul-server-0:tmp/backup.snap
Exec into the server and use the consul snapshot restore
command to restore
Consul's internal state to the backup taken before the disaster occurred.
$ kubectl exec consul-server-0 -- consul snapshot restore /tmp/backup.snapRestored snapshot
After this completes, you should observe EnsureRegistration failed
errors in the
logs similar to what is shown below.
$ kubectl logs consul-server-0 | grep error...TRUNCATED...2021-01-18T12:07:56.248Z [WARN] agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "bce8158a-6c72-8ccd-68ed-7268a5272276": Node name consul-server-1 is reserved by node d72c9202-3a87-6a68-b2e5-80a09063e9e8 with name consul-server-1 (10.0.5.53)"2021-01-18T12:08:16.199Z [WARN] agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "e5a31be1-9d85-ed2f-2cc7-271e6aa8cb87": Node name consul-server-0 is reserved by node ca1fda52-6468-0c81-8e87-619837afcc9e with name consul-server-0 (10.0.4.73)"2021-01-18T12:08:16.207Z [WARN] agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "d1428d4c-ec80-3254-d5e2-bafaa6154280": Node name consul-server-2 is reserved by node 17f1d108-dd34-cca6-f6e3-6b2024625d9a with name consul-server-2 (10.0.6.106)"...TRUNCATED...
To resolve these errors, you need to perform a consul leave
on each server. When
each server restarts, the node id issue for the servers will be resolved. This
can be done most quickly via kubectl exec
, You could issue a
kubectl rollout restart statefulset/consul-server
command as well, but the
kubectl exec method is faster because Kubernetes doesn't need to re-attach the
persistent volume. Whichever option you choose, this may take a few minutes.
$ kubectl exec consul-server-0 -- consul leave && sleep 2 && kubectl rollout status statefulset/consul-server --watch \ && kubectl exec consul-server-1 -- consul leave && sleep 2 && kubectl rollout status statefulset/consul-server --watch \ && kubectl exec consul-server-2 -- consul leave && sleep 2 && kubectl rollout status statefulset/consul-server --watch
Example output:
Graceful leave completeWaiting for 1 pods to be ready...partitioned roll out complete: 3 new pods have been updated...Graceful leave completeWaiting for 1 pods to be ready...partitioned roll out complete: 3 new pods have been updated...Graceful leave completeWaiting for 1 pods to be ready...Waiting for 2 pods to be ready...Waiting for 3 pods to be ready...Waiting for 2 pods to be ready...Waiting for 1 pods to be ready...partitioned roll out complete: 3 new pods have been updated...
Check the logs to ensure the node id issue has been resolved for all servers.
$ kubectl logs consul-server-0
If you do not see the EnsureRegistration failed
errors resolve for the servers
after the restart, perform the restart on each server again until you do. You
should repeat the step above for all servers in the dc1
datacenter to ensure
the errors have resolved.
Note
You may still see errors for the client agents. This is to be expected and does not mean the previous steps have not completed successfully.
Enable ACLs
Once you have restored the backup and performed a consul leave
on each server,
it is time to enable ACLs by upgrading the Consul Helm installation using an
updated Helm values file. The Consul Helm values file should be modified to set both
global.acls.manageSystemACLs
and global.acls.createReplicationToken
to true
.
We have provided the ./new-dc1/new-dc1-values-step2.yaml
file, which is properly
configured and can be used as a reference.
Issue the following command to start the Consul Helm upgrade.
$ helm upgrade consul hashicorp/consul -f ./new-dc1/new-dc1-values-step2.yaml --waitNAME: consul...TRUNCATED...helm status consulhelm get all consul
When the Helm upgrade completes, you will now observe blocked by ACLs
log
entries on the servers.
$ kubectl logs consul-server-0...TRUNCATED...2021-01-18T12:38:03.480Z [WARN] agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-0000000000022021-01-18T12:38:16.282Z [INFO] agent.server.gateway_locator: new cached locations of mesh gateways: primary=[ac1afc99b9a6a4a3096c69fbe1256a11-1212886934.us-west-2.elb.amazonaws.com:443] local=[10.0.4.252:8443]2021-01-18T12:38:23.698Z [WARN] agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-0000000000022021-01-18T12:38:47.633Z [WARN] agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-0000000000022021-01-18T12:38:52.290Z [WARN] agent: Node info update blocked by ACLs: node=e5a31be1-9d85-ed2f-2cc7-271e6aa8cb87 accessorID=00000000-0000-0000-0000-0000000000022021-01-18T12:39:09.400Z [WARN] agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002...TRUNCATED...
To resolve the ACL errors, you will have to:
- Perform a
consul leave
again on each server - Set the ACL token on each server
Restart servers
Since the ACL config is set in the server configmap, it won't take effect until the
Consul servers restart. Use consul leave
again on each server to apply the new
configmap.
$ kubectl exec consul-server-0 -- consul leave -token $CONSUL_TOKEN && sleep 2 && kubectl rollout status statefulset/consul-server --watch \ && kubectl exec consul-server-1 -- consul leave -token $CONSUL_TOKEN && sleep 2 && kubectl rollout status statefulset/consul-server --watch \ && kubectl exec consul-server-2 -- consul leave -token $CONSUL_TOKEN && sleep 2 && kubectl rollout status statefulset/consul-server --watch
Example output:
Graceful leave completeWaiting for 1 pods to be ready...Waiting for 2 pods to be ready...Waiting for 1 pods to be ready...partitioned roll out complete: 3 new pods have been updated...Graceful leave completeWaiting for 1 pods to be ready...partitioned roll out complete: 3 new pods have been updated...Graceful leave completeWaiting for 1 pods to be ready...partitioned roll out complete: 3 new pods have been updated...
Set server ACL tokens
The next task is to set ACL tokens for each of the servers. The servers will currently still be logging ACL errors. That is because the Consul state store currently has the tokens from the restored snapshot backup, but each server has a newly created ACL token that was generated during the Helm upgrade.
Review the logs to observe the ACL errors.
$ kubectl logs consul-server-0 | grep ACL \ && kubectl logs consul-server-1 | grep ACL \ && kubectl logs consul-server-2 | grep ACL
Example output:
[WARN] agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002[WARN] agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002[WARN] agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002[WARN] agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002[WARN] agent: Node info update blocked by ACLs: node=d1428d4c-ec80-3254-d5e2-bafaa6154280 accessorID=00000000-0000-0000-0000-000000000002
You must retrieve the new token from each server agent, and then set it on each server. This is a multi-step process that must be performed on each server. The multiple steps are as follows:
- Use the Consul bootstrap token to retrieve the server token AccessorID for
each server using the
consul acl token list
command - Pass the server token AccessorID to the
consul acl token read
command to retrieve the SecretID from the restored Consul control plane - this is is the acl token you need for each new server - Use the Consul bootstrap token and the SecretID you've retrieved for each server to set
the agent token using the
consul acl set-agent-token
command on each server
Use consul acl token list
to retrieve the AccessorID for each server. Make note
of the AccessorIDs in the output somewhere as you will need this information to
perform the next step.
$ kubectl exec consul-server-0 -- consul acl token list -token $CONSUL_TOKEN | grep consul-server -B 1AccessorID: 8c6da5b2-a9af-205c-7779-924548f37a27Description: Server Token for consul-server-1.consul-server.default.svc--AccessorID: e054b1ce-474c-6479-bd4a-e2f895448167Description: Server Token for consul-server-2.consul-server.default.svc--AccessorID: 53c70d15-304d-e30a-7a66-9ff67df621e2Description: Server Token for consul-server-0.consul-server.default.svc
Retrieve the SecretID, which is the ACL token, for each server using the
consul acl token read
command. Note that you will have to provide the AccessorID
for each server, and you must make sure to provide the server specific AccessorID
in each case.
$ kubectl exec consul-server-0 -- consul acl token read -token $CONSUL_TOKEN -id <consul-server-0-token-AccessorID> | grep SecretID -A 1 -B 1 \ && kubectl exec consul-server-1 -- consul acl token read -token $CONSUL_TOKEN -id <consul-server-1-token-AccessorID> | grep SecretID -A 1 -B 1 \ && kubectl exec consul-server-2 -- consul acl token read -token $CONSUL_TOKEN -id <consul-server-2-token-AccessorID> | grep SecretID -A 1 -B 1
Example output:
AccessorID: 53c70d15-304d-e30a-7a66-9ff67df621e2SecretID: c220a977-f682-d7fc-27c9-f3902e55d207Description: Server Token for consul-server-0.consul-server.default.svcAccessorID: 8c6da5b2-a9af-205c-7779-924548f37a27SecretID: 260cc041-4d34-07ed-b245-cabce2eb3965Description: Server Token for consul-server-1.consul-server.default.svcAccessorID: e054b1ce-474c-6479-bd4a-e2f895448167SecretID: 88ebd1fe-43f9-824f-1776-d2164f39667fDescription: Server Token for consul-server-2.consul-server.default.svc
Use the consul acl set-agent-token
command to set the current agent
token on each server to the SecretID, or token, you retrieved in the previous
step.
$ kubectl exec consul-server-0 -- consul acl set-agent-token -token $CONSUL_TOKEN agent <consul-server-0-token-SecretID> \ && kubectl exec consul-server-1 -- consul acl set-agent-token -token $CONSUL_TOKEN agent <consul-server-1-token-SecretID> \ && kubectl exec consul-server-2 -- consul acl set-agent-token -token $CONSUL_TOKEN agent <consul-server-2-token-SecretID>
Example output:
ACL token "agent" set successfullyACL token "agent" set successfullyACL token "agent" set successfully
Review the logs on each server to ensure all the ACL errors have been resolved.
$ kubectl logs consul-server-02021-01-18T13:12:39.535Z [WARN] agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-0000000000022021-01-18T13:13:04.835Z [WARN] agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-0000000000022021-01-18T13:13:18.051Z [WARN] agent: Node info update blocked by ACLs: node=bce8158a-6c72-8ccd-68ed-7268a5272276 accessorID=00000000-0000-0000-0000-0000000000022021-01-18T13:13:21.739Z [INFO] agent: Updated agent's ACL token: token=agent2021-01-18T13:13:22.985Z [INFO] agent: Synced node info
Notice that the ACL errors stop after the node info is synced. Repeat this step for each server to ensure all ACL errors have resolved.
Synchronize the secondary datacenter
At this point the secondary datacenter has some stale information that needs to
be synchronized with the primary datacenter. Specifically, the consul-federation
secret has a gateway address for a gateway that no longer exists. Synchronizing,
the secondary datacenter is a multi-step process. To synchronize the datacenters,
you must:
- Extract the new
consul-federation
secret from the new primary datacenter - Delete the stale
consul-federation
secret from the secondary datacenter - Apply the
consul-federation
secret from the new primary datacenter to the secondary datacenter - Perform a
helm upgrade
to apply the new secrets to the Helm installation - Restart the servers in the secondary datacenter
Once you finish this process, the federation will be completely restored and operable.
Export consul-federation secret.
$ kubectl get secret consul-federation -o yaml > ./new-dc1/consul-federation-secret.yaml
Switch your kubectl
context to dc2
.
$ kubectl config use-context dc2Switched to context "dc2".
Delete the existing stale consul-federation
secret from the secondary datacenter.
$ kubectl delete secret/consul-federation
Add the current secret that you just exported from the new primary datacenter.
$ kubectl apply -f ./new-dc1/consul-federation-secret.yaml
Perform a helm upgrade in the secondary datacenter.
$ helm upgrade consul hashicorp/consul
Note
depending on your machine load, the Helm upgrade may timeout before
it is complete. If it does, use kubectl get pods --watch
to observe the pods.
Wait until the consul-acl-init-cleanup
job finishes, before proceeding.
Finally, use consul leave
one more time to restart the servers in the secondary
datacenters.
$ kubectl exec consul-server-0 -- consul leave -token $CONSUL_TOKEN && sleep 2 && kubectl rollout status statefulset/consul-server --watch \ && kubectl exec consul-server-1 -- consul leave -token $CONSUL_TOKEN && sleep 2 && kubectl rollout status statefulset/consul-server --watch \ && kubectl exec consul-server-2 -- consul leave -token $CONSUL_TOKEN && sleep 2 && kubectl rollout status statefulset/consul-server --watch
Example output:
Graceful leave completeWaiting for 1 pods to be ready...Waiting for 2 pods to be ready...Waiting for 1 pods to be ready...partitioned roll out complete: 3 new pods have been updated...Graceful leave completeWaiting for 1 pods to be ready...partitioned roll out complete: 3 new pods have been updated...Graceful leave completeWaiting for 1 pods to be ready...partitioned roll out complete: 3 new pods have been updated...
Validate recovery
Now issue the same series of commands as you issued earlier to validate that the datacenter recovery worked as planned, and that the federated datacenters are able to communicate with each other.
Switch your kubectl
context to dc1
.
$ kubectl config use-context dc1Switched to context "dc1".
Issue the following command to exec into the consul-server statefulset and inspect
the results of the consul members -wan
command. All servers in both datacenters
should be returned.
$ kubectl exec statefulset/consul-server -- consul members -wanNode Address Status Type Build Protocol DC Segmentconsul-server-0.dc1 10.0.4.73:8302 alive server 1.10.0 2 dc1 <all>consul-server-0.dc2 10.0.5.209:8302 alive server 1.10.0 2 dc2 <all>consul-server-1.dc1 10.0.5.53:8302 alive server 1.10.0 2 dc1 <all>consul-server-1.dc2 10.0.6.170:8302 alive server 1.10.0 2 dc2 <all>consul-server-2.dc1 10.0.6.106:8302 alive server 1.10.0 2 dc1 <all>consul-server-2.dc2 10.0.4.17:8302 alive server 1.10.0 2 dc2 <all>
Deploy the example application backend services to the new primary datacenter.
$ kubectl apply -f ./dc1/postgres.yaml \ && kubectl apply -f ./dc1/product-api.yaml
Example output:
service/postgres createdserviceaccount/postgres createddeployment.apps/postgres createdservice/product-api createdserviceaccount/product-api createdservicedefaults.consul.hashicorp.com/product-api createdconfigmap/db-configmap createddeployment.apps/product-api created
Issue the following command to exec into the consul-server stateful set and inspect
the results of the consul catalog services
command for each datacenter. All
services from both datacenters should be returned.
$ kubectl exec statefulset/consul-server -- consul catalog services -datacenter dc1 \ && kubectl exec statefulset/consul-server -- consul catalog services -datacenter dc2
Example output:
consulmesh-gatewaypostgrespostgres-sidecar-proxyproduct-apiproduct-api-sidecar-proxyconsulfrontendfrontend-sidecar-proxymesh-gatewaypublic-apipublic-api-sidecar-proxy
Switch your kubectl
context to dc2
.
$ kubectl config use-context dc2Switched to context "dc2".
Issue the following command to start a tunnel from the local development host to the application UI pod so that you can validate that the services are able to communicate across the federation.
$ kubectl port-forward deploy/frontend 8080:80Forwarding from 127.0.0.1:8080 -> 80Forwarding from [::1]:8080 -> 80Handling connection for 8080
Open localhost:8080 to view the example application UI. If the page displays, it proves that communication is occurring between the federated datacenters across the WAN using the configured mesh gateways.
Destroy the lab (optional)
If you chose to create a lab, you should now destroy it to ensure no further resources are consumed, and in the case of EKS, charges incurred.
Perform a helm uninstall
to destroy the load balancer in dc2
.
$ helm uninstall consul
Use kubectl get svc
to see if the LoadBalancer still exists.
$ kubectl get svcNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEconsul-mesh-gateway LoadBalancer 172.20.244.35 a5bf7ac338ad04facb4047c9af3b0126-1412956651.us-east-2.elb.amazonaws.com 443:31523/TCP 10mkubernetes ClusterIP 172.20.0.1 <none> 443/TCP 21mpostgres ClusterIP 172.20.14.172 <none> 5432/TCP 7mproduct-api ClusterIP 172.20.128.119 <none> 9090/TCP 6m58s
Keep checking until the consul-mesh-gateway
service is gone.
$ kubectl get svcNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEkubernetes ClusterIP 172.20.0.1 <none> 443/TCP 21mpostgres ClusterIP 172.20.14.172 <none> 5432/TCP 7mproduct-api ClusterIP 172.20.128.119 <none> 9090/TCP 6m58s
Use terraform to destroy the dc2
Kubernetes cluster in EKS.
$ terraform -chdir=./dc2 destroy -auto-approve...TRUNCATED...Destroy complete! Resources: 42 destroyed.
Switch your kubectl
context to dc1
.
$ kubectl config use-context dc1Switched to context "dc1".
Perform a helm uninstall
to destroy the load balancer in dc1
.
$ helm uninstall consul
Use kubectl get svc
to see if the LoadBalancer still exists.
$ kubectl get svcNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEconsul-mesh-gateway LoadBalancer 172.20.244.35 a5bf7ac338ad04facb4047c9af3b0126-1412956651.us-east-2.elb.amazonaws.com 443:31523/TCP 10mkubernetes ClusterIP 172.20.0.1 <none> 443/TCP 21mpostgres ClusterIP 172.20.14.172 <none> 5432/TCP 7mproduct-api ClusterIP 172.20.128.119 <none> 9090/TCP 6m58s
Keep checking until the consul-mesh-gateway
service is gone.
$ kubectl get svcNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEkubernetes ClusterIP 172.20.0.1 <none> 443/TCP 21mpostgres ClusterIP 172.20.14.172 <none> 5432/TCP 7mproduct-api ClusterIP 172.20.128.119 <none> 9090/TCP 6m58s
Use terraform to destroy the dc1
Kubernetes cluster in EKS.
$ terraform -chdir=./new-dc1 destroy -auto-approve...TRUNCATED...Destroy complete! Resources: 42 destroyed.
Next steps
This tutorial focused on providing you with the information you need to design a disaster recovery plan that will allow you to recover from a primary datacenter loss or outage when running Consul on Kubernetes.
Specifically, you:
- Reviewed the essential data and secrets you must backup and secure in order to recover from a datacenter loss or lengthy cloud provider outage
- Optionally, set up a lab environment to practice performing a recovery
- Reviewed the manual recovery steps you will take to recover a lost primary datacenter
Visit Backup Consul Data and State to learn more about Consul disaster recovery planning.
Visit Secure Consul with Vault Integrations to learn more ways you can integrate Consul with Vault.
Visit Deploy Consul and Vault on Kubernetes with Run Triggers to learn about using HCP Terraform to run Consul with Vault on Google Kubernetes Engine.