Disaster recovery for federated primary datacenter
Operating a multi-datacenter federated environment requires you to prepare for many possibilities, including a complete outage of one of your physical datacenters, or a cloud provider outage that might make one of the components of your environment temporarily or permanently unavailable.
This can have different implications depending on your configuration, but it is something you want to be prepared for, in case the unthinkable happens.
This tutorial guides you through the process of creating a recovery strategy from a quorum loss to the complete loss of your primary datacenter.
The recommended logical steps are listed in three main sections, the first will help you prepare for an outage event, the second will help you restore your primary datacenter, and the third will help you validate the restore was successful and ensure federation is re-established.
Prerequisites
Base configuration
To complete this tutorial, you will need two wide area network (WAN) joined Consul datacenters with access control list (ACL) replication enabled. If you are starting from scratch, follow these tutorials to set up your datacenters, or use them to check that you have the proper configuration in place:
You will also need to enable ACL replication, which you can do by following the steps in the ACL Replication for Multiple Datacenters tutorial.
Mesh gateways
If you want to make use of the mesh gateway functionality, you can follow Connect Services Across Datacenters with Mesh Gateways to setup them for the two datacenters.
Central service configuration
Lastly, you should set enable_central_service_config = true
on your Consul clients which will allow you to centrally configure the sidecar
and mesh gateway proxies.
Disaster recovery process
The process can be summarized in the following phases:
- Formulate a disaster preparation strategy
- Outage event occurs for the primary datacenter
- Restore primary datacenter
- Restore and validate federation
Disaster preparation strategy
Use a recommended architecture
Not every outage has the same level of impact. A lot of the resiliency of your Consul federation will rely on proper configuration and the adoption of a recommended architecture.
Refer to Consul Reference Architecture to learn the recommended configurations for your Consul datacenter.
Automate your deployment
Re-deploying an entire datacenter after an outage is a non-trivial operation that might require a considerable amount of time. You can reduce the time to recover by following Infrastructure as Code (IaC) principles and using tools such as Terraform and Vault to help you in the deployment and recovery process.
The amount of downtime you experience from the loss of your primary datacenter is directly proportional to the amount of time it takes you to deploy a new primary datacenter. Reducing the deployment time via automation will not only help you standardize the process and reduce errors, but also reduce the downtime you will experience after a datacenter loss.
Have a backup strategy
A Consul datacenter's state is more than just the initial configuration, it includes data that is generated during normal operations such as KV entries, ACL tokens, and intentions. When your datacenter fails, this information is lost and cannot be manually recreated without a backup. For example you may keep you Consul agent and service configuration in version control, however, you will not be able to as easily recover security credentials since credentials generated during the installation process are now different from the ones that were present in the datacenter that was “lost” in the disaster.
The process of then updating every secret in Consul to reflect the updated value is tedious, error prone and hard to debug.
Restoring from a snapshot ensures all the intentions, KV entries and ACL tokens are reintroduced.
You can follow Backup Consul Data and State to learn how to perform a snapshot of your Consul datacenter to use in case of disaster.
Have a TLS certificate distribution process in place
Certificates are stored on the agent disk and not saved in a snapshot. This means you will have to re-generate them.
Consul comes equipped with a command ,consul tls
, that permits you to generate
TLS certificates for the agents. This simplifies the automation of deployment by
giving you the ability to generate a CA and TLS certificates as part of the
process.
As an alternative, we suggest you use Vault as a CA and TLS certificate generator to help you automate the process. You can follow Generate mTLS Certificates for Consul with Vault to learn how to automate certificate generation and distribution for your Consul server agents.
Adopt an adequate ACL down policy
When your primary datacenter is down you lose your ability to validate
ACL policies. To mitigate this Consul has a configuration parameter,
down_policy
, that tells Consul which strategy to follow if ACLs
cannot be validated against the primary datacenter.
By default, Consul adopts the extend-cache
approach, meaning that in case of an
outage Consul will allow cached ACL objects to be used, ignoring their TTL
values. If a non-cached ACL is used, extend-cache
acts like deny
.
If you changed the down_policy
to the more restrictive value of deny
, you
will be impacted more severely from the outage, since
all ACL protected operations in the secondary datacenter will be denied until the primary datacenter is restored.
Outage event in the primary datacenter
Depending on your configuration, multiple levels of outage can occur.
This tutorial covers how to handle outages that render the primary datacenter unable to serve requests, including service discovery requests to Consul DNS and intention validation. This can happen in many possible ways, but the two extremes of the spectrum are the following:
- Loss of quorum in the primary datacenter This is an outage where the datacenter has less than (N/2)+1 servers available, where N is the total number of servers. While some of the nodes are unaffected, not enough are available to form a quorum.
- Complete loss of the primary datacenter This is either a disaster event that completely wipes an entire facility or a major outage of your cloud provider.
Loss of quorum in the primary datacenter
In the scenario where you have lost enough servers in your primary datacenter that a quorum cannot be reached, you will experience the same level service failure as if the whole datacenter is unavailable.
If the outage is limited to the server nodes, or did not severely impact your client fleet, it might not be practical to rebuild the whole datacenter only to re-establish a quorum.
In this scenario, you can follow the Outage Recovery tutorial to restore your server nodes and make sure they are able to reform the raft cluster and elect a leader.
Once the raft cluster is reformed, and the datacenter is again operational, you can now restart any client agents that might have been affected by the outage.
Complete loss of the primary datacenter
The worst case scenario in a federated environment is the loss of the primary datacenter.
In this scenario, the course of action should aim to restore the lost datacenter as quickly as possible. The following sections in the tutorial will guide you through the necessary steps to perform a restore and to make sure all functionality is re-established.
Restore primary datacenter
The steps necessary to restore your environment after a full outage of your primary datacenter are the following:
- Restore your primary datacenter
- Restore the last snapshot to the newly recovered datacenter
- Apply agent tokens to the servers
- Perform a rolling restart of the servers in the primary datacenter
- Perform a rolling restart of the clients in the primary datacenter
- Restore and validate federation
Restore datacenter nodes
You can use the same process you used for the initial deploy of your Consul datacenter.
To make sure the new deployment is able to communicate with the secondary datacenter, you will need to ensure that you are using the same
- CA certificate
- CA key
- Gossip encryption key
as the ones you used for the pre-existing deployment.
Restore snapshot
After the datacenter has been restored, you can restore the snapshot to it using
the consul snapshot
command.
$ consul snapshot restore -token=<value> backup.snap Restored snapshot
Note
The token
used for the snapshot restore procedure needs to be a
token valid for the newly restored datacenter. If your restored datacenter does
not have ACL enabled, you can restore the snapshot without a token.
Set Consul ACL agent tokens
The newly restarted nodes will now be missing the ACL tokens needed to
successfully join the datacenter. Once you restored the ACL system with the
snapshot restore you will be able to set the tokens for the different nodes
using the consul acl
command.
Note
The following operations require a token with acl = write
privileges. Also, after the snapshot is restored the ACL system will now
contain the previous tokens created before the outage. You can use the any
management token you had in your datacenter before the outage. To setup the
token for the request you can either use the CONSUL_HTTP_TOKEN
environment
variable or to pass it directly using the -token=
command parameter.
First retrieve the tokens available in the datacenter.
$ consul acl token list ... AccessorID: 694f15e2-b8f9-c5dd-cb92-8d7bc529df9fDescription: server-1 agent tokenLocal: falseCreate Time: 2021-01-07 21:38:25.331288761 +0000 UTCLegacy: falsePolicies: a660e45f-3f6e-501c-1303-3d39a83f6ff9 - acl-policy-server-node ...
Then retrieve the token using the AccessorID
.
$ consul acl token read -id 694f15e2-b8f9-c5dd-cb92-8d7bc529df9f AccessorID: 694f15e2-b8f9-c5dd-cb92-8d7bc529df9fSecretID: 237f1a27-3399-ebf0-29f1-9828d89159b1Description: server-1 agent tokenLocal: falseCreate Time: 2021-01-07 21:38:25.331288761 +0000 UTCPolicies: a660e45f-3f6e-501c-1303-3d39a83f6ff9 - acl-policy-server-node
Note
You will have to set the token on all the server nodes in order to get them able to re-join the datacenter successfully. Depending on your configuration you might have different tokens for each server or re-use the same token for all server agents but in both cases the token needs to be set on all agents otherwise they will not be able to successfully join the datacenter.
Finally apply the token to the server.
$ consul acl set-agent-token agent 237f1a27-3399-ebf0-29f1-9828d89159b1 ACL token "agent" set successfully
Perform a rolling restart of the servers
After the snapshot is restored and the tokens have been set on the nodes, you will observe errors in the server logs about duplicate node ids. This happens because the servers received new node ids when they were reinstalled, and these node ids are different from the ones stored in the snapshot.
...[WARN] agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "835aa73d-78e9-ba63-d2da-8bbfba1329a7": Node name server-1 is reserved by node 4d2d0bfa-6d2d-c373-fff2-16ac612dca7a with name server-1 (172.19.0.3)"[WARN] agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "49be2708-2f0b-0a1b-0069-9e1420fa251b": Node name server-2 is reserved by node ea086564-8d58-2918-682d-9092651f2157 with name server-2 (172.19.0.4)"[WARN] agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "3398cdca-dc18-a786-e3a6-6f7deb5dcd0d": Node name server-3 is reserved by node d7529d1e-6924-41d8-fc40-fa0f131cc558 with name server-3 (172.19.0.5)"...
To resolve these errors, you need to perform a consul leave
on each server and
then start the server again.
Once servers are restarted, the node ids will be set to the expected value and this will resolve the errors in the logs.
Perform a rolling restart of the clients
The same log errors will be present on the clients. After completing the server restarts, you must perform the same operations on the clients to resolve the log errors.
Restore and validate federation
Once restored, it is possible that the new servers will have a different IP than the one used to wan join them from the secondary datacenter. In that scenario, the federation will be broken.
To verify the federation you can use consul members
on the primary datacenter.
$ consul members -wanNode Address Status Type Build Protocol DC Segmentserver-1.primary 172.19.0.3:8302 alive server 1.9.0 2 primary <all>server-2.primary 172.19.0.4:8302 alive server 1.9.0 2 primary <all>server-3.primary 172.19.0.5:8302 alive server 1.9.0 2 primary <all>
To restore the federation you must perform a rolling restart of the secondary
datacenter servers using the new primary datacenter servers' IP in the
retry-join-wan
parameter.
After the rolling restart, you should be able to observe all the servers in the
consul members
command output.
$ consul members -wanNode Address Status Type Build Protocol DC Segmentserver-1.secondary 172.19.0.7:8302 alive server 1.9.0 2 secondary <all>server-2.secondary 172.19.0.8:8302 alive server 1.9.0 2 secondary <all>server-3.secondary 172.19.0.9:8302 alive server 1.9.0 2 secondary <all>server-1.primary 172.19.0.3:8302 alive server 1.9.0 2 primary <all>server-2.primary 172.19.0.4:8302 alive server 1.9.0 2 primary <all>server-3.primary 172.19.0.5:8302 alive server 1.9.0 2 primary <all>
Next steps
In this tutorial, you learned how to restore your federated environment in the event of a full outage of your primary datacenter.
You can follow Provide Fault Tolerance with Redundancy Zones to learn how Consul Enterprise helps you in making the deployment more robust by using different zones (such as AWS Availability Zones) for your Consul deployment.