This adds a guide with recommendations on how to setup
Managers and keep the Swarm cluster healthy.
Signed-off-by: Alexandre Beslic <alexandre.beslic@gmail.com>
(cherry picked from commit 24f87f26e73a49383e0606813a86ed96da7f5a18)
Signed-off-by: Tibor Vass <tibor@docker.com>
| 1 | 1 |
new file mode 100644 |
| ... | ... |
@@ -0,0 +1,241 @@ |
| 0 |
+<!--[metadata]> |
|
| 1 |
+aliases = [ |
|
| 2 |
+"/engine/swarm/manager-administration-guide/" |
|
| 3 |
+] |
|
| 4 |
+title = "Swarm Manager Administration Guide" |
|
| 5 |
+description = "Manager administration guide" |
|
| 6 |
+keywords = ["docker, container, cluster, swarm, manager, raft"] |
|
| 7 |
+advisory = "rc" |
|
| 8 |
+[menu.main] |
|
| 9 |
+identifier="manager_admin_guide" |
|
| 10 |
+parent="engine_swarm" |
|
| 11 |
+weight="12" |
|
| 12 |
+<![end-metadata]--> |
|
| 13 |
+ |
|
| 14 |
+# Administer and maintain a swarm of Docker Engines |
|
| 15 |
+ |
|
| 16 |
+When you run a swarm of Docker Engines, **manager nodes** are the key components |
|
| 17 |
+for managing the cluster and storing the cluster state. It is important to understand |
|
| 18 |
+some key features of manager nodes in order to properly deploy and maintain the |
|
| 19 |
+swarm. |
|
| 20 |
+ |
|
| 21 |
+This article covers the following swarm administration tasks: |
|
| 22 |
+ |
|
| 23 |
+* [Add Manager nodes for fault tolerance](#add-manager-nodes-for-fault-tolerance) |
|
| 24 |
+* [Distributing manager nodes](#distributing-manager-nodes) |
|
| 25 |
+* [Running manager-only nodes](#run-manager-only-nodes) |
|
| 26 |
+* [Backing up the cluster state](#back-up-the-cluster-state) |
|
| 27 |
+* [Monitoring the swarm health](#monitor-swarm-health) |
|
| 28 |
+* [Recovering from disaster](#recover-from-disaster) |
|
| 29 |
+ |
|
| 30 |
+Refer to [How swarm mode nodes work](how-swarm-mode-works/nodes.md) |
|
| 31 |
+for a brief overview of Docker Swarm mode and the difference between manager and |
|
| 32 |
+worker nodes. |
|
| 33 |
+ |
|
| 34 |
+## Operating manager nodes in a swarm |
|
| 35 |
+ |
|
| 36 |
+Swarm manager nodes use the [Raft Consensus Algorithm](raft.md) to manage the |
|
| 37 |
+cluster state. You only need to understand some general concepts of Raft in |
|
| 38 |
+order to manage a swarm. |
|
| 39 |
+ |
|
| 40 |
+There is no limit on the number of manager nodes. The decision about how many |
|
| 41 |
+manager nodes to implement is a trade-off between performance and |
|
| 42 |
+fault-tolerance. Adding manager nodes to a swarm makes the swarm more |
|
| 43 |
+fault-tolerant. However, additional manager nodes reduce write performance |
|
| 44 |
+because more nodes must acknowledge proposals to update the cluster state. |
|
| 45 |
+This means more network round-trip traffic. |
|
| 46 |
+ |
|
| 47 |
+Raft requires a majority of managers, also called a quorum, to agree on proposed |
|
| 48 |
+updates to the cluster. A quorum of managers must also agree on node additions |
|
| 49 |
+and removals. Membership operations are subject to the same constraints as state |
|
| 50 |
+replication. |
|
| 51 |
+ |
|
| 52 |
+## Add manager nodes for fault tolerance |
|
| 53 |
+ |
|
| 54 |
+You should maintain an odd number of managers in the swarm to support manager |
|
| 55 |
+node failures. Having an odd number of managers ensures that during a network |
|
| 56 |
+partition, there is a higher chance that a quorum remains available to process |
|
| 57 |
+requests if the network is partitioned into two sets. Keeping a quorum is not |
|
| 58 |
+guaranteed if you encounter more than two network partitions. |
|
| 59 |
+ |
|
| 60 |
+| Cluster Size | Majority | Fault Tolerance | |
|
| 61 |
+|:------------:|:----------:|:-----------------:| |
|
| 62 |
+| 1 | 1 | 0 | |
|
| 63 |
+| 2 | 2 | 0 | |
|
| 64 |
+| **3** | 2 | **1** | |
|
| 65 |
+| 4 | 3 | 2 | |
|
| 66 |
+| **5** | 3 | **2** | |
|
| 67 |
+| 6 | 4 | 2 | |
|
| 68 |
+| **7** | 4 | **3** | |
|
| 69 |
+| 8 | 5 | 3 | |
|
| 70 |
+| **9** | 5 | **4** | |
|
| 71 |
+ |
|
| 72 |
+For example, in a swarm with *5 nodes*, if you lose *3 nodes*, you don't have a |
|
| 73 |
+quorum. Therefore you can't add or remove nodes until you recover one of the |
|
| 74 |
+unavailable manager nodes or recover the cluster with disaster recovery |
|
| 75 |
+commands. See [Recover from disaster](#recover-from-disaster). |
|
| 76 |
+ |
|
| 77 |
+While it is possible to scale a swarm down to a single manager node, it is |
|
| 78 |
+impossible to demote the last manager node. This ensures you maintain access to |
|
| 79 |
+the swarm and that the swarm can still process requests. Scaling down to a |
|
| 80 |
+single manager is an unsafe operation and is not recommended. If |
|
| 81 |
+the last node leaves the cluster unexpetedly during the demote operation, the |
|
| 82 |
+cluster swarm will become unavailable until you reboot the node or restart with |
|
| 83 |
+`--force-new-cluster`. |
|
| 84 |
+ |
|
| 85 |
+You manage cluster membership with the `docker swarm` and `docker node` |
|
| 86 |
+subsystems. Refer to [Add nodes to a swarm](join-nodes.md) for more information |
|
| 87 |
+on how to add worker nodes and promote a worker node to be a manager. |
|
| 88 |
+ |
|
| 89 |
+## Distributing manager nodes |
|
| 90 |
+ |
|
| 91 |
+In addition to maintaining an odd number of manager nodes, pay attention to |
|
| 92 |
+datacenter topology when placing managers. For optimal fault-tolerance, distribute |
|
| 93 |
+manager nodes across a minimum of 3 availability-zones to support failures of an |
|
| 94 |
+entire set of machines or common maintenance scenarios. If you suffer a failure |
|
| 95 |
+in any of those zones, the swarm should maintain a quorum of manager nodes |
|
| 96 |
+available to process requests and rebalance workloads. |
|
| 97 |
+ |
|
| 98 |
+| Swarm manager nodes | Repartition (on 3 Availability zones) | |
|
| 99 |
+|:-------------------:|:--------------------------------------:| |
|
| 100 |
+| 3 | 1-1-1 | |
|
| 101 |
+| 5 | 2-2-1 | |
|
| 102 |
+| 7 | 3-2-2 | |
|
| 103 |
+| 9 | 3-3-3 | |
|
| 104 |
+ |
|
| 105 |
+## Run manager-only nodes |
|
| 106 |
+ |
|
| 107 |
+By default manager nodes also act as a worker nodes. This means the scheduler |
|
| 108 |
+can assign tasks to a manager node. For small and non-critical clusters |
|
| 109 |
+assigning tasks to managers is relatively low-risk as long as you schedule |
|
| 110 |
+services using **resource constraints** for *cpu* and *memory*. |
|
| 111 |
+ |
|
| 112 |
+However, because manager nodes use the Raft consensus algorithm to replicate data |
|
| 113 |
+in a consistent way, they are sensitive to resource starvation. You should |
|
| 114 |
+isolate managers in your swarm from processes that might block cluster |
|
| 115 |
+operations like cluster heartbeat or leader elections. |
|
| 116 |
+ |
|
| 117 |
+To avoid interference with manager node operation, you can drain manager nodes |
|
| 118 |
+to make them unavailable as worker nodes: |
|
| 119 |
+ |
|
| 120 |
+```bash |
|
| 121 |
+docker node update --availability drain <NODE-ID> |
|
| 122 |
+``` |
|
| 123 |
+ |
|
| 124 |
+When you drain a node, the scheduler reassigns any tasks running on the node to |
|
| 125 |
+other available worker nodes in the cluster. It also prevents the scheduler from |
|
| 126 |
+assigning tasks to the node. |
|
| 127 |
+ |
|
| 128 |
+## Back up the cluster state |
|
| 129 |
+ |
|
| 130 |
+Docker manager nodes store the cluster state and manager logs in the following |
|
| 131 |
+directory: |
|
| 132 |
+ |
|
| 133 |
+`/var/lib/docker/swarm/raft` |
|
| 134 |
+ |
|
| 135 |
+Back up the raft data directory often so that you can use it in case of disaster |
|
| 136 |
+recovery. |
|
| 137 |
+ |
|
| 138 |
+You should never restart a manager node with the data directory from another |
|
| 139 |
+node (for example, by copying the `raft` directory from one node to another). |
|
| 140 |
+The data directory is unique to a node ID and a node can only use a given node |
|
| 141 |
+ID once to join the swarm. (ie. Node ID space should be globally unique) |
|
| 142 |
+ |
|
| 143 |
+To cleanly re-join a manager node to a cluster: |
|
| 144 |
+ |
|
| 145 |
+1. Run `docker node demote <id-node>` to demote the node to a worker. |
|
| 146 |
+2. Run `docker node rm <id-node>` before adding a node back with a fresh state. |
|
| 147 |
+3. Re-join the node to the cluster using `docker swarm join`. |
|
| 148 |
+ |
|
| 149 |
+In case of [disaster recovery](#recover-from-disaster), you can take the raft data |
|
| 150 |
+directory of one of the manager nodes to restore to a new swarm cluster. |
|
| 151 |
+ |
|
| 152 |
+## Monitor swarm health |
|
| 153 |
+ |
|
| 154 |
+You can monitor the health of Manager nodes by querying the docker `nodes` API |
|
| 155 |
+in JSON format through the `/nodes` HTTP endpoint. Refer to the [nodes API documentation](../reference/api/docker_remote_api_v1.24.md#36-nodes) |
|
| 156 |
+for more information. |
|
| 157 |
+ |
|
| 158 |
+From the command line, run `docker node inspect <id-node>` to query the nodes. |
|
| 159 |
+For instance, to query the reachability of the node as a Manager: |
|
| 160 |
+ |
|
| 161 |
+```bash |
|
| 162 |
+docker node inspect manager1 --format "{{ .ManagerStatus.Reachability }}"
|
|
| 163 |
+reachable |
|
| 164 |
+``` |
|
| 165 |
+ |
|
| 166 |
+To query the status of the node as a Worker that accept tasks: |
|
| 167 |
+ |
|
| 168 |
+```bash |
|
| 169 |
+docker node inspect manager1 --format "{{ .Status.State }}"
|
|
| 170 |
+ready |
|
| 171 |
+``` |
|
| 172 |
+ |
|
| 173 |
+From those commands, we can see that `manager1` is both at the status |
|
| 174 |
+`reachable` as a manager and `ready` as a worker. |
|
| 175 |
+ |
|
| 176 |
+An `unreachable` health status means that this particular manager node is unreachable |
|
| 177 |
+from other manager nodes. In this case you need to take action to restore the unreachable |
|
| 178 |
+manager: |
|
| 179 |
+ |
|
| 180 |
+- Restart the daemon and see if the manager comes back as reachable. |
|
| 181 |
+- Reboot the machine. |
|
| 182 |
+- If neither restarting or rebooting work, you should add another manager node or promote a worker to be a manager node. You also need to cleanly remove the failed node entry from the Manager set with `docker node demote <id-node>` and `docker node rm <id-node>`. |
|
| 183 |
+ |
|
| 184 |
+Alternatively you can also get an overview of the cluster health with `docker node ls`: |
|
| 185 |
+ |
|
| 186 |
+```bash |
|
| 187 |
+# From a Manager node |
|
| 188 |
+docker node ls |
|
| 189 |
+ID HOSTNAME MEMBERSHIP STATUS AVAILABILITY MANAGER STATUS |
|
| 190 |
+1mhtdwhvsgr3c26xxbnzdc3yp node05 Accepted Ready Active |
|
| 191 |
+516pacagkqp2xc3fk9t1dhjor node02 Accepted Ready Active Reachable |
|
| 192 |
+9ifojw8of78kkusuc4a6c23fx * node01 Accepted Ready Active Leader |
|
| 193 |
+ax11wdpwrrb6db3mfjydscgk7 node04 Accepted Ready Active |
|
| 194 |
+bb1nrq2cswhtbg4mrsqnlx1ck node03 Accepted Ready Active Reachable |
|
| 195 |
+di9wxgz8dtuh9d2hn089ecqkf node06 Accepted Ready Active |
|
| 196 |
+``` |
|
| 197 |
+ |
|
| 198 |
+## Manager advertise address |
|
| 199 |
+ |
|
| 200 |
+When initiating or joining a Swarm cluster, you have to specify the `--listen-addr` |
|
| 201 |
+flag to advertise your address to other Manager nodes in the cluster. |
|
| 202 |
+ |
|
| 203 |
+We recommend that you use a *fixed IP address* for the advertised address, otherwise |
|
| 204 |
+the cluster could become unstable on machine reboot. |
|
| 205 |
+ |
|
| 206 |
+Indeed if the whole cluster restarts and every Manager gets a new IP address on |
|
| 207 |
+restart, there is no way for any of those nodes to contact an existing Manager |
|
| 208 |
+and the cluster will stay stuck trying to contact other nodes through their old address. |
|
| 209 |
+While having dynamic IP addresses for Worker nodes is acceptable, Managers are |
|
| 210 |
+meant to be a stable piece in the infrastructure thus it is highly recommended to |
|
| 211 |
+deploy those critical nodes with static IPs. |
|
| 212 |
+ |
|
| 213 |
+## Recover from disaster |
|
| 214 |
+ |
|
| 215 |
+Swarm is resilient to failures and the cluster can recover from any number |
|
| 216 |
+of temporary node failures (machine reboots or crash with restart). |
|
| 217 |
+ |
|
| 218 |
+In a swarm of `N` managers, there must be a quorum of manager nodes greater than |
|
| 219 |
+50% of the total number of managers (or `(N/2)+1`) in order for the swarm to |
|
| 220 |
+process requests and remain available. This means the swarm can tolerate up to |
|
| 221 |
+`(N-1)/2` permanent failures beyond which requests involving cluster management |
|
| 222 |
+cannot be processed. These types of failures include data corruption or hardware |
|
| 223 |
+failures. |
|
| 224 |
+ |
|
| 225 |
+Even if you follow the guidelines here, it is possible that you can lose a |
|
| 226 |
+quorum of manager nodes. If you can't recover the quorum by conventional |
|
| 227 |
+means such as restarting faulty nodes, you can recover the cluster by running |
|
| 228 |
+`docker swarm init --force-new-cluster` on a manager node. |
|
| 229 |
+ |
|
| 230 |
+```bash |
|
| 231 |
+# From the node to recover |
|
| 232 |
+docker swarm init --force-new-cluster --listen-addr node01:2377 |
|
| 233 |
+``` |
|
| 234 |
+ |
|
| 235 |
+The `--force-new-cluster` flag puts the Docker Engine into swarm mode as a |
|
| 236 |
+manager node of a single-node cluster. It discards cluster membership information |
|
| 237 |
+that existed before the loss of the quorum but it retains data necessary to the |
|
| 238 |
+Swarm cluster such as services, tasks and the list of worker nodes. |
| 0 | 239 |
new file mode 100644 |
| ... | ... |
@@ -0,0 +1,47 @@ |
| 0 |
+<!--[metadata]> |
|
| 1 |
+title = "Raft consensus in swarm mode" |
|
| 2 |
+description = "Raft consensus algorithm in swarm mode" |
|
| 3 |
+keywords = ["docker, container, cluster, swarm, raft"] |
|
| 4 |
+advisory = "rc" |
|
| 5 |
+[menu.main] |
|
| 6 |
+identifier="raft" |
|
| 7 |
+parent="engine_swarm" |
|
| 8 |
+weight="13" |
|
| 9 |
+<![end-metadata]--> |
|
| 10 |
+ |
|
| 11 |
+## Raft consensus algorithm |
|
| 12 |
+ |
|
| 13 |
+When the Docker Engine runs in swarm mode, manager nodes implement the |
|
| 14 |
+[Raft Consensus Algorithm](http://thesecretlivesofdata.com/raft/) to manage the global cluster state. |
|
| 15 |
+ |
|
| 16 |
+The reason why *Docker swarm mode* is using a consensus algorithm is to make sure that |
|
| 17 |
+all the manager nodes that are in charge of managing and scheduling tasks in the cluster, |
|
| 18 |
+are storing the same consistent state. |
|
| 19 |
+ |
|
| 20 |
+Having the same consistent state across the cluster means that in case of a failure, |
|
| 21 |
+any Manager node can pick up the tasks and restore the services to a stable state. |
|
| 22 |
+For example, if the *Leader Manager* which is responsible for scheduling tasks in the |
|
| 23 |
+cluster dies unexpectedly, any other Manager can pick up the task of scheduling and |
|
| 24 |
+re-balance tasks to match the desired state. |
|
| 25 |
+ |
|
| 26 |
+Systems using consensus algorithms to replicate logs in a distributed systems |
|
| 27 |
+do require special care. They ensure that the cluster state stays consistent |
|
| 28 |
+in the presence of failures by requiring a majority of nodes to agree on values. |
|
| 29 |
+ |
|
| 30 |
+Raft tolerates up to `(N-1)/2` failures and requires a majority or quorum of |
|
| 31 |
+`(N/2)+1` members to agree on values proposed to the cluster. This means that in |
|
| 32 |
+a cluster of 5 Managers running Raft, if 3 nodes are unavailable, the system |
|
| 33 |
+will not process any more requests to schedule additional tasks. The existing |
|
| 34 |
+tasks will keep running but the scheduler will not be able to rebalance tasks to |
|
| 35 |
+cope with failures if when the manager set is not healthy. |
|
| 36 |
+ |
|
| 37 |
+The implementation of the consensus algorithm in swarm mode means it features |
|
| 38 |
+the properties inherent to distributed systems: |
|
| 39 |
+ |
|
| 40 |
+- *agreement on values* in a fault tolerant system. (Refer to [FLP impossibility theorem](http://the-paper-trail.org/blog/a-brief-tour-of-flp-impossibility/) |
|
| 41 |
+ and the [Raft Consensus Algorithm paper](https://www.usenix.org/system/files/conference/atc14/atc14-paper-ongaro.pdf)) |
|
| 42 |
+- *mutual exclusion* through the leader election process |
|
| 43 |
+- *cluster membership* management |
|
| 44 |
+- *globally consistent object sequencing* and CAS (compare-and-swap) primitives |