The cluster implementation used for Galera replication uses regular network connectivity over the main instance interface for all cluster communication. This means that cluster nodes don’t have to be placed physically close to each other as long as they have good network connectivity.
However, this also means that a node cannot distinguish between a node failure of another node and broken network connectivity to the other node. To avoid the situation where the cluster nodes operate independently and get diverging data sets (a split-brain situation), the cluster nodes take a vote and will cease to operate unless they are part of the majority of connected nodes. This ensures that there is only one data set that is allowed to be updated at the time. In the case of a temporary network failure, disconnected nodes can easily synchronize their data to the majority’s data set and continue to operate.
Two Node Clusters
Galera recommends three nodes to avoid a split-brain situation. If only two instances are chosen, which is not recommended, make sure only one of them is getting written to and is a primary while the other is for DR purposes. An arbitrator can also be configured to avoid split brain. For more information, refer to Galera Documentation on Galera Arbitrator.
There is no real high availability in two node clusters. In the event that one of the node leaves the cluster ungracefully it will take the database offline on the remaining node. Two node clusters are more for redundancy than availability and must be manually intervened with to be functional again in the event of a failure.
For more information, refer to Galera Documentation on Two-node Clusters.
High Availability
This setup requires three or more nodes. In case of a node failure, the remaining nodes will still be able to form a cluster through a majority quorum vote and continue to operate.
The first cluster node always has a slightly higher quorum vote than the rest of the nodes. In a setup of an even (4 or more) number of nodes where the nodes are divided over two sites, the site that has the first node will continue to operate if the connectivity between the sites fails.
Continuous Service Availability
To ensure that service clients always connect to an operational node in the cluster, an external load-balancer should be used for automatic fail-over and/or load distribution.
In the case a custom application being developed for consumption of the services provided by SignServer Clouds’ external interfaces, this could also be handled by making the custom application connect to any of the nodes that is found to be operational.
If lower availability and manual interaction is acceptable in case of a node failure, this could also be solved by redirecting a DNS name to the service.