Scope of Availability

For the Hardware Appliance, the availability is defined as being able to keep the service running with full data integrity for the applications running on the Hardware Appliance that uses the internal SQL database.

How it works

The cluster implementation used on the Hardware Appliance uses regular network connectivity over the Application Interface for all cluster communication. This means that cluster nodes don’t have to be placed physically close to each other as long as they have good network connectivity. 

However, this also means that a node cannot distinguish between the failure of another node and broken network connectivity to the other node. To avoid the situation where the cluster nodes operate independently and get diverging data sets (a so called split  brain situation), the cluster nodes take a vote and will cease to operate unless they are part of the majority of connected nodes. This ensures that there is only one data set that is allowed to be updated at the time. In the case of a temporary network failure, disconnected nodes can easily synchronize their data to the majority’s data set and continue to operate.

Synchronization of key material

Key material stored in the HSM is not automatically synchronized after the cluster has been set up. Manual synchronization is however possible.

Pre-cluster setup generation of keys

If suitable for your use-case, you could generate all keys that will be used during the installations life-time after installing the first node, but before starting the cluster configuration for the additional nodes. This way, all additional cluster nodes will be provisioned with the complete key material on installation and no additional manual key synchronization will be necessary.

Post-cluster setup generation of keys

When generating new keys (or in any other way modifying the key material) after the cluster has been setup, you need to manually synchronize the key material. Note that applications that are connected to the shared database may malfunction if they try to use references to keys that are not yet synchronized. For example, if a Certificate Authority in EJBCA is renewed with new key generation, other cluster nodes shortly after the renewal will try to use the new key. This will fail since the key generation was local to the node where it was performed.

Use-Case: Synchronize key material

  1. On Node 1: Generate the key pair(s) on the first node.
  2. On Node 1: Go to the HSM tab of the Hardware Appliance WebConf and download a Cluster Key Synchronization Package by clicking Download protected HSM backup.
  3. On Node n: Go to the HSM tab of the Hardware Appliance WebConf and upload the package.
  4. Repeat step 3 for each node (n>1).
  5. Configure the application to start using the new key pair(s).

Since node 1 has higher database quorum vote weight, it is generally advised to generate the keys there to avoid a reboot and potential downtime in a two node setup.

Network topology

All cluster nodes should have a dedicated connection to all other nodes in the cluster. However the cluster can propagate the data as long as all nodes are connected to at least one other node.

The network connection is done via the GRE protocol (IP protocol number 47. For more information, refer to Wikipedia: List of IP protocol numbers. Since GRE is an IP protocol, it is not based on either TCP or UDP and has no concept of ports. It is an IP protocol by itself. That means that it can not simply be made available with a port forwarding behind a NAT (Network Address Translation). A fully transparent VPN solution will be required if the cluster is supposed to be installed over different locations.

If you do have network equipment that is able to encapsulate the protocol, you might still run into the issue of network address complications. This is easiest worked around by setting up the systems in a simpler network configuration (e.g. same site) and later shipment/reconfiguration.

A cluster node will never forward traffic between two other nodes to avoid networking loops. Compared to using the spanning tree protocol (STP), this means that a broken network connection between two nodes will not trigger any downtime of other connections.

If you prefer the dynamic loop prevention behavior, you could add managed switches in front of the Application Interfaces of the Hardware Appliances. Please note that if the network topology change prevents network traffic between the nodes for too long, your cluster nodes might stop operation and require manual interaction. Rapid Spanning Tree Protocol (RSTP) might be an interesting alternative to STP in this case.

Cluster traffic security considerations

The current version of the Hardware Appliance uses no protection for the cluster traffic. IPSec will be used in a later release, but for now, you need to ensure that this sensitive traffic is protected by other means.