High Availability and Clustering

For any PKI with availability requirements, whether on the CA or VA end, some form of redundancy needs to be factored in to ensure that the failure of a single instance does not result in downtime. There are two methods usually employed to solve this: Failover/Disaster Recovery Site and Clustering.

Failover/Disaster Recovery Site

This configuration is the most straightforward to implement, though it does require a manual failover and only provides a single layer of redundancy. 

In this configuration:

  • All traffic is lead to the Primary site while the Secondary site is online but idle. 
  • Failover is handled manually, but redirecting traffic to the Secondary site if the Primary site fails 
  • Each site (Primary/Secondary) is geographically separated
  • The HSMs on each site are functionally identical, the HSM on the secondary site being restored from a backup of the HSM on the primary site as part of the key ceremony 
  • The databases on each site are connected and configured as Master/Slave, meaning that a write on either site will be mirrored to the other. In case the Primary site fails but the two are still interconnected and the databases are still online, then service can be resumed again on the Primary site with a minimum of delay by re-configuring the slave database on the Secondary site to be the master.
  • The fail-over mechanism can be partly automated but is usually not fully automated.

Clustering 

A more advanced but more redundant and secure solution is master/Master clustering, which also (through use of a load balancer) provides performance benefits. 

  • The database used by each node is a Master/Master database cluster, meaning that writes can be done on all database nodes in the cluster.
    • A commonly used database cluster technology is Galera.
    • Galera is a quorum based clustering solution that can work with MariaDB, MySQL and PostgreSQL. From MariaDB 10.x the Galera clustering plugin is shipped with the database distribution by default.
  • If one node fails, traffic can be handled by the nodes without re-configuration. The load balancer can automatically stop sending traffic to the failed node.
  • The HSMs used by each EJBCA node are functionally identical, with the HSM key material being synced between HSMs (manually or automatically depending on what the HSM supports).
  • Master/Master clustering can be combined in a Primary/Secondary setup, with two nodes in the Primary site and one in the Secondary, where the Secondary site may be online and used in parallel with the Primary site (although this is rather unusual).

For a solution that is straightforward to set up, the EJBCA Appliance comes with clustering built-in and easy to configure.