Out-Of-Sync Cluster Nodes


If the master node is stopped while other cluster nodes are still running, he sometimes can't normally re-join the cluster. This is the case when a write operation was in progress while the master node was stopped, or if there was a write operation a few seconds before the master node was stopped or killed. In these cases, the slave nodes didn't receive all changes from the master node.

When this cluster node is then re-started while another cluster node is still running, CRX will detect this cluster node is out of sync and the repository will not start. Instead, an error message is written to the server.log saying the repository is not available, and the following or a similar error message in the file crx-quickstart/logs/crx/error.log and crx-quickstart/logs/stdout.log:

ClusterTarSet: Could not open (ClusterTarSet.java, line 710)
java.io.IOException: This cluster node and the master are out of sync. Operation stopped.
Please ensure the repository is configured correctly.
To continue anyway, please delete the index and data tar files on this cluster node and restart.
Please note the Lucene index may still be out of sync unless it is also deleted.
java.io.IOException: Init failed
RepositoryImpl: failed to start Repository: Cannot instantiate persistence manager 
RepositoryStartupServlet: RepositoryStartupServlet initializing failed 

Avoiding Out-Of-Sync Cluster Nodes

To avoid this problem, please ensure the slave cluster nodes are stopped before the master is stopped. If you are not sure which cluster node currently is the master, please open the page http://localhost:port/crx/config/cluster.jsp. The master identity must match the contents of the file <repository>/cluster_node.id of the respective cluster node.

Automatic Out-Of-Sync Prevention

Note: this feature is part of CRX hotfix pack or greater

To avoid that cluster nodes can get out of sync, as soon as multiple cluster nodes are connected to each other, a marker file 'clustered.txt' is created on each cluster node in the repository root directory.

This file is deleted on the master node as soon as this cluster node runs alone (as the only cluster node). It is not deleted when a slave is stopped normally, or when a cluster node process is killed while the file existed.

If the file exists, a cluster node only starts as a slave. I will not start as a master. The cluster node will wait at most 1 minute (by default) until it can connect to the master, and if it can't, then startup will fail.

The maximum number of seconds to wait for a master can be changed by setting the system property "com.day.crx.core.cluster.WaitForMasterRetries" (default: 60). This is the number of seconds to wait for a master to appear (one second delay after each try, so the default is wait at most 60 seconds before giving up).

Recovering an Out-Of-Sync Cluster Node

To re-join a cluster node that is out of sync, the repository needs to be brought back in sync. There are multiple ways to achieve this:

- Create a new repository and join the cluster node as normal.

- Use the Online Backup feature to create a cluster node (see paragraph below). In many cases this is the fastest way to add a cluster node.

- Restore a backup of this cluster node and start.

- As described in the error message, delete the index and data tar files that are out-of-sync on this cluster node and restart. Please note the Lucene index may still be out of sync unless it is also deleted. This procedure is discouraged as it requires more knowledge of the repository, and may be slower than using the online backup feature (specially if the Lucene index needs to be re-built).

Using Online Backup to Create a Cluster Node

To speed up creating a cluster node, the online backup can be used to create a new cluster node. Unlike joining a cluster node, this will not cause the index to be re-built.

On the master node:

    If you didn't do that already, install the master cluster node.

    If the new cluster node is on a different machine: stop the master cluster node, and append the IP address of both the master and the new cluster node. If the property "addresses" already exists, please manually append the IP addresses (the value of the property addresses is a comma separated list of the IP addresses of all cluster nodes). If it doesn't exist yet in the properties file, this config option can be appended as follows (replace, with the correct list of IP addresses):

    echo "addresses=," >> crx-quickstart/repository/cluster.properties

    Create an online backup. This can be automated using curl or wget:

    curl -c login.txt "http://localhost:7402/crx/login.jsp?UserId=admin&Password=xyz&Workspace=crx.default"
    curl -b login.txt -f -o progress.txt "http://localhost:7402/crx/config/backup.jsp?action=add&&zipFileName=&targetDir=<targetDir>"

    If required, copy / move this backup to the target machine(s).

    Please note the preferredMaster should only be set on one cluster instance. If necessary, the flag needs to be changed in the copy.

    Start the new instance

    Verify the cluster nodes are joined by opening the page http://localhost:port/crx/config/cluster.jsp

                                                                                                                              < backnext >