If the master node is stopped while other cluster nodes are still running, he sometimes can't normally re-join the cluster. This is the case when a write operation was in progress while the master node was stopped, or if there was a write operation a few seconds before the master node was stopped or killed. In these cases, the slave nodes didn't receive all changes from the master node.
When this cluster node is then re-started while another cluster node is still running, CRX will detect this cluster node is out of sync and the repository will not start. Instead, an error message is written to the server.log saying the repository is not available, and the following or a similar error message in the file crx-quickstart/logs/crx/error.log and crx-quickstart/logs/stdout.log:
ClusterTarSet: Could not open (ClusterTarSet.java, line 710) java.io.IOException: This cluster node and the master are out of sync. Operation stopped. Please ensure the repository is configured correctly. To continue anyway, please delete the index and data tar files on this cluster node and restart. Please note the Lucene index may still be out of sync unless it is also deleted. ... java.io.IOException: Init failed ... RepositoryImpl: failed to start Repository: Cannot instantiate persistence manager ... RepositoryStartupServlet: RepositoryStartupServlet initializing failed
To avoid this problem, please ensure the slave cluster nodes are stopped before the master is stopped. If you are not sure which cluster node currently is the master, please open the page http://localhost:port/crx/config/cluster.jsp. The master identity must match the contents of the file <repository>/cluster_node.id of the respective cluster node.
Note: this feature is part of CRX hotfix pack 188.8.131.52 or greater
To avoid that cluster nodes can get out of sync, as soon as multiple cluster nodes are connected to each other, a marker file 'clustered.txt' is created on each cluster node in the repository root directory.
This file is deleted on the master node as soon as this cluster node runs alone (as the only cluster node). It is not deleted when a slave is stopped normally, or when a cluster node process is killed while the file existed.
If the file exists, a cluster node only starts as a slave. I will not start as a master. The cluster node will wait at most 1 minute (by default) until it can connect to the master, and if it can't, then startup will fail.
The maximum number of seconds to wait for a master can be changed by setting the system property "com.day.crx.core.cluster.WaitForMasterRetries" (default: 60). This is the number of seconds to wait for a master to appear (one second delay after each try, so the default is wait at most 60 seconds before giving up).
To re-join a cluster node that is out of sync, the repository needs to be brought back in sync. There are multiple ways to achieve this:
- Create a new repository and join the cluster node as normal.
- Use the Online Backup feature to create a cluster node (see paragraph below). In many cases this is the fastest way to add a cluster node.
- Restore a backup of this cluster node and start.
- As described in the error message, delete the index and data tar files that are out-of-sync on this cluster node and restart. Please note the Lucene index may still be out of sync unless it is also deleted. This procedure is discouraged as it requires more knowledge of the repository, and may be slower than using the online backup feature (specially if the Lucene index needs to be re-built).
Using Online Backup to Create a Cluster Node