At first replicating EhCache seems a very easy task, just need to configure ehcache.xml with RMI and you are ready. Is it so?
RMI
At the beginning it seems so, the cache seems to be replicated, everything works. However at some point you notice in the log something like:
Exception on replication of putNotification. Error unmarshaling return header; nested exception is:
java.net.SocketTimeoutException: Read timed out. Continuing...
java.rmi.UnmarshalException: Error unmarshaling return header; nested exception
is:
java.net.SocketTimeoutException: Read timed out
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:
209)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:142)
...
at net.sf.ehcache.Cache.put(Cache.java:1339)
at net.sf.ehcache.hibernate.EhCache.put(EhCache.java:141)
at org.hibernate.cache.ReadWriteCache.put(ReadWriteCache.java:159)
at org.hibernate.engine.loading.CollectionLoadContext.addCollectionToCache(CollectionLoadContext.java:313)
This usually happens when you have 2 nodes of the cluster trying to put or update the same element at the same time. The usual response for timeout problems is to increase the socketTimeoutMillis, but in this case this only made the problem worse as the response never came and the node became much slower. In the end, after days and days of wireshark dumps we realized the problem was most likely in ehCache rmi implementation and decided to switch to jGroups.
JGroups
At first, again, jgroups seems very easy to use, just add the jgroups jar and the ehcache-jgroupsreplication-1.x.jar. “x” depends on jgroups version. For the latest ehcache (2.7.5) and latest jgroups (3.4.2) you need version 1.7. At first again everything seems to work with the default configuration (UDP). Nodes are communicating, everybody is happy. It most case this is the case but you might have a virtualized cluster: say linux guests on kvm where a few minutes after the clusters are started and communicating the communication does not work anymore. This is quite baffling and again you might spend days and days searching. Is it ehCache, is it JGroups, is it the network (virtual)? Then by change you find out this is kvm related and known. The only think to do is, in the host:
echo 1 > /sys/class/net/virbr0/bridge/multicast_querier
Still, until finding this out you might end up updating ehcache, updating jgroups, changing ehcache jgroups config which does not work with the new version and in all loosing lots and lots of time. Here is a very simple python script which converts the .xml configs into ehcache like configurations:
<pre lang="python">
#!/usr/bin/env python
import xml.etree.ElementTree as ET
import sys
tree = ET.parse(sys.argv[1])
root = tree.getroot()
eh='connect='
f1 = True
for child in root:
if not f1:
eh = eh + ':\n'
f1 = False
eh = eh + child.tag.split('}')[1]
if len(child.attrib) > 0:
eh = eh + '('
f2 = True
for k in child.attrib.keys():
if not f2:
eh = eh + ';'
f2 = False
eh = eh + k + '=' + child.attrib[k]
eh = eh + ')'
print eh