Replicated EhCache, the uneasy road

At first replicating EhCache seems a very easy task, just need to configure ehcache.xml with RMI and you are ready. Is it so?

RMI

At the beginning it seems so, the cache seems to be replicated, everything works. However at some point you notice in the log something like:

Exception on replication of putNotification. Error unmarshaling return header; nested exception is:
java.net.SocketTimeoutException: Read timed out. Continuing...
java.rmi.UnmarshalException: Error unmarshaling return header; nested exception
is:
java.net.SocketTimeoutException: Read timed out
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:
209)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:142)
...
at net.sf.ehcache.Cache.put(Cache.java:1339)
at net.sf.ehcache.hibernate.EhCache.put(EhCache.java:141)
at org.hibernate.cache.ReadWriteCache.put(ReadWriteCache.java:159)
at org.hibernate.engine.loading.CollectionLoadContext.addCollectionToCache(CollectionLoadContext.java:313)

This usually happens when you have 2 nodes of the cluster trying to put or update the same element at the same time. The usual response for timeout problems is to increase the socketTimeoutMillis, but in this case this only made the problem worse as the response never came and the node became much slower. In the end, after days and days of wireshark dumps we realized the problem was most likely in ehCache rmi implementation and decided to switch to jGroups.

JGroups

At first, again, jgroups seems very easy to use, just add the jgroups jar and the ehcache-jgroupsreplication-1.x.jar. “x” depends on jgroups version. For the latest ehcache (2.7.5) and latest jgroups (3.4.2) you need version 1.7. At first again everything seems to work with the default configuration (UDP). Nodes are communicating, everybody is happy. It most case this is the case but you might have a virtualized cluster: say linux guests on kvm where a few minutes after the clusters are started and communicating the communication does not work anymore. This is quite baffling and again you might spend days and days searching. Is it ehCache, is it JGroups, is it the network (virtual)? Then by change you find out this is kvm related and known. The only think to do is, in the host:

echo 1 > /sys/class/net/virbr0/bridge/multicast_querier

Still, until finding this out you might end up updating ehcache, updating jgroups, changing ehcache jgroups config which does not work with the new version and in all loosing lots and lots of time. Here is a very simple python script which converts the .xml configs into ehcache like configurations:

<pre lang="python">
#!/usr/bin/env python

import xml.etree.ElementTree as ET
import sys

tree = ET.parse(sys.argv[1])
root = tree.getroot()

eh='connect='

f1 = True

for child in root:
    if not f1:
        eh = eh + ':\n'
    f1 = False
    eh = eh + child.tag.split('}')[1]
    if len(child.attrib) > 0:
        eh = eh + '('
        f2 = True
        for k in child.attrib.keys():
            if not f2:
                eh = eh + ';'
            f2 = False
            eh = eh + k + '=' + child.attrib[k]
        eh = eh + ')'

print eh

RMI#

JGroups#

RMI

JGroups