Tuesday, April 18, 2006

What does 'reliable' mean?

Bobby Woolf discusses WS-ReliableMessaging here. His post reminded me of a conversation I had with a colleague a couple of weeks ago. He was specifying some standards to be used on a large project and wanted to turn his intuitive understanding of the word "reliable" into a prescription for what technology to use.

The scenario was a synchronous request/response interaction. SOAP/HTTP isn't reliable. The request might not get there. The request might get there but the response might get lost in transit (e.g. one or other party crashes at just the wrong time). Maybe the response would get back if it were allowed to but you have timed out. In short, how do you know whether your request has been actioned and whether is has been actioned successfully?

An obvious answer to this is to use asynchronous messaging: put the request on a queue, trust the messaging infrastructure to deliver it (or tell you that it can't). Wait for the response. If you or the recipient die at any point or if the network goes down at an inopportune moment, who cares? The messaging layer will take care of it all for you.

The colleague's problem was that he needed to be vendor-agnostic and "internet-friendly". JMS wasn't on the cards and WS-RM doesn't cut it yet. So, what can we do?

The answer in this case was quite nice: what we want in this situation isn't really "reliability"... it's "certainty". We just want to know if it worked or not. The correct tool in our kit bag for this problem isn't asynchronous messaging. It's transactionality.

Consider the problem again:

There's an operation we want to perform. We really need to know if it happened or not (we can't risk it happening twice or not at all). Unfortunately, we have to assume that the communication could fail at any point and that the application at either end could crash at any time.

The problem with the non-transactional, non-queued request-response pattern is that, as soon as we make the request, we have to pray. The operation may happen or it may not. We may get to find out or we may not.

Now consider the transactional case: we make the request... but we're implicitly saying: "try to do this but don't make it permanent yet." If we get a response then everything's great, we can go ahead and ask the transaction coordinator to make it permanent ("commit") - or just forget about it if there was a problem. But now consider what happens if something goes wrong after we've made the initial request. We don't know if the operation was successful or not. But we no longer care.... we know it hasn't happened because we never committed it. When everything comes back up, the transaction coordinator will take care of tidying up all the junk that's lying around and we can just go ahead and try again.

It wasn't WS-ReliableMessaging he wanted; it was WS-AtomicTransaction.

Maybe the old-timers (if it's safe to call Danny Sabbah an old timer...) are right.... there really is nothing new in IT.

No comments: