Tuesday, November 23, 2004

Convoy processing

Those of you who subscribe to the various microsoft.public.biztalk.* newsgroups may have seen a number of posts from myself re. convoy processing, and the perils of correlation set subscriptions. The end result of all of this has been quite dramatic, in terms of our understanding of convoys, and has resulted in us removing them from our design. This is obviously quite a serious step, and so I thought it might be worth passing on some of our recent experience.

First, a quick recap on the messaging architecture. The pub-sub model is described pretty well here, amongst others, so I won't repeat it; suffice to say that messages are delivered to the messagebox through receive locations, and then matched to subscriptions using a combination of message type (as defined by the schema), and any applied filters. Subscribers include send ports (for content-based routing), orchestrations, and orchestration instances (in the case of correlation, and therefore convoys.)

If a subscription exists, then any matching message that arrives in the messagebox will be consumed.

So - how are subscriptions created, and when? The easiest way to check this is to use the BTSSubscriptionViewer utility, found in the SDK\Utilities directory. Using this you can see that in the case of send ports and orchestrations, the subscriptions exist all the time that the artefact is enlisted. For correlation sets things are a little more complex, and this is where it started to unravel for us.

Correlation subscriptions are created when the correlation set is initialized - through a receive or send shape within an orchestration. In a common sequential convoy scenario, the set is intialized, and then a Listen shape is used to pick up the correlated messages (see Alan Smith's sample here.) Without thinking about it, it's easy to assume that before the listen shape is reached, messages will be discarded, regardless of correlation matches, however this is not so! Once the subscription is created, messages will be consumed by the orchestration regardless of the listen shape.

There are two scenarios that we have come across where this causes problems:

1. When theres is a delay of some sort after the initialisation, during which new messages might reasonably be expected to be discarded, and not consumed.
2. When receiving correlated messages at a faster rate than the orchestration is capable of processing them.

We hit the first scenario when using an orchestration to manage a publication schedule. We were receiving an initial message, sleeping until a given date, then sending the first message on, and entering a Listen-Loop, which was being used to hoover up updates to the initial message. We found that updates received during in the initial delay were being published rather than discarded. There is a work around for this, and that is to use a fake Send shape just before the Listen-Loop to initialise the correlation at the last minute. It's ugly, but it works.

The second scenario is much more serious, and has proved a show-stopper for us. Consider the situation where an orchestration is not only using a convoy to batch up messages, but processing the messages as well. As in Alan's sample, the batch limits ("completeness conditions") are set by one of two parameters - a batch size, and a timeout value. This means that either the number of messages processed reaches a set limit, at which point the batch is delivered, and the orchestration dies, or there is a sufficiently long delay in between individual incoming messages for the orchestration to deliver the batch as it currently stands and then die. (e.g. If the convoy picks up 10 messages, output them, else if the convoy picks up 5 messages, then sits for 10 minutes waiting for the next message, output the batch of 5 only.)

In our test orchestration we had the following setup:

- A receive location delivering messages at a rate of 2/second.
- An orchestration picking up the messages in a convoy, and processing each message.
- The processing of each message takes 2 seconds (simulated using a delay.)
- A batch limit of 10 messages, output to a flat file, after which the orchestration dies.

We then sent 100 messages in to the receive location, expecting to see 10 flat files appear.

What we actually saw was 3 files appearing, with no sign of the missing messages. The explanation appears to be as follows:

The correlation set is initialised when the first message is received, at which point a subscription is created for all further messages (all 100 messages matched the correlation.)
The orchestration takes 20 seconds to process the ten messages it requires. However, as the messages are being received at a rate of 2/sec, 40 messages have been delivered to the messagebox in this time, and all of them match the subscription of the correlation set. They are therefore consumed, and not discarded (yet). Neither is a new orchestration instance created. This means that a net 30 messages are consumed, but NOT processed. At the end of the 20 seconds, the orchestration dies, and the outstanding 30 messages discarded.

If you then look at the services report in HAT, you should see that the orchestration is marked as "Completed with discarded messages". The missing messages should be visible in the messages report, again with the status "Suspended", "Completed with discarded messages". You could, of course, save these messages and manually resubmit them, but obviously in a production environment this is not an option.

The lesson from all of this seems to be that you should always think of convoys in light of the subscriptions that they use to consume messages, and understand when these subscriptions are created, and what might happen to the messages that fall in between the gaps.

Caveat convoy, as they say.

UPDATE: see this for a very informative posting on the background to this problem.


Anonymous said...

Is it an option to keep your orchestration running continuously so that it never loses the subscription?

Hugo Rodger-Brown said...

Nice idea. Unfortunately from a management point of view it's not so practical (in our particular scenario.)

(Update - we've just been told that the entire business process has been modelled incorrectly, so the convoy is no longer relevant :-S )

Anonymous said...

Don't you just love IT! How many times have we all been there before, when the rug is whipped out from underneath your feet!

At least its a solution to your problem. Good luck with the revised business process.

Anonymous said...

Not sure why you'd use Orchestration to do the batching. Orchestrations are typically used to ensure that you have to sequence a set of operations on a pub-sub bus which does not ensure message ordering (well unless you use MSMQT); providing compensating transactions; and increase reliability using a durable store for the bus.

Hugo Rodger-Brown said...

The easiest answer to your question is simply, why not? Message order is irrelevant, I simply want to collate a fixed number of messages of the same type, using a common identifier. As far as I know this is a fairly standard implementation of the "Aggregator" pattern - using a counter as the completion condition. It actually works rather well in cases where the input rate is reasonably low, and the batch size is small. (See Alan Smith's blog or this whitepaper for an example http://eaipatterns.com/docs/integrationpatterns_biztalk.pdf)

An orchestration is the easiest place to do the batching, as any alternative will involve creating custom persistence (e.g. new SQL db / tables) and some process for determining batch completeness. As it turns out we've had to go down this route for the reasons I've described, but it adds considerably to the complexity of the project, as we have had to engage the internal SQL team and other developers.

I'm not sure I agree with your definition of an orchestration either - surely an orchestration is a means of processing messages of a particular type, according to a well-defined (?!) business process.

Things like compensating transactions are internal implementation considerations that are used to ensure that processes are reliable (or at least consistent).

In our case, the business process involves the individual processing of messages, coupled with the need to present the output to the end-user as a batch - something which convoys are well-suited to, subject to the realities of actual implementation (i.e. not in our situation)?