Why BPEL process instances may appear on the Manual Recovery List

- August 12, 2007

Applies to:
Oracle(R) BPEL Process Manager - Version: 10.1.2 to 10.1.3.1

Solution
Messages appearing on the Recovery Console happen for a number of reasons, namely;

BPEL has a Delivery Service that intercepts the incoming messages. Once it intercepts an incoming message, the delivery service does two things: 1) Put a very short JMS message to the in-memory queue . The small JMS message will trigger work further downstream of the process. 2) Save the BPEL message to tables at the dehydration store.

In the same thread the BPEL engine instantiate or continue the BPEL instance. By the time the engine finishes processing this message (i.e. hits the end of the process, or hits the first dehydration point) the engine will update the message tables to mark the message as "HANDLED".

The Recovery page shows those invoke or callback messages that are NOT in the HANDLED state. If you click fast enough, you will see messages come and go on the Recovery page. So seeing messages at the Recovery page does not necessarily mean there are problems, unless ...

Unless they stay there permanently. They stay there permanently because there are no more JMS messages to trigger the WorkerBean to process those un-handled BPEL messages in invoke_message or dlv_message tables. Derived from this, these are the scenarios that could cause the BPEL messages stay at UNRESOLVED states:

* The server shuts down or crash before it finish processing the BPEL message. When the server re-started again, the in-memory JMS message associated with the invoke or call back message is already is already gone.
* The engine could not finish processing the message before reaching the time-out as dictated by the transaction-time out specified in the server.xml. In such case it rolls back the message to the message table. But the JMS message associated with the invoke messages are already consumed.

In these cases, no more events (i.e. JMS messages) to trigger the engine to process the BPEL messages in invoke_message or dlv_message tables. Therefore a manual recovery is performed.

References
Bug 4859293 - REQUEST AN AUTOMATED RECOVERY ACTIVITY INTO BPEL CODE FOR FAILED PROCESSES

Comments

javachic said…

I have a problem in production where
responses are "stuck" in a callback.
We are have asynchronous services.
At the time when response is permanently "stuck" in callback we are seeing messages in the log like
"subscriber not found". But, if
we attempt to do "Perform manual recover" I get message that multiple instances are found with the same Id.
Those instances are invoked by BPEL itself, the ID is given by BPEL, I don't quite understand why there are multiple instances.

We have Oralcle BPEL 10.1.3.3.1 clustered with RAC. We have not seen this issue in DEV, where it is not clustered.

What is working for recovery:

1. Manually get response message that stuck in "callback" using apis
2. Manually post response to the open instance.
3. That keeps it going.

Do you have any ideas what is going on? why responses getting "stuck" in "callback"?

Thank you,
Tanya.

November 9, 2009 at 12:57 PM