Monday, May 6, 2013

Interpreting JAX-WS client connection errors on WebSphere

The present article describes the exceptions thrown by the WebSphere runtime for various connection related error conditions that may occur during a Web service call performed using a JAX-WS client. This information is useful in determining the root cause of a connection problem when only the exception message in the WebSphere logs is available.

Errors that occur during the establishment of the TCP connection trigger exceptions with informative error messages (such as Connection refused) and are easy to debug. Things are less trivial for exceptions caused by timeouts and premature connection closures, and we will concentrate on this type of error conditions here.

Note that the scenarios described in this article were tested on a WAS 7.0 server running on Linux. Since WebSphere's HTTP transport is based on a proprietary native asynchronous I/O library (and not the Java NIO framework), the error messages may depend on the operating system. Also note that the information in this article doesn't apply to the JAX-WS thin client (com.ibm.jaxws.thinclient_7.0.0.jar) used in standalone applications. The reason is that the JAX-WS runtime only uses the native async I/O library inside the application server, but falls back to Java NIO when running in a standalone application.

Client read timeout

The first case to consider is a client read timeout. This occurs if the server doesn't reply within the configured timeout. In this case, the WebSphere channel framework internally generates the following exception:

com.ibm.io.async.AsyncTimeoutException(Async operation timed out, [Timeout, rc=0])

This exception will be wrapped in another exception:

java.net.SocketTimeoutException: Async operation timed out

This means that client read timeouts are easy to recognize.

Premature connection termination on server side

Another case to consider is the premature termination of the connection on the server side. There are various reasons why this may occur: failure of the target server, failure of a proxy/gateway between the client and the server, a proxy/gateway that forcefully closes the connection because of a timeout, etc.

If an HTTP client connection is closed by the server during a request (i.e. unless the connection is idle), the channel framework internally generates one of the following exceptions:

java.io.IOException: Async IO operation failed (1), reason: RC: 107 Transport endpoint is not connected
java.io.IOException: Async IO operation failed (3), reason: RC: 107 Transport endpoint is not connected

This exception is generated regardless of whether the connection closure is expected or not (as can be seen by activating traces for com.ibm.io.async.AbstractAsyncFuture). According to the HTTP protocol, there is a single case where the server side is expected to close the connection during a request, namely an HTTP response that has no Content-Length header and that is not chunked. In that case, once the HTTP status line and headers have been sent, connection closure is used to signal the end of the response body. Since the connection closure is expected, the exception generated by the channel framework is obviously not thrown to the application. If the connection closure by the server side was actually premature, then the response body will appear truncated (or empty) to the client. This will result in an XML parser error, or in the following exception if the response is completely empty:

org.apache.axis2.AxisFault: The input stream for an incoming message is null.

In all other cases, connection closure by the server side is not expected, and the exception generated by the channel framework is thrown to the application. This includes the following error conditions:

  • The connection is closed before sending the HTTP response code.

  • The HTTP response has a Content-Length header, but the server closes the connection before the specified number of bytes have been sent.

  • The HTTP response is chunked and the server closes the connection in the middle of a chunk or before the final (empty) chunk has been sent.

One important thing to know about the java.io.IOException: Async IO operation failed exception is that it is created in a very peculiar way. If one looks closely at the stack trace, one can see that the IOException object is actually taken from a cache:

java.io.IOException: Async IO operation failed (3), reason: RC: 107  Transport endpoint is not connected
        at com.ibm.io.async.AsyncLibrary$IOExceptionCache.(AsyncLibrary.java:891)
        at com.ibm.io.async.AsyncLibrary$IOExceptionCache.get(AsyncLibrary.java:904)
        at com.ibm.io.async.AsyncLibrary.getIOException(AsyncLibrary.java:918)

The likely reason for this unusual (to say the least...) design is that, as noted above, the exception is generated internally even if the connection closure is actually expected. The cache is therefore required to avoid the large overhead associated with instantiating the exception in those cases where it is not thrown to the application (and where the service invocation actually completes successfully). The issue with that strategy is that the stack trace associated with the exception is not meaningful and has nothing to do with the location where the error actually occurred. It should be completely ignored when debugging java.io.IOException: Async IO operation failed errors.

As shown above, for connection termination the reason code appearing in the error message is always 107, but the code appearing in parentheses just after the Async IO operation failed part may differ from case to case. A value of 1 indicates that the socket was already closed before the read request was processed by the channel framework. This would e.g. occur if the server closes the connection immediately. A value of 3 indicates that the server closed the connection while the read request was pending. This typically occurs if the connection is closed because of a timeout.

Termination of idle persistent connections

To avoid the overhead of creating a new TCP connection for each HTTP request, HTTP 1.1 introduces the concept of persistent connections. According to section 8.1.4 of the HTTP 1.1 specification,

A client, server, or proxy MAY close the transport connection at any time. For example, a client might have started to send a new request at the same time that the server has decided to close the "idle" connection. From the server's point of view, the connection is being closed while it was idle, but from the client's point of view, a request is in progress.

The existence of this race condition is a well known problem with the HTTP protocol and was already identified in early proposals for persistent connections (See section 5.1 of https://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-4.pdf).

It should be noted that section 8.1.2.1 of the HTTP 1.1 specification also requires that

If the server chooses to close the connection immediately after sending the response, it SHOULD send a Connection header including the connection-token close.

Normally, an HTTP server only closes a persistent connection after it has been idle for a certain time or if a certain number of requests have been received on that connection. As shown above, in the latter case, the server is expected to use the Connection: close header to inform the client (since it knows in advance that it will close the connection). That means that in practice, with a well behaved HTTP server, the race condition will only occur if the keep-alive timeout configured on the client side is higher than the keep-alive timeout configured on the server side.

For JAX-WS clients on WebSphere, the default keep-alive timeout is 5 seconds, which is smaller than the keep-alive timeout usually seen on HTTP servers. Therefore with a well behaved HTTP server, the race condition should never occur. If the HTTP server is not so well behaved and closes the persistent connection too quickly (or closes the connection immediately after sending a response without a Connection: close header), then the JAX-WS client may trigger the following error:

java.io.IOException: Async IO operation failed (1), reason: RC: 107  Transport endpoint is not connected

This is consistent with the observations made in the previous section, because the connection closure is initiated by the server side and the socket is already closed when the read request for the response is processed by the channel framework (which explains the code 1).

Section 8.1.4 of the HTTP 1.1 specification describes one possible way to handle this problem:

[C]lients [...] MUST be able to recover from asynchronous close events. Client software SHOULD reopen the transport connection and retransmit the [request] so long as [it] is idempotent [...]. Non-idempotent [requests] MUST NOT be automatically retried.

This strategy is not generally usable because SOAP requests cannot be assumed to be idempotent.

Another strategy is to use the 100 (Continue) status as described in section 8.2.3: the client sends an Expect: 100-continue header and then waits for the intermediate 100 (Continue) response before sending the request body. This allows the client to detect a broken connection before sending the SOAP request. The drawback of this strategy is that it requires an additional network roundtrip, which to some extend defeats the purpose of using persistent connections.

WebSphere supports both mechanisms, but they are not enabled by default (for obvious reasons: they should not be required with well behaved HTTP servers and they both are problematic). It is possible to enable one of these strategies on specific JAX-WS clients by configuring a HTTP transport policy.

Unfortunately, as of WebSphere 7.0.0.27 (and 8.5.0.2) none of these settings actually allow the JAX-WS runtime to recover from a premature termination of an idle connection. If Send 'Expect 100-continue' request header is enabled in the policy, then the following error occurs if the JAX-WS runtime attempts to send a request at the same time the server closes the idle connection:

java.io.IOException: Request not finished yet

With Resend enabled, another exception is triggered:

java.lang.RuntimeException: Invalid call to WsByteBuffer method.  Buffer has already been released.

This means that WebSphere's JAX-WS runtime is not able to handle the premature termination of an idle HTTP connection at all, and the only workaround is to disable persistent connections.

1 comment:

  1. Andreas!

    Friend, you are a boss. No doubts about it. Love this post, clear, and useful, a extrange combination when we talk about websphere. Congratulations.

    And after win your simpathy, then, the cuestion: i'm working hard with jax-ws and websphere, and i have a lot of problems with performance. Incredible problems. We had made some changes, most important maybe cacheClassList.persist=true, becouse when GC deletes jax instantiation, performance problems return again. Definitely, big question: after a lot of investigations, we think that the principal problem is declares WS like bared instaed of wrapped, and extra job that websphere needs in this instatiation.. ¿did you have any problem similar working with Websphere?

    Thanks a lot, friend! And congrats again.

    ReplyDelete