Sunday, November 2, 2014

Running WebSphere Application Server (full profile) in a Docker container

This article describes how to run WebSphere Application Server in a Docker container. We are going to use the developer version to create a full profile, but the instructions can easily be adapted to a regular WebSphere version (provided you have an appropriate license). To create the Docker image, download IBM Installation Manager for Linux x86_64 and use the following Dockerfile, after replacing the -userName and -userPassword arguments with your IBM ID:

FROM centos:centos6

RUN yum install -q -y unzip

ADD agent.installer.linux.gtk.x86_64_*.zip /tmp/

 unzip -qd /tmp/im /tmp/agent.installer.linux.gtk.x86_64_*.zip && \
 /tmp/im/installc \
   -acceptLicense \
   -showProgress \
   -installationDirectory /usr/lib/im \
   -dataLocation /var/im && \
 rm -rf /tmp/agent.installer.linux.gtk.x86_64_*.zip /tmp/im

 /usr/lib/im/eclipse/tools/imutilsc saveCredential \
   -url \
   -userName \
   -userPassword mypassword \
   -secureStorageFile /root/credentials && \
 /usr/lib/im/eclipse/tools/imcl install \ \
   -repositories \
   -acceptLicense \
   -showProgress \
   -secureStorageFile /root/credentials \
   -sharedResourcesDirectory /var/cache/im \
   -preferences \
   -installationDirectory /usr/lib/was && \
 rm /root/credentials

RUN useradd --system -s /sbin/nologin -d /var/was was

 hostname=$(hostname) && \
 /usr/lib/was/bin/ -create \
   -templatePath /usr/lib/was/profileTemplates/default \
   -profileName default \
   -profilePath /var/was \
   -cellName test -nodeName node1 -serverName server1 \
   -hostName $hostname && \
 echo -n $hostname > /var/was/.hostname && \
 chown -R was:was /var/was

USER was

RUN echo -en '#!/bin/bash\n\
set -e\n\
old_hostname=$(cat /var/was/.hostname)\n\
if [ $old_hostname != $hostname ]; then\n\
  echo "Updating configuration with new hostname..."\n\
  sed -i -e "s/\"$old_hostname\"/\"$hostname\"/" $node_dir/serverindex.xml\n\
  echo $hostname > /var/was/.hostname\n\
if [ ! -e $launch_script ] ||\n\
   [ $node_dir/servers/server1/server.xml -nt $launch_script ]; then\n\
  echo "Generating launch script..."\n\
  /var/was/bin/ server1 -script $launch_script\n\
' > /var/was/bin/ && chmod a+x /var/was/bin/

# Speed up the first start of a new container
RUN /var/was/bin/

RUN echo -en '#!/bin/bash\n\
set -e\n\
echo "Starting server..."\n\
exec /var/was/bin/\n\
' > /var/was/bin/ && chmod a+x /var/was/bin/

CMD ["/var/was/bin/"]

Note that by executing this Dockerfile you accept the license agreement for IBM Installation Manager and WebSphere Application Server for Developers.

Here are some more details about the Dockerfile:

  • Only IBM Installation Manager needs to be downloaded before creating the image. The product itself (WebSphere Application Server for Developers 8.5.5) is downloaded by Installation Manager during image creation. Note that this may take a while. The preserveDownloadedArtifacts=false preference instructs Installation Manager to remove the downloaded packages. This reduces the size of the image.

  • The Dockerfile creates a default application server profile that is configured to run as a non-root user. The HTTP port is 9080 and the URL of the admin console is http://...:9060/ibm/console. New containers should be created with the following options: -p 9060:9060 -p 9080:9080.

    To see the WebSphere server logs, use the following command (requires Docker 1.3):

    docker exec container_id tail -F /var/was/logs/server1/SystemOut.log
  • Docker assigns a new hostname to every newly created container. This is a problem because the serverindex.xml file in the configuration of the WebSphere profile contains the hostname. That is to say that WebSphere implicitly assumes that the hostname is static and not expected to change after the profile has been created. To overcome this problem the Dockerfile adds a script called to the image. That script is executed before the server is started and (among other things) updates the hostnames in serverindex.xml when necessary.

  • Docker expects the RUN command to run the server process in the foreground (instead of allowing it to detach) and to gracefully stop the server when receiving a TERM signal. WebSphere's command doesn't meet these requirements. This issue is solved by using the -script option, which tells to generate a launch script instead of starting the server. This launch script has the desired properties and is used by the RUN command. This has an additional benefit: the command itself takes a significant amount of time (it's a Java process that reads the configuration and then starts a separate process for the actual WebSphere server) and skipping it reduces the startup time.

    There is however a problem with this approach. The content of the launch script generated by depends on the server configuration, in particular the JVM settings specified in server.xml. When they change, the launch script needs to be regenerated. This can be easily detected and the script added by the Dockerfile is designed to take care of this.

    The RUN command is a script that first runs and then executes the launch script. In addition to that, is also executed once during the image creation. This will speed up the first start of a new container created from that image, not only because the launch script will already exist, but also because the very first execution of the script typically takes much longer to complete.

Sunday, October 12, 2014

How to change the StAX implementation in an OSGi environment

The StAX API uses the so called JDK 1.3 service provider discovery mechanism to locate providers of its three factory classes (XMLInputFactory, XMLOutputFactory and XMLEventFactory). This mechanism uses the thread context class loader to look up resources under META-INF/services/. Switching to a StAX implementation other than the one shipped with the JRE therefore requires deploying the JAR containing that implementation in such a way that it becomes visible to the thread context class loader.

In a simple J2SE application, the thread context class loader is set to the application class loader and the right way to do this is to add the JAR to the class path. In a JavaEE environment, each application (EAR) and each Web module (WAR) has its own class loader and the specification requires that the application server sets the thread context class loader to the corresponding application or module class loader before invoking a component (servlet, bean, etc.). To change the StAX implementation used by a given application or module, it is therefore enough to add the JAR to the relevant EAR or WAR.

Things are different in an OSGi environment because each bundle has its own class loader, but the thread context class loader is undefined. JDK 1.3 service provider discovery will therefore not be able to discover a StAX implementation deployed as a bundle. The solution to this problem is to modify the StAX API to replace the JDK 1.3 service provider discovery with an OSGi aware mechanism not relying on the thread context class loader. That modified StAX API would then be deployed as an OSGi bundle itself so that it will be used in place of the StAX API from the JRE. At least two such StAX API bundles exist: one from the Apache Geronimo project and one from Apache ServiceMix.

In the following we will discuss how they work, and in particular how they can be used to switch to Woodstox as StAX implementation. Note that the Woodstox JAR (Maven: org.codehaus.woodstox:woodstox-core-asl) as well as the StAX2 API JAR (Maven: org.codehaus.woodstox:stax2-api) on which it depends already have OSGi manifests and therefore can be deployed as bundles without the need to repackage them.

To use the StAX support from Apache Geronimo, two bundles need to be installed: the Geronimo OSGi registry (Maven: org.apache.geronimo.specs:geronimo-osgi-registry:1.1) and the Geronimo StAX API bundle (Maven: org.apache.geronimo.specs:geronimo-stax-api_1.2_spec:1.2). The OSGi registry tracks bundles that have the SPI-Provider attribute set to true in their manifests. It scans these bundles for resources under META-INF/services/. That information is then used by the StAX API bundle to locate the StAX implementation. This means that Geronimo uses the same metadata as the JDK 1.3 service provider discovery, but requires an additional (non standard) bundle manifest attribute. The stock Woodstox bundle doesn't have this attribute and therefore will not be recognized. Instead, a repackaged version of Woodstox is required. The Geronimo project provides this kind of bundles (Maven: org.apache.geronimo.bundles:woodstox-core-asl), albeit not for the most recent Woodstox versions.

The StAX support from Apache ServiceMix comes as a single bundle to deploy (Maven: org.apache.servicemix.specs:org.apache.servicemix.specs.stax-api-1.2:2.4.0). It scans all bundles for StAX related resources under META-INF/services/, i.e. it uses exactly the same metadata as the JDK 1.3 service provider discovery. This means that it will recognize the vanilla Woodstox bundle and no repackaging is required.

To summarize, the most effective way to switch to Woodstox as the StAX implementation in an OSGi environment is to deploy the following three bundles (identified by their Maven coordinates):

Wednesday, September 10, 2014

Zitat des Tages

Es ist zu erreichen die Gründung eines mitteleuropäischen Wirtschaftsverbandes durch gemeinsame Zollabmachungen, unter Einschluß von Frankreich, Belgien, Holland, Dänemark, Österreich-Ungarn, Polen und eventl. Italien, Schweden und Norwegen. Dieser Verband, wohl ohne gemeinsame konstitutionelle Spitze, unter äußerlicher Gleichberechtigung seiner Mitglieder, aber tatsächlich unter deutscher Führung, muß die wirtschaftliche Vorherrschaft Deutschlands über Mitteleuropa stabilisieren.

Ähnlichkeit mit real existierenden Institutionen ist rein zufällig...

Sunday, July 6, 2014

Le graphique du jour

Inégalités et taux de syndicalisation aux USA:


Monday, February 24, 2014

Quote of the day

In those parts of the world where learning and science have prevailed, miracles have ceased; but in those parts of it as are barbarous and ignorant, miracles are still in vogue.

Ethan Allen, Reason the Only Oracle of Man, pamphlet, 1784

Saturday, February 15, 2014

Using byteman to locate deadlock-prone data source access patterns on WebSphere

The other day I came across a very interesting deadlock situation in an application deployed on a production WebSphere server. The deadlock occurred because for certain requests, the application requires more than one concurrent connection from the same JDBC data source in a single thread. This situation arises e.g. when an application uses transaction suspension (e.g. by calling an EJB method declared with REQUIRES_NEW) and the new transaction uses a data source that has already been accessed in the suspended transaction. In that case, the container is required to retrieve a new connection from the pool, resulting in two connections from the same data source being held by the same thread at the same time. Since the size of the connection pool is bounded, this may indeed lead to a deadlock if multiple such requests are processed concurrently. There is very well written explanation of that problem in the WebSphere documentation:

Deadlock can occur if the application requires more than one concurrent connection per thread, and the database connection pool is not large enough for the number of threads. Suppose each of the application threads requires two concurrent database connections and the number of threads is equal to the maximum connection pool size. Deadlock can occur when both of the following conditions are true:

  • Each thread has its first database connection, and all are in use.
  • Each thread is waiting for a second database connection, and none would become available since all threads are blocked.

To prevent the deadlock in this case, increase the maximum connections value for the database connection pool by at least one. This ensures that at least one of the waiting threads obtains a second database connection and avoids a deadlock scenario.

For general prevention of connection deadlock, code your applications to use only one connection per thread. If you code the application to require C concurrent database connections per thread, the connection pool must support at least the following number of connections, where T is the maximum number of threads:

T * (C - 1) + 1

The deadlock situation can be visualized using a resource allocation diagram. With 4 threads, a maximum connection pool size of 4 and C=2, the diagram would look as follows:

Note that since blocked connection requests eventually time out (by default after 3 minutes), the situation is not a real (permanent) deadlock. However, after a given thread is unblocked by a timeout (and the connection held by that thread released), the system will typically reach another deadlock state very quickly because of application requests that have been queued in the meantime (by the Web container if requests come in via HTTP).

What makes this problem so nasty is that it is a threshold phenomenon. Under increasing load the system will at first behave gently: as long as the maximum pool size is not reached, it is not possible for the deadlock to occur and the system will respond in a normal way. If the load increases further, the number of active connections will eventually reach the limit and the probability for the deadlock to occur will become non zero. Once the deadlock materializes, the behavior of the system drastically changes, and the impact is not limited to requests that require multiple concurrent connections per thread: any request depending on the data source (even with C=1) will be blocked. This will rapidly lead to a thread pool starvation, blocking all incoming requests, even ones that don't use the data source. As noted above, connection request timeouts will not necessarily improve the situation, even if the load (in terms of number of incoming requests per unit of time) decreases below the level that initially triggered the deadlock.

To illustrate the last point, assume that the normal response time of the service is of order 100ms and that the maximum connection pool size is 10. In this scenario the threshold above which the deadlock may occur is of order 100 req/s. Once the deadlock occurs, the average response time drastically changes. It will be determined by the connection request timeout configured on the data source, which is 3 minutes by default. The actual average response time will be lower because once a timeout occurs and a connection becomes available in the pool, a certain number of requests may go through without triggering the deadlock again. Let's be optimistic and assume that in that state the average response time will be of order 10 seconds. Then the new threshold will be of order 1 req/s, i.e. for the deadlock to clear there would have to be a drastic decrease in load.

As noted in the WebSphere documentation quoted above, there are two options to avoid the problem. One is to set the maximum pool size for the data source to a sufficiently high value. Note that as long as there are requests with C>1, the maximum connection pool size must be larger than the thread pool size. There are some problems with this option:

  • In many environments there is a limit on the total number of open connections allowed by the database. Configuring large connection pools may cause a problem at that level.
  • Increasing connection pool sizes also increases the maximum number of SQL statements that may be executed concurrently. This may cause problems for the database server in other scenarios.

The other option is to review the application and to make sure that C≤1 for all requests. This raises another interesting question, namely how to identify code for which C>1 without the need to carry out specific load tests that attempt to trigger the actual deadlock or to implement costly code reviews (that would probably miss some scenarios anyway). Ideally one would like to identify such code by simply monitoring the application in a test environment. In principle this should be feasible because the algorithm to detect this at runtime is trivial: if a thread requests a new connection from a pool while it already owns one, take a stack trace and log the event.

It appears that WebSphere Application Server doesn't have any feature that would allow to do that. On the other hand this is a typical use case for tools such as BTrace. Unfortunately BTrace is known not to work on IBM JREs because it uses an undocumented feature that only exists in Oracle JREs. There is however a similar tool called Byteman that works on IBM JREs.

The following Byteman script indeed achieves the goal (Note that it was written for WAS 8.5; it may need some changes to work on earlier versions):

HELPER helper.Helper

RULE connection reservation
METHOD reserve(javax.resource.spi.ManagedConnectionFactory,, javax.resource.spi.ConnectionRequestInfo, java.lang.Object, boolean, boolean, int, int, boolean)
IF true
DO reserved($0, $!)

RULE connection release
METHOD release(, java.lang.Object)
IF true
DO released($0, $1)

The script simply intercepts the relevant calls to the connection pool manager that are used for reserving and releasing connections. It then extracts the MCWrapper object (MC stands for managed connection; there is one wrapper for each physical connection in the pool) and passes it to a helper class that takes care of the bookkeeping. If the helper detects that two wrappers from the same pool are used by a single thread, it will log that event. The class looks as follows:

package helper;

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

public class Helper {
  // Note: key is PoolManager and values are MCWrapper. We need to use Object because the helper is
  // added to the classpath of the server. If we want to use the actual classes, then we would have
  // to load the helper as a fragment into the bundle.
  private static final ThreadLocal<Map<Object,Set<Object>>> threadLocal
      = new ThreadLocal<Map<Object,Set<Object>>>() {
    protected Map<Object,Set<Object>> initialValue() {
      return new HashMap<Object,Set<Object>>();

  public void reserved(Object poolManager, Object mcWrapper) {
    Map<Object,Set<Object>> map = threadLocal.get();
    Set<Object> mcWrappers = map.get(poolManager);
    if (mcWrappers == null) {
      mcWrappers = new HashSet<Object>();
      map.put(poolManager, mcWrappers);
    // Note that the same MCWrapper may be returned twice if the connection is sharable and requested
    // multiple times in the same transaction (which is OK); however we don't need to track that
    // because "released" is only called once per MCWrapper.
    if (mcWrappers.size() > 1) {
      System.out.println("Detected concurrent connection requests for the same pool in the same thread!");
      new Throwable().printStackTrace(System.out);

  public void released(Object poolManager, Object mcWrapper) {
    // Note that this method is called only once per MCWrapper for shared connections (i.e. when
    // the MCWrapper is really put back into the pool).

That class needs to be added to the class path of the server. The Byteman script itself is enabled by adding the following argument to the JVM command line of the WebSphere server:


When the detection mechanism is triggered, it will output a dump of the connection pool as well as a stack trace for the code that requests the concurrent connection. The connection pool dump will show at least one connection with a managed connection wrapped linked to a transaction in state SUSPENDED. It is easy to improve the helper class to collect the stack trace for the first connection request as well. Note however that this changes requires the helper to save a stack trace for every connection request (even for code with C=1) which would have an impact on performance.

Wednesday, February 5, 2014

Reducing the impact of DB2 client reroutes on applications deployed on WebSphere

In a previous blog post I discussed a couple of common pitfalls when using HADR and automatic client reroute with DB2 and WebSphere. In the present post I will analyze another closely related topic, namely how WebSphere and applications deployed on WebSphere react to a client reroute and what can be done to minimize the impact of a failover.

There are a couple of things one needs to be aware of in order to analyze these questions:

  • The failover of a database always causes all active transactions on that database to be rolled back. The fundamental reason is that HADR doesn’t replicate locks to the standby database, as mentioned here. Note that, on the other hand, HADR does ship log records for uncommitted operations (which means that transactions that are rolled back on the primary also cause a roll back on the standby). The standby therefore has enough information to reconstruct the data in an active transaction, but the fact that locks are not replicated implies that it cannot fully reconstruct the state of the active transactions during a failover. It therefore cannot allow these transactions to continue and is forced to perform a rollback.

  • By default, when the JDBC driver performs a client reroute after detecting that a database has failed over, it will trigger a (with ERRORCODE=-4498 and SQLSTATE=08506). This exception will be mapped by WebSphere to a before it is received by the application.

    Note that this occurs during the first attempt to reuse an existing connection after the failover. Since connections are pooled, there may be a significant delay between the failover and the occurrence of the ClientRerouteException/StaleConnectionException.

The correct way to react to a ClientRerouteException/StaleConnectionException would therefore be to reexecute the entire transaction. Obviously there is a special case, namely a reroute occurring while attempting to execute the first query in a transaction. In this situation, only a single operation needs to be reexecuted. Note that this is actually the most common case because it occurs for transactions started after the failover, but that attempt to reuse a connection established before the failover. Typically this is more likely than a failover in the middle of a transaction (except of course on very busy systems or applications that use long running transactions).

The JDBC data source can be configured to automatically handle that special case. This feature is called seamless failover. The DB2 documentation describes the conditions that need to be satisfied for seamless failover to be effective:

If seamless failover is enabled, the driver retries the transaction on the new server, without notifying the application.

The following conditions must be satisfied for seamless failover to occur:

  • The enableSeamlessFailover property is set to DB2BaseDataSource.YES. [...]
  • The connection is not in a transaction. That is, the failure occurs when the first SQL statement in the transaction is executed.
  • There are no global temporary tables in use on the server.
  • There are no open, held cursors.

This still leaves the case where the failover occurs in the middle of a transaction. The DB2 documentation has an example that shows how an application could react in this situation by reexecuting the entire transaction. However, the approach suggested by that example is not realistic for real world applications. There are multiple reasons for that:

  • It requires lot of boilerplate error handling code to be added to the application. That code would be much more complex than what is suggested by the example. Just to name a few complications that may occur: reuse of the same data access code in different transactions, container managed transactions, distributed transactions, the option to join an existing transaction, transactions started by and imported from remote clients, etc.
  • Writing and maintaining that code is very error-prone. It is very easy to get it wrong, so that instead of reexecuting the current transaction, the code would only partially reexecute the transaction or reexecute queries that are part of a previous transaction that has already been committed. Since the code is not executed during normal program flow, such bugs will not be noticed immediately.
  • It is virtually impossible to test this code. One would have to find a way to trigger or simulate a database failover at a well defined moment during code execution. One would then have to apply this technique to every possible partially executed transaction that can occur in the application. This is simply not realistic.

A more realistic option would be to handle this at the framework level. E.g. it is likely that Spring could be set up or extended to support automatic transaction reexecution in case of a client reroute. If this support is designed carefully and tested thoroughly, then one can reasonably assume that it just works transparently for any transaction, removing the need to test it individually for every transaction.

However, before embarking on this endeavor, you should ask yourself if the return on investment is actually high enough. You should take into account the following aspects in your evaluation:

  • There may be multiple frameworks in use in your organization (e.g. EJB and Spring). Automatic transaction reexecution would have to be implemented for each of these frameworks separately. For some frameworks, it may be impossible to implement this in a way that is transparent for applications.
  • Database failovers are expected to be rare events. If seamless failover is enabled, then only transactions that are active at the time of the failover will be impacted. This means that the failure rate may be very low.
  • When the primary DB2 instance goes down because of a crash, it will take some time before the standby takes over. Even if the application successfully reexecutes the transaction, the client of the application may still receive an error because of timeouts. On the other hand, in case of a manual takeover for maintenance reasons, one can usually reduce the impact on clients by carefully scheduling the takeover.
  • There are lots of reasons why a client request may fail, and database failovers are only one possible cause. Other causes include application server crashes and network issues. It is likely that implementing automatic transaction reexecution would reduce the overall failure rate only marginally. It may actually be more interesting to implement a mechanism that retries requests on the client side for any kind of failure.
  • Message driven beans already provide a retry mechanism that is transactionally safe. In some cases this may be a better option than implementing a custom solution.

The conclusion is that while it is in general a good idea to enable seamless failover, in most cases it is not worth trying to intercept ClientRerouteException/StaleConnectionException and to automatically reexecute transactions.

Sunday, January 19, 2014

Understanding socket bind failures on WebSphere Application Server

If you are familiar with WebSphere Application Server, then you probably have already noticed that sometimes WebSphere seems to have problems reopening TCP ports after a restart. When this occurs, the following kind of error message is logged repeatedly:

TCPC0003E: TCP Channel TCP_2 initialization failed. The socket bind failed for host * and port 9080. The port may already be in use.

Eventually WebSphere will succeed in binding to the port, as indicated by the following message:

TCPC0001I: TCP Channel TCP_2 is listening on host * (IPv6) port 9080.

IBM recently published a technote about this problem. It concludes by claiming that there is no solution for this limitation and [WebSphere] is working as designed. That statement however misses some key points and is actually not quite accurate.

First of all, it is important to understand why the issue occurs. The problem is that on some of the TCP sockets it attempts to put into listen mode, WebSphere doesn't set the SO_REUSEADDR socket option. On Linux you can check that using the tool I presented in a previous blog post. You will see that the SO_REUSEADDR option is set on the sockets used by IIOP (BOOTSTRAP_ADDRESS and ORB_LISTENER_ADDRESS) and the SOAP JMX connector (SOAP_CONNECTOR_ADDRESS), but not on the sockets used by the Web container (WC_defaulthost, WC_defaulthost_secure, WC_adminhost and WC_adminhost_secure), the SIB service (SIB_ENDPOINT_ADDRESS and SIB_ENDPOINT_SECURE_ADDRESS) and the core group service (DCS_UNICAST_ADDRESS).

The problem with not setting SO_REUSEADDR is that the bind to a port will fail as long as there are TCP connections in state TIME_WAIT on that port. A connection will enter this state if the connection termination is initiated by the local end. E.g. when WebSphere decides to close an idle HTTP connection, then that connection will end up in state TIME_WAIT and prevent WebSphere from reopening the HTTP port after a restart. As can be seen in the TCP state diagram, a connection can only leave the TIME_WAIT state after the corresponding timeout is reached. At this point the connection transitions to the CLOSED state and the system discards all information about the connection. That is the reason why after several attempts, WebSphere eventually succeeds in binding to the port. It is also the reason why IBM's technote suggests modifying the timeout value.

More background on the TIME_WAIT state and the SO_REUSEADDR option can be found in section 18.6, "TCP State Transition Diagram" of W. Richard Stevens' classic textbook "TCP/IP Illustrated, Vol. 1":

The TIME_WAIT state is also called the 2MSL wait state. Every implementation must choose a value for the maximum segment lifetime (MSL). It is the maximum amount of time any segment can exist in the network before being discarded. [...]

Given the MSL value for an implementation, the rule is: when TCP performs an active close, and sends the final ACK, that connection must stay in the TIME_WAIT state for twice the MSL. This lets TCP resend the final ACK in case this ACK is lost (in which case the other end will time out and retransmit its final FIN).

Another effect of this 2MSL wait is that while the TCP connection is in the 2MSL wait, the socket pair defining that connection (client IP address, client port number, server IP address, and server port number) cannot be reused. That connection can only be reused when the 2MSL wait is over.

Unfortunately most implementations (i.e., the Berkeley-derived ones) impose a more stringent constraint. By default a local port number cannot be reused while that port number is the local port number of a socket pair that is in the 2MSL wait. [...]

Some implementations and APIs provide a way to bypass this restriction. With the sockets API, the SO_REUSEADDR socket option can be specified. It lets the caller assign itself a local port number that's in the 2MSL wait, but we'll see that the rules of TCP still prevent this port number from being part of a connection that is in the 2MSL wait.

The question is now why WebSphere doesn't use SO_REUSEADDR. The technote suggest that this is "by design", but fails to explain the rationale behind that "design". In addition, the quote from Stevens' book clearly shows that setting SO_REUSEADDR does no harm, and it is what most other server processes do.

There is another interesting aspect. I mentioned earlier that the problem only affects some ports opened by WebSphere, in particular the ones used by the Web container, the SIB service and the core group services. Interestingly, these ports are all managed by the WebSphere channel framework. They can easily be identified by looking at the "Ports" page for the server in the admin console:

Ports that have a link "View associated transports" are managed by the channel framework and don't use SO_REUSEADDR (Note that this suggests that the problem also exists for the ports used by the SIP service not considered earlier).

It turns out that the channel framework actually supports a custom property soReuseAddr that can be used to specify the value of the SO_REUSEADDR option. The corresponding documentation in the infocenter is interesting because it explicitly presents that custom property as a solution for the bind problem, contradicting the statement made in the technote that there is no solution:

Use the soReuseAddr custom property to control bind behavior. When the WebSphere Application Server is restarted, if the inbound TCP channels have problems trying to bind the listening socket, errors are printed into the SystemOut file until either the bind is successful or the number of allowed bind attempts has been passed. This custom property helps to avoid repeated error messages during the bind process.

For inbound TCP channel binding environments, you can avoid the repeated error messages by using the soReuseAddr custom property to affect TCP inbound channel processing. When soReuseAddr is set to 1, the TCP channel is forced to try each bind attempt with the re-use option set to true on the socket. The restart of the WebSphere Application Server processes first binding attempt, despite those sockets in TIME_WAIT state.

By setting the soReuseAddr property to 1 on all TCP inbound channels, it is indeed possible to avoid the TCPC0003E error entirely. To configure that property on a given port using the admin console, start from the corresponding "View associated transports" link and look for "TCP inbound channel". If you prefer to use admin scripting, write a script that locates all configuration objects of type TCPInboundChannel and adds the soReuseAddr property to these objects.

This is still not the complete story though. IBM's technote makes the following statement about the environments in which the problem occurs:

This problem may occur on Red Hat Enterprise Linux Server Release 5.9 with WebSphere Application Server versions -

Obviously the problem is not specific to a particular Linux distribution or version, but that is not the point here. What is more interesting is the range of WebSphere versions. The technote doesn't make it clear whether the problem only exists in WAS versions up to or whether was simply the current WAS version when the technote was written. It turns out that it is actually the former: in WAS the behavior is indeed no longer the same. Now all listening sockets are created with the SO_REUSEADDR option set by default, and the problem no longer exists.

This would suggest that IBM simply changed the default value of the soReuseAddr custom property to 1. However, a simple test with soReuseAddr=0 shows that WebSphere actually completely ignores the property and always enables SO_REUSEADDR, although the current version of the corresponding infocenter page still mentions the property and specifies that its default value is 0.

Saturday, January 18, 2014

How to suspend HTTP traffic to a WebSphere Application Server

One of the annoying things with the WebSphere plug-in for IBM HTTP Server is that there is no straightforward way to suspend traffic to a given application server. The problem is that the plug-in is not aware of the runtime weights of the members in a WebSphere cluster. The only way to suspend HTTP traffic to a given server is to set the configured weight of the cluster member to zero and then to regenerate and propagate the plug-in configuration file. The plug-in will automatically reread that file and stop sending HTTP requests to the server. Alternatively, one can also edit the plugin-cfg.xml file manually to temporarily set the LoadBalanceWeight to zero.

Obviously this method is cumbersome, especially compared to how this kind of operation is done on other load balancers. On the other hand, one of the advantages of the WebSphere plug-in is that it able to detect a stopped member and fail over the connections without loosing requests: as soon as it detects that the HTTP port on the WebSphere server is closed, it will redirect requests (including the request that caused the attempt to establish the connection to the application server) to other cluster members. Therefore another approach would be to instruct the application server to (temporarily) close its HTTP port(s) in order to force the WebSphere plug-in to route requests to other members.

It turns out that this is indeed possible. Each application server has an MBean of type WebContainer with operations stopTransports and startTransports. The first operation stops all HTTP transports and closes the corresponding ports, i.e. WC_defaulthost and WC_defaulthost_secure (as well as WC_adminhost and WC_adminhost_secure on stand-alone servers and deployment managers). The second operation restores normal operation.

As noted in PK96239, the WebContainer MBean was deprecated in WAS 6.1 and has been replaced by the TransportChannelService MBean. However, the latter is much more difficult to use and as of WAS 8.5.5, the WebContainer MBean is still supported. Therefore using the WebContainer MBean remains the preferred method to do this.

It should also be noted that using the stopTransports to suspend HTTP traffic to a WebSphere server may have some drawbacks in certain situations. The most important one is that since the HTTP ports are closed, it is no longer possible to send any kind of HTTP request to the server. In particular it is no longer possible to send test requests directly to the server. One should also be careful if there are applications deployed on the server that may send HTTP requests to the server itself (via localhost) in response to requests received via protocols other than HTTP, such as IIOP (remote EJB calls) or JMS.

Thursday, January 9, 2014

Quote of the day

Option three is Assad wins. And I must tell you at the moment, as ugly as it sounds, I'm kind of trending toward option three as the best out of three very, very ugly possible outcomes.

Michael Hayden, former head of the CIA

Wednesday, January 1, 2014

Graphic of the day

When a picture is worth a thousand words...

Source: (German).

How TCP backlog works in Linux

When an application puts a socket into LISTEN state using the listen syscall, it needs to specify a backlog for that socket. The backlog is usually described as the limit for the queue of incoming connections.

Because of the 3-way handshake used by TCP, an incoming connection goes through an intermediate state SYN RECEIVED before it reaches the ESTABLISHED state and can be returned by the accept syscall to the application (see the TCP state diagram). This means that a TCP/IP stack has two options to implement the backlog queue for a socket in LISTEN state:

  1. The implementation uses a single queue, the size of which is determined by the backlog argument of the listen syscall. When a SYN packet is received, it sends back a SYN/ACK packet and adds the connection to the queue. When the corresponding ACK is received, the connection changes its state to ESTABLISHED and becomes eligible for handover to the application. This means that the queue can contain connections in two different state: SYN RECEIVED and ESTABLISHED. Only connections in the latter state can be returned to the application by the accept syscall.
  2. The implementation uses two queues, a SYN queue (or incomplete connection queue) and an accept queue (or complete connection queue). Connections in state SYN RECEIVED are added to the SYN queue and later moved to the accept queue when their state changes to ESTABLISHED, i.e. when the ACK packet in the 3-way handshake is received. As the name implies, the accept call is then implemented simply to consume connections from the accept queue. In this case, the backlog argument of the listen syscall determines the size of the accept queue.

Historically, BSD derived TCP implementations use the first approach. That choice implies that when the maximum backlog is reached, the system will no longer send back SYN/ACK packets in response to SYN packets. Usually the TCP implementation will simply drop the SYN packet (instead of responding with a RST packet) so that the client will retry. This is what is described in section 14.5, listen Backlog Queue in W. Richard Stevens' classic textbook TCP/IP Illustrated, Volume 3.

Note that Stevens actually explains that the BSD implementation does use two separate queues, but they behave as a single queue with a fixed maximum size determined by (but not necessary exactly equal to) the backlog argument, i.e. BSD logically behaves as described in option 1:

The queue limit applies to the sum of [...] the number of entries on the incomplete connection queue [...] and [...] the number of entries on the completed connection queue [...].

On Linux, things are different, as mentioned in the man page of the listen syscall:

The behavior of the backlog argument on TCP sockets changed with Linux 2.2. Now it specifies the queue length for completely established sockets waiting to be accepted, instead of the number of incomplete connection requests. The maximum length of the queue for incomplete sockets can be set using /proc/sys/net/ipv4/tcp_max_syn_backlog.

This means that current Linux versions use the second option with two distinct queues: a SYN queue with a size specified by a system wide setting and an accept queue with a size specified by the application.

The interesting question is now how such an implementation behaves if the accept queue is full and a connection needs to be moved from the SYN queue to the accept queue, i.e. when the ACK packet of the 3-way handshake is received. This case is handled by the tcp_check_req function in net/ipv4/tcp_minisocks.c. The relevant code reads:

        child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
        if (child == NULL)
                goto listen_overflow;

For IPv4, the first line of code will actually call tcp_v4_syn_recv_sock in net/ipv4/tcp_ipv4.c, which contains the following code:

        if (sk_acceptq_is_full(sk))
                goto exit_overflow;

We see here the check for the accept queue. The code after the exit_overflow label will perform some cleanup, update the ListenOverflows and ListenDrops statistics in /proc/net/netstat and then return NULL. This will trigger the execution of the listen_overflow code in tcp_check_req:

        if (!sysctl_tcp_abort_on_overflow) {
                inet_rsk(req)->acked = 1;
                return NULL;

This means that unless /proc/sys/net/ipv4/tcp_abort_on_overflow is set to 1 (in which case the code right after the code shown above will send a RST packet), the implementation basically does... nothing!

To summarize, if the TCP implementation in Linux receives the ACK packet of the 3-way handshake and the accept queue is full, it will basically ignore that packet. At first, this sounds strange, but remember that there is a timer associated with the SYN RECEIVED state: if the ACK packet is not received (or if it is ignored, as in the case considered here), then the TCP implementation will resend the SYN/ACK packet (with a certain number of retries specified by /proc/sys/net/ipv4/tcp_synack_retries and using an exponential backoff algorithm).

This can be seen in the following packet trace for a client attempting to connect (and send data) to a socket that has reached its maximum backlog:

  0.000 ->  TCP 74 53302 > 9999 [SYN] Seq=0 Len=0
  0.000 ->  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  0.000 ->  TCP 66 53302 > 9999 [ACK] Seq=1 Ack=1 Len=0
  0.000 ->  TCP 71 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  0.207 ->  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  0.623 ->  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  1.199 ->  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  1.199 ->  TCP 66 [TCP Dup ACK 6#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
  1.455 ->  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  3.123 ->  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  3.399 ->  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  3.399 ->  TCP 66 [TCP Dup ACK 10#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
  6.459 ->  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  7.599 ->  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  7.599 ->  TCP 66 [TCP Dup ACK 13#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
 13.131 ->  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
 15.599 ->  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
 15.599 ->  TCP 66 [TCP Dup ACK 16#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
 26.491 ->  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
 31.599 ->  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
 31.599 ->  TCP 66 [TCP Dup ACK 19#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
 53.179 ->  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
106.491 ->  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
106.491 ->  TCP 54 9999 > 53302 [RST] Seq=1 Len=0

Since the TCP implementation on the client side gets multiple SYN/ACK packets, it will assume that the ACK packet was lost and resend it (see the lines with TCP Dup ACK in the above trace). If the application on the server side reduces the backlog (i.e. consumes an entry from the accept queue) before the maximum number of SYN/ACK retries has been reached, then the TCP implementation will eventually process one of the duplicate ACKs, transition the state of the connection from SYN RECEIVED to ESTABLISHED and add it to the accept queue. Otherwise, the client will eventually get a RST packet (as in the sample shown above).

The packet trace also shows another interesting aspect of this behavior. From the point of view of the client, the connection will be in state ESTABLISHED after reception of the first SYN/ACK. If it sends data (without waiting for data from the server first), then that data will be retransmitted as well. Fortunately TCP slow-start should limit the number of segments sent during this phase.

On the other hand, if the client first waits for data from the server and the server never reduces the backlog, then the end result is that on the client side, the connection is in state ESTABLISHED, while on the server side, the connection is considered CLOSED. This means that we end up with a half-open connection!

There is one other aspect that we didn't discuss yet. The quote from the listen man page suggests that every SYN packet would result in the addition of a connection to the SYN queue (unless that queue is full). That is not exactly how things work. The reason is the following code in the tcp_v4_conn_request function (which does the processing of SYN packets) in net/ipv4/tcp_ipv4.c:

        /* Accept backlog is full. If we have already queued enough
         * of warm entries in syn queue, drop request. It is better than
         * clogging syn queue with openreqs with exponentially increasing
         * timeout.
        if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
                NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
                goto drop;

What this means is that if the accept queue is full, then the kernel will impose a limit on the rate at which SYN packets are accepted. If too many SYN packets are received, some of them will be dropped. In this case, it is up to the client to retry sending the SYN packet and we end up with the same behavior as in BSD derived implementations.

To conclude, let's try to see why the design choice made by Linux would be superior to the traditional BSD implementation. Stevens makes the following interesting point:

The backlog can be reached if the completed connection queue fills (i.e., the server process or the server host is so busy that the process cannot call accept fast enough to take the completed entries off the queue) or if the incomplete connection queue fills. The latter is the problem that HTTP servers face, when the round-trip time between the client and server is long, compared to the arrival rate of new connection requests, because a new SYN occupies an entry on this queue for one round-trip time. [...]

The completed connection queue is almost always empty because when an entry is placed on this queue, the server's call to accept returns, and the server takes the completed connection off the queue.

The solution suggested by Stevens is simply to increase the backlog. The problem with this is that it assumes that an application is expected to tune the backlog not only taking into account how it intents to process newly established incoming connections, but also in function of traffic characteristics such as the round-trip time. The implementation in Linux effectively separates these two concerns: the application is only responsible for tuning the backlog such that it can call accept fast enough to avoid filling the accept queue); a system administrator can then tune /proc/sys/net/ipv4/tcp_max_syn_backlog based on traffic characteristics.