Wednesday, May 22, 2013

Customizing WebSphere access logs to include the response time

WebSphere Application Server always had the option to enable HTTP access logging. In the admin console, this feature is called "NCSA access logging" because the log format is compatible with the old NCSA HTTPd. Quite naturally, being based on the original NCSA code, Apache HTTPD also supports the same log format. In addition, Apache HTTPD allows to customize the access log. Until recently, this was not the case in WebSphere and the administrator only had the choice between two standard log formats: "Common" and "Combined".

Customized access logs were introduced by the APARs PM46717 and PM68250 in WAS 7.0.0.25 and 8.0.0.2. They are enabled by a custom property called accessLogFormat that needs to be configured on the HTTP channel. As noted in PM46717 this will only work if the "Use chain-specific logging" option is enabled on the HTTP Channel where the previously mentioned property is enabled. The accessLogFormat property uses a syntax similar to the LogFormat directive in HTTPD. In particular, the %D format string can be used to add the response time to the logs. However, there are a couple of issues with that:

  • As described in the previously mentioned APARs, the time is measured with millisecond accuracy [and] microsecond precision. The last three digits are always 0. That is a bit surprising considering that Java has a method that gives access to a timer with sub-millisecond resolution. The reason may be that in some Java versions that method relies on a per-CPU timer and using it to calculate time differences on a multi-processor system may yield incorrect results. Fortunately, millisecond accuracy should be enough in most cases.
  • According to APAR PM81277, the %D format string was not correctly implemented before WAS 7.0.0.29.
  • On WAS 8.5.0.1 and 8.5.0.2 (other versions have not been tested but are likely affected too) %D actually doesn't represent the time elapsed between the reception of the request and the transmission of the response, but the time elapsed between the establishment of the HTTP connection and the transmission of the response. This is fine if the client uses a new connection for each request, but it obviously makes the feature totally useless with clients that use persistent HTTP connections.

Finally, it is important to note that for obvious reasons the elapsed time measured on the server side always differs from the actual response time seen by the client. If one assumes that network delays are negligible, then one can expect that both are approximately equal. However, there is one case where there can be a large discrepancy between the value reported by WebSphere and the actual response time, namely if the Web container thread pool becomes saturated. The reason is that the HTTP transport in WebSphere uses a single thread pool to process asynchronous I/O events and to let applications handle HTTP requests. That means that if all available threads are blocked in application code (e.g. waiting for the execution of data base requests), then this will also prevent WebSphere from processing I/O events. These I/O events will then be queued in the operating system until more threads become available again. That effectively protects the server from overload, but it introduces a bias into the response time measurement because the timer only starts when WebSphere begins processing the request.

That problem doesn't occur with the access logs produced by a Web server (such as IBM HTTP Server with the WebSphere plug-in) in front of WAS. Therefore, in that kind of setup it is always better to rely on the response time measured by the Web server (which will be much closer to the actual response time seen by the client) and not to use the WAS access logs as a substitute for the Web server access logs.

6 comments:

  1. Hi Andreas, just a note that the same problem with response time accuracy can occur in httpd/IHS because the request can be in the ListenBackLog (some customers use MaxConnections & MaxClients to even out workload distribution and use the ListenBackLog as a queue, although it might also be more normally utilized under heavy load).

    ReplyDelete
    Replies
    1. Thanks for your comment. The point I wanted to make is that when a WebContainer thread pool starvation occurs, this may go completely unnoticed in the WAS access logs, but that in the IHS access logs, one can actually see the increased response time. You are right that when the load/concurrency increases further, then a similar issue will occur on the IHS and the response times in the IHS access logs will also become inaccurate.

      Anyway, right now the most important problem is the bug described in bullet 3 which makes the response time measurement on WAS completely useless if clients use keep-alive connections.

      Delete
  2. Hi Andreas, I believe the persistent connection response time issue has been resolved with APAR PM86708

    ReplyDelete
    Replies
    1. Thanks for letting me know. We will test that when 8.5.5.1 is out.

      Delete
  3. Thanks for a great article on various ways to measure response time.

    Is it possible for us to print the JDBC SQL statements also using Request Metrics?
    Please advise how to achieve this.

    This will help us in pinpointing which SQL is taking how much time.

    Thanks for your guidance,

    Regards
    Arabinda

    ReplyDelete
    Replies
    1. The article describes a single method to measure response time, namely by customizing the WebSphere access logs. This is a feature of the HTTP transport in WebSphere and has nothing to do with Request Metrics.

      Delete