A call came in about slow response time for a new release just deployed into the test environment. The login page was taking over 5 minutes compared to less than 8 seconds in production.
As it as a web based application. The first step was to get on a webex with the client and record the login process using fiddler. While watching the webex I could see that one of the HTTP request was taking a very long time to return from the server (this is a great feature of fiddler is that you can see the icons change as the status changes for each HTTP request).
One the fiddler stop, I went to the HTTP request in fiddler and looked at the statistics. Af few key things can be seen the returned payload was 8.5M so large but not huge for a corporate network. The overall the elapsed time was 5m 20 sec of which
One thing was also to check if this payload was being cached. Looking at Response Header for the returned request then caching is enabled. They have followed the google page speed recommendation and set to Vary.
While on the webex I asked for a wireshark trace to be taken so I could get some visibility into how the network was performing. A wireshark trace only captures the packets at the workstation but it can be useful to see if certain TCP/IP type issues are occurring. What I looked at in the wireshark trace was first the throughput from the client. IT seems that much is capped at just over 0.5 Mbit/s although there is a spike and then nothing.
I also wanted to check if there are any problems reported such as the TCP/IP window being undersized or lots of transmissions etc. A quick check is to use the Expert Information option. Remember to select to use the display filter that concentrates on the traffic between client and server and all the numbers are relative. In this case we are transmitting 16.5K packet so 5 out of order packets is not alot.
Next as I had access to the weblogs for I did a quick analysis to see if any particular client or access point has slower than expected response time. I imported these into Excel an created a pivot table and then for each IP calculated the average response time. When doing this remember to filter on status code being HTTP 200 so that the cache check 304 do not skew the response time calculations.
I had a table similar to the below, (I have changed the IPs and changed the response times to seconds. What is most noticable is that the 10.1.62.11 IP has the slow response times while the others are fine.
|IP||Response Time (sec)||Count|
Talking to the network guys the 10.1.62.11 is the proxy server used by the client to access both the production and test environments. The production users where not reporting any problems and when we tested production with a clear cache the response time was as less than 5 seconds. The initial thought was some packet shaping was being undertaken to prioritise the production traffic. However. the client confirmed this was not enabled on the client network.
Going further into the route between the proxy server it was noticed that traffic for the test environment was going across the Data Centre link as they work in a different data center. Although capacity was available on the link there was actually packet shaping enabled which meant the traffic was being constrained to sharing a 1Gb portion of the link.
With a quick network change to the packet shaper all was well..