They do say the exception proves the rule. In this example, I do deviate from how I normally troubleshoot a problem. But, it was a bit of an odd problem!
One of the account managers I work with was getting worried. They are using data from Dynatrace to calculate an SLA metric for end user response times. In the recent months the response time for a key transaction had jumped significantly. pushing the value into the red. The SLA reports the 95th percentile. So, I was asked to have a look to see what the problem could be.
Looking at the response time for the transaction the median is very good at 350ms so I switched to the slowest 10% (90th percentile) and that was just over one minute. So, the next step was to look at the waterfall diagram for one of the slow pages. The waterfall is shown in the graphic below:
There are few things of interest:
(1) The user is just using the problem page/transaction
(2) The page is called every minute. (This suggests that this is automatically called rather than initiated by the user)
(3) Response time is good and then goes bad, and is a pretty consistent value.
I was surprised that a page this slow didn’t already have users complaining about it. So, I decided to see if I knew any of the users and then I could just check this with them. As it happens I recognized one of the users and I gave them a call and the mystery was explained.
The config.aspx page is a status page used by certain users, It transpires that this is accessed from a desktop device within the data center which means users have to remote desktop onto the desktop from their laptops (don’t ask why). We, did some live tests and discovered it occurred when the browser was left open and the remote desktop went into sleep mode! Therefore, when we get slow pages there isn’t a real user at the end waiting for a response.
Quick chat with the account manager and we agreed that page should be excluded from the SLA calculation. Problem sorted!