Dynatrace Performance Troubleshooting Example

They do say the exception proves the rule. In this example, I do deviate from how I normally troubleshoot a problem. But, it was a bit of an odd problem!

One of the account managers I work with was getting worried. They are using data from Dynatrace to calculate an SLA metric for end user response times. In the recent months the response time for a key transaction had jumped significantly. pushing the value into the red. The SLA reports the 95th percentile. So, I was asked to have a look to see what the problem could be.

Looking at the response time for the transaction the median is very good at 350ms so I switched to the slowest 10% (90th percentile) and that was just over one minute. So, the next step was to look at the waterfall diagram for one of the slow pages. The waterfall is shown in the graphic below:

The waterfall is showing the delay is for the OnLoad processing. Typically the OnLoad processing is the execution of JavaScript after the page is loaded.

Normally, I would then try to recreate the problem myself and try to profile the JavaScript with the browser developer tools but that takes time and I still had some more digging to do in Dynatrace. Next, I wanted to see how this looked for a users session and I noticed something odd. Here, is the session for a user.

There are few things of interest:

(1) The user is just using the problem page/transaction

(2) The page is called every minute. (This suggests that this is automatically called rather than initiated by the user)

(3) Response time is good and then goes bad, and is a pretty consistent value.

I was surprised that a page this slow didn’t already have users complaining about it. So, I decided to see if I knew any of the users and then I could just check this with them. As it happens I recognized one of the users and I gave them a call and the mystery was explained.

The config.aspx page is a status page used by certain users, It transpires that this is accessed from a desktop device within the data center which means users have to remote desktop onto the desktop from their laptops (don’t ask why). We, did some live tests and discovered it occurred when the browser was left open and the remote desktop went into sleep mode! Therefore, when we get slow pages there isn’t a real user at the end waiting for a response.

Quick chat with the account manager and we agreed that page should be excluded from the SLA calculation. Problem sorted!

Performance Monitoring choosing a Sampling Period

I was recently visiting the St Ives Tate Art Gallery (I know I am so cultured/it was raining) and came across an art work called Measuring Niagara with a Teaspoon by Cornelia Parker. The blurb about the work is based on her interest in the absurd task of measuring an enormous object with such a small object. In this case Niagara falls using a tea spoon.

So, as a performance engineer this got me thinking about how we choose the sampling period when collecting performance data. We generally believe the more granular (shortest sampling period) the better as detail is lost when data is sampled over longer periods. For example, the graphic below is for CPU utilisation sampled at different periods and as you can see the odd behavior of the CPU does not become apparent until sampling every second.

However, the shorter the sampling period the more data we collect and have to anaylse. There are often times when we don’t have the luxury of being able to collect data with a really short sampling period. In which case how do we choose the best sample period?

For me, it comes down to what am I looking for in the data. For example, users are complaining about an intermittent problem where for about 15 minutes during the day response time is really slow across all user transactions. The time delay in the incident reporting means I can’t rely on using a circular buffer to store a couple of hours of highly sampled data which could be flushed just after an incident. So, I need to collect day, I would live with a sampling period of about 5 to 10 minutes. This means I would have enough data to correlate changes in resource usage with when the users reported performance issues.

Another example would be for looking for reasons behind slow response times. For example we are looking for reasons the response time has increase to over 6 seconds. In that case the minimum I would accept would be 3 seconds.

You can see in most cases the rule I apply is at least sample at 1/2 the duration of the period of interest. Ideally I would go 1/3 the period of interest. As they say you need 3 data points to draw a curve!

There is possibly some science behind this. There is the Nyquist Theorem which is used in signal processing. It says you need to sample twice as much as the frequency of the analogue signal you are converting. There is a bit of maths to prove the theorem but basically it makes sense to think you need to have at least two data points to detect a noticeable change in the system you are sampling.

How to Manage Complexity (Martin Thompson) Interview

While some people provide the odd cartoon at Christmas! Dave Farley provides something more substantial this festive season. I always like to listen to Martin Thompson on any of his performance work. This interview is a bit more general than purely performance related work, but it is still worth taking the time to listen. Also, I found it very interesting to hear that his consulting engagements are moving away from improving response time to improving efficiency.

To me the take away from this interview are listed below:

(1) Keep things simple

(2) Concentrate on the business logic

(3) Work in small steps, feedback and make small incremental changes

(4) Most performance problems he sees are systemic design flaws

(5) Measure and under stand (model) the system you are trying to improve and compare that model of what should be achieved.

(6) Consider separation of concerns in the design will lead to simpler code (“one class, one thing; one method, one thing”)

(7) If you haven’t heard about it think about Mechanical Sympathy https://dzone.com/articles/mechanical-sympathy

Small vs Large Scale Performance Test Environments

I have just added to the website a presentation that looks at sizing and extrapolation techniques for people considering building a small scale performance test environment instead of a large full scale performance test environment. In the paper several approaches are considered.
Factoring – This is where the architecture is easily scaled and therefore the performance test can be undertaken on a subset of the hardware.

Dimensioning – The architecture has known bottlenecks that drive the performance such as a central DB. The performance test environment must contain the bottleneck component but other components may not need to be representative of a full sized environment.

Modelling – This examines the use of modelling to take results from a small scale environment and predict the results for a larger scale environment.

Flipping – This looks at creating test environment that can be have the correct amount of resources allocated to them for a “full scale” performance test for example during off hours and then revert to a smaller scale performance test environment at other times.

Full Scale – Finally the advantages and disadvantages of a full scale performance test environment are discussed.

Finally the caveat for these techniques is that for any testing on a small scale performance test environment does not guarantee that all performance problems will be discovered due to application/scalability constrains that may only appear in the full sized environment!

You can download the presentation from here.

How do you know if your Load Test has a bottleneck

The bottleneck in a system may not be obvious. (Life would be easier but less fun if there where always easy to find). This is because there are two types “hard” and “soft”. Hard bottlenecks are the ones where a resource such as a CPU is working flat out which limits the ability of the system to process more transaction. While a soft bottleneck is some internal limit such at number of threads or connections that once all used limit the ability to process more transaction. Therefore, how do you find know if you have a bottleneck. If you are looking at the results from a single load test you may not know you will need to run multiple load tests at different numbers of virtual users and then see if you number of transactions per second increase with each increase in virtual users. The results can be seen in the two graphs below. The first shows how the throughput (transaction per seconds) increases and levels off when saturated and the second shows the response time. You will probably have heard the express below the knee of the curve and this is an the point that is to the left of the bend in the response time graph.

Throughput Graph
Throughput Graph

Response Time Graph

The graphs above where actually generated using a spreadsheet model for the performance of a closed loop model. This is like LoadRunner and other testing tools where the are a fixed number of users that use the system then wait and return to the system. The reality is that the performance graphs may look different from the expected norm. An example is shown below from a LoadRunner test the first graph shows how the number of VUser where increased during the test and the second graph shows the increase in response times. In this case the jump in response time is dramatic. However, in some cases the increase in response time will be less dramatic as the system will start to error at high loads which will distort the response time figures.

Example LoadRunner VUser Graph

Example LoadRunner Graph Showing Increasing Response Times

Having discovered there is a bottleneck in the system then you have to start looking for it.

Scalability

Scalability can be defined in many ways. However, in general it is the relationship of how an output increases with a change in input. Typically we may think of how throughput changes as we increase the number of CPUs. In a perfect world we would like to have linear scaling. . I came across a good example of non-linear scaling. It is from a presentation presented by Peter Hughes. It is where you are having a dinner party and have 1 meter square tables each table seats 4 people. As it is a dinner party you want to have everybody facing each other as much as possible. So with one table you can sit 4 people.

scalability11

To increase the number of guests you need 4 tables but you can now only sit 8 people.

 

scalability2

To increase the number of guests again you need 9 tables but you can now only sit 12 people.

  

scalability3

 

If you plot the relationship between guests and tables on a graph is looks like the one below.

 scalability-graph 

 

 

 

Extrapolation of Load Test Results is it worth it?

It is a common problem that performance testing is often carried out on smaller scale test environments but project managers want to know that the system will scale and response times will not be degraded. Therefore can the performance test results be extrapolated? My view on extrapolation is it is a great technique when used properly but it does not guarantee that the system you tested will work well on the full sized production environment. The two main reasons for failure are

1) You have made a mistake in the creation of your model. These mistakes could be simply just a poorly built model or a bad assumption. However, with plenty of time and expertise you can overcome some of these limitations by building a good model.

2) There are “soft” bottlenecks in the system that are only detected at high load. A common example might be a piece of software may be limited to a certain number of threads that once all used, limit scalability. Some of these “soft” limits might be know by developers before hand and can be investigated with the model and the test environment but it the unknown unknowns that will be the problem on go live day

However, this does not mean that extrapolation is bad or should be avoided. Where as it cannot guarantee that the system will work in production is can be used to show that the system will fail and as we all know avoiding a costly failure is often worth the effort. Using modelling techniques you can estimate the needed hardware configuration for the production system which can be compared to what is expected to be deployed and if the deployed hardware is undersized you have a made a friend with the project manager.

Welcome to my Performance Engineering Blog

Hi this is a blog that I have started for “fun” about my work as a performance engineer. For some a performance engineer is a performance tester that can help fix performance problems. A wider definition of a performance engineer is one that can help achieve the performance goals of a project throughout the lifecycle of the project through development and into production. I surpose I like to feel I am more the latter type of performance engineer. I particularly like performance modelling and prediction. However, we must remember “performance prediction is easy, getting it right that the hard part” (Thanks to Dr Ed Upchurch for that quote)

You may also wonder why I have called this blog 1202 performance. This is because of the 1202 error code that was generated for the computer overload on the Apollo 11 decent to the moon.  Want to know more try watching this https://youtu.be/z4cn93H6sM0