Performance Monitoring choosing a Sampling Period

I was recently visiting the St Ives Tate Art Gallery (I know I am so cultured/it was raining) and came across an art work called Measuring Niagara with a Teaspoon by Cornelia Parker. The blurb about the work is based on her interest in the absurd task of measuring an enormous object with such a small object. In this case Niagara falls using a tea spoon.

So, as a performance engineer this got me thinking about how we choose the sampling period when collecting performance data. We generally believe the more granular (shortest sampling period) the better as detail is lost when data is sampled over longer periods. For example, the graphic below is for CPU utilisation sampled at different periods and as you can see the odd behavior of the CPU does not become apparent until sampling every second.

However, the shorter the sampling period the more data we collect and have to anaylse. There are often times when we don’t have the luxury of being able to collect data with a really short sampling period. In which case how do we choose the best sample period?

For me, it comes down to what am I looking for in the data. For example, users are complaining about an intermittent problem where for about 15 minutes during the day response time is really slow across all user transactions. The time delay in the incident reporting means I can’t rely on using a circular buffer to store a couple of hours of highly sampled data which could be flushed just after an incident. So, I need to collect day, I would live with a sampling period of about 5 to 10 minutes. This means I would have enough data to correlate changes in resource usage with when the users reported performance issues.

Another example would be for looking for reasons behind slow response times. For example we are looking for reasons the response time has increase to over 6 seconds. In that case the minimum I would accept would be 3 seconds.

You can see in most cases the rule I apply is at least sample at 1/2 the duration of the period of interest. Ideally I would go 1/3 the period of interest. As they say you need 3 data points to draw a curve!

There is possibly some science behind this. There is the Nyquist Theorem which is used in signal processing. It says you need to sample twice as much as the frequency of the analogue signal you are converting. There is a bit of maths to prove the theorem but basically it makes sense to think you need to have at least two data points to detect a noticeable change in the system you are sampling.

How to Manage Complexity (Martin Thompson) Interview

While some people provide the odd cartoon at Christmas! Dave Farley provides something more substantial this festive season. I always like to listen to Martin Thompson on any of his performance work. This interview is a bit more general than purely performance related work, but it is still worth taking the time to listen. Also, I found it very interesting to hear that his consulting engagements are moving away from improving response time to improving efficiency.

To me the take away from this interview are listed below:

(1) Keep things simple

(2) Concentrate on the business logic

(3) Work in small steps, feedback and make small incremental changes

(4) Most performance problems he sees are systemic design flaws

(5) Measure and under stand (model) the system you are trying to improve and compare that model of what should be achieved.

(6) Consider separation of concerns in the design will lead to simpler code (“one class, one thing; one method, one thing”)

(7) If you haven’t heard about it think about Mechanical Sympathy https://dzone.com/articles/mechanical-sympathy

Response time blip when the system is quiet

Working with a customer using Dynatrace there where some noticeable peaks in response times for a application that wasn’t highly used. As can be seen in the graph below:

The application was web based and hosted on IIS. I had come across problems like this before and suspected it was due to the idle timer setting for an application pool. By default IIS will shutdown worker processes after being idle for 20 minutes. However, for some applications the time to restart a worker process can be noticeable slow.

If you set the idle Time-out to 0 then it will not terminate and you won’t have to suffer the restart overhead. I have made this change several times and not seen any adverse affects. But remember all applications are different!

Troubleshooting system.OutOfMemoryException

It happened when I got an email from the service manager “User for an application we host are intermittently failing to login during peak login period in the morning. The same users have no problem to login at other times and no particular group of users or locations are identified. The users are accessing via Citrix.

The support team have been trying to reproduce the error by login in multiple times and they cannot reproduce consistently. They even had 30-40 session active on the test box and they could not reproduce and the test box has available memory”.

The first thing I wanted to to see the actual error seen by the users and any logs. The support folks sent this through and the error came with a stack trace which had this at the bottom:

at System.Text.StringBuilder..ctor(String value, Int32 startIndex, Int32 length, Int32 capacity)

Exception of type ‘System.OutOfMemoryException’ was thrown.

What string builder is doing is allocating space for holding a string. You can get an OutOfMemoryException if you try to allocate a string length greater than the capacity but this level of programming error would probably occur at any time not at the peak. As this is the login process it should be pretty consistent.

I think the issue here is to do with the limit on virtual memory rather than physical memory. I mocked up a .NET 4.5 app on my PC that grabs 0.5G of string storage after each key press. Here is a code snippet below:

int capacity = 250000000;
int maxCapacity = capacity + 1024;
StringBuilder stringBuilder1 =
new StringBuilder(capacity, maxCapacity);
Console.WriteLine(GC.GetTotalMemory(false));

Console.ReadKey(true);

There where two scenarios I found that generated the OutOfMemory exception

Exceeding the Process Address Space

The test app process is limited to 4G of memory and when that is exceeded I get the out of memory error. However, I don’t think this is your issue as the problem occurs only during heavy usage.

Committed bytes is exceeded

Committed bytes is the total amount of memory available on the Physical RAM and the Page File. I get the error when the perfmon counter % Committed byes in Use reaches 100% as the process cannot find any spare memory for the next 0.5G of string storage. Below you can see when the issue occurs:

The answer was simply to increase the committed bytes limit which means an increase in the paging file size.

Performance Trouble Shooting Example (#2)

Sometimes in my job I will have to troubleshoot some old technology. This is a case where I had a call that an application that run in CGI process spawned from an IIS website was running a lot slower that the same application in a test environment. This was for a single iteration of a request to the application. The customer was keen to find out why that although the test environment was identical to production there was such a performance difference. Luckily, we could take out one of the IIS servers from production so that we could us it for testing.

To start with we did some benchmark tests on the machine using tools like IOmeter and zCPU just to confirm the raw compute power was the same. As a separate thread various teams where checking to make sure the two environment where identical not only in terms of hardware and OS but security, IIS config etc. They all came back to say the production and test environments where identical. Next we wanted to see if the problem was in the application itself or the invocation of the application. Someone knocked up a CGI hello world application which we tested in test and production. This showed that the production environment was slower running the “Hello World” application and therefore the problem was around the invocation of the CGI process. Next I wrote a C# program that invoked a simple process and timed it. Running this in both test and prod showed no difference between the two environment. This lead to the problem being specific to the IIS/CGI.

The next step was to take a Process Monitor trace (Process Monitor – Windows Sysinternals | Microsoft Docs). The first thing we noticed was the trace size was significantly larger for the production trace. On the production server there are multiple calls made to Symantec. With the production trace showing multiple calls made to Symantec end point security before the start of the LoadImage event. At the start it only add 1 second to the trace before the image load but there are latter calls to the registry. With this information we asked the security team to check again the security settings and they discovered they where different this the exceptions set in test not set in production. The settings where reviewed and it was decided that the same exceptions rules could be applied across both environment and after this was done the retesting showed that both environment provided the same level of performance.

What is the lesson for this. Let data be your truth! I am not saying the people checking the security settings lied but comparing lots of settings is complex and often not automated. Add in the biases that people think they have configured the system correctly to start. leads to early conclusions that can be flawed.

Performance Trouble Shooting Example (#1)

A call came in about slow response time for a new release just deployed into the test environment. The login page was taking over 5 minutes compared to less than 8 seconds in production.


As it as a web based application. The first step was to get on a webex with the client and record the login process using fiddler. While watching the webex I could see that one of the HTTP request was taking a very long time to return from the server (this is a great feature of fiddler is that you can see the icons change as the status changes for each HTTP request).

One the fiddler stop, I went to the HTTP request in fiddler and looked at the statistics. Af few key things can be seen the returned payload was 8.5M so large but not huge for a corporate network. The overall the elapsed time was 5m 20 sec of which

One thing was also to check if this payload was being cached. Looking at Response Header for the returned request then caching is enabled. They have followed the google page speed recommendation and set to Vary.

While on the webex I asked for a wireshark trace to be taken so I could get some visibility into how the network was performing. A wireshark trace only captures the packets at the workstation but it can be useful to see if certain TCP/IP type issues are occurring. What I looked at in the wireshark trace was first the throughput from the client. IT seems that much is capped at just over 0.5 Mbit/s although there is a spike and then nothing.

I also wanted to check if there are any problems reported such as the TCP/IP window being undersized or lots of transmissions etc. A quick check is to use the Expert Information option. Remember to select to use the display filter that concentrates on the traffic between client and server and all the numbers are relative. In this case we are transmitting 16.5K packet so 5 out of order packets is not alot.

Next as I had access to the weblogs for I did a quick analysis to see if any particular client or access point has slower than expected response time. I imported these into Excel an created a pivot table and then for each IP calculated the average response time. When doing this remember to filter on status code being HTTP 200 so that the cache check 304 do not skew the response time calculations.

I had a table similar to the below, (I have changed the IPs and changed the response times to seconds. What is most noticable is that the 10.1.62.11 IP has the slow response times while the others are fine.

IPResponse Time (sec)Count
10.1.62.1136524
10.1.9.953
10.1.9.737

Talking to the network guys the 10.1.62.11 is the proxy server used by the client to access both the production and test environments. The production users where not reporting any problems and when we tested production with a clear cache the response time was as less than 5 seconds. The initial thought was some packet shaping was being undertaken to prioritise the production traffic. However. the client confirmed this was not enabled on the client network.

Going further into the route between the proxy server it was noticed that traffic for the test environment was going across the Data Centre link as they work in a different data center. Although capacity was available on the link there was actually packet shaping enabled which meant the traffic was being constrained to sharing a 1Gb portion of the link.

With a quick network change to the packet shaper all was well..

Things to check when you cannot replicate an end user performance problem

Here are a few things to consider when confronted with users complaining about poor performance and you cannot replicate the issue with your configuration. This is looking at comparing your tests verses a users. Most of the points consider the set up on the end users device.

1) Same Hardware/Software? OK may sound obvious but often over looked particularly around the details of software versions.

2) Are you doing a like for like transaction. I once heard of a support call that complained that it took ages to find a car part when doing a search on the inventory system. It was later discovered that the user was doing a wild card search against the inventory and then scroll through the results to find the part!

3) Same start and end points? Users often think in completing processes rather than transactions. When in doubt ask the users to screenshot the start and end points. For windows systems they can use Microsoft Step Recorder.

4) Do you have the same data sets? Your test user account on the system is most likely not to have the history of claims/sales/orders of a real users.

5) Security settings? This is common when support staff are using different desktop builds to the customer. Common differences beyond the hardware differences are encrypted disks and exclusion rules for anti-virus

6) Browser version and setting? There is a significant difference in performance across browser types and versions. Always check you are testing like for like. Plus don’t forget any plugins.

7) Connectivity? I have even seen problems with two users side be side having different connectivity characteristics even on the same subnet. The problem was one user has wifi enabled and it would connect to there wifi hotspot on their phone! rather than the corporate network. More importantly end users will have difference latency and bandwidth availably than you. Browsers like chrome allow you to emulate some network characteristics.

8) Observer effect? Your debugging tools on your machine may change the outcome of the test. I have seen this with debugging proxies like Fiddler. No please don’t use this as a reason not to use those tool, life would be very difficult without them! Also, don’t expect to RDP into the machine and the results to the best same here is an example RDP example

9) Screen size can make difference, I once had a system where users with larger displays has some transaction take longer as it was rendering more items to the screen for each transaction.

10) Background processes can consume CPU and network bandwidth that can impact the end users machine performance.