Performance Trouble Shooting Example (#1)

A call came in about slow response time for a new release just deployed into the test environment. The login page was taking over 5 minutes compared to less than 8 seconds in production.


As it as a web based application. The first step was to get on a webex with the client and record the login process using fiddler. While watching the webex I could see that one of the HTTP request was taking a very long time to return from the server (this is a great feature of fiddler is that you can see the icons change as the status changes for each HTTP request).

One the fiddler stop, I went to the HTTP request in fiddler and looked at the statistics. Af few key things can be seen the returned payload was 8.5M so large but not huge for a corporate network. The overall the elapsed time was 5m 20 sec of which

One thing was also to check if this payload was being cached. Looking at Response Header for the returned request then caching is enabled. They have followed the google page speed recommendation and set to Vary.

While on the webex I asked for a wireshark trace to be taken so I could get some visibility into how the network was performing. A wireshark trace only captures the packets at the workstation but it can be useful to see if certain TCP/IP type issues are occurring. What I looked at in the wireshark trace was first the throughput from the client. IT seems that much is capped at just over 0.5 Mbit/s although there is a spike and then nothing.

I also wanted to check if there are any problems reported such as the TCP/IP window being undersized or lots of transmissions etc. A quick check is to use the Expert Information option. Remember to select to use the display filter that concentrates on the traffic between client and server and all the numbers are relative. In this case we are transmitting 16.5K packet so 5 out of order packets is not alot.

Next as I had access to the weblogs for I did a quick analysis to see if any particular client or access point has slower than expected response time. I imported these into Excel an created a pivot table and then for each IP calculated the average response time. When doing this remember to filter on status code being HTTP 200 so that the cache check 304 do not skew the response time calculations.

I had a table similar to the below, (I have changed the IPs and changed the response times to seconds. What is most noticable is that the 10.1.62.11 IP has the slow response times while the others are fine.

IPResponse Time (sec)Count
10.1.62.1136524
10.1.9.953
10.1.9.737

Talking to the network guys the 10.1.62.11 is the proxy server used by the client to access both the production and test environments. The production users where not reporting any problems and when we tested production with a clear cache the response time was as less than 5 seconds. The initial thought was some packet shaping was being undertaken to prioritise the production traffic. However. the client confirmed this was not enabled on the client network.

Going further into the route between the proxy server it was noticed that traffic for the test environment was going across the Data Centre link as they work in a different data center. Although capacity was available on the link there was actually packet shaping enabled which meant the traffic was being constrained to sharing a 1Gb portion of the link.

With a quick network change to the packet shaper all was well..

Things to check when you cannot replicate an end user performance problem

Here are a few things to consider when confronted with users complaining about poor performance and you cannot replicate the issue with your configuration. This is looking at comparing your tests verses a users. Most of the points consider the set up on the end users device.

1) Same Hardware/Software? OK may sound obvious but often over looked particularly around the details of software versions.

2) Are you doing a like for like transaction. I once heard of a support call that complained that it took ages to find a car part when doing a search on the inventory system. It was later discovered that the user was doing a wild card search against the inventory and then scroll through the results to find the part!

3) Same start and end points? Users often think in completing processes rather than transactions. When in doubt ask the users to screenshot the start and end points. For windows systems they can use Microsoft Step Recorder.

4) Do you have the same data sets? Your test user account on the system is most likely not to have the history of claims/sales/orders of a real users.

5) Security settings? This is common when support staff are using different desktop builds to the customer. Common differences beyond the hardware differences are encrypted disks and exclusion rules for anti-virus

6) Browser version and setting? There is a significant difference in performance across browser types and versions. Always check you are testing like for like. Plus don’t forget any plugins.

7) Connectivity? I have even seen problems with two users side be side having different connectivity characteristics even on the same subnet. The problem was one user has wifi enabled and it would connect to there wifi hotspot on their phone! rather than the corporate network. More importantly end users will have difference latency and bandwidth availably than you. Browsers like chrome allow you to emulate some network characteristics.

8) Observer effect? Your debugging tools on your machine may change the outcome of the test. I have seen this with debugging proxies like Fiddler. No please don’t use this as a reason not to use those tool, life would be very difficult without them! Also, don’t expect to RDP into the machine and the results to the best same here is an example RDP example

9) Screen size can make difference, I once had a system where users with larger displays has some transaction take longer as it was rendering more items to the screen for each transaction.

10) Background processes can consume CPU and network bandwidth that can impact the end users machine performance.

WAN emulator

This post talks about making a WAN emulator from a Raspberry Pi. As the Pi runs a derivative of the Debian Linux operating system which has native packet shaping features it was an ideal choice for making a WAN emulator. The aim was to connect the Pi between a client PC and the network and allow me to simulate things like packet delay or loss. I could then get the customer to repeat transaction on the client PC and we could observe/time the effect of different network characteristics. Not only a good tool for investigation but an ideal tool for demonstrating the effect of network latency for people considering data center moves or dismissive about complaints about poor performance for users in the “regions”

I had a RaspberryPI Model B but need a few things for the WAN emulation.

  • Additional Ethernet port – the Pi has only 1 ethernet port so I needed a Ethernet to USB device. A simple eBay purchase for a few quid
  • Screen – Again another ebay purchase for a 7 inch screen with separate PSU. The screen is a bit bulky compared to the rest of the kit and I have noticed recently that there are 5 inch screen that connect direct to the Pi with no PSU needed.
  • Keyboard – Another ebay purchase of a 7″ keyboard with MicroUSB and tablet case.
  • Finally I need a HDMI connector and a MicroUSB to USB converter

The kit is all connected together and can be seen in the picture below.

 

Next you have to create a bridge between the two Ethernet adapters. This is done with the follow commands, which I have in a .sh file and run once the Pi is booted. This turns the Pi into a transparent bridge between WAN and Client PC.
ifconfig eth0 0.0.0.0
ifconfig eth1 0.0.0.0
brctl addbr bridge0
brctl addif bridge0 eth0
brctl addif bridge0 eth1
ifconfig bridge0 up

Next you can use tc to inject delay and packet loss. For example to add a 50ms delay

tc qdisc add dev eth0 root netem delay 50ms

It has been a few months since I built this and I apologies if I have forgotten any steps around installation but I found a quick google solved any problems.

Extracting data from a LoadRunner results DB

I collegue wanted to record the resource usage from the weekly performance test automatically into an excel spreadsheet.  So this is how you do it. From excel choose Data -> Import External -> New Data Query and select a MS Access database from the dialogue box. Remember this is the access database created from running and saving the analysis, the default output.mdb produced at the end of a load test if for errors. Next you will need to open the access DB that was created when you saved the results analysis. After you have done this you will be able to use MS query to create the query.

You will need to join the Host, Event_map and Monitor_meter tables to construct the query. The equiry used is shown below, where it provides the resource average for an 1 hour of the test after the first 10 minutes.

SELECT Host.`Host Name`, Event_map.`Event Name`, Avg(Monitor_meter.Aminimum) AS ‘Avg ‘
FROM `C:\….\filename`.Event_map Event_map, `C:\….\filename`.Host Host, `C:\….\filename`.Monitor_meter Monitor_meter
WHERE Host.`Host ID` = Monitor_meter.`Host ID` AND Event_map.`Event ID` = Monitor_meter.`Event ID` AND ((Monitor_meter.`End Time`>=600 And Monitor_meter.`End Time`<=4200))
GROUP BY Host.`Host Name`, Event_map.`Event Name`

Once this is working in the query editor you can return back to excel and the data will be added to the spreadsheet

Python Packet Inspector for Network Captures

I recently was involved on a performance trouble shooting exercise for a company that uses Citrix to access their core ERP application. As part of the exercise individual users has taken packet captures using WireShark for key business transactions. The transactions types where the same for all the users but the users where are different locations. The common metric from the captures was the amount of data exchanged for the transaction and the session length. As there where many locations, transaction types and users the best way to do this was to automate the analysis.

I decided that I would use python as I was fairly familiar with the language and there where several libraries for WireShark analysis.

The chosen library was http://kiminewt.github.io/pyshark/ and a good tutorial on how to use this can be found here http://thepacketgeek.com/pyshark-using-the-packet-object/

The code imports, iterates through the files in the given directory (which should point to the wireshark capture files). When it finds a capture all the citrix packets are loaded into a capture array using a filter using the citrix port number
cap = pyshark.FileCapture(pcap_file,display_filter="tcp.port == 2598")

Once the cap array is populated then a loop iterates through the array summing the size of the payload data

#Iterate through the cap array
for i in cap:
# If the packet has payload data add that to the size counter
try:
if i.data:
size = size + int(i.data.len)
except:
pass

Finally, once the array has been processed then the payload size and session time is printed out. The session time is the timestamp of the last frame in the array which is held in the iteration variable i minus the timestamp for the first frame in the array. I have used epoch time so the result is in seconds. print file, ": is :",size,": size and :",float(i.frame_info.time_epoch)-float(cap[0].frame_info.time_epoch)

The complete code with a bit of error processing is here:


import pyshark
import os
directory = "/home/andrew/Documents/CaptureDirectory"

#Iterate through every file in the directory
for file in os.listdir(directory):
#Analyse if it a wireshark capture files
if file.endswith(".pcapng"):
#Populate cap array with packets matching using the citrix port 2598
size = 0
pcap_file = (directory + "/" + file)
cap = pyshark.FileCapture(pcap_file,display_filter="tcp.port == 2598")

#Iterate through the capute array
for i in cap:
# If the packet has payload data add that to the size counter
try:
if i.data:
size = size + int(i.data.len)
except:
pass
# Print out payload size and session duration
if size > 0:
print file, ": is :",size,": size and :",float(i.frame_info.time_epoch)-float(cap[0].frame_info.time_epoch)
else:
print file," No Citrix"

An example of the output is

andrew@debian:~$ python summary.py
smith-tran12.pcapng : is : 89636 : size and : 47.6105160713
davies-tran1.pcapng : is : 267292 : size and : 62.6023669243
smith-tran11.pcapng : is : 242545 : size and : 37.8602318764
kirby-tran1.pcapng No Citrix

Using R to detect growth in perfmon resource metrics

I have started using the statistical package R to detect any trends in performance test data. In this example I am looking to detect windows perfmon metrics that increase over the duration of a performance test.

You can install R an open source statistical analysis package from here http://www.r-project.org/

The code below can be cut and pasted into the R gui command line but you will have to change the first line of the script to use the directory holding the data file (procs.csv).

The comments should give you an idea what it is doing..

setwd("C:\\Documents and Settings\\alee\\My Documents\\Projects\\youProject\\")

# Load the Data isetednto a Data Frame

pData <- read.csv("procs.csv",sep=",",header=TRUE)

NumOfCols <- length(names(pData))

# Create a metrix to hold the gradients

slope <- 1:NumOfCols

# Loop through the metrics and calculate the slope

for(i in 2:NumOfCols ) {
     x <- 1:length(pData[[i]])
     y <- pData[[i]]
     # Ignore any blank columns
     if ( is.na(y[[1]]) )
          {slope[[i]] <- NA }
     else
     {
          fit <- lm( y ~ x)
          slope[[i]] <- fit[[1]][[2]]
     }
}

results=data.frame(metric=names(pData),Co=slope,OrgOrder=1:NumOfCols)

OrderResults <- results[order(-results$Co),]

#plot top 5 growing metrics

for(i in 1:5)
{
     OrderResults[i,3]
     plot(pData[[OrderResults[i,3]]],type="o",col="blue")
     title(main=names(pData[OrderResults[i,3]]), col.main="black", font.main=4)
     # Prompt for Enter so plot stays on screen long enough to be read
     readline(prompt = "Pause. Press to continue...")
}