Top 10 Tips on Getting Observability Right

This was featured as the Capacitas thought of the Week and as I was the author thought I would capture it on my own blog! See the original here:

  1. After buying a tool, create an ecosystem of people and integrations to get observability that provides value.
  2. Make sure the tool fits your technology stack (ok, this may seem obvious but be careful as the tool needs to work with legacy tech as well as the new and shiny tech).
  3. Spend time configuring the tools to make it readable and relevant to the users i.e., Rename IIS application pool AGKP_04520756 to UKAGKPPortal.
  4. Remove noise from the tool e.g. If volume /dev/u001/fred  is always full then configure the tool not to show it in red!
  5. Try to avoid underlying infrastructure alerting, it doesn’t matter what your hardware is doing, it is your users experience that matter. Set alerts at the UX level.
  6. Configure alerts that align with the business. For example, set alerts for critical user transactions such as time to generate a quote or complete payment rather than a generic alert across all user transactions.
  7. Decide who are the consumers of the tool and work with them to make sure they are trained; the tool should be configured for their needs. I am not talking about just turning them into a dashboard. That would be teaching them how to fish!
  8. These tools do love data sources. The more you give them the more likely you can correlate the source of problems i.e., poor performance could be due to waiting for a VM to be scheduled on the hypervisor. You may never know this unless you are monitoring the hypervisor. Of course, the more you monitor, the more you pay!
  9. Don’t be afraid to have a dedicated monitoring team but ensure they are skilled in using the tool and not just configuring the tool. i.e. When production goes down, they are in the thick of it trying to resolve the issue.
  10. Keep an eye on the costs!

Performance Trouble Shooting Example (#2)

Sometimes in my job I will have to troubleshoot some old technology. This is a case where I had a call that an application that run in CGI process spawned from an IIS website was running a lot slower that the same application in a test environment. This was for a single iteration of a request to the application. The customer was keen to find out why that although the test environment was identical to production there was such a performance difference. Luckily, we could take out one of the IIS servers from production so that we could us it for testing.

To start with we did some benchmark tests on the machine using tools like IOmeter and zCPU just to confirm the raw compute power was the same. As a separate thread various teams where checking to make sure the two environment where identical not only in terms of hardware and OS but security, IIS config etc. They all came back to say the production and test environments where identical. Next we wanted to see if the problem was in the application itself or the invocation of the application. Someone knocked up a CGI hello world application which we tested in test and production. This showed that the production environment was slower running the “Hello World” application and therefore the problem was around the invocation of the CGI process. Next I wrote a C# program that invoked a simple process and timed it. Running this in both test and prod showed no difference between the two environment. This lead to the problem being specific to the IIS/CGI.

The next step was to take a Process Monitor trace (Process Monitor – Windows Sysinternals | Microsoft Docs). The first thing we noticed was the trace size was significantly larger for the production trace. On the production server there are multiple calls made to Symantec. With the production trace showing multiple calls made to Symantec end point security before the start of the LoadImage event. At the start it only add 1 second to the trace before the image load but there are latter calls to the registry. With this information we asked the security team to check again the security settings and they discovered they where different this the exceptions set in test not set in production. The settings where reviewed and it was decided that the same exceptions rules could be applied across both environment and after this was done the retesting showed that both environment provided the same level of performance.

What is the lesson for this. Let data be your truth! I am not saying the people checking the security settings lied but comparing lots of settings is complex and often not automated. Add in the biases that people think they have configured the system correctly to start. leads to early conclusions that can be flawed.

WAN emulator

This post talks about making a WAN emulator from a Raspberry Pi. As the Pi runs a derivative of the Debian Linux operating system which has native packet shaping features it was an ideal choice for making a WAN emulator. The aim was to connect the Pi between a client PC and the network and allow me to simulate things like packet delay or loss. I could then get the customer to repeat transaction on the client PC and we could observe/time the effect of different network characteristics. Not only a good tool for investigation but an ideal tool for demonstrating the effect of network latency for people considering data center moves or dismissive about complaints about poor performance for users in the “regions”

I had a RaspberryPI Model B but need a few things for the WAN emulation.

  • Additional Ethernet port – the Pi has only 1 ethernet port so I needed a Ethernet to USB device. A simple eBay purchase for a few quid
  • Screen – Again another ebay purchase for a 7 inch screen with separate PSU. The screen is a bit bulky compared to the rest of the kit and I have noticed recently that there are 5 inch screen that connect direct to the Pi with no PSU needed.
  • Keyboard – Another ebay purchase of a 7″ keyboard with MicroUSB and tablet case.
  • Finally I need a HDMI connector and a MicroUSB to USB converter

The kit is all connected together and can be seen in the picture below.


Next you have to create a bridge between the two Ethernet adapters. This is done with the follow commands, which I have in a .sh file and run once the Pi is booted. This turns the Pi into a transparent bridge between WAN and Client PC.
ifconfig eth0
ifconfig eth1
brctl addbr bridge0
brctl addif bridge0 eth0
brctl addif bridge0 eth1
ifconfig bridge0 up

Next you can use tc to inject delay and packet loss. For example to add a 50ms delay

tc qdisc add dev eth0 root netem delay 50ms

It has been a few months since I built this and I apologies if I have forgotten any steps around installation but I found a quick google solved any problems.