Smokeping: How a dedicated server provider is using the monitoring tool

Smokeping, which does ping periodically, and graphs the results, have been invaluable. Here's the link to our smokeping page:

When we first started using smokeping, some of the results were totally off. The graphs were recording high latency for pings to the firewall periodically, when we were expecting under 1ms. Turns out, having syslog on for our firewall was overloading the CPU.

We added more locations over time, eg. Hong Kong, Indonesia, Malaysia, Vietnam, USA, UK. When there's problems, we feedback to our datacentre provider, and the network guys has been responsive to fixing problems.

How to interpret the graphs? Why does some of the graphs show alot of deviation?

There's a bit of trial and error to the targets that we use for the graphs. Some of the targets are servers, some of them are routers. Many routers will slow down responses to ICMP packets when the CPU is busy, and that is shown as higher latency in the graphs although normal traffic is not affected. Sometimes the servers choosen has low bandwidth, and the latency shoots up during peak usage. As far as possible, we try not to use these (busy routers, bandwidth starved servers) as targets.

Why do we want to show customers these graphs when some of them look downright ugly?

1. "Debugging". Can you think of times when you encounter problems with the network, send an email to your service provider, and they respond later that there's no problems ? We need to keep a record so when customers tell us there's problems at a particular time, we can get our provider to check, and provide them with the graphs to help them check. This reduces the time taken to resolve problems. Also, we check the graphs periodically, if we notice problems, we get our provider to fix them. What we have not yet done but would like to, is to do active monitoring with thresholds, and SMS alerts when those thresholds are breached.

2. Customer expectations and customer turn over. Our target market are sysadmins, people who understands the Internet is a best effort network, who chooses to pay more for SCSI, and brings around a UMPC with 3G. We do not mind scaring away potential customers who expect a perfect network with 0% packet loss 24x7 to all locations around the world; customers who ask why ping round trip from Singapore to USA latency is 200ms and not 30ms. We also want customers to know what they are paying for, and if the network is not good enough, they should know beforehand, and not after paying.

08 Jun 08

