Smokeping, which does ping periodically, and graphs the results, have been invaluable. Here's the link to our smokeping page:
When we first started using smokeping, some of the results were totally off. The graphs were recording high latency for pings to the firewall
periodically, when we were expecting under 1ms. Turns out, having syslog on for our firewall was overloading the CPU.
We added more locations over time, eg. Hong Kong, Indonesia, Malaysia, Vietnam, USA, UK. When there's problems, we feedback to our datacentre
provider, and the network guys has been responsive to fixing problems.
How to interpret the graphs? Why does some of the graphs show alot of deviation?
There's a bit of trial and error to the targets that we use for the graphs. Some of the targets are servers, some of them are routers. Many
routers will slow down responses to ICMP packets when the CPU is busy, and that is shown as higher latency in the graphs although normal
traffic is not affected. Sometimes the servers choosen has low bandwidth, and the latency shoots up during peak usage.
As far as possible, we try not to use these (busy routers, bandwidth starved servers) as targets.
Why do we want to show customers these graphs when some of them look downright ugly?
1. "Debugging". Can you think of times when you encounter problems with the network, send an email to your service provider, and they respond
later that there's no problems ? We need to keep a record so when customers tell us there's problems at a particular time, we can get our
provider to check, and provide them with the graphs to help them check. This reduces the time taken to resolve problems. Also, we check
the graphs periodically, if we notice problems, we get our provider to fix them. What we have not yet done but would like to, is to do
active monitoring with thresholds, and SMS alerts when those thresholds are breached.
2. Customer expectations and customer turn over. Our target market are sysadmins, people who understands the Internet is a best effort network,
who chooses to pay more for SCSI, and brings around a UMPC with 3G. We do not mind scaring away potential customers who expect a perfect
network with 0% packet loss 24x7 to all locations around the world; customers who ask why ping round trip from Singapore to USA latency
is 200ms and not 30ms. We also want customers to know what they are paying for, and if the network is not good enough, they should know
beforehand, and not after paying.
08 Jun 08