WonderProxy Blog

April 26, 2011

Monitoring the WonderProxy Network

Filed under: Uncategorized — Will Roberts @ 12:53 pm

It’s important to know the state of your network and to be quickly informed when something goes wrong. There are a number of independent pieces on each machine that ensure that our proxies work correctly, and we need to be able to make sure that they’re all working correctly. We currently use three tools to monitor our network: Cacti, Nagios, and Smokeping.

  Cacti

We’ve been using Cacti for quite a while, and it provides us with historical information about our proxies: CPU, Load, RAM, Bandwidth. This data allows us to see how well each of our proxies are handling the load of the customers and allows us to plan upgrades if necessary. Squid isn’t a very CPU intensive process, so even when our proxies are shuttling data at 1MB/s in and out the CPU isn’t particularly taxed.

Nagios provides near real-time updates about the status of our network; it is the main source of status updates for our proxies. The information provided by Nagios is used to update the network status bar on our main page. We believe that our customers should be able to easily determine if something’s not working on their end or our end, so our main page is updated every minute with the current status of our proxies.

Our checking is currently centralized to one machine, so whether it can reach the proxy and properly authenticate may unfortunately not line up with what our customers see. However it does easily catch the cases where the machine is down or unreachable due to network issues at our host. There are also times where our monitoring machine decides it’s not happy and stops resolving hostnames which really makes our front page look sad. It doesn’t happen very often (4 times total I believe) and since it’s identical to one of our proxies that’s with the same host I’m at a bit of a loss as to what’s going wrong. If it happens again and I’m able to isolate the problem you can bet there will be a post about it!

When we first started using Nagios I installed the Debian package nagios-statd-server on all the proxies which would be periodically interrogated by the client on the monitoring machine. This worked fine for the most part, but every now and then the server on the proxy would get hung up and stop answering requests. As I upgraded our proxies to Debian Squeeze the problem got worse, and not being familiar with Python I didn’t really have any idea what could be going wrong. So I filed a bug against the Debian package since I couldn’t find an upstream.

Having the monitoring process regularly hang on our proxies wasn’t really sustainable so I went searching for another client/server program for Nagios and ended up with the nagios-nrpe-server which directly executes standard Nagios plugins. It has worked flawlessly since we switched many months ago. It appears that this is the preferred method of retrieving server statuses so I’m not sure how I stumbled across nagios-statd-server.

Smokeping provides us with a visualization of the latency between our monitoring host and our proxies around the world. We expect latency to grow in relation to the distance from the monitor to the proxy, but it doesn’t follow that packet loss should be expected. Smokeping can show us which of our proxies are likely experiencing network issues based on the packet loss and any spikes in latency. Unfortunately with some of our hosts there is a base level of packet loss so looking at the graphs isn’t always clear cut.

For most of our proxies, the graph above would be a relatively flat green line, but every now and again one of them has a bad day and we get colors!

February 18, 2011

The Problem With Time

Filed under: Uncategorized — Will Roberts @ 5:15 pm

A frequent issue we have with our hosts is that the proxies will have the incorrect time. This makes billing unreliable and can make tracking down issues difficult since timestamps in logs will be wrong. The best part is that with the problem providers, they rarely seem to care. Depending on the type of VPS we have with the provider there are two solutions.

The first option is to bug the provider. If they’re running an NTP daemon then all of the nodes will inherit the correct time and everyone wins. Most of the time we can’t seem to get them to do this though.

The second option is to run our own NTP which we’re fine with, but sometimes still requires help from the provider. If our VPS is based on OpenVZ or Virtuozzo then the provider can set an option which allows our node to maintain its own wallclock. However, if our VPS is based on Xen then we can do it ourselves like so:

echo 1 > /proc/sys/xen/independent_wallclock

And more permanently by adding xen.independent_wallclock = 1 to /etc/sysctl.conf

We use SNMP to query the time of our machines using a custom OID that just returns the current UTC seconds since epoch. We then compare that number to the same value from our monitoring machine which is synched to the global NTP servers. Differences of a up to 5 seconds either way are considered acceptable though obviously we’d prefer if it were always dead on. We consider 5-60 seconds offset to be a warning state, and then further than 60 seconds to be an error state.

If the time is properly synched then if we get complaints that something wasn’t working for our customers at a given time I know exactly what time I should be looking at in the logs on the machine. When the time is wrong I have to start doing some mental math to figure out when that proxy thought the error event was, and if it communicates with another machine whose time is off I then have to do another calculation to figure out where the logs would be. So each machine in the chain that has the incorrect time makes it more difficult for us to track down what might have gone wrong.

January 28, 2011

Xen, OpenVZ, Virtuozzo, and You!

Filed under: Uncategorized — Will Roberts @ 11:32 pm


There are three main virtualization technologies out there being used to provide VPSs: Xen, OpenVZ, and Virtuozzo. Of our 38 proxies, 12 are on Xen, 12 are on OpenVZ, 10 are on Virtuozzo, 3 are physical machines, and we have a single VMware machine.

Physical machines give us absolute control over all the settings. The only limitation to the software we can run is what is available in the Debian repositories and what I’m willing to compile. The other benefit is that no one else has access to the box so there’s no concern with hostname and resolver settings being changed which can be issue with our VPSs. The hostname of the box isn’t overly important except when I’ve got a terminal open and I need to know where I am on our network (sorry I don’t actually know where vz2542 is). We run our own DNS resolvers on each box so that the box should get responses similar to other boxes in the geographic area so our resolv.conf is fairly simple and points at our local resolver. If dedicated servers weren’t so expensive, we’d use them everywhere.

Xen is our preferred virtualization technology mainly because it also allows us to run an OpenSwan based IPSec VPN. Xen also allows us to opt-in to an independent wall-clock so if the host’s clock isn’t properly synced we can fix our clock without their help. When a Xen VPS is rebooted some files are automatically overwritten to ensure that the VPS will work properly, in our case this actually changes our configuration and is very undesirable. Thankfully we can use the immutable bit like on a physical machine to prevent the change without any side effects.

We can use chattr to set the immutable bit so that files aren’t accidentally modified

chattr +i /etc/hostname /etc/resolv.conf

OpenVZ doesn’t currently allow for IPSec VPNs using OpenSwan though it does appear that it will change in the future; no telling how long that change will take to show up on production systems. Like Xen we can use the immutable bit to preserve files from undesired modification without side effects. Unlike Xen, we do need to ask the provider for help if our clock is wrong. OpenVZ tends to be popular among VPS providers because it is free; most customers don’t care which technology is being used and will buy based on price.

Virtuozzo similarly doesn’t allow for IPSec VPNs, and for the longest time I didn’t realize there was an actual difference between OpenVZ and Virtuozzo (oops!). The more unfortunate “feature” I found after I’d already made the change to all our proxies is that if you have the immutable bit set on certain important files (like we do) then the VPS will refuse to boot! Thankfully only one proxy was taken offline by this mistake, and even more fortunately Virtuozzo provides a fairly decent VPS management panel that allows you to reboot into a recovery mode and “fix” the problem. The control panel can be accessed by going to https://hostname.example.com:4643/vz/cp and putting in the root username/password. I haven’t had to use it for anything other than fixing my mess with the immutable bit, but it is nice to know it’s there.

Since we’re incredibly geographically sensitive, we don’t always get to choose our hosting providers based on their virtualization technology. When we do have the choice, we strongly prefer Xen over the competition. Between OpenVZ and Virtuozzo it generally comes down to other criteria like cost.

January 14, 2011

HOWTO: Managing 30+ Servers

Filed under: Uncategorized — Will Roberts @ 2:19 pm

When we started out we only had a handful of servers so I was doing each setup by hand, and manually applying each change to each server’s configuration. That meant I was spending an hour or more for each new setup, then probably 30-45 minutes for each change depending on its complexity. The setup time doesn’t really have a scalability issue, though it does mean that I can’t be doing something else at the same time. The bigger issue is rolling out a change to all our active servers; a 5 minute change suddenly becomes a 2.5 hour chore when you’ve got 30 servers.

After about 15 or so servers we reached a tipping point where I realized I was going to need a more automated mechanism for setting up the servers and for rolling out new changes. I don’t know all the ins and outs of Bourne Shell scripting, but I’ve managed to create some pretty creative scripts over time so that’s where I started. Pushing trivial updates out to existing machines now becomes a matter of running a script once for each server, and since we know all our hostnames, we can just loop over the hosts running each one in turn. I’ve toyed with the idea of running the scripts in parallel (there shouldn’t be an issue), but for the moment I’ve left them in serial so I can see the result of each box in turn.

#!/bin/sh

MYSQL_HOST="mysql.database"
SERVERS=`mysql --host=$MYSQL_HOST -u oursupersecretuser -s --skip-column-names -e "$QUERY" wonder_proxy`

export HOST_STATUS=/home/lilypad/billing/host_status.txt

SCRIPT=$1
shift

for i in $SERVERS
do
  if [ -S /home/lilypad/.ssh/master-wproxy@$i:22 ]; then
    echo -n
  elif [ `grep -c "$i 2" $HOST_STATUS 2> /dev/null` -eq 0 ]; then
    nohup ssh -MNf $i  /dev/null 2> /dev/null
  fi
  $SCRIPT $i $*
done

The script above is the basic loop I use to run my other scripts on our machines. The MYSQL_HOST variable allows us to more easily migrate from one box to another which has already happened (and was an absolute pain the first time). The custom query allows this script to be called by other scripts to only select certain portions of our network. Once we have the list of hosts, we then ensure that each host has an active SSH tunnel or attempt to start one if the host isn’t known to be down. Then the script is executed with the hostname and all extra arguments.

The scripts are all fairly simple, and I tend to reuse/mangle them for other uses as needed, but here’s an example:

#!/bin/sh

scp /data/proxy-setup/ipsec/etc/ipsec.conf $1:~/
ssh $1 sudo cp ipsec.conf /etc/
ssh $1 rm ipsec.conf
ssh $1 sudo /etc/init.d/ipsec restart

Pretty simple, but it’s nice not to repeat those 4 commands 37 times when I make a tiny change. So in order to push that tiny change I’d end up just running:

./run_all_vpn.sh ./push_ipsec_conf.sh

The ssh command we use in the first script allows multiple SSH connections to flow over the same TCP connection. This reduces the cost of initiating the TCP handshake as well as the SSH handshake for exchanging keys. The flags as explained by the man page:

-M
Places the ssh client into “master” mode for connection sharing. Multiple -M options places ssh into “master” mode with
confirmation required before slave connections are accepted. Refer to the description of ControlMaster in ssh_config(5)
for details.
-N
Do not execute a remote command. This is useful for just forwarding ports (protocol version 2 only).
-f
Requests ssh to go to background just before command execution. This is useful if ssh is going to ask for passwords or
passphrases, but the user wants it in the background.

January 11, 2011

HOWTO: Speedy Server Setup

Filed under: Uncategorized — Will Roberts @ 2:06 pm

We tend to expand in bursts, so it’s helpful if I can be configuring multiple servers at once instead of dedicating an hour to one server, then another hour to another server. The most difficult part is removing all the unneeded packages from the boxes; installing the packages we want and configuring them is barely a quarter of the current setup script. Since we deal with so many hosts producing an image that we can have them create the server with isn’t exactly convenient; it’s been easier to take what they give us and then work from there.

The first thing we need to do is setup SSH key access to the new machine so that the rest of the install doesn’t need someone entering passwords. There might be a simpler way, but this is what we’ve got at the moment:

cat /home/lilypad/.ssh/id_rsa.pub | ssh root@$HOST "tee /dev/null > foo; mkdir .ssh 2> /dev/null; chmod 700 .ssh; chmod 600 foo; mv foo .ssh/authorized_keys"

So we pipe the SSH key over the SSH connection, write it to a file, make the .ssh directory and then move it to the correct location. At this point we now have easy SSH access to the machine, and we actually maintain active SSH master tunnels to all the machines on the network to reduce the connection lag when running scripts. More on how we do that in my next post.

Our first step on the new machine is to remove any software we explicitly know we don’t want and that will cause issues for our configuration. Things like Apache get nuked so that they don’t collide with the ports on which we run Squid. Then we update all the software on the box to the newest available versions in Debian 5 (a few of our boxes still start as Debian 4), then make the transition to Debian 6. At this point we still don’t have any of “our” packages installed so we start removing unneeded packages with a fairly simple set of rules:

  1. If the package is on our whitelist of known needed packages, leave it.
  2. If the package is on our blacklist of known unneeded packages, remove it.
  3. If removing the package will only remove it and no other packages, remove it.
  4. Ask!

Here’s the part of the script that handles those rules. The packages and packages-blacklist files are just lists of package names.

for i in `dpkg -l | sed -n s/"ii  \([^ ]*\).*"/"\\1"/p`
do
  grep "^$i$" setup/packages > /dev/null
  if [ $? -eq 0 ]; then
    echo KEEPING: $i
    continue
  fi

  grep "^$i$" setup/packages-blacklist > /dev/null
  if [ $? -eq 0 ]; then
    echo PURGING: $i
    apt-get -y purge $i
    continue
  fi

  echo $i | grep -v linux > /dev/null
  if [ $? -ne 0 ]; then
    echo ASKING: $i
    apt-get purge $i
    continue
  fi

  if [ `apt-get -s -qq remove $i | grep ^Remv | wc -l` -eq 1 ]; then
    echo PURGING: $i
    apt-get -y purge $i
    continue
  fi

  echo ASKING: $i
  apt-get purge $i
done

At this point it’s fairly rare that I get asked whether a package should be removed since I update the lists anytime a new package is encountered. Once that’s done we start copying our custom config files for each package and restart the program as needed. The install can run unattended and takes anywhere from 30-60 minutes depending on the speed of the downloads and the power of the machine, and I can be running multiple at once with little trouble.

Theme: Customized Shocking Blue Green. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.