WonderProxy Blog

January 28, 2011

Xen, OpenVZ, Virtuozzo, and You!

Filed under: Uncategorized — Will Roberts @ 11:32 pm


There are three main virtualization technologies out there being used to provide VPSs: Xen, OpenVZ, and Virtuozzo. Of our 38 proxies, 12 are on Xen, 12 are on OpenVZ, 10 are on Virtuozzo, 3 are physical machines, and we have a single VMware machine.

Physical machines give us absolute control over all the settings. The only limitation to the software we can run is what is available in the Debian repositories and what I’m willing to compile. The other benefit is that no one else has access to the box so there’s no concern with hostname and resolver settings being changed which can be issue with our VPSs. The hostname of the box isn’t overly important except when I’ve got a terminal open and I need to know where I am on our network (sorry I don’t actually know where vz2542 is). We run our own DNS resolvers on each box so that the box should get responses similar to other boxes in the geographic area so our resolv.conf is fairly simple and points at our local resolver. If dedicated servers weren’t so expensive, we’d use them everywhere.

Xen is our preferred virtualization technology mainly because it also allows us to run an OpenSwan based IPSec VPN. Xen also allows us to opt-in to an independent wall-clock so if the host’s clock isn’t properly synced we can fix our clock without their help. When a Xen VPS is rebooted some files are automatically overwritten to ensure that the VPS will work properly, in our case this actually changes our configuration and is very undesirable. Thankfully we can use the immutable bit like on a physical machine to prevent the change without any side effects.

We can use chattr to set the immutable bit so that files aren’t accidentally modified

chattr +i /etc/hostname /etc/resolv.conf

OpenVZ doesn’t currently allow for IPSec VPNs using OpenSwan though it does appear that it will change in the future; no telling how long that change will take to show up on production systems. Like Xen we can use the immutable bit to preserve files from undesired modification without side effects. Unlike Xen, we do need to ask the provider for help if our clock is wrong. OpenVZ tends to be popular among VPS providers because it is free; most customers don’t care which technology is being used and will buy based on price.

Virtuozzo similarly doesn’t allow for IPSec VPNs, and for the longest time I didn’t realize there was an actual difference between OpenVZ and Virtuozzo (oops!). The more unfortunate “feature” I found after I’d already made the change to all our proxies is that if you have the immutable bit set on certain important files (like we do) then the VPS will refuse to boot! Thankfully only one proxy was taken offline by this mistake, and even more fortunately Virtuozzo provides a fairly decent VPS management panel that allows you to reboot into a recovery mode and “fix” the problem. The control panel can be accessed by going to https://hostname.example.com:4643/vz/cp and putting in the root username/password. I haven’t had to use it for anything other than fixing my mess with the immutable bit, but it is nice to know it’s there.

Since we’re incredibly geographically sensitive, we don’t always get to choose our hosting providers based on their virtualization technology. When we do have the choice, we strongly prefer Xen over the competition. Between OpenVZ and Virtuozzo it generally comes down to other criteria like cost.

January 20, 2011

Improving Site Performance

Filed under: Uncategorized — Paul Reinheimer @ 9:48 am

Our site hasn’t really been our focus over the past months, instead I’ve been concentrating on acquiring new network locations, while Will has been improving our server setup and maintenance architecture (we’ve blogged about Setting up Proxy Servers and Managing 30+ servers previously). More recently we’ve been taking a harder look at how the site performs, both for us, and for our users, and found it lacking.

Examining our main page’s performance with XHGui quickly revealed that a considerable amount of time was being spent generating the information displayed in the footer (server list, server country list, and proxied traffic). This data was shuffled off to APC‘s user storage mechanism removing it from the average page load entirely. Google Webmaster Tools still reported a startlingly high average pageload time:

Google Webmaster Tools analysis of site performance

This was quite surprising as the site seemed pretty snappy overall. Further investigation showed that server specific pages loaded more slowly (3-5 seconds!). Since our goal is to provide proxies for GeoIP testing, having server specific pages load slowly is sub-optimal. Looking at the pages with YSlow and Page Speed reveals that the real culprit is the embedded Google Map. Switching to use a static map greatly reduced page load time (to ~800ms). This also reduced functionality, as the map is no longer dynamic, but we plan on switching to a combined static & dynamic system in the future.

If you’re interested in front end performance, High Performance Web Sites and Even Faster Web Sites are invaluable.

Reading through the suggestions from YSlow a bit more closely, then diving into the Apache documentation I also managed to find a few quick gains by configuring our web server to do a bit more work for us:

  ExpiresActive On
  ExpiresByType image/png "access plus 1 month"
  ExpiresByType text/css "access plus 1 month"
  ExpiresByType image/jpeg "acces plus 1 month"
  <Location />
    SetOutputFilter DEFLATE
    BrowserMatch ^Mozilla/4 gzip-only-text/html
    BrowserMatch ^Mozilla/4\.0[678] no-gzip
    BrowserMatch \bMSI[E] !no-gzip !gzip-only-text/html
    SetEnvIfNoCase Request_URI .(?:gif|jpe?g|png)$ no-gzip dont-vary
  </Location>

YSlow will tell you to turn off eTags under the default rule-set. If you’re running with a single web server this is bad advice. You may want to select the Small Site or Blog ruleset to get the most out of the tool. Moving forward we may decide to re-organize our javascript code to make expiry rules for it easy (we can’t set distant expiry for all javascript documents as our live status bar relies on it), for now we’ll leave them as is. We’re happy with our new scores:

Screenshot showing our A grade with YSlow

YSlow - A Grade

Screenshot showing our grade of 93 within Page Speed

Page Speed - 93/100

Having sorted out the low hanging fruit on the front end, I looked at the account pages and the administrative tools we’re using. Performance there was abysmal, with some pages taking at least 10 seconds to load. The pages with the worst performance were the ones displaying any sort of usage statistic; the very worst being ones that displayed aggregate statistics for all users. Looking at our usage table built on the squid logs it has nearly a million rows. Despite being indexed there’s still a lot of data to aggregate and sum.

With an eye toward improving performance I decided to build some summary tables. The first one aggregates usage by user, by server, by day. This summary table was roughly 1/23rd of the original usage table. Makes sense since it rolled up the 24 hourly reports into one. This table is considerably quicker to query, and I started rolling it out to various portions of the admin section immediately.

Table indicting much higher performance using summary tables

While these numbers are still rather pathetic, remember that these are admin actions, not forward facing pages. Optimizing for these would be folly; the time would be much better spent working on outward facing pages read by users and search engines alike. The significant increase here will simply make managing the system a speedier task.

Knowing that the summary table is going to be useful, we need to keep it up to date. To accomplish this task we’re running this query after every log rotation (a process described in our post Squid log parsing for proxy billing). Luckily I’ve got friends at Percona (authors of the MySQL Performance Blog) who gave me a hand with crafting the query:

INSERT INTO sum_usage_daily SELECT
	`user_id`,
	`group_id`,
	server_id,
	date(`timestamp`)AS `date`,
	sum(`bytes`)AS `bytesSum`
FROM
	`usage`
WHERE
	`server_id` IS NOT NULL
AND timestamp BETWEEN date_sub(date(NOW()), INTERVAL 2 DAY) AND date(NOW())
GROUP BY
	`dd`,
	`user_id`,
	`server_id`,
	`group_id`
ORDER BY
	NULL ON DUPLICATE KEY UPDATE bytes = VALUES(bytes);

Note: ON DUPLICATE KEY UPDATE had numerous bugs prior to MySQL 5.0.38, ensure you’re on a recent version before mimicking this query with your own data.

This query looks at the past two days of traffic, either inserting new records, or updating existing ones when they exist. The  AND timestamp BETWEEN date_sub(date(NOW()), INTERVAL 2 DAY) AND date(NOW()) clause ensures we’re looking at full days, rather than the last 48 hours (the latter would result in incorrect summaries for entire days). This keeps the summary table up to date throughout the day and ensures that yesterday’s data is correct as well.

My only regret was changing some of the column names in the summary table. While “date” represents the contents of the column better than “timestamp”, it did mean that column references in code had to be changed rather than just switching the table reference. Other than that the conversion has been quite quick and painless.

Having reigned in the performance of the site, it’s time to look at adding new site features, and a few new products. More on those later.

January 14, 2011

HOWTO: Managing 30+ Servers

Filed under: Uncategorized — Will Roberts @ 2:19 pm

When we started out we only had a handful of servers so I was doing each setup by hand, and manually applying each change to each server’s configuration. That meant I was spending an hour or more for each new setup, then probably 30-45 minutes for each change depending on its complexity. The setup time doesn’t really have a scalability issue, though it does mean that I can’t be doing something else at the same time. The bigger issue is rolling out a change to all our active servers; a 5 minute change suddenly becomes a 2.5 hour chore when you’ve got 30 servers.

After about 15 or so servers we reached a tipping point where I realized I was going to need a more automated mechanism for setting up the servers and for rolling out new changes. I don’t know all the ins and outs of Bourne Shell scripting, but I’ve managed to create some pretty creative scripts over time so that’s where I started. Pushing trivial updates out to existing machines now becomes a matter of running a script once for each server, and since we know all our hostnames, we can just loop over the hosts running each one in turn. I’ve toyed with the idea of running the scripts in parallel (there shouldn’t be an issue), but for the moment I’ve left them in serial so I can see the result of each box in turn.

#!/bin/sh

MYSQL_HOST="mysql.database"
SERVERS=`mysql --host=$MYSQL_HOST -u oursupersecretuser -s --skip-column-names -e "$QUERY" wonder_proxy`

export HOST_STATUS=/home/lilypad/billing/host_status.txt

SCRIPT=$1
shift

for i in $SERVERS
do
  if [ -S /home/lilypad/.ssh/master-wproxy@$i:22 ]; then
    echo -n
  elif [ `grep -c "$i 2" $HOST_STATUS 2> /dev/null` -eq 0 ]; then
    nohup ssh -MNf $i  /dev/null 2> /dev/null
  fi
  $SCRIPT $i $*
done

The script above is the basic loop I use to run my other scripts on our machines. The MYSQL_HOST variable allows us to more easily migrate from one box to another which has already happened (and was an absolute pain the first time). The custom query allows this script to be called by other scripts to only select certain portions of our network. Once we have the list of hosts, we then ensure that each host has an active SSH tunnel or attempt to start one if the host isn’t known to be down. Then the script is executed with the hostname and all extra arguments.

The scripts are all fairly simple, and I tend to reuse/mangle them for other uses as needed, but here’s an example:

#!/bin/sh

scp /data/proxy-setup/ipsec/etc/ipsec.conf $1:~/
ssh $1 sudo cp ipsec.conf /etc/
ssh $1 rm ipsec.conf
ssh $1 sudo /etc/init.d/ipsec restart

Pretty simple, but it’s nice not to repeat those 4 commands 37 times when I make a tiny change. So in order to push that tiny change I’d end up just running:

./run_all_vpn.sh ./push_ipsec_conf.sh

The ssh command we use in the first script allows multiple SSH connections to flow over the same TCP connection. This reduces the cost of initiating the TCP handshake as well as the SSH handshake for exchanging keys. The flags as explained by the man page:

-M
Places the ssh client into “master” mode for connection sharing. Multiple -M options places ssh into “master” mode with
confirmation required before slave connections are accepted. Refer to the description of ControlMaster in ssh_config(5)
for details.
-N
Do not execute a remote command. This is useful for just forwarding ports (protocol version 2 only).
-f
Requests ssh to go to background just before command execution. This is useful if ssh is going to ask for passwords or
passphrases, but the user wants it in the background.

January 11, 2011

HOWTO: Speedy Server Setup

Filed under: Uncategorized — Will Roberts @ 2:06 pm

We tend to expand in bursts, so it’s helpful if I can be configuring multiple servers at once instead of dedicating an hour to one server, then another hour to another server. The most difficult part is removing all the unneeded packages from the boxes; installing the packages we want and configuring them is barely a quarter of the current setup script. Since we deal with so many hosts producing an image that we can have them create the server with isn’t exactly convenient; it’s been easier to take what they give us and then work from there.

The first thing we need to do is setup SSH key access to the new machine so that the rest of the install doesn’t need someone entering passwords. There might be a simpler way, but this is what we’ve got at the moment:

cat /home/lilypad/.ssh/id_rsa.pub | ssh root@$HOST "tee /dev/null > foo; mkdir .ssh 2> /dev/null; chmod 700 .ssh; chmod 600 foo; mv foo .ssh/authorized_keys"

So we pipe the SSH key over the SSH connection, write it to a file, make the .ssh directory and then move it to the correct location. At this point we now have easy SSH access to the machine, and we actually maintain active SSH master tunnels to all the machines on the network to reduce the connection lag when running scripts. More on how we do that in my next post.

Our first step on the new machine is to remove any software we explicitly know we don’t want and that will cause issues for our configuration. Things like Apache get nuked so that they don’t collide with the ports on which we run Squid. Then we update all the software on the box to the newest available versions in Debian 5 (a few of our boxes still start as Debian 4), then make the transition to Debian 6. At this point we still don’t have any of “our” packages installed so we start removing unneeded packages with a fairly simple set of rules:

  1. If the package is on our whitelist of known needed packages, leave it.
  2. If the package is on our blacklist of known unneeded packages, remove it.
  3. If removing the package will only remove it and no other packages, remove it.
  4. Ask!

Here’s the part of the script that handles those rules. The packages and packages-blacklist files are just lists of package names.

for i in `dpkg -l | sed -n s/"ii  \([^ ]*\).*"/"\\1"/p`
do
  grep "^$i$" setup/packages > /dev/null
  if [ $? -eq 0 ]; then
    echo KEEPING: $i
    continue
  fi

  grep "^$i$" setup/packages-blacklist > /dev/null
  if [ $? -eq 0 ]; then
    echo PURGING: $i
    apt-get -y purge $i
    continue
  fi

  echo $i | grep -v linux > /dev/null
  if [ $? -ne 0 ]; then
    echo ASKING: $i
    apt-get purge $i
    continue
  fi

  if [ `apt-get -s -qq remove $i | grep ^Remv | wc -l` -eq 1 ]; then
    echo PURGING: $i
    apt-get -y purge $i
    continue
  fi

  echo ASKING: $i
  apt-get purge $i
done

At this point it’s fairly rare that I get asked whether a package should be removed since I update the lists anytime a new package is encountered. Once that’s done we start copying our custom config files for each package and restart the program as needed. The install can run unattended and takes anywhere from 30-60 minutes depending on the speed of the downloads and the power of the machine, and I can be running multiple at once with little trouble.

January 6, 2011

Squid Log Parsing for Proxy Billing

Filed under: Uncategorized — Paul Reinheimer @ 7:31 pm

Parsing logs from Squid is a routine task for us, we perform this task on a regular basis (currently hourly) to bill our customers for used traffic. The process involves a script on our central server connecting to each remote server in turn, rotating the logs on that machine, then pulling the old one down to the local system. The system silently ignores servers that are down, we’re already kept up to date about outages by Nagios. The script will warn if it’s able to connect, but anything goes awry (permission errors, premature disconnect, etc.) By rotating the logs using a central script, rather than doing it automatically on each machine, the case of a  log being rotated, but not transferred back to the central server (due to a down or unreachable server) is mitigated.

WonderProxy Log Handling

The log parser is called with a single parameter, the path to the log in question, from that it determines the name of the server from which the logs originated. The system where most of the scripts are largely oblivious to the source of their data has allowed us to expand our network significantly without any changes to the scripts (the most recent change was moving from a 4 hour rotation cycle to a 1 hour cycle). The logs we receive from squid look like this:

1294011828.214   1309 184.163.123.123 TCP_MISS/200 10032 GET http://wonderproxyblog.com/ paul DIRECT/76.74.254.120 text/html
1294011828.414    581 184.163.123.123 TCP_MISS/404 5564 GET http://wonderproxyblog.com/.well-known/host-meta paul DIRECT/72.233.2.58 text/html
1294011828.784      0 184.163.123.123 TCP_MEM_HIT/200 943 GET http://s0.wp.com/wp-content/themes/h4/global.css? paul NONE/- text/css
1294011828.787      0 184.163.123.123 TCP_MEM_HIT/200 791 GET http://s2.wp.com/wp-includes/js/l10n.js? paul NONE/- application/x-javascript
1294011828.795      0 184.163.123.123 TCP_HIT/200 9024 GET http://s.gravatar.com/js/gprofiles.js? paul NONE/- text/javascript
1294011828.858     66 184.163.123.123 TCP_MISS/200 2158 GET http://b.scorecardresearch.com/beacon.js paul DIRECT/96.17.156.19 application/x-javascript
1294011828.871      0 184.163.123.123 TCP_MEM_HIT/200 1511 GET https://ssl-stats.wordpress.com/w.js? paul NONE/- application/x-javascript
1294011828.928    139 184.163.123.123 TCP_MISS/200 2749 GET http://edge.quantserve.com/quant.js paul DIRECT/64.94.107.11 application/x-javascript
1294011828.997    205 184.163.123.123 TCP_HIT/200 27230 GET http://s1.wp.com/wp-includes/js/jquery/jquery.js? paul NONE/- application/x-javascript
1294011829.091    334 184.163.123.123 TCP_MISS/200 414 GET http://wordpress.com/remote-login.php? paul DIRECT/74.200.247.60 text/html

We then parse that output with a regular expression (regex):

(?P<timestamp>\d+)\.\d{3}\s+-?\d+ (?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) (?P<cache>\w+)\/(?P<httpresponse>\d+) (?P<size>\d+) 
	(?P<uri>.+) (?P<user>\S+) (?P<method>[A-Z]+\/\S+).+!\n

The regular expression names its parameters, giving us an easy set of data to work with inside our billing script. The regex isn’t perfect, the system occasionally runs into a line it doesn’t parse properly which it emails back in a report. Originally this happened frequently, so the regex was tweaked (inside the RxToolkit of Komodo IDE), it’s now quite rare. As you can see there’s an incredibly large number of lines for just a partial page load (me loading the blog front page), it doesn’t include the numerous requests required to populate a google map widget from a recent post, or any of the images used on the page. We don’t require this level of granularity for billing, nor do we necessarily want it to be easily accessible. To turn this into a slightly more manageable log entry we merge requests: All requests through a given proxy server, for a given hour, are merged into a single piece of data to be inserted into the database containing: the user’s ID, the user’s account number, the server ID (from our network), the traffic in bytes, and the timestamp of the hour. This gives us a more manageable usage table, the raw log files are merged with other files from the same server on the same day and archived. While we’d like to just get rid of them (they’re just gathering dust and occupying hard drive space) many jurisdictions require record keeping from providers, and we’d like to be able to fully account for usage for our customers upon request.

To come back to the log file for a moment, I’d like to look at one piece of data in particular:

1294011828.871      0 184.163.123.123 TCP_MEM_HIT/200 1511 GET https://ssl-stats.wordpress.com/w.js? paul NONE/- application/x-javascript
1294011828.928    139 184.163.123.123 TCP_MISS/200 2749 GET http://edge.quantserve.com/quant.js paul DIRECT/64.94.107.11 application/x-javascript
1294011828.997    205 184.163.123.123 TCP_HIT/200 27230 GET http://s1.wp.com/wp-includes/js/jquery/jquery.js? paul NONE/- application/x-javascript

What our regular expression terms cache, the squid docs term Squid Result Codes and it generally indicates where squid got the resource. In the case of a miss it had to retrieve the resource from the URL in question. A TCP_HIT indicates it was cached, while a TCP_MEM_HIT indicates it was in cache, and still in memory avoiding hitting the disk. In our experience the list from the squid docs is non-exhaustive, for example TCP_REFRESH_UNMODIFIED doesn’t appear, so some research or testing was necessary as we put the system into use. We use this information to determine whether to record the size at its original value, or to double it (as a server proxying a 100Kb resource must download it from the source server, then upload it to the end user).

We’ve considered technologies like Spread to give us near real time logs from across the network, but we haven’t really seen the need. None of our customers are asking for it, and the need to go read old log files for detailed information has only come up twice, both times well after the fact.

January 2, 2011

Buying VPS Systems

Filed under: Uncategorized — Paul Reinheimer @ 1:44 pm

One thing we’ve managed to accrue a lot of experience with is dealing with VPS providers. In an ideal world we’d probably be able to buy thirty servers from a single provider, but that’s not the case. Most of our providers give us a single server to work with, so we’re managing lots of accounts in parallel. We’ve got some great providers that month after month give us zero issues to deal with, sketchy providers where we had to get PayPal or our credit card issuer involved, and some odd cases like where the provider went down between deciding to order, and finding the credit card to pay with.

Finding Candidates

Finding a VPS in a specific location can be somewhat challenging. The hosting market is extremely competitive and is flooded with GoogleAds on any relevant term; a few sites have been so well optimized for search engines, they appear even when irrelevant. Searching for generic phrases like “hosting in denmark” are likely to yield results containing a lot of basically spam sites, that link to companies that pay a referral fee regardless of their actual location. The most effective method we’ve found is to use Google Maps to search for hosting providers in the appropriate location.

Due Diligence

Having already been burned numerous times, we then do a few searches to look for reviews of the provider. Web Hosting Talk has a popular set of forums and a lot of active users. It’s a great place to start. There’s also a lot of… less experienced users there posting based more on their own incompetence than the provider, beware. One thing I do watch out for is hosting providers posting details of a users account in response to a complaint. While I understand their desire to defend themselves, I generally feel this presents a lack of professionalism on the part of the provider so I steer clear.

It’s also a good idea to look at how long a company has been in business. While every company does need to start somewhere, we’ve already been around long enough to see several hosts come and go. Finding some history of posts in WHT is helpful, finding a relatively short period of heavily discounted coupon posts is probably an indicator that they’re desperate to get their first customers (and revenue) through the door.

Looking for coupons

Lots of providers have sales nearly constantly, we’ll do a few quick searches on the net as a whole, and on Web Hosting Talk specifically. We’ve managed to knock some costs down significantly by signing up with coupons.

Signing Up

A pretty basic process, that a few providers do manage to make difficult (with non-refilling forms after error, multiple step process, or email based validation mid-registration). We’ll take a bare bones Debian 5 64bit machine whenever we can get it.

Favorite Providers

It doesn’t take much for us to like you: Give us the box quickly, give us the operating system we asked for, keep an accurate clock, and don’t have serious downtime. That said, our list of favourite providers isn’t terribly long (we’re only listing providers we’ve had for at least 6 months, and have no serious issues with).

VPSVille – Our VPS in Toronto has been a solid part of the network, no issues since we procured it.
GPLHost – We have several locations with GPLHost, they’re all reliable, we appreciate being able to set up our box with a minimal Debian install.
Slicehost – Solid provider, easy to set up and get going.
Cool Housing – Reliable provider.
Gandi – Their payment system is tricky, and leaves a bunch to be desired but the VPS has been solid.
MyHost.ie – Reliable provider in Ireland.

Providers We’ve left with prejudice

Spry – Migrated our systems to a different city unapologetically, and without notice. They then blocked ports that we used to monitor our system’s health, rendering our monitoring infrastructure useless.
Delimiter – Stopped answering support tickets, cities would just go dark either for days at a time or permanently. If you compare their offered locations six months ago to the present you’ll notice a lot of omissions.
Enotch – Completely failed to actually set up our VPS, we ended up requesting a refund and using a different provider.

The Shocking Blue Green Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.