Building McLaren.com – Part 4: Load testing

I’ve just finished working on McLaren’s new F1 site, http://mclaren.com/home, for the 2010 season, at Pirata London, for Work Club.

I’ll be writing up what we’ve done here in several parts. Sign up for my RSS feed to keep updated.

Part four covers the load testing of telemetry data, broadcast by the nginx http push module.

Using Apache Bench (ab)

Apache Bench (ab) is our preferred load-testing tool.

Load Testing is a stressful exercise for any server. I strongly recommend creating a completely new Amazon EC2 server to run load tests from.

Do NOT run ab from your local machine. I know you're trying to save effort, but it's not about power. Both PCs and Macs are desktop machines, and will just throw errors at pathetically small load, even when testing against themselves!

ab may not be installed by default. It can be installed with either of the following commands (linux):

yum install httpd-devel

or

apt-get install apache2-utils

Now type ab for help.

You'll need to increase your open files limit, as we did on the feed server.

ulimit -n 999999

A good test (in my experience) is:

ab -n 100000 -c 5000 -k -r http://server/feed/subscribe

Some important notes before you do this:

  1. If you're not posting to the publish URL on that box, all the ab threads will just hang and wait. Make sure you're publishing.
  2. ab is requesting a lot of documents, and it'll get them. This will take a lot of bandwidth. Check your charges.
  3. The ”-r” switch is not supported by the RedHat version of ab. Just take it away if it complains.
  4. Don't run this against other people's servers! It's mean and stupid. And they could conceivably take you to court for a denial of service attack.
  5. Don't run this against a live site! Not only is that stupid, but you could also be prosecuted under terrorism legislation for denial-of-service attacks. Your IP will get blacklisted, and you will be shunned by the wider Internet community, forced to eke out some sort of existence in lower IRC chatrooms for the rest of time. Don’t do it kids.

ab uses the following parameters here:

  • n – number of requests
  • c – number of concurrent requests
  • k – enable 'keep-alive' (which lowers the overhead of each connection having it's own HTTP connection established each time)
  • url – the page we're requesting. You can always run curl <url> to see what this is.
  • r – ignore errors – can help when using very high numbers

I’ve found the keep-alive is necessary, but I’m not sure if it’s fair to have it in there.

ab will present a summary report.

Failures and Errors

The 'failures' due to length shown below are common, and not really failures. They are merely saying that the length of responses vary, and since our POSTs are different lengths then this is ok.

Should you receive an apr_socket_connect(): Cannot assign requested address (99) or similar, wait a bit and start again. There's probably a router in the way trying to prevent DOS attacks, and I half suspect that it take a lot longer than you think before the ports get properly closed.

What happens when it gets hit too hard?

Two important factors seem to come into play.

One, the POST action starts taking longer, and can start to take longer than a second.

Two, requests seem to be served at slower intervals. It may be that users miss packets (telemetry not so important, but comments missing are bad). If a user is not activly waiting when the new POST comes through, they will not get the update.

Example output:

This is ApacheBench, Version 2.3 <$Revision: 655654 $>

Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking feed.mclaren.com (be patient).....done


Server Software:        nginx/0.8.33
Server Hostname:        feed.servername.com
Server Port:            80

Document Path:          /feed/subscribe
Document Length:        776 bytes

Concurrency Level:      5
Time taken for tests:   29.421 seconds
Complete requests:      100
Failed requests:        70
   (Connect: 0, Receive: 0, Length: 70, Exceptions: 0)
Write errors:           0
Keep-Alive requests:    95
Total transferred:      100395 bytes
HTML transferred:       73605 bytes
Requests per second:    3.40 [#/sec] (mean)
Time per request:       1471.046 [ms] (mean)
Time per request:       294.209 [ms] (mean, across all concurrent requests)
Transfer rate:          3.33 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    2   5.5      0      21
Processing:    11 1469 2126.8   1041   10642
Waiting:        0 1469 2127.2   1041   10642
Total:         11 1471 2130.3   1041   10659

Percentage of the requests served within a certain time (ms)
  50%   1041
  66%   1043
  75%   1046
  80%   1047
  90%   1049
  95%  10657
  98%  10659
  99%  10659
 100%  10659 (longest request)
 

Data from NGINX

NGINX also provides figures, and these are likely to be more accurate than via other means. When a POST is made the reponse shows the number of active subscribers.
Our publishing system records these figures and displays them in the admin console.

queued messages: 0
last requested: 1 sec. ago (-1=never)
active subscribers: 60001

Test Results

ab has a maximum limit on concurrent requests, which is a mere 20,000.

Since cloud servers are so easy to set up, and you can rely on their network connectivity, they make good machines to load-test with. When testing against our fairly seriously specced rackspace feed server, I could run 3 machines requesting 20,000 concurrents per second before I started losing packets at a serious rate. It could serve about 45,000 requests per second, each being around a kilobyte each.

Turning that around and testing against a cloud server (using the c1.xlarge instance type), we could reach only about a third of that before we saw similar delays/losses.

So, our final decision on load was the following:
Our main server can handle the first 20,000 concurrent viewers.
A new c1.xlarge server to be brought up for every subsequent 10,000 visitors.

The JavaScript in the client-side is supplied with URLs for every feed server, and makes a choice at random. We’ve seen an even distribution by this means across several servers.

The c1.xlarge servers cost 96cents/hour, and the kind of bandwidth we’re serving, together with the relatively short race duration, makes this a very economical way to server the load.

In the next post, I’ll show you the scripts I use to bring these cloud servers online.