Building McLaren.com – Part 2: Serving Telemetry

I’ve just finished working on McLaren’s new F1 site, http://mclaren.com/home, for the 2010 season, at Pirata London, for Work Club.

I’ll be writing up what we’ve done here in several parts. Sign up for my RSS feed to keep updated.

Part two covers the telemetry panel, known as “The Race 1.0b”. Technically, I think this is the most interesting section of the site.

Choosing a solution

Before I started in this job, I spent a week scouting around for technologies and platforms to support what we need.

We need to distribute our data feed, on a second-by-second basis to thousands of viewers. Although we weren’t sure exactly how many visitors we’d need to support, there’s a chance we’d get a mention on television, so we should be expecting visitors in the tens of thousands. Obviously we’d need to keep the latency down too – there’s no point broadcasting stats that appear with a significant delay over the television signal. This would very much be a companion to the television programme.

The most sensible way to receive a live feed is using Comet. This is similar to the familiar Ajax, with one big difference. When you request a data packet with Ajax, it will return as soon as it can. It might say “here’s data”, or it might say “I have no data yet”. The second of these cases is useless – you’d have to go and ask again. The process of continually requesting data on a timed basis is known as polling. It puts a big load on the server, which needs to waste execution time dealing with “I have no data” requests.

Comet is different, because it will not receive a “I have no data” response. It doesn’t get a response at all in this case. It just sits and waits, and only gets a response when data comes through. This action is called a “push”, because the server triggers an action at the client, which is an unusual operation. This is Comet’s key difference.

Comet has two flavours. If the response terminates the connection when data comes through, then it will need to request the next packet straight away. This, then, is known as “long-polling” Comet. It has the familiar polling action that we had with Ajax, but far lower latency, because the packet comes through as soon as it’s available. There is no chance of a 1 second delay. The other flavour of Comet is “streaming”, because it drip-feeds the data into the connection. The new HTML5 web sockets would probably fall under this heading, though I believe they use TCP direct, instead of needing to deal with HTTP. I would love to have streamed the data, but reading about it online showed that I would have numerous problems, with proxies and hubs that would wait for packets to complete before passing on results. Essentially, streaming is not thought to be ready for production yet.

I chose long-polling Comet for McLaren, and that simply left the question of implementation.

Long-polling Comet: software

If you’re dealing with a server under high load, you’re bombarded with thousands of requests a second. It was clear that I’d need the lightest possible solution to the problem. A web server which can handle the C10k problem, and hopefully do significantly better than that. Apache uses one thread per request, so in anyone’s book, won’t be able to handle this kind of load.

The nginx (engine-x) webserver is an ideal choice for performance. Webfaction have some oft-quoted stats on it’s performance, and it can even show improvement when proxying to Apache. Last year I’d read on Simon Willison’s blog that nginx has a push module.

The nginx http push module (nhpm) is a publisher/subscriber module for nginx. It serves as a hub which can receive POST requests, and then distribute them to a large number of subscribers waiting on GET requests. Naturally, the POSTs can be filtered by IP. The logical demo for any Comet demo is a chatroom, which is duly presented on the site. What I needed was something far more basic, because everyone receives the same data, but it would do.

Another option was to use node.js, which has apparently now been used to support a million comet user for Plurk. Despite performing reasonably well in tests, I decided against node because I didn’t have the time to build and test a webserver myself before launch. I tend to think there’s more at work in a robust webserver than first appears. Nginx is mature, and the push module is several iterations in.

Let’s just go ahead and install nginx on our linux variant. This will install on a Mac natively, but if you want to do do the later performance testing chapters, I recommend getting another linux box going. You can set one up with Amazon cheaply, or with Rackspacecloud if you find all that certificate stuff confusing.

Installing nginx with the push module:
If you’re not connecting as root, switch to a root shell

  sudo bash

Download and untar nginx and the nginx push module

  curl -O http://www.nginx.org/download/nginx-0.7.65.tar.gz
  curl -O http://pushmodule.slact.net/downloads/nginx_http_push_module-0.692.tar.gz
  tar -xzvf nginx-0.7.65.tar.gz 
  tar -xvzf nginx_http_push_module-0.692.tar.gz

nginx requires PCRE (regular expressions library) and open-ssl

  curl -L -O http://downloads.sourceforge.net/pcre/pcre-8.01.tar.gz
  tar -xzf pcre-8.01.tar.gz
  apt-get install libssl-dev  #ubuntu
  yum install openssl-devel  #redhat/fedora

Now we compile and install nginx

  cd nginx-0.7.65
  ./configure --add-module=../nginx_http_push_module-0.692 --with-http_flv_module --user=apache --group=apache --with-http_gzip_static_module --with-pcre=../pcre-8.01
  make && make install

Now nginx is installed, we can pop off and edit the nginx.conf file.

  vi /usr/local/nginx/conf/nginx.conf

The nginx config file is a thing of beauty. If you’re used to Apache configs, this will be like upgrading from PC to Mac. Difficult to switch but well worth the effort.

For now, just grab my simple version of the nginx conf.

If you haven’t got a user already, let’s create one, somewhat confusingly named “apache”, though you can choose your own name if you want.

  groupadd apache
  useradd -c "Apache Server" -d /dev/null -g apache -s /bin/false apache

And start the server:

  /usr/local/nginx/sbin/nginx

(you can stop with):

  /usr/local/nginx/sbin/nginx -s stop

Remember it’s listening on Port80, so make sure it’s not conflicting with Skype or Apache.

Nginx is now listening on port 80.
If you send a POST to /feed/publish it will get broadcast out to anyone hanging on /feed/subscribe
Let’s test it in a browser.
Open http://localhost/feed/subscribe
It will hang on and wait.

Now create a small HTML page that POSTs to /feed/publish

<form action="http://127.0.0.1/feed/publish" method="POST">
  <textarea name="body" value="content"/>
  <input type="submit"/>
</form>

Hit Submit, and … nothing will happen.

Closer inspection with FireBug shows that you’ve made an OPTIONS request instead of a POST.
This is a limit on cross-domain posting in browsers.

Fortunately, I allowed for this in the conf file. Move your HTML file to /usr/local/nginx/html/test.html
You can now view it on your local server as http://localhost/test.html
Hit Submit, and your GET thread that was hanging on, will now instantly respond with a packet of data for you to download.

We have a pub/sub hub running!

Another option is to POST using PHP or similar scripting language, which doesn’t have the browser cross-domain scripting security block.

You will want to lock down your POST url, because you don’t want everyone being able to broadcast to your users!
Simply change this bit of the conf file:

	location /feed/publish {
		allow 127.0.0.1;  # deny public posting - only allow from this IP
		deny all;
		...
	}

All nginx http push module needs is a little memory to work. It barely scratches the CPU at all. Let me show you the CPU usage graph from the first race using the telemetry.

Long-polling Comet: system

When configuring a Comet server for performance, everyone turns first to Richard Jones’s awesome posts, A Million-user Comet Application with Mochiweb. I followed his advice on server configuration.

Specifically, I do the following:

ulimit -n 999999 # increase the number of available file handles
echo "1024 65535" > /proc/sys/net/ipv4/ip_local_port_range # add more ephemeral ports

Of course, your ulimit change won’t persist, so you need to edit the limits conf:

vi /etc/security/limits.conf

Insert these lines in the document somewhere:

*      hard nofile 999999
*      hard nproc 999999
*      soft nofile 999999
*      soft nproc 999999

(ensure there’s no spaces before the asterisks).

Long-polling Comet: hardware

We had three heavy boxes set up to run the site. However, we were very keen to keep the site running regardless of visitors accessing our data feed. If anything, we’d rather lose the telemetry than the site itself, so it didn’t make sense to mix hardware between the two functions.

Two boxes were kept to run the site then, under a load balancer. The third box was moved out, and runs the feed server alone.

Next problem was the firewall. Our firewall is hardware limited to 10,000 concurrent users, great for a standard site running simple subsecond requests, but utterly useless for any kind of Comet connection. For another 300UKP the firewall could go to 5,000 more concurrents, but was still a huge limiting factor. So we juggled it to get the best connection possible for the feed server.

Since we’re running one box, we could also move out from behind the load balancer. We now can’t run on the same domain, so created a new subdomain for the server.

Heading back to original calculations, we then looked at the next perceived limit: bandwidth.

If our packet size is, for example, 1 kilobyte, including any http header.
Assuming 1 packet is delivered per second.
And our network card is 100Mbps (mega bit per second), which works out as 10MBps (mega byte per second).
How many users can we support?

Simple maths, 10,000MBps / 1k = 10,000 requests per second.

So, we can’t move beyond 10,000 concurrent users anyway, simply because the bandwidth is a limiting factor, even with our small packets.

So we now have a Gigabit card, and Gigabit switch. I’ve tested that up to 333Mbps, which is enough for at least 30,000 concurrents. Any more and we’ll need cloud servers to support the load (hopefully detailed in a later part).

Possible Alternatives

Having thought about all this, the system could be replaced with a simple memory lookup. The telemetry details could be stored in an array that is constantly overwritten, and the commentary stored in a queue in a way that you could ensure each requestor reliably received them all. I had a quick peek at using the nginx memcached plugin, which would let me serve values straight from memcached. It definitely had an appeal, though obviously performance wouldn’t get any better – since our limiting factor is the hardware supplying the bandwidth.

Next Part, I’ll look at the Feed itself, and how it’s processed on the front-end.