A day in the life of atest failure

I love my job. Fact.

But some days are easier than others. If you’ve been reading my tweets this week, you’ll have sensed a certain frustration.

Don’t worry. None of this is your fault.

But here’s what happened.

Over the weekend it had occurred to me, while I was asleep at about 6am – seriously, this happens – it occurred to me that I’d forgotten to submit my latest test fixes before I merged my branch. This would screw up the branch for the weekend, for the whole team, and delay the deploy on Monday. Bad news.

As soon as I woke up, I popped open the laptop, preemptively apologised to all and sundry – as is the English way – and proceeded the fix the problem. It turned out my fixes only fixed half the problem. And it’s the weekend.

So, I cheated. I knew this code was in a hidden section, and wouldn’t be live when deployed. I could just comment out the tests, ensuring the branch would pass even if the code didn’t work properly.

First thing Monday morning, I returned to the problem. I looked back over the logs to find the faulty test and dutifully kicked off a test run.

And this was the start of my problem.

It passed.

I was a little surprised. There were failures last week, I could see there were failures. Maybe somebody else fixed it. I check the logs. Nope. I run the tests again.

It passed.

I should mention that running the tests isn’t instantaneous. It’s a complex test suite we’re running here. We’re currently migrating our server code away from Ruby (we’re always migrating to faster systems). In the interim, we’re running tests against both systems. The various compilation and startup sequences, combined with the real browser used, and the queues of other developers waiting on the system, gives us an average runtime of about 15 minutes.

It’s optimised to run a suite of tests, rather than individual tests, so both cases take around the same amount of time to run. In fact, we have some excellently comprehensive test suites with thousands of tests, and they can all be run in about 15 mins. Running single tests is still slow, because the task can’t be distributed.

From this point on, I’ll indicate the test timings like this: (15).

I can of course run a test locally. Again, this can take time. Usually less time – let’s say 5 minutes or so. It takes less time because I already have the servers running locally.

So my tests are passing. Usually, a cause for joy. In this case, a worry. Why did they fail last week? I dig out the code from last week, and run the tests (15).

It fails, as expected.

Well, that’s something. I diff the code. And here I win the Goof of the Week award. Prize chump right here. The tests are passing because I commented them out.

Duh.

I go to lunch and try to apologise to my own brain cells for letting them down.

I kick off a real test run (15). It fails. Good. But weird – I swear these were working once. I test the Ruby code instead (15). It passes.

Interesting. So I’ve found a difference. This is weird because I haven’t actually changed any of the server code. It shouldn’t be possible to be different here.

What’s worse is that we can’t easily test the new code locally. It takes longer. I run them (10). They pass. I run the Ruby tests (5). They pass. I run the new tests again, to be sure it’s not random. Pass (10). Pass (10).

I spike the code full of debugger statements. Run the tests (15). More debugging (15). Add screenshots (15). Everything is the same between systems, except that the new tests are failing.

” is not true”. Thanks, error messages.

Hmm. I look back at the screenshots. While the systems show the same screens, there’s something I didn’t expect. When you’re a new user on Twitter, we show a little helper module at the top of the page. I’d forgotten we did that. It also loads asynchronously.

Aha. Here it is. The new code runs so much faster that this module loads earlier. This explains the difference. I go home, happy that I’ve established the problem.

Next day, I make allowances for the helper module. I pull out the debugging code, happily run the tests (15) one last time, and prep the branch for shipping.

It fails.

Bollocks. I unprep the branch and put the debugging code back in. I run the tests (15).

It fails. Same debugging result as before.

That’s weird. I was so convinced the module was the problem. I hack the code and remove the module entirely. Another run (15).

It passes.

Ok, so it’s the module. No question. I just got the allowances wrong. I tweak them, and run (15). Fail. I add debugging (15). Nothing. And again (15). No.

This is crazypants. I am so not happy.

I try to reproduce locally. Passes (10) every (10) time (10). No amount of debugging and breakpoints will help if I can’t reproduce the error.

I look back to the screenshots. Aha. No images in the module.

I knew the error was caused by an bad height calculation. And the missing images are (whips out an onscreen ruler), yes, the images are 137px – exactly the difference.

Eureka.

Our images don’t load on the testing machines because they’re insulated against the outside network. That makes sense. Locally, we can access the images, which must be why they pass. It’s perfect.

Furthermore, our testing machines are using Firefox, which won’t render broken images, while I’m using Chrome locally – which does. This would account for the height difference.

I curse and laugh, since this quirk of Firefox has bitten me many times in the distant past.

To fix the test, I simply break all the images – to be consistent. I prep the branch for shipping, throw it out to the test cluster one last time, and go and get a fifteenth cup of coffee.

It fails (15).

What. The. Fuck.
What. Thefuck.
Whatthefuck.
Fuck.

How can this be? I had this nailed.

Worse than that, the tests is now passing occasionally. When I run the tests on the test cluster, I run them about sixty times in parallel. This ensures we don’t get “flaky” tests, which sometimes pass and sometimes don’t. Flaky tests are notoriously disruptive and hard to fix.

Sometimes, all you can do is try it yourself. I open a browser. I downgrade it to the exact same version as our test cluster. I’ve done this before, but I’m doing it again. I run the tests (10). I click around. I fiddle with the code. I break the images, to match. Refresh, refresh, refresh.

Hey, what was that?

I refresh again. And again. There it is.

Firefox won’t render a broken image – it hides it away completely. I knew that. But what I’m seeing here is a flash of broken image. There’s a flash of a broken image in this helper module, before Firefox hides it away.

If my code is making calculations just at that moment, it’ll get a different result than, say, half a second later.

I don’t get my hopes up again. I tweak the test. There’s actually a little css property:
-moz-force-broken-image-icon:1;
This will prevent Firefox from hiding the broken images. I run the tests. All the tests (15).

Pass.

I prep. I ship. I smile.
I move on to the next thing.

(btw, if testing tools is your kind of thing, and you want to help me make things faster, we’re hiring and I will buy you cookies)