Experiments

One of the great perks of working at a place like Twitter is the ability to run a/b tests against a large userbase. Such experimentation is one of our core functions: every change is tested with a small subset of users before it's launched to everyone. I thought it would be interesting to describe how that works.

Let's start with a basic example.

Back in 2015, Twitter replaced "Favorites" with "Likes". This was a cosmetic change: the word changed in a number of phrases across the site and the icon changed from a star to a heart. They are still labelled as "favorites" on the server. This is a great example of a simple change that can be tested through experimentation.

When you interact with Twitter, the client you use (the website or an app) will send logs back to the servers describing your behaviour. If you tap a button, we'll log that. If you see a Tweet, we'll log that too. We have two sources of data for the "favorites": we know from the client logs how many times you pushed the button; and we can check on the server how many tweets you've favorited. These don't entirely match, because sometimes your network will fail, but they're pretty close.

For an experiment like this, we would be aiming to maintain the number of favorites, and perhaps even see an increase. If the number of favorites drops, we probably wouldn't move ahead, because we already know that favorites are tightly coupled to the number of users that use Twitter.

Why are favorites coupled to users? There are several reasons. At the most basic level, our users post Tweets and the more favorites they receive, the more likely they are to visit Twitter and Tweet again. On a second level, when someone favorites your Tweet you receive a notification, drawing you back to the service. Third, we can use the favorite as a signal of preference and quality, which helps us to model our ranking algorithms to present you and others with better Tweets, again increasing your likelihood of returning to the service.

So, we have our Product Change:

And we have our testable hypothesis:

Assuming we are armed with design and copy, we can begin implementation. Rather than deleting the references to Favorite, we add branches to our code to be able to present both to users, and the selection is based on a "feature switch". Each user regularly checks their feature switches on our server, so we are able to change their experience when we are ready. This is incredibly useful for testing, where we can check that the service works properly in both conditions, and also de-risks our deploys. We can safely ship the new code without any user experience change, and then we can make our changes independently of any other experiments or code changes.

Once our code is done, we get it reviewed, tested, merged and deployed. We then "dogfood" the experience. This term is derived from the horrible expression "eating your own dogfood", an apparently important need for ... people who make dog foods. We do not eat dog food. It just means that we enable the experience for employees. We have pre-release versions of the app called "earlybird".

We have learned to enable experiments for only half of employees, so that we can catch regressions in the previous version. It's very easy to break something that isn't being tested by employees any more. I have done this twice in the past year.

We can then link our featureswitch to our experimentation tool. At Twitter, this is known as DDG, short for Duck-Duck-Goose, an American school playground ritual. DDG allows us to choose what percentage of users will see the "treatment" (Likes instead of favorites) and which will see "control" (no change, for comparison). This is generally a subset of users because we don't want to overwhelm the DDG data pipelines.

Enabling a change for 5% of web users will usually get you a million users in a few days. And that's why it's great to work on such a popular service. There are millions of engaged users who we've able to quickly test out new ideas with. Combined with web's daily deployment schedule, I have been known to design, develop, test, ship and enable experiments for real users within two days of conceiving the idea.

The data takes a few days to come in and then we usually continue to collect data for 2 weeks. At this point, we can look to see if we've validated our hypothesis.

We may choose to iterate on the experiment, shipping a small code change and reversioning the experiment (essentially wiping the results clean and starting again). In some cases, we'll resample users (randomly choose a new set of users rather than using the old users) to avoid carryover bias, other times we consider the appearance and disappearance of features to be too disruptive, and we'll try to keep the users the same.

If our hypothesis is validated, we have permission to launch to everyone! This is a good day. We'll normally plan with comms and marketing to announce the new feature as we launch it.

While we're shipping to everyone (more reently, this is known as "GA" for General Availability), we will keep our experiment running. We'll keep 5% in the treatment, 5% in control, and we'll only launch for the other 90%. This is known as a "holdback". We don't turn the treatment up to 95% because we don't want to overwhelm the data pipelines, so we just change the default value of the featureswitch for those outside the experiment.

The holdback has two benefits: one, we can see the long-term effects of the change; and two, we can combine a number of experiments in a holdback, to make sure the results (increases in favorites, perhaps) are additive. It's no good if we run two experiments with a 10% increase in favorites if they combine to produce 10%. This happens.

After a quarter or two, we can close out the holdbacks and log the results. These are known as "causal effects". Note: this is not the same word as "casual". While users of Twitter will change behaviour due to all kinds of external effects (presidents, pandemics, so forth), we know this "causal effect" is change that we have caused. These are usually the Key Metrics we define for our OKRs.

In this case, we found that favorites increased dramatically, and so we shipped it. For some, this was an unwelcome change, but the metrics improvement was enough to launch.

Hey, that was a lot of words and I haven't even started on the complex cases yet. As they say in TikTok, follow for Part 2.

Thanks for reading! I guess you could now share this post on TikTok or something. That'd be cool.
Or if you had any comments, you could find me on Threads.

Published