Experiments, part 2

Yesterday we talked about a/b testing at large companies like Twitter, and we ran through a simple example. Today, I wanted to talk about more complex examples, and the problems associated with them.

So let's pick a more complex example.

Back in 2014, we made a significant update to user's profile pages and, like any other change, we tested it through experimentation.

Our hypothesis was that more features on the profile page would lead to more time spent on site, more followers for users, and more favorites on tweets. So, who will be affected by the experiment? How can we test our hypothesis?

Problem: choosing users for the experiment #

Certainly the profile owner themselves would be affected. There's a certain satisfaction to configuring your own page (remember myspace?). I can imagine they would spend more time on the site initially. But how would it affect other Twitter users who can now see the new profile?

I can think of some options

when users in the treatment bucket look at any profile, they see a new profile. The control group sees the old profile. A downside of this option is that users wouldn't know how your profile looks for other users. Another is that users might not know how to configure extra settings.
users in the treatment bucket get their profiles converted to the new format for all viewers. This lets us measure the increase in time they spend on the site. However, we would not be able to directly measure (using the client events we described yesterday) the behavioural change of viewers.

Both of these options have pros and cons, and these are strongly dependent on the abilities of your analytics tools. For example, in the second case, we may be able to compare "likes received". While this isn't a client event, it could be measured by proxy client events (eg, a notification that a like was received), or could be drawn from the database table rather than a client event.

If memory serves correctly, we chose the second of these options, and the analysis was extremely difficult.

Problem: network effects #

The power of a service like Twitter is entirely derived from the userbase, and the fact that users can interact with each other. There are features of Twitter that are only successful because of these network effects, and this affects how we can test them.

If we rolled out a feature like Direct Messages now, and allowed standard randomization of users into treatment and control buckets, the experiment would surely fail. You can't have a back and forth conversation if only one side can join.

This even affects smaller features in the conversation product. There is much discussion at the moment around the problems of Android users joining group messages with iPhone users. The "green bubble" effect degrades the experience of the whole group.

The general solution to such problems is to manually curate your control and treatment groups; usually we pick countries. Ideal country candidates have an active but insular community. An example of this is Turkey, where the language helps keep conversation insular. Another example is Japan.

Care must be taken in such cases however, as local cultural differences can add significant bias to results. As Japan is one of our biggest markets, we have numerous internal research documents aroun the different usage patterns in Japan. Another problem with choosing such markets is feedback: as an English speaker, it's easy to find and read user complaints in English, but much harder in Turkish.

Manually-curated treatment groups like this are generally more complex to analyse and are a case where it's important to get help from a Data Scientist instead of just reading the graphs off a standard tool.

Problem: too many changes at once #

The new web profile lets you use a larger profile photo, customize your header, show off your best Tweets and more.
Coming soon: a whole new you, in your Twitter profile, 2014

When your feature involves a number of different changes, the analysis can be quite complex. Let's say favorites are up (good) but time on site is down (bad) - how can we iterate on our feature to improve it? This is impossible to know without more data.

The obvious solution is to break up the big change into a series of small testable changes - and indeed that was one of the most significant "lessons learned" from the restrospective of this feature. However, it didn't stop the same thing happening to features over and over again in the years since. The lesson isn't learned.

Delivering in small chunks is unpopular: you don't get the "big launch" or "big win" that marketing and executives love. It's also harder to justify features to users. What's the point in a redesign of the profile page if there are no new features with it?

A good alternative to the breakup is user research, using surveys and interviews to determine which features users like and what they don't like. We refer to this as "qual" - qualitative research, rather than "quant" - quantitative research (the a/b experiment).

Personally, I deeply distrust the "qual". I have great respect for the people who do it, but I don't believe users are able to tell us honestly about their opinion of features. Often, as users, we must overcome our resistance to change. We aren't aware of the problems underlying our complaints (or compliments), and we cannot judge or suggest reasonable solutions.

If you want to run user research, make sure you have a strong product driver to push the feature through the resistance to change. Otherwise, break up your change into testable chunks.

Problem: self-selection #

Beta testing and opt-in is incredibly tempting, and something our users often request. For product managers, it's appealing. Users feel valued when they're part of an exclusive preview, and are asked for feedback. Product managers get early feedback and relationships with journalists to boost their ego.

Often the feedback is misleading. If users are able to volunteer for a preview, they will. These self-selecting users will give biased and skewed feedback. It's often easier to get straight to public a/b test where treatment users can be randomly chosen.

Examples of problems we've seen:

Updating to the new version of the website: the first users were thrilled, the majority of users accepted it, the last users resented the change, and thought we should offer the old version forever.
We ran a beta version of the app called "twttr" that tested a bunch of changes, but when we took them to real users, they weren't popular. We shut down the beta.

Problem: logged-out users #

Profile pages on Twitter are available to both logged-in and logged-out users.

So far we've bucketed our users based on their user id, but if our users haven't logged in, then how can we consistently bucket them? We would want to make sure that our users see a consistent treatment on each visit, because sometimes the changes in UI are significant, and changing buckets would be very confusing.

Where legally possible, we issue a cookie with an identifier to help with bucketing users. This solves most of the problem, but still leaves us with:

When a user is selected for an experiment bucket while logged out, and then logs in, do we switch the experiment bucket?
When a user is selected for an experiment bucket while logged out, and then signs up, do we switch the experiment bucket?
If a logged-in user uses our multi-account feature to switch accounts, do we switch their experiment bucket?
If a logged-in user logs out, do we switch their experiment bucket?
Lots of logged out traffic comes from crawlers and bots, how do we detect them and handle the bucketing for them?

Lots of these questions have answers that are complicated by care for the privacy of our users, as well as following local regulations on cookies and tracking.

Currently each experiment should consider these issues, and will have their own solutions. We don't have clear guidelines.

Thanks for reading! I guess you could now share this post on TikTok or something. That'd be cool.
Or if you had any comments, you could find me on Threads.

Published 17 January 2022