Thursday, March 16, 2017

Accounting for the look-elsewhere effect in A/B testing

If we perform an A/B test we are trying to confirm or disprove a null hypothesis. Such a test will never be perfect, but will always have an error associated with it. For example, if we managed to rule out our null hypothesis with $2\sigma$ ($95\%$ confidence level), there is a $5\%$ chance that we just measured a random fluctuation and our null hypothesis is actually true (false positive or type 1 error). If that risk is too high, we can collect more data and reduce the risk, but we will never manage to entirely get rid of it.

So what happens if we perform a double A/B test, meaning we compare our control sample to two modifications. Well, now we have a $5\%$ chance to see a random fluctuation outside the $95\%$ confidence level for either one of our test samples. If we test in both cases for a $95\%$ confidence interval the chance to confirm both null hypotheses is 
\[
(1 - 0.05)^2 = 0.9025,
\] meaning $\approx 90\%$. So this means we have a $10\%$ chance to get at least one $2\sigma$ event, even if the null hypothesis is true. If we compare $20$ samples we get \[
(1 - 0.05)^{20} = 0.358.
\]So now we have already a $64.2\%$ chance to see at least one random $2\sigma$ fluctuation. This is called the look-elsewhere effect. Looking at many experiments, we increase the chance of seeing a significant fluctuation away from the null hypothesis, even if the null hypothesis is true.

This means that performing one test and encountering a $2\sigma$ fluctuation is more significant than performing many tests and encountering one $2\sigma$ fluctuation because if you perform enough tests you are basically guaranteed to find a $2\sigma$ fluctuation. 

How can we account for the look-elsewhere effect? 


The easiest way to correct for it is to keep the participants in an A/B test small. Beyond that, we have to collect more data to get back to a higher confidence level. The most conservative solution is a correction of $1/k$ to the accepted false positive rate, with $k$ being the number of tests. This means if we accepted $5\%$ false positive rate for a one way A/B test we have to change that to $2.5\%$ for a two-way A/B test and to $0.25\%$ for a $20$ way A/B test. 

One should be aware that increasing the confidence level too conservatively might increase our false negative rate (type 2 error). There are many other more relaxed corrections like the Holm-Bonferroni method or the Sidak correction, but in all cases, we end up collecting more data. 

An alternative approach: Analysis of Variances (ANOVA) 


For an ANOVA F-test we start with the null hypothesis that all sample means are equal 
\[
H_0 = \mu_0 = \mu_A = \mu_B = ...
\]The alternative hypothesis is that at least two of the means are not equal. The ANOVA F-test statistic is the ratio of the average variability among groups to the average variability within groups: \[
F = \frac{\frac{1}{k-1}\sum^k_{i=1}N_i(\mu_i - \overline{\mu})^2}{\frac{1}{N-k}\sum^k_{i=1}\sum^N_{j=1}(x_{ji} - \mu_i)^2},
\]where $\mu_i = \frac{1}{N}\sum^N_{j=1}x_{ji}$ is the mean for sample $i$ and $\overline{\mu} = \frac{1}{k}\sum^k_{i=1}\mu_i$ is the mean of the means for all $k$ samples. The denominator of this equation describes the average variance for the different samples, while the numerator describes the variation of the sample means around the global mean.

One should note that the ANOVA method assumes that the variances are approximately equal between the different treatments. This, however, is usually true, if the different samples are reasonably similar. 

After performing the F-test one has to use the degrees of freedom together with the F value to get the p-value from the F distribution. If this fails to reject the null hypothesis, the test is over and none of the variations does better than the control. If the null hypothesis is rejected we know that at least two means are significantly different from another. In this case, one has to proceed with post-hoc tests to identify where the difference occurred. 

The advantage of using ANOVA over multiple t-tests is that the ANOVA F-test will identify if any of the group means are significantly different from at least one of the other group means with a single test. If the significance level is set at $95\%$, the probability of a Type I error for the ANOVA F-test is $5\%$ regardless of the number of groups being compared. 

One disadvantage is that the F-test does not look at deviations from the control sample, but for outliers in the set of samples. There is, for example, no way to account for negative outliers. If you have a layout which is significantly worse compared to the control it will also result in a large F-value even though you might not be interested in layouts which do worse than your control. 

Example 


We have 3 website layouts (variation $A$ and $B$ and the control sample $0$) and we want to study whether any of those increases the download rate. First, we state the null hypothesis, which in this case is that the mean download rate is the same for the new and the control samples 
\[
H_0 = \mu_0 = \mu_A = \mu_B.
\]Now we have to state what is the improvement we are aiming for and with what significance do we want to confirm our null hypothesis? Well, let's assume we are looking for a $10\%$ improvement and are content with a $95\%$ confidence level.

Since we are working in a business environment we now want to ask the question how much data do we have to collect? Collecting too much data is connected to a cost. To estimate the sample size, we just need to re-write our equation for the F-test above 
\[
N-k = \frac{F\sum^N_{j=1}(x_{ji} - \mu_i)^2}{\frac{1}{k-1}(0.1\mu_0)^2},
\]where I replaced $\sum^k_{i=1}(\mu^i - \overline{\mu})^2$ with $(0.1\mu_0)^2$, which basically states that one sample has a $10\%$ deviation in the mean download rate. I also assumed that all samples have the same size $N_i = N_j$.

Now we go to the F-statistic and check what F value would represent a $95\%$ confidence level. For that, we need the degrees of freedom. We have $k-1$ degrees of freedom in the numerator quantity and $N-k$ degrees of freedom in the denominator. Note that the F-distribution is not symmetric, so you have to use the correct order here. 

We can assume that $N-k$ is very large (here I assume $N = 10\,000$) and integrate the F distribution
# Input parameters
Nminusk = 10000
kminus1 = 2
step = 0.001

# Integrate the F-distribution to get the critical value
F = 0.
integrate = 0
while integrate < 0.95:
    F += step
    integrate += f.pdf(F, kminus1, Nminusk)*step
    if integrate > 0.95:
        print "F value at 95%% confidence level is %0.1f" % F
        break
which gives

>> F value at 95% confidence level is 3.0

As long as $N$ is large ($>30$), the result does not depend on its exact value. There might be a Python function to directly print out this value, but I could not find it (there are also online tools available). Here is a plot of the F distribution for our case
Figure 1: F-distribution for the case of $N = 10\,000$ and $2$ degrees of freedom. The dashed line shows the $2\sigma$ significance level, meaning any F value larger than that indicates a significant result.

Finally, we need an estimate of the variance in the distribution. If we assume we had a download rate of $2\%$ in the past, we get $\sigma^2_s = 0.02(1 - 0.02) = 0.0196$ assuming a Bernoulli distribution for the variance. Putting it all together we get
\[
N = \frac{3\sigma_s^2}{\frac{1}{2}(0.1\mu_0)^2} = 29400.
\]Great! Now we have a number to work with.

Let's assume after a few days we get a table back showing

samplevisitorsdownloadsfraction
control294005000.0190
A294006200.0235
B294004900.0186

The first step is to test whether the F-test assumptions actually held in this case, meaning we have to check whether the sample variances $\sigma^2_s$ for the different samples are approximately equal. Let's assume that this is true in our case above.

Next we have to calculate the $F$ value, which in the above case gives $F = 3.38 > 3$. This means we can rule out the null hypothesis and know that the scatter of the means is not consistent with them all being equal (at our chosen confidence level). Note that this does not necessarily mean that one of the new layouts does better than the control.

So where do we go from here?

Post-hoc tests 


The problem after ruling out the F-test null hypothesis is that now we want to know where the difference actually happened. This requires to look at the samples themselves and that has all the look-elsewhere issues we discussed before. So to protect us from the look-elsewhere effects, we now have to ask for a higher confidence level, in our case with $k = 3$ we should ask for $1 - 0.05/3 = 0.983 \rightarrow 98.3\%$ confidence levels.

Next we have to state the t-test null hypothesis. In our case that is
\[
\begin{align}
H^A_0 &= \mu_A - \mu_0 \leq 0,\\
H^B_0 &= \mu_B - \mu_0 \leq 0
\end{align}
\] and we still look for a $10\%$ improvement over the control sample with the now required confidence of $98.3\%$. The details about t-tests have been discussed in my last blog post.

The required t-value can be obtained from a t-distribution and for a $98.3\%$ significance we need a t-value of $2.12$. Our t-values are $t = 3.26$ for layout $A$ and $t=-0.32$ for layout $B$, indicating a significant improvement of the download rate with layout $A$ compared to the control. It is definitely possible that the post-hoc t-tests are inconclusive and that more data needs to be collected.

As a final note, I would like to point out that looking at your experiment many times can be beneficial. Some layouts might be so good, that you can already see a significant improvement over control after looking at a small sample $N$, even though your smaller target improvement (in our case $10\%$) requires a much larger $N$.

So one strategy could be to look at your data at certain intervals and check whether a significant improvement is already present. Of course, every look at your data comes with a price tag, since now one has to increase the required confidence level as described above. However, one might catch significant improvements early and be able to stop the trial earlier. Or one could send more data into the good layouts and less into the bad layouts. Google for example follows such an approach.

The Python code used above is available on Github. Let me know if you have any questions/comments below.
cheers
Florian

No comments:

Post a Comment