Thursday, March 30, 2017

The relative velocity effect and its impact on the galaxy power spectrum

I recently published a paper and I thought it would be a good idea to write a blog post about this paper for a more general audience. If you are interested, you can find the paper on arXiv under the rubric extragalactic cosmology. I published the paper together with two of my collaborators, Prof. Uros Seljak from Berkeley University and Dr. Zvonimir Vlah from Stanford.

In this paper, we looked at the so-called relative velocity effect in the large-scale structure of the Universe. To understand this effect we need to go through three areas of cosmology:

(1) The building blocks and evolution of the Universe
(2) Baryon Acoustic Oscillations (BAO)
(3) The basics of galaxy redshift surveys

I will go through all three of these points in turn and bring them together in the end. But before I start here is a general overview to see how these three points fit together.

If you want to test a theory, what you have to do is you have to look at what this theory predicts and see whether you can confirm or disprove this prediction with data. Such tests are fundamental to the progress of science. For many cosmological theories the most powerful tests we currently have involve Baryon Acoustic Oscillations (BAO) in the photon and matter distribution. Below I will discuss exactly what BAO are but for now, just see it as a test to distinguish cosmological models.

In the paper I want to discuss here, we looked at the relative velocity effect. As the name suggests, it describes a relative velocity, in this case, a relative velocity between cold dark matter and baryonic matter. This relative velocity can impact galaxy formation, meaning the types of galaxies which form and the places where they form. These changes can impact our measurement of Baryon Acoustic Oscillations from galaxy redshift surveys. Given the importance of BAO for cosmological studies, we need to understand the impact of this relative velocity otherwise we might draw the wrong conclusion from such measurements.

In the rest of this blog post, I will describe the origin of the relative velocity, its impact on galaxy formation and how this could influence what we measure with galaxy redshift surveys.

The building blocks and evolution of the Universe 


The Universe is made up of baryonic matter (protons, electrons etc), cold dark matter (we don't know what particle cold dark matter is made of) and dark energy (we don't know much about that either). Dark Energy is a force which causes the Universe to expand at late times, but it is not relevant for this paper, so let's focus on baryonic matter and cold dark matter.

The main difference between baryons and cold dark matter is the fact that cold dark matter does not have an electric charge. This is crucial since if it had an electric charge, it would interact with photons and interaction with photons would mean that we can see it. The reason why cold dark matter is called dark is because we cannot directly see it (no photons scatter on it). The only reason we know it is there is because we can see the gravitational effects it has (for example through gravitational lensing experiments).

Ok, let us go through the different stages of the Universe's evolution and see how they impact the two main components we are interested in (baryons and cold dark matter).

(1) The beginning, inflation: 


The Universe starts off with a period of very quick expansion, which is driven by the so called inflaton field and only lasts for a tiny fraction of a second. The theory of inflation is observationally not very well tested and fairly speculative at this point. But it provides the right initial conditions for the next steps of the Universe's evolution.

For this discussion, the most important aspect of inflation is that it introduced tiny density fluctuations into the matter density field, which are going to be the places where galaxies will form later on.

(2) Plasma phase: 


After inflation, the Universe is so hot and dense that it does not allow atoms to form. For some time even quarks, which form the nuclei of atoms, are free. Since the Universe expands, it cools over time. After about $300\,000$ years the Universe cooled enough so that atoms could form, mostly neutral hydrogen.

The period before the formation of neutral hydrogen is the crucial period for this discussion because in those $300\,000$ years, dark matter and baryonic matter followed a very different path of evolution. The difference is caused by the electric charge. The baryons interact with each other through the exchange of photons which is possible because of their electric charge. Wherever you have a density of baryons, you have a density of photons and this causes a photon pressure which pushes the baryons out of the overdensity.

Cold dark matter does not interact with photons, because it does not have an electric charge. Therefore, wherever you have a higher density of cold dark matter, gravity is pulling the matter together, causing the density to increase. So while photon pressure prevents the baryon density to grow at any point in the Universe, gravity causes the cold dark matter density to grow.

(3) The dark ages: 


After $300\;000$ years the Universe cooled enough for protons to capture electrons to form neutral hydrogen. As the name suggests, this form of hydrogen is neutral, since it is made up of one proton and one electron, and the electric charges they carry cancel out. This means that now baryonic matter stops interacting with photons because in the form of neutral hydrogen, baryons do not have an electric charge anymore.

Without the photon interaction, there is no photon pressure and without photon pressure, the baryon density can grow just like the cold dark matter density did all along. The reason this period of the Universe's evolution is called the dark age is because none of the matter interacts with photons. Since photons are what we can see with telescopes, we can't see anything in that era.

(4) Reionisation: 


During the dark ages the density of baryons and dark matter grows and at some point, the temperature in the high-density regions grows enough so that the hydrogen ionizes again (protons and electrons separate). In principle we would now expect to go back to the physics we described during the plasma phase, however, the photon density is much lower now. Therefore the photon pressure isn't a big deal anymore and the evolution of baryons is determined by other processes. These processes lead to the formation of stars and galaxies.

Cold dark matter does not interact and hence it can't form stars. Therefore cold dark matter concentrates in so-called haloes and galaxies form in the center of these haloes.

The important phase for the subject discussed in our paper is phase (2) the plasma phase. During this time the baryon density cannot grow, while the cold dark matter density does grow. This means that baryons and cold dark matter start off in phase (3) with a different density distribution.

Besides the difference in the density, there is also a difference in the velocity because the cold dark matter was moving towards high-density regions (driven by gravity), while baryons were moving out of high-density regions (driven by the photon pressure).

So the main point we have to keep in mind is that there is a relative velocity between cold dark matter and baryons due to the physical processes which happened in the plasma phase of the Universe.

Next, we need to discuss Baryon Acoustic Oscillations.

Baryon Acoustic Oscillations 


Baryon Acoustic Oscillations are a signal in the matter distribution which can be used to test cosmological models. So what are Baryon Acoustic Oscillations?

As I mentioned before, in the plasma phase the baryon density cannot grow. Instead, wherever there is an over-density, the photon pressure drives all baryons out in a spherical wave. This wave never stops but travels for the entire $300\,000$ years until the photons and the baryons decouple. Only at this point does the wave stops.

This, however, leads to a strange distribution of matter. The initial over-density, which the baryons got kicked out of because of photon pressure, grew due to the infall of cold dark matter. So now we have an overdensity in the center and a shell of baryons around it. The radius of this shell depends on the speed the wave traveled.

These processes happened at every over-density in the Universe and hence at the end of the plasma phase, the Universe is a superposition of such structures. The main point here is that these over-densities (the overdensity in the center, which grew through the infall of cold dark matter and the baryonic shell) are the seeds where galaxies will form later on.

However, this introduces a special scale in the distribution of galaxies, namely the radius of the shell. So if you look at the separation between galaxies, you will find an excess of galaxies separated by the radius of this shell. This special scale is what we call Baryon Acoustic Oscillations, and this scale is one of the most important observational tools we have in Cosmology.

Next, we need to discuss the principles of galaxy redshift surveys.

Galaxy redshift surveys 


One important tool to study the Universe is galaxy redshift surveys. Such surveys measure the 3D position of galaxies. I am currently working for the BOSS (Baryon Oscillation Spectroscopic Survey) Collaboration which in the last 6 years composed a galaxy redshift survey with about 1 million galaxies, representing the biggest such survey currently available.

With such a survey we can measure Baryon Acoustic Oscillations since we have the position of the galaxies. As I mentioned, BAO are just a special separation scale of galaxies, so we look at all galaxy pairs, count how many we have and search for the scale where we suddenly get more pairs. This is the Baryon Acoustic scale and we can use that to learn about the expansion history of the Universe.

When we construct a galaxy redshift survey we pick a certain kind of galaxy, in BOSS we picked so-called Luminous Red Galaxies (LRGs), and observe them everywhere in the sky. We can only observe baryons (galaxies), since cold dark matter does not send any photons which we could observe with a telescope.

However, if we want to learn something about the Universe we need to know the entire matter distribution, not just the distribution of galaxies. Therefore we use galaxies to trace the underlying matter density field. The idea behind this is that wherever we have more baryons, we will also have more cold dark matter.

The statistical assumptions we make when using LRGs to trace the matter density field are:

(1) We assume that galaxies are related to the underlying matter density field by a bias relation
\[
 \text{galaxy density} = b(r)*\text{matter density},
\]where $b(r)$ is the so-called bias parameter, and it allows us to study the matter distribution, even though we only observe galaxies.
(2) We assume that the galaxies we use as tracers have the same relation b(r) to the matter density field everywhere in the Universe, meaning the probability to find a LRG galaxy at any point in the Universe only depends on the matter density field and nothing else.

We will get back to these assumptions very soon because the relative velocity between baryons and cold dark matter potentially violates these assumptions, which is the reason we wrote the paper.

Bringing it all together 


The relative velocity between cold dark matter and baryons is expected to be something like 30km/s at the beginning of the plasma phase but decreases over time meaning the baryons catch up with the cold dark matter and their relative velocity today is expected to be only 0.03km/s.

A relative velocity of 0.03km/s today is completely negligible, for example the velocities in galaxy groups are typically $100$ - $1000\,$km/s. However, at the times when the first galaxies form, this relative velocity is around $1$-$3\,$km/s. If the baryons have a relative velocity with respect to cold dark matter, they might be able to escape the gravitational potential of small cold dark matter halos.

This means, within such dark matter halos, there are fewer baryons available for galaxy formation. This might impact what kind of galaxies form in areas of the Universe where the relative velocity is large.

Now you can understand why the relative velocity is potentially interesting for people who study the formation of galaxies. The problem is that we do not understand galaxy formation very well. There are physical processes like supernovae feedback, AGN feedback, gas cooling and many many more, which are very difficult to simulate. But let's ignore the implications for galaxy formation for now. Why is this interesting for galaxy redshift surveys?

As stated above, we assume that the probability to form a LRG galaxy is only dependent on the matter density. The impact of the relative velocity means that the probability to observe a LRG galaxy also depends on the value of the relative velocity at that point in the Universe. So the relative velocity violates assumption (2) made above.

What did we do in the paper? 


Even though the relative velocity violates assumption (2) of our galaxy redshift survey analysis above, everything is not lost. We can actually account for this effect, by modifying the relation $b(r)$ given in assumption (1). So we basically account for the fact that the probability to find a LRG galaxy might depend on where we look in the Universe.

The paper we published develops a model for the relative velocity effect, which we fit to the BOSS data. Long story short, we did not detect the relative velocity effect. However, with this test we ensured that the potential impact of the relative velocity effect for our BAO measurements in BOSS is smaller than our current measurement uncertainties. This increases our confidence in the BOSS analysis and in any cosmological implications one might draw from this measurement. I expect that all future galaxy survey studies of BAO need to check this additional effect, to ensure that they are not systematically biased.

Finally I am very interested to learn more about what this effect could mean for galaxy formation itself. If it would be possible to understand what impact the relative velocity effect has on galaxy formation, we might get another signal to look for to measure this effect. And maybe we can even improve our galaxy formation models.

Ok I hope that was helpful. If you have any questions feel free to leave a comment below.
cheers
Florian

Thursday, March 16, 2017

Accounting for the look-elsewhere effect in A/B testing

If we perform an A/B test we are trying to confirm or disprove a null hypothesis. Such a test will never be perfect, but will always have an error associated with it. For example, if we managed to rule out our null hypothesis with $2\sigma$ ($95\%$ confidence level), there is a $5\%$ chance that we just measured a random fluctuation and our null hypothesis is actually true (false positive or type 1 error). If that risk is too high, we can collect more data and reduce the risk, but we will never manage to entirely get rid of it.

So what happens if we perform a double A/B test, meaning we compare our control sample to two modifications. Well, now we have a $5\%$ chance to see a random fluctuation outside the $95\%$ confidence level for either one of our test samples. If we test in both cases for a $95\%$ confidence interval the chance to confirm both null hypotheses is 
\[
(1 - 0.05)^2 = 0.9025,
\] meaning $\approx 90\%$. So this means we have a $10\%$ chance to get at least one $2\sigma$ event, even if the null hypothesis is true. If we compare $20$ samples we get \[
(1 - 0.05)^{20} = 0.358.
\]So now we have already a $64.2\%$ chance to see at least one random $2\sigma$ fluctuation. This is called the look-elsewhere effect. Looking at many experiments, we increase the chance of seeing a significant fluctuation away from the null hypothesis, even if the null hypothesis is true.

This means that performing one test and encountering a $2\sigma$ fluctuation is more significant than performing many tests and encountering one $2\sigma$ fluctuation because if you perform enough tests you are basically guaranteed to find a $2\sigma$ fluctuation. 

How can we account for the look-elsewhere effect? 


The easiest way to correct for it is to keep the participants in an A/B test small. Beyond that, we have to collect more data to get back to a higher confidence level. The most conservative solution is a correction of $1/k$ to the accepted false positive rate, with $k$ being the number of tests. This means if we accepted $5\%$ false positive rate for a one way A/B test we have to change that to $2.5\%$ for a two-way A/B test and to $0.25\%$ for a $20$ way A/B test. 

One should be aware that increasing the confidence level too conservatively might increase our false negative rate (type 2 error). There are many other more relaxed corrections like the Holm-Bonferroni method or the Sidak correction, but in all cases, we end up collecting more data. 

An alternative approach: Analysis of Variances (ANOVA) 


For an ANOVA F-test we start with the null hypothesis that all sample means are equal 
\[
H_0 = \mu_0 = \mu_A = \mu_B = ...
\]The alternative hypothesis is that at least two of the means are not equal. The ANOVA F-test statistic is the ratio of the average variability among groups to the average variability within groups: \[
F = \frac{\frac{1}{k-1}\sum^k_{i=1}N_i(\mu_i - \overline{\mu})^2}{\frac{1}{N-k}\sum^k_{i=1}\sum^N_{j=1}(x_{ji} - \mu_i)^2},
\]where $\mu_i = \frac{1}{N}\sum^N_{j=1}x_{ji}$ is the mean for sample $i$ and $\overline{\mu} = \frac{1}{k}\sum^k_{i=1}\mu_i$ is the mean of the means for all $k$ samples. The denominator of this equation describes the average variance for the different samples, while the numerator describes the variation of the sample means around the global mean.

One should note that the ANOVA method assumes that the variances are approximately equal between the different treatments. This, however, is usually true, if the different samples are reasonably similar. 

After performing the F-test one has to use the degrees of freedom together with the F value to get the p-value from the F distribution. If this fails to reject the null hypothesis, the test is over and none of the variations does better than the control. If the null hypothesis is rejected we know that at least two means are significantly different from another. In this case, one has to proceed with post-hoc tests to identify where the difference occurred. 

The advantage of using ANOVA over multiple t-tests is that the ANOVA F-test will identify if any of the group means are significantly different from at least one of the other group means with a single test. If the significance level is set at $95\%$, the probability of a Type I error for the ANOVA F-test is $5\%$ regardless of the number of groups being compared. 

One disadvantage is that the F-test does not look at deviations from the control sample, but for outliers in the set of samples. There is, for example, no way to account for negative outliers. If you have a layout which is significantly worse compared to the control it will also result in a large F-value even though you might not be interested in layouts which do worse than your control. 

Example 


We have 3 website layouts (variation $A$ and $B$ and the control sample $0$) and we want to study whether any of those increases the download rate. First, we state the null hypothesis, which in this case is that the mean download rate is the same for the new and the control samples 
\[
H_0 = \mu_0 = \mu_A = \mu_B.
\]Now we have to state what is the improvement we are aiming for and with what significance do we want to confirm our null hypothesis? Well, let's assume we are looking for a $10\%$ improvement and are content with a $95\%$ confidence level.

Since we are working in a business environment we now want to ask the question how much data do we have to collect? Collecting too much data is connected to a cost. To estimate the sample size, we just need to re-write our equation for the F-test above 
\[
N-k = \frac{F\sum^N_{j=1}(x_{ji} - \mu_i)^2}{\frac{1}{k-1}(0.1\mu_0)^2},
\]where I replaced $\sum^k_{i=1}(\mu^i - \overline{\mu})^2$ with $(0.1\mu_0)^2$, which basically states that one sample has a $10\%$ deviation in the mean download rate. I also assumed that all samples have the same size $N_i = N_j$.

Now we go to the F-statistic and check what F value would represent a $95\%$ confidence level. For that, we need the degrees of freedom. We have $k-1$ degrees of freedom in the numerator quantity and $N-k$ degrees of freedom in the denominator. Note that the F-distribution is not symmetric, so you have to use the correct order here. 

We can assume that $N-k$ is very large (here I assume $N = 10\,000$) and integrate the F distribution
# Input parameters
Nminusk = 10000
kminus1 = 2
step = 0.001

# Integrate the F-distribution to get the critical value
F = 0.
integrate = 0
while integrate < 0.95:
    F += step
    integrate += f.pdf(F, kminus1, Nminusk)*step
    if integrate > 0.95:
        print "F value at 95%% confidence level is %0.1f" % F
        break
which gives

>> F value at 95% confidence level is 3.0

As long as $N$ is large ($>30$), the result does not depend on its exact value. There might be a Python function to directly print out this value, but I could not find it (there are also online tools available). Here is a plot of the F distribution for our case
Figure 1: F-distribution for the case of $N = 10\,000$ and $2$ degrees of freedom. The dashed line shows the $2\sigma$ significance level, meaning any F value larger than that indicates a significant result.

Finally, we need an estimate of the variance in the distribution. If we assume we had a download rate of $2\%$ in the past, we get $\sigma^2_s = 0.02(1 - 0.02) = 0.0196$ assuming a Bernoulli distribution for the variance. Putting it all together we get
\[
N = \frac{3\sigma_s^2}{\frac{1}{2}(0.1\mu_0)^2} = 29400.
\]Great! Now we have a number to work with.

Let's assume after a few days we get a table back showing

samplevisitorsdownloadsfraction
control294005000.0190
A294006200.0235
B294004900.0186

The first step is to test whether the F-test assumptions actually held in this case, meaning we have to check whether the sample variances $\sigma^2_s$ for the different samples are approximately equal. Let's assume that this is true in our case above.

Next we have to calculate the $F$ value, which in the above case gives $F = 3.38 > 3$. This means we can rule out the null hypothesis and know that the scatter of the means is not consistent with them all being equal (at our chosen confidence level). Note that this does not necessarily mean that one of the new layouts does better than the control.

So where do we go from here?

Post-hoc tests 


The problem after ruling out the F-test null hypothesis is that now we want to know where the difference actually happened. This requires to look at the samples themselves and that has all the look-elsewhere issues we discussed before. So to protect us from the look-elsewhere effects, we now have to ask for a higher confidence level, in our case with $k = 3$ we should ask for $1 - 0.05/3 = 0.983 \rightarrow 98.3\%$ confidence levels.

Next we have to state the t-test null hypothesis. In our case that is
\[
\begin{align}
H^A_0 &= \mu_A - \mu_0 \leq 0,\\
H^B_0 &= \mu_B - \mu_0 \leq 0
\end{align}
\] and we still look for a $10\%$ improvement over the control sample with the now required confidence of $98.3\%$. The details about t-tests have been discussed in my last blog post.

The required t-value can be obtained from a t-distribution and for a $98.3\%$ significance we need a t-value of $2.12$. Our t-values are $t = 3.26$ for layout $A$ and $t=-0.32$ for layout $B$, indicating a significant improvement of the download rate with layout $A$ compared to the control. It is definitely possible that the post-hoc t-tests are inconclusive and that more data needs to be collected.

As a final note, I would like to point out that looking at your experiment many times can be beneficial. Some layouts might be so good, that you can already see a significant improvement over control after looking at a small sample $N$, even though your smaller target improvement (in our case $10\%$) requires a much larger $N$.

So one strategy could be to look at your data at certain intervals and check whether a significant improvement is already present. Of course, every look at your data comes with a price tag, since now one has to increase the required confidence level as described above. However, one might catch significant improvements early and be able to stop the trial earlier. Or one could send more data into the good layouts and less into the bad layouts. Google for example follows such an approach.

The Python code used above is available on Github. Let me know if you have any questions/comments below.
cheers
Florian

Monday, March 6, 2017

Hypothesis testing: Using the t-test to evaluate an A/B test

A/B testing is a common tool to improve the customer engagement with a website. The objectives are often to increase the sign up or download rate.

Let's assume we have a website and we want to test it against a modification $A$ of that website (we changed the download button for example). How do we determine whether the new layout does better than the old?

Well we divide the website traffic up into 2 groups, each directed to one of the layouts. We then check whether the new layout increases our measure of interest (download rate).

Before starting the A/B test we have to clearly specify:

(1) What is the null hypothesis?
(2) What is the minimum/expected improvement we are looking for?
(3) What is the level of significance we are content with?

Let's start with (1), what is the null hypothesis that we want to disprove. Usually the null hypothesis is that the new layout does not do any better than the original layout, which can be written as
\[
H_0 = \mu_A - \mu_0 \leq 0,
\]where $\mu$ is the mean download rate. The alternative hypothesis is that $\mu_A > \mu_0$, meaning that there is an actual improvement. However, we can't just look for any improvement; we need to look for a clearly specified goal. That is the point of (2).

Let's assume we are looking for an improvement of at least $10\%$ in the download rate. This is a clear goal and we can write the alternative hypothesis as $\mu_A - \mu_0 > 0.1\mu_0$.

Finally (3), we have to specify a level of confidence which is enough for us to make a decision. A common value people like to choose is $95\%$ confidence level, meaning that if we claim at the end that the null hypothesis has been ruled out, there is a $5\%$ chance that we are wrong and we just observed a random fluctuation in the data (type 1 error). This might be too risky for some applications, e.g. clinical trials. In that case one can require a higher confidence level like $99.9\%$, which reduces the chance for a type 1 error to $0.1\%$. Note that statistics will never give you $100\%$ certainty, but if you are willing to collect more and more data, you can approach $100\%$ as closely as you want.

After specifying the setup of the A/B test we can now calculate the number of users required to rule out or confirm the null hypothesis. This is usually required in an industry environment, since our A/B test is always connected to a cost.

Before we calculate the number of users we have to specify how we want to evaluate the A/B test. There are at least two options here, a Z-test or a t-test, so what is the difference between the two?

Difference between Z-test and t-test


The Z-test and t-test are both used to determine the hypothesis that a population mean equals a particular value or that two population means are equal. In both cases we only approximate the confidence level, since we calculate the confidence under certain assumptions for the probability distribution. The Z-test for example assumes that the probability distribution around the null hypothesis is Normal distributed. The t-test assumes that the probability distribution around the null hypothesis follows a Student-t distribution. Note that in both cases we assume that the null hypothesis is true.

Both the Z-test and the t-test rely on other assumptions, which are often broken in real data, so it is important to be aware of these assumptions (and test them):

(1) The samples we look at are random selections from the population
(2) The samples are independent
(3) The sample distribution is approximately Normal

The t-test accounts for additional variations because of small number statistics by using the degrees of freedom (dof) in the probability distribution. For a large number of degrees of freedom the t-test becomes very similar to the Z-test, since the Student-t distribution approximates a Normal distribution. So in general one should use the t-test if the sample is small, while the Z-test can be used if the sample is large.

Here is the definition for the Z-test:
\[
Z = \frac{x - \mu_p}{\frac{\sigma_p}{\sqrt{N}}},
\]where $\sigma_p$ is the population standard deviation, $\mu_p$ is the population mean and $N$ is the sample size. The Z-value calculated with this equation has to be compared to a Normal distribution to get the p-value.

The t-test is defined as
\[
t = \frac{x - \mu_s}{\frac{\sigma_s}{\sqrt{N}}},
\]where $\mu_s$ is the mean of the sample and $\sigma_s$ is the standard deviation of the sample. The t-test has to deal with the additional uncertainty from the fact that the standard deviation $\sigma_s$ has been derived from the sample itself. It does this by using the Student-t distribution instead of the Normal distribution, which is generally broader than a Normal distribution, but approximates the normal distribution for large samples $N$.
Figure 1: Several Student-t distributions with different degrees of freedom (black lines) compared to a normal distribution (blue line).

So the claim here is, that if we pick many samples $X_i$ from the population, construct a t-test for all samples and plot their distribution, they will follow a t-distribution, if the null hypothesis is true.

The number of degrees of freedom is given by the number of samples minus the number of constraints. Since for these tests we usually use the data to constrain two parameters (the mean and standard deviation), we get $N-2$. However, you should check in your particular case whether this is still true. We will talk more about the exact treatment later.


Example


Here we will look at the download rate of a website, meaning the ratio of users who download our product, relative to the total number of users who come to the site. We have two layouts for our website, the new layout $A$ and the old (control) layout. First we specify the setup of our test:

(1) The null hypothesis is that the new layout is not any better than the control $H_0 = \mu_A - \mu_0 \leq 0$
(2) We are looking for a $10\%$ improvement in the download rate
(3) We want to see a confidence level of $95\%$

The next question is, do we use a Z-test or a t-test? Well first of all do we know the standard deviation of the population? Not really... we know that it is a binomial distribution with a probability to click on the download button of $p$ and a probability to not click of $(1 - p)$. This means the standard deviation of the download rate is
\[
\sigma_s = \sqrt{p(1-p)}.
\]And usually $p$ is estimated from the sample itself. So the t-test seems more appropriate, since we do not have much information on the total population.

Before we move on we should note that $\sigma_s$ for this test is not really given by the equation above. In our case, both the sample testing the new layout $A$ and the control sample have an error associated, and hence the combined error is given by
\[
\sigma_{\mu_A - \mu_0}\sqrt{\frac{1}{N_A} + \frac{1}{N_0}} = \sqrt{\frac{(N_A - 1)\sigma_{s,A}^2 + (N_0 - 1)\sigma_{s,0}^2}{N_A + N_0 - 2}}\sqrt{\frac{1}{N_A} + \frac{1}{N_0}} \approx \sqrt{\frac{\sigma^2_{s,A}}{N_A} + \frac{\sigma^2_{s,0}}{N_0}},
\]where the approximation on the right is justified for samples with $N>30$.

Using what we have established so far, we can formulate our t-test as
\[
t = \frac{(\mu_A - \mu_0)}{\sqrt{\frac{\sigma^2_{s,A}}{N_A} + \frac{\sigma^2_{s,0}}{N_0}}}.
\]Here we want to test a specific case, we want to know whether layout $A$ provides a $10\%$ improvement over the control sample. So we want to know when the statistics are good enough to rule out such an improvement.

We can re-write the t-test to answer that question
\[
t = \frac{(0.1\mu_0)}{\sqrt{2\sigma_s^2/N}},
\]where we assumed $N_A = N_0$ (because any other choice is sub-optimal) and $\sigma^2_{s,A} = \sigma^2_{s,0}=\sigma^2_{s}$ for simplicity. Now we solve this equation for the sample size
\[
N = \frac{2t^2\sigma_s^2}{(0.1\mu_0)^2}.
\]Before we can calculate $N$ we have to know $t$ corresponding to a $2\sigma$ ($95\%$) confidence level. To get this from the Student-$t$ distribution we need the number of degrees of freedom, which can be calculated as
\[
dof = \frac{\left(\frac{\sigma_{s,A}^2}{N_A}+ \frac{\sigma_{s,0}^2}{N_0}\right)^2}{ \left[ \frac{\left(\frac{\sigma_{s,A}^2}{N_A}\right)^2}{(N_A - 1)} \right] + \left[ \frac{\left(\frac{\sigma_{s,0}^2}{N_0}\right)^2}{(N_0 - 1)} \right] }.
\]If we assume $N_A = N_0$ and $\sigma_{s,A} = \sigma_{s,0}$, this simplifies to
\[
dof = 2(N - 1).
\]This means there is a problem here. To calculate $N$ we need to know $t$ and to calculate $t$ we need to know the degrees of freedom, which depend on $N$.

However, we can just assume that we are going to deal with a sample $N > 30$, which means the Normal distribution is a good approximation to the Student-$t$ distribution. For the normal distribution we do not need the degrees of freedom, hence
print("The 90\% confidence interval is = (%0.2f %0.2f)" % (norm.interval(0.9, loc=0, scale=1))
>> The 90% confidence interval is = (-1.64 1.64)

We calculated the $90\%$ confidence interval, instead of the $95\%$, since we are interested in a one tail distribution. We do not care about the probability that the new layout $A$ is worse than the control. We just want to know when do we have $95\%$ confidence that layout $A$ is not doing any better than the control. Since the distribution is symmetric, $5\%$ of the probability is above $1.64$ and $5\%$ is below $-1.64$. So we ignore the lower limit and set $t = 1.64$.

Putting this information in our equation for the Number of users we get
\[
N \geq \frac{5.3792 \sigma_s^2}{(0.1\mu_0)^2}.
\]We still need to estimate the standard deviation $\sigma_s$ somehow. One could use the download rate in the past, but one should be aware that this might have a time dependence... so think carefully where you get this estimate from. Let's assume we estimate the download rate to be $2\%$ so
\[
\sigma_s^2 = 0.02(1 - 0.02) = 0.0196.
\]With this we can calculate $N \geq 26359$.

Ok so let's run the test with this many users... imagine that after a day or so you get this table back:

sample
visitors
downloads
fraction
t
control
26359
500
0.0187
N/A
A
26359
560
0.0210
1.893

This looks great! Because of $t_A > 1.64$ we have ruled out the null hypothesis with $> 2\sigma$ and we should probably switch to the new layout $A$.

Even though there wasn't much programming included in this exercise, below I attached a small python script producing the plot above and the numbers used in this example (also available on Github).
cheers
Florian

'''
This program goes through the different steps of an A/B test analysis
'''
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t as student_t
from scipy.stats import norm


def main():

    mu = 0
    variance = 1
    sigma = np.sqrt(variance)
    x = np.linspace(-10, 10, 1000)
    linestyles = ['-', '--', ':', '-.']
    dofs = [1, 2, 4, 30]

    # plot the different student t distributions
    for dof, ls in zip(dofs, linestyles):
        dist = student_t(dof, mu)
        label = r'$\mathrm{t}(dof=%1.f, \mu=%1.f)$' % (dof, mu)
        plt.plot(x, dist.pdf(x), ls=ls, color="black", label=label)

    plt.plot(x,norm.pdf(x, mu, sigma), color="green", linewidth=3, label=r'$\mathrm{N}(\mu=%1.f,\sigma=%1.f)$' % (mu, sigma))
    
    plt.xlim(-5, 5)
    plt.xlabel('$x$')
    plt.ylabel(r'$p(x|k)$')
    plt.title("Student's $t$ Distribution approximates Normal")

    plt.legend()
    plt.show()

    print "The 90%% confidence interval is = (%0.2f %0.2f)" % (norm.interval(0.9, loc=0, scale=1))
    download_rate_estimate = 0.02
    sigma_s = download_rate_estimate*(1. - download_rate_estimate)
    N = 5.3792*sigma_s/(0.1*download_rate_estimate)**2
    print "estimate of N = %d" % round(N)

    # Calculate the t value given the measured download fractions 
    download_fractions = [0.0187, 0.0210]
    print "t_A = %0.2f (for a measured download rate of 2.1%%)" % t_test(N, download_fractions[1], N, download_fractions[0])

    return 


# function to calculate the t value
def t_test(N_A, p_A, N_0, p_0):
    variance_A = p_A*(1. - p_A)
    variance_0 = p_0*(1. - p_0)
    return (p_A - p_0)/np.sqrt(variance_A/N_A + variance_0/N_0)

if __name__ == '__main__':
  main()