Saturday, March 31, 2018

About the assumption of a Gaussian likelihood in a data model comparison

This is a blog post about a recent publication led by ChangHoon Hahn, which I co-authored (Hahn et al. 2018). The paper looks at the impact of non-Gaussian likelihoods for cosmological parameter estimation. This blog post attempts to explain the content of this paper for a general audience.

When finding the best model to describe a given dataset, most studies employ the $\chi^2$ statistic, which is a measure of how well the model and the data agree
\[
\chi^2 = \sum^{d} (data - model)^TC^{-1}(data - model),
\]where $d$ is the number of data points in the dataset, and the covariance matrix $C$ describes the error of the data points. This statistic is directly related to the so-called likelihood, which can be obtained as
\[
p(x|\theta) = \frac{1}{(2\pi)^{d/2}\sqrt{det(C)}} \exp{\left(-0.5\chi^2\right)}.
\]This likelihood describes the probability that one could have observed the data, given that the underlying truth is described by the model with the parameters $\theta$. So whenever somebody observes some data, the final analysis step is usually to calculate this likelihood and find the parameters $\theta$ which have the largest likelihood.

However, the likelihood above has a Gaussian shape, and it is not guaranteed that the probability distribution has that form. So the question we are addressing in this publication is, how close is the likelihood to a Gaussian form and if there is a deviation, how does it impact the analysis.

The first question to answer is, why do we assume a Gaussian likelihood in the first place?

The Gaussian distribution has a central role in statistics because it has been shown that when collecting many different measurements of the same event, these measurements often follow a Gaussian distribution. For example, if you throw a coin 100 times and calculate the mean, this mean is probably somewhere close to 0.5.

If you make that experiment many times, meaning you throw the coin 100 times, calculate the mean of the 100 throws and repeat that many times. The distribution of means will peak close to 0.5 and will show a Gaussian shape (see Figure 1). This is quite surprising because the probability of throwing heads or tails is a top-hat distribution, which looks not at all like a Gaussian. So adding many top-hat distributions leads somehow to a Gaussian distribution.
Figure 1: Probability distribution for the mean of many throws of a coin. The number of throws is shown in the top gray
box. With an increasing number of throws, the distribution looks more and more Gaussian (source: Wikipedia).
Figure 1 shows the probability distribution when looking at the mean of many coin flips. The distribution looks more and more Gaussian if the number of coin flips is increased.

In fact, that happens quite often in nature, the likelihood distribution of a large number of events often looks like a Gaussian, even if the likelihood distribution for a single event does not. This is known in statistics as the central limit theorem.

However, given the example above you can already see that the assumption of a Gaussian likelihood only works for large numbers. Figure 1 shows clearly that the probability distribution is far from Gaussian if one calculates the mean from only a small number of coin flips.

Going back to our paper. We looked at the power spectrum analysis of the Baryon Oscillation Spectroscopic Survey published in Beutler et al. (2016) and the group multiplicity function analysis of Sinha et al (2017).

Let's start with the power spectrum analysis. I discussed this analysis in several blog posts (e.g. here) already so I will skip the details about it. Here we do not care too much about the details of the measurement anyway, we just want to test whether the assumption of a Gaussian likelihood is justified.

In the power spectrum case one calculates the mean of many Fourier modes and the question is whether we have enough modes to make the likelihood distribution Gaussian. In general, there are many modes on small scales (large wavevector k) but only a small number of modes on large scales (small wavevector k). E.g. at $k=0.01h/$Mpc only a few dozen modes contribute to the value of the power spectrum, while at k=0.2h/Mpc many thousand modes contribute. Whether this matters or not depends on how this impacts the parameters of the model, which is not straightforward.

There is a very simple way to check whether the likelihood is Gaussian. We have produced mock catalogs for the BOSS survey, which mimic the BOSS dataset. Each of these mock catalogs represents a different possible realization of the measurement of BOSS. Therefore we just have to measure the power spectrum of all these mock catalogs and see whether those power spectra scatter around their mean in the shape of a Gaussian.

Whatever the distribution is going to be, we then use that distribution rather than assuming a Gaussian when searching for the model with the maximum likelihood.
Figure 2. This plot shows the likelihood distribution for the different cosmological parameters derived from
the BOSS power spectrum. The blue distribution assumes a Gaussian likelihood, which the orange
distribution assumes a likelihood distribution seen in the mock catalogs. You see there is overall good
agreement except for a small shift in $f\sigma_8$.
Figure 2 shows the likelihood distribution for each cosmological parameter. Note this is a different likelihood than what we talked about before. The likelihood we talked about before describes the probability of observing the data vector given that the underlying model is the true model, while the likelihood in Figure 2 describes the probability of observing the data given that parameter value.

The blue distribution assumes a Gaussian likelihood, while in orange we use the likelihood from the mocks. The two likelihoods seem to agree well, except for the parameter $f\sigma_8$, where we found a small shift to larger values.

Next the group multiplicity function analysis of Sinha et al (2017). The likelihood distributions are again shown in blue and orange in Figure 3. For this analysis, you can see that the two likelihoods agree less well, especially in terms of how peaked the distributions are. Removing the assumption of a Gaussian likelihood seems to increase the error on almost all derived parameters.
Figure 3: Comparison between the original group multiplicity function analysis of Sinha et al (2017) with the new analysis dropping the assumption of a Gaussian likelihood. The two distributions show significant
deviations for most parameters.
In conclusion, the assumption of the form of the likelihood can impact cosmological parameter constraints and should be accounted for. With larger datasets, the statistics usually gets better and the issue discussed here will get smaller. However, observables like primordial non-Gaussianity are located in the regime with small mode statistics and hence such studies must check such biases carefully.

I hope this was useful. Let me know if you have any comments/questions in the comment section below.
best
Florian 

Thursday, March 29, 2018

Measuring Neutrinos with the distribution of galaxies in the Universe

In this blog post, I will summarise the paper Baumann et al. (2018) which my collaborators and I just published. The paper reports a measurement of the phase shift in the Baryon Acoustic Oscillation signal due to free streaming particles (neutrinos) in the early universe. So what does that mean?

1. What is the Baryon Acoustic Oscillation (BAO) scale? The BAO scale is a special scale in the distribution of galaxies. If you look at the separation of galaxies in the Universe, you will find that at around 150 Megaparsecs (500 000 000 light years) there is a sudden increase in the number of galaxy pairs. This special scale in the distribution of galaxies was imprinted when the Universe was just 300 000 years old at the so-called decoupling era. If you measure the correlation function of a galaxy distribution, meaning you count the number of galaxy pairs as a function of scale, the BAO signal will show up as a peak at 150 Megaparsecs. If you look at the Fourier transform of the correlation function, the power spectrum, it will show up as an oscillation signal. This signal is what we use in this analysis.

2. What is the decoupling era? In the early Universe photons and baryons were coupled together due to the high pressure. 300 000 years after the Big Bang the temperature and pressure dropped enough so that photons and baryons decoupled. From that point forward most of the photons in the Universe just traveled through the Universe without any interactions with anything. For that reason, we can look at these Cosmic Microwave Background (CMB) photons and basically see a snapshot of the Universe as it was 300 000 years after the Big Bang.

3. What is the effective number of relativistic species? This number describes all particles which had a relativistic velocity at the decoupling era. Currently, this number is assumed to be three, given the three known species of neutrinos ($\nu_{e}$, $\nu_{\mu}$ and $\nu_{\tau}$). We know that these particles must have been produced during the Big Bang and since they are very light, they should have had relativistic velocities at decoupling.

Figure 1 shows the impact of the effective number of free-streaming relativistic species ($N_{\rm eff}$) on the BAO signal and you can see that it leads to a shift in the phase.

Figure 1: This figure shows the impact of the effective number of relativistic species ($N_{\rm eff}$)
on the Baryon Acoustic Oscillation (BAO) scale in the matter power spectrum. It is the shift in the
phase of the BAO which we measured for the first time in this analysis (reference: This plot is from
Baumann et al. 2017).
This phase shift is very small, but we now managed to constrain the BAO scale to 1% using the Baryon Oscillation Spectroscopic Survey (BOSS), which makes us sensitive to such small effects.

So why are we interested in this effect?

Well, it could be that there are additional particles which propagated in the early Universe, but which we have not seen yet in particle accelerators. In this sense, we are using the Big Bang as the ultimate high energy collider and test whether the remains of the Big Bang show any evidence for new particles.

Besides that, the neutrinos generated by the Big Bang are also a very interesting probe of the early Universe. As mentioned before we can use the Cosmic Microwave Background (CMB) to get a snapshot of the Universe when it was only 300 000 years old. But we cannot actually look any closer to the Big Bang. If we could see the distribution of neutrinos produced by the Big Bang, we would actually see the Universe when it was only 1 second old, because neutrinos stopped interacting with other components of the Universe much earlier than photons, 1 second after the Big Bang.

So if we want to investigate the Universe at its very early age, the neutrino background produced by the Big Bang would be one way to do that. Another would be the gravitational wave background produced by the Big Bang. However, right now we do not have experiments sensitive enough to detect the neutrino or gravitational wave background. The detection of the phase shift in the BAO is an indirect detection of the neutrino background.

The shift in the BAO signal is not quite scale independent, hence the first step is we have to create a template for the shift, which we include in the standard BAO fitting pipeline, which I developed for the BOSS analysis back in 2017 (Beutler et al. 2017). The template for the phase shift is shown in Figure 2.
Figure 2: Template describing the shift in the BAO phase as a function of wavenumber. We
include this template in our analysis, but put a free parameter as its amplitude. 
The figure shows that the shift caused by neutrinos $f(k)$ is larger at larger wavenumber $k$.

The amplitude of this template is a free parameter in the fit, which we call $\beta$. We then use the BOSS dataset to constrain this parameter.
Figure 3: Constraints on the amplitude of the phase shift ($\beta$) obtained from the BOSS dataset
in two redshift bins ($z_1$ and $z_3$).
Figure 3 shows the constraints on the amplitude of the template ($\beta$) obtained from the BOSS dataset. In BOSS we have two redshift bins, $z_1$ ($0.2 < z < 0.5$) and $z_3$ ($0.5 < z < 0.75$). We use both redshift bins for our constraint, which yields $\beta = 1.2\pm 1.8$. This constraint can then be converted into a constraint on the effective number of relativistic species. $\beta = 0$ would correspond to $N_{\rm eff} = 0$, while $\beta = 2.45$ would correspond to $N_{\rm eff} = \infty$.

You can see that our constraint just from the BOSS data can't really constrain $N_{\rm eff}$. However, Figure 3 also shows that there is a strong degeneracy between the parameter $\beta$ and the BAO scale parameter $\alpha$. If we fix the cosmological model, we can get strong priors on the $\alpha$ parameters from the Cosmic Microwave Background. The constraints including these priors are also included in Figure 3 (red and orange contours on the left and red solid line on the right) resulting in $\beta = 2.05\pm0.81$. This excludes $\beta=0$ (corresponding to $N_{\rm eff} = 0$) with more than $2\sigma$. Note that the expected value of $\beta$ in a standard cosmological model with $N_{\rm eff} = 3.046$ is $\beta = 1$, while our data seem to prefer larger values. However, $\beta = 1$ is well within our likelihood distribution (red line in Figure 3, right) and hence we do not see any significant deviation from the standard model. 

A similar measurement of the phase shift in the photon distribution (CMB) using the Planck data has been reported in Follin et al. 2015. They measured $N_{\rm eff} = 2.3^{+1.1}_{-0.4}$, which is consistent with our measurement.

In summary, we, for the first time, measured the phase shift in the BAO signal in the galaxy power spectrum, representing an (indirect) detection of the cosmic neutrino background produced by the Big Bang. Our measurement confirms the current standard model with three neutrino species. While the detection significant of our measurement is fairly low, future galaxy surveys like DESI and Euclid will improve these measurements significantly.

Our paper about this measurement was published in Nature (see https://www.nature.com/articles/s41567-019-0435-6)

Let me know if you have any comments or questions below.
best
Florian

Monday, March 26, 2018

Context managers in Python

A context manager in Python is a way to manage resources. Typical uses of context managers include locking and unlocking resources like database connections and closing, opening files.

Context managers are used together with the with statement. You probably encountered the with statement at some point when coding in Python, the most often used case is file handling
with open('data.dat', 'r') as file: 
   contents = file.read()
However, files do not have to be opened like this, instead, you could also use
file = open('data.dat', 'r')
contents = file.read()
file.close()
The problem with this code is that the file.close() method might never be called if an exception happened after the opening of the file. If the file is kept open, you have a resource leak, and the system may slow down or crash since the number of available file handlers is finite.

A more robust version of dealing with a file could use the try... finally statement like this
file = open('data.dat', 'r')
try:
   contents = file.read()
finally:
   file.close()
So even if there is an exception while reading the file, the finally clause always ensures that the file is closed again. This is basically how the with statement is implemented.

The with statement has two minor advantages over the try... finally statement above: (1) it is a little bit shorter and (2) it does not require calling the close() method explicitly, since the with statement calls it automatically, whenever the program leaves the with statement (indented block below the with statement).

So the main purpose of the with statement is to guarantee the execution of startup and cleanup actions around a block of code, with the main application being resource management.

So how would we write a context manager? We can generate this functionality using a class
class MyOpen(object):
   def __init__(self, filename):
      self.filename = filename

   def __enter__(self):
      self.file = open(self.filename)
      return self.file

   def __exit__(self, ctx_type, ctx_value, ctx_traceback):
      self.file.close()
      return True
This class has two methods which are specifically used by the with statement, the __enter__() and __exit__() methods. The __enter__() method is called at the beginning of the with statement. The __exit__() method is called when leaving the with statement and could handle exceptions, even though this is ignored in the above example.

You might wonder what is the difference between  __init__() and __enter__(). The __init__() method is called when the class object is created, which does not need to be the beginning of the with statement. This difference allows us to produce reusable context managers like this
file_handler = MyOpen('data.dat')
with file_handler as file:
    contents = file.read()
In this case, the __init__() method is called in line one, when the class object is created, the __enter__() is called in line two when the with statement begins and the __exit__() method is called in line three when the with statement ends.

So to produce a context manager you just have to create a class with two special methods: __enter__() and __exit__(). The __enter__() method returns the resource to be managed, like a file object in the case of open(). The __exit__() method does any cleanup work.

Hope that was useful and please let me know if you have any comments or questions in the comment section below.
cheers
Florian

Wednesday, March 7, 2018

Measuring the galaxy bispectrum of the large scale structure of the Universe

This blog post summarizes the publication Naonori, Saito, Beutler & Seo (2018), which my collaborators and I published recently. It describes a new way to measure the galaxy bispectrum. Even though I have attempted here to explain the content of this paper for a general audience, the subject is very technical and some knowledge of higher mathematics is required to follow this post.

What is the Bispectrum?

The density field of the Universe, meaning the distribution of dark matter and baryonic matter, can be expressed as the density at point x relative to the mean density of the Universe
\[
\delta(\vec{x}) = \frac{n(\vec{x})}{\overline{n}} - 1,
\]which is also known as the overdensity field. This density field is an important cosmological probe since e.g. a different amount of dark energy would change the density field. So we can use the density field to study dark energy.

Beside dark energy, the density field also contains information about dark matter, the neutrino masses and the inflationary period of the Universe. We can also use it to test general relativity or modified gravity theories.

If the distribution of $\delta(\vec{x})$ at different positions $\vec{x}$ follows a Gaussian distribution, all information contained in it can be captured by the 2-point function, which in configuration-space is called the correlation function, $\xi(r)$, and in Fourier-space is called the power spectrum, $P(k)$. The 2-point function is the ensemble average of correlations between all possible pairs within the density field.
\begin{align}
\xi(\vec{r}) &= \langle \delta(\vec{x} + \vec{r})\delta(\vec{x})\rangle\\
P(\vec{k}) &= \langle \delta(\vec{k})^2\rangle,
\end{align}where $\delta(\vec{k})$ is the Fourier transform of the density field $\delta(\vec{x})$.

It turns out that the density field of the Universe is close to a Gaussian field on large scales, but on smaller scales ($\sim 10 Mpc$) it has non-Gaussian properties. To capture all information in a density field which does not behave like a Gaussian random field, we need to use all possible n-point functions, not just the 2-point function.

Including the 3-point correlation, which in Fourier-space is called the bispectrum, will increase the signal-to-noise ratio, and improve cosmological parameter constraints. Figure 1 shows how much information the bispectrum can add as a function of the smallest scale included ($k_{\rm max}$).
Figure 1: Signal-to-noise ratio as a function of $k_{max}$. Adding the bispectrum allows getting
closer to the Gaussian limit, which represents the maximum amount of information which can be
retrieved from this clustering dataset.
The bispectrum is defined as the ensemble average of the product of the density field at three points
\[
B_{123} = \langle \delta(\vec{k}_1)\delta(\vec{k}_2)\delta(\vec{k}_3)\rangle.
\]This, however, is a 9-dimensional quantity, which is very difficult to analyze. The question is, how can we compress the information contained in this quantity into some lower dimensional space without losing information?

First, we have to figure out how many parameters does one need to describe this three-point configuration?

The bispectrum essentially identifies all triangle configurations in the density field and averages them. Any triangle can be described by 3 parameters. However, with galaxy surveys, we are also interested in the orientation of the triangle with respect to the line-of-sight, which then requires 2 additional parameters, so 5 in total. Just as a comparison, in the power spectrum, this can be done with only 2 parameters, the amplitude $k$ and the angle to the line-of-sight $\mu$.

Other estimators

One important aspect of the bispectrum is its computational complexity. Since one needs to identify all possible triangles in the density field, the complexity of such a procedure is naively given by $\mathcal{O}(N^3)$. However, in 2015 we learned that the bispectrum can actually be estimated in $\mathcal{O}(N\ln N)$ using Fast Fourier Transforms (FFTs). For the rest of this post I will only consider FFT based estimators for the bispectrum.

The first FFT-based bispectrum estimator has been proposed by Roman Scoccimarro (Roman Scoccimarro 2015) and it looks like this
\[
B^m_{\ell}(k_1, k_2, k_3) = \frac{(2\ell + 1)}{N^T_{123}} \prod^3_{i=1} \int_{k_i} d^3 q_i \delta_{123}\delta_{\ell}(q_1)\delta_0(q_2)\delta_0(q_3).
\]This estimator is based on the approach suggested in Scoccimarro (2015) and Bianchi et al. (2015), which calculates the $\delta_{\ell}(q_1)$ using FFTs.

However, there is a fundamental issue with this estimator. Any Fourier-space quantity like the power spectrum or bispectrum needs to account for the survey window function. For example, the power spectrum we measure in a galaxy survey is given by
\[
P^{\rm conv}(\vec{k}) = \int d\vec{k}P^{\rm true}(\vec{k})W^2(\vec{k} - \vec{k}'),
\]where $W(\vec{k})$ represents the survey window function. The window function basically refers to the power spectrum one would measure with a completely random distribution of points within a certain geometric configuration. Even if all galaxies in our galaxy survey were at completely random positions, we would still measure a power spectrum. This power spectrum of random positions is $W(\vec{k})$. So in principle, we know the window function very well, but accounting for it in a power spectrum analysis can be quite tricky.

The approach we take in the power spectrum analysis goes through the following steps:

(1) Fourier transform the power spectrum model
(2) Multiply it by the window function
(3) Fourier transform back into Fourier-space

After we have convolved the power spectrum model with the window function in this way, we can compare it to the power spectrum measurement from the data.

To do the same with the bispectrum we need to define a Hankel transform for the bispectrum. A Hankel transform is a Fourier transform of the ensemble average quantity (bispectrum/power spectrum).

However, a Hankel transform for Roman's bispectrum has not been found yet (and might not exist). This does not mean that there is no way to include the window function in Roman's estimator, it just means that so far no formalism has been published and we found it very hard to find such a formalism. For that reason, we developed a new bispectrum estimator, which has a Hankel relation to its configuration-space equivalent and therefore can include the survey window function effect in the same way as described above for the power spectrum.

Our new estimator

We use the 3D bispectrum in the form $B(\vec{k}_1, \vec{k}_2, \hat{n})$, where we chose the $\vec{k}_3$ axis as the line-of-sight $\hat{n}$. We now use spherical harmonic expansion in all three vectors
\begin{align}
B(\vec{k}_1,\vec{k}_2,\hat{n}) &=& \sum_{\ell_1\ell_2 L} \sum_{m_1m_2M}B_{\ell_1\ell_2L}^{m_1m_2M}(k_1,k_2) y_{\ell_1}^{m_1}(\hat{k}_1) y_{\ell_2}^{m_2}(\hat{k}_2) y_{L}^{M}(\hat{n}),
\end{align}with the coefficients
\begin{align}
B_{\ell_1\ell_2L}^{m_1m_2M}(k_1,k_2) &=
N_{\ell_1\ell_2L} \int \frac{d^2\hat{k}_1}{4\pi}\int \frac{d^2\hat{k}_2}{4\pi}\int \frac{d^2\hat{n}}{4\pi} \\
&\times y^{m_1*}_{\ell_1}(\hat{k}_1) y^{m_2*}_{\ell_2}(\hat{k}_2)y^{M*}_{L}(\hat{n}) B(\vec{k}_1,\vec{k}_2,\hat{n}),
\end{align}and the normalized spherical harmonics $y_{\ell}^m = \sqrt{4\pi/(2\ell+1)}\, Y_{\ell}^m$. The step above mapped the 3D bispectrum into the 8 dimensional coefficients $B_{\ell_1\ell_2L}^{m_1m_2M}(k_1,k_2)$. However, as we established above, the bispectrum can be captured with only 5 parameters. We found that the 8 dimensional quantity above can be further compressed into 5 parameters using
\begin{align}
B_{\ell_1\ell_2L}(k_1,k_2) &=
H_{\ell_1\ell_2L} \sum_{m_1m_2M} \left( \begin{smallmatrix} \ell_1 & \ell_2 & L \\ m_1 & m_2 & M \end{smallmatrix} \right)
B_{\ell_1\ell_2L}^{m_1m_2M}(k_1,k_2),
\end{align}where $H_{\ell_1\ell_2L}=\left( \begin{smallmatrix} \ell_1 & \ell_2 & L \\ 0 & 0 & 0 \end{smallmatrix} \right)$. This last step assumes statistical isotropy and parity-symmetry as well as homogeneity of the density field.

Properties of our estimator

We found a few interesting properties for this particular decomposition:

1. All anisotropies with respect to the line of sight are compressed into one multipole, labeled $L$. These anisotropies are caused by redshift-space distortions and the Alcock-Paczynski effect, which represent two of the most important observables in cosmology.

2. The shot noise term associated with this decomposition is k dependent. This is rather surprising since for the power spectrum we only have to deal with a constant shot noise (not considering halo exclusion). The non-constant shot noise has very interesting implications, e.g. it does not vanish for the configuration-space quantity (the three-point function).

But the most important point is that for this estimator we can write down a Hankel transform
\begin{align}
B_{\ell_1\ell_2L}(k_1,k_2) &= (-i)^{\ell_1+\ell_2}(4\pi)^2 \int dr_1 r_1^2 \int dr_2 r_2^2 \\
&\times j_{\ell_1}(k_1r_1) j_{\ell_2}(k_2r_2) \zeta_{\ell_1\ell_2L}(r_1,r_2) \\
\zeta_{\ell_1\ell_2L}(r_1,r_2) &=
i^{\ell_1+\ell_2}\int \frac{dk_1k_1^2}{2\pi^2} \int \frac{dk_2k_2^2}{2\pi^2} \\
&\times j_{\ell_1}(r_1k_1)j_{\ell_2}(r_2k_2)B_{\ell_1\ell_2L}(k_1,k_2),
\end{align}which allows us to apply the window function to any bispectrum model before comparing it to the measurement, just as outlined above for the power spectrum. The window function effects are shown in Figure 2.

Figure 2: This plot shows the impact of the survey window function onto the bispectrum multipoles. Compare the black dashed line, which shows the measurement of the bispectrum in periodic box simulations with the grey shaded region, which shows the measurement of the bispectrum in lightcones, following the geometry of the Baryon Oscillation Spectroscopic Survey, North Galactic Cap. You can see that there is a suppression of power on almost all scales, due to the window function. The red dashed line and red solid line show the impact of the window function onto a second order perturbation theory model using our window function formalism. You can see that we can reproduce the effect measured in the simulations.
Finally, we used this analysis pipeline to measure the highest order anisotropic bispectrum component $B_{202}$ which we detect with $14\sigma$ significance. This represents the first detection of the anisotropic bispectrum signal (to our knowledge).

To summarize, the new bispectrum decomposition outlined in this paper allows us to write down a clear analysis pipeline which self-consistently accounts for the survey window function. The next steps will be to apply this analysis to galaxy survey datasets.

I hope this summary was useful and please let me know if you have any comments/questions in the comment section below.
cheers
Florian