Monday, October 22, 2018

Shrinkage estimators for stable covariance matrices

Obtaining reliable error bars is often the most difficult aspect of a data analysis project. In cosmology, we usually derive a covariance matrix from simulations as
\[
C = \frac{1}{N-1}\sum^N_{ij}(X_i - \overline{X}_i)(X_j - \overline{X}_j),
\]where $X$ represents any statistic you want to use in the analysis and $\overline{X}$ is the mean over all simulations.

The covariance matrix as estimated above carries noise due to the finite number of simulations. The level of noise depends on the number of bins $n$ in the data vector (which determines the dimension of the matrix) and the number of simulations used $N$. If $N \gg n$ the matrix is usually fine, while with $N \approx n$ the noise can play a significant role and in the case of $N < n$ the matrix cannot even be inverted. In the last two cases, we can use shrinkage estimators to obtain a covariance matrix which avoids the propagation of the noise due to the small $N$ into the parameter estimate.

The shrinkage estimator $C'$ for a covariance matrix $C$ is usually given as a linear combination of the matrix itself and a target matrix $T$
\[
C' = \lambda T + (1 - \lambda)C,
\]where $T$ could, for example, be the diagonal matrix $T = \text{diag}(C)\mathcal{I}$ with $\mathcal{I}$ being the identity matrix.

The question now is how to obtain $\lambda$. For $N \gg n$ we would like to obtain the covariance matrix $C$, since in this case it should have little noise and should be reliable. For $N \approx n$ we would like to obtain $\lambda \approx 1$, so that we get something similar to $T$, avoiding the noise in $C$.

Using $T$ instead of $C$, of course, has consequences since it will ignore the correlation between bins in the data vector, which can significantly underestimate the parameter uncertainties. We therefore can't obtain the parameter error from the likelihood analysis itself but need to use an alternative. For example, we could perform the same analysis on many mock realizations and obtain the parameter error from the variance amongst the mock realizations.

So how can we obtain $\lambda$ in practice? One common approach to estimate $\lambda$ is by using cross-validation. Using the Python package sklearn we can get the inverse of the covariance matrix (also known as the precision matrix) with
from sklearn.covariance import LedoitWolf

N = len(list_of_mock_pk)
X_train = list_of_mock_pk[:int(N*0.75)]
X_test = list_of_mock_pk[int(N*0.75):]

lw = LedoitWolf()
loglik_lw = lw.fit(X_train).score(X_test)
print("loglik_lw = ", loglik_lw)
print("lw.shrinkage_ = ", lw.shrinkage_)
Cinv = lw.get_precision()
Figure 1 shows an example taken from arXiv:0711.2509, where a shrinkage estimator is used to measure $\sigma_8$ from a galaxy power spectrum. The upper panel shows the distribution of $\sigma_8$ from a set of mock realizations, while the lower panel shows the estimated error on $\sigma_8$. The black solid line shows the true result while the blue dashed line shows the result with the original covariance matrix $C$, which carries a lot of noise since $N\approx n$. Using the shrinkage estimator they get the red dashed line, which now agrees very well with the reference distribution. However, the error estimate in the lower panel is far away from the reference case, but so is the original covariance matrix (blue dashed line). So while we still can't trust the error from the analysis, at least the distribution of estimated $\sigma_8$ is now correct and we can estimate the error on $\sigma_8$ from that distribution.
Figure 1: This Figure shows the distribution of $\sigma_8$ measurements in mock realizations using a
covariance matrix based on a finite set of simulations (blue dashed line), and using the shrinkage estimator
(red dashed line). The shrinkage estimator gives the correct distribution of $\sigma_8$ but
provides the wrong error estimate (lower panel). Reference arXiv:0711.2509
So shrinkage estimators usually perform better than $C$ alone and performing a shrinkage estimation is a good approach to get an idea of how reliable the covariance matrix is. If the shrinkage estimate results in a large $\lambda$, one should be cautious about any result based on this covariance matrix.

Let me know if you have any comments/questions below.
best
Florian

Monday, April 30, 2018

Jacobian transformation and uniform priors in Bayesian inference

This blog post describes the Jacobian transformation and how a uniform prior on a parameter can turn into a non-uniform prior for related parameters.

The Jacobian transformation is an algebraic method for determining the probability distribution of a variable $y$ when we know the probability distribution for $x$ and some transfer function which relates $x$ and $y$.

Let $x$ be a variable with probability density function $f(x)$ and
\[
F(x) = \int f(x)dx,
\]and
\[
F(y) = \int f(y)dy.
\]We can determine f(y) as
\[
f(y) = J(x, y)f(x).
\]where $J(x, y)$ is the Jacobian. In this example, the Jacobian is given by
\[
J(x, y) = \left|\frac{\partial x}{\partial y}\right|.
\]Now assume you perform a likelihood analysis and you put a uniform prior on a parameter $\alpha$. The uniform prior on $\alpha$ between $\alpha_{\rm low}$ and  $\alpha_{\rm high}$ means that
\[
f(\alpha) =
\begin{cases}
   1 & \text{ for } \alpha_{\rm low} < \alpha < \alpha_{\rm high},\\
   0 & \text{ otherwise }.
\end{cases}
\]The result of the analysis will be some posterior distribution for $\alpha$. Now assume you want to use that likelihood to infer something about another parameter $\beta = \frac{1}{\alpha}$. The uniform prior which was applied to $\alpha$ is now distorted and we can calculate it using the Jacobian transformation above.

First, we calculate the Jacobian
\[
\left|\frac{\partial \alpha}{\partial \beta}\right| = \frac{1}{\beta^2},
\]which together with $f(\alpha)$ results in
\[ f(\beta) = \frac{1}{\beta^2}. \]
Figure 1: A flat prior on $\alpha$ transfers into a strongly scale-dependent prior on $\beta = 1/\alpha$.
While the prior on $\alpha$ gives the same weight to all scales between $1.5$ and $10$,
in $\beta$ much more weight is given to large values of $\beta$
.

Figure 1 shows the prior for $\alpha$ and $\beta$ using $1.5 < \alpha < 10$. This means that the uniform prior on $\alpha$ now looks nowhere near a uniform prior on $\beta$. Given that Bayesian inference always assumes some prior, one has to take into account how priors change in parameter transformations.

Please leave comments/questions below.
best
Florian

Saturday, March 31, 2018

About the assumption of a Gaussian likelihood in a data model comparison

This is a blog post about a recent publication led by ChangHoon Hahn, which I co-authored (Hahn et al. 2018). The paper looks at the impact of non-Gaussian likelihoods for cosmological parameter estimation. This blog post attempts to explain the content of this paper for a general audience.

When finding the best model to describe a given dataset, most studies employ the $\chi^2$ statistic, which is a measure of how well the model and the data agree
\[
\chi^2 = \sum^{d} (data - model)^TC^{-1}(data - model),
\]where $d$ is the number of data points in the dataset, and the covariance matrix $C$ describes the error of the data points. This statistic is directly related to the so-called likelihood, which can be obtained as
\[
p(x|\theta) = \frac{1}{(2\pi)^{d/2}\sqrt{det(C)}} \exp{\left(-0.5\chi^2\right)}.
\]This likelihood describes the probability that one could have observed the data, given that the underlying truth is described by the model with the parameters $\theta$. So whenever somebody observes some data, the final analysis step is usually to calculate this likelihood and find the parameters $\theta$ which have the largest likelihood.

However, the likelihood above has a Gaussian shape, and it is not guaranteed that the probability distribution has that form. So the question we are addressing in this publication is, how close is the likelihood to a Gaussian form and if there is a deviation, how does it impact the analysis.

The first question to answer is, why do we assume a Gaussian likelihood in the first place?

The Gaussian distribution has a central role in statistics because it has been shown that when collecting many different measurements of the same event, these measurements often follow a Gaussian distribution. For example, if you throw a coin 100 times and calculate the mean, this mean is probably somewhere close to 0.5.

If you make that experiment many times, meaning you throw the coin 100 times, calculate the mean of the 100 throws and repeat that many times. The distribution of means will peak close to 0.5 and will show a Gaussian shape (see Figure 1). This is quite surprising because the probability of throwing heads or tails is a top-hat distribution, which looks not at all like a Gaussian. So adding many top-hat distributions leads somehow to a Gaussian distribution.
Figure 1: Probability distribution for the mean of many throws of a coin. The number of throws is shown in the top gray
box. With an increasing number of throws, the distribution looks more and more Gaussian (source: Wikipedia).
Figure 1 shows the probability distribution when looking at the mean of many coin flips. The distribution looks more and more Gaussian if the number of coin flips is increased.

In fact, that happens quite often in nature, the likelihood distribution of a large number of events often looks like a Gaussian, even if the likelihood distribution for a single event does not. This is known in statistics as the central limit theorem.

However, given the example above you can already see that the assumption of a Gaussian likelihood only works for large numbers. Figure 1 shows clearly that the probability distribution is far from Gaussian if one calculates the mean from only a small number of coin flips.

Going back to our paper. We looked at the power spectrum analysis of the Baryon Oscillation Spectroscopic Survey published in Beutler et al. (2016) and the group multiplicity function analysis of Sinha et al (2017).

Let's start with the power spectrum analysis. I discussed this analysis in several blog posts (e.g. here) already so I will skip the details about it. Here we do not care too much about the details of the measurement anyway, we just want to test whether the assumption of a Gaussian likelihood is justified.

In the power spectrum case one calculates the mean of many Fourier modes and the question is whether we have enough modes to make the likelihood distribution Gaussian. In general, there are many modes on small scales (large wavevector k) but only a small number of modes on large scales (small wavevector k). E.g. at $k=0.01h/$Mpc only a few dozen modes contribute to the value of the power spectrum, while at k=0.2h/Mpc many thousand modes contribute. Whether this matters or not depends on how this impacts the parameters of the model, which is not straightforward.

There is a very simple way to check whether the likelihood is Gaussian. We have produced mock catalogs for the BOSS survey, which mimic the BOSS dataset. Each of these mock catalogs represents a different possible realization of the measurement of BOSS. Therefore we just have to measure the power spectrum of all these mock catalogs and see whether those power spectra scatter around their mean in the shape of a Gaussian.

Whatever the distribution is going to be, we then use that distribution rather than assuming a Gaussian when searching for the model with the maximum likelihood.
Figure 2. This plot shows the likelihood distribution for the different cosmological parameters derived from
the BOSS power spectrum. The blue distribution assumes a Gaussian likelihood, which the orange
distribution assumes a likelihood distribution seen in the mock catalogs. You see there is overall good
agreement except for a small shift in $f\sigma_8$.
Figure 2 shows the likelihood distribution for each cosmological parameter. Note this is a different likelihood than what we talked about before. The likelihood we talked about before describes the probability of observing the data vector given that the underlying model is the true model, while the likelihood in Figure 2 describes the probability of observing the data given that parameter value.

The blue distribution assumes a Gaussian likelihood, while in orange we use the likelihood from the mocks. The two likelihoods seem to agree well, except for the parameter $f\sigma_8$, where we found a small shift to larger values.

Next the group multiplicity function analysis of Sinha et al (2017). The likelihood distributions are again shown in blue and orange in Figure 3. For this analysis, you can see that the two likelihoods agree less well, especially in terms of how peaked the distributions are. Removing the assumption of a Gaussian likelihood seems to increase the error on almost all derived parameters.
Figure 3: Comparison between the original group multiplicity function analysis of Sinha et al (2017) with the new analysis dropping the assumption of a Gaussian likelihood. The two distributions show significant
deviations for most parameters.
In conclusion, the assumption of the form of the likelihood can impact cosmological parameter constraints and should be accounted for. With larger datasets, the statistics usually gets better and the issue discussed here will get smaller. However, observables like primordial non-Gaussianity are located in the regime with small mode statistics and hence such studies must check such biases carefully.

I hope this was useful. Let me know if you have any comments/questions in the comment section below.
best
Florian 

Thursday, March 29, 2018

Measuring Neutrinos with the distribution of galaxies in the Universe

In this blog post, I will summarise the paper Baumann et al. (2018) which my collaborators and I just published. The paper reports a measurement of the phase shift in the Baryon Acoustic Oscillation signal due to free streaming particles (neutrinos) in the early universe. So what does that mean?

1. What is the Baryon Acoustic Oscillation (BAO) scale? The BAO scale is a special scale in the distribution of galaxies. If you look at the separation of galaxies in the Universe, you will find that at around 150 Megaparsecs (500 000 000 light years) there is a sudden increase in the number of galaxy pairs. This special scale in the distribution of galaxies was imprinted when the Universe was just 300 000 years old at the so-called decoupling era. If you measure the correlation function of a galaxy distribution, meaning you count the number of galaxy pairs as a function of scale, the BAO signal will show up as a peak at 150 Megaparsecs. If you look at the Fourier transform of the correlation function, the power spectrum, it will show up as an oscillation signal. This signal is what we use in this analysis.

2. What is the decoupling era? In the early Universe photons and baryons were coupled together due to the high pressure. 300 000 years after the Big Bang the temperature and pressure dropped enough so that photons and baryons decoupled. From that point forward most of the photons in the Universe just traveled through the Universe without any interactions with anything. For that reason, we can look at these Cosmic Microwave Background (CMB) photons and basically see a snapshot of the Universe as it was 300 000 years after the Big Bang.

3. What is the effective number of relativistic species? This number describes all particles which had a relativistic velocity at the decoupling era. Currently, this number is assumed to be three, given the three known species of neutrinos ($\nu_{e}$, $\nu_{\mu}$ and $\nu_{\tau}$). We know that these particles must have been produced during the Big Bang and since they are very light, they should have had relativistic velocities at decoupling.

Figure 1 shows the impact of the effective number of free-streaming relativistic species ($N_{\rm eff}$) on the BAO signal and you can see that it leads to a shift in the phase.

Figure 1: This figure shows the impact of the effective number of relativistic species ($N_{\rm eff}$)
on the Baryon Acoustic Oscillation (BAO) scale in the matter power spectrum. It is the shift in the
phase of the BAO which we measured for the first time in this analysis (reference: This plot is from
Baumann et al. 2017).
This phase shift is very small, but we now managed to constrain the BAO scale to 1% using the Baryon Oscillation Spectroscopic Survey (BOSS), which makes us sensitive to such small effects.

So why are we interested in this effect?

Well, it could be that there are additional particles which propagated in the early Universe, but which we have not seen yet in particle accelerators. In this sense, we are using the Big Bang as the ultimate high energy collider and test whether the remains of the Big Bang show any evidence for new particles.

Besides that, the neutrinos generated by the Big Bang are also a very interesting probe of the early Universe. As mentioned before we can use the Cosmic Microwave Background (CMB) to get a snapshot of the Universe when it was only 300 000 years old. But we cannot actually look any closer to the Big Bang. If we could see the distribution of neutrinos produced by the Big Bang, we would actually see the Universe when it was only 1 second old, because neutrinos stopped interacting with other components of the Universe much earlier than photons, 1 second after the Big Bang.

So if we want to investigate the Universe at its very early age, the neutrino background produced by the Big Bang would be one way to do that. Another would be the gravitational wave background produced by the Big Bang. However, right now we do not have experiments sensitive enough to detect the neutrino or gravitational wave background. The detection of the phase shift in the BAO is an indirect detection of the neutrino background.

The shift in the BAO signal is not quite scale independent, hence the first step is we have to create a template for the shift, which we include in the standard BAO fitting pipeline, which I developed for the BOSS analysis back in 2017 (Beutler et al. 2017). The template for the phase shift is shown in Figure 2.
Figure 2: Template describing the shift in the BAO phase as a function of wavenumber. We
include this template in our analysis, but put a free parameter as its amplitude. 
The figure shows that the shift caused by neutrinos $f(k)$ is larger at larger wavenumber $k$.

The amplitude of this template is a free parameter in the fit, which we call $\beta$. We then use the BOSS dataset to constrain this parameter.
Figure 3: Constraints on the amplitude of the phase shift ($\beta$) obtained from the BOSS dataset
in two redshift bins ($z_1$ and $z_3$).
Figure 3 shows the constraints on the amplitude of the template ($\beta$) obtained from the BOSS dataset. In BOSS we have two redshift bins, $z_1$ ($0.2 < z < 0.5$) and $z_3$ ($0.5 < z < 0.75$). We use both redshift bins for our constraint, which yields $\beta = 1.2\pm 1.8$. This constraint can then be converted into a constraint on the effective number of relativistic species. $\beta = 0$ would correspond to $N_{\rm eff} = 0$, while $\beta = 2.45$ would correspond to $N_{\rm eff} = \infty$.

You can see that our constraint just from the BOSS data can't really constrain $N_{\rm eff}$. However, Figure 3 also shows that there is a strong degeneracy between the parameter $\beta$ and the BAO scale parameter $\alpha$. If we fix the cosmological model, we can get strong priors on the $\alpha$ parameters from the Cosmic Microwave Background. The constraints including these priors are also included in Figure 3 (red and orange contours on the left and red solid line on the right) resulting in $\beta = 2.05\pm0.81$. This excludes $\beta=0$ (corresponding to $N_{\rm eff} = 0$) with more than $2\sigma$. Note that the expected value of $\beta$ in a standard cosmological model with $N_{\rm eff} = 3.046$ is $\beta = 1$, while our data seem to prefer larger values. However, $\beta = 1$ is well within our likelihood distribution (red line in Figure 3, right) and hence we do not see any significant deviation from the standard model. 

A similar measurement of the phase shift in the photon distribution (CMB) using the Planck data has been reported in Follin et al. 2015. They measured $N_{\rm eff} = 2.3^{+1.1}_{-0.4}$, which is consistent with our measurement.

In summary, we, for the first time, measured the phase shift in the BAO signal in the galaxy power spectrum, representing an (indirect) detection of the cosmic neutrino background produced by the Big Bang. Our measurement confirms the current standard model with three neutrino species. While the detection significant of our measurement is fairly low, future galaxy surveys like DESI and Euclid will improve these measurements significantly.

Our paper about this measurement was published in Nature (see https://www.nature.com/articles/s41567-019-0435-6)

Let me know if you have any comments or questions below.
best
Florian

Monday, March 26, 2018

Context managers in Python

A context manager in Python is a way to manage resources. Typical uses of context managers include locking and unlocking resources like database connections and closing, opening files.

Context managers are used together with the with statement. You probably encountered the with statement at some point when coding in Python, the most often used case is file handling
with open('data.dat', 'r') as file: 
   contents = file.read()
However, files do not have to be opened like this, instead, you could also use
file = open('data.dat', 'r')
contents = file.read()
file.close()
The problem with this code is that the file.close() method might never be called if an exception happened after the opening of the file. If the file is kept open, you have a resource leak, and the system may slow down or crash since the number of available file handlers is finite.

A more robust version of dealing with a file could use the try... finally statement like this
file = open('data.dat', 'r')
try:
   contents = file.read()
finally:
   file.close()
So even if there is an exception while reading the file, the finally clause always ensures that the file is closed again. This is basically how the with statement is implemented.

The with statement has two minor advantages over the try... finally statement above: (1) it is a little bit shorter and (2) it does not require calling the close() method explicitly, since the with statement calls it automatically, whenever the program leaves the with statement (indented block below the with statement).

So the main purpose of the with statement is to guarantee the execution of startup and cleanup actions around a block of code, with the main application being resource management.

So how would we write a context manager? We can generate this functionality using a class
class MyOpen(object):
   def __init__(self, filename):
      self.filename = filename

   def __enter__(self):
      self.file = open(self.filename)
      return self.file

   def __exit__(self, ctx_type, ctx_value, ctx_traceback):
      self.file.close()
      return True
This class has two methods which are specifically used by the with statement, the __enter__() and __exit__() methods. The __enter__() method is called at the beginning of the with statement. The __exit__() method is called when leaving the with statement and could handle exceptions, even though this is ignored in the above example.

You might wonder what is the difference between  __init__() and __enter__(). The __init__() method is called when the class object is created, which does not need to be the beginning of the with statement. This difference allows us to produce reusable context managers like this
file_handler = MyOpen('data.dat')
with file_handler as file:
    contents = file.read()
In this case, the __init__() method is called in line one, when the class object is created, the __enter__() is called in line two when the with statement begins and the __exit__() method is called in line three when the with statement ends.

So to produce a context manager you just have to create a class with two special methods: __enter__() and __exit__(). The __enter__() method returns the resource to be managed, like a file object in the case of open(). The __exit__() method does any cleanup work.

Hope that was useful and please let me know if you have any comments or questions in the comment section below.
cheers
Florian

Wednesday, March 7, 2018

Measuring the galaxy bispectrum of the large scale structure of the Universe

This blog post summarizes the publication Naonori, Saito, Beutler & Seo (2018), which my collaborators and I published recently. It describes a new way to measure the galaxy bispectrum. Even though I have attempted here to explain the content of this paper for a general audience, the subject is very technical and some knowledge of higher mathematics is required to follow this post.

What is the Bispectrum?

The density field of the Universe, meaning the distribution of dark matter and baryonic matter, can be expressed as the density at point x relative to the mean density of the Universe
\[
\delta(\vec{x}) = \frac{n(\vec{x})}{\overline{n}} - 1,
\]which is also known as the overdensity field. This density field is an important cosmological probe since e.g. a different amount of dark energy would change the density field. So we can use the density field to study dark energy.

Beside dark energy, the density field also contains information about dark matter, the neutrino masses and the inflationary period of the Universe. We can also use it to test general relativity or modified gravity theories.

If the distribution of $\delta(\vec{x})$ at different positions $\vec{x}$ follows a Gaussian distribution, all information contained in it can be captured by the 2-point function, which in configuration-space is called the correlation function, $\xi(r)$, and in Fourier-space is called the power spectrum, $P(k)$. The 2-point function is the ensemble average of correlations between all possible pairs within the density field.
\begin{align}
\xi(\vec{r}) &= \langle \delta(\vec{x} + \vec{r})\delta(\vec{x})\rangle\\
P(\vec{k}) &= \langle \delta(\vec{k})^2\rangle,
\end{align}where $\delta(\vec{k})$ is the Fourier transform of the density field $\delta(\vec{x})$.

It turns out that the density field of the Universe is close to a Gaussian field on large scales, but on smaller scales ($\sim 10 Mpc$) it has non-Gaussian properties. To capture all information in a density field which does not behave like a Gaussian random field, we need to use all possible n-point functions, not just the 2-point function.

Including the 3-point correlation, which in Fourier-space is called the bispectrum, will increase the signal-to-noise ratio, and improve cosmological parameter constraints. Figure 1 shows how much information the bispectrum can add as a function of the smallest scale included ($k_{\rm max}$).
Figure 1: Signal-to-noise ratio as a function of $k_{max}$. Adding the bispectrum allows getting
closer to the Gaussian limit, which represents the maximum amount of information which can be
retrieved from this clustering dataset.
The bispectrum is defined as the ensemble average of the product of the density field at three points
\[
B_{123} = \langle \delta(\vec{k}_1)\delta(\vec{k}_2)\delta(\vec{k}_3)\rangle.
\]This, however, is a 9-dimensional quantity, which is very difficult to analyze. The question is, how can we compress the information contained in this quantity into some lower dimensional space without losing information?

First, we have to figure out how many parameters does one need to describe this three-point configuration?

The bispectrum essentially identifies all triangle configurations in the density field and averages them. Any triangle can be described by 3 parameters. However, with galaxy surveys, we are also interested in the orientation of the triangle with respect to the line-of-sight, which then requires 2 additional parameters, so 5 in total. Just as a comparison, in the power spectrum, this can be done with only 2 parameters, the amplitude $k$ and the angle to the line-of-sight $\mu$.

Other estimators

One important aspect of the bispectrum is its computational complexity. Since one needs to identify all possible triangles in the density field, the complexity of such a procedure is naively given by $\mathcal{O}(N^3)$. However, in 2015 we learned that the bispectrum can actually be estimated in $\mathcal{O}(N\ln N)$ using Fast Fourier Transforms (FFTs). For the rest of this post I will only consider FFT based estimators for the bispectrum.

The first FFT-based bispectrum estimator has been proposed by Roman Scoccimarro (Roman Scoccimarro 2015) and it looks like this
\[
B^m_{\ell}(k_1, k_2, k_3) = \frac{(2\ell + 1)}{N^T_{123}} \prod^3_{i=1} \int_{k_i} d^3 q_i \delta_{123}\delta_{\ell}(q_1)\delta_0(q_2)\delta_0(q_3).
\]This estimator is based on the approach suggested in Scoccimarro (2015) and Bianchi et al. (2015), which calculates the $\delta_{\ell}(q_1)$ using FFTs.

However, there is a fundamental issue with this estimator. Any Fourier-space quantity like the power spectrum or bispectrum needs to account for the survey window function. For example, the power spectrum we measure in a galaxy survey is given by
\[
P^{\rm conv}(\vec{k}) = \int d\vec{k}P^{\rm true}(\vec{k})W^2(\vec{k} - \vec{k}'),
\]where $W(\vec{k})$ represents the survey window function. The window function basically refers to the power spectrum one would measure with a completely random distribution of points within a certain geometric configuration. Even if all galaxies in our galaxy survey were at completely random positions, we would still measure a power spectrum. This power spectrum of random positions is $W(\vec{k})$. So in principle, we know the window function very well, but accounting for it in a power spectrum analysis can be quite tricky.

The approach we take in the power spectrum analysis goes through the following steps:

(1) Fourier transform the power spectrum model
(2) Multiply it by the window function
(3) Fourier transform back into Fourier-space

After we have convolved the power spectrum model with the window function in this way, we can compare it to the power spectrum measurement from the data.

To do the same with the bispectrum we need to define a Hankel transform for the bispectrum. A Hankel transform is a Fourier transform of the ensemble average quantity (bispectrum/power spectrum).

However, a Hankel transform for Roman's bispectrum has not been found yet (and might not exist). This does not mean that there is no way to include the window function in Roman's estimator, it just means that so far no formalism has been published and we found it very hard to find such a formalism. For that reason, we developed a new bispectrum estimator, which has a Hankel relation to its configuration-space equivalent and therefore can include the survey window function effect in the same way as described above for the power spectrum.

Our new estimator

We use the 3D bispectrum in the form $B(\vec{k}_1, \vec{k}_2, \hat{n})$, where we chose the $\vec{k}_3$ axis as the line-of-sight $\hat{n}$. We now use spherical harmonic expansion in all three vectors
\begin{align}
B(\vec{k}_1,\vec{k}_2,\hat{n}) &=& \sum_{\ell_1\ell_2 L} \sum_{m_1m_2M}B_{\ell_1\ell_2L}^{m_1m_2M}(k_1,k_2) y_{\ell_1}^{m_1}(\hat{k}_1) y_{\ell_2}^{m_2}(\hat{k}_2) y_{L}^{M}(\hat{n}),
\end{align}with the coefficients
\begin{align}
B_{\ell_1\ell_2L}^{m_1m_2M}(k_1,k_2) &=
N_{\ell_1\ell_2L} \int \frac{d^2\hat{k}_1}{4\pi}\int \frac{d^2\hat{k}_2}{4\pi}\int \frac{d^2\hat{n}}{4\pi} \\
&\times y^{m_1*}_{\ell_1}(\hat{k}_1) y^{m_2*}_{\ell_2}(\hat{k}_2)y^{M*}_{L}(\hat{n}) B(\vec{k}_1,\vec{k}_2,\hat{n}),
\end{align}and the normalized spherical harmonics $y_{\ell}^m = \sqrt{4\pi/(2\ell+1)}\, Y_{\ell}^m$. The step above mapped the 3D bispectrum into the 8 dimensional coefficients $B_{\ell_1\ell_2L}^{m_1m_2M}(k_1,k_2)$. However, as we established above, the bispectrum can be captured with only 5 parameters. We found that the 8 dimensional quantity above can be further compressed into 5 parameters using
\begin{align}
B_{\ell_1\ell_2L}(k_1,k_2) &=
H_{\ell_1\ell_2L} \sum_{m_1m_2M} \left( \begin{smallmatrix} \ell_1 & \ell_2 & L \\ m_1 & m_2 & M \end{smallmatrix} \right)
B_{\ell_1\ell_2L}^{m_1m_2M}(k_1,k_2),
\end{align}where $H_{\ell_1\ell_2L}=\left( \begin{smallmatrix} \ell_1 & \ell_2 & L \\ 0 & 0 & 0 \end{smallmatrix} \right)$. This last step assumes statistical isotropy and parity-symmetry as well as homogeneity of the density field.

Properties of our estimator

We found a few interesting properties for this particular decomposition:

1. All anisotropies with respect to the line of sight are compressed into one multipole, labeled $L$. These anisotropies are caused by redshift-space distortions and the Alcock-Paczynski effect, which represent two of the most important observables in cosmology.

2. The shot noise term associated with this decomposition is k dependent. This is rather surprising since for the power spectrum we only have to deal with a constant shot noise (not considering halo exclusion). The non-constant shot noise has very interesting implications, e.g. it does not vanish for the configuration-space quantity (the three-point function).

But the most important point is that for this estimator we can write down a Hankel transform
\begin{align}
B_{\ell_1\ell_2L}(k_1,k_2) &= (-i)^{\ell_1+\ell_2}(4\pi)^2 \int dr_1 r_1^2 \int dr_2 r_2^2 \\
&\times j_{\ell_1}(k_1r_1) j_{\ell_2}(k_2r_2) \zeta_{\ell_1\ell_2L}(r_1,r_2) \\
\zeta_{\ell_1\ell_2L}(r_1,r_2) &=
i^{\ell_1+\ell_2}\int \frac{dk_1k_1^2}{2\pi^2} \int \frac{dk_2k_2^2}{2\pi^2} \\
&\times j_{\ell_1}(r_1k_1)j_{\ell_2}(r_2k_2)B_{\ell_1\ell_2L}(k_1,k_2),
\end{align}which allows us to apply the window function to any bispectrum model before comparing it to the measurement, just as outlined above for the power spectrum. The window function effects are shown in Figure 2.

Figure 2: This plot shows the impact of the survey window function onto the bispectrum multipoles. Compare the black dashed line, which shows the measurement of the bispectrum in periodic box simulations with the grey shaded region, which shows the measurement of the bispectrum in lightcones, following the geometry of the Baryon Oscillation Spectroscopic Survey, North Galactic Cap. You can see that there is a suppression of power on almost all scales, due to the window function. The red dashed line and red solid line show the impact of the window function onto a second order perturbation theory model using our window function formalism. You can see that we can reproduce the effect measured in the simulations.
Finally, we used this analysis pipeline to measure the highest order anisotropic bispectrum component $B_{202}$ which we detect with $14\sigma$ significance. This represents the first detection of the anisotropic bispectrum signal (to our knowledge).

To summarize, the new bispectrum decomposition outlined in this paper allows us to write down a clear analysis pipeline which self-consistently accounts for the survey window function. The next steps will be to apply this analysis to galaxy survey datasets.

I hope this summary was useful and please let me know if you have any comments/questions in the comment section below.
cheers
Florian

Tuesday, February 20, 2018

Error handling within a flask application

Any website should somehow be able to deal with unexpected errors. In this blog post, I will describe how I handle errors within my Flask application, which follows this great tutorial.

In Flask, it is quite easy to register error handlers which re-direct to a custom error page like this
from flask import render_template

@app.errorhandler(404)
def page_not_found(e):
    return render_template('404.html'), 404
However, rather than going through the list of possible errors and creating a route and page for each, one can also create a general error handler which handles all errors. To do that you have to register a general error handler in your app like this
from werkzeug.exceptions import default_exceptions

def init_app(app):
    ''' Function to register error_handler in app '''
    for exception in default_exceptions:
        app.register_error_handler(exception, error_handler)

    app.register_error_handler(Exception, error_handler) 
Calling the init_app(app) function will register the error_handler() function for all exceptions in the default_exceptions list of the werkzeug package, which should cover all errors except for some exotic ones. Now we just have to write the error_handler() function.
def error_handler(error):
    ''' Catches all errors in the default_exceptions list '''
    msg = "Request resulted in {}".format(error)
    if current_user.is_authenticated:
        current_app.logger.error(msg, exc_info=error)
    else:
        current_app.logger.warning(msg, exc_info=error)

    if isinstance(error, HTTPException):
        description = error.get_description(request.environ)
        code = error.code
        name = error.name
    else:
        description = ("We encountered an error "
                       "while trying to fulfill your request")
        code = 500
        name = 'Internal Server Error'

    templates_to_try = ['errors/error{}.html'.format(code), 'errors/generic_error.html']
    return render_template(templates_to_try,
                           code=code,
                           name=Markup(name),
                           description=Markup(description),
                           error=error), code
This function first checks whether the error has been caused by an authenticated user, and if so it writes an error message to the logfile. If the user was not authenticated, it only writes a warning message. The reason for this distinction is that robots cause all sorts of 404 errors in my app, which I don't care much about, but if a user causes an exception, I definitely want to know about it.

In the next step, the handler extracts all information it can get from the error message and then renders a template. The render_template function will go through the templates_to_try list until it finds an existing template, so this way you can register custom pages for certain errors but if a custom error does not exist, it will just render the general error page, which in my case looks like this
{% extends "master.html" %}

{% block title %}Error{% endblock %}

{% block description %}
<meta name="description" content="Error message.">
{% endblock %}

{% block body %}

    <div id="navbar_wrapper">
        <div id="site_content">
            <div class="container">
                <div class="col-xs-12 col-sm-9 col-md-10 col-lg-10">
                    <br>
                    <h1>{{ code }}:{{ name }}</h1>
                    <p>{{description}}</p>
                    <p>The administrator has been notified. Sorry for the inconvenience!</p>
                    <button class="btn btn-primary" onclick="history.back(-1)">Go Back</button> 
                </div>
            </div>
        </div>
    </div>

{% endblock %}
The advantage of such a custom error page is that you can add a back button, to get the users back on your page if you haven't alienated them yet with the error.

I put the entire example on Github. Let me know if you have any comments or questions.
cheers
Florian


Friday, February 16, 2018

Understanding the super() function in Python

In this blog post, I will explain how the super() function in Python works. The super function is a built-in Python function and can be used within a class to gain access to inherited methods from a parent class that has been overwritten.

So let's look at an example. Assume we want to build a dictionary class which has all properties of dict, but additionally allows us to write to logger. This can be done by defining a class which inherits from the dict class and overwrites the functions with new functions which do the same as the old, but additionally have the logging call. The new dict class would look like this
class LoggingDict(dict):
    def __setitem__(self, key, value):
        logging.info("Setting key %s to %s" % (key, value))
        super().__setitem__(key, value)
    def __getitem__(self, key):
        logging.info("Access key %s" % key)
        super().__getitem__(key)
Here we overwrite the __getitem__ and __setitem__ methods with new ones, but we just want to add a logging functionality and keep the functionality of the original functions. Note that we do not need super() to do this since we could get the same result with
class LoggingDict(dict): 
    def __setitem__(self, key, value): 
        logging.info("Setting key %s to %s" % (key, value))
        dict.__setitem__(self, key, value)
    def __getitem__(self, key): 
        logging.info("Access key %s" % key)
        dict.__getitem__(self, key)
The advantage of super() is that, should you decide that you want to inherit from a different class, you would only need to change the first line of the class definition, while the explicit use of the parent class requires you to go through the entire class and change the parent class name everywhere, which can become quite cumbersome for large classes. So super() makes your code more maintainable.

However, the functionality of super() gets more complicated if you inherit from multiple classes and if the function you refer to is present in more than one of these parent classes. Since the parent class is not explicitly declared, which parent class is addressed by super()?

The super() function considers an order of the inherited classes and goes through that ordered list until it finds the first match. The ordered list is known as the Method Resolution Order (MRO). You can see the MRO of any class as
>>> dict.__mro__
(<type 'dict'>, <type 'object'>)
The use of the MRO in the super() function can lead to very different results in the case of multiple inheritances, compared to the explicit declaration of the parent class. Let's go through another example where we define a Bird class which represents the parent class for the Parrot class and the Hummingbird class:
class Bird(object): 
    def __init__(self): 
        print("Bird init") 

class Parrot(Bird):
    def __init__(self):
        print("Parrot init")
        Bird.__init__(self) 

class Hummingbird(Bird):
    def __init__(self): 
        print("Hummingbird init")
        super(Hummingbird, self).__init__()
Here we used the explicit declaration of the parent class in the Parrot class, while in the Hummingbird class we use super(). From this, I will now construct an example where the Parrot and Hummingbird classes will behave differently because of the super() function.

Let's create a FlyingBird class which handles all properties of flying birds. Non-flying birds like ostriches would not inherit from this class:
class FlyingBird(Bird):
    def __init__(self):
        print("FlyingBird init")
        super(FlyingBird, self).__init__()
Now we produce child classes of Parrot and Hummingbird, which specify specific types of these animals. Remember, Hummingbird uses super, Parrot does not:
class Cockatoo(Parrot, FlyingBird):
    def __init__(self):
        print("Cockatoo init")
        super(Cockatoo, self).__init__()

class BeeHummingbird(Hummingbird, FlyingBird):
    def __init__(self):
        print("BeeHummingbird init")
        super(BeeHummingbird, self).__init__()
If we now initiate an instance of Cockatoo we will find that it will not call the __init__ function of the FlyingBird class
>>> Cockatoo() 
Cockatoo init 
Parrot init 
Bird init 
while an initiation of a BeeHummingbird instance does
>>> BeeHummingbird()
BeeHummingbird init 
Hummingbird init 
FlyingBird init 
Bird init 
To understand the order of calls you might want to look at the MRO
>>> print(BeeHummingbird.__mro__)
(<class 'BeeHummingbird'>, <class 'Hummingbird'>, <class 'FlyingBird'>,
<class 'Bird'>, <type 'object'>)
This is an example where not using super() is causing a bug in the class initiation since all our Cockatoo instances will miss the initiation functionality of the FlyingBird class. It clearly demonstrates that the use of super() goes beyond just avoiding explicit declarations of a parent class within another class.

Just as a side note before we finish, the syntax for the super() function has changed between Python 2 and Python 3. While the Python 2 version requires an explicit declaration of the arguments, as used in this post, Python 3 now does all this implicitly, which changes the syntax from (Python 2)
super(class, self).method(args)
to
super().method(args)
I hope that was useful. Let me know if you have any comments/questions. Note that there are very useful discussions of this topic on stack-overflow and in this blog post.
cheers
Florian

Wednesday, February 7, 2018

Python best coding practices: Seven tips for better code

The Python software foundation makes detailed recommendations about naming conventions and styles in the PEP Style Guide. Here is a short summary of that fairly long document, picking seven points I found useful.

The suggested coding and naming conventions sometimes make the code more robust, but often are just for the purpose of making the code more readable. Keep in mind that usually, readability is the most critical aspect of a business level codebase, since any improvement first requires understanding the existing code.

1. A good Python coding practice is to use the floor operator // wherever possible. For example, I used to index arrays and lists like this
nums[int(N/2)]
but this can be done much faster with the floor operator
nums[N//2]
You can test that with
import time
time0 = time.time()
for i in range(0, 10000):
    5//2
print(time.time()-time0)
which in my case gave $0.0006051$, while my old approach
import time
time0 = time.time()
for i in range(0, 10000):
    int(5/2)
print(time.time()-time0)
takes $0.002234$. The reason why the floor operator is faster is that it is pre-calculated, while the usual division is not. Read more about it here.

2.  Next, how can we test whether a variable x is None or not. The following four cases could be used
# bad
if x:

# better
if x != None:

# better
if not x is None:

# best
if x is not None:
The recommended case is the last version. Case 1 is dangerous. While None is mapped to false in a boolean context, many other values also convert to false (like '0' or empty strings). Therefore, if it is really None which you want to test for, you should explicitly write it. So while avoiding case 1 makes your code more robust, avoiding 2 and 3 is just following coding conventions in Python.

The difference between the unequal (!=) and 'is not' operators is subtle, but important. The != operator tests whether the variable has a different value than None, while the 'is not' operator tests whether the two variables point to different objects. In the case above it makes no difference, but still the best practice, in this case, is to use the 'is not' operator.

Following from the example above there are also recommendations for how to test boolean values. If x is True or False, test for it like this
if x:
but not 
if x == True: 
or
if x is True:

3. Another case which makes your code more readable is the use of startswith() and endswith() instead of string splicing. For example use
if foo.startswith('bar'):
instead of
if foo[:3] == 'bar':
Both versions give the same result, and both versions should be robust, but the first version is deemed more readable.

4. To check for types use the isinstance() function
if isinstance(x, int):
rather than the 'is' or == operator
if type(x) == int:

if type(x) is int:
The type operator can easily give you the wrong answer. Take the following example
class MyDict(dict): 
  pass
x = MyDict()
print("type = ", type(x))
print(type(x) == dict)
print(isinstance(x, dict))
which gives the output
('type = ', <class '__main__.MyDict'>)
False
True
Even though the MyDict class behaves just like a dictionary, since it inherited all its properties, the type operator does not see it that way, while the isinstance() function notices that this class has all the functionality of a dict.

5. When using try-except statements always catch exceptions by name
try: 
    x = a/b
except ValueError: 
    #do something
You can use try-except statements to catch all possible exceptions, for example, to ensure that the user gets clean feedback but if you use general try-except statements, you usually write sloppy code, since you have not thought about what exceptions could happen. If you use general except statements, catch the exception and write it to a log file.

6. To break up long lines the preferred methods is to use brackets and to line up the broken lines
if first_condition and second_condition and third_condition and fourth_condition:
should be
if (first_condition and second_condition and
    third_condition and fourth_condition):

7. Finally, you should follow naming conventions:
  • Class and Exception names should use capitalized words without underscore like GroupName() or ExceptionName
  • Function, module, variable and method names should be lowercase, with words separated by underscores like process_text() and global_var_name
  • Constants are written in all capital letters with underscores separating words like MAX_OVERFLOW and TOTAL
A good way to enforce PEP style standards in your project is to use a tool like pylint. Pylint can easily be hooked into a git project, preventing commits which do not follow these standards. In my experience, this is a very good idea when working in a big team with very different experience levels.

I hope this summary was useful. Let me know if you have any questions/comments below.
cheers
Florian