Friday, May 12, 2023

The difference between a partial and total derivative

Imagine that the price of a can of baked beans is a function of two variables, the cost of beans and the cost of hiring workers for canning and distribution. Naively, as the cost of beans increases, the final cost of the baked beans will increase by the same amount. This represents the partial derivative, meaning we are looking at the change in prices as a function of just one variable (the price of beans) and keeping everything else fixed.

But is that the entire story? If the price of beans increases, the workers who are supposed to can the product and distribute it to the supermarkets might need to pay more for their breakfast baked beans as well. Consequently, these workers might ask for higher salaries. This means the cost of workers might also depend on the price of beans.

Mathematically, for a function $f(x,y)$ we compute the partial derivative with respect to $x$ through $\frac{\partial f}{\partial x}$ by holding $y$ constant. 

We find the total derivative $\frac{d f}{d x}$ by adding together the direct effect of $x$ and the indirect effect through $y$. Thus $\frac{d f}{d x} = \frac{\partial f}{\partial x} + \frac{\partial f}{\partial y}\frac{d y}{d x}$.

Hope this was useful. If you have any questions or comments, please leave them below.
cheers
Florian

Monday, July 22, 2019

What is a Boltzmann brain?

The second law of thermodynamics states that the entropy or disorder of a system always increases over time. A hot cup of tea, for example, represents an ordered state, since the atoms in the teacup have significantly more energy than its surrounding. The second law tells us that the teacup will cool down, meaning the energy will be dispersed over its surroundings making the entire system less ordered. 

After this law was established in the late 19th century, people started to apply it to the Universe as a whole and ask, why is the Universe in such an ordered state? The Universe mainly consists of stars, planets, and galaxies, separated by huge expanses of empty space. So the Universe represents a very ordered state. Given that the Universe was considered to be eternal the second law of thermodynamics would suggest that we should live in a very disordered Universe, which would not allow life to form.

There are two solutions to this problem. One version is that the Universe is not eternal and it started with very low entropy. Why the Universe started in a low entropy state is unclear and remains a key question in cosmology. The second solution makes use of the fact that the laws of thermodynamics are statistical laws and hence while the entropy should always increase over time, there is a small (but non-zero) probability that it decreases. One could use this to argue that the Universe just happened to randomly fluctuate to a low entropy state. For example, atoms could randomly assemble in a tiny region of space, which would give rise to the BigBang and create the Universe we observe today. Even though such a fluctuation is extremely unlikely, one can invoke the anthropic principle, which states that we can only live in parts of the eternal Universe which is subject to such an unlikely fluctuation. 

Given that the probabilities for such thermal fluctuations decrease exponentially with the number of atoms involved, it is far more likely to have small fluctuations in small regions of the Universe compared to creating our observed Universe. 
The Boltzmann brain is a thought experiment in which such a random fluctuation would form a fully functional brain without any life-supporting environment. It is vastly more likely that a thermal fluctuation creates a Boltzmann brain rather than an entire Universe which can harbor life, meaning that if life really is formed by thermal fluctuations, most of the life in the Universe consists of Boltzmann brains.

Boltzmann brains are linked to the measure problem in cosmology, where one often has to deal with infinities in time and space, which means that even small probabilities can become feasible given enough time or space. If such scenarios would result in Boltzmann brains as the most common life form, this is usually considered to be an indication that this is not a viable scenario. However, the question why we live in our Universe rather than as isolated brains floating in the Universe is not really solved yet (here is a recent paper on the subject https://arxiv.org/pdf/1702.00850.pdf).

Let me know if you have any question in the comment section below
best
Florian

Sunday, April 28, 2019

Gelman & Rubin convergence criteria for MCMC

A Monte Carlo Markov Chain (MCMC) is a very popular method to obtain the likelihood for a large parameter space and often it is the only computationally feasible way to obtain the likelihood. However, one disadvantage of MCMC is that it is not clear when the MCMC chain actually represents the likelihood, i.e. when it has converged. Many convergence criteria have been proposed, but none of them can guarantee convergence. In this blog post, I will explain the Gelman & Rubin convergence criteria, which is one of the most popular indicators of convergence. I will also present a Python implementation of it, which can be used for the Markov Chain Monte Carlo (MCMC) Ensemble sampler emcee.

The Gelman & Rubin criteria consist of the following 4 steps:
  1. Run multiple MCMC chains and remove the first half of each chain.
  2. Calculate the mean and variance for each parameter in each chain.
  3. Take the mean value of each parameter from the different chains and determine the variance between the mean values of the different chains (between chain variance).
  4. Compare the between chain variance with the variance of each parameter within one chain. If the between chain and within chain variances are sufficiently close, the chains can be assumed to have converged. 
The basic concept is that the variance for each parameter determined within a chain should be equal to the variance between the different chains. The mean and variance for each parameter $\theta_j$ and a chain with length $N$ is given by
\[
\begin{align}
\bar{\theta}_j &= \frac{1}{N}\sum^N_{i=1} \theta_i\\
\sigma^2(\theta_j) &= \frac{1}{N-1}\sum^N_{i=1}(\theta_{ij} - \overline{\theta}_j)^2
\end{align}
\]To calculate the Gelman & Rubin criteria we have to run multiple chains and calculate $\bar{\bar{\theta}}_j$ (mean of means) for each parameter $j$ as well as the mean variance $\bar{\sigma}(\theta_j)$ and variance of the mean $\sigma(\bar{\theta}_j)$:
\[
\begin{align}
\bar{\bar{\theta}}_j &= \frac{1}{N_{\rm chains}}\sum^{N_{\rm chains}}_{j=1}\bar{\theta}_j\\
\bar{\sigma}^2(\theta_j) &= \frac{1}{N_{\rm chains}}\sum^{N_{\rm chains}}_{j=1}\sigma^2(\theta_j)\\
\sigma^2(\bar{\theta}_j) &= \frac{1}{N_{\rm chains}-1}\sum^{N_{\rm chains}}_{j=1}(\bar{\theta}_j - \bar{\bar{\theta}}_j)^2.
\end{align}
\]The variance of the means, $\sigma^2(\bar{\theta}_j)$, will converge to zero for $N\rightarrow\infty$, but for any finite chain it will have a value which can be estimated as $\frac{1}{N}\bar{\sigma}^2(\theta_j)$. If the chains started with widely dispersed initial conditions, the term
\[
B_j = \sigma^2(\bar{\theta}_j) - \frac{1}{N}\bar{\sigma}^2(\theta_j)
\]will converge to zero from above. The Gelman & Rubin convergence criteria therefore defines so-called scale reduction parameters as
\[
\hat{R}_j = \sqrt{1 + \frac{B_j}{\bar{\sigma}^2(\theta_j)}},
\]which will converge to $1$ from above. Convergence is reached if these scale reduction parameters are close to 1 for each free parameter of the fit.

How close do the scale reduction parameters have to be to 1? There is no clear answer here, but a good idea is to plot these parameters as a function of time, which should show monotonic convergence to 1. Another criterion can be motivated by your use case. If you want to use these likelihoods to distinguish between different models, you should consider that a scale reduction parameter different to 1 represents an error on this likelihood. So to distinguish between two likelihoods you should make sure that the error on the likelihood is sufficiently small to distinguish between the two.
Given the equations above we can now implement this convergence criterium. Here I will present a Python implementation which can directly be used within the MCMC Ensemble sampler emcee. We can run emcee with a few simple lines of code
import emcee
import numpy as np
import para
rstate = np.random.get_state()
pos = set_initial_position()
sampler = emcee.EnsembleSampler(para.nwalkers, para.ndim, func)
for result in sampler.sample(pos, iterations=para.chainstep, rstate0=rstate):
    pos = result[0]
    rstate = result[2]
Here I used a parameter file para, which contains all relevant parameters of the fit. In my case it currently looks like this
# MCMC parameters
Nchains = 8
ithin = 100
minlength = 40000 
ichaincheck = 1000 
nwalkers = 100 
epsilon = 0.04 
where nwalkers defines the number of walkers (emcee allows to parallelize MCMC using multiple 'walkers' which talk to each other), ndim gives the number of free parameters and chainstep defines the length of the MCMC chain.

An important parameter in the above code is pos, which defines the initial values for each parameter (and each walker in emcee). A possible choice for the function to get the initial position is
def set_initial_position():
    pos = []
    for walker in range(0,para.nwalkers): 
        ran_scatter = np.random.random_sample((para.ndim,))-1.)
        start_sig = 2.*ran_scatter*para.expected_sig
        start_pos = para.max_like_para + start_sig
        pos.append(start_pos)
    return pos
where para.max_like_para represents an array of maximum likelihood values for each parameter and para.expected_sig represents an array of expected standard deviations for each parameter. Of course, these arrays might not be available and hence it could be quite difficult to define the initial positions for the chain. Often a conservative guess is the best one can do.

The code above produces a chain of fixed length, while we want to run the chain until the Gelman & Rubin criterium indicates convergence. Therefore we extend the above code to
withinchainvar = np.zeros((para.Nchains, para.ndim), dtype=np.float)
meanchain = np.zeros((para.Nchains, para.ndim), dtype=np.float)
scalereduction = np.array(para.ndim, dtype=np.float)

chainstep = para.minlength
itercounter = para.minlength
loopcriteria = True
# Run chain until the chain has converged
while loopcriteria:
    for jj in range(0, para.Nchains):
        print("process chain %d" % jj)
        for result in sampler[jj].sample(pos[jj], 
                                         iterations=chainstep, 
                                         rstate0=rstate[jj]):
            pos[jj] = result[0]
            rstate[jj] = result[2]
            # Use the second half of the chain for convergence test
        chain_length_per_walker = int(sampler[jj].chain.shape[1])
        chainsamples = sampler[jj].chain[:, int(chain_length_per_walker/2):, :]\
                                  .reshape((-1, para.ndim))
        chain_length = len(chainsamples)
        # Variance for each parameter within one chain
        withinchainvar[jj] = np.var(chainsamples, axis=0)
        # Mean for each parameter within one chain
        meanchain[jj] = np.mean(chainsamples, axis=0)
    
    R = gelman_rubin_convergence(withinchainvar,
                                 meanchain, 
                                 chain_length, 
                                 para.Nchains)
    loopcriteria = not all(np.absolute(1 - R) < para.epsilon)
    chainstep = para.ichaincheck
    itercounter += chainstep
where the gelman_rubin_convergence() function is defined as
def gelman_rubin_convergence(withinchainvar, meanchain, chain_length, N):    
    meanall = np.mean(meanchain, axis=0)
    mean_wcv = np.mean(withinchainvar, axis=0)
    vom = np.zeros(ndim, dtype=np.float)
    for jj in range(0, N):
        vom += (meanall - meanchain[jj])**2/(N-1.)
    B = vom - mean_wcv/chain_length
    return np.sqrt(1. + B/mean_wcv)
where the two parameters ichaincheck and epsilon define the number of steps between which the scale reduction parameters are calculated, and the threshold for the scale reduction parameters, respectively.

The biggest danger with this convergence criteria is that one starts all chains with the same or very similar initial conditions, in which case the procedure described above might indicate convergence even for completely unconverged chains. Hence one generally should use multiple convergence criteria. Some additional signs of convergence are
  1. Any sub-segment of the chain gives a similar result to the entire chain.
  2. The acceptance rate is close to the equilibrium acceptance rate for the algorithm used. E.g. for Metropolis-Hastings it is around 50%.
  3. The correlation length of the chain is far smaller than the length of the chain.
  4. Multiple chains initialized from different initial conditions give similar results.
Hope this was useful. If you have any questions or comments, please leave them below.
cheers
Florian

Monday, October 22, 2018

Shrinkage estimators for stable covariance matrices

Obtaining reliable error bars is often the most difficult aspect of a data analysis project. In cosmology, we usually derive a covariance matrix from simulations as
\[
C = \frac{1}{N-1}\sum^N_{ij}(X_i - \overline{X}_i)(X_j - \overline{X}_j),
\]where $X$ represents any statistic you want to use in the analysis and $\overline{X}$ is the mean over all simulations.

The covariance matrix as estimated above carries noise due to the finite number of simulations. The level of noise depends on the number of bins $n$ in the data vector (which determines the dimension of the matrix) and the number of simulations used $N$. If $N \gg n$ the matrix is usually fine, while with $N \approx n$ the noise can play a significant role and in the case of $N < n$ the matrix cannot even be inverted. In the last two cases, we can use shrinkage estimators to obtain a covariance matrix which avoids the propagation of the noise due to the small $N$ into the parameter estimate.

The shrinkage estimator $C'$ for a covariance matrix $C$ is usually given as a linear combination of the matrix itself and a target matrix $T$
\[
C' = \lambda T + (1 - \lambda)C,
\]where $T$ could, for example, be the diagonal matrix $T = \text{diag}(C)\mathcal{I}$ with $\mathcal{I}$ being the identity matrix.

The question now is how to obtain $\lambda$. For $N \gg n$ we would like to obtain the covariance matrix $C$, since in this case it should have little noise and should be reliable. For $N \approx n$ we would like to obtain $\lambda \approx 1$, so that we get something similar to $T$, avoiding the noise in $C$.

Using $T$ instead of $C$, of course, has consequences since it will ignore the correlation between bins in the data vector, which can significantly underestimate the parameter uncertainties. We therefore can't obtain the parameter error from the likelihood analysis itself but need to use an alternative. For example, we could perform the same analysis on many mock realizations and obtain the parameter error from the variance amongst the mock realizations.

So how can we obtain $\lambda$ in practice? One common approach to estimate $\lambda$ is by using cross-validation. Using the Python package sklearn we can get the inverse of the covariance matrix (also known as the precision matrix) with
from sklearn.covariance import LedoitWolf

N = len(list_of_mock_pk)
X_train = list_of_mock_pk[:int(N*0.75)]
X_test = list_of_mock_pk[int(N*0.75):]

lw = LedoitWolf()
loglik_lw = lw.fit(X_train).score(X_test)
print("loglik_lw = ", loglik_lw)
print("lw.shrinkage_ = ", lw.shrinkage_)
Cinv = lw.get_precision()
Figure 1 shows an example taken from arXiv:0711.2509, where a shrinkage estimator is used to measure $\sigma_8$ from a galaxy power spectrum. The upper panel shows the distribution of $\sigma_8$ from a set of mock realizations, while the lower panel shows the estimated error on $\sigma_8$. The black solid line shows the true result while the blue dashed line shows the result with the original covariance matrix $C$, which carries a lot of noise since $N\approx n$. Using the shrinkage estimator they get the red dashed line, which now agrees very well with the reference distribution. However, the error estimate in the lower panel is far away from the reference case, but so is the original covariance matrix (blue dashed line). So while we still can't trust the error from the analysis, at least the distribution of estimated $\sigma_8$ is now correct and we can estimate the error on $\sigma_8$ from that distribution.
Figure 1: This Figure shows the distribution of $\sigma_8$ measurements in mock realizations using a
covariance matrix based on a finite set of simulations (blue dashed line), and using the shrinkage estimator
(red dashed line). The shrinkage estimator gives the correct distribution of $\sigma_8$ but
provides the wrong error estimate (lower panel). Reference arXiv:0711.2509
So shrinkage estimators usually perform better than $C$ alone and performing a shrinkage estimation is a good approach to get an idea of how reliable the covariance matrix is. If the shrinkage estimate results in a large $\lambda$, one should be cautious about any result based on this covariance matrix.

Let me know if you have any comments/questions below.
best
Florian

Monday, April 30, 2018

Jacobian transformation and uniform priors in Bayesian inference

This blog post describes the Jacobian transformation and how a uniform prior on a parameter can turn into a non-uniform prior for related parameters.

The Jacobian transformation is an algebraic method for determining the probability distribution of a variable $y$ when we know the probability distribution for $x$ and some transfer function which relates $x$ and $y$.

Let $x$ be a variable with probability density function $f(x)$ and
\[
F(x) = \int f(x)dx,
\]and
\[
F(y) = \int f(y)dy.
\]We can determine f(y) as
\[
f(y) = J(x, y)f(x).
\]where $J(x, y)$ is the Jacobian. In this example, the Jacobian is given by
\[
J(x, y) = \left|\frac{\partial x}{\partial y}\right|.
\]Now assume you perform a likelihood analysis and you put a uniform prior on a parameter $\alpha$. The uniform prior on $\alpha$ between $\alpha_{\rm low}$ and  $\alpha_{\rm high}$ means that
\[
f(\alpha) =
\begin{cases}
   1 & \text{ for } \alpha_{\rm low} < \alpha < \alpha_{\rm high},\\
   0 & \text{ otherwise }.
\end{cases}
\]The result of the analysis will be some posterior distribution for $\alpha$. Now assume you want to use that likelihood to infer something about another parameter $\beta = \frac{1}{\alpha}$. The uniform prior which was applied to $\alpha$ is now distorted and we can calculate it using the Jacobian transformation above.

First, we calculate the Jacobian
\[
\left|\frac{\partial \alpha}{\partial \beta}\right| = \frac{1}{\beta^2},
\]which together with $f(\alpha)$ results in
\[ f(\beta) = \frac{1}{\beta^2}. \]
Figure 1: A flat prior on $\alpha$ transfers into a strongly scale-dependent prior on $\beta = 1/\alpha$.
While the prior on $\alpha$ gives the same weight to all scales between $1.5$ and $10$,
in $\beta$ much more weight is given to large values of $\beta$
.

Figure 1 shows the prior for $\alpha$ and $\beta$ using $1.5 < \alpha < 10$. This means that the uniform prior on $\alpha$ now looks nowhere near a uniform prior on $\beta$. Given that Bayesian inference always assumes some prior, one has to take into account how priors change in parameter transformations.

Please leave comments/questions below.
best
Florian

Saturday, March 31, 2018

About the assumption of a Gaussian likelihood in a data model comparison

This is a blog post about a recent publication led by ChangHoon Hahn, which I co-authored (Hahn et al. 2018). The paper looks at the impact of non-Gaussian likelihoods for cosmological parameter estimation. This blog post attempts to explain the content of this paper for a general audience.

When finding the best model to describe a given dataset, most studies employ the $\chi^2$ statistic, which is a measure of how well the model and the data agree
\[
\chi^2 = \sum^{d} (data - model)^TC^{-1}(data - model),
\]where $d$ is the number of data points in the dataset, and the covariance matrix $C$ describes the error of the data points. This statistic is directly related to the so-called likelihood, which can be obtained as
\[
p(x|\theta) = \frac{1}{(2\pi)^{d/2}\sqrt{det(C)}} \exp{\left(-0.5\chi^2\right)}.
\]This likelihood describes the probability that one could have observed the data, given that the underlying truth is described by the model with the parameters $\theta$. So whenever somebody observes some data, the final analysis step is usually to calculate this likelihood and find the parameters $\theta$ which have the largest likelihood.

However, the likelihood above has a Gaussian shape, and it is not guaranteed that the probability distribution has that form. So the question we are addressing in this publication is, how close is the likelihood to a Gaussian form and if there is a deviation, how does it impact the analysis.

The first question to answer is, why do we assume a Gaussian likelihood in the first place?

The Gaussian distribution has a central role in statistics because it has been shown that when collecting many different measurements of the same event, these measurements often follow a Gaussian distribution. For example, if you throw a coin 100 times and calculate the mean, this mean is probably somewhere close to 0.5.

If you make that experiment many times, meaning you throw the coin 100 times, calculate the mean of the 100 throws and repeat that many times. The distribution of means will peak close to 0.5 and will show a Gaussian shape (see Figure 1). This is quite surprising because the probability of throwing heads or tails is a top-hat distribution, which looks not at all like a Gaussian. So adding many top-hat distributions leads somehow to a Gaussian distribution.
Figure 1: Probability distribution for the mean of many throws of a coin. The number of throws is shown in the top gray
box. With an increasing number of throws, the distribution looks more and more Gaussian (source: Wikipedia).
Figure 1 shows the probability distribution when looking at the mean of many coin flips. The distribution looks more and more Gaussian if the number of coin flips is increased.

In fact, that happens quite often in nature, the likelihood distribution of a large number of events often looks like a Gaussian, even if the likelihood distribution for a single event does not. This is known in statistics as the central limit theorem.

However, given the example above you can already see that the assumption of a Gaussian likelihood only works for large numbers. Figure 1 shows clearly that the probability distribution is far from Gaussian if one calculates the mean from only a small number of coin flips.

Going back to our paper. We looked at the power spectrum analysis of the Baryon Oscillation Spectroscopic Survey published in Beutler et al. (2016) and the group multiplicity function analysis of Sinha et al (2017).

Let's start with the power spectrum analysis. I discussed this analysis in several blog posts (e.g. here) already so I will skip the details about it. Here we do not care too much about the details of the measurement anyway, we just want to test whether the assumption of a Gaussian likelihood is justified.

In the power spectrum case one calculates the mean of many Fourier modes and the question is whether we have enough modes to make the likelihood distribution Gaussian. In general, there are many modes on small scales (large wavevector k) but only a small number of modes on large scales (small wavevector k). E.g. at $k=0.01h/$Mpc only a few dozen modes contribute to the value of the power spectrum, while at k=0.2h/Mpc many thousand modes contribute. Whether this matters or not depends on how this impacts the parameters of the model, which is not straightforward.

There is a very simple way to check whether the likelihood is Gaussian. We have produced mock catalogs for the BOSS survey, which mimic the BOSS dataset. Each of these mock catalogs represents a different possible realization of the measurement of BOSS. Therefore we just have to measure the power spectrum of all these mock catalogs and see whether those power spectra scatter around their mean in the shape of a Gaussian.

Whatever the distribution is going to be, we then use that distribution rather than assuming a Gaussian when searching for the model with the maximum likelihood.
Figure 2. This plot shows the likelihood distribution for the different cosmological parameters derived from
the BOSS power spectrum. The blue distribution assumes a Gaussian likelihood, which the orange
distribution assumes a likelihood distribution seen in the mock catalogs. You see there is overall good
agreement except for a small shift in $f\sigma_8$.
Figure 2 shows the likelihood distribution for each cosmological parameter. Note this is a different likelihood than what we talked about before. The likelihood we talked about before describes the probability of observing the data vector given that the underlying model is the true model, while the likelihood in Figure 2 describes the probability of observing the data given that parameter value.

The blue distribution assumes a Gaussian likelihood, while in orange we use the likelihood from the mocks. The two likelihoods seem to agree well, except for the parameter $f\sigma_8$, where we found a small shift to larger values.

Next the group multiplicity function analysis of Sinha et al (2017). The likelihood distributions are again shown in blue and orange in Figure 3. For this analysis, you can see that the two likelihoods agree less well, especially in terms of how peaked the distributions are. Removing the assumption of a Gaussian likelihood seems to increase the error on almost all derived parameters.
Figure 3: Comparison between the original group multiplicity function analysis of Sinha et al (2017) with the new analysis dropping the assumption of a Gaussian likelihood. The two distributions show significant
deviations for most parameters.
In conclusion, the assumption of the form of the likelihood can impact cosmological parameter constraints and should be accounted for. With larger datasets, the statistics usually gets better and the issue discussed here will get smaller. However, observables like primordial non-Gaussianity are located in the regime with small mode statistics and hence such studies must check such biases carefully.

I hope this was useful. Let me know if you have any comments/questions in the comment section below.
best
Florian 

Thursday, March 29, 2018

Measuring Neutrinos with the distribution of galaxies in the Universe

In this blog post, I will summarise the paper Baumann et al. (2018) which my collaborators and I just published. The paper reports a measurement of the phase shift in the Baryon Acoustic Oscillation signal due to free streaming particles (neutrinos) in the early universe. So what does that mean?

1. What is the Baryon Acoustic Oscillation (BAO) scale? The BAO scale is a special scale in the distribution of galaxies. If you look at the separation of galaxies in the Universe, you will find that at around 150 Megaparsecs (500 000 000 light years) there is a sudden increase in the number of galaxy pairs. This special scale in the distribution of galaxies was imprinted when the Universe was just 300 000 years old at the so-called decoupling era. If you measure the correlation function of a galaxy distribution, meaning you count the number of galaxy pairs as a function of scale, the BAO signal will show up as a peak at 150 Megaparsecs. If you look at the Fourier transform of the correlation function, the power spectrum, it will show up as an oscillation signal. This signal is what we use in this analysis.

2. What is the decoupling era? In the early Universe photons and baryons were coupled together due to the high pressure. 300 000 years after the Big Bang the temperature and pressure dropped enough so that photons and baryons decoupled. From that point forward most of the photons in the Universe just traveled through the Universe without any interactions with anything. For that reason, we can look at these Cosmic Microwave Background (CMB) photons and basically see a snapshot of the Universe as it was 300 000 years after the Big Bang.

3. What is the effective number of relativistic species? This number describes all particles which had a relativistic velocity at the decoupling era. Currently, this number is assumed to be three, given the three known species of neutrinos ($\nu_{e}$, $\nu_{\mu}$ and $\nu_{\tau}$). We know that these particles must have been produced during the Big Bang and since they are very light, they should have had relativistic velocities at decoupling.

Figure 1 shows the impact of the effective number of free-streaming relativistic species ($N_{\rm eff}$) on the BAO signal and you can see that it leads to a shift in the phase.

Figure 1: This figure shows the impact of the effective number of relativistic species ($N_{\rm eff}$)
on the Baryon Acoustic Oscillation (BAO) scale in the matter power spectrum. It is the shift in the
phase of the BAO which we measured for the first time in this analysis (reference: This plot is from
Baumann et al. 2017).
This phase shift is very small, but we now managed to constrain the BAO scale to 1% using the Baryon Oscillation Spectroscopic Survey (BOSS), which makes us sensitive to such small effects.

So why are we interested in this effect?

Well, it could be that there are additional particles which propagated in the early Universe, but which we have not seen yet in particle accelerators. In this sense, we are using the Big Bang as the ultimate high energy collider and test whether the remains of the Big Bang show any evidence for new particles.

Besides that, the neutrinos generated by the Big Bang are also a very interesting probe of the early Universe. As mentioned before we can use the Cosmic Microwave Background (CMB) to get a snapshot of the Universe when it was only 300 000 years old. But we cannot actually look any closer to the Big Bang. If we could see the distribution of neutrinos produced by the Big Bang, we would actually see the Universe when it was only 1 second old, because neutrinos stopped interacting with other components of the Universe much earlier than photons, 1 second after the Big Bang.

So if we want to investigate the Universe at its very early age, the neutrino background produced by the Big Bang would be one way to do that. Another would be the gravitational wave background produced by the Big Bang. However, right now we do not have experiments sensitive enough to detect the neutrino or gravitational wave background. The detection of the phase shift in the BAO is an indirect detection of the neutrino background.

The shift in the BAO signal is not quite scale independent, hence the first step is we have to create a template for the shift, which we include in the standard BAO fitting pipeline, which I developed for the BOSS analysis back in 2017 (Beutler et al. 2017). The template for the phase shift is shown in Figure 2.
Figure 2: Template describing the shift in the BAO phase as a function of wavenumber. We
include this template in our analysis, but put a free parameter as its amplitude. 
The figure shows that the shift caused by neutrinos $f(k)$ is larger at larger wavenumber $k$.

The amplitude of this template is a free parameter in the fit, which we call $\beta$. We then use the BOSS dataset to constrain this parameter.
Figure 3: Constraints on the amplitude of the phase shift ($\beta$) obtained from the BOSS dataset
in two redshift bins ($z_1$ and $z_3$).
Figure 3 shows the constraints on the amplitude of the template ($\beta$) obtained from the BOSS dataset. In BOSS we have two redshift bins, $z_1$ ($0.2 < z < 0.5$) and $z_3$ ($0.5 < z < 0.75$). We use both redshift bins for our constraint, which yields $\beta = 1.2\pm 1.8$. This constraint can then be converted into a constraint on the effective number of relativistic species. $\beta = 0$ would correspond to $N_{\rm eff} = 0$, while $\beta = 2.45$ would correspond to $N_{\rm eff} = \infty$.

You can see that our constraint just from the BOSS data can't really constrain $N_{\rm eff}$. However, Figure 3 also shows that there is a strong degeneracy between the parameter $\beta$ and the BAO scale parameter $\alpha$. If we fix the cosmological model, we can get strong priors on the $\alpha$ parameters from the Cosmic Microwave Background. The constraints including these priors are also included in Figure 3 (red and orange contours on the left and red solid line on the right) resulting in $\beta = 2.05\pm0.81$. This excludes $\beta=0$ (corresponding to $N_{\rm eff} = 0$) with more than $2\sigma$. Note that the expected value of $\beta$ in a standard cosmological model with $N_{\rm eff} = 3.046$ is $\beta = 1$, while our data seem to prefer larger values. However, $\beta = 1$ is well within our likelihood distribution (red line in Figure 3, right) and hence we do not see any significant deviation from the standard model. 

A similar measurement of the phase shift in the photon distribution (CMB) using the Planck data has been reported in Follin et al. 2015. They measured $N_{\rm eff} = 2.3^{+1.1}_{-0.4}$, which is consistent with our measurement.

In summary, we, for the first time, measured the phase shift in the BAO signal in the galaxy power spectrum, representing an (indirect) detection of the cosmic neutrino background produced by the Big Bang. Our measurement confirms the current standard model with three neutrino species. While the detection significant of our measurement is fairly low, future galaxy surveys like DESI and Euclid will improve these measurements significantly.

Our paper about this measurement was published in Nature (see https://www.nature.com/articles/s41567-019-0435-6)

Let me know if you have any comments or questions below.
best
Florian