Extending the Fully Bayesian Unfolding with Regularization Using a Combined Sampling Method

: Regularization extensions to the Fully Bayesian Unfolding are implemented and studied with an algorithm of combined sampling to ﬁnd, in a reasonable computational time, an optimal value of the regularization strength parameter in order to obtain an unfolded result of a desired property, like smoothness. Three regularization conditions using the curvature, entropy and derivatives are applied, as a model example, to several simulated spectra of top-pair quark pairs that are produced in high energy pp collisions. The existence of a minimum of a χ 2 between the unfolded and particle-level spectra is discussed, with recommendations on the checks and validity of the usage of the regularization feature in Fully Bayesian Unfolding (FBU).


Introduction
This study is motivated by unstable results of unfolding in the case of spectra with a large number of bins or with a poorly diagonal migration matrix. In order to obtain the expected smooth results, a bias is carefully introduced via a regularization function. We focused on the Fully Bayesian Unfolding (FBU) method [1].
Other unfolding methods than FBU have already been introduced, including regularization features. Because FBU has the ability to provide the full probability posterior of the result and thus stands out from other methods, it is interesting to also provide FBU with the regularization option.
This study uses several over-binned spectra, so the unfolding without regularization results in fluctuating, high-variance unfolded spectra. These were constructed by fine binning, so that the migration matrix is under-represented on diagonal. Five typical spectra considering the production of top quark pairs in pp collisions were chosen, namely the pseudorapidity η t,had and transverse momentum p t,had

Unfolding
The unfolding procedure corrects measured spectra for the resolution, efficiency, and acceptance effects of the detector usually to the particle level. The principles of the Fully Bayesian Unfolding method are based on the Bayes theorem where the conditional probability P(A|B) reads the probability of A, given as B. Fully Bayesian Unfolding provides the full probability posterior P(A|B) of the result, which gives extra information when compared to other unfolding methods [2]. Rewriting Equation (1), the probability of the truth spectrum T, given the data D, is P(T|D) = P(D|T) · π(T) Norm. , (2) where the Norm. is a normalization factor, P(D|T) is the likelihood function, and π(T) is a prior probability density for T. In the case of π(T) = 1 the algorithm is non-regularized. By inserting an arbitrary regularization function instead of π(T), one can intentionally bias the final result towards a suitable property, e.g., to be smooth, and unfolding is then called regularized. The convenient way is to use an exponential function with regularization strength parameter τ and an inner function S(T) P(T|D) = P(D|T) · e −τS(T) Norm. .
If the parameter τ = 0, the prior π(T) = 1, and no regularization is applied; on the other hand, the higher value of τ, the more dominant the regularization term.

Simulated Spectra
The input pseudo-data spectrum D sim is simulated for the process pp → tt at √ s = 14 TeV while using MadGraph5 [3] software, Pythia8 generator [4], and Delphes [5] detector simulation with an ATLAS card while using basic kinematics cuts in order to obtain a realistic sample of top quark pairs decaying in a semi-leptonic way tt → bWbW → + jets, with more details being described in [6].

Defining Equations of Unfolding
The necessary ingredients for unfolding are the data spectrum D and the migration matrix M, which is normalized, so that it maps the probability of an event at the truth (particle) level bin i to be reconstructed (measured) at the detector level bin j.
The schematic formula describing unfolding for the case of a simple migration matrix inversion iŝ In this paper, the background spectrum B is not taken into account. The efficiency and acceptance corrections are derived from projections of the migration matrix M divided by the simulated particle level T or data spectrum D, as Because our data are simulated we set D = D sim and Equation (4) becomeŝ The production of spectra was divided into two statistically independent sets A and B, where set A is used for unfolding components and set B is taken as pseudo data D simB input to unfoldinĝ The result of unfoldingT B is compared to the corresponding particle level spectrum T B (green line in Figure 1a). The χ 2 test between the unfolded and particle spectrum divided by the number of degrees of freedom (number of the bins), χ 2 /ndf, is calculated in the standard way, as follows The input components that are needed for the unfolding process are visualized for the case of the η tt spectrum presented in Figure 1. In essence, the FBU unfolding technique exploits the response matrix to fold a probe truth-level spectrum, obtaining the detector-level spectrum, and computing a likelihood between such folded spectrum and the actual data. This elegantly avoids issues of singularities in explicit matrix inversion or their regularization, as e.g., in the SVD technique [7]. The tricky part of the FBU is how to effectively sample the truth space in the region giving large likelihood values.

Likelihood Function P(D|T)
The likelihood function is given by the product of Poisson distributions in case of counts as where and After unfolding, the probability P(T|D) is divided by the efficiency correction . The acceptance correction η that is shown in Figure 1b takes on values [0,1] and, thus, the factorial is computed using the Gamma function as x! = Γ(x + 1).
In the simplified case of no background (B = 0), pseudo data D = D sim , and introducing the natural logarithm, since the sampling of the functions is more numerically stable, Equation (9) becomes which can be rewritten as This expression has the advantage of allowing one to find the maximum of log P(T|D sim ) by sampling the log-likelihood function log L(T) and the regularization function S(T) separately and tune the regularization parameter τ, so that the optimal τ can be found in a much shorter time than sampling the full likelihood for every possible value of τ.
In this study, both of the approaches were used: the one when sampling was applied to the function P(T|D sim ) as a whole, denoted as full sampling, and the one when sampling of the two components log L(T) and S(T) is performed separately, denoted as combined (fast) sampling. These methods will be described in detail in Section 2.6. Figure 2a shows an example of the log-likelihood, sampling of the function log L(T) for the spectrum η tt and its projection to the 6 th and 9 th bin log L(T 6 , T 9 ). Figure 2b is created while using the highest points of the log L(T 6 , T 9 ) and, thus, represents an envelope of the log L(T 6 , T 9 ), while Figure 2c represents the exponential of the envelope of the original L(T 6 , T 9 ), with the posteriors plotted along the x and y axes (red lines), which are the marginalized binned probability densities that are analytically given as   The mean and standard deviation of the posteriors are taken as the unfolded result in each bin in this study in order to construct the unfolded spectrum and its statistical uncertainty. Another approach is to fit the posteriors and use parameters from the fit, see Figure 3, but this method is not used in this paper, since the regularization can change the shape of the posterior to non-Gaussian.  The advantage of the FBU in comparison to other unfolding methods is in providing the whole probability distribution. Marginalizing this probability to one-dimensional posteriors the user gains the probability distribution of the unfolded result for each bin and it can decide a custom statistical property to define the unfolding result and its uncertainty in the particular bin while other methods usually provide only a single number with some uncertainty. Thus, FBU provides more user control over the unfolded output interpretation. Furthermore, marginalizing the whole probability into two dimensions enables one to study cross-bin correlations, accessible only via dedicated pseudo experiments in other methods. Another advantage is the possibility of a natural inclusion of systematic uncertainties variation by introducing prior distributions for corresponding nuisance parameters directly into the likelihood function and performing a simultaneous fit, which is a procedure known as profiling. Figure 4 illustrates the unfolding process and proceeds from the (blue) pseudo data line to the (red) unfolded line in Figure 4a, which fluctuates due to low occupied diagonal of the migration matrix M (Figure 1c). One can improve the variance of the result by introducing a regularization term aiming to diminish the curvature of the unfolding result ( Figure 4b) with the optimal strength parameter τ opt = 2089 derived using the full sampling algorithm discussed in Section 2.3 ( Figure 5), so that the resulting χ 2 /ndf between particle and unfolded spectra is minimal.

Motivation for the Regularization
On the other hand, if one sets the strength of regularization to a very large number, e.g., τ ≈ 10 5 , the corresponding χ 2 /ndf is high and the unfolded result departs from the particle level spectrum, see Figure 4c.
This implies that χ 2 (τ)/ndf may have a minimum as function of τ (see Figure 5) and the aim of this study is to provide a way to find such a minimum in a reasonable computational time. In the next section, we describe three regularization functions used in this study.    Relative χ 2 (τ)/ndf as function of the regularization strength parameter τ and its minimum at τ opt = 2089. The parameter τ is normalized, such that τ = τ rel , see Section 3. The vertical line represents the minimum of χ 2 /ndf.

Regularization Functions S(T)
While the regularization function can be taken as any arbitrary function, we motivate several choices [1]. The aim of the negative entropy, curvature and derivatives is to smoothen the unfolded spectra.

•
Entropy regularization • Curvature regularization where • First derivative regularization where and N is the number of bins, W t is the width of tth bin, and C t is the bin center of the tth bin.
Each of the regularization functions has a different effect on the full likelihood, see Figure 6. While sampling the likelihood according to gradient of L reg (T) = L(T) − S(T), i.e., leaving τ = 1 for the moment, the values of S(T) are stored.

Sampling the Likelihood Function
Effective sampling plays a crucial role in finding extremes of the likelihood function. In this paper, a private implementation of the Markov chain Hamiltonian Monte Carlo [8] method is used when each point of a function is derived from the previous point, thus forming a chain. The problem of finding the maximum of the likelihood function is transformed to generate a sample via the study of motion of a virtual particle in the potential represented by the likelihood function in an N-dimensional space. The Hamiltonian of a particle with momentum p in potential V(x), depending on position vector x, can be reformulated from the classical expression to the studied problem, where the position vector x is substituted by the vector of particle-level truth values T and the potential V(x) by the negative log-likelihood function − log L(T), so that where ε is the time interval. Each point in momentum space is updated as Similarly, the second equation of motion so the truth (position) value is updated in time, according to and the momentum as Equations (26), (29) and (30) provide one and two-step updates for T and p, respectively. The time interval ε is adapted in first few steps and it then stays constant for the rest of the sampling. The initial momentum p is chosen randomly according to normal distribution and the initial truth vector T is randomly chosen while using uniform distribution within limits set by user or using a default expected range in i-th bin 0, 3 · (D sim ) i η i i . These limits create a hyper-cube, in which the sampling is performed. The algorithm details are described in [8].
The introduction of the canonical momentum and shifting the N dimensional problem to 2N dimensions is one of the key features of the Hamiltonian MC chain generator. Classical mechanics Hamiltonian equations of motions are then employed, with the negative likelihood playing the role of a potential for a particle moving in the generalized phase-space, under the conservation of the total energy. The invariance of the system w.r.t. time translations is the motivation for exploring the trajectory of such a particle for different initial conditions, which leads to effectively sampling a larger phase space and more of the classical trajectories, thus increasing the probability of reaching the global minimum of the potential, i.e., the likelihood maximum.

Regularization
This section provides an overview of unfolding characteristics as function of the regularization strength parameter τ. Throughout this paper, the original τ abs parameter is normalized while using the number of bins N and the curvature, entropy, or derivatives of the particle spectrum that are generally symbolized by S(T part ) as and, in the τ dependence, figures τ ≡ τ rel are used, so different spectra and regularization functions acquire similar values of τ, which are then more comparable.

Evolution of Posteriors with τ
Two effects come into shaping the posteriors with stronger regularization applied. First is the shifting of the peak, which is welcome in obtaining the expected modified unfolded spectrum. The second effect decreases the posteriors width and this effect decreases the uncertainty in the mean for each bin sometimes down to an artificially infinitesimal small width with extremely high τ and, thus, undesirably somewhat increases the χ 2 when comparing unfolded and truth spectra.
Because these are two competing effects, a minimum of the χ 2 (τ)/ndf may or may not appear. Figure 7 shows the posterior in the 15th bin of the η tt spectrum for three scenarios. First, when no regularization is applied, so τ = 0, then with regularization with optimal τ = 2089, so the final χ 2 /ndf is the smallest, and finally with an extremely high τ ≈ 10 4 when the posterior becomes almost a Dirac δ-function.

Evolution of Curvature, Entropy, and Derivatives with τ
One of the first checks whether the regularization works properly is to study the curvature, entropy and derivatives of the unfolded spectra with respect to the increasing regularization strength τ. The expected result is that, with higher τ, the curvature decreases and the unfolded spectrum is smoother. The same characteristics also holds for entropy and derivatives. Figure 8 shows the decrease of curvature, entropy, and derivatives with increasing τ, as expected.

Evolution of Bin Cross-Correlations with τ
Using the likelihood projection to two different bins, the cross-bin correlations can be derived. Figure 9 shows the overall matrix built from these correlation factors. While, in case of (a) (τ = 0), the bin correlations are minimal, the progressing trend in case (b) (τ = 912), and finally (c) (τ ≈ 10 4 ) is evident. Thus, the stronger regularization is the more bins of the unfolded spectrum are correlated between each other.  -0.23-0.320.08-0.370.21-0.20-0.17-0.12-0.360.07-0.200.28-0.791.00  -0.63   -0.00  0.11-0.22-0.430.14-0.37-0.030.00-0.140.16-0.230.07-0.440.35-0 In order to quantify the increase of correlations, the averaged sum of off-diagonal elements of bin correlation matrix A is shown for various τ, given bȳ and also its absolute value version to study the sum of correlations and anti-correlations These bin-averaged cross correlationsC andC abs are shown in Figure 10, and the increasing trend proves that, the stronger regularization, the more correlated the unfolding result over bins.

Evolution of χ 2 /ndf and Bin Uncertainties with τ
The χ 2 /ndf is calculated as where eachT i for one particular bin is taken as the mean of the posterior and the uncertainty σT i as the standard deviation of the posterior. Because the width of the posterior is observed to decrease with rising τ, the χ 2 /ndf also has to rise. In order to naively separate effects of spectra smoothing and posteriors narrowing, two variables χ num and χ denom were studied, being defined as   If the effects of shifting and narrowing the posterior are independent then χ 2 /ndf and ratio χ num /χ denom should be comparable which can be seen in Figure 13. The problem of finding the optimal τ by finding the minimal χ 2 /ndf is the time that is needed to create such a graph as in Figure 13, while each point of the graph is obtained by running the full unfolding. In the next section, an algorithm is described, which enables one to estimate the minimum of χ 2 /ndf in a much faster way.

Combined Sampling as a Faster Algorithm
In order to speed up the algorithm which estimates the minimum χ 2 /ndf value, the algorithm of combined sampling is proposed. The main idea is to sample separately the function log L 0 (T) and the regularization function S(T) and then vary the τ parameter to obtain the overall likelihood A problem arises with sampling the regularization function S(T) since this function is not smooth and sampling algorithm can end up mapping only the local minima. To avoid this is, an auxiliary function is sampled, where τ = 1 so the global minimum is dominated by L 0 (T). The function log L 1 (T) stands in sampling only for the purpose of providing a correct gradient, while values of log L 0 (T) and S(T) are stored separately, so post-unfolding these values can be used to compute in a fast way the full desired posterior of log L comb (T) = log L 0 (T) − τS(T) for any τ. This faster combined sampling approach is shown in blue in Figure 14, while the red points stand for the full sampling, where, for each point in τ, a new sampling had to be performed. Figure 14 shows a very good agreement between combined and full sampling methods in terms of their minima.
In the case of derivative regularization in Figure 14c the full sampling method (red) does not cover the highest values of τ in the range of approximately [5 × 10 3 , 10 5 ]. The reason is that the regularization function of derivatives, as compared to the curvature and entropy, is very narrow, approaching for high τ almost a δ-function, see Figure 6c and, thus, the sampling either takes infinite time to find such a peak in the wide domain of the function or is stopped by reaching the maximum number of iterations and the resulting χ 2 /ndf is very high as the global maximum is not found.

Accidental Minima of χ 2 (τ)/ndf
In the case of unfolding without regularization being effective enough so no smoothing is actually needed, the narrowing of posteriors can cause an accidental minimum in the χ 2 (τ)/ndf function. The example of the m tt spectrum shows the variable χ num as a flat function with increasing tail in high τ values, see Figure 15a. At the same time, the χ denom decreases rapidly and the ratio of these components as well as χ 2 /ndf exhibit a minimum at τ ≈ 7000 solely due to the narrowing of posteriors.  This accidental minimum does not improve the smoothness of the spectrum, as illustrated in Figure 16. The unfolded spectrum while using regularization in Figure 16b only decreases the uncertainties (red line) when compared to Figure 16a while not providing a smoother spectrum, as can be seen in the ratio and only a minor decrease of χ 2 /ndf. The regularization is not effective due to the fact that the tail of the m tt spectrum has lower number of events compared to the first half of the spectrum. The contribution to the curvature is thus small in the tail. The solution would be to switch on the regularization only for the second half of the spectrum as shown in Figure 16c where χ 2 (τ)/ndf(tail) computed in the tail region of the spectrum decreases when compared to Figure 16a,b, also affecting the full χ 2 /ndf via a change of the global maximum of the full likelihood.

Hidden Minima of χ 2 (τ)/ndf
If the smoothing effect that is expressed by χ num in Figure 17a has a minimum and if the narrowing effect described by χ denom in Figure 17b starts to decrease with a similar or sharper slope than χ num (τ ≈ 10 2 ), then the real minimum in χ 2 /ndf in Figure 17c vanishes and flattens, so finding the optimal τ for the p t,had T spectrum in Figure 18 cannot be identified by minimizing χ 2 /ndf as a function of τ. Figure 18b shows the improved unfolded spectrum that was obtained while using the minimum of χ num . However, due to the narrowing of posteriors, the χ 2 /ndf increases as compared to Figure 18a (no regularization).  using the derivatives in case of a hidden minimum in χ 2 (τ)/ndf for one representative random seed. Spectrum becomes smoother, but χ 2 /ndf does not improve due to the narrowing of posteriors.

Real Minima of χ 2 (τ)/ndf
If the smoothing effect expressed by χ num in Figure 19a has a global minimum and, if the narrowing effect described by χ denom in Figure 19b starts to decrease around the same region of τ values as χ num (τ ≈ 10 2 ), then a real minimum in χ 2 /ndf in Figure 19c is observed and it helps to improve the desired property the unfolded spectrum, as shown for the spectrum of η tt in Figure 20.

Results
The results for the spectra of m tt , p t,had T and η tt were already shown in Figures 16, 18 and 20. In this section, remaining two studied spectra η t,had and p tt T are shown in Figures 21 and 22, where, for η t,had , the regularization is unnecessary. The results of all five spectra and their characteristics are summarized in Tables 1-3 in order to attempt finding a common feature that could predict the existence of real minima. Firstly, a characteristic of the migration matrix M was studied defined as where f diag has the meaning of averaged diagonal elements. As a second characteristic, the correlation factor ρ is computed while using the ROOT's TH2D class and the GetCorrelationFactor() method [9]. using entropy regularization for one representative random seed. Variable χ 2 (τ)/ndf (b) using full (red) and combined (blue) sampling of the p tt T spectrum with minimum at τ = 2089 (entropy regularization). The vertical dotted lines indicate the positions of χ 2 /ndf minima for each sampling case. Table 1. Results of minima type and basic characteristics of the migration matrix M: averaged on-diagonal factor f diag and correlation matrix ρ.