Obviously, the histogram grid is defined by the first bin origin
${\overline{\mathit{y}}}_{1}$ and each histogram bin width
${\mathit{h}}_{j}$. It is usually taken that the first bin origin
${\overline{\mathit{y}}}_{1}={\mathit{y}}_{min}+{\mathit{h}}_{1}/2=\{{y}_{1min}+{h}_{11}/2,\dots ,{y}_{dmin}+{h}_{d1}/2\}$, where
${\mathit{y}}_{min}$ is a vector of the smallest component values from the dataset
$\mathit{Y}$. Additionally, the histogram grid is constructed so that all the observations from the dataset
$\mathit{Y}$ are contained. Usually, that is made by ensuring that the last bin origin
${\overline{\mathit{y}}}_{V}={\mathit{y}}_{max}{\mathit{h}}_{V}/2=\{{y}_{1max}{h}_{1V}/2,\dots ,{y}_{dmax}{h}_{dV}/2\}$, where
${\mathit{y}}_{max}$ is a vector of the largest component values from the dataset
$\mathit{Y}$. Finally, the
$\mathit{y}\in {V}_{j}$ operation from Equation (
8) is defined as
and the observation
${\mathit{y}}_{max}$ is counted into last bin
V. Let
${k}_{j}$ be the frequency of the
jth bin. The frequency
${k}_{j}$ defines how many observations the
jth bin holds. For the dataset
$\mathit{Y}$ it is defined as
3.1. Estimation of Optimum HistogramBin Widths
The main problem is estimating the number of bins
V and the bin widths
${\mathit{h}}_{j}$ for each bin
$j\in \{1,\dots ,V\}$. Most of the time an equal bin width for each and every bin can be safely assumed [
22]; i.e.,
Nonetheless, it should be pointed out that there is a lot of research on variablebinwidth, nonparametric density estimators [
23], such as the histograms are themselves [
22]. Thus, the
jth bin’s origin, from Equation (
9), becomes
and the first bin’s origin
${\overline{\mathit{y}}}_{1}$ can be safely estimated with
If the origin of the last bin is
then the histogrambin widths can be estimated as
where
${v}_{i}\in \mathbb{Z}\wedge {v}_{i}\ge 1$ is the number of cuts in the
ith dimension, which we referred to as the marginal number of bins in
Section 2.4. In Algorithm 4 it is assumed that
${v}_{1}={v}_{2}=\dots ={v}_{i}=\dots ={v}_{d}=v$, thereby making sure the loop in line 3 executes only
${v}_{max}{v}_{min}+1$ times. Although this restriction seems strong, it often does not lead to a large deterioration in the estimates, especially after the EM algorithm’s utilization. Imagine, however, that this restriction is removed, and the set
$K=\{{v}_{min},\dots ,{v}_{max}\}$ is applied to each
${v}_{i}\in \{{v}_{1},\dots ,{v}_{d}\}$. This would lead to
${({v}_{max}{v}_{min}+1)}^{d}$ possible combinations and hence many loop executions. It is hard to find any practical example of when this large increase in execution times would be justified. Nevertheless, we ask ourselves whether, if the
$\mathit{v}=\{{v}_{1},{v}_{2},\dots ,{v}_{d}\}$ can be inexpensively estimated, it could it result in improvements in the performance of the parameter estimation.
It is clear that the estimation of the histogrambin width
$\mathit{h}$, as described above, can be translated into an estimation of the histogram’s number of bins
$\mathit{v}$. In the following explanations we will refer to the different number of bins
$\mathit{v}$ as different binning
$\mathit{v}$ and the histogram estimation for the binning
$\mathit{v}$ is the problem of estimating the frequency
${k}_{j}$ for each bin
j, as defined in Equation (
11). The problem of estimating the optimum binning
$\mathit{v}$ can thus be formulated as the following constrainedinteger optimization problem
where
$\mathit{v}=\{{v}_{1},\dots ,{v}_{d}\}\in {\mathbb{Z}}^{d}\wedge {v}_{i}\ge 1\phantom{\rule{2.0pt}{0ex}}\forall {v}_{i}\in \mathit{v}$ and
$H\left(\mathit{v}\right)$ is the optimization function that calculates the goodness of the histogram estimation with binning
$\mathit{v}$. Generally speaking, as was already stated in [
24], this is an NPhard problem, which cannot be solved in polynomial time. Additionally, most of the approaches for solving the integeroptimization problem usually remove the integer constraint and solve the equivalent realvalued problem [
24]. As the evaluation of the optimization function
$H\left(\mathit{v}\right)$ requires the histogram estimation, finding the optimum binning
$\mathit{v}$, by removing the integer constraint, and for example, using the gradient methods, is not favorable, since the derivatives of the abovepresented histogram estimation and consequently the optimization function
$H\left(\mathit{v}\right)$ need to be numerically estimated. Hence, the optimization method will require multiple repetitions of the histogram estimations with different numbers of bins
$\mathit{v}$ and consequently evaluations of the function
$H\left(\mathit{v}\right)$. To ease the optimization problem slightly, we will also adopt another constraint in the form of the maximum marginal number of bins
${v}_{max}$ and the minimum marginal number of bins
${v}_{min}$, so that
holds. As the most straightforward implementation, which we will refer to as exhaustive search, would be to evaluate the optimization function
$H\left(\mathit{v}\right)$ for each and every binning
$\mathit{v}$ from the set
which is in fact the
dary Cartesian product of set
$K=\{{v}_{min},{v}_{min}+1,\dots ,{v}_{max}\}$.
The algorithm implementation is presented in Algorithm 5. As was already stated, the evaluation of the function
$H\left(\mathit{v}\right)$ requires a histogram estimation with binning
$\mathit{v}$, which can be made in
$O\left(n\right)$, where
n is the number of observations in the input dataset
$\mathit{Y}$, so it can be assumed that the efficient implementation of the
$H\left(\mathit{v}\right)$ estimation can be made in
$O\left(n\right)$. However, the number of candidates in the constructed set
C, from line 2 of Algorithm 5 is
${({v}_{max}{v}_{min}+1)}^{d}$. If
${v}_{min}=1$ and
${v}_{max}=n$ are assumed, the computational complexity of Algorithm 5 becomes
$O\left({n}^{d+1}\right)$, which is definitely not desirable as the datasets with large numbers of dimensions are now quite common [
25]. However, if the set
C contains the optimum candidate, that value would be selected.
Algorithm 5: Exhaustive search for optimum histogram binning $\mathit{v}$. 
 1:
Input dataset $\mathit{Y}$, initialize set $K=\{{v}_{\mathrm{min}},{v}_{\mathrm{min}}+1,\dots ,{v}_{\mathrm{max}}\}$;  2:
Construct set $C={K}^{d}$ with Equation ( 21);  3:
Set ${\widehat{H}}_{\mathrm{opt}}=\infty $, ${\mathit{v}}_{\mathrm{opt}}=\left\{\right\}$;  4:
foreach $\mathit{v}\in C$ do:  5:
Evaluate $\widehat{H}=H\left(\mathit{v}\right)$;  6:
if $\widehat{H}>{\widehat{H}}_{\mathrm{opt}}$:  7:
${\widehat{H}}_{\mathrm{opt}}=\widehat{H}$, ${\mathit{v}}_{\mathrm{opt}}=\mathit{v}$;  8:
end  9:
end

To solve the above problem more efficiently than with the exhaustive search, we used the derivation of the coordinatedescent optimization algorithm. The coordinate descent and its variants are types of hillclimbing optimization algorithms [
26]. Hence, given some initial starting value of the optimized parameters, it will always choose local optima, given that the optimization function is not convex. The solution is obtained by cyclically switching the coordinates, i.e., the dimensions, of the optimization problem. For each dimension it searches for the optimum value of the parameter in that dimension, while keeping other parameter values (in other dimensions) constant. As it finishes it moves to the next dimension and repeats the process until convergence. The search for the optimum parameter value in each dimension can be made by calculating the gradient or using a line search. Given that we already stated that a line search is preferred over a gradient calculation, the algorithm implementation is given in Algorithm 6.
Algorithm 6: Coordinatedescent algorithm for optimum histogram binning $\mathit{v}$. 
 1:
Input dataset $\mathit{Y}$, initialize set $K=\{{v}_{\mathrm{min}},{v}_{\mathrm{min}}+1,\dots ,{v}_{\mathrm{max}}\}$;  2:
Set ${\widehat{H}}_{\mathrm{opt}}=\infty $, ${v}_{\mathrm{opt}}=0$, ${\mathit{v}}_{\mathrm{opt}}=\left\{\right\}$;  3:
Set initial starting position $\mathit{v}=\{{v}_{1},\dots ,{v}_{i},\dots ,{v}_{d}\}$ so that ${v}_{1}=\dots ={v}_{i}=\dots ={v}_{d}=1$;  4:
Set Converged = False;  5:
while Converged is not True do:  6:
foreach $i\in \{1,\dots ,d\}$ do:  7:
Set ${v}_{i\mathrm{old}}={v}_{i}$, ${v}_{\mathrm{opt}}={v}_{i\mathrm{old}}$;  8:
foreach $v\in K$ do:  9:
Update ${v}_{i}\in \mathit{v}$ so that ${v}_{i}=v$;  10:
Evaluate $\widehat{H}=H\left(\mathit{v}\right)$;  11:
if $\widehat{H}>{\widehat{H}}_{\mathrm{opt}}$:  12:
${\widehat{H}}_{\mathrm{opt}}=\widehat{H}$, ${v}_{\mathrm{opt}}=v$, ${\mathit{v}}_{\mathrm{opt}}=\mathit{v}$;  13:
end  14:
end  15:
if $i=1\wedge {v}_{\mathrm{opt}}={v}_{i\mathrm{old}}$:  16:
Set Converged = True;  17:
end  18:
end  19:
end

Essentially, as stated above, in each dimension i, the marginal number of bins ${v}_{i}$ is optimized by performing a line search from the supplied minimum value ${v}_{min}$ to the maximum value ${v}_{max}$ for a fixed step size of 1 (construction of the ordered set K). When estimating the optimum parameter value of ${v}_{i}$, other parameter values are kept constant (in other dimensions). The parameter values for the other dimensions (ones which are not currently being estimated) are obtained from a previously conducted line search or using the starting initial value of $\mathit{v}$. The convergence is assessed as follows. The number of iterations I of the Algorithm 6 are counted by the number of whileloop (lines 5–19 of Algorithm 6) executions. Each iteration of the while loop starts by updating the parameter value in the first dimension of the problem; i.e., $i=1$ and ${v}_{i}={v}_{1}$. Hence, if this parameter value has not changed, it is an obvious clue that in other dimensions the parameter values will not change, given that at least one iteration of Algorithm 6 has passed; i.e., $I>1$. To ensure that Algorithm 6 will perform at least one iteration, i.e., the condition $I>1$, we chose to hard code the initial startingparameter values to 1 (line 3 of Algorithm 6). This is clearly unwanted binning $\mathit{v}$ as it results in only one bin; hence, the estimated value ${\widehat{H}}_{\mathrm{opt}}$ of the optimization function $H\left(\mathit{v}\right)$ can be safely set to the worst value; e.g., in the maximization case ${\widehat{H}}_{\mathrm{opt}}=\infty $.
It is hard to predict the necessary number of iterations
I for the convergence of Algorithm 6, as it depends on the input dataset
$\mathit{Y}$. With an increase in the number of iterations
I, the number of optimizationfunction
$H\left(\mathit{v}\right)$ evaluations increases. In the worstcase scenario, the number of
$H\left(\mathit{v}\right)$ evaluations can even surpass the number of evaluations in the exhaustive search variant given in Algorithm 5, due to the fact that the same value of
$\mathit{v}$ could be revisited in the optimization procedure. As the optimization function
$H\left(\mathit{v}\right)$ evaluation is the computationally most burdening part of Algorithm 6, this is clearly unwanted. However, to set the upper limit of the
$H\left(\mathit{v}\right)$ evaluations to be equal to the number of evaluations in the exhaustive search variant (Algorithm 5), we have used a simple memoization technique [
27]. Memoization is often used in dynamic programming and recently also to improve the performance of metaheuristic algorithms [
28]; however, it should be noted that it leads to an increase in memory usage. However, as pointed out in [
28], it is hard to imagine this as an issue, given that the advances in hardware technology lead to massive capacities of inexpensive memory. The memoized version of the coordinatedescent algorithm for optimum histogram binning
$\mathit{v}$ is given in Algorithm 7.
Algorithm 7: Coordinatedescent algorithm for optimum histogram binning $\mathit{v}$ with memoization. 
 1:
Input dataset $\mathit{Y}$, initialize set $K=\{{v}_{\mathrm{min}},{v}_{\mathrm{min}}+1,\dots ,{v}_{\mathrm{max}}\}$;  2:
Set ${\widehat{H}}_{\mathrm{opt}}=\infty $, ${v}_{\mathrm{opt}}=0$, ${\mathit{v}}_{\mathrm{opt}}=\left\{\right\}$;  3:
Set initial starting position $\mathit{v}=\{{v}_{1},\dots ,{v}_{i},\dots ,{v}_{d}\}$ so that ${v}_{1}=\dots ={v}_{i}=\dots ={v}_{d}=1$;  4:
Initialize memoization dictionary $D=\left\{\right\}$;  5:
Set Converged = False;  6:
while Converged is not True do:  7:
foreach $i\in \{1,\dots ,d\}$ do:  8:
Set ${v}_{i\mathrm{old}}={v}_{i}$, ${v}_{\mathrm{opt}}={v}_{i\mathrm{old}}$;  9:
foreach $v\in K$ do:  10:
Update ${v}_{i}\in \mathit{v}$ so that ${v}_{i}=v$;  11:
if $H\left(\mathit{v}\right)$ already evaluated and stored in dictionary D:  12:
Retrieve stored value of $\widehat{H}$ stored for value $\mathit{v}$;  13:
else:  14:
Evaluate $\widehat{H}=H\left(\mathit{v}\right)$;  15:
Store value $\widehat{H}$ in dictionary D for $\mathit{v}$;  16:
end  17:
if $\widehat{H}>{\widehat{H}}_{\mathrm{opt}}$:  18:
${\widehat{H}}_{\mathrm{opt}}=\widehat{H}$, ${v}_{\mathrm{opt}}=v$, ${\mathit{v}}_{\mathrm{opt}}=\mathit{v}$;  19:
end  20:
end  21:
if $i=1\wedge {v}_{\mathrm{opt}}={v}_{i\mathrm{old}}$:  22:
Set Converged = True;  23:
end  24:
end  25:
end

The intuition behind using the coordinatedescent algorithm for our optimization purposes comes from the histogram estimation being the most trivial nonparametric probability density estimator [
29]. Most of the time it is used on a univariate (onedimensional) random variable; however, as in our case, it can be used for a multivariate (multidimensional) random variable. If, for example, we assume independence between each dimension in a random variable
$\mathit{y}\in {\mathbb{R}}^{d}$, the problem of estimating each
${v}_{i}\in \mathit{v}$ could be broken in
d univariate problems, and the estimation of each marginal number of bins
${v}_{i}$ could be conducted separately. This could result in a large overestimation of each marginal number of bins
${v}_{i}$, due to the fact that in a
ddimensional problem the actual number of bins in the histogram grid is
$V={\prod}_{i=1}^{d}{v}_{i}$. Given that a larger number of bins leads to more empty bins in the histogram grid (e.g., when
$V>>n$), most of the
$H\left(\mathit{v}\right)$ will penalize such a solution. When using the coordinatedescent algorithm, we act analogously; however, we do so by taking into account the actual number of bins in the
ddimensional grid. Nonetheless, as already stated, the coordinatedescent algorithm can be trapped into some spurious local optima. To ensure that the best possible solution is estimated, we use the results from the coordinatedescent algorithm to conduct an exhaustive search in a narrowed parameter space, as shown in Algorithm 8. To clarify why this could be helpful, let us imagine the following scenario. Let some marginal number of bins
${v}_{i}$ be underestimated, whereas some other marginal number of bins
${v}_{\tilde{\u0131}}$ is overestimated (e.g., in some dimension
i, the estimated value of
${v}_{i}$ is smaller than the optimum, and in some other dimension
$\tilde{\u0131}\ne i$, the estimated parameter of the value
${v}_{\tilde{\u0131}}$ is higher than the optimum). A new set
K can be constructed from the minimum estimated value to the maximum estimated value, as shown in lines 3 and 4 of Algorithm 8 and constructing all the possible combinations from the set
K (constructing the set
C from line 5 of Algorithm 8) will ensure that the optimum solution is acquired. Although there is no clear evidence that the utilization of an additional exhaustive search will be beneficial, we presume that the newly constructed set
K and consequently set
C (line 5 and 6 of Algorithm 8) will hold many fewer candidates for the optimum solution. Additionally, due to using the memoization technique, a lot of possible solutions are already visited; thus, an additional exhaustive search should not lead to an exceptional increase in the computational time.
Algorithm 8: Narrowed exhaustive search with a coordinatedescent algorithm for optimum histogram binning $\mathit{v}$ with memoization. 
 1:
Input dataset $\mathit{Y}$, initialize set $K=\{{v}_{\mathrm{min}},{v}_{\mathrm{min}}+1,\dots ,{v}_{\mathrm{max}}\}$;  2:
Obtain ${\mathit{v}}_{\mathrm{opt}}$ using Algorithm 7;  3:
Set ${v}_{min}=min\left({\mathit{v}}_{o}pt\right)$;  4:
Set ${v}_{max}=max\left({\mathit{v}}_{o}pt\right)$;  5:
Construct narrowed set $K=\{{v}_{\mathrm{min}},{v}_{\mathrm{min}}+1,\dots ,{v}_{\mathrm{max}}\}$;  6:
Construct set $C={K}^{d}$ with Equation ( 21);  7:
foreach $\mathit{v}\in C$ do:  8:
if $H\left(\mathit{v}\right)$ already evaluated and stored in dictionary D:  9:
Retrieve stored value of $\widehat{H}$ stored for value $\mathit{v}$;  10:
else:  11:
Evaluate $\widehat{H}=H\left(\mathit{v}\right)$;  12:
Store value $\widehat{H}$ in dictionary D for $\mathit{v}$;  13:
end  14:
if $\widehat{H}>{\widehat{H}}_{\mathrm{opt}}$:  15:
${\widehat{H}}_{\mathrm{opt}}=\widehat{H}$, ${\mathit{v}}_{\mathrm{opt}}=\mathit{v}$;  16:
end  17:
end

3.2. The Knuth Rule as the Optimization Function
Finally, let us address the elephant in the room; i.e., the optimization function
$H\left(\mathit{v}\right)$. For the optimization function
$H\left(\mathit{v}\right)$ the Knuth rule defined in [
22] is used. The Knuth rule yielded good estimates when used in [
4], and additionally, due to its usability in a multivariate setting, it was chosen here. The Knuth rule is defined as
where
$\mathit{Y}$ is the dataset,
n is the number of observations in the dataset,
V is the number of bins,
$log\Gamma $ is the natural logarithm of the gamma function and
$const$ is an integration constant. It is clear that different binning
$\mathit{v}$ should produce different results for
$H\left(\mathit{v}\right)$, even though it is not obvious from the definition in Equation (
22). Different binning
$\mathit{v}$ leads to different frequencies
${k}_{j}\left(\mathit{Y}\right)$, if we recall Equation (
11), and additionally, a different number of bins
V, ultimately giving different values to the
${\sum}_{j=1}^{V}log\Gamma ({k}_{j}\left(\mathit{Y}\right)+1/2)$ term in Equation (
22). Nonetheless, as the number of bins
V increases (with the increase in the marginal number of bins
${v}_{i}$, the bin width shrinks
${h}_{i}\to 0$), it will lead to the state where each populated bin contains only one observation,
${k}_{j}\left(\mathit{Y}\right)=1\phantom{\rule{2.0pt}{0ex}}\forall j\in \{1,\dots ,n\}$ and the rest of the bins will have a frequency of zero. Since computers digitize the data, as was stated in [
22], this is a most unlikely scenario for every dataset
$\mathit{Y}$; however, after passing some final binning
${\mathit{v}}_{\mathrm{final}}$, as the order of the bins in Equation (
22) is not important, the term
${\sum}_{j=1}^{V}log\Gamma ({k}_{j}\left(\mathit{Y}\right)+1/2)$ will not change and the change of
$H\left(\mathit{v}\right)$ will be dependent on the number of bins
V. If we assume that for the final binning
${v}_{\mathrm{final}}$ we have the scenario where each populated bin holds one observation, Equation (
22) transforms into
The value of
$H\left(V\right)$ from Equation (
23) is monotonically increasing as
V increases. This could jeopardize the estimation process for the optimum solution
$\mathit{v}$ in Algorithm 8. Therefore, to determine when the evaluation of the Knuth rule (Equation (
22)) is no longer usable, due to this possible degeneration, we have chosen to constrain the maximum number of nonempty bins
${v}_{max}^{*}$ in the histogram grid; in other words
The Equation (
24) for
$d=1$ yields the famous RootN rule
${v}_{max}^{*}=2\sqrt{n}$ and as the dimension increases
$d\to \infty $ the limiting number of populated bins is
${v}_{max}^{*}\to n$. Hence, after each histogram estimation the number of nonempty bins in the histogram grid is checked and if the number of nonempty bins
${v}^{*}$ is higher than
${v}_{max}^{*}$, then the solution is rejected.
Remark 7. Although the ${v}_{max}^{*}$ should be an integer, the result of Equation (24) is real; i.e., ${v}_{max}^{*}\in \mathbb{R}$. Since the ${v}_{max}^{*}$ is only used for comparison, the need to be an integer is thus avoided.