2. Related Work
In most of practical studies on NTF, the value of rank is either decided by trial and error or by specialists’ insights (see e.g., [
6,
14]). In addition, AIC and BIC have also been used for model selection in NTF [
15].
Unlike NTF, there are a number of rank selection methods for NMF. As a special case of NTF, NMF aims to factorize a nonnegative data matrix
$X\in {\mathbb{R}}_{+}^{I\times J}$ into factor matrices
$W\in {\mathbb{R}}_{+}^{I\times R}$ and
$H\in {\mathbb{R}}_{+}^{R\times J}$ as
$X=WH+E$, where
E is the approximation error matrix. In addition to general methods such as AIC, BIC and crossvalidation [
16], more involved criterion have been developed to select the rank for NMF, such as the MDL criterion with latent variable completion [
17] and the nonparametric Bayes method [
18]. The MDL codelength under the assumption of model regularity is studied in [
19].
Squires et al. [
20] proposed two rank selection methods for NMF based on the MDL principle. In this study, data matrix
X is considered as the message from a transmitter to a receiver and it is sent in such a way that a transmitter send
W,
H and
E and a receiver reconstruct
X from them. When rank
R is low, encoding of
W and
H requires shorter codelengths while that of
E does longer one. On the contrary, when
R is high, encoding of
W and
H requires longer codelengths while that of
E does shorter one. The MDL principle is used to find the best solution of the tradeoff between the accuracy and the complexity. Squires et al. proposed two methods for calculating codelengths with Shannon information:
$logp\left(x\right)$ where the probability
p is known in advance. Note that the NML codelength we use can be thought of as an extension of Shannon information into the situation where the parameter
$\theta $ of probability distribution
$p\left(x\right\theta )$ is unknown in advance [
10].
3. Proposed Method
This section proposes an MDLbased rank estimation method for NTF. To do this, we extend the study on NMF by [
20] into NTF in a nontrivial way. As noted in
Section 1.2, in the case of tensors, we may suffer from the imbalance problem, that is, the number of elements in factor matrices is too much smaller than that of elements in a data tensor. For instance, in case of
$I=J=K=50$ and rank
$R=20$, error tensor
$\mathcal{E}$ has
$50\times 50\times 50$ = 125,000 elements, while the total number of elements in three factor matrices
$T,U,V$ is only
$20\times (50+50+50)=3000$. In such cases, compared with the codelength for the error tensor, those of factor matrices are too small to have an influence on the total result. Therefore, the NTFbased rank selection methods tend to choose the model that best fits the data and ignore the complexity of model in some way. As a consequence, the tradeoff between complexity and errors cannot be well formalized.
Our key idea is to take the tensor slice based approach. The overall flow of our method is summarized as follows: We first produce a number of tensor slices from a nonnegative data tensor, then consider those slices as nonnegative data matrices and employ NMF to factorize them. Next, we select a rank for each tensor slice so that the total codelength is minimized, and finally select the largest one among the selected ranks for slices as the rank of the original tensor. Note that we calculate the codelength with the
NML codelength, rather than Shannon information used in [
20].
First of all, for a thirdorder nonnegative tensor
$\mathcal{X}\in {\mathbb{R}}_{+}^{I\times J\times K}$, the three kinds of its twodimensional slices are: Horizontal slices
${X}_{i}\in {\mathbb{R}}_{+}^{J\times K}\left(i=1,...,I\right)$, lateral slices
${X}_{j}\in {\mathbb{R}}_{+}^{I\times K}\left(j=1,...,J\right)$, and frontal slices
${X}_{k}\in {\mathbb{R}}_{+}^{I\times J}\left(k=1,...,K\right)$. Each tensor slice can be treated as a matrix to be factorized as follows:
where
${X}_{a}$ represents a tensor slice of the nonnegative data tensor
$\mathcal{X}$.
${W}_{a,R}$ and
${H}_{a,R}$ are the two nonnegative factor matrices, both of which consist of
R factors.
${E}_{a,R}$ denotes the error matrix. When the data tensor has a true rank
R, any
${X}_{a}$ has the rank no more than
R. Therefore, we could select appropriate ranks for all
${X}_{a}$s and select the maximum value among them as the rank of the tensor.
Next, we select a rank of
${X}_{a}$ so that the total codelength is minimum. The total codelength of
${X}_{a}$ with rank
R is given by:
where
$\mathcal{L}\left(x\right)$ is the codelength required for encoding
x under the prefix condition. In order to calculate the codelengths for elements in
${W}_{a,R},{H}_{a,R},$ and
${E}_{a,r}$, we need to discretize them with an appropriate precision
$\delta $, since they are realvalued numbers with unlimited precision, which cannot be encoded. For a given precision
$\delta $, the elements in each matrix are first vectorized to build a vector, then discretized into bins with precision
$\delta $ to create a histogram. Letting the minimum and maximum of elements be
${v}_{min}$ and
${v}_{max}$, respectively, the histogram has bins:
where
$s=\lceil ({v}_{max}{v}_{min})/\delta \rceil $ is the number of bins.
Then we employ the NML codelength in order to compute the codelength of
${x}^{n}$, each of which denotes an element assigned to a bin. The NML codelength is a theoretically reasonable coding method when the parameter value is unknown. Let
$p\left({x}^{n}\right\theta ,\mathcal{M})$ be the probability distribution with parameter
$\theta $ under the model
$\mathcal{M}$. According to [
12], given a data sequence
${x}^{n}$, the NML codelength
${\mathcal{L}}_{\mathrm{NML}}\left({x}^{n}\mathcal{M}\right)$ for
${x}^{n}$ under the model
$\mathcal{M}$ is given by:
where
$\widehat{\theta}\left({x}^{n}\right)$ denotes the maximum likelihood estimator of
$\theta $ from
${x}^{n}$. The second term in Equation (
3) is generally difficult to calculate. Rissanen [
12] derived its asymptotic approximation formula as follows:
where
k is the number of parameters in model
$\mathcal{M}$,
$\leftI\left(\theta \right)\right$ denotes the determinant of the Fisher information matrix, and
$o\left(1\right)$ satisfies
$\underset{n\to \infty}{lim}o\left(1\right)=0$ uniformly over
${x}^{n}$.
For
${W}_{a,R}$ and
${H}_{a,R}$, the zero terms in factor matrices are generally much more than the nonzero terms. Thus we separately compute the codelength of zero terms, namely the first bin in the histogram, and nonzero terms as follows:
where
${Y}_{a,R}$ is either
${W}_{a,R}$ or
${H}_{a,R}$.
${Y}_{a,R}^{0}$ represents the zero terms in
${Y}_{a,R}$, and
${Y}_{a,R}^{+}$ denotes the nonzero terms in
${Y}_{a,R}$.
For the zero terms or the first bin in the histogram, by applying Equations (
3) and (
4) into the Bernoulli model, their NML codelength is calculated as follows:
where
n is the total number of elements in
${W}_{a,R}$ or
${H}_{a,R}$, and
${n}_{0}$ denotes the number of zero values in this matrix.
For the binned data in
${W}_{a,R}^{+}$,
${H}_{a,R}^{+}$ and
${E}_{a,R}$, by applying Equations (
3) and (
4) into histogram densities with
s bins, we can compute their NML codelengths as:
where
${n}_{i}$ is the number of elements in the
ith bin, and
$\mathsf{\Gamma}\left(\xb7\right)$ is the gamma function.
${\mathcal{L}}_{\mathrm{int}}\left(s1\right)$ is the codelength of an integer [
10], which can be computed as:
where the summation is taken over all the positive iterates. Using Equations (
5) and (
6), the total description length of
${X}_{a}$ can be calculated as follows:
After we apply the MDL principle to select the rank with the shortest total codelength for each tensor slice, we select the largest one from all of slices’ ranks as the rank of the tensor. This estimate can be seen as an lower bound of the rank of the tensor from the fact that the ranks of slices is smaller than that of the data tensor, that is, if the rank of tensor X is R and the decomposition is represented as ${x}_{ijk}={\sum}_{r=1}^{R}{t}_{ir}{u}_{jr}{v}_{kr}$, each slice ${X}_{a}$ can be represented as ${W}_{a}^{\prime}{H}_{a}^{\prime}$ with rank R matrices ${W}_{a}^{\prime}$ and ${H}_{a}^{\prime}$. For example, each element of ${X}_{i}={\left({x}_{ijk}\right)}_{jk}$ can be represented as ${x}_{ijk}={\sum}_{r=1}^{R}{w}_{jr}{h}_{kr}$, where ${w}_{jr}={t}_{ir}{u}_{jr}$ and ${h}_{kr}={v}_{kr}$.
We show the entire procedure in Algorithm 1.
Algorithm 1 Rank Selection with Tensor Slices 
Slice a nonnegative thirdorder data tensor $\mathcal{X}\in {\mathbb{R}}_{+}^{I\times J\times K}$ into tensor slices ${X}_{a}$ for $a=1\phantom{\rule{4pt}{0ex}}\mathrm{to}\phantom{\rule{4pt}{0ex}}I+J+K$ do for $R=1\phantom{\rule{4pt}{0ex}}\mathrm{to}\phantom{\rule{4pt}{0ex}}min\left(I,J,K\right)$ do Perform NMF on ${X}_{a}$ to obtain ${W}_{a,R}$ and ${H}_{a,R}$ Calculate ${E}_{a,R}={X}_{a}{W}_{a,R}{H}_{a,R}$ Compute the total codelength using Equation ( 8) end for Select the rank of a tensor slice: end for Select the rank of tensor: return ${R}_{tensor}$

4. Comparison Method: MDL2stage
We further developed a novel algorithm, MDL2stage, as a comparison method. MDL2stage is based on tensor slice similarly with the proposed method, but it encodes the factorized results of NMF via the twostage codelength. All of its calculation is exactly the same as our proposed method except the way of encoding the error matrix and the nonzero terms in the factor matrices.
In MDL2stage, we fit a parametric probability distribution to the histogram to estimate the probability density of each bin. Generally, we assume that elements in the histogram generated by the error matrix
${E}_{a,R}$ follow the normal distribution, and assume that the nonzero elements in the histograms of two factor matrices
${W}_{a,R}$ and
${H}_{a,R}$ are gammadistributed. Then we use twostage codelength to calculate the description length of
${E}_{a,R}$ and the nonzero terms in
${W}_{a,R}$ and
${H}_{a,R}$ as follows:
where
${Y}_{a,R}^{+}$ is
${E}_{a,R}$ or nonzero terms in
${W}_{a,R}$ or
${H}_{a,R}$,
${\rho}_{i}$ denotes the estimated probability of an element to be in the
ith bin.
The total codelength for a tensor slice
${X}_{a}$ with rank
R is:
Again, we apply the MDL principle to choose the rank with the shortest total codelength for each slice, and select the greatest rank from all of slices’ ranks as the rank of tensor.
As for the computational complexity for our proposed method and MDL2stage, the cost of factorizing a tensor slice with size $I\times J$ is $O\left(IJR\right)$ for each iteration. Since there are K such tensor slices, the total computational complexity of the two methods based on tensor slice is $O\left(IJKR\right)$ per iteration. By similar arguments, performing NMF on all $I+J+K$ tensor slices costs $O\left(IJKR{T}_{\mathrm{N}MF}\right)$, where ${T}_{\mathrm{N}MF}$ denotes the number of iterations in NMF.
Although theoretically we have to perform NMF on $I+J+K$ slices, numerical experiments have proven that actually we only need to factorize K tensor slices which have the biggest size $I\times J$, where we assume that K is the smallest value among I, J and K. This is because that smaller tensor slices usually give comparatively low ranks. Therefore, when we choose the largest rank over all ranks of slices, whether or not to use tensor slices with smaller sizes does not influence the final result.