Next Article in Journal
A Fast Particle-Locating Method for the Arbitrary Polyhedral Mesh
Previous Article in Journal
Simple K-Medoids Partitioning Algorithm for Mixed Variable Data
Previous Article in Special Issue
Aiding Dictionary Learning Through Multi-Parametric Sparse Representation

Algorithms 2019, 12(9), 178;

Adaptive-Size Dictionary Learning Using Information Theoretic Criteria
Department of Automatic Control and Computers, University Politehnica of Bucharest, 313 Spl. Independenţei, 060042 Bucharest, Romania
Department of Statistics, University of Auckland, Auckland 1142, New Zealand
Author to whom correspondence should be addressed.
Received: 14 July 2019 / Accepted: 22 August 2019 / Published: 25 August 2019


Finding the size of the dictionary is an open issue in dictionary learning (DL). We propose an algorithm that adapts the size during the learning process by using Information Theoretic Criteria (ITC) specialized to the DL problem. The algorithm is built on top of Approximate K-SVD (AK-SVD) and periodically removes the less used atoms or adds new random atoms, based on ITC evaluations for a small number of candidate sub-dictionaries. Numerical experiments on synthetic data show that our algorithm not only finds the true size with very good accuracy, but is also able to improve the representation error in comparison with AK-SVD knowing the true size.
dictionary learning; sparse representation; information theoretic criteria; dictionary size

1. Introduction

Dictionary learning (DL) is now a mature field [1,2,3], with several efficient algorithms for solving the basic problem or its variants and with numerous applications in image processing (denoising and inpainting), classification, compressed sensing and others. The basic DL problem is: given N training signals gathered as the columns of the matrix Y R m × N and the sparsity level s, find the dictionary D R m × n by solving
min D , X Y D X F 2 s . t . x 0 s , = 1 : N d j 2 = 1 , j = 1 : n
Here, · F is the Frobenius norm; and x and d j are columns of X and D , respectively. The first constraint says that the matrix X has at most s nonzeros on each column. The second constraint is the normalization of the atoms (columns of the dictionary).
An alternative view is to relate Equation (1) to the sparse matrix factorization—or dictionary recovery (DR)—problem. The different assumption is that there indeed exist a dictionary D and a sparse matrix X such that Y DX ; the purpose is to recover them from the data Y . Such a ground truth is usually not available in practical applications. However, investigating DR is useful for theoretical and practical developments. Significant recent work on this matter can be found in [4,5,6,7]. Our contribution belongs more to the DR line of thought.
In most DR algorithms, the dictionary size n (the number of atoms) is assumed to be known. In DL applications, n is chosen with a rather informal trial-and-error procedure: a few sizes are used and that giving the best performance in the application at hand is selected. A dictionary with more atoms typically gives a smaller error: if D 1 R m × n 1 and D 2 R m × n 2 , with n 1 < n 2 , we expect that Y D 1 X 1 F 2 > Y D 2 X 2 F 2 . However, this does not mean that D 2 is necessarily better. We aim here at an automatic choice of the size that is the most appropriate to the data, based on Information Theoretic Criteria (ITC).
The only previous work based on ITC [8] uses Minimum Description Length for the choice of the size and the sparsity level. Their algorithm implements a virtual coder and searches exhaustively over all considered dictionary sizes. A recent method [9] reporting promising results has a similar purpose, but is based on geometric properties of the DR problem.
All other methods are based on heuristics that essentially aim to obtain a dictionary having the same representation error as that obtained with a standard method, but with smaller size. We list a few of the successful approaches. In [10,11], a small dictionary is grown by adding representative atoms from time to time in the iterative DL process. Clustering ideas are used in [12,13,14] to reduce the size of a large dictionary. A direct attempt of optimizing the size is employed in [15] by introducing in the objective a proxy for the size as penalization. An Indian Buffet Process is the tool in [16]. Other techniques based on Bayesian learning are [17,18].
Unlike the method in [8] but similarly to that in [9], our approach tries to adapt the size during the DL process. It also has a much simpler implementation, using standard ITC that do not intervene in the learning itself, but only in the selection of the atoms. Hence, the complexity is not much higher than that of the underlying DL algorithm.
Section 2 presents the ITC, specialized to the DL problem, and describes the pool of candidate dictionaries during the DL algorithm. Section 3 gives the details of our algorithm for adapting the dictionary size. Section 4 is dedicated to experimental results on synthetic data that show that our algorithm is able to recover the size of the true dictionary, in various noise and sparsity level conditions and with various initializations. Our algorithm also gives, almost always, recovery errors on test data that are better than those given by the underlying DL algorithm in possession of the true dictionary size. Comparison with the method in [9] is also favorable.

2. Ingredients

2.1. Information Theoretic Criteria

ITC serve for assessing the adequacy of a model to a process described by experimental data, by combining its goodness of fit (approximation error) with its complexity. The underlying model in Equation (1) is Y = DX + U , where U is a matrix with entries that follow a Gaussian distribution with zero mean and an unknown variance (the same for all entries) [19]. Denoting T = m N , the goodness of fit is expressed via the Root Mean Square Error
RMSE = 1 T Y D X F .
The complexity depends primarily on the number of parameters; in the DL or DR case, this is
NoP = s N + ( m 1 ) n .
The first term corresponds to the number of nonzeros in X . The second term is the number of independent elements of the dictionary; we subtract n from the total of m n elements to account for the atom normalization constraints.
After preliminary investigation with several ITC, we kept only two for our DL approach. The first is Bayesian Information Criterion (BIC) [20], extended to the form
EBIC = 2 log RMSE + log T T NoP + 2 N T log n s .
The first two terms are the standard ones; we have added a third term to account for all possible positions of the nonzero entries in the matrix X , inspired by [21].
The second ITC is Renormalized Maximum Likelihood (RNML) [22,23], adapted as in [24] to our context, to which we have added the same combinatorial term as above:
ERNML 1 = ( T NoP ) log RMSE 2 T NoP + NoP log DX F 2 T · NoP + log [ NoP ( T NoP ) ] + 2 N log n s .
We note that a different version, ERNML 2 , derived in [25], further analyzed in [26] and adapted as Equation (5), gave results similar to those of ERNML 1 , thus we do not report it here. The ITC dismissed after preliminary (and thus possibly insufficient) investigation are those from [27,28].

2.2. Candidate Dictionaries

Application of ITC needs several candidate dictionaries from which selection is made with the minimum ITC value. Since we aim to run a single instance of the learning algorithm and, as such, we have a single dictionary, the only possibility is to compare smaller dictionaries made of a subset of the atoms. Even so, there are too many possible combinations. To reduce their number, we order the atoms based on their importance in the representations. Since
DX = j = 1 n d j x j T ,
where x j T is the j-th row of X , we sort the atoms in decreasing order of their “power”
P ( d j ) = x j T 2 .
We still name D the sorted dictionary. For selection, we consider dictionaries D ν R m × ν that are made of the first ν n atoms of D . Thus, there are at most n candidates. However, since small dictionaries are certainly not useful, we can also impose a lower bound n min and take ν n min . One can choose n min = m or even larger values.

2.3. DL Algorithm

Many DL algorithms are suited to the framework that we propose. We confine the discussion to standard algorithms, which aim to iteratively improve the dictionary and whose iterations have two stages: (i) sparse coding, in which the representation matrix X is computed for fixed dictionary D ; and (ii) dictionary update, in which D is improved, possibly together with the nonzero elements of X , but without changing the nonzero positions. Such algorithms are impervious to dictionary size changes; atoms can be removed or added between iterations without any change in the algorithm. Since they give good results and are fast, we adopt some of the simplest algorithms: Orthogonal Matching Pursuit (OMP) [29] for sparse coding and Approximate K-SVD (AK-SVD) [30] for dictionary update.

3. Algorithm

We assume first that the sparsity level is known. Our strategy is implemented by Algorithm 1, named ITC-ADL. Starting with an initial dictionary of size n init , we run the DL algorithm. We change the size only every c iterations, to allow the current set of atoms to be sufficiently trained together. Thus, the selection based on ITC described in Section 2.2 can be meaningful.
Although we could compute ITC values for all dictionary sizes between n min and the current n, it is more efficient to consider only a smaller number of candidates n cand ; note that the representations have to be recomputed for each sub-dictionary; although this can be done economically (see below the discussion on complexity), it may become a significant burden. Let us denote n ITC the size with minimum ITC value among the n cand largest possible dictionaries (those with sizes n, n 1 , …, n n cand + 1 ).
The main question is now how to use this value. If n ITC < n , we might be tempted to continue the learning process with only n ITC atoms. However, this seems to be (assertion confirmed by numerical experiments) a too drastic decision, that can easily lead to premature shrinking of the dictionary. Instead, we take n ITC as an indicator of the direction where n must evolve. If n ITC is much smaller than the current n, we decrease the size by δ (this number is 5 in our experiments); if n ITC is only slightly smaller, than we decrease the size by one; finally, we interpret n ITC = n as a sign that the size needs to be increased and add δ + atoms to the dictionary (we take δ + = 5 ). There are several methods for generating new atoms [3] (Section 3.9); we choose the simplest: random atoms.
The whole algorithm is run, as typical in DL, for a preset number K of iterations. Only at the end of these iterations we take n ITC as a true size information. With this size, we run c more DL iterations, as a final refinement.
Algorithm 1: ITC-ADL: DL with ITC-adapted dictionary size.
Algorithms 12 00178 i001
It is relatively hard to estimate the complexity of ITC-ADL, due to the dictionary size variations. We describe only the extra operations with respect to a standard DL algorithm, disregarding the size. There are two main categories of operations that increase the complexity. The first is the total number of iterations, that has to be larger than for standard DL, in order to let the dictionary size converge. The ITC give reliable information if the dictionary is well trained, hence neither K nor c can be small. In the tests reported below, we took K 200 and c = 5 .
The second extra operation is the computation of ITC, which involves the recomputation of the representations for each dictionary. Since we already have the representations X for the full dictionary of size n, we can progressively recompute only the representations that change as the size decreases. For example, the dictionary of size n 1 lacks atom d n , but is otherwise identical with D . We need to recompute only the representations that contain d n . Their number should be less than ( N s ) / n , since atom d n , which is the least used, should appear in less representations than the average. Thus, overall, the number of recomputed representations is likely bounded by ( n cand s / n ) N , which is comparable with N (the number of signals represented at each iteration); since this happens only every cth iteration, the extra complexity is relatively small.
In the experimental conditions described in the next section, ITC-ADL is about 3–5 times slower than the underlying AK-SVD, which is one of the fastest DL algorithms. This is not an excessive computational burden.
If the sparsity level s is not known, we simply run ITC-ADL for several candidate values; the best ITC value decides the best ( n , s ) pair. Adapting also s during the algorithm may be possible, but seems more difficult than adapting only the size and was left for future work.

4. Numerical Results

We tested our algorithm on synthetic data, obtained with “true” dictionaries D true whose unit norm atoms are generated randomly following a Gaussian distribution. Given the sparsity level s, the representations X are also generated randomly, with nonzeros on random positions. The data are Y = D true X + U , where the entries of U are statistically independent, Gaussian distributed, with zero mean. The variance of the additive noise is chosen to have four different values for the signal-to-noise ratio (SNR): 10 dB, 20 dB, 30 dB and 40 dB. The dictionary sizes are n { 128 , 192 , 256 } ; the overcompleteness factor n / m has moderate values, as in most applications.
Some of the input data for Algorithm 1 are constant throughout all the experiments. The number of signals is N = 4000 and their size is m = 64 . The number of iterations is K { 200 , 300 , 400 } , increasing with n. The dictionary size changes are made every c = 5 iterations. The size steps are δ + = δ = 5 . The number of size candidates is n cand = 20 . All the results are obtained with 50 runs for the same data, but with different realizations of D true and U .
We report in this section only a few representative results. More results can be found in Appendix A. Representative Matlab sources are given at

4.1. Experiments with Known Sparsity Level

In all the experiments reported in this section, ITC-ADL is in possession of the true value of s, which takes even values from 4 to 12. In the first round of experiments, the dictionary D true has n true = 128 atoms. The size n init of the initial dictionary takes random values between 80 and 180, in order to test the robustness of the algorithm to initialization.
Table 1 reports the average, minimum and maximum size n computed by ITC-ADL over the 50 runs. Our algorithm is able to find very good estimates of the size, for all considered noise and sparsity level values. An important conclusion is that the size of the initial dictionary has no impact on the results.
We evaluate the performance of the dictionaries given by ITC-ADL on 1000 test signals generated similar to the training ones. For comparison, we compute the RMSE for two dictionaries: (i) D true , for which the representations are computed with OMP; and (ii) the dictionary computed by AK-SVD with the true size n = n true and the same number of iterations ( K = 200 , in this case). We denote RMSEt and RMSEn the RMSE obtained with these dictionaries. Both approaches have an advantage over ours: the first has the true dictionary, so it is in fact an oracle; the second uses the same DL algorithm that we use, but knows the size.
Table 2 shows the average value of the ratios RMSE/RMSEt and RMSE/RMSEn. Values below 1 mean that our algorithm is better. Our algorithm always gives worse results than the oracle, which is expected, but the difference is often very small. Compared with AK-SVD, our algorithm is superior in all cases, the advantage growing with the SNR. Another conclusion that could be drawn is that the problem becomes harder as s and the SNR grow. The first part is natural; for the second part, an explanation can be that, as the SNR grows, there are fewer local minima with values close to the global one; our algorithm appears to be able to find them, while AK-SVD may be trapped in poor local minima; running it with several initializations can improve the results.
The results for n true = 192 are only slightly worse, thus we jump to those for n true = 256 , shown in Table 3. The size n init takes random values between 160 and 360. Now, some of the harder problems with large sparsity level ( s 10 ) are no longer well solved: some size estimations are wrong. However, the good behavior of the RMSE persists: most results are near-oracle and clearly better than AK-SVD.
We can see now some differences between the two ITC: ERNML 1 tends to overestimate n, but hardly ever underestimates it, while EBIC is more prone to underestimation. However, in most setups, both ITC give sizes that are near from the true one. Regarding the RMSE (see Table 4), the situation is somewhat reversed: ERNML 1 is slightly better than EBIC.

4.2. Execution Times and Discussion of Parameter Values

We present here some characteristics of ITC-ADL based on experimental evidence.
The average running times of ITC-ADL, in the configurations described above, are shown in Table 5, together with those of AK-SVD. Over the considered n and s values, the ratio between the execution times of ITC-ADL and AK-SVD varies between 3.34 and 4.12 .
Without reporting any actual times, we note that the ITC and the SNR have almost no influence. In addition, the execution time is roughly proportional with the number of iterations K and the number of signals N. The other parameters have obvious influences: more frequent dictionary size changes (smaller c) increase the time; same effect has a larger number of candidates n cand ; the size steps δ + and δ only slightly affect the time. We note that our implementation is not fully optimized, but it is based on a very efficient implementation of AK-SVD [30].
We chose the parameter values for the experiments reported in the previous section with the aim to show that a single set of values ensures good results for all considered dictionary sizes and sparsity levels. In fact, ITC-ADL is quite robust to the parameter values. Nevertheless, fine tuning is possible when n and s have a more limited range of values. We present below a few results that show the effect of the parameters on the outcome of ITC-ADL. We considered only the cases n { 128 , 256 } , s { 6 , 10 } , SNR = 30 dB; the ITC is ERNML 1 . Since we have run again the algorithm, some results may be different from those from the previous section, due to the random factors involved.
We gave the size steps δ + and δ values between 2 and 8. For n = 128 , the results are very similar. For n = 256 , a trend is (barely) visible in Table 6: larger size steps lead to poorer results. This is natural, since a large dictionary increase is a perturbing factor when the algorithm is near convergence. On the other hand, a small size step is not useful in the first iterations of the algorithm; if the initial size is far from the true one, the convergence can be very slow. Thus, although we have obtained good results with constant δ + = δ = 5 , some refinements are certainly possible. It makes sense to decrease δ + , δ as the algorithm evolves. This allows fine tuning of the size when the algorithm approaches convergence.
The number of candidates n cand has more influence on the results but again this is visible especially for n = 256 . Table 7 shows the average dictionary sizes given by ITC-ADL for n cand { 10 , 15 , 20 , 25 } . It is clear that a larger number of candidates is beneficial; since this leads to a larger execution time, a compromise is necessary. This was our reason for taking n cand = 20 . However, for n = 128 , even n cand = 10 seems sufficient, as the sizes (not shown here) are virtually the same for all considered n cand values.
We turn now to the parameter c, the number of iterations between size changes. Intuitively, small values of c lead to better results, but with more computational effort. Table 8 shows the results for n = 256 , for c { 3 , 4 , 5 , 6 , 8 , 10 } . The effect of c is visible for the more difficult problems. For small n and s, a larger c can give faster good results.
Finally, we partially justify our choices for the number of iterations. Generally, the rule is simple: more iterations lead to better results. However, there are many factors that influence the convergence speed and there are many random elements in the algorithm (the initial dictionary, the added atoms, the initial size and, of course, the noise). We illustrate in Figure 1 the evolution of the main variables, the current dictionary size (as given by the ITC) and RMSE (on the training data), for n = 128 , s = 10 and 10 runs of ITC-ADL with random initial sizes. The values are shown for iterations that are multiple of c; the size is constant between them and the RMSE typically decreases. It is visible that, after 100 iterations, the size is close to the true one and the RMSE is at about its final (and almost optimal, as we have seen) value. Typically, the size becomes larger than the true one when the RMSE reaches the lowest value. Then, ITC-ADL gradually trims the dictionary almost without changing the RMSE. Thus, if our purpose is dictionary learning (and perfect recovery is not sought because there is actually no true dictionary), the number of iterations can be smaller than for dictionary recovery.

4.3. Experiments with Unknown Sparsity Level

Using the same setup as in Section 4.1, we run ITC-ADL with several values of s, not only the true one, and choose the sparsity level that gives the best ITC value. Otherwise, the algorithm is unchanged and estimates the dictionary size as usual. Table 9 presents the average values of the sparsity level and of the dictionary size given by the above procedure for n true = 128 .
The variance of the estimated s is very low: typically, only two values are obtained. For example, when s = 8 , for SNR = 10 dB, only values of 6 and 7 are obtained; when SNR=40 dB, only values of 8 and 9 are obtained; etc. One can see that the sparsity level estimations are rather accurate; the only deviation is at low SNR, when s is underestimated. The dictionary size is well estimated, in accordance with the previous results.
A competing algorithm is Adaptive ITKrM [9] (Matlab sources available at, reported to give very good results in DR problems. However, it seems that this algorithm needs more signals than ours. In our setup with N = 4000 , A-ITKrM gives very poor results. Table 10 gives results for N = 20,000, where A-ITKrM becomes competitive. Due to time constraints, the results are averaged over only 20 runs. The table shows the obtained average sparsity level, dictionary size, and ratio RMSE/RMSEi, where RMSE is the error of our algorithm (using ERNML 1 ) and RMSEi is the error of A-ITKrM. Since A-ITKrM severely underestimates s, we have computed RMSEi (and RMSE) using OMP with the true s. Even so, our algorithm finds a more accurate version of the dictionary. While ITKrM gives very good estimates of n and a good approximation of the dictionary, it seems unable to refine the dictionary as well as our algorithm.
In the current unoptimized implementations, the execution times of ITC-ADL and A-ITKrM are comparable, our algorithm being faster for small s and slower for large s. Limited trials with N = 50,000 suggest that the above remarks continue to stand true.

5. Conclusions

We present a dictionary learning algorithm that works for unknown dictionary size. Using specialized Information Theoretic Criteria, based on BIC and RNML, the size is adapted during the evolution of a standard DL algorithm (AK-SVD in our case). Experimental results show that the algorithm is able to discover the true size of the dictionary used to generate data and, somewhat surprisingly, can give better results than AK-SVD run with the true size.
There are several possible directions for future research. Testing ITC-ADL in various applications is a first aim; they can range from direct applications, such as denoising or missing data estimation (inpainting), to more complex ones, e.g. classification. We also plan to combine the ITC idea with other DL algorithms, not only AK-SVD, especially with algorithms for which the sparsity level is not the same for all signals. Another direction is to find ITC that are suited for other types of noise, not Gaussian as here, with the final purpose of obtaining a single tool that analyzes data, finds the appropriate sparse representation model and designs the optimal dictionary.

Author Contributions

Conceptualization, B.D.; methodology, B.D. and C.D.G.; software, B.D. and C.D.G.; validation, B.D. and C.D.G.; formal analysis, B.D. and C.D.G.; investigation, B.D. and C.D.G.; writing—original draft preparation, B.D.; writing—review and editing, B.D. and C.D.G.; and visualization, B.D. and C.D.G.


This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

We present here more numerical results for supporting our algorithm. The setup is that already described in Section 4.
1. Known sparsity level. For each dictionary size n true { 128 , 192 , 256 } , we present three tables with results obtained with the two criteria, ERNML 1 and EBIC. (Some of the tables have already been presented in the main body of this paper and are only mentioned again here for full reference.) The first gives the average, minimum and maximum size n computed by ITC-ADL. The second shows RMSE information on the training data and the third shows RMSE information on test data. These tables are as follows: for n true = 128 , Table 1, Table A1 and Table 2; for n true = 192 , Table A2, Table A3 and Table A4 (the size n init takes random values between 120 and 280); and for n true = 256 , Table 3, Table A5 and Table 4.
2. Unknown sparsity level. Results for n true = 192 are given in Table A6.
Table A1. Ratios RMSE/RMSE t and RMSE/RMSE n on the training data, when n true = 128 . Left: ERNML 1 . Right: EBIC.
Table A1. Ratios RMSE/RMSE t and RMSE/RMSE n on the training data, when n true = 128 . Left: ERNML 1 . Right: EBIC.
SNR (dB) s = 4 681012SNR (dB) s = 4 681012
Table A2. Minimum (top), average (middle) and maximum (bottom) sizes computed when n true = 192 . Left: ERNML 1 . Right: EBIC.
Table A2. Minimum (top), average (middle) and maximum (bottom) sizes computed when n true = 192 . Left: ERNML 1 . Right: EBIC.
SNR (dB) s = 4 681012SNR (dB) s = 4 681012
192192192191193 192192192185170
203199195198471 203201193193200
192192192192192 192192192192192
199195194194194 217195195194195
192192192192192 192192192192192
204194202219219 200194197202211
192192192192192 192192192192192
268260217271329 269260208222230
Table A3. Ratios RMSE/RMSE t and RMSE/RMSE n on the training data, when n true = 192 . Left ERNML 1 . Right: EBIC.
Table A3. Ratios RMSE/RMSE t and RMSE/RMSE n on the training data, when n true = 192 . Left ERNML 1 . Right: EBIC.
SNR (dB) s = 4 681012SNR (dB) s = 4 681012
Table A4. Ratios RMSE/RMSE t and RMSE/RMSE n on test data, when n true = 192 . Left: ERNML 1 . Right: EBIC.
Table A4. Ratios RMSE/RMSE t and RMSE/RMSE n on test data, when n true = 192 . Left: ERNML 1 . Right: EBIC.
SNR (dB) s = 4 681012SNR (dB) s = 4 681012
Table A5. Ratios RMSE/RMSE t and RMSE/RMSE n on the training data, when n true = 256 . Left: ERNML 1 . Right: EBIC.
Table A5. Ratios RMSE/RMSE t and RMSE/RMSE n on the training data, when n true = 256 . Left: ERNML 1 . Right: EBIC.
SNR (dB) s = 4 681012SNR (dB) s = 4 681012
Table A6. Average values of s (top) and n (bottom), when n true = 192 . Left: ERNML 1 . Right: EBIC.
Table A6. Average values of s (top) and n (bottom), when n true = 192 . Left: ERNML 1 . Right: EBIC.
SNR (dB) s = 4 681012SNR (dB) s = 4 681012


  1. Rubinstein, R.; Bruckstein, A.; Elad, M. Dictionaries for Sparse Representations Modeling. Proc. IEEE 2010, 98, 1045–1057. [Google Scholar] [CrossRef]
  2. Tosic, I.; Frossard, P. Dictionary Learning. IEEE Signal Proc. Mag. 2011, 28, 27–38. [Google Scholar] [CrossRef]
  3. Dumitrescu, B.; Irofti, P. Dictionary Learning Algorithms and Applications; Springer: Berlin, Germany, 2018. [Google Scholar]
  4. Agarwal, A.; Anandkumar, A.; Jain, P.; Netrapalli, P.; Tandon, R. Learning sparsely used overcomplete dictionaries. In Proceedings of the Conference on Learning Theory, Barcelona, Spain, 13–15 June 2014; pp. 123–137. [Google Scholar]
  5. Arora, S.; Ge, R.; Moitra, A. New algorithms for learning incoherent and overcomplete dictionaries. In Proceedings of the Conference on Learning Theory, Barcelona, Spain, 13–15 June 2014; pp. 779–806. [Google Scholar]
  6. Hillar, C.; Sommer, F. When can dictionary learning uniquely recover sparse data from subsamples? IEEE Trans. Inf. Theory 2015, 61, 6290–6297. [Google Scholar] [CrossRef]
  7. Sun, J.; Qu, Q.; Wright, J. Complete dictionary recovery over the sphere I: Overview and the geometric picture. IEEE Trans. Inf. Theory 2017, 63, 853–884. [Google Scholar] [CrossRef]
  8. Ramirez, I.; Sapiro, G. An MDL framework for sparse coding and dictionary learning. IEEE Trans. Signal Proc. 2012, 60, 2913–2927. [Google Scholar] [CrossRef]
  9. Schnass, K. Dictionary learning-from local towards global and adaptive. arXiv 2018, arXiv:1804.07101. [Google Scholar]
  10. Rusu, C.; Dumitrescu, B. Stagewise K-SVD to Design Efficient Dictionaries for Sparse Representations. IEEE Signal Proc. Lett. 2012, 19, 631–634. [Google Scholar] [CrossRef]
  11. Marsousi, M.; Abhari, K.; Babyn, P.; Alirezaie, J. An Adaptive Approach to Learn Overcomplete Dictionaries With Efficient Numbers of Elements. IEEE Trans. Signal Proc. 2014, 62, 3272–3283. [Google Scholar] [CrossRef]
  12. Feng, J.; Song, L.; Yang, X.; Zhang, W. Sub clustering K-SVD: Size variable dictionary learning for sparse representations. In Proceedings of the 2009 16th IEEE International Conference on Image Processing (ICIP), Cairo, Egypt, 7–10 November 2009; pp. 2149–2152. [Google Scholar]
  13. Mazhar, R.; Gader, P.D. EK-SVD: Optimized Dictionary Design for Sparse Representations. In Proceedings of the 2008 19th International Conference on Pattern Recognition, Tampa, FL, USA, 8–11 December 2008; pp. 1–4. [Google Scholar]
  14. Rao, N.; Porikli, F. A Clustering Approach to Optimize Online Dictionary Learning. In Proceedings of the International Conference on Acoustics Speech Signal Proc. (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 1293–1296. [Google Scholar]
  15. Yaghoobi, M.; Blumensath, T.; Davies, M. Dictionary Learning for Sparse Approximations with the Majorization Method. IEEE Trans. Signal Proc. 2009, 57, 2178–2191. [Google Scholar] [CrossRef]
  16. Dang, H.; Chainais, P. Towards dictionaries of optimal size: a Bayesian non parametric approach. J. Signal Proc. Syst. 2018, 90, 221–232. [Google Scholar] [CrossRef]
  17. Zhou, M.; Chen, H.; Paisley, J.; Ren, L.; Li, L.; Xing, Z.; Dunson, D.; Sapiro, G.; Carin, L. Nonparametric Bayesian Dictionary Learning for Analysis of Noisy and Incomplete Images. IEEE Trans. Image Proc. 2012, 21, 130–144. [Google Scholar] [CrossRef] [PubMed]
  18. Huang, Y.; Paisley, J.; Lin, Q.; Ding, X.; Fu, X.; Zhang, X. Bayesian nonparametric dictionary learning for compressed sensing MRI. IEEE Trans. Image Proc. 2014, 23, 5007–5019. [Google Scholar] [CrossRef] [PubMed]
  19. Jung, A.; Eldar, Y.; Görtz, N. On the minimax risk of dictionary learning. IEEE Trans. Inf. Theory 2016, 62, 1501–1515. [Google Scholar] [CrossRef]
  20. Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  21. Chen, J.; Chen, Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika 2008, 95, 759–771. [Google Scholar] [CrossRef]
  22. Rissanen, J. MDL denoising. IEEE Trans. Inf. Theory 2000, 46, 2537–2543. [Google Scholar] [CrossRef]
  23. Rissanen, J. Information and Complexity in Statistical Modeling; Springer: Heidelberg, Germany, 2007. [Google Scholar]
  24. Roos, T.; Myllymäki, P.; Rissanen, J. MDL denoising revisited. IEEE Trans. Signal Proc. 2009, 57, 3347–3360. [Google Scholar] [CrossRef]
  25. Liski, E.; Liski, A. Minimum Description Length Model Selection in Gaussian Regression under Data Constraints. In Statistical Inference, Econometric Analysis and Matrix Algebra; Schipp, B., Kräer, W., Eds.; Physica-Verlag: Heidelberg, Germany, 2009; pp. 201–208. [Google Scholar]
  26. Giurcăneanu, C.; Razavi, S.; Liski, A. Variable selection in linear regression: Several approaches based on normalized maximum likelihood. Signal Process. 2011, 91, 1671–1692. [Google Scholar] [CrossRef]
  27. Akaike, H. Autoregressive model fitting for control. Ann. Inst. Stat. Math. 1971, 23, 163–180. [Google Scholar] [CrossRef]
  28. Hannan, E.J.; Quinn, B.G. The Determination of the Order of an Autoregression. J. R. Stat. Soc. Ser. B (Methodol.) 1979, 41, 190–195. [Google Scholar] [CrossRef]
  29. Pati, Y.; Rezaiifar, R.; Krishnaprasad, P. Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet Decomposition. In Proceedings of the 27th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 1–3 November 1993; Volume 1, pp. 40–44. [Google Scholar]
  30. Rubinstein, R.; Zibulevsky, M.; Elad, M. Efficient Implementation of the K-SVD Algorithm Using Batch Orthogonal Matching Pursuit; Technical Report CS-2008-08; Technion University: Haifa, Israel, 2008. [Google Scholar]
Figure 1. Evolution of the dictionary size (left) and of the RMSE (right) during 10 runs of ITC-ADL, with n = 128 , s = 10 .
Figure 1. Evolution of the dictionary size (left) and of the RMSE (right) during 10 runs of ITC-ADL, with n = 128 , s = 10 .
Algorithms 12 00178 g001
Table 1. Minimum (top), average (middle) and maximum (bottom) sizes computed when n true = 128 . Left: ERNML 1 . Right: EBIC.
Table 1. Minimum (top), average (middle) and maximum (bottom) sizes computed when n true = 128 . Left: ERNML 1 . Right: EBIC.
SNR (dB) s = 4 681012SNR (dB) s = 4 681012
128128128128128 128128128128124
143132130130143 143135128129130
128128128128128 128128128128128
130129129131137 130131132129131
128128128128128 128128128128128
129133130131138 130129132133135
128128128128128 128128128128128
129129143132140 129132133137132
Table 2. Ratios RMSE/RMSE t and RMSE/RMSE n on test data, when n true = 128 . Left: ERNML 1 . Right: EBIC.
Table 2. Ratios RMSE/RMSE t and RMSE/RMSE n on test data, when n true = 128 . Left: ERNML 1 . Right: EBIC.
SNR (dB) s = 4 681012SNR (dB) s = 4 681012
Table 3. Minimum (top), average (middle) and maximum (bottom) sizes computed when n true = 256 . Left: ERNML 1 . Right: EBIC.
Table 3. Minimum (top), average (middle) and maximum (bottom) sizes computed when n true = 256 . Left: ERNML 1 . Right: EBIC.
SNR (dB) s = 4 681012SNR (dB) s = 4 681012
256256256254330 256256222206225
268270261566616 277260257257386
255256256256256 256256256256239
288261260260444 275261260260273
256256256256256 256256256256236
312335357374477 312329267287291
256256256256256 256256256256256
343343357422489 343336283320322
Table 4. Ratios RMSE/RMSE t and RMSE/RMSE n on test data, when n true = 256 . Left: ERNML 1 . Right: EBIC.
Table 4. Ratios RMSE/RMSE t and RMSE/RMSE n on test data, when n true = 256 . Left: ERNML 1 . Right: EBIC.
SNR (dB) s = 4 681012SNR (dB) s = 4 681012
Table 5. Average execution times (in seconds) of ITC-ADL (with ERNML 1 ) and AK-SVD.
Table 5. Average execution times (in seconds) of ITC-ADL (with ERNML 1 ) and AK-SVD.
nAlgorithm s = 4 681012
Table 6. Average dictionary sizes given by ITC-ADL, for n = 256 , s = 6 , when δ + = δ vary.
Table 6. Average dictionary sizes given by ITC-ADL, for n = 256 , s = 6 , when δ + = δ vary.
δ + , δ 2345678
Table 7. Average dictionary sizes given by ITC-ADL, for n = 256 , for several values of n cand .
Table 7. Average dictionary sizes given by ITC-ADL, for n = 256 , for several values of n cand .
n cand = 10 152025
s = 6 263.86258.34257.30256.80
s = 10 288.18286.68280.66274.84
Table 8. Average dictionary sizes given by ITC-ADL, for n = 256 , for several values of c.
Table 8. Average dictionary sizes given by ITC-ADL, for n = 256 , for several values of c.
c = 3 456810
s = 6 257.16258.04258.32256.92257.28256.96
s = 10 265.36273.00280.36285.46282.64284.02
Table 9. Average values of s (top) and n (bottom), when n true = 128 . Left: ERNML 1 . Right: EBIC.
Table 9. Average values of s (top) and n (bottom), when n true = 128 . Left: ERNML 1 . Right: EBIC.
SNR (dB) s = 4 681012SNR (dB) s = 4 681012
Table 10. Results with A-ITKrM: average values of s (top), n (middle) and RMSE/RMSE i (bottom).
Table 10. Results with A-ITKrM: average values of s (top), n (middle) and RMSE/RMSE i (bottom).
SNR (dB) s = 4 681012

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (
Back to TopTop