Estimating the Rank of a Nonnegative Matrix Factorization Model for Automatic Music Transcription Based on Stein’s Unbiased Risk Estimator

Seokjin Lee

doi:10.3390/app10082911

School of Electronics Engineering, Kyungpook National University, Daegu 41566, Korea

Appl. Sci.2020, 10(8), 2911;https://doi.org/10.3390/app10082911

This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing

Version Notes

Order Reprints

Abstract

In this paper, methods to estimate the number of basis vectors of the nonnegative matrix factorization (NMF) of automatic music transcription (AMT) systems are proposed. Previously, studies on NMF-based AMT have demonstrated that the number of basis vectors affects the performance and that the number of note events can be a good selection as the rank of NMF. However, many NMF-based AMT methods do not provide a method to estimate the appropriate number of basis vectors; instead, the number is assumed to be given in advance, even though the number of basis vectors significantly affects the algorithm’s performance. Recently, based on Bayesian methods, certain estimation algorithms for the number of basis vectors have been proposed; however, they are not designed to be used as music transcription algorithms but are components of specific NMF methods and thus cannot be used generally as NMF-based transcription algorithms. Our proposed estimation algorithms are based on eigenvalue decomposition and Stein’s unbiased risk estimator (SURE). Because the SURE method requires variance in undesired components as a priori knowledge, the proposed algorithms estimate the value using random matrix theory and first and second onset information in the input music signal. Experiments were then conducted for the AMT task using the MIDI-aligned piano sounds (MAPS) database, and these algorithms were compared with variational NMF, gamma process NMF, and NMF with automatic relevance determination algorithms. Based on experimental results, the conventional NMF-based transcription algorithm with the proposed rank estimation algorithms demonstrated enhanced F1 score performances of 2–3% compared to the algorithms. While the performance advantages are not significantly large, the results are meaningful because the proposed algorithms are lightweight, are easy to combine with any other NMF methods that require an a priori rank parameter, and do not have setting parameters that considerably affect the performance.

Keywords:

automatic music transcription; nonnegative matrix factorization; number of basis vectors; Stein’s unbiased risk estimator

1. Introduction

Nonnegative matrix factorization (NMF) has been developed for use with signal processing systems, including audio and music signal processing systems [1]. Lee and Seung [2] introduced the NMF problem with a parts-based representation; moreover, an algorithm to solve the problem has been developed [3]. Because the NMF algorithm can be applied in various engineering fields, diverse algorithms have been developed to implement it [4]. The NMF algorithm has been applied to various signal- and image-processing fields such as signal separation [5,6] and hyperspectral unmixing [7,8].

Furthermore, the NMF algorithm is useful for music signal processing systems [9]. Because the NMF method is a parts-based representation, it can be applied to reduce the redundancy of spectrogram data and parse them into musical notes. Using this property, the NMF algorithm can be applied to various music-signal-processing systems, including automatic music transcription (AMT) [10,11].

NMF-based AMT algorithms perform relatively well for dealing with polyphonic music transcription (MT) tasks; however, the performance of such algorithms is affected by the number of basis vectors [9]. If the number of basis vectors is set to a value less than the actual number of note events, then the algorithm can fail and miss certain notes. If the number of basis vectors is set to a value greater than the number of actual note events, additional processing is required to select or group the correct answer from the excessively numerous basis vectors [9,12]. NMF-based algorithms demonstrate the best performance when the number of basis vectors and the number of note events are equal [9].

While setting the number of basis vectors is important for producing good results, most NMF-based music transcription systems assume that the correct value is available. Recently, multiple NMF algorithms including the estimation of the number of bases, using Bayesian inference [13], Gamma process statistics [14], and automatic relevance determination (ARD) [15], have been proposed. However, these algorithms have very high computational complexity, lack robustness in setting parameters, or show degraded performance because of a mismatch between the algorithm’s basic assumption and the music signal. Furthermore, because the conventional algorithms have the proper number of bases available after the NMF bases have been estimated, they cannot be combined with other NMF methods.

Recently, a model order estimation algorithm [16] has been developed for noisy principal component analysis (PCA) based on Stein’s unbiased risk estimator (SURE) [17]. Inspired by this study, we used the SURE method for performing rank estimation for NMF-based music transcription algorithms and focused on the similarity between order estimation for noisy PCA and rank estimation for NMF problems. A similar approach has been attempted in [18], but the method developed in [18] requires that all of the data be available simultaneously. Considering that several on-line NMF algorithms [19,20] have been developed to sequentially process large numbers of data or to be used in real-time systems, their application might be significantly limited because of that disadvantage. Thus, we would like to develop a rank estimation method that can be combined easily with existing NMF methods; consequently, the proposed algorithm uses as small a portion of the input music data as possible. Another SURE-based algorithm has been developed [21] but it is not suitable for music signal processing systems. This algorithm assumes that multiplying optimal NMF basis matrices is the same as that of the original data (i.e., there is no noise matrix), but music signal processing systems cannot satisfy that assumption.

In this study, a rank estimation algorithm, which is based on SURE, random matrix theory (RMT), and music onset information, for AMT is presented. To acquire estimation results that are suitable for a music signal, the proposed algorithm uses first and second onset information. The proposed method is similar to those of [16,18] wherein all algorithms are based on SURE; however, the algorithm for estimating the noise variance (which is crucial for properly estimating the model order) is different. The method in [16] performs a rough estimation of model order on the sole basis of RMT to estimate the noise variance, and the method in [18] uses a median filter to extract noise components without employing a probability model for noise eigenvalues. However, the proposed method uses the characteristics of NMF-based AMT methods (in particular, the fact that the NMF model has a rank of one at the first musical note or chord) and the Marchenko–Pastur (MP) probability model to enhance the NMF-based AMT methods, unlike conventional techniques.

The remainder of this paper is organized as follows. In Section 2, we further define the problem; in Section 3, we derive the proposed algorithm; and in Section 4, we exhibit and discuss simulation results of the proposed algorithm.

2. Problem Description

2.1. NMF Algorithm

The NMF algorithm decomposes a nonnegative matrix

V \in R_{+}^{K \times N}

into two nonnegative matrices

W \in R_{+}^{K \times R}

and

H \in R_{+}^{R \times N}

as

V \approx WH,

(1)

where

R_{+}^{K \times N}

is the field of matrices with K rows and N columns that contain nonnegative elements. Alternatively, the NMF model can be expressed in terms of a noise matrix

E \in R^{K \times N}

as follows:

V = WH + E .

(2)

Algorithms for solving the NMF problem and for estimating the matrices

W

and

H

from the target matrix

V

were developed in [3]. They comprise alternating estimation equations for each matrix:

\begin{matrix} W \leftarrow arg min_{W} C (V ‖ WH) \end{matrix}

(3)

\begin{matrix} H \leftarrow arg min_{H} C (V ‖ WH) \end{matrix}

(4)

with the constraints that

\forall W_{k r} \geq 0

and

\forall H_{r n} \geq 0

, where

C (A ‖ B)

is a distance measure between the matrices

A

and

B

,

W_{k r}

is the element in the kth row and rth column of matrix

W

, and

\forall W_{k r}

means all of the elements of matrix

W

.

Several “distance” measures, including the Euclidean distance and the Kullback–Leibler divergence, can be used to calculate

C (V ‖ WH)

. If the Euclidean distance is used,

C (V ‖ WH)

is defined as follows:

C (V ‖ WH) = \frac{1}{2} {∥ V - (WH) ∥}_{F}^{2},

(5)

where

{∥ ∥}_{F}

is the Frobenius norm [3,22]. Through the application of the multiplicative update rule [3],

W

and

H

can be iteratively estimated as follows:

W \leftarrow W \otimes [({VH}^{T}) ⊘ ({WHH}^{T})],

(6)

H \leftarrow H \otimes [(W^{T} V) ⊘ (W^{T} WH)],

(7)

where ⊗ and ⊘ indicate element-wise multiplication and division, respectively.

2.2. Application of NMF to AMT

The NMF algorithm is used in AMT systems to decompose a music signal into multiple note events. The input music signal is transformed into magnitudes on a spectrogram, expressed as a nonnegative matrix

V

. The NMF algorithm decomposes the magnitude spectrogram

V

into two nonnegative basis matrices,

W

and

H

, representing the frequency and time bases, respectively.

Figure 1 and Figure 2 show examples of the application of the NMF algorithm for music signal processing. Figure 1 shows the spectrogram of a selection from Cello Suite No. 1 Prelude by J. S. Bach (BWV 1007) comprising the notes F2, C3, A3, and G3. An example showing the application of the NMF algorithm to the music signal is shown in Figure 2. Figure 2a shows a set of column vectors of

W

, and Figure 2b shows a set of row vectors of

H

. As noted in previous studies [9,23], good estimates of the matrices

W

and

H

indicate the frequency structures and time envelopes of the musical notes in the signal, respectively. Figure 2 shows that the column vectors of

W

indicate the frequency structures of C3, F2, G3, and A3, and that the row vectors of

H

indicate the time envelopes of the same notes, respectively.

Figure 1. Spectrogram of a selection from Cello Suite No. 1 Prelude by J. S. Bach (BWV 1007) consisting of the notes F2, C3, A3, and G3. Vertical dashed lines indicate the starting points of every tenth frame.

Figure 2. Example of nonnegative matrix factorization (NMF) analysis results in terms of (a) the column vectors of

W

and (b) the row vectors of

H

.

If R is not optimal, then the desired result cannot be obtained (Figure 2). Figure 3 shows the results obtained when R is greater than the number of notes (

R = 6

); the target signal is the same as that in Figure 2. Compared with Figure 2, Figure 3 contains redundant information, namely, the third and fourth basis vectors in both

W

and

H

. Similarly, Figure 4 shows the results obtained when R is excessively small (

R = 2

). The decomposed basis cannot describe the characteristics of each musical note, with the consequence that the results do not yield any musically useful information. Based on these examples, an appropriately selected number of basis vectors is essential for obtaining useful results.

Figure 3. Example of NMF analysis for (a) the column vectors of

W

and (b) the row vectors of

H

, when the number of basis vectors is excessively large.

Figure 4. Example of the NMF analysis results for (a) the column vectors of

W

and (b) the row vectors of

H

when the NMF is excessively small.

3. Method for Estimating the NMF Rank

3.1. Rank Estimation Using Stein’s Unbiased Risk Estimator

If we define

V_{0} ≜ WH

, NMF Model (2) can be rewritten as follows:

V = V_{0} + E

(8)

According to the work of Smaragdis and Brown [9] and the example shown in Section 2, a well-decomposed basis corresponds to a musical note event. Because different musical notes have different fundamental frequencies (

f_{0}

) and because the spectrum of a note event forms a harmonic line spectrum comprising

n f_{0}

(n is an integer), the columns of the matrix

W

are linearly independent (Figure 5). Hence, the nonnegative ranks of all of the matrices

W

,

H

, and

V_{0} = WH

are equal to R. Therefore, the estimation of the number of basis vectors is the same as the estimation of a nonnegative rank in a noisy environment. Unfortunately, the estimating of the nonnegative rank is NP-hard problem [24]. Because the rank and the nonnegative rank are related (the matrix rank is the lower bound of the nonnegative rank [25]), we decide to estimate the rank instead of the nonnegative rank.

Figure 5. Illustrative diagram for the NMF model and matrix rank (

R = 4

).

If we assume that the rank of

V_{0}

is R,

R_{V_{0}} = (1 / N) V_{0} V_{0}^{T}

has R non-zero eigenvalues but

R_{V} = (1 / N) {VV}^{T}

has more than R non-zero eigenvalues because of

E

. Therefore, the rank estimation problem is the same as the problem of identifying R signal eigenvalues in a noisy environment, which can be handled using noisy principal component analysis (PCA). The SURE algorithm is a well-known method for model order estimation in the noisy PCA problem; it selects the model order as the value that minimizes a risk function. To apply the SURE method to the NMF problem, Model (2) can be rewritten as follows:

v (n) = Wh (n) + e (n) = v_{0} (n) + e (n)

(9)

where

v (n)

,

v_{0} (n)

,

h (n)

, and

e (n)

are the nth columns of

V

,

V_{0}

,

H

, and

E

, respectively. n denotes both the column index and the discrete-time index. The SURE method selects the model order R, which is same as the number of column vectors of the matrix

W

that minimizes the risk function [16,18]

\begin{matrix} R_{R} & = & E ∥ v_{0} (n) - {\hat{v}}_{0} {(n) ∥}^{2} \\ = & E ∥ e (n) - \hat{e} {(n) ∥}^{2} \\ = & E ∥ \hat{e} {(n) ∥}^{2} - 2 E ({\hat{e}}^{T} (n) e (n)) + K σ_{e}^{2} \end{matrix}

(10)

where

\hat{e} (n) = v (n) - {\hat{v}}_{0} (n)

,

{\hat{v}}_{0} (n)

provides the estimation of

v_{0} (n)

, and

σ_{e}^{2}

is the variance of

e (n)

.

E (\cdot)

and

∥ \cdot ∥

represent the expectation and the L2-norm, respectively, K is the length of the vector

v (n)

, and

{\hat{e}}^{T} (n)

indicates the transpose of

\hat{e} (n)

. According to Stein’s lemma [26], the second term has the relationship [18]

\begin{matrix} E ({\hat{e}}^{T} (n) e (n)) = σ_{e}^{2} E \{tr (\frac{\partial \hat{e} (n)}{\partial e^{T} (n)})\} \end{matrix}

(11)

where

tr (\cdot)

represents the trace operator of a matrix, if we assume that

e (n)

has a Gaussian distribution. Because

e (n) = v (n) - v_{0} (n)

and

\hat{e} (n) = v (n) - {\hat{v}}_{0} (n)

, (11) can be expressed as follows:

\begin{matrix} E ({\hat{e}}^{T} (n) e (n)) = σ_{e}^{2} [K - E \{tr (\frac{\partial {\hat{v}}_{0} (n)}{\partial v^{T} (n)})\}] . \end{matrix}

(12)

Therefore, the risk function

R_{R}

becomes

R_{R} = E {∥ \hat{e} (n) ∥}^{2} + 2 σ_{e}^{2} E \{tr (\frac{\partial {\hat{v}}_{0} (n)}{\partial v^{T} (n)})\} - K σ_{e}^{2} .

(13)

If the stochastic processes are assumed to be ergodic processes [18], the ensemble average can then be converted to the time average. If we consider the time average of the whole data

(n = 1, \dots, N)

, rather than the ensemble average, then (13) would be

R_{R} = \frac{1}{N} \sum_{n = 1}^{N} {∥ v (n) - {\hat{v}}_{0} (n) ∥}^{2} + \frac{2 σ_{e}^{2}}{N} \sum_{n = 1}^{N} tr (\frac{\partial {\hat{v}}_{0} (n)}{\partial v^{T} (n)}) - K σ_{e}^{2}

(14)

According to previous studies of noisy PCA, if

V_{0}

has the matrix rank R in Model (2), then

V_{0}

can be expressed by R eigenvalues and eigenvectors of

V

as [16]

{\hat{v}}_{0} (n) = \sum_{j = 1}^{R} p_{j} \frac{l_{j} - {\hat{σ}}_{R}^{2}}{l_{j}} p_{j}^{T} v (n)

(15)

where

l_{j}

is the jth eigenvalue and

p_{j}

is the jth eigenvector of

(1 / N) \sum_{n = 1}^{N} v (n) v^{T} (n)

. Furthermore,

{\hat{σ}}_{R}^{2}

can be calculated using the maximum likelihood estimator as [16]

{\hat{σ}}_{R}^{2} = \frac{1}{K - R} \sum_{j = R + 1}^{K} l_{j} .

(16)

Because we require the expressions of the partial derivatives in the second term of (14), we calculate the partial derivative of (15) with respect to

v^{T} (n)

as follows:

\begin{matrix} \frac{\partial {\hat{v}}_{0} (n)}{\partial v^{T} (n)} & = & \sum_{j = 1}^{R} (\frac{\partial p_{j}}{\partial v^{T} (n)} d_{j} p_{j}^{T} v (n) + p_{j} \frac{\partial d_{j}}{\partial v^{T} (n)} p_{j}^{T} v (n) \\ + p_{j} d_{j} p_{j}^{T} + p_{j} d_{j} v (n) \frac{\partial p_{j}}{\partial v^{T} (n)}) \end{matrix}

(17)

where

d_{j} = (l_{j} - {\hat{σ}}_{R}^{2}) / l_{j}

.

Based on the theorem for the derivatives of eigenvectors [16], the partial derivative of the eigenvector can be expressed as [18]

\frac{\partial p_{j}}{\partial v^{T} (n)} = \frac{1}{N} \sum_{i \neq j} \frac{(p_{j}^{T} z (n)) p_{i} p_{i}^{T} + (p_{i}^{T} z (n)) p_{i} p_{j}^{T}}{l_{j} - l_{i}},

(18)

with the result that the trace of the partial derivative is

tr (\frac{\partial p_{j}}{\partial v^{T} (n)}) = \frac{1}{N} \sum_{i \neq j} \frac{p_{j}^{T} v (n)}{l_{j} - l_{i}} .

(19)

Moreover, the partial derivative of the eigenvalue is as follows:

\frac{\partial l_{j}}{\partial v^{T} (n)} = \frac{2}{N} v^{T} (n) p_{j} p_{j}^{T} .

(20)

Through the application of Models (18)–(20) to (14), the risk function can be expressed as follows:

\begin{matrix} R_{R} = & (K - R) {\hat{σ}}_{R}^{2} + \sum_{j = 1}^{R} \frac{{\hat{σ}}_{R}^{4}}{l_{j}} + 2 σ_{e}^{2} R - 2 σ_{e}^{2} {\hat{σ}}_{R}^{2} \sum_{j = 1}^{R} \frac{1}{l_{j}} \\ + \frac{4 σ_{e}^{2} {\hat{σ}}_{R}^{2}}{N} \sum_{j = 1}^{R} \frac{1}{l_{j}} + \frac{2 σ_{e}^{2}}{N} \sum_{j = 1}^{R} d_{j} \sum_{i \neq j} \frac{l_{j} + l_{i}}{l_{j} - l_{i}} - K σ_{e}^{2} . \end{matrix}

(21)

The detailed derivation of (21) is the same as that in the Appendix of [18]. Finally, dropping the parts that do not depend on R,

\begin{matrix} R_{R} = & (K - R) {\hat{σ}}_{R}^{2} + \sum_{j = 1}^{R} \frac{{\hat{σ}}_{R}^{4}}{l_{j}} + 2 σ_{e}^{2} R - 2 σ_{e}^{2} {\hat{σ}}_{R}^{2} \sum_{j = 1}^{R} \frac{1}{l_{j}} \\ + \frac{4 σ_{e}^{2} {\hat{σ}}_{R}^{2}}{N} \sum_{j = 1}^{R} \frac{1}{l_{j}} + \frac{2 σ_{e}^{2}}{N} \sum_{j = 1}^{R} d_{j} \sum_{i \neq j} \frac{l_{j} + l_{i}}{l_{j} - l_{i}} . \end{matrix}

(22)

The risk function can be calculated via (22) using the eigenvalues of the data if the noise variance

σ_{e}^{2}

is known. Therefore, determining the noise variance exactly is required to apply the SURE method to our problem. In the next section, we develop the noise variance estimation method based on random matrix theory and the properties of the NMF-based AMT method.

There is one more fact to take note of. Stein’s lemma (11) assumes that

e

has a Gaussian distribution; however, the actual distribution of

e

is not exactly Gaussian but is more similar to the Laplace distribution. Unfortunately, developing the SURE method for the Laplace distribution is very difficult, so we decide to use the Gaussian assumption here and to leave the development of the algorithm for the Laplace distribution for a future work.

3.2. Noise Variance Estimation

While the detailed derivation process differs from the method in [16], the developed results (22) are similar to those of SURE-based algorithm in [16,18]. However, (22) requires noise variance information, which has a significant impact on performance. To estimate the noise variance, the method used in [16] crudely estimates the rank based on the probability model, and that used in [18] performs noise-component extraction based on the median filtering. While neither of these methods use any characteristics of NMF-based AMT techniques, this study proposes a new noise variance method by considering such characteristics.

As per random matrix theory, noise eigenvalues follow a Marchenko–Pastur (MP) density [16,27]. The MP density for

a \leq x \leq b

is

f_{γ, σ_{e}^{2}} (x) = \frac{γ}{2 π σ_{e}^{2} x} \sqrt{(b - x) (x - a)}, (a \leq x \leq b)

(23)

and

f_{γ, σ_{e}^{2}} (x) = 0

for

x < a

or

x > b

, where

\begin{matrix} a = σ_{e}^{2} {(1 - γ^{- 1 / 2})}^{2}, \end{matrix}

(24)

\begin{matrix} b = σ_{e}^{2} {(1 + γ^{- 1 / 2})}^{2}, \end{matrix}

(25)

and

γ = N / K

. According to [16,27], the largest noise eigenvalue converges as

l_{R + 1} \to σ_{e}^{2} {(1 + γ^{- 1 / 2})}^{2}, almost surely .

(26)

Furthermore, the median value satisfies the relationship [16]

m e d (l_{R + 1}, \dots, l_{ρ}) \to F_{γ, σ_{e}^{2}}^{- 1} (\frac{1}{2}), almost surely,

(27)

where

m e d (C)

is a median value of a set C,

F_{γ, σ_{e}^{2}}^{- 1}

is the inverse of the cumulative distribution function for the MP distribution, and

ρ

is a smaller value of K and N.

Consequently, if we know the maximum noise eigenvalue, we can estimate

σ_{e}^{2}

from that eigenvalue. Generally, it is necessary to know the value of R to determine the maximum noise eigenvalue

l_{R + 1}

. Thus, model order estimation algorithms based on SURE and RMT use the following strategy: 1. Perform a crude estimation of R. 2. Identify the parameters of the MP distribution for the noise eigenvalues. 3. Estimate

σ_{e}^{2}

. 4. Perform a precise estimation of R using the SURE method [16].

In the proposed algorithm, using the note onset information, the MP distribution parameters are determined. As previously mentioned, the number of NMF basis vectors and the number of note events is the same, and there is only one note event between the first onset and the second onset. Thus,

R = 1

for the data before the second onset. If we define the partial correlation matrix as

R_{v} (N_{p}) = \frac{1}{N_{p}} \sum_{n = 1}^{N_{p}} v (n) v^{T} (n)

(28)

then

R_{v} (N_{p})

has only one signal eigenvalue (

l_{1}

), and the remaining eigenvalues (

l_{2}

,

l_{3}

, ⋯) correspond to the remaining components

e (n)

when

N_{p}

is the frame number between the first and second onsets. Therefore, we estimate the MP distribution parameters from this period if we can assume that the parameters—the maximum and median values of noise eigenvalues—do not quickly change over time.

To verify this assumption, we consider the example shown in Figure 1, where the first onset of the input music signal occurs in the tenth frame and the second onset is in the 22nd frame. Therefore, the appropriate number of basis vectors is

R = 1

for the data from the first to 21st frames because there is only one note event. Figure 6 and Figure 7 show the eigenvalues of the data matrix obtained from the music signal in the example where “x” and “o” denote the signal and noise eigenvalues, respectively; the horizontal axis shows the index of an eigenvalue; and the vertical axis shows its magnitude. Figure 6 shows the eigenvalues of the data from the first frame to the 21st frame (0.46 s), and Figure 7 shows the eigenvalues of the overall data (2.00 s). Using the partial correlation matrix mentioned above, Figure 6 and Figure 7 show the eigenvalues of

R_{v} (21)

and

R_{v} (87)

, respectively.

Figure 6. Eigenvalues of

R_{v}

for the data from

N_{p} = 21

: (a) Full-scale display, (b) scaled display for noise eigenvalues, where “x” denotes a signal eigenvalue and “o” denotes a noise eigenvalue.

Figure 7. Eigenvalues of

R_{v}

for all of the data (

N_{p} = 87

): (a) Full-scale display, (b) scaled display for noise eigenvalues, where “x” denotes a signal eigenvalue and “o” denotes a noise eigenvalue.

Figure 6a shows that a large eigenvalue

(j = 1)

corresponds to the signal

v_{0}

, and 20 small eigenvalues

(2 \leq k \leq 21)

correspond to the remaining components

e (n)

. The remaining eigenvalues

(k \geq 22)

are not shown in the graph because their values are zero, agreeing with the theory in [27]. Figure 7a shows that four large eigenvalues

(1 \leq k \leq 4)

correspond to the signal

v_{0}

, and 83 small eigenvalues

(5 \leq k \leq 87)

correspond to the remaining noisy components. Figure 6b and Figure 7b show the eigenvalues within the range of 0–100 for comparison with the noisy signal properties. Comparing the “o” marks and the “+” marks in the Figure 7b, the graph shows that the noise eigenvalues of the data for

1 \leq n \leq 21

and

1 \leq n \leq 87

have similar properties, i.e., maximum value, median value, and shape.

Because

γ

is different when

N_{p} = 21

and

N_{p} = 87

, the maximum noise eigenvalue is different even if

σ_{e}^{2}

is the same. If we compensate the difference of

γ

between two points, the magnitude of

l_{2} (N_{p} = 21)

, 78.04, is equivalent to the value, 26.02, when

N = 87

. While the maxima of the noise eigenvalues are not exactly the same, the difference between the noise eigenvalues of the data for

(N_{p} = 21)

and

(N_{p} = 87)

is very small ( 9.4 if we compensate for the change of

γ

) compared to the differences between the magnitudes of the signal eigenvalues and the noise eigenvalues for each data case. The difference between the median values is small. Thus, this data can be used to make a crude estimation of the MP distribution parameters and

σ_{e}^{2}

.

We defined

N_{r a n k 1}

as the frame number for an arbitrary frame between the first and the second onset frames. If there is only one note in

1 \geq n \geq N_{r a n k 1}

, as shown in the example,

R_{v} (N_{r a n k 1})

is a rank-1 matrix. Even if multiple notes are played simultaneously at the first onset-it is generally a chord-

R_{v} (N_{r a n k 1})

can be expected to be a rank-1 matrix because the notes possess the same time basis vector as shown in Figure 8 [9]. Therefore,

R_{v} (N_{r a n k 1})

has

(N_{r a n k 1} - 1)

noise eigenvalues of

2 \leq k \leq N_{r a n k 1}

, which correspond to the remaining components.

Figure 8. Illustrative diagram of the NMF model for the case in which a chord is played at the first onset.

From the relationship of (26) or (27), we can estimate the noise variance by using either the maximum or the median values of the noise eigenvalues. Consequently, we propose two algorithms. In the first, the noise variance

σ_{e}^{2}

is estimated as follows:

σ_{e}^{2} = \frac{l_{2}}{{(1 + γ^{- 1 / 2})}^{2}},

(29)

using the maximum value of the noise eigenvalues. In the second version, the noise variance

σ_{e}^{2}

is estimated as follows:

σ_{e}^{2} = \frac{m e d (l_{2}, \dots, l_{ρ})}{F_{γ, 1}^{- 1} (1 / 2)},

(30)

using the median value, because the MP distribution has the property of

F_{γ, σ_{e}^{2}}^{- 1} (x) = σ_{e}^{2} F_{γ, 1}^{- 1} (x)

[16]. Algorithm 1 shows a summary of the proposed algorithm.

Because the proposed methods were developed based on the characteristics of the AMT method, they may not be applied to other NMF-related tasks. However, the key idea is to determine the noise variance using part of the data. This is possible when the task’s characteristics can be used to find a suitable rank for part of the signal. Therefore, if we can find the proper characteristics of a target task, the proposed methods can be extended to other NMF-related tasks.

Algorithm 1 Rank Estimation with SURE (RESURE)

Require: Spectrogram

V

, second onset frame number N.

Set the rank-1 frame number $N_{r a n k 1}$ as a value smaller than the second onset frame number.
Calculate the rank-1 covariance matrix $R_{v} (N_{r a n k 1})$ using (28).
Determine $l_{2}$ , the second largest eigenvalue of $R_{v} (N_{r a n k 1})$ , and estimate the noise variance using (29) (version 1) or (30) (version 2).
Calculate the covariance matrix $R_{V} = V V^{T} / N$ .
Calculate the SURE risk function $R_{R}$ with the eigenvalues of $R_{V}$ using (22).
Determine the number of bases $\hat{R}$ in accordance with

$\hat{R} = arg min_{R} R_{R} .$

An onset detection algorithm must be used to determine the first and second onset frames. Many studies have proposed successful onset detection algorithms [28,29], but developing such an algorithm is beyond this study’s scope.

4. Simulation

4.1. Experimental Settings

To evaluate the proposed algorithm, the numbers of basis vectors were estimated in AMT experiments, performed with piano music signals. The test dataset consisted of signals from the “AkPnCGdD” dataset of the MIDI-aligned piano sounds (MAPS) database [30]. The test signals comprised 30 piano-music signals with a sampling frequency of 44,100 Hz and 16-bit quantization, truncated at 30 s. The signals were transformed into two-dimensional data through a constant-Q transform [31] with 24 frequency bins per an octave from 27.5 Hz to the half of the sampling frequency.

We compared the proposed algorithm with conventional NMF algorithms that estimate the number of basis vectors, including variational NMF (VAR-NMF) [13], NMF with automatic relevance determination (ARD-NMF) [15], and gamma process NMF (GaP-NMF) [14] algorithms. VAR-NMF was performed with the initial hyperparameters

a = 1

and

b = 0.005

, and the hyperparameters were optimized using the algorithm proposed in [13] with tying parameters. The parameters of the ARD-NMF algorithm were

a = 25

,

β = 0

, and

ϕ = 1

; and the parameters of the GaP-NMF algorithm were set at

α = 1

,

a = 0.5

, and

b = 0.1

, which obtain the best performance. To determine the proper parameter values, we started from the values in the original papers and then searched for the optimal parameters around those values.

The compared algorithms possessed their own estimation algorithms for the NMF basis vectors, but the proposed algorithm could be combined with any algorithm to estimate the NMF basis vectors. In this experiment, the NMF algorithm with temporal continuity and sparseness criteria, proposed by Virtanen [1], was used to estimate the NMF basis vectors after applying the proposed algorithm, with the parameters

α = 1

and

β = 0.1

. Information on the first and second onset was extracted from the MIDI data included in the database.

The estimation algorithms were evaluated by measuring the AMT performances. These performances were evaluated in accordance to the segment-based F1-score. All clips were segmented by 0.1 s, and the F1-score was calculated using the following equations:

P = \frac{N_{T P}}{N_{T P} + N_{F P}}

(31)

R = \frac{N_{T P}}{N_{T P} + N_{F N}}

(32)

F_{1} = \frac{2 \times P \times R}{P + R},

(33)

where

N_{T P}

is the number of correct pitches (true positives),

N_{F P}

is the number of incorrect pitches (false positives), and

N_{F P}

is the number of missing pitches (false negatives). The pitch information of an NMF frequency basis was estimated using the subharmonic summation [32], and the onset and offset were calculated from the NMF temporal basis using a threshold that is defined by the average plus standard deviation of each basis.

4.2. Experimental Results

As per previous studies [9], the optimal number of bases might be related to the number of note events, as mentioned in Section 2. Therefore, we first compare the estimated number of bases of each algorithm and evaluate the performance based on the number of actual note events. To compare the overall performance, we verify the distribution of the absolute values of errors, as shown in Figure 9. The vertical lines indicate the range between the smallest and largest errors, and the boxes indicate the quartiles: The top and bottom of the box indicates the third and first quartiles, and the horizontal bar in the middle of the box indicates the median value. As expected, smaller values indicate better performance. The graph demonstrates that the ARD-NMF algorithm has the best performance among the conventional algorithms. The result of the proposed RESUREv2 algorithm (1st quartile: 3.25, median: 8.5, 3rd quartile: 13) is slightly better than that of the ARD-NMF algorithm (1st quartile: 4.25, median: 9.0, 3rd quartile: 16).

Figure 9. Distribution of absolute values of the estimation errors based on the number of notes. The top and bottom of the box indicate the third and first quartiles, and the horizontal bar within the box indicates the median value. The ends of the vertical bar indicate the maximum and minimum values.

The detailed results of the estimated NMF ranks for the database are shown in Table 1. The underlined bold number indicates the result closest to the number of musical notes for each clip, and the (highlighted) green value indicates a result with an error within

\pm 15 %

of the number of musical notes. The results demonstrate that ARD-NMF (with a

τ

of

2.5 \times 10^{- 2}

) performs best among the conventional algorithms, and the performance of the RESUREv2 algorithm is similar to or slightly better than that of the ARD-NMF algorithm. While the ARD-NMF performance is affected by the parameter

τ

(comparing the

τ = 0.005

and

τ = 0.025

cases), the proposed algorithm does not contain any parameters. This is a further advantage of the proposed algorithm.

Table 1. Estimated NMF ranks for 30 test signals. The underlined bold number indicates the result closest to the number of musical notes for each clip, and the (highlighted) green value indicates a result with an error within

\pm 15 %

of the number of musical notes.

While the RESUREv2 algorithm works well for many audio clips, it greatly overestimates the estimation results in some clips such as # 11, # 23, # 27. These clips have in common that the interval between the first and second onset times is excessively short (100 ms). In those cases, the RESUREv1 shows better performance compared to RESUREv2. While we cannot generalize with only three cases, we expect that the RESUREv1 algorithm might improve the RESUREv2 algorithm with small intervals between the first and second onsets despite the poor performance of the RESUREv1 algorithm overall.

While several studies have demonstrated that the actual number of notes might be related to the optimal NMF rank for the matrix-factorization-based AMT method [9], they cannot guarantee that the number of notes is the best selection for the NMF rank. Therefore, we measure the AMT performance to evaluate the rank estimation performance. The detailed settings for evaluating the AMT performance are discussed in Section 4.1. The F1-score results are shown in Figure 10. While the differences of F1-score performances between the algorithms are not as noticeable as the number of notes estimation performance, the algorithm whose estimated rank is close to the number of notes shows better F1-score performance. ARD-NMF performs best (1st quartile: 40.89, median: 43.23, 3rd quartile: 48.63) among the conventional algorithms, and the RESUREv2 algorithm performs slightly better (1st quartile: 41.20, median: 45.61, 3rd quartile: 51.25) compared to ARD-NMF. Just as in the case of the rank estimation results, the ARD-NMF F1-score results are similar to those of the RESUREv2, but the ARD-NMF performances are affected noticeably by the parameter

τ

. The proposed algorithm’s F1-score advantages (which represent an improvement of approximately 2%–3% over the conventional algorithm), are not considerably large. However, the proposed algorithm neither specifies a particular NMF method, nor requires sensitive design parameters; therefore, it is meaningful that its performance is improved, rather than degraded.

Figure 10. Box plots for the F1-score performances. The top and bottom of the box indicate the third and first quartiles, respectively, and the horizontal bar inside the box indicate the median value. The ends of the vertical bar indicate the maximum and minimum values.

To analyze the rank estimation and F1-score results, we plot the precision and recall of each algorithm (RESUREv2, ARD-NMF, GaP-NMF) versus the differences between the estimated ranks of the algorithms and the numbers of musical notes in audio clips, as shown in Figure 11. The x-axis of each graph indicates the estimated rank minus the number of notes. The graph demonstrates that the small estimated ranks tend to have high precision and low recall and vice versa. However, certain extremely small ranks (indicated by the dashed red circle in Figure 11a) cause low precision and recall performances. In our view, the proposed algorithm avoided severely underestimating the rank that causes both low precision and recall, resulting in better performance.

Figure 11. Scatter plots for (a) precision and (b) recall versus the difference between the estimated rank and number of notes (estimated rank number of notes).

To compare the performance of the proposed algorithm with that of the conventional SURE-based rank estimation method, we measured the estimation error of the number of notes and the F1-score performance of the conventional SURE method [16]. The input data of the conventional SURE method were the same as that of the proposed method. The conventional SURE method does not require parameter settings, similar to the proposed method. The Figure 12a shows the absolute values of the estimation errors (same as Figure 9) and Figure 12b shows the F1-score results (same as Figure 10). The graphs show that the conventional SURE method fails to determine the proper number of notes, and the F1-score results are also degraded.

Figure 12. Performance comparisons of the proposed method with the conventional SURE-based method [16] by (a) the absolute values of estimation errors for the number of notes and (b) the F1-score performances. The top and bottom of the box indicate the third and first quartiles, respectively, and the horizontal bar inside the box indicate the median value. The ends of the vertical bar indicate the maximum and minimum values.

Figure 13 shows the elapsed times of the algorithms for processing a 30-second-long audio clip. The algorithms were executed using MATLAB in a laptop with 3.5-GHz dual-core CPU and 16-GB memory. The RESUREv1 and RESUREv2 algorithms show nearly the same elapsed times. The NMF algorithm with the proposed RESUREv2 algorithms uses the least time, 85.12 s, making it worthwhile to combine the conventional NMF and the proposed RESURE algorithms in terms of computation time. Furthermore, the RESURE algorithms use much less computation time than the NMF methods, making them easy to combine with various NMF methods. The measured elapsed time of the RESURE algorithm does not include the onset detection process but there is no significant difference if we consider it because many lightweight onset detection techniques already exist.

Figure 13. Comparison chart for elapsed times of algorithms. The elapsed times were measured for processing a 30-second-long audio clip using MATLAB on a laptop with a 3.5-GHz dual-core CPU and 16-GB memory.

To compare the complexity of the proposed algorithm with the rank-estimation parts of the conventional methods, we measured the elapsed times for the rank-selection parts of each algorithm, (i.e., the parts for calculating the lower bounds of marginal likelihoods in the VAR-NMF and GaP-NMF algorithms, and the part for calculating relevance in the ARD-NMF algorithm) under a changing input-signal length from 10 s to 40 s. Figure 14 shows the comparison results. Similar to the case of Figure 13, the elapsed times of RESUREv1 and RESUREv2 are fairly identical; therefore, we display the elapsed time of RESUREv2 only. The graph shows that the proposed method requires the least time to estimate the rank independent of the input-signal length.

Figure 14. Chart comparing the elapsed times of the parts of various algorithms related to model order selection. The elapsed times were measured using MATLAB on a laptop with a 3.5-GHz dual-core CPU and 16-GB memory.

5. Conclusions

This paper developed an algorithm for estimating the NMF rank in music signal processing systems. As has been mentioned in previous studies, music signal processing systems based on NMF algorithms perform well when the order is the same as the number of independent notes or chords. Thus, for NMF-based music signal processing systems, estimating the proper NMF rank is important.

The proposed NMF rank estimation method was based on the SURE method. To use the SURE method, an estimate of the noise variance is required because it is assumed to be known. The conventional method estimates the noise variance using RMT, but the SURE-RMT algorithm estimates the NMF rank for general signal analysis and not merely for music signal processing systems. The proposed algorithm uses the onset information to estimate the number of basis vectors. The input data from the beginning to the second onset is required to have one basis; thus, the noise variance was estimated accurately using the eigenvalues of the correlation matrix of the partial data. We proposed two rank estimation algorithms, i.e., RESUREv1 and RESUREv2, which use the maximum and median values, respectively, of the noise eigenvalues.

Simulations for estimating the NMF order were performed to evaluate the proposed algorithm. The simulations were performed using 30 test signals taken from the MAPS database. The evaluation results demonstrated that the proposed algorithm estimated the NMF rank to be closer to the number of musical notes compared to the conventional NMF rank estimation algorithms such as VAR-NMF, ARD-NMF, and GaP-NMF. Furthermore, music transcription evaluations were performed for the proposed rank estimation method combined with the conventional NMF algorithm, with results showing that the proposed method is effective as compared to the conventional-NMF-based music transcription algorithm. Though the F1-score performance advantages are not significantly large compared to to the ARD-NMF and GaP-NMF performances, the results are still meaningful, because the proposed algorithms are lightweight, are easy to combine with any other NMF methods that require the rank parameter as a priori, and lack setting parameters that affect the performance substantially.

Funding

This research was supported by Kyungpook National University Research Fund, 2018.

Conflicts of Interest

The author declares no conflict of interest.

References

Virtanen, T. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 1066–1074. [Google Scholar] [CrossRef]
Lee, D.D.; Seung, H.S. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401, 788–791. [Google Scholar] [CrossRef] [PubMed]
Lee, D.D.; Seung, H.S. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2001; pp. 556–562. [Google Scholar]
Berry, M.W.; Browne, M.; Langville, A.N.; Pauca, V.P.; Plemmons, R.J. Algorithms and applications for approximate nonnegative matrix factorization. Comput. Stat. Data Anal. 2007, 52, 155–173. [Google Scholar] [CrossRef]
Wang, H.; Wang, M.; Li, J.; Song, L.; Hao, Y. A Novel Signal Separation Method Based on Improved Sparse Non-Negative Matrix Factorization. Entropy 2019, 21, 445. [Google Scholar] [CrossRef]
Fu, X.; Huang, K.; Sidiropoulos, N.D.; Ma, W.K. Nonnegative Matrix Factorization for Signal and Data Analytics: Identifiability, Algorithms, and Applications. IEEE Signal Process. Mag. 2019, 36, 59–80. [Google Scholar] [CrossRef]
He, W.; Zhang, H.; Zhang, L. Total variation regularized reweighted sparse nonnegative matrix factorization for hyperspectral unmixing. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3909–3921. [Google Scholar] [CrossRef]
Tsinos, C.G.; Rontogiannis, A.A.; Berberidis, K. Distributed blind hyperspectral unmixing via joint sparsity and low-rank constrained non-negative matrix factorization. IEEE Trans. Comput. Imaging 2017, 3, 160–174. [Google Scholar] [CrossRef]
Smaragdis, P.; Brown, J.C. Non-negative matrix factorization for polyphonic music transcription. In Proceedings of the 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, 19–22 October 2003; pp. 177–180. [Google Scholar]
Vincent, E.; Bertin, N.; Badeau, R. Harmonic and inharmonic nonnegative matrix factorization for polyphonic pitch transcription. In Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, 31 March–4 April 2008; pp. 109–112. [Google Scholar]
Benetos, E.; Dixon, S.; Duan, Z.; Ewert, S. Automatic music transcription: An overview. IEEE Signal Process. Mag. 2018, 36, 20–30. [Google Scholar] [CrossRef]
Park, S.H.; Lee, S.; Sung, K.M. Polyphonic Music Transcription by Nonnegative Matrix Factorization with Harmonicity and Temporality Criteria. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2012, 95, 1610–1614. [Google Scholar] [CrossRef]
Cemgil, A.T. Bayesian inference for nonnegative matrix factorisation models. Comput. Intell. Neurosci. 2009, 2009, 1–17. [Google Scholar] [CrossRef]
Hoffman, M.D.; Blei, D.M.; Cook, P.R. Bayesian nonparametric matrix factorization for recorded music. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010. [Google Scholar]
Tan, V.Y.; Févotte, C. Automatic relevance determination in nonnegative matrix factorization with the/spl beta/-divergence. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 1592–1605. [Google Scholar] [CrossRef] [PubMed]
Ulfarsson, M.O.; Solo, V. Dimension estimation in noisy PCA with SURE and random matrix theory. IEEE Trans. Signal Process. 2008, 56, 5804–5816. [Google Scholar] [CrossRef]
Stein, C.M. Estimation of the mean of a multivariate normal distribution. Ann. Stat. 1981, 9, 1135–1151. [Google Scholar] [CrossRef]
Lee, S. Estimation of the Matrix Rank of Harmonic Components of a Spectrogram in a Piano Music Signal Based on the Stein’s Unbiased Risk Estimator and Median Filter. IEICE Trans. Inf. Syst. 2019, 102, 2276–2279. [Google Scholar] [CrossRef]
Guan, N.; Tao, D.; Luo, Z.; Yuan, B. Online nonnegative matrix factorization with robust stochastic approximation. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1087–1099. [Google Scholar] [CrossRef]
Lee, S. RLS-based on-line sparse nonnegative matrix factorization method for acoustic signal processing systems. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2013, 96, 980–985. [Google Scholar] [CrossRef]
Ulfarsson, M.O.; Solo, V. Tuning parameter selection for nonnegative matrix factorization. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6590–6594. [Google Scholar]
Cichocki, A.; Zdunek, R.; Phan, A.H.; Amari, S.I. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
Wang, B.; Plumbley, M.D. Musical audio stream separation by non-negative matrix factorization. In Proceedings of the DMRN Summer Conference, At Glasgow, UK, 23–24 July 2005; pp. 23–24. [Google Scholar]
Shitov, Y. The nonnegative rank of a matrix: Hard problems, easy solutions. SIAM Rev. 2017, 59, 794–800. [Google Scholar] [CrossRef]
Cohen, J.E.; Rothblum, U.G. Nonnegative Ranks, Decompositions, and Factorizations of Nonnegative Matices. Linear Algebra Appl. 1993, 190, 149–168. [Google Scholar] [CrossRef]
Eldar, Y.C. Generalized SURE for exponential families: Applications to regularization. IEEE Trans. Signal Process. 2008, 57, 471–481. [Google Scholar] [CrossRef]
Baik, J.; Silverstein, J.W. Eigenvalues of large sample covariance matrices of spiked population models. J. Multivar. Anal. 2006, 97, 1382–1408. [Google Scholar] [CrossRef]
Klapuri, A. Sound onset detection by applying psychoacoustic knowledge. In Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, AZ, USA, 15–19 March 1999; Volume 6, pp. 3089–3092. [Google Scholar]
Dixon, S. Onset detection revisited. In Proceedings of the 9th International Conference on Digital Audio Effects, Montreal, QC, Canada, 18–20 September 2006; Volume 120, pp. 133–137. [Google Scholar]
Emiya, V.; Badeau, R.; David, B. Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Trans. Audio Speech Lang. Process. 2009, 18, 1643–1654. [Google Scholar] [CrossRef]
Schörkhuber, C.; Klapuri, A.; Holighaus, N.; Dörfler, M. A Matlab toolbox for efficient perfect reconstruction time-frequency transforms with log-frequency resolution. In Proceedings of the Audio Engineering Society Conference: 53rd International Conference: Semantic Audio, London, UK, 27–29 January 2014. [Google Scholar]
Hermes, D.J. Measurement of pitch by subharmonic summation. J. Acoust. Soc. Am. 1988, 83, 257–264. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Spectrogram of a selection from Cello Suite No. 1 Prelude by J. S. Bach (BWV 1007) consisting of the notes F2, C3, A3, and G3. Vertical dashed lines indicate the starting points of every tenth frame.

Figure 2. Example of nonnegative matrix factorization (NMF) analysis results in terms of (a) the column vectors of

W

and (b) the row vectors of

H

.

Figure 3. Example of NMF analysis for (a) the column vectors of

W

and (b) the row vectors of

H

, when the number of basis vectors is excessively large.

Figure 4. Example of the NMF analysis results for (a) the column vectors of

W

and (b) the row vectors of

H

when the NMF is excessively small.

Figure 5. Illustrative diagram for the NMF model and matrix rank (

R = 4

).

Figure 6. Eigenvalues of

R_{v}

for the data from

N_{p} = 21

: (a) Full-scale display, (b) scaled display for noise eigenvalues, where “x” denotes a signal eigenvalue and “o” denotes a noise eigenvalue.

Figure 7. Eigenvalues of

R_{v}

for all of the data (

N_{p} = 87

): (a) Full-scale display, (b) scaled display for noise eigenvalues, where “x” denotes a signal eigenvalue and “o” denotes a noise eigenvalue.

Figure 8. Illustrative diagram of the NMF model for the case in which a chord is played at the first onset.

Figure 9. Distribution of absolute values of the estimation errors based on the number of notes. The top and bottom of the box indicate the third and first quartiles, and the horizontal bar within the box indicates the median value. The ends of the vertical bar indicate the maximum and minimum values.

Figure 10. Box plots for the F1-score performances. The top and bottom of the box indicate the third and first quartiles, respectively, and the horizontal bar inside the box indicate the median value. The ends of the vertical bar indicate the maximum and minimum values.

Figure 11. Scatter plots for (a) precision and (b) recall versus the difference between the estimated rank and number of notes (estimated rank number of notes).

Figure 12. Performance comparisons of the proposed method with the conventional SURE-based method [16] by (a) the absolute values of estimation errors for the number of notes and (b) the F1-score performances. The top and bottom of the box indicate the third and first quartiles, respectively, and the horizontal bar inside the box indicate the median value. The ends of the vertical bar indicate the maximum and minimum values.

Figure 13. Comparison chart for elapsed times of algorithms. The elapsed times were measured for processing a 30-second-long audio clip using MATLAB on a laptop with a 3.5-GHz dual-core CPU and 16-GB memory.

Figure 14. Chart comparing the elapsed times of the parts of various algorithms related to model order selection. The elapsed times were measured using MATLAB on a laptop with a 3.5-GHz dual-core CPU and 16-GB memory.

Table 1. Estimated NMF ranks for 30 test signals. The underlined bold number indicates the result closest to the number of musical notes for each clip, and the (highlighted) green value indicates a result with an error within

\pm 15 %

of the number of musical notes.

Table 1. Estimated NMF ranks for 30 test signals. The underlined bold number indicates the result closest to the number of musical notes for each clip, and the (highlighted) green value indicates a result with an error within

\pm 15 %

of the number of musical notes.

Clip No.	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
Number of Notes	28	30	34	27	33	42	27	33	33	26	32	20	43	25	59
VAR-NMF	21	15	23	9	18	9	5	24	23	15	59	11	33	9	57
ARD-NMF $(τ = 5 \times 10^{- 3})$	23	17	24	14	20	20	17	12	13	14	22	32	14	23	14
ARD-NMF $(τ = 2.5 \times 10^{- 2})$	35	27	38	12	28	15	30	19	20	18	12	27	26	24	29
GaP-NMF	22	20	23	20	22	22	19	20	23	18	17	21	23	22	14
RESUREv1	25	22	15	9	13	19	25	58	58	12	47	13	14	15	30
RESUREv2	31	39	30	14	35	34	40	18	17	23	62	18	40	35	51
Clip No.	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30
Number of Notes	52	23	49	19	39	46	21	23	27	28	29	45	36	43	18
VAR-NMF	5	11	30	11	16	15	17	21	15	16	13	22	16	6	6
ARD-NMF $(τ = 5 \times 10^{- 3})$	17	16	12	41	19	15	20	35	19	24	22	14	15	15	53
ARD-NMF $(τ = 2.5 \times 10^{- 2})$	23	15	34	15	55	23	22	39	40	34	31	14	26	35	22
GaP-NMF	15	17	18	24	16	19	19	21	20	22	20	17	21	17	21
RESUREv1	12	35	56	20	19	20	17	32	19	15	14	36	18	12	12
RESUREv2	46	4	37	26	44	53	21	39	24	29	18	57	18	62	5

© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Estimating the Rank of a Nonnegative Matrix Factorization Model for Automatic Music Transcription Based on Stein’s Unbiased Risk Estimator

Abstract

1. Introduction

2. Problem Description

2.1. NMF Algorithm

2.2. Application of NMF to AMT

3. Method for Estimating the NMF Rank

3.1. Rank Estimation Using Stein’s Unbiased Risk Estimator

3.2. Noise Variance Estimation

4. Simulation

4.1. Experimental Settings

4.2. Experimental Results

5. Conclusions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics