Chord Recognition Based on Temporal Correlation Support Vector Machine

In this paper, we propose a method called temporal correlation support vector machine (TCSVM) for automatic major-minor chord recognition in audio music. We first use robust principal component analysis to separate the singing voice from the music to reduce the influence of the singing voice and consider the temporal correlations of the chord features. Using robust principal component analysis, we expect the low-rank component of the spectrogram matrix to contain the musical accompaniment and the sparse component to contain the vocal signals. Then, we extract a new logarithmic pitch class profile (LPCP) feature called enhanced LPCP from the low-rank part. To exploit the temporal correlation among the LPCP features of chords, we propose an improved support vector machine algorithm called TCSVM. We perform this study using the MIREX’09 (Music Information Retrieval Evaluation eXchange) Audio Chord Estimation dataset. Furthermore, we conduct comprehensive experiments using different pitch class profile feature vectors to examine the performance of TCSVM. The results of our method are comparable to the state-of-the-art methods that entered the MIREX in 2013 and 2014 for the MIREX’09 Audio Chord Estimation task dataset.


Introduction
A musical chord can be defined as a set of notes played simultaneously.A succession of chords over time forms the harmony core in a piece of music.Hence, the compact representation of the overall harmonic content and structure of a song often requires labeling every chord in the song.Chord recognition has been applied in many applications such as the segmentation of pieces into characteristic segments, the selection of similar pieces, and the semantic analysis of music [1,2].With its many applications, automatic chord recognition has been one of the main fields of interest in musical information retrieval in the last few years.
The basic chord recognition system has two main steps: feature extraction and chord classification.In the first step, the features used in chord recognition are typically variants of the pitch class profile (PCP) introduced by Fujishima (1999) [1].Many publications have improved PCP features for chord recognition by addressing potentially negative influences such as percussion [2], mistuning [3,4], harmonics [5][6][7], or timbre dependency [5,8].In particular, harmonic contents are abundant in both musical instrument sounds and the human singing voice.However, harmonic patterns in instrument sounds are more regular compared with the singing voice, which often includes ornamental features such as vibrato that lead to significant deviations of the frequency of partials from perfectly harmonic [9].To attenuate the effect of the singing voice and consider the temporal correlations of music, we separate the singing voice from the accompaniment before obtaining the PCP.
Chord classification is computed once the feature has been extracted.Modeling techniques typically use template-fitting methods [1,3,[10][11][12][13], the hidden Markov model (HMM) [14][15][16][17][18][19][20], and dynamic Bayesian networks [21,22] for this recognition process.The template-based method has some advantages, including the fact that it does not require annotated data and has a low computational time.However, its drawbacks include the problem of creating a model of templates of chroma vectors and the selection of a distance measure.The HMM method is a statistical model and its parameter estimation requires substantial training data.The recognition rate of the HMM method is typically relatively high, but it only considers the role of positive training samples without addressing the impact of negative training samples, thereby greatly limiting its discriminative ability.The support vector machine (SVM) method can achieve a high recognition rate, but encounters challenges in the presence of cross-aliasing that cannot be accurately judged.The main difference between HMM and SVM is in the principle of risk minimization [23].HMM uses empirical risk minimization, which is the simplest induction principle.In contrast, SVM uses structural risk minimization as its induction principle.The difference in risk minimization leads to the better generalization performance of SVM compared with HMM [24].We present a new method for chord estimation based on a hybrid model of HMM and SVM.
The remainder of this paper is organized as follows: Section 2 reviews related chord estimation work; Section 3 describes our PCP feature vector construction method; Section 4 explains our approach; Section 5 displays the results on the MIREX'09 (Music Information Retrieval Evaluation eXchange) dataset and a provides a comparison with other methods; and Section 6 concludes our work and suggests directions for future work.

Related Work
PCP is also called the chroma vector, which is often a 12-dimensional vector whereby each component represents the spectral energy or salience of a semi-tone on the chromatic scale regardless of the octave.The computation of the chroma representation of an audio recording is typically based either on the short-time Fourier transform (STFT) in combination with binning strategies [18,[25][26][27] or on the constant Q transform [3,6,15,28,29].The succession of these chroma vectors over time is often called the chromagram and this forms a suitable representation of the musical content of a piece.
Many features have been used for chord recognition including non-negative least squares [30], chroma DCT (Discrete Cosine Transform)-reduced log pitch (CRP) [31], loudness-based chromagram (LBC) [22], and Mel PCP (MPCP) [32].In [1], Fujishima developed a real-time chord recognition system using a 12-dimensional pitch class profile derived from the discrete Fourier transform (DFT) of the audio signal, and performed pattern matching using binary chord-type templates.Lee [6] introduced a new input feature called the enhanced pitch class profile (EPCP) using the harmonic product spectrum.Gómez and Herrera [33] used harmonic pitch class profile (HPCP) as the feature vector, which is based on Fujishima's PCP, and correlated it with a chord or key model adapted from Krumhansl's cognitive study.
Variants of the pitch class profile (PCP) first introduced by Fujishima (1999) [1] address the potentially negative influences of percussion [2], mistuning [3,4], harmonics [5][6][7] or timbre dependency [5,8].In addition to these factors, we explore ways to attenuate the influence of the singing voice.Weil introduced an additional pre-processing step for main melody attenuation [34].To attenuate the negative influence of singing voices, we consider the amplitude similarity of neighborhood musical frames belonging to the same chord and obtain the enhanced PCP.Adding a pre-processing step consisting of robust principal component analysis (RPCA), we expect the low-rank matrix to contain the musical accompaniment and the sparse matrix to contain the vocal signals.Then, the low-rank matrix can be used to calculate the features.The pre-processing considers the temporal correlations of music.The two most popular chord estimation methods are the template-based model and the hidden Markov model.For the audio chord estimation task of MIREX 2013 and 2014, one of the most popular methods is HMM [35][36][37][38][39][40][41].A binary chord template with three harmonics was also presented [42].
Template-based chord recognition methods use the chord definition to extract chord labels from a musical piece.Neither training data nor extensive music theory knowledge is used for this purpose [43].To smooth the resulting representation and exploit the temporal correlation, low-pass and median filters are used to filter the chromagram in the time domain [10].The template-based method only outputs fragmented transcriptions without considering the temporal correlations of music.
The HMM can model sequences of events in a temporal grid considering hidden and visible variables.In the case of chord recognition, the hidden states correspond to the real chords that are being played, while the observations are the chroma vectors.In [35], Cho and Bello use a K-stream HMM which is then decoded using the standard Viterbi algorithm.Khadkevich and Omologo also use a multi-stream HMM, but the feature is a time-frequency reassignment spectrogram [36].Steenbergen and Burgoyne present a chord estimation method based on the combination of a neural network and an HMM [40].In [39], the probabilities of the HMM are not trained through expectation-maximization (EM) or any other machine learning technique, but are instead derived from a number of knowledge-based sub-models.Ni and McVicar proposed a harmony progression HMM topology that consists of three hidden and two observed variables [37].The hidden variables correspond to the key K, the chord C, and the bass annotations B. These methods usually require a reference dataset for the learning period and entail more parameters to be trained.
In contrast, our method has only one parameter trained from the reference dataset, which is the state transition probability, and the other parameters are obtained from the SVM.The hybrid HMM and SVM model uses the respective advantages of these methods.Our novel method is called the temporal correlation support vector machine (TCSVM).
Our system is composed of two main steps: feature extraction and chord classification, as shown in Figure 1.We pre-process the audio to separate the singing voice.The system then tracks beat intervals in the music and extracts a set of vectors for the PCP.In the chord classification step, our method uses SVM classification and the Viterbi algorithm.Because we employ the temporal correlation of chords, the system can combine the SVM with the Viterbi algorithm, leading to a TCSVM.The Viterbi algorithm uses the transitions between chords to estimate the chords.vocal signals.Then, the low-rank matrix can be used to calculate the features.The pre-processing considers the temporal correlations of music.
The two most popular chord estimation methods are the template-based model and the hidden Markov model.For the audio chord estimation task of MIREX 2013 and 2014, one of the most popular methods is HMM [35][36][37][38][39][40][41].A binary chord template with three harmonics was also presented [42].
Template-based chord recognition methods use the chord definition to extract chord labels from a musical piece.Neither training data nor extensive music theory knowledge is used for this purpose [43].To smooth the resulting representation and exploit the temporal correlation, low-pass and median filters are used to filter the chromagram in the time domain [10].The template-based method only outputs fragmented transcriptions without considering the temporal correlations of music.
The HMM can model sequences of events in a temporal grid considering hidden and visible variables.In the case of chord recognition, the hidden states correspond to the real chords that are being played, while the observations are the chroma vectors.In [35], Cho and Bello use a K-stream HMM which is then decoded using the standard Viterbi algorithm.Khadkevich and Omologo also use a multi-stream HMM, but the feature is a time-frequency reassignment spectrogram [36].Steenbergen and Burgoyne present a chord estimation method based on the combination of a neural network and an HMM [40].In [39], the probabilities of the HMM are not trained through expectation-maximization (EM) or any other machine learning technique, but are instead derived from a number of knowledge-based sub-models.Ni and McVicar proposed a harmony progression HMM topology that consists of three hidden and two observed variables [37].The hidden variables correspond to the key K, the chord C, and the bass annotations B. These methods usually require a reference dataset for the learning period and entail more parameters to be trained.
In contrast, our method has only one parameter trained from the reference dataset, which is the state transition probability, and the other parameters are obtained from the SVM.The hybrid HMM and SVM model uses the respective advantages of these methods.Our novel method is called the temporal correlation support vector machine (TCSVM).
Our system is composed of two main steps: feature extraction and chord classification, as shown in Figure 1.We pre-process the audio to separate the singing voice.The system then tracks beat intervals in the music and extracts a set of vectors for the PCP.In the chord classification step, our method uses SVM classification and the Viterbi algorithm.Because we employ the temporal correlation of chords, the system can combine the SVM with the Viterbi algorithm, leading to a TCSVM.The Viterbi algorithm uses the transitions between chords to estimate the chords.

Normalized Logarithmic PCP Feature
Our system begins by extracting suitable feature vectors from the raw audio.Like most chord recognition systems, we use a chromagram or a PCP vector as the feature vector.Müller and Ewert propose a 12-dimensional feature vector-quantized PCP [8,29] that determines the proper frequency resolution and is sufficient for separating musical notes by low-frequency components.

Normalized Logarithmic PCP Feature
Our system begins by extracting suitable feature vectors from the raw audio.Like most chord recognition systems, we use a chromagram or a PCP vector as the feature vector.Müller and Ewert propose a 12-dimensional feature vector-quantized PCP [8,29] that determines the proper frequency resolution and is sufficient for separating musical notes by low-frequency components.
The calculation of PCP feature vectors can be divided into the following steps: (1) using the constant Q transform to calculate the 36-bin chromagram; (2) mapping the spectral chromagram to a particular semitone; (3) median filtering; (4) segmenting the audio signal with a beat-tracking algorithm; (5) reducing the 36-bin chromagram to a 12-bin chromagram based on beat-synchronous segmentation; (6) normalizing the 12-bin chromagram.The reader is referred to [15] for more detailed PCP calculation steps.
In the beat-synchronous (tactus) segmentation, we use the beat-tracking algorithm proposed by Ellis [44].This method has proven successful for a wide variety of signals.Using beat-synchronous segments has the added advantage that the resulting representation is a function of the beat, or tactus, rather than time.
Unlike most of the traditional PCP methods, we determine the normalized value using the p-norm and logarithm.The formula is as follows: QPCP log ppq " log 10 rC ¨QPCP 12 ppq `1s (1) After applying the logarithm and normalization, the chromagram is called the LPCP.
The left image of Figure 2 shows a PCP of the C major triad and the right image shows its LPCP.The strongest peaks are found at C, E, and G because the C major triad comprises three notes at C (root), E (third), and G (fifth). Figure 2 demonstrates that LPCP more clearly approximates the underlying fundamental frequencies than PCP.The calculation of PCP feature vectors can be divided into the following steps: (1) using the constant Q transform to calculate the 36-bin chromagram; (2) mapping the spectral chromagram to a particular semitone; (3) median filtering; (4) segmenting the audio signal with a beat-tracking algorithm; (5) reducing the 36-bin chromagram to a 12-bin chromagram based on beat-synchronous segmentation; (6) normalizing the 12-bin chromagram.The reader is referred to [15] for more detailed PCP calculation steps.
In the beat-synchronous (tactus) segmentation, we use the beat-tracking algorithm proposed by Ellis [44].This method has proven successful for a wide variety of signals.Using beat-synchronous segments has the added advantage that the resulting representation is a function of the beat, or tactus, rather than time.
Unlike most of the traditional PCP methods, we determine the normalized value using the p-norm and logarithm.The formula is as follows: ( ) ( ) After applying the logarithm and normalization, the chromagram is called the LPCP.
The left image of Figure 2 shows a PCP of the C major triad and the right image shows its LPCP.The strongest peaks are found at C, E, and G because the C major triad comprises three notes at C (root), E (third), and G (fifth). Figure 2 demonstrates that LPCP more clearly approximates the underlying fundamental frequencies than PCP.

Enhanced PCP with Singing-Voice Separation
To attenuate the effect of the singing voice and consider the temporal correlations of the chord, we first separate the singing voice from the accompaniment before calculating the PCP or LPCP.This is denoted as enhanced PCP (EPCP) or enhanced logarithmic PCP (ELPCP).The framework of singing voice separation is shown in Figure 3.In general, because of the underlying repeated musical structure, we assume that music is a low-rank signal.Singing voices offer more variation and have a higher rank but are relatively sparse in the frequency domain.We assume that the low-rank matrix A represents the music accompaniment and the sparse matrix E represents the vocal signals [45].Then, we perform the separation in two steps.First, we compute the spectrogram of music signals in matrix D, which is calculated from the STFT.Second, we use the inexact augmented Lagrange multiplier (ALM)

Enhanced PCP with Singing-Voice Separation
To attenuate the effect of the singing voice and consider the temporal correlations of the chord, we first separate the singing voice from the accompaniment before calculating the PCP or LPCP.This is denoted as enhanced PCP (EPCP) or enhanced logarithmic PCP (ELPCP).The framework of singing voice separation is shown in Figure 3.The calculation of PCP feature vectors can be divided into the following steps: (1) using the constant Q transform to calculate the 36-bin chromagram; (2) mapping the spectral chromagram to a particular semitone; (3) median filtering; (4) segmenting the audio signal with a beat-tracking algorithm; (5) reducing the 36-bin chromagram to a 12-bin chromagram based on beat-synchronous segmentation; (6) normalizing the 12-bin chromagram.The reader is referred to [15] for more detailed PCP calculation steps.
In the beat-synchronous (tactus) segmentation, we use the beat-tracking algorithm proposed by Ellis [44].This method has proven successful for a wide variety of signals.Using beat-synchronous segments has the added advantage that the resulting representation is a function of the beat, or tactus, rather than time.
Unlike most of the traditional PCP methods, we determine the normalized value using the p-norm and logarithm.The formula is as follows: ( ) ( ) After applying the logarithm and normalization, the chromagram is called the LPCP.
The left image of Figure 2 shows a PCP of the C major triad and the right image shows its LPCP.The strongest peaks are found at C, E, and G because the C major triad comprises three notes at C (root), E (third), and G (fifth). Figure 2 demonstrates that LPCP more clearly approximates the underlying fundamental frequencies than PCP.

Enhanced PCP with Singing-Voice Separation
To attenuate the effect of the singing voice and consider the temporal correlations of the chord, we first separate the singing voice from the accompaniment before calculating the PCP or LPCP.This is denoted as enhanced PCP (EPCP) or enhanced logarithmic PCP (ELPCP).The framework of singing voice separation is shown in Figure 3.In general, because of the underlying repeated musical structure, we assume that music is a low-rank signal.Singing voices offer more variation and have a higher rank but are relatively sparse in the frequency domain.We assume that the low-rank matrix A represents the music accompaniment and the sparse matrix E represents the vocal signals [45].Then, we perform the separation in two steps.First, we compute the spectrogram of music signals in matrix D, which is calculated from the STFT.Second, we use the inexact augmented Lagrange multiplier (ALM) In general, because of the underlying repeated musical structure, we assume that music is a low-rank signal.Singing voices offer more variation and have a higher rank but are relatively sparse in the frequency domain.We assume that the low-rank matrix A represents the music accompaniment and the sparse matrix E represents the vocal signals [45].Then, we perform the separation in two steps.First, we compute the spectrogram of music signals in matrix D, which is calculated from the STFT.Second, we use the inexact augmented Lagrange multiplier (ALM) method [45], which is an efficient algorithm for solving the RPCA problem, to solve A + E = |D|, given the input magnitude of D. Then, using RPCA, we can separate matrices A and E. The low-rank matrix A can be exactly recovered from D = A + E by solving the following convex optimization problem: where λ is a positive weighting parameter.The inexact ALM method is as follows [45]: Figure 4 shows the PCP and EPCP of an audio music piece (the music is "Baby It's You", from the Beatles' album Please Please Me). Figure 5 shows the LPCP and ELPCP of the same musical piece.
given the input magnitude of D. Then, using RPCA, we can separate matrices A and E. The low-rank matrix A can be exactly recovered from D = A + E by solving the following convex optimization problem: where λ is a positive weighting parameter.The inexact ALM method is as follows [45]: 6: // line 7 solves 1 1 arg min ( , , ,μ ) 7: 8: 10: Figure 4 shows the PCP and EPCP of an audio music piece (the music is "Baby It's You", from the Beatles' album Please Please Me). Figure 5 shows the LPCP and ELPCP of the same musical piece.
Figure 4 shows that the EPCP has improved continuity compared with the PCP. Figure 5 shows that the ELPCP is further enhanced compared with the LPCP.In Figure 4, it is shown that the EPCP of audio music is more obvious than the PCP. Figure 5 shows that the ELPCP of audio music is clearer than the LPCP.Thus, PCP features for chord recognition are improved by singing voice separation before calculating the PCP or LPCP.

Automatic Chord Recognition
Our chord recognition system entails two parts: support vector machine classification and the Viterbi algorithm.For SVM classification, we use LIBSVM (Library for Support Vector Machines) to obtain the chord probability estimates [46].Then the Viterbi algorithm uses the probability estimates and trained state transition probability to estimate the chord of the music.Because of the temporal correlation of chords, we combine the SVM classification with the Viterbi  Figure 4 shows that the EPCP has improved continuity compared with the PCP. Figure 5 shows that the ELPCP is further enhanced compared with the LPCP.In Figure 4, it is shown that the EPCP of audio music is more obvious than the PCP. Figure 5 shows that the ELPCP of audio music is clearer than the LPCP.Thus, PCP features for chord recognition are improved by singing voice separation before calculating the PCP or LPCP.

Automatic Chord Recognition
Our chord recognition system entails two parts: support vector machine classification and the Viterbi algorithm.For SVM classification, we use LIBSVM (Library for Support Vector Machines) to obtain the chord probability estimates [46].Then the Viterbi algorithm uses the probability estimates and trained state transition probability to estimate the chord of the music.Because of the temporal correlation of chords, we combine the SVM classification with the Viterbi algorithm and call the system TCSVM (Temporal Correlation Support Vector Machine).

Support Vector Machine Classification
SVM is a popular machine learning method for classification, regression, and other learning tasks.LIBSVM is currently one of the most widely used SVM software packages.A classification task usually involves a training set where each instance contains the class labels and the features.The goal of SVM is to produce a model to predict the target labels of the test data given only the test data features.
Many methods are available for multi-class SVM classification [47,48].LIBSVM uses the "one-against-one" approach for multiclass classification.The classification assumes the use of the radial basis function (RBF) kernel of the form Kpx i , x j q " e ´γ||x i ´xj || 2  .The two parameters of an RBF kernel, C and γ, must be determined by a parameter search as the optimal values vary between tasks.We use the grid-search method to obtain the C and γ parameters.
Once the C and γ parameters are set, the class label and probability information can be predicted.This section discusses the LIBSVM implementation for extending the SVM to output probability estimates.Given K chord classes, for any x, the goal is to estimate p i " Ppy " i|xq, i " 1, ..., K.

Viterbi Algorithm in SVM
The SVM method recognizes the chord based on frame-level classification without considering the inter-frame temporal correlation of chord features.For multiple frames corresponding to the same chord, the recognition results of traditional SVM are independent and fluctuate.Accounting for the inter-frame temporal correlation in the recognition procedure can improve the overall chord recognition rate.Our system combines SVM with the Viterbi algorithm to introduce the temporal correlation prior to a chord.Suppose the system has hidden K states, and we denote each state as S i , i P r1 : Ks where the state refers to the chord type.The observed events are Q t , t P r1 : Ts, which are PCP features.The current observed chord feature is Q " tQ 1 , Q 2 . . ., Q T u , t P r1 : Ts.A ij represents the transition probability from chord S i to chord S j .At an arbitrary time point t, for each of the states S i , a partial probability δ t pS i q indicates the probability of the most probable path ending at the state S i , given the current observed events Q 1 , Q 2 . . ., Q t : δ t pS i q " max j pδ t´1 pS j q ¨ApS j , S i q ¨PpQ t |S i qq.Here, we assume that we already know the probability δ t´1 pS j q for any of the previous states S j at time t ´1.PpQ t |S i q is p i ptq, the current probability estimates of SVM.Once we have all of the objective probabilities for each state at each time point, the algorithm seeks from the end to the beginning to find the most probable path of states for the given sequence of observation events ψ t piq " argr max 1ďjďN pδ t´1 pS j q ¨ApS j , S i qqs; ψ t piq indicates the optimal state at time t based on the probability computed in the first stage.
In our method, we set the initialization observation probability Π i to 1/24.The observed events are PCP features y t , where y t is the PCP feature of the t th frame.Generally, SVM predicts only the class label without probability information.The LIBSVM implementation extends SVM to output the probability estimates.The current observation probability corresponds to the probability estimates of SVM and replaces the PpQ t |S i q in the Viterbi algorithm.S i represents the chord i P r1 : Ks, where K is the number of chords and is set to 24.
Figure 6 is the comparison of the ground truth chord and estimated chord for the Beatles song "Baby It's You".The top figure shows the result of using the SVM method to recognize the chord and the bottom figure uses TCSVM.The ground truth chord is represented in pink and the estimated chord labels are in blue.Figure 6 indicates that the estimation is more stable when using TCSVM.

Experimental Results and Analysis
In this section, we compare the results of chord estimation using different features and methods.We compare our method with the methods that were submitted to MIREX 2013 and MIREX 2014 on the MIREX'09 dataset.

Experimental Results and Analysis
In this section, we compare the results of chord estimation using different features and methods.We compare our method with the methods that were submitted to MIREX 2013 and MIREX 2014 on the MIREX'09 dataset.

Corpus and Evaluation Results
For evaluation, we use the MIREX'09 Audio Chord Estimation task dataset which consists of 12 Beatles albums (180 songs, PCM 44 100Hz, 16 bits, mono).Besides the Beatles albums, in 2009, an extra dataset was donated by Matthias Mauch, which consists of 38 songs from Queen and Zweieck [21].
This database has been used extensively for the evaluation of many chord recognition systems, in particular those presented at MIREX 2013 and 2014 for the Audio Chord Estimation task.The evaluation is conducted based on the chord annotations of the Beatles albums provided by Harte and Sandler [49], and the chord annotations of Queen and Zweieck provided by Matthias Mauch [21].
According to [50], chord symbol recall (CSR) is a suitable metric to evaluate chord estimation performance.Since 2013, MIREX has used CSR to estimate how well the predicted chords match the ground truth: where t E is the total duration of segments where annotation equals estimation, and t A is the total duration of the annotated segments.
Because pieces of music vary substantially in length, we weight the CSR by the length of the song when computing the average for a given corpus.This final number is referred to as the weighted chord symbol recall.In this paper, the recognition rate and CSR are equivalent for a song and the recognition rate and weighted chord symbol recall are equivalent for a given corpus or dataset.
In the training stage, we randomly selected 25% of the songs from the Beatles albums to determine the parameters C and γ for the SVM kernel and the state transition probability matrix A. For SVM, the training dataset is composed of the PCP features of labeled musical fragments, which are selected from the training songs.The average estimation accuracies or recognition rates are reported.
First, we compare the recognition rates of SVM with PCP and EPCP features.The comparison between the ground truth and estimated chord is shown in Figure 7 for the example song (the Beatles song "Baby It's You").The top and bottom figures show the results using the SVM method with PCP features and EPCP features, respectively.The ground truth chord is represented in pink and the estimated chord labels are in blue.Figure 7 indicates that EPCP improves the recognition rate.In Figure 8, the recognition rate using SVM with PCP features is 70.15%, while that of TCSVM with the same features is 75.07%.The top image of Figure 6 shows less reliable estimated chords at the times when the chords change.The bottom image considers the inter-frame temporal correlation of chord features and shows more stable estimated chords even when the chords change.
Second, we compare the recognition rates of SVM and TCSVM with different features.The recognition results of the TCSVM method with ELPCP are superior, as shown in Figure 8.Because EPCP and ELPCP consider the temporal correlation of music, the rates show few differences between the SVM and TCSVM.‚ JR2: Jean-Baptiste Rolland [52] More details about these methods can be found from the corresponding MIREX websites [53].The results of this comparison are presented in Figure 9.

Experimental Results
determine the parameters C and γ for the SVM kernel and the state transition probability matrix A .For SVM, the training dataset is composed of the PCP features of labeled musical fragments, which are selected from the training songs.The average estimation accuracies or recognition rates are reported.
First, we compare the recognition rates of SVM with PCP and EPCP features.The comparison between the ground truth and estimated chord is shown in Figure 7 for the example song (the Beatles song "Baby It's You").The top and bottom figures show the results using the SVM method with PCP features and EPCP features, respectively.The ground truth chord is represented in pink and the estimated chord labels are in blue.Figure 7 indicates that EPCP improves the recognition rate.In Figure 8, the recognition rate using SVM with PCP features is 70.15%, while that of TCSVM with the same features is 75.07%.The top image of Figure 6 shows less reliable estimated chords at the times when the chords change.The bottom image considers the inter-frame temporal correlation of chord features and shows more stable estimated chords even when the chords change.
Second, we compare the recognition rates of SVM and TCSVM with different features.The recognition results of the TCSVM method with ELPCP are superior, as shown in Figure 8.Because EPCP and ELPCP consider the temporal correlation of music, the rates show few differences between the SVM and TCSVM.More details about these methods can be found from the corresponding MIREX websites [53].The results of this comparison are presented in Figure 9.The recognition rate of our TCSVM method with ELPCP is 82.98%.The recognition rate of our TCSVM (ELPCP) approach is similar to the best-scoring method (KO1) in MIREX 2014.
The results of the 2015 edition of MIREX automatic chord estimation tasks can be found on the corresponding websites [54].Table 1 shows the comparison of the recognition rates of the 2015 edition on the Isophonics 2009 datasets.The chord classes are as follows: Major and minor(MajMin); Seventh chords(Sevenths); Major and minor with inversions(MajMinInv); Seventh chords with inversions(SeventhsInv).The recognition rate of our TCSVM method is higher than other methods except the rates of the MajMinInv chords.Figure 10 shows the confusion between the chords using SVM and Figure 11 shows the confusion between the chords using TCSVM.The x-axis is the ground truth chord and the y-axis is the estimated chord.Comparing the two images suggests that TCSVM reduces the rate of erroneous identifications for a more reliable result.Figure 10 shows the confusion between the chords using SVM and Figure 11 shows the confusion between the chords using TCSVM.The x-axis is the ground truth chord and the y-axis is the estimated chord.Comparing the two images suggests that TCSVM reduces the rate of erroneous identifications for a more reliable result.

Conclusions
We present a new feature called ELPCP and a machine learning model called TCSVM for chord estimation.We separate the singing voice from the accompaniment to improve the features and consider the temporal correlation of music.Temporal correlation SVM is used to estimate the chord.This system results in more accurate chord recognition and eliminates many spurious chord estimates appearing in the conventional recognition procedure.
Future work should address some limitations.First, this paper only involves common chord estimation as part of the audio chord estimation task.Future work will involve the recognition of more complex chords to increase the applicability of this work in the field of music information retrieval, including song identification, query by similarity, and structure analysis.Second, we consider the effect of the singing voice and the results of Figure 8 show that the recognition rate with singing voice separation is better than without it.There is more room for further improvement of the PCP features to make them more suitable for chord recognition.Finally, we will evaluate the TCSVM method for cases when the audio music contains noise or is corrupted by noise.
Acknowledgments: This work was supported by the national Natural Science Foundation of China (Grant

Conclusions
We present a new feature called ELPCP and a machine learning model called TCSVM for chord estimation.We separate the singing voice from the accompaniment to improve the features and consider the temporal correlation of music.Temporal correlation SVM is used to estimate the chord.This system results in more accurate chord recognition and eliminates many spurious chord estimates appearing in the conventional recognition procedure.
Future work should address some limitations.First, this paper only involves common chord estimation as part of the audio chord estimation task.Future work will involve the recognition of more complex chords to increase the applicability of this work in the field of music information retrieval, including song identification, query by similarity, and structure analysis.Second, we consider the effect of the singing voice and the results of Figure 8 show that the recognition rate with singing voice separation is better than without it.There is more room for further improvement of the PCP features to make them more suitable for chord recognition.Finally, we will evaluate the TCSVM method for cases when the audio music contains noise or is corrupted by noise.

Figure 6 .
Figure 6.Comparison of ground truth and estimated chords using SVM (top) and TCSVM (bottom).

Figure 6 .
Figure 6.Comparison of ground truth and estimated chords using SVM (top) and TCSVM (bottom).

Figure 9 .
Figure 9.Comparison of recognition rates with state-of-the-art methods on Mirex'09 dataset.

Figure 9 .
Figure 9.Comparison of recognition rates with state-of-the-art methods on Mirex'09 dataset.
Compared with State-of-the-Art Methods

Table 1 .
Comparison of recognition rates of the 2015 edition on Isophonics 2009 datasets.