A Robust Cover Song Identiﬁcation System with Two-Level Similarity Fusion and Post-Processing

: Similarity measurement plays an important role in various information retrieval tasks. In this paper, a music information retrieval scheme based on two-level similarity fusion and post-processing is proposed. At the similarity fusion level, to take full advantage of the common and complementary properties among different descriptors and different similarity functions, ﬁrst, the track-by-track similarity graphs generated from the same descriptor but different similarity functions are fused with the similarity network fusion (SNF) technique. Then, the obtained ﬁrst-level fused similarities based on different descriptors are further fused with the mixture Markov model (MMM) technique. At the post-processing level, diffusion is ﬁrst performed on the two-level fused similarity graph to utilize the underlying track manifold contained within it. Then, a mutual proximity (MP) algorithm is adopted to reﬁne the diffused similarity scores, which helps to reduce the bad inﬂuence caused by the “hubness” phenomenon contained in the scores. The performance of the proposed scheme is tested in the cover song identiﬁcation (CSI) task on three cover song datasets (Covers80, Covers40, and Second Hand Songs (SHS)). The experimental results demonstrate that the proposed scheme outperforms state-of-the-art CSI schemes based on single similarity or similarity fusion.


Introduction
A huge increase in the number of digital music tracks promotes the development of content-based music information retrieval (MIR) technology.As a part of MIR, cover song identification (CSI, also called cover version identification) has received increasing attention due to its potential real-world applications in copyright protection and the management of online music products.Additionally, the study of CSI techniques helps to understand how the human auditory system measures and models the similarity between music.
As one of the most fundamental components of MIR applications, how to measure and model similarity between music items is an important yet challenging research question [1].Various similarity functions have been proposed in recent years [2][3][4][5].Considering that the similarity between two tracks can be calculated based on different descriptors and similarity functions, the complementary properties are neglected while using a single similarity function.It has been verified [6][7][8] that different descriptors and similarity functions are complementary to each other in the CSI task.To fully take advantage of the common as well as complementary information contained in different descriptors and similarity functions in describing the similarity between tracks, some researchers began to study similarity fusion algorithms for CSI.In [9], the main melody and accompaniment of the music were extracted first.Then, the maximum value of the similarities obtained based on main melody, accompaniment, and mixture signal, separately, was taken as the final similarity.In [6], the standard classification-based fusion strategy [10] was adopted to fuse the similarities of three related yet different descriptors (harmony, melody, and bass line).In [11], the fusion of different similarities was achieved by projecting different similarities in a multi-dimensional space, where the dimensionality of the space was the number of similarities considered.However, this scheme was easily disturbed by bad descriptors because of the diluted signal-to-noise ratio.In [12], the similarity graphs obtained based on different descriptors and corresponding similarity functions were fused by the similarity network fusion (SNF) technique [13].Then, the track-by-track similarities in the fused similarity graph were adopted for version identification.Due to the merits of the SNF technique, this fusion scheme could reduce the noise existing in each similarity graph and take advantage of the common as well as complementary information across each similarity graph.A similar strategy was adopted in [8] to fuse the similarities obtained based on the same descriptor and different similarity functions (Qmax [4] and Dmax [5]).This achieved the highest identification accuracy in the CSI task of MIREX 2016 (http://www.music-ir.org/mirex/wiki/2016:Audio_Cover_Song_Identification_Results).Some researchers proposed multi-stage similarity fusion schemes to take advantage of the common and complementary information provided by different musical descriptors and different similarity functions at the same time [7,14].In [14], the SNF technique was applied to both the descriptor-level fusion and the similarity-level fusion.It achieved the highest identification accuracy on the Covers80 dataset.In [7], in the early fusion, the similarities obtained by the same descriptor and different similarity functions were integrated by SNF.In the late fusion, the learning method selected by the sparse group LASSO algorithm was applied to the early fused similarity to obtain the probability that the input track pair belonged to the reference/cover pair.Finally, the final similarity was obtained by averaging the probability-based similarities obtained based on each descriptor.
However, some important factors that may seriously influence the identification accuracy are not considered in the available fusion schemes: (i) The complementarity among different descriptors and that among different similarity functions is not considered simultaneously [6,8] or not fused efficiently [7].(ii) The track manifold of the fused similarity graph, which will affect retrieval accuracy greatly, is not taken into consideration [15] (refer to Section 2.4.1 for specific examples).(iii) The bad influence caused by the "hubness" phenomenon contained in the fused similarity graph is seldom considered, which may increase the false positive rate [16].
To solve the possible shortcomings existing in the available similarity fusion algorithms and enhance the CSI performances further, a new CSI scheme based on two-level similarity fusion and post-processing is put forward in this paper.At the fusion level, a nonlinear graph fusion technique [13] is first adopted to fuse the similarity graphs constructed based on the same descriptor and different similarity functions.Then, a mixture Markov model (MMM) [17] is introduced to integrate the first-level fused similarity graphs generated based on two complementary descriptors.At the post-processing level, diffusion [16] is first applied on the obtained two-level fused similarity graph to take full advantage of the underlying structure of the tracks contained within it to reduce the noise and enhance the identification further.Then, the mutual proximity (MP) technique [15] is performed on the diffused similarity scores to reduce the bad influence caused by the "hubness" phenomenon existing in the diffused track community.It should be noted that the proposed scheme is different from our previously proposed scheme [7] in the following respects: (i) Unlike the scheme in [7], the proposed scheme is fully unsupervised.(ii) The track manifold contained in the two-level fused similarity graph is not considered in [7].(iii) The negative influence of the "hubness" phenomenon, which is not considered in [7], is eliminated by the MP technique in the proposed scheme.Extensive experiments conducted on three cover song datasets (Covers80 (https://labrosa.ee.columbia.edu/projects/coversongs/covers80/),Covers40, and SHS https://labrosa.ee.columbia.edu/millionsong/secondhand)manifest the necessity and effectiveness of each step included in the proposed model (Section 3.3.1)and the superiority of the proposed scheme, in terms of CSI identification accuracy over state-of-the-art CSI schemes (Section 3.3.2) and computational complexity, especially when the size of the dataset increased (Section 3.3.3).The rest of this paper is organized as follows.The proposed model is presented in Section 2. Section 3 reports the experimental results.Finally, conclusions are drawn and future work is discussed in Section 4.

Proposed Model
A block diagram of the proposed model, which is illustrated by an example of results obtained on Covers40 (see Section 3.1), is shown in Figure 1.
Let V = {v q |q = 1, • • • , N} denote a music collection.Two function lists are defined as follows: where f i (v q ) extracts the i-th kind of descriptor from the track v q .• Function list s = {s j |j = 1, • • • , R}: where s j ( f i (v q ), f i (v p )) computes the j-th similarity score between the i-th descriptors of the input tracks v q and v p .First, diffusion is performed on the second-level fused similarity graph to take advantage of the structure of the underlying track manifold contained within it to reduce noise and enhance retrieval accuracy, then mutual proximity (MP) is adopted to modify the diffused similarity to reduce the "hubness" phenomenon.

Descriptor Extraction
For each track v q , q = 1, • • • , N in the music collection, M kinds of descriptors (denoted as are extracted, respectively.In the proposed scheme, the harmonic pitch class profile (HPCP) [18] and main melody (MLD) [19] descriptors are extracted from each track, respectively.

First-Level Fusion
For each pair of tracks (v q and v p ), the j-th kind of similarity function is performed on their i-th descriptors to obtain the similarity score s (i) j (q, p): Thus, the track-by-track similarity matrix obtained based on the i-th descriptor and j-th similarity function can be represented as a graph, denoted as j }, where the vertices V correspond to the tracks in the collection, and the edges E are weighted by the corresponding similarity scores s (i) j .
To take advantage of the complementarity between the Qmax and Dmax similarity functions in representing the similarity between cover versions, the similarity graphs based on the same descriptor (HPCP or MLD) and two different similarity functions (Qmax [4] and Dmax [5]) are fused with the SNF technique [13].The specific details of the SNF technique can be found in [13] and [7].The first-level fused similarity graph for the i-th descriptor can be denoted as G (i) (V, E, A (i) ), i = 1, 2, which is obtained with Equation (2): To test the validity of the first-level fusion, three cover sets shown in Table 1 are studied here.The six tracks were used both as the queries and the targets.The corresponding 6 × 6 similarity matrices obtained by MLD-Qmax, MLD-Dmax, and the first-level fused version of them (denoted as SNF-MLD-QD), are shown in Figure 2a-c, respectively.The cells corresponding to the query/cover pairs are marked with white boxes.It can be seen that MLD-Qmax and MLD-Dmax did not work on the No. 1 and No. 3 cover sets, respectively.However, after first-level fusion, this problem was solved.

Second-Level Fusion
To make full use of the common and complementary properties of different descriptors (HPCP and MLD), the first-level fused similarity graphs for each descriptor are further fused with MMM technique [17] as follows.
For a walker sitting at vertex v q ∈ V in graph G (i) (V, E, A (i) ), she first decides which graph to land in, jumps to that graph, then decides which neighboring vertex to go to according to the graph's similarity matrix.The procedure of walking from v q to v p across all graphs can be represented with Equation (3): where ξ(v p |v q ) is the transition probability of walking from v q to v p in the second-level fused similarity graph.ξ (i) (v q ) is the probability of switching to (or staying in) graph G (i) when the walker is at vertex v q .The degree of v q in G (i) , denoted as d (i) (v q ), is defined as the sum of the edge strength of all vertices connected to v q (i.e.d (i) (v q ) = ∑ p A (i) (q, p)).The volume of graph G (i) , denoted as θ (i) , is defined as the sum of all edge strengths in it, which can be calculated as Then, ξ (i) (v p |v q ) can be rewritten as When the random walk model reaches a stationary state, the stationary probability at vertex v q is defined as ( Suppose the stationary probability of the second-level fused graph, denoted as Π(v q ), can be represented by a linear combination of the stationary probabilities of all first-level fused graphs as follows: where Then, ξ (i) (v q ) in Equation ( 3) can be calculated as follows: By plugging (4), ( 5), ( 7) into (3), we obtain Then A(q, p) = ∑ i w i (v q ) A (i) (q,p) is adopted as the second-level fused similarity score.The corresponding similarity graph is denoted as G(V, E, A), where A = {A(q, p), q, p = 1, • • • , N}.

Post-Processing
At the post-processing level, first, the locally constrained diffusion process (LCDP) [16] is performed on the second-level fused similarity graph to make full use of the underlying track manifold structure contained within it to enhance the retrieval performance.Then, the MP technique is applied on the obtained diffused similarity to eliminate the negative influence caused by the "hubness" phenomenon contained in the diffused track community.

Diffusion Processing
For diffusion processing, we adopt the LCDP technique proposed in [16].The central concept of LCDP is to restrict a random walk to the K nearest neighbors of the data points by replacing the original graph G in traditional diffusion process with a K nearest neighbor (K-NN) graph G K , which can effectively reduce the influence of the noisy data points.Figure 3 shows the classification results of double moon data before and after applying diffusion on the distance values.It can be seen that diffusion can utilize the structure of the underlying data manifold to enhance classification performance.Given the second-level fused similarity matrix A, the transition matrix, denoted as U = {U(q, p)|q, p = 1, • • • , N}, can be calculated as follows: where D is a diagonal matrix and the q-th diagonal element D(q, q) is the degree of v q in graph G.
Assume that the K-NN graph of G is G K , which is generated by only keeping the similarity scores of each node and its K nearest neighbors in G.The transition matrix corresponding to G K is U K .We generate a diffused similarity matrix, denoted as , where f t q is a column vector indicating the probability of being at a vertex starting from vertex v q after t steps.Then, LCDP [16] is employed to iteratively update F as follows: where F 0 = U K , and the diffusion terminates after a pre-defined number of iterations or if F does not change.Then, the obtained diffused similarity graph can be denoted as G (d) (V, E, F).

Hubness Reduction
To reduce the negative influence caused by the "hubness" phenomenon existing in the track community, we adopt MP algorithm [15] to transform the obtained arbitrary similarity scores to probability-based similarity scores.MP is a global scaling method, and its general idea is to reinterpret the original distance space so that two objects sharing similar nearest neighbors are more closely tied to each other.Under the assumption that all distances in a data set follow a certain distribution, any similarity s x,y can now be reinterpreted as the probability of v y being the nearest neighbor of v x , P(X) is defined by the similarities of v x to all other objects in the collection, and the probability of an element v y being a nearest neighbor of v x is: F x denotes the cumulative distribution function (CDF), which is assumed for the distribution of similarity scores s x,i=1..n .Then, the MP-based similarity between v x and v y , denoted as MP(x, y), is defined as the probability that v y is the nearest neighbor of v x given P(X) and v x is the nearest neighbor of v y given P(Y) as follows: By visualizing the joint similarity score distribution of X and Y, computing MP for a given similarity score s x,y in a collection of N objects can be boiled down to simply counting the number of objects j having a smaller similarity score to v x and v y than s x,y : Figure 4 shows the probability distribution of the diffused similarities on Covers40 before and after applying the MP algorithm to them.It can be seen that the MP algorithm helps to enlarge the difference between inter tracks (unrelated tracks), which helps to reduce the false positive rate.

Experiments
In this section, we evaluate the performance of the proposed scheme.The cover song data sets used in the experiment and the experimental settings are described in Sections 3.1 and 3.2, respectively.The experimental results, which include the necessity and importance of each step in the proposed scheme, the performance comparison with state-of-the-art CSI schemes, and the computational complexity comparison with other fusion-based CSI schemes, are discussed in Section 3.3.

Datasets
To evaluate the performance of the proposed model, we used three different cover song datasets (see Table 2) in the experiments.
Covers80, denoted as DB160 in this paper, is provided by Ellis from LabROSA.It contains 80 cover sets with 2 tracks in each set.Most of the tracks in this database have significant differences in rhythm.
Covers40, denoted as DB400 here, is composed of 400 tracks and 40 cover sets collected by us.There are 9 cover versions, which include both popular songs and classical music, for each original track.A complete list of this collection can be obtained by contacting us by email.
SHS, part of Second Hand Song cover song dataset, which consists of 12,730 tracks.There are 4235 original tracks and 8495 covers in this collection.The average number of covers in each cover set is 3.01, ranging from 2 to 42.This collection spans a variety of genres, including pop, rock, electronic, jazz, blues, and classical music.As shown in Table 2, we split it into four subsets sequentially without overlapping, denoted as DB3172, DB3183, DB3187, and DB3188, respectively.

Experiment Settings
To reduce the computation time and the memory requirements, the track was converted into a mono, 22.5 kHz, and 16 bits per sample version.Then the pre-processed signal was segmented into frames of 464 ms by Hamming window without overlapping.For each frame, the HPCP and MLD descriptors were extracted.Qmax and Dmax were adopted to measure the similarity between HPCP or MLD descriptors.As for the evaluation measures, the mean of average precision (MAP) [4], the mean averaged reciprocal rank (MaRR) [20], and the total number of covers identified in TOP 10 (TOP-10) were adopted to evaluate the performance of the CSI schemes.The larger the value of MAP, MaRR, or TOP-10, the better the performance achieved.

Experimental Results
First, we prove the necessity and importance of each step included in the proposed model by comparing the identification accuracy obtained in each step.Second, we compare the performance of the proposed model with those of state-of-the-art CSI schemes, in terms of MAP, MaRR, and TOP-10, on all three datasets.Finally, we compare the computational complexity of the proposed model with those of other similarity-fusion based CSI schemes.

Necessity and Importance of Each Step Included in the Proposed Model
To verify the necessity and validity of each step in the proposed model (see Figure 1), the identification accuracy in terms of MAP, MaRR, and TOP-10 achieved in each step are compared in Figure 5, where baseline (BL) is the fusion object (HPCP-Qmax, HPCP-Dmax, MLD-Qmax, MLD-Dmax) that achieved the best performance, and SNF denotes the first-level fused similarity for the HPCP descriptor.In Figure 5, only the results on DB3172 are included.Similar results could be obtained on the other three SHS subsets.

Comparison with State-Of-The-Art CSI Schemes
To verify the efficiency of the proposed scheme in comparison with other CSI schemes that are based on single similarity function or similarity fusion, the MAP, MaRR, and TOP-10 achieved by each scheme are listed in Table 3.The CSI schemes included in this experiment were the proposed model (denoted as TLSFP-two-level similarity fusion and post-processing); HPCP-Qmax [4]; HPCP-Dmax [5]; a particle swarm optimization (PSO)-based scheme [21]; a high space (HS) mapping-based scheme [11]; the scheme proposed in [8] (denoted as SNF-2); the scheme proposed in [12] (denoted as SNF-3); SNF-4, which fuses the similarities based on HPCP-Qmax, HPCP-Dmax, MLD-Qmax, and MLD-Dmax with SNF; and a two-layer fusion based scheme [7].For the HS and PSO schemes, the same similarity types as those in SNF-4 were adopted.
The experimental results shown in Table 3 demonstrate that the proposed TLSFP scheme outperformed the other CSI schemes (based on single similarity function or similarity fusion) included in terms of MAP, MaRR, and TOP-10, on all six datasets except for the MAP value on DB3187.The gap was 0.0069, which is very small and can be neglected.

Computational Complexity Comparison
In this experiment, the computational complexity of the proposed model in terms of average computing time is compared with those obtained by PSO-, HS-, and SNF-4-based fusion schemes.
All the experiments were carried out on a desktop machine with an Intel(R) Core(TM) i7 CPU (4.0 GHz) and 32 GB memory.Given the total fusion computing time T, we obtained the average computing time with AvgT = T/( N 2 ) 2 , where N is the total number of tracks in the dataset.The experimental results shown in Figure 6 demonstrate that: (i) The PSO scheme cost much more time than the other three.(ii) HS achieved the lowest computational complexity in four schemes.However, its performance may be unsatisfactory (see Table 3).(iii) The proposed TLSFP scheme needed a slightly longer time than SNF-4 when the dataset was small.However, with the increase of the dataset size, the difference became smaller and smaller.When the SHS was considered, the computational complexity of TLSFP was lower than that of SNF-4.So, the proposed model is very fit for large music collections.

Conclusions and Future Work
In this paper, we propose a music information retrieval scheme based on two-level similarity fusion and post-processing.It adopts different strategies (SNF and MMM) to combine the merits of different similarity functions and those of different descriptors in two fusion levels.In addition, it introduces diffusion and MP techniques to refine the fused similarity scores to enhance cover version identification accuracy.Extensive experiments on three cover song datasets (including Covers80 and SHS) manifested the effectiveness and efficiency of the proposed model in comparison with state-of-the-art CSI schemes.
TLSFP can be modified and applied to other important tasks in different fields, such as image classification, visual object tracking, cancer subtypes identification, and drug taxonomy, etc.We leave all these problems for future work.

Figure 1 .
Figure 1.Block diagram and illustrative example of the proposed model, taking part results on Covers40 as an example.(a) Extract the harmonic pitch class profile (HPCP) descriptor and main melody (MLD) descriptor from each track in the music collection.(b) A track-by-track similarity graph is constructed based on each descriptor and corresponding similarity function.The similarity graphs based on the same descriptor and different similarity functions are fused with similarity network fusion (SNF).(c) The first-level fused similarity graphs for each descriptor are integrated with the mixture Markov model (MMM) technique to obtain a second-level fused similarity graph.(d) Post-processing.First, diffusion is performed on the second-level fused similarity graph to take advantage of the structure of the underlying track manifold contained within it to reduce noise and enhance retrieval accuracy, then mutual proximity (MP) is adopted to modify the diffused similarity to reduce the "hubness" phenomenon.

Figure 3 .
Figure 3. Illustration of the effectiveness of the diffusion process in classification.Pentagrams represent two queries from different groups.Each element is assigned to one of the two queries according to its distances with the query samples: (a) without diffusion (b) with diffusion.

Figure 4 .
Figure 4. Probability distribution of the diffused similarities on DB400 (a) before and (b) after applying MP to them.

Figure 5 Figure 6
Figure 5 Identification accuracy achieved in each step of the proposed model on DB160 (first column), DB400 (second column), and DB3172 (last column).

Figure 5 .
Figure 5. Identification accuracy achieved in each step of the proposed model on (first column) DB160, column) DB400, and (last column) DB3172.BL: baseline; DIFF: diffusion processing; MAP: mean of average precision; MaRR: mean averaged reciprocal rank; TOP-10: total number of covers identified in TOP 10.

Figure 6 .
Figure 6.The comparison of average computing time achieved by different similarity fusion schemes on four datasets.

Table 1 .
The tracks in the selected cover sets.

Table 2 .
Cover song datasets used.

Table 3 .
Identification accuracy comparison among different cover song identification (CSI) schemes.HS: high space; PSO: particle swarm optimization; TLSFP: two-level similarity fusion and post-processing.