Learning Low-Dimensional Embeddings of Audio Shingles for Cross-Version Retrieval of Classical Music

: Cross-version music retrieval aims at identifying all versions of a given piece of music using a short query audio fragment. One previous approach, which is particularly suited for Western classical music, is based on a nearest neighbor search using short sequences of chroma features, also referred to as audio shingles. From the viewpoint of efﬁciency, indexing and dimensionality reduction are important aspects. In this paper, we extend previous work by adapting two embedding techniques; one is based on classical principle component analysis, and the other is based on neural networks with triplet loss. Furthermore, we report on systematically conducted experiments with Western classical music recordings and discuss the trade-off between retrieval quality and embedding dimensionality. As one main result, we show that, using neural networks, one can reduce the audio shingles from 240 to fewer than 8 dimensions with only a moderate loss in retrieval accuracy. In addition, we present extended experiments with databases of different sizes and different query lengths to test the scalability and generalizability of the dimensionality reduction methods. We also provide a more detailed view into the retrieval problem by analyzing the distances that appear in the nearest neighbor search.


Introduction
Large amounts of digitally available music data require efficient retrieval strategies.In recent decades, many systems for music retrieval based on the query-by-example paradigm have been suggested.Given a fragment of a music representation as a query, the task is to automatically retrieve documents from a music database containing parts or aspects that are similar to the query [1][2][3].One such retrieval scenario is known as audio identification or fingerprinting [4][5][6][7], where the user specifies a query using an excerpt of an audio recording, and the task is to identify the particular audio recording that is the source of the query.A more challenging scenario is cross-version retrieval, including tasks such as audio matching, version identification, or cover song retrieval .Here, given an excerpt of an audio recording as a query, the goal is to automatically retrieve all recordings in a database that correspond to the same piece of music as the query.Relevant documents may include various interpretations, arrangements, and cover songs of the piece underlying the recording of the query fragment.We focus on such a retrieval scenario in the context of Western classical music, where one typically has many different performances (referred to as versions) of the same piece of music.For example, given a 10 to 30 s fragment of a recording of Beethoven's Fifth Symphony performed by the Berlin Philharmonic conducted by Karajan, the task is to identify all versions of this symphony in a database, including an interpretation by the New York Philharmonic conducted by Bernstein and an interpretation by the Vienna Philharmonic conducted by Abbado.
Figure 1 illustrates a typical retrieval procedure, where query and database recordings are compared employing chroma-based audio representations [7] (Chapter 3), resulting in a ranked list of database documents.For comparison, a temporal alignment procedure (e.g., subsequence dynamic time warping [7] (Chapter 7)) is often used to compensate for non-linear tempo differences between the query and relevant database documents [24,31].However, for huge data collections, the resulting runtime of such approaches is prohibitive.As a more efficient alternative, previous work [1,9,12] introduced shingling approaches, where short feature sequences are used for indexing.In this paper, we build on a study presented by Grosche and Müller [12], who approached this task using chroma-based audio shingles (see the left part of Figure 1 for a visualization of such a shingle).Retrieval was performed via locality-sensitive hashing (LSH) applied to entire shingles.LSH is a random indexing technique for approximate nearest neighbor search [32].The authors investigated the feature design, the length of the shingles, and the effect of dimensionality reduction applied to individual feature vectors.In this paper, we propose approaches to increase the efficiency of the retrieval even more.We concentrate on the aspect of dimensionality reduction applied to entire shingles so that standard tree-based indexing techniques for nearest neighbor search can be used for retrieval.
As one contribution of this paper, we first use an approach based on principal component analysis (PCA) applied to entire shingles, rather than to individual chroma vectors as in previous work [12].As another contribution, we then adapt convolutional neural networks with triplet loss [33] to further reduce the shingles' dimensionality without losing their discriminative power.We conduct basic experiments with a medium-sized collection of music recordings to study the benefits and limitations of the dimensionality reduction methods.As our main result, we show that the shingle dimension can be reduced from 240 to below 8 with only a moderate loss in retrieval quality.Furthermore, we report on extended experiments with a larger data set, using different query lengths to study the scalability and generalizability of the embedding approaches.In our context, scalability refers to the data set size and generalizability refers to the diversity of the data set.We also provide detailed insights into the challenges of the retrieval task and their musical reasons by analyzing the distance distributions that form the basis for the nearest neighbor search.
The structure of this paper is as follows.We give an overview of related work in Section 2 and formalize our retrieval scenario in Section 3. We describe our embedding approaches in Section 4.Then, in Section 5, we report on our basic experiments based on a medium-sized data set.Finally, in Section 6, we give further insights based on our extended experiments using a larger and more diverse data set, and conclude in Section 7 with a short summary.Z f B j 2 E / s e 4 X 1 / r 1 h r J R l h S N Y G 4 o Z p X g B X N A t G L M y / Y Q 5 q W V 2 p u x K U J c O V T c u 6 o s U I v t 0 f n R O E O H A + i q i 0 6 l i 9 q B h q / M Q L x D A V l G B f O W J g q 4 6 7 s B a y j K l E i S K F v Q 4 r H x 9 1 I M x m 3 + H O u 3 U S k V z v 1 i 6 L O X L R 8 n A 0 7 R N O 7 S H q R 5 Q m c 6 p g j p k h y / 0 S m / a l X a n P W p P 3 1 Q t l e R s 0 a + l P X 8 B h T 6 R L Q = = < / l a t e x i t > . . .

< l a t e x i t s h a 1 _ b a s e 6 4 = "
< l a t e x i t s h a 1 _ b a s e 6 4 = " 4 r 5 3 G 6 e T F / t M 7 e L I 6 p l 0 7 J H n p 5 5 2 I t v y B b g D W 7 J l F 7 H l F t y C A 7 D g d T N G A i t y j 3 q q + l X V q 4 / K z 6 I w 1 5

Related Work
On a rough level, we can categorize music retrieval scenarios into metadata-and content-based retrieval tasks [1].Metadata-based systems use textual information for searching in music databases.On the contrary, content-based systems use actual music data such as sheet music images, symbolic music representations, or audio data.Content-based systems can be further categorized according to the modalities involved.For an overview of multi-modal music retrieval scenarios, we refer to a survey by Müller et al. [34].In our contribution, we focus on retrieval scenarios, where both query and database documents are audio recordings.In such a setting, we have a query that is either a segment of a music recording or a complete recording.The goal is then to retrieve music recordings that are similar to the query, based on some notion of similarity.Following Casey et al. [1] and Grosche et al. [2], we can categorize such retrieval scenarios according to two properties: Specificity and granularity.Specificity refers to the degree of similarity between the query and the database documents.High specificity is related to a strict notion of similarity, whereas low specificity refers to a rather vague one.The granularity refers to the length of the query, which can range from a short audio snippet (a couple of seconds) to an entire recording (several minutes).
A typical task of high specificity and low granularity is audio identification or audio fingerprinting, where the task is to identify the particular audio recording that is the source of the query [6,35].At the lower end of the specificity scale are tasks such as genre recognition [36].A medium-level specificity is associated with tasks such as audio matching [12,14], version identification [31,37], live song detection [28,38], and cover song retrieval [8,10,13,15,16,[21][22][23][24][25][26][27]29,30].In all of these tasks, one allows for variations as they typically occur in different performances and arrangements of a piece of music.The tasks differ in their granularity (e.g., shorter queries for audio matching and longer queries for version identification) and the specific types of music recordings of interest (e.g., live versions by the same performers for live song detection, or popular music with different performers for cover song retrieval).Cover song retrieval is a well-established research task, where one considers variations as they occur in different performances of the same piece of popular music.Such variations concern many different musical facets, including timbre, tempo, timing, structure, key, harmony, and lyrics [25].A task that is similar to cover song retrieval is version identification for Western classical music, where one allows for variations as they occur in different performances of the same piece of Western classical music [12,14,[18][19][20][39][40][41].This scenario is associated with a higher specificity than cover song retrieval because we expect fewer variations in Western classical music than in popular music.For example, the rough harmonic progression is the same among different performances of the same classical piece, which is not always the case for cover songs of popular music.For that reason, cover song retrieval can be considered a more difficult task compared to version identification for classical music.For more details on this, we refer to the overview article by Serrà et al. [25].In this paper, we focus on audio matching or version identification for Western classical music.
Miotto and Orio address version identification for classical music by modeling each musical work with a hidden Markov model (HMM) [18,19].In these studies, a query is identified by choosing the HMM that models the query with the highest probability.To avoid the time-consuming evaluation of all HMMs, the authors propose to first select a small subset of potential candidates [19] and then to evaluate only the HMMs for the most promising candidates.Instead of HMMs, other audio alignment algorithms such as particle filtering have been used in similar settings [20].Classical music retrieval was also approached as a multi-modal scenario [39,40], in particular using audio and symbolic representations [42].Arzt et al. [39,40] present an approach that uses symbolic music representations (as the database) to identify a query audio snippet of classical piano music.The query is first transcribed into a series of symbolic events by a neural network.Then, a symbolic fingerprinting algorithm can be applied.This system has a good performance for music where the automatic transcription step achieves good results, e.g., piano music.However, the approach is problematic for kinds of music where automatic transcription is more difficult, such as complex orchestral music.Another line of work uses chroma feature sequences of short audio fragments for audio matching of classical music [12,14].Our contribution builds upon this line of work and we refer to these studies in the following sections.

Shingle-Based Retrieval Scenario
Closely following Grosche and Müller [12], we now formalize the shingle-based retrieval strategy used in this paper.Given a short fragment of a music recording as a query, the goal is to retrieve all versions (documents) of the same piece of music underlying the query.To this end, we compare the database and query recordings based on a particular feature representation.The retrieval result for a query is given as a ranked list of documents.Figure 1 illustrates this general procedure.In the following, we explain the feature computation as well as the retrieval approach.
Our approach is based on so-called "shingles" [9], which are short sequences of feature vectors.We denote such a shingle of feature dimension F ∈ N and fixed length L ∈ N by S ∈ R F×L .In general, we generate such shingles from audio recordings, which are represented by longer feature sequences of variable length.The feature sequence of an audio recording is denoted by C = (c 1 , . . ., c N ) of length N ∈ N and consists of feature vectors c n ∈ R F for n ∈ {1, . . ., N}.We use chroma-based audio features, which measure local energy distributions of the audio recording in the F = 12 chromatic pitch class bands [7] (Chapter 3).More precisely, we use a variant called CENS (chroma energy distribution normalized statistics) [43], which are chroma features with post-processing that makes them more suited for retrieval: First, each chroma vector is 1 -normalized.Then, the resulting values of the chroma features are quantized in a logarithmic way (by mapping logarithmically spaced value ranges to integer values, e.g., values between 0.05 and 0.1 are mapped to to 1, values between 0.1 and 0.2 to 2, etc.).Next, the chroma feature sequence is temporally smoothed (using a smoothing length of 4 s) and downsampled (from 10 Hz to 1 Hz).Finally, each chroma vector is 2 -normalized.The most important aspect of this post-processing is the temporal smoothing, because it makes the features more robust against tempo differences.This chroma variant is state-of-the-art for the given task [12].The upper part of Figure 2    We now describe the retrieval approach for a fixed query (denoted as Q) and database document (denoted as D).For now, the query consists of a single shingle S Q ∈ R F×L .Previous investigations of the query length found that a length of 20 s is well suited for performing the retrieval with a single audio shingle [12].In our study, we use such a shingle length, resulting in a shingle dimensionality of F × L = 12 × 20 = 240.The document D is represented by a set of shingles This set S D consists of all subsequences C L n from the audio recording of document D (as defined in Equation ( 1)), generated with a hop size H = 1.In the next step, S Q is compared with all shingles from the set S D .The comparison between Q and D is achieved by first transforming the shingles to vectors by a function for some K ∈ N. In the brute-force case, f just flattens a matrix by concatenating all columns (i.e., K = FL).Using shingle embedding methods, as explained later, f performs a dimensionality reduction (typically K FL).Given two shingles S (1) and S (2) , we compare them in the embedding space using a distance function In the following, we use the squared Euclidean distance: where x (1) = f (S (1) ) and x (2) = f (S (2) ).Given a query Q (in the form of a shingle S Q ) and a database document D (in the form of a set of shingles S D ), we compute the distance between Q and D by Finally, for Q and a data collection D containing |D| documents, we compute δ D (S Q ) between Q and all D ∈ D and rank the results by ascending δ D (S Q ).
Previous work [12] used maximum cosine similarity instead of minimum Euclidean distance.We use the squared Euclidean distance because this distance naturally occurs in both dimensionality reduction methods, as explained in Section 4. Note that, since each feature vector of the shingles is normalized, Euclidean distance and cosine similarity lead to similar retrieval results.The relation between the squared Euclidean distance of 2 -normalized vectors and their cosine similarity is given by: d(x (1) , x (2) ) = 2 1 − cos(x (1) , x (2) ) .

Shingle Embedding
In this section, we explain the brute-force approach and the two shingle embedding techniques.The first uses standard PCA and the second is based on siamese neural networks, which are trained with the so-called triplet loss.

Brute Force
As a baseline approach, we just flatten each audio shingle S ∈ R 12×20 to a vector of size K = 12 × 20 = 240 by concatenating the feature vectors.In other words, we do not apply any dimensionality reduction.Note that no training is involved in this approach, contrary to the embedding methods that are explained in the following subsections.This baseline approach of using the full-dimensional audio shingles was also used in the study by Grosche and Müller [12].

PCA
As a first embedding approach, we use PCA [44] to reduce the dimensionality of the audio shingles.Each audio shingle S ∈ R 12×20 is seen as a vector of size 12 × 20 = 240.PCA learns a basis that is used to linearly project these audio shingles.Dimensionality reduction is then performed by considering only the first basis vectors instead of the complete set of basis vectors.This basis is learned such that the squared Euclidean distance between the original and the dimensionality-reduced shingles of a given training set is minimized.As a result, one obtains a linear transformation that maps an audio shingle S to an embedding vector x ∈ R K .This mapping f (S) = x is then used in the experiments with a separate test set by embedding the entire database as well as the queries before performing the retrieval (see Section 5.4).The authors of [12] also used PCA for dimensionality reduction, but they applied PCA only to individual chroma vectors of dimension F. Thus, their approach can only make use of the chroma dimension, without any temporal information.In our approach, we apply PCA to entire shingles of dimension F × L and can, therefore, exploit redundancies in the temporal sequence of chroma vectors for dimensionality reduction.

Neural Network with Triplet Loss
As a second embedding approach, we use neural networks to reduce the dimensionality of the audio shingles.Such networks can be used to learn non-linear functions that map high-dimensional input representations to low-dimensional output representations.We now show how the dimensionality of the audio shingles can be reduced using a convolutional neural network trained with the triplet loss [33].During training, our network embeds three shingles for computing the loss, which is then used for updating the parameters of the network.This will be explained in detail later in this section.Figure 3 shows an illustration of this process.
M y Y C i r i h M m 9 c + 6 5 Z + 4 d r h P 5 P J G W 9 T p g D A 4 N j 4 x m x s Y n J q e m Z 7 K z c 7 U k 7 M Q u q 7 q h H 8 a n j p 0 w n w t W l V z 6 7 D S K m R 0 4 P q s 7 7 R 0 (S a , S p , S n ) are embedded into the vectors (x a , x p , x n ) by a neural network.
For learning embeddings, a loss function that enforces specified similarities and dissimilarities in the embedding space is used.An example of such a loss function is the hinge loss, which has been used to learn representations for cross-modal score-to-audio embeddings [45] and for artist similarity [46].A related loss function is the triplet loss, which was developed for the task of face recognition [33] and then adapted for audio and music processing tasks such as speech retrieval [47], sound event classification [48], audio fingerprinting [49], artist clustering [50], music similarity [51], and cover song retrieval [37].The latter study is conceptually similar to our approach, but presents results that are worse than those achieved by more traditional approaches [21].
To define the triplet loss in our scenario, we select three shingles: An anchor shingle S a , a positive shingle S p , and a negative shingle S n (see the upper part of Figure 3 for an example).With respect to the anchor shingle, the positive shingle is musically similar, while the negative shingle is musically dissimilar.More precisely, the anchor and positive shingles originate from different versions of the same piece of music and correspond to the same musical position within that piece.Anchor and negative shingles do not correspond to each other.
The goal is to find an embedding function f : R F×L → R K such that f (S a ) is numerically close to f (S p ) and far from f (S n ).The lower part of Figure 3 visualizes embeddings with this property.Let Then, following Schroff et al. [33], we define the loss as with α ∈ R ≥0 being a margin parameter and d being a distance function.In our experiments, we use the squared Euclidean distance, as defined in Equation ( 5).The cost function J to be optimized during batch gradient descent is the average of losses over a batch B of triplets: Our neural network architecture is summarized in Table 1.It comprises two blocks, each consisting of two convolutional layers and a max-pooling layer.Finally, a dense layer reduces the network's internal representation to the embedding dimensionality K; an 2 -normalization layer ensures that the embedding vectors do not become arbitrarily large or small.This topology is inspired by a network for multi-modal music embeddings [45].We simplified the architecture (e.g., using 2 blocks instead of 4) for two reasons: First, our scenario is mono-modal and therefore less complex compared to the multi-modal task [45], and second, our input dimension is smaller (using chroma features instead of spectral features).Our network has relatively few parameters compared with many other deep learning systems.E.g., for K = 10, the network has about 6000 parameters.

Basic Experiments
In this section, we report on our experiments with medium-sized data sets for training and testing.Later, in Section 6, we will also report on experiments with an extended data set.For now, we use medium-sized data sets similar to those used in previous studies [12] for being comparable with this work.First, we describe the data sets and our evaluation measures.Second, we discuss the evaluation results obtained using the brute-force approach, the PCA-based embeddings, and the approach using neural networks.Third, we analyze the influence of the margin parameter α (used in the loss function) on the retrieval results.Finally, we report on a runtime experiment that indicates the impact of the embeddings' dimensionality on the retrieval time.

Training and Testing Data Sets
In our experiments, we used audio recordings of Western classical music.In particular, we used pieces from three composers: Symphonies by Beethoven, Mazurkas by Chopin, and pieces from Vivaldi's The Four Seasons.These composers cover three different musical eras, namely the Baroque, Classical, and Romantic periods.Table 2 shows a list of the musical pieces underlying the recordings.For each musical piece, our database contains several versions that are performed by different orchestras, conductors, and soloists.To make our results comparable to prior work, we use audio data sets that are similar to the one used in the study by Grosche and Müller [12].The data sets comprise recordings of some of Frédéric Chopin's Mazurkas, which have been collected within the Mazurka Project [52].There are two disjoint sets: D 1 (357 recordings, 62,867 shingles) was used for training the dimensionality reduction methods, and D 2 (330 recordings, 52,332 shingles) was used for evaluating the retrieval quality based on the embedded shingles.Furthermore, circular chroma shifts were applied to the training set, which simulate musical transpositions and increase the number of shingles used for training by a factor of twelve.This process can be seen as a type of data augmentation.Both training and test sets are musically related, as they contain the same composers and the same music genres.However, the musical pieces in both sets are different.

Evaluation Procedure
In our testing stage, we used the data set D 2 , which is independent of the training set D 1 .When performing retrieval using a query Q, we computed δ D (S Q ) for all D ∈ D 2 and obtained a ranked list of documents, as described in Section 3. We excluded the document containing the query from the database D 2 so that there are no trivial retrieval results.
For evaluating the results, we considered three evaluation measures.First, we used precision at one (P@1), which is 1 if the top-ranked document is relevant, and 0 otherwise.However, not only the top rank is of relevance in our retrieval scenario.This was taken into account by our second evaluation measure, called R-precision (P R ).Here, R ∈ N denotes the number of relevant documents for a given query.Note that this number may be different for different queries.P R is defined as the proportion of relevant documents among the first R ranks.Third, we used average precision, which is a standard evaluation measure for information retrieval that takes the entire list of ranks into account.It is defined as the mean of the precision scores for the ranks with retrieved relevant documents.This measure is not as well interpretable as P@1 or P R , but it is the most comprehensive evaluation measure that we used.For a more detailed explanation, we refer to the book by Manning et al. [53] (Chapter 8).
For our experiments, we used a set of queries, which we created by equidistantly sampling 10 queries from each recording of our test set D 2 , resulting in 3300 queries.The evaluation results were then averaged over all queries.In the case of average precision, the averaged measure is referred to as mean average precision (MAP).

Brute Force
As a baseline, we performed a first retrieval experiment based on the original audio shingles without dimensionality reduction (see Section 4.1).Since no training was involved, we report the results in Table 3 (upper rows) for both the training set D 1 and the test set D 2 .For example, in the case of the test set D 2 , we achieved a P@1 value of 0.996, which means that only 13 of the 3300 queries did not yield a relevant document on the top rank.Furthermore, the MAP value of 0.972 indicates that almost all relevant documents appear at the beginning of the ranked list.The results for the training set are similar, indicating a comparable complexity of both data sets.
The parameters and feature design of the shingles were chosen in such a way that the brute-force approach yielded close to perfect results for the given task.For a comparison, we also performed an experiment using a temporal alignment procedure, similar to the classical state-of-the-art approaches for cover song retrieval by Serrà et al. [24,25,54].In essence, these approaches are based on the combination of enhanced chroma representations with non-linear temporal alignment procedures.In our experiments, we performed subsequence dynamic time warping (SDTW) [7] (Chapter 7) to align the feature sequences of a query and a database document.The approaches of Serrà et al. [24,54] used local alignment procedures that aligned subsequences of the query to subsequences of the database document (e.g., Smith-Waterman, or Q max algorithm).This was motivated by the task of popular music cover song retrieval where the query is a complete recording.In this case, query and relevant database documents typically have a different structure.However, in our case, we are dealing with Western classical music, and the query is only a 20 s excerpt.Therefore, we can expect that the query is entirely represented as a musically corresponding subsequence in the relevant database documents.Under this assumption, SDTW is more or less equivalent to the Smith-Waterman algorithm.For SDTW, we used the Euclidean distance, the step size condition {(2, 1), (1,2), (1, 1)}, and the weights (2, 1, 1) for vertical, horizontal, and diagonal steps, respectively.As a result of SDTW, we obtained a matching function; the minimum of this matching function was used as a distance measure for ranking the documents.Table 3 (middle rows) shows the retrieval results for this experiment, using the CENS features described in Section 3.These results are very close to the results of the shingle-based brute-force approach.This confirms that in our music scenario, no alignment procedure is needed when using CENS processing.
One main motivation for using CENS smoothing is to introduce robustness to local tempo variations.When an alignment procedure is used, such smoothing is not needed.Therefore, we also conducted an alignment experiment, using the original chroma features without CENS post-processing.In this setting, the feature rate was 10 Hz instead of 1 Hz.Table 3 (lower rows) shows the retrieval results using these features.In the case of the test set D 2 , we achieved a P@1-value of 0.999, which means that almost all queries yielded a relevant document on the top rank.In all evaluation measures, we see small improvements over the shingle-based brute-force approach.However, this goes along with a dramatic increase in runtime.The runtime for the overall retrieval experiment in our setting increased from about a minute for the shingle-based brute-force approach (1 Hz features) to several hours for the alignment-based approach using 10 Hz features.We describe further aspects related to runtime in Section 5. 7.
In summary, we showed that we can get close-to-perfect results with the brute-force approaches.In other words, when only looking at retrieval quality, the problem of cross-version retrieval for Western classical music can be regarded as being largely solved.However, brute-force approaches are time-consuming.The main focus of this paper is efficiency, and we want to see to which extent we can keep the retrieval quality while reducing the shingle dimensionality.Therefore, in the following, we are not aiming for improving the brute-force approaches, but for keeping a comparable result while using low-dimensional embeddings of the audio shingles.If not mentioned otherwise, brute-force always refers to the shingle-based brute-force approach in the following.Table 3. Retrieval results for various brute-force approaches: The shingle-based brute-force approach (K = 240), an alignment-based brute-force approach using chroma energy distribution normalized statistics (CENS) features (1 Hz), and an alignment-based brute-force approach using chroma features (10 Hz) without CENS post-processing.

PCA
As the second approach, we applied dimensionality reduction with PCA as described in Section 4.2.We used the training set D 1 to learn the PCA basis and evaluate the approach with the test set D 2 .Table 4 shows the evaluation results for two different PCA-based reduction strategies: The left columns (GRO) refer to the reduction of individual chroma vectors as done by Grosche and Müller [12], and the right columns (PCA) refer to the proposed reduction of entire shingles.The rows correspond to the considered dimensionalities 40, 60, and 80.Let us take the case of K = 40 as an example.In the first approach (GRO), each chroma vector was reduced to two dimensions, leading to the dimensionality of K = 2 × 20 = 40.This strategy resulted in an MAP value of 0.832.In the shingle-based reduction (PCA), an entire 240-dimensional feature sequence was reduced altogether to a 40-dimensional vector.This approach led to an MAP value of 0.964.In general, our experiments showed that a shingle-based reduction leads to better retrieval results.This is not surprising, because this approach can exploit temporal redundancies for dimensionality reduction.In the following, we aim to reduce the dimensionality to a degree that would not have been possible with the first approach (GRO).Columns 2-4 of Table 5 show the evaluation results obtained with our shingle-based approach for much lower dimensionalities.The retrieval quality consistently increases with an increase of dimensionality from an MAP value of 0.580 for K = 3 to 0.952 for K = 30.Let us consider the dimensionalities of 6 and 12 as exemplary cases.For K = 6 the P@1 value is 0.857, i.e., for 472 of the 3300 queries, the top-ranked document was not relevant.For K = 12 (P@1: 0.957), this is the case for only 142 queries.Table 4. Retrieval results for principle component analysis (PCA)-based dimensionality reduction of individual chroma vectors (GRO) [12] and entire shingles (PCA, proposed) using the test set D 2 .

Neural Network with Triplet Loss
As the third approach, we applied dimensionality reduction with a deep neural network (DNN) as described in Section 4.3.For training the neural network, we used triplets of shingles from the training set D 1 .They were generated with the constraint that the central time positions of the anchor and positive shingles corresponded to the same musical position in different versions of the same piece.The negative shingle did not musically correspond to the anchor shingle.To generate such musically meaningful triplets, we needed to compute musically corresponding time positions in all versions of the same pieces in a pre-processing stage.We used a dynamic-time-warping-based music synchronization approach [7] (Chapter 3) for this purpose.Furthermore, random circular shifts along the chroma axis were applied to avoid biasing the network towards the musical keys in our data set.The shifts applied to the anchor and the positive examples were the same, while the shift applied to the negative example was chosen independently.This triplet generation procedure led to a combinatorial explosion of possible triplets.For that reason, not all possible triplets were provided to the network during training.We defined an "epoch" to consist of 2000 batches with a batch size of 128 triplets, used for batch gradient descent with the Adam optimizer [55].Other triplet loss studies sometimes control the triplet generation by a specific procedure called "semi-hard triplet mining" [33].Preliminary experiments (not reported here) showed that in our case, there were no improvements by this method.Therefore, we do not apply this procedure in the experiments reported in this paper.In our first experiments, we fixed α = 1.3 (see Section 5.6 for a discussion) as well as a learning rate of 10 −3 and trained a neural network for 10 epochs.It turned out that a larger number of epochs did not improve the retrieval results.
Columns 5-7 of Table 5 show the evaluation results for a range of different dimensionalities from 3 to 30 using the test set D 2 .In general, the retrieval quality increases with an increase of K from an MAP value of 0.683 for K = 3 to 0.959 for K = 30.Let us consider some cases as examples.For K = 6, the P@1 value is 0.890, i.e., for 363 of the 3300 queries, the top-ranked document was not relevant.For K = 12 (P@1: 0.964), this is the case for only 119 queries.Compared to the PCA-based approach, the neural network especially improved the retrieval results for smaller dimensionalities like 6 or 8, where the MAP value is greater by more than 0.08 (e.g., K = 8, MAP for PCA: 0.806 and for DNN: 0.898).We observed rather small improvements in P@1, but there was a considerable increase of P R (e.g., K = 8, P R for PCA: 0.754 and for DNN: 0.856).

Influence of α
In the following, we analyze the influence of the parameter α, which is used in the loss function as defined in Equation ( 8).This parameter can be interpreted as the margin between d(x a , x p ) and d(x a , x n ). Figure 4 shows the evaluation results (MAP) for various α values.For this experiment, we only used the dimensionalities of K = 6 and K = 12 as examples to illustrate general tendencies.For a given α and K, 25 neural networks were trained for 10 epochs with different random initializations.Then, they were used for dimensionality reduction in the retrieval scenario, resulting in 25 MAP values.From these, we computed the mean (µ) and the standard deviation (σ).The solid lines show µ and the light areas show ±σ around µ.For both K = 6 and K = 12, we see similar trends: α = 0 achieves rather bad results.In this case, no margin is enforced, and the loss is zero as soon as d(x a , x p ) ≤ d(x a , x n ).However, increasing α only slightly leads to a clear improvement in MAP.Any α in the range of [0.1, 1.7] produces results of similar quality.For α > 1.7, the results strongly decrease.This can be explained as follows: Only a small positive margin is needed for retrieving the correct versions.Using a large margin brings no benefit for retrieval, but makes the training much harder.In summary, for 0.1 ≤ α ≤ 1.7, we see a stable overall behavior of the results with a small standard deviation, showing robustness to the initialization used.

Runtime Experiment
We showed that we can substantially reduce the dimensionality of the audio shingles while keeping their discriminative power.Such low-dimensional embeddings are beneficial when using indexing techniques for efficient nearest neighbor retrieval.To show this property, we conducted an experiment where we computed the distances of the 3300 queries to the documents of our test set D 2 (as done for the previous experiments).We measured the runtimes for the alignment-based approaches (described in Section 5.3) and for the shingle-based approaches.In the case of the shingle-based approaches, we employed three different nearest neighbor search strategies.The first search strategy is a full search by just computing all distances between the shingles of the database documents and the query shingles.For the second and third search strategies, we used k-d trees, which are standard data structures for searching in multidimensional spaces [56].In the second strategy (Doc-Trees), we built one tree for each of the 330 documents of the test set and searched for the nearest neighbor to the query in each tree.As a consequence, each document occurred precisely once in the ranked list (as in the previous experiments).In the third strategy (Db-Tree), we built a single tree for all documents of the database and searched for the 330 nearest embeddings to the query.With this strategy, we were not able to rank all documents of the database because some of the returned embeddings originated from the same document.Note that the reported runtimes depend on the used implementations.So, rather than focusing on the absolute times, we want to emphasize the relative tendencies as well as the orders of magnitude.
We performed our experiments using Python 3.6.5 on a computer with an Intel Xeon CPU E5-2620 v4 (2.10 GHz) and 31GiB RAM.For the alignment-based approaches, we used the SDTW implementation of librosa 0.7.1 [57], which is written in Python and accelerated by the just-in-time compiler numba.For the full search, we used the efficient pairwise-distance calculation of scipy 1.0.1 [58], which calls a highly optimized implementation in C. For the k-d trees (using a default leaf size of 30), we used the implementation of scikit-learn 0.20.1 [59], which is written in Cython.
Table 6 presents the runtimes for selected settings, averaged over several iterations of the retrieval experiment.The first column specifies the retrieval approach, the second column lists the runtimes for the full search strategy, and the third column lists the runtimes for the Db-Tree search strategy.For the alignment-based approach using 10 Hz features (first row), the runtime was about 6.5 h.When using 1 Hz features (second row), the runtime decreased to about 6 minutes.It is not surprising that the first alignment-based approach was much slower, since the feature rate was ten times higher and the alignment algorithms were of quadratic complexity.For the brute-force shingle approach (K = 240, third row), the runtime was significantly lower than for both alignment-based approaches.It took 23.0 s for the full search strategy and 76.9 s for the Db-Tree search strategy.In our setting and with the used implementations, the Db-Tree strategy was slower than full search for K = 240.It is well known that k-d trees degenerate for high dimensions [60].With dimensionality reduction to K = 30 or K = 12 (fourth and fifth row), both search strategies were in a similar range (e.g., for K = 12, full search: 1.8 s, Db-Tree: 1.1 s).For lower dimensions (K = 6, sixth row), the Db-Tree search strategy substantially accelerated the nearest neighbor search (full search: 1.2 s, Db-Tree: 0.4 s). Figure 5 shows the runtime (µ and ±σ for 100 iterations of the retrieval experiment) for the dimensionality reduction approaches.For the full search strategy, we see an almost linear relationship between the dimensionality K and the runtime.This strategy is independent of the underlying data distribution.For that reason, we do not distinguish between PCA-and DNN-based dimensionality reduction for the full search strategy.This is different for the tree-based search strategies (Db-Tree and Doc-Trees), where the data distributions of the PCA-and DNN-based embeddings lead to different search times.We also see an almost linear relationship for the Doc-Trees search strategy for both PCAand DNN-based retrieval approaches.In general, for this strategy, the runtime is higher than for the full search.In our case, the data size of a single document is too small for the Doc-Trees strategy to give any benefit over the full search strategy.For the Db-Tree strategy, we see a slightly exponential growth of runtime with growing K.When the dimensionality falls below 15, the Db-Tree strategy starts to give benefits for the fast nearest neighbor search.We want to emphasize again that the absolute runtimes are implementation-dependent.Therefore, we want to highlight some general tendencies: The shingle-based approaches are significantly faster than the alignment-based approaches.In general, the experiments confirm that lower dimensionalities accelerate the nearest neighbor search.In particular, when using small dimensions (below 15 in our setting), we can further speed up the search by standard multidimensional indexing strategies.

Extended Experiments
In this section, we investigate the scalability and generalizability of our approach by evaluating the embedding methods on a larger and more diverse data set.In this way, we also provide deeper insights into the benefits and limitations of our dimensionality reduction approaches.First, in Section 6.1, we describe our extended data set, which is only used for testing.Then, in Section 6.2, we discuss the evaluation results obtained using the embedding methods trained on the smaller training set described in Section 5.1.Next, in Section 6.3, we investigate the discriminatory capacity of the low-dimensional embeddings by using a longer query length employing multiple shingles per query.Finally, in Section 6.4, we analyze the distances that appear in the nearest neighbor search to better understand the complexity of the retrieval problem depending on the composers and genres of classical music.

Extended Data Set
To test the scalability and generalizability of our approach, we compiled an extended data set D 3 , which is listed in Table 7.Including the test set D 2 (see Table 2), the extended data set additionally comprises a variety of further composers and genres, including piano and violin concertos (Brahms, Schumann, Tchaikovsky), symphonies (Mahler, Mozart), opera music (Wagner), and character pieces in piano solo and orchestral versions (Mussorgsky).The data set D 3 consists of 535 recordings (205,522 shingles) and contains about 60 h of audio material, compared to the 16 h of the previous test set D 2 .

Evaluation
For our extended experiments, we kept the settings from the previous experiments and applied the same embedding methods as described in Section 4. In particular, the embedding methods were trained on the smaller training set D 1 (see Table 2) as before, and were then evaluated with the extended data set D 3 , containing composers and genres of classical music that are not contained in the training set.The retrieval for a larger and more diverse data set obviously constitutes a harder task.For this reason, we can expect the retrieval results to decrease.
Figure 6 shows the MAP evaluation results for different embedding dimensionalities K on the smaller test set D 2 , as reported in the previous Section 5 (a) and on the extended data set D 3 (b).As expected, we see a decrease in retrieval quality for the larger data set.The results for the brute-force approach decrease from an MAP of 0.996 for D 2 to 0.924 for D 3 .The results based on the embedding approaches decrease even more.In particular, the smaller dimensionalities result in an MAP of less than 0.6; e.g., for K = 6, the MAP is 0.441 and 0.500 for PCA and DNN, respectively.This confirms our assumption that the extended data set constitutes a harder task and leads to an increased probability for false negatives.One reason for the increased difficulty of the task is that there is a higher potential for confusion between the documents due to the increased data set size.Another reason is the increased diversity of database documents.A possible explanation for the poorer results of the embedding methods could be that the embeddings are just overfitted to the composers and styles of the training set and are not very discriminative in general.In the next section, we will question this argument by considering longer query audio fragments.

Dependency on Query Length
A possible explanation for the results of the previous section is that the query length of 20 s is not discriminative enough to identify all versions of the same piece.To analyze this hypothesis, we increased the query length in the following experiments.An obvious way to do this is to increase the query shingle length.However, we want to keep our shingle size fixed to keep the same database structure for different query lengths.Therefore, instead of increasing the query shingle length, we used multiple successive shingles for each query.
Recall that, in our previous approach (see Section 3), we compared a query (Q) and a document (D) by performing a nearest neighbor search of a single query shingle S Q and all shingles from the set S D of document submatrices (see Equation ( 2)).The squared Euclidean distance to the nearest neighbor δ D (S Q ) ∈ R ≥0 is regarded as the document-wise distance between Q and D and was used for ranking all documents of the database.Here, instead of a single query shingle S Q , we collected a set of successive shingles S Q from the query recording, as done for the documents (see Equation ( 2)).Instead of using a hop size H = 1 (as for the documents), we used a hop size of H = L/2 = 10.Denoting the number of shingles per query by λ := |S Q |, a query covers (λ − 1) × 10 + 20 s of audio content.For each of these shingles, we computed the document-wise distance δ D , as done previously (see Equation ( 6)).To obtain a single distance value between the query and the database document, the resulting shingle-wise distances were simply averaged: Finally, all database documents D ∈ D were ranked by their averaged distances δ D (S Q ) ∈ R ≥0 in ascending order.Note that our previous experiments are the special case of λ = 1 (query length: 20 s).
For our experiments, we sampled 10 queries from each recording in an equidistant way, as before.Note each query now consists of multiple shingles.Figure 7 shows the MAP evaluation results for K = 12 using different query lengths on the extended data set.The retrieval quality considerably improves with increasing query length, e.g., the brute-force approach improves from an MAP value of 0.924 for λ = 1 to 0.976 for λ = 5.Similarly, the MAP values for the PCA-and DNN-based approaches increase strongly with the query length.Table 9 shows the results for each of the composers of the data set for λ = 5 (query length: 60 s).For Chopin, the DNN (MAP: 0.937) outperforms PCA (MAP: 0.813) and comes close to brute force (MAP: 0.974).For Mahler, PCA (MAP: 0.935) is slightly better than the DNN (MAP: 0.891).For Schumann's piano concerto, the DNN (MAP: 0.840) substantially outperforms PCA (MAP: 0.681).In contrast to the Chopin results, this cannot be explained by overfitting, because neither Schumann nor any piano concerto was part of the training set.Furthermore, we achieved an MAP value of 0.967 for the pieces by Brahms using the DNN approach.This is the best result among all composers for this approach, even better than for the composers of the training set.This shows a certain generalizability of the DNN embedding method.
In summary, our experiments show that low-dimensional embeddings need a longer query length to be discriminative enough when using a larger data set.Note that a query length of 60 s is still a medium duration compared to studies for related tasks.For example, in popular music cover song retrieval, an entire recording is often used as a query-e.g., in the approach by Serrà et al. [24] or the work by Casey et al. [9] (using entire recordings as queries, though with a short shingle length of 3 s).Furthermore, we showed that, in general, composers outside the training set do not necessarily lead to worse retrieval results compared to composers contained in the training set.Thus, we can assume a certain generalizability of our approach within the common practice period of Western classical music.

Distance Analysis
In the previous section, we addressed issues of scalability and generalizability in relation to the query length.Now we want to gain further insights into the challenges that occur in the retrieval scenario.For example, beyond rank-based evaluation measures, we want to find out whether particular compositions are easier or harder for retrieval than others, e.g., due to harmonic or melodic characteristics.Recall from Section 3 that the ranks are computed on the basis of the distances that appear between queries and documents.The retrieval problem is well behaved if the distances between queries and documents of the same piece of music (relevant documents) are substantially smaller than the distances between queries and documents of different pieces (non-relevant documents).To understand how well behaved our problem is, we now analyze the distances that appear in the nearest neighbor search.
Recall that we have R ∈ N relevant documents for a given query.We want to compare the distances to the relevant documents with the distances to the non-relevant documents.Since most of the non-relevant documents should be easily distinguishable from the relevant ones, we restrict our analysis to the most difficult non-relevant documents for the task.To balance the numbers of relevant and non-relevant documents, we only consider the R non-relevant documents with the smallest distances to the given query.Small distances mean that these non-relevant documents have the greatest confusion potential with the relevant documents.In the following, we analyze the relation of the distributions of distances for relevant documents and non-relevant documents, respectively.The relation between these distributions indicates how difficult the retrieval problem is.However, the analysis of the relation between the distributions has to be taken with care, because versions with different difficulties are included in a single distribution.For that reason, a strong separation of the distributions is a sufficient condition for perfect retrieval results, but not a necessary condition.In other words, even if the overlap between the distributions is large, the retrieval may still give excellent results.In this sense, we regard the distributions only as weak indicators and only evaluate them by visual inspection in the following.For a more statistically rigorous analysis of such distributions, we refer to the study by Casey et al. [9].
Figure 8 shows such distributions for the brute-force approach, where the distributions of relevant document distances appear in orange color and the distributions of non-relevant document distances appear in blue color.The three rows show distributions for the composers that are part of both D 2 and D 3 (Beethoven, Chopin, and Vivaldi); the three columns correspond to different evaluation settings.The left column refers to the smaller test set D 2 with λ = 1 (query length: 20 s), the middle column refers to the extended test set D 3 with λ = 1 (query length: 20 s), and the right column refers to the extended test set D 3 with λ = 5 (query length: 60 s).Considering Chopin in the left column, we see that the center of the orange distribution is much further to the left than the blue distribution.This means that the distances to the relevant documents are generally smaller than the distances to the non-relevant documents.We also see a small overlap between both distributions, which is caused by the fact that some distances to non-relevant documents are smaller than the greatest distances to relevant documents.However, since the distances that lead to the overlap could relate to different queries, we cannot conclude that this necessarily leads to confusion in the retrieval step.In general, since the overlap is small, the distributions indicate that the retrieval problem is well behaved for Chopin.For Beethoven and Vivaldi (left column), there is more overlap.This suggests that the Chopin pieces are "easier" in the sense that they are more discriminative than other pieces.For the extended data set (middle column), we see similar tendencies for all three composers.The blue distributions come a bit closer to the orange ones with respect to their relation for the smaller data set (left column).The orange distributions are identical to the ones for the smaller data set because the distances to the relevant documents are the same.Only the distances to non-relevant documents decrease, because the extended data set contains more non-relevant documents with possibly smaller distances.When using a longer query length (λ = 5, right column), the orange and blue distributions are better separated for all three composers.The distances for the relevant documents only slightly increase because of the longer query length, but the distances to the non-relevant documents increase strongly, which leads to a better separation between the distributions.So far, we analyzed distributions for the brute-force approach to gain insights into the musical complexity of the retrieval task.In the following, we want to understand the effect of the embedding on the distributions.Figure 9 shows further distributions for the extended test set D 3 and five selected composers (Beethoven, Chopin, Vivaldi, Mahler, and Schumann).The left column refers to the brute-force approach (K = 240) with λ = 1 (query length: 20 s), the middle column refers to the DNN embedding approach (K = 12) with λ = 1 (query length: 20 s), and the right column refers to the DNN embedding approach (K = 12) with λ = 5 (query length: 60 s).Note that the absolute distances between the different approaches are not comparable, but the relations between orange and blue distributions are meaningful.The distance measure is always the squared Euclidean distance, cf.Equation (5).However, for the brute-force approach, we have sequences of 20 non-negative 2 -normalized vectors of size 12, which are then flattened (distance range [0,40]).For the DNN approach, we have real-valued 2 -normalized vectors of size 12 (distance range [0,4]).
For the brute-force approach (left column), the new composers of the extended data set (Mahler, Schumann) behave similarly to the previous ones.The middle column reflects the weaker results for the low-dimensional embeddings with a shorter query length.As expected, there are strong overlaps between the orange and blue distributions.We can also see that there are some distances close to zero in the orange distributions.This can be explained by the neural network training, where the anchor and positive embeddings are pushed close together.However, the anchor and negative embeddings do not seem to be pulled apart the same way.For Chopin, the most prominent composer of the training set D 1 , the separation between the distributions is clearer.When using longer queries (λ = 5, right column), all orange and blue distributions are better separated.This holds for Chopin, where the initial situation (λ = 1) is better, as well as for composers with heavily overlapping distributions for λ = 1-e.g., Schumann.

Conclusions
In this paper, we proposed two dimensionality reduction methods for learning compact embeddings of audio shingles (short sequences of chroma vectors) for a cross-version retrieval scenario in the context of Western classical music.We showed that our strategy of reducing entire shingles results in better retrieval quality and faster speed compared to those of the previous approach of reducing individual chroma vectors [12].In our experiments, we greatly reduced the dimensionality of shingles-from 240 to below 12-with only little loss in retrieval quality.Both PCA and neural networks with triplet loss turned out to be effective for this task.In particular, we found that neural networks are beneficial for small dimensionalities of between 6 and 12.Such small dimensions allow for indexing by simple nearest neighbor trees, which could be the foundation of fast content-based audio retrieval in large classical music databases where efficiency is an important issue.We also showed that the database size has a strong impact on the retrieval results, especially when using dimensionality reduction methods.Applying such techniques, one needs a longer query length to be discriminative enough on a larger data set.Increasing the query length from 20 to 60 s results in a high retrieval accuracy, even for low-dimensional embeddings.We also showed that our approaches generalize to composers of the common practice period of Western classical music which are not contained in the training set.Furthermore, we analyzed the distances that appear in the nearest neighbor search to gain insights into the challenges of the retrieval scenario.For example, we showed that the distance distributions for different composers can differ strongly and give the indication that certain composers or pieces are more difficult for retrieval than others.In future work, we will investigate whether training on larger data sets can make the embeddings more robust for shorter query lengths.Furthermore, up to now, we used CENS features-which are state-of-the-art for the given task-as the input.We want to investigate if good embeddings can be learned from less specialized input features such as "raw" chroma features, spectrograms, or even waveforms.
t e x i t s h a 1 _ b a s e 6 4 = " 9 4 I G a m l A R r n S Y B 9 Y e A p 4 l Y h e P 2 c = " >A A A C 0 X i c h V F L S 8 N A E J 7 G V 1 t f V Y 9 e g q 3 g q S T 1 o M f i C y 9 C R f u A t k i S r n F p X i T b Q i 2 C e P X m V f + Y / h Y P f r u m g h b p h s 3 M f v P N t z M 7 d u T x R B j G e 0 a b m 1 9 Y X M r m 8 s s r q 2 v r h Y 3 N R h I O Y o f V n d A L 4 5 Z t J c z j A a s L L j z W i m J m + b b H m n b / W M a b Q x Y n P A y u x S h i X d 9 y A 3 7 L H U s A a p U 6 w 1 4 o k t J N o W i U D b X 0 a c d M n S K l q x Y W P q h D P Q r J o Q H 5 x C g g A d 8 j i x J 8 b T L J o A h Y l 8 b A Y n h c x R k 9 U B 6 5 A 7 A Y G B b Q P v 4 u T u 0 U D X C W m o n K d n C L h x 0 j U 6 d d 7 D O l a I M t b 2 X w E 9 h P 7 H u F u f / e M F b K s s I R r A 3 F n F K 8 A C 7 o D o x Z m X 7 K n N Q y O 1 N 2 J e i W D l U 3 H P V F C p F 9 O j 8 6 J 4 j E w P o q o t O p Y r r Q s N V 5 i B c I Y O u o Q L 7 y R E F X H f d g L W W Z U g l S R Q t 6 M a x 8 f d S D M Z t / h z r t N C p l c 7 9 c u a w U q 0 f p w L O 0 T T u 0 h 6 k e U J X O q Y Y 6 5 D R f 6 J X e t C t t p D 1 q T 9 9 U L Z P m b N G v p T 1 / A c j x k a 0 = < / l a t e x i t > . ..< l a t e x i t s h a 1 _ b a s e 6 4 = " 9 4 I G a m l A R r n S Y B 9 Y e A p 4 l Y h e P 2 c = " >A A A C 0 X i c h V F L S 8 N A E J 7 G V 1 t f V Y 9 e g q 3 g q S T 1 o M f i C y 9 C R f u A t k i S r n F p X i T b Q i 2 C e P X m V f + Y / h Y P f r u m g h b p h s 3 M f v P N t z M 7 d u T x R B j G e 0 a b m 1 9 Y X M r m 8 s s r q 2 v r h Y 3 N R h I O Y o f V n d A L 4 5 Z t J c z j A a s L L j z W i m J m + b b H m n b / W M a b Q x Y n P A y u x S h i X d 9 y A 3 7 L H U s A a p U 6 w 1 4 o k t J N o W i U D b X 0 a c d M n S K l q x Y W P q h D P Q r J o Q H 5 x C g g A d 8 j i x J 8 b T L J o A h Y l 8 b A Y n h c x R k 9 U B 6 5 A 7 A Y G B b Q P v 4 u T u 0 U D X C W m o n K d n C L h x 0 j U 6 d d 7 D O l a I M t b 2 X w E 9 h P 7 H u F u f / e M F b K s s I R r A 3 F n F K 8 A C 7 o D o x Z m X 7 K n N Q y O 1 N 2 J e i W D l U 3 H P V F C p F 9 O j 8 6 J 4 j E w P o q o t O p Y r r Q s N V 5 i B c I Y O u o Q L 7 y R E F X H f d g L W W Z U g l S R Q t 6 M a x 8 f d S D M Z t / h zr t N C p l c 7 9 c u a w U q 0 f p w L O 0 T T u 0 h 6 k e U J X O q Y Y 6 5 D R f 6 J X e t C t t p D 1 q T 9 9 U L Z P m b N G v p T 1 / A c j x k a 0 = < / l a t e x i t > . . .< l a t e x i t s h a 1 _ b a s e 6 4 = " M z T X l Q 0 r 0 H U 0 p w g v V 4 e K H y R + Y j o = " > A A A C 0 H i c h V F L T 8 J A E B 7 q C / C F e v T S C C a e S I s H P R J f 8 W K C R h 4 J E N O W B R v 6 y n Y h I j H G q z e v + s v 0 t 3 j w 2 7 W Y K D F s s 5 3 Z b 7 7 5 d m b H j j w 3 F o b x n t L m 5 h c W l 9 K Z 7 P L K 6 t p 6 b m O z F o c D 7 r C q E 3 o h b 9 h W z D w 3 Y F X h C o 8 1 I s 4 s 3 / Z Y 3 e 4 f y 3 h 9 y H j s h s G 1 G E W s 7 V u 9 w O 2 6 j i U A 1 Q u t T i j i w k 0 u b x Q N t f R p x 0 y c P C W r E u Y + q E U d C s m h A f n E K C A B 3 y O L Y n x N M s m g C F i b x s A 4 P F f F G T 1 Q F r k D s B g Y F t A + / j 2 c m g k a 4 C w 1 Y 5 X t 4 B Y P m y N T p 1 3 s M 6 V o g y 1 v e 6 c m 6 t 5 6 t F + v 1 2 9 U a y m I W 6 d e x 3 r 4 A 3 j S d 4 A = = < / l a t e x i t > 2. D 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 s U L 3 i P L 6 N P b Z J X U c 8 p f v D O L D e / N M L f M p s I p p U / G i m V 8 T 1 x j R I 9 1 k X H h u a x l f a T p S u M K r 2 0 3 I e v L L G L 6 D P 7 y d G h R x M b W 4 u K d 9 R y S w 7 f v a 0 4 g o e y x A j P l J Y N r O 7 6 k F F Z K y 5 I U j I J 8 i t J M 3 9 R z f 3 f L v g w + Y 6 4 F V 6 L 5 / w K s K i e t R r P d O P j Q q h + + K Z a j j G d 4 j p f c g F c 4 x D G 6 r D n A D T 7 j C 2 6 d T 8 5 X 5 8 7 5 9 s f V K R U x T / H P c b 7 / B j n C n q 8 = < / l a t e x i t > 1. D 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " v g E u D d x O w x 3 X H a I C / A p w J 9 n I O M 0 8 H W m 6 0 j j H e 9 t N x P p y i 5 g + w z u e A 1 o U s Y G 1 u P h k P f v k C O z 7 k h N I K V u s w E x 5 x u D a j s 8 o h Z X S s q Q l o y C f o j T T N / U 8 3 t 2 s L 4 O P m W v K l f A f L s C 8 0 m 7 U / b 3 6 2 8 N G b f 9 D u R y r e I 1 t 7 H I D 3 m E f X 9 B k z S G u 8 B 0 / 8 N P 5 5 v x y f j t / b l 2 d h T L m F e 4 d 5 + 8 N N I 2 e r Q = = < / l a t e x i t > 3. D 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " l n T 1 1 N o B R 5 J g A U e 9 o s P A S L 6 C R O 0 A 5 u d e t O Z j u 3 8 B Y e w I U v 0 2 p B R c w i K y J f R L z 4 E E E e R 4 X 2 v M c + p 3 / g x + D P o e G R 0 V + / x 8 Y r E 5 P 7 R d Z W o a y H W Z y p w 0 A U M o 5 S W d e R j u V h r q R I g l g e B K 0 1 Y z + 4 k K q I s n R P X + b y J B H N N D q L Q q E J N S o T C z V 3 9 j g R + l w l n f V u w 5 9 t V K p e z b P H / a z 4 p V J F e X a y y h O O c Y o M I d p I I J F C U 4 8 h U P A 7 g g 8 P O b E T d I g p a p G 1 S 3

Figure 1 .
Figure 1.Overview of the retrieval procedure: A query (Q) is compared with a set D of database documents, resulting in a ranked list of documents.
shows a visualization of the feature type used in this paper.To generate shingles of length L < N from the feature sequence C, we use a hop size H ∈ N to define the subsequences C L,H n := (c (n−1)H+1 , . . ., c (n−1)H+L ) (1) for n ∈ {1, . . ., N−L H + 1}, where • denotes the floor operation.The resulting subsequences can also be regarded as matrices or shingles C L,H n ∈ R F×L .For brevity, we will omit the superscript H in the case of H = 1.See the lower part of Figure 2 for such a sequence of overlapping shingles.N < l a t e x i t s h a 1 _ b a s e 6 4 = " / s X y y b G S n 9 9 L q g X m 6 8 z h K T 5 y 0 9

Feature
Sequence C of document D < l a t e x i t s h a 1 _ b a s e 6 4 = " V b o 9 i 7 b 1 U K S Y D / z b

Figure 2 .
Figure 2. Schematic overview of the shingle generation process: A set of overlapping shingles S D is generated from the feature sequence of document D.

( 12 ⇥
20 = 240 dimensions) < l a t e x i t s h a 1 _ b a s e 6 4 = " Q 2 d x m 7 F w E r q k Q a 7 t t h Z g 3 o O a v b Y = " > A A A D G H i c h V J N S x x B E H 1 O o v H b j T l 6 6 b g K e l l m J o H k I k i + y E U w 4 K q g I j O z 7 d r s f D H d K 6 h 4 9 j / 4 H 7 y a a 2 6 S a 2 7 5 F / k B H v K 6 n R V U x B 5 6 q u p V 1 a u q p u I y V d r 4 / t 8 h 7 8 X L 4 Z F X o 2 P j E 5 N T 0 z O N 1 7 O b u u h X i W w n R V p U 2 3 G k Z a p y 2 T b K p H K 7 r G S U x a n c i n u f r X / r S F Z a F f m G O S 7 l X h Z 1 c 3 W g k s g Q 2 m + 8 X V o I Q r F r V C a 1 C H 2 x I s L 3 / o L o 0 M 5 t k l 7 e b z T 9 l u + O e K w E t d J E f d a L x j / s o o M C C f r I I J H D U E 8 R Q f P b Q Q A f J b E 9 n B K r q C n n l z j D O H P 7 j J K M i I j 2 + O / S 2 q n R n L b l 1 C 4 7 Y Z W U t 2 K m w C L v N 8 c Y M 9 p W l d Q 1 5 Q 3 v i c O 6 T 1 Y 4 d c y 2 w 2 P K m I x j j n G N u M E h I 5 7 L z O r I Q S / P Z 9 q p D A 7 w 0 U 2 j 2 F / p E D t n c s f z h Z 6 K W M 9 5 B L 6 6 y C 4 5 Y m c f 8 Q V y y j Y 7 s K 8 8 Y B B u 4 g 5 l 5 K R 0 L H n N G J G v o r S v b / t 5 e r r B X B Y / Y a 0 z r k T w c A E e K 5 t h K 3 j X C n + E z d V P 9 X K M Y g 7 z W O I G f M A q v m O d P S c 4 x y W u 8 N O 7 8 H 5 5 1 9 7 v 2 1 B v q M 5 5 g 3 v H + / M f Z 5 u j W Q = = < / l a t e x i t > (K = 6 dimensions) < l a t e x i t s h a 1 _ b a s e 6 4 = " R M H 7 4 f p P

Figure 3 .
Figure 3. Schematic overview of the triplet embedding: Anchor, positive, and negative shingles

Figure 4 .
Figure 4. Mean average precision (MAP) evaluation results for dimensionality reduction with various neural networks using the triplet loss with different α values.

Figure 5 .
Figure 5.Time (in seconds) needed for searching 3300 queries using the shingle-based approaches depending on the embedding dimensionality K with various nearest neighbor search strategies.

Figure 7 .
Figure 7. Evaluation results with varying λ on the extended data set D 3 .

Figure 8 .
Figure 8. Distance distributions for three composers (Beethoven, Chopin, and Vivaldi) for the test set D 2 with λ = 1 (left column), the extended test set D 3 with λ = 1 (middle column), and the extended test set D 3 with λ = 5 (right column).All distributions are computed for the brute-force approach (K = 240).The orange color refers to distances for relevant documents and the blue color refers to distances for non-relevant documents.

Figure 9 .
Figure9.Distance distributions for composers (Beethoven, Chopin, Vivaldi, Mahler, and Schumann) for the brute-force approach (K = 240) with λ = 1 (left column), the deep neural network (DNN) approach (K = 12) with λ = 1 (middle column), and the DNN approach (K = 12) with λ = 5 (right column).All distributions are computed for the extended test set D 3 .The orange color refers to distances for relevant documents and the blue color refers to distances for non-relevant documents.

Table 1 .
Neural network architecture.For all convolutional operations, we use a zero-padding size of 1. Conv2D* indicates a circular padding of the chroma axis instead of zero-padding.

Table 2 .
The music collections D 1 used for training and D 2 used for testing.Duration format hh:mm:ss.

Table 5 .
Retrieval results for our proposed PCA-and DNN-based dimensionality reduction methods using the test set D 2 .

Table 6 .
Time (in seconds) needed for searching 3300 queries, for selected approaches.

Table 7 .
Extended data set D 3 .This data set includes the test set D 2 .Duration format hh:mm:ss.

Table 9 .
Composer-wise evaluation results using the extended data set D 3 for λ = 5 (query length: 60 s).