Efficient Retrieval of Music Recordings Using Graph-Based Index Structures

Flexible retrieval systems are required for conveniently browsing through large music collections. In a particular content-based music retrieval scenario, the user provides a query audio snippet, and the retrieval system returns music recordings from the collection that are similar to the query. In this scenario, a fast response from the system is essential for a positive user experience. For realizing low response times, one requires index structures that facilitate efficient search operations. One such index structure is the K-d tree, which has already been used in music retrieval systems. As an alternative, we propose to use a modern graph-based index, denoted as Hierarchical Navigable Small World (HNSW) graph. As our main contribution, we explore its potential in the context of a cross-version music retrieval application. In particular, we report on systematic experiments comparing graphand tree-based index structures in terms of the retrieval quality, disk space requirements, and runtimes. Despite the fact that the HNSW index provides only an approximate solution to the nearest neighbor search problem, we demonstrate that it has almost no negative impact on the retrieval quality in our application. As our main result, we show that the HNSW-based retrieval is several orders of magnitude faster. Furthermore, the graph structure also works well with high-dimensional index items, unlike the tree-based structure. Given these merits, we highlight the practical relevance of the HNSW graph for music information retrieval (MIR) applications.


Introduction
Ongoing digitization efforts lead to increasingly large music collections. With growing dataset sizes, it can become challenging to find relevant audio documents in such a collection. A paradigm for searching music databases is known as query-by-example, where the user provides an audio query, and the task is to find audio recordings from the database containing parts or aspects similar to the query [1][2][3].
An example of a public query-by-example music retrieval service is the audio fingerprinting application Shazam [4,5], where the user supplies a query audio snippet, which is then identified by comparing the snippet's fingerprint with fingerprints from a reference database. One reason for Shazam's popularity is the ability to provide music identification results close to instantly. This short response time is possible because of the strict notion of similarity (identity of recordings) between the query and relevant database recordings in the audio identification task, combined with clever indexing techniques [1,2,6]. In our paper, we address a cross-version retrieval scenario, where we aim to find all performances or versions of a given piece of music, which is specified by a query audio snippet. In related tasks, the user may provide the query by singing or humming a melody [7,8]. In such cross-version retrieval scenarios, the notion of similarity is less strict (identity of the musical piece underlying different recordings), which leads to higher response times compared to fingerprinting services. In a previous study [9], the authors showed that the runtime of the search procedure in a cross-version retrieval scenario is in the order of a few seconds, even for a small database of about 16 h. With growing database sizes, this runtime also increases and becomes prohibitive for usage outside academia. In a query-by-example setting, it makes a dramatic difference whether the user has to wait a fraction of a second or a couple of seconds for the results after specifying the query. Efficiency is an important aspect of usability and a critical requirement of information retrieval systems for being practically relevant [10]. In this article, we show how to increase the efficiency for our cross-version retrieval scenario by using a modern indexing approach.
Indexing procedures increase the speed of search operations in a database, using specialized data structures, such as inverted file indices [11]. In our context, the nearest neighbor search is essential, where one aims to find the closest item in a database to a given query item. We can classify nearest neighbor search procedures into exact and approximate search approaches. Exact nearest neighbor search procedures (such as K-d trees [12,13]) guarantee to find the item in the database that is closest to the query. Other approaches relax this requirement. Instead of finding exact nearest neighbors, they only aim to find sufficiently nearby neighbors. In general, this is referred to as approximate nearest neighbor search. A well-known approach of this category is, e.g., locality-sensitive hashing (LSH) [14]. Beyond the distinction of approximate and exact solutions, nearest neighbor search procedures can be categorized according to their algorithmic approach [15,16]. The main catagories of this distinction are hashing-based, partition-based, and graph-based search approaches. An example of hashing-based procedures is LSH [14], which has already been used in music retrieval studies [17][18][19][20]. Grosche und Müller [17] found that LSH can increase the retrieval efficiency compared to an exhaustive search. However, the retrieval results can be negatively affected (because LSH is an approximate search approach), and the LSH settings must be adjusted carefully to avoid a strong decrease in the retrieval quality. Examples of partition-based approaches include K-d trees [12,13], which also have been used in MIR [9,21,22]. For example, McFee and Lanckriet compared several K-d tree variants in the context of music similarity search [22]. They found that the combination of a tree variant for approximate search, known as spill trees [23], with principal component analysis (PCA) [24] gives a favorable trade-off between accuracy and complexity. As an alternative, in our contribution, we explore a modern graph-based index structure called Hierarchical Navigable Small World (HNSW) graph [25], which provides an approximate search solution, and already has been successfully used for, e.g., image retrieval [26].
In our paper, we use a cross-version music retrieval task [9] as an example application to explore the HNSW graph in practice. We conduct systematic experiments using databases of different sizes to analyze the impact of the graph-based index on the quality and speed of our retrieval system. As our main contribution, we show that we can increase the efficiency of our music retrieval application by several orders of magnitude through the usage of an HNSW graph as an index structure. Although the graph-based search belongs to the category of approximate nearest neighbor search approaches, our experiments show that, by and large, the retrieval quality is not negatively affected by using the graph in our scenario. We also analyze several further aspects in our retrieval system, such as the feature computation and the construction, saving, and loading of the index structure. We want to emphasize that our results demonstrate huge advantages of the graph-based indexing approach compared to previously used index structures. We expect similar benefits for other MIR problems involving nearest neighbor search. Beyond cross-version music retrieval, such search problems occur in diverse MIR tasks, such as cover song retrieval [27,28], music similarity estimation [19], query-by-humming [20], or symbolic music genre classification [29].
To make our results reproducible [30], we use open-source implementations of the discussed index structures and provide code that shows how we use them, along with pre-computed features for an example dataset (https://www.audiolabs-erlangen.de/ resources/MIR/2020_signals-indexing, accessed on 12 May 2021). In this way, we enable a straightforward usage of the index structures in future MIR applications.
The remainder of this paper is organized as follows. In Section 2, we outline our music retrieval application. Next, in Section 3, we describe graph-based search procedures in general and the HNSW graph in particular. Then, in Section 4, we present our systematic experiments where we apply the HNSW graph for our music retrieval task. Finally, we summarize our main findings in the concluding Section 5.

Music Retrieval Application
In this section, we describe our motivating retrieval scenario, our cross-version retrieval approach, and our datasets, which are used later in our experiments. Readers with a primary interest in indexing and our experimental findings may skip this section at first reading.

Motivating Retrieval Scenario
In our article, we deal with a query-by-example retrieval scenario using a real-world music collection from a music publisher. This collection consists of the complete audio catalog of the Carus publishing house, a leading sheet music publisher of sacred and secular choral music. Beyond sheet music, Carus also produces and publishes audio recordings, mainly for choral pieces of Western classical music. Carus' complete audio catalog is a medium-sized music collection of nearly 400 h (more details in Section 2.3). Internally, we implemented a web-based interface for browsing this dataset, as illustrated by the screenshot shown in Figure 1. Here, the user can specify a query in the form of a YouTube link of a music recording, e.g., an interpretation of Mendelssohn's Verleih uns Frieden (Grant us peace) by an amateur choir. Such YouTube videos are often poorly annotated. Therefore, in our scenario, we use a content-based retrieval approach, where we take a 20-s audio snippet from the YouTube recording as a query. Then, the system retrieves recordings from the Carus collection that are based on the same musical piece as the query. Following [9,17], we denote different recordings of the same piece of music as "versions." In the case of the Mendelssohn piece, the retrieval system returns two CD articles from the Carus catalog, which both include a version of that piece by professional musicians (the Kammerchor Stuttgart under the direction of Frieder Bernius). The user may listen to the retrieved versions or click on the cover images to access more information on the linked webpage of the publisher. Rather than describing this web-based tool in further detail, it serves as a motivating scenario for our retrieval experiments while indicating our study's practical relevance.

Cross-Version Retrieval System
We now describe our retrieval approach, closely following [9]. Given a database of music recordings and a short query audio fragment, the aim is to identify all recordings (versions) in the dataset based on the same musical piece as the query. To this end, we compare the database and query recordings using chroma-based audio features, which measure local energy distributions in the 12 chromatic pitch class bands [31,32]. Our chroma features are computed by suitably pooling the frequency bins of a time-frequency representation with a logarithmic frequency axis, where a frequency bin corresponds to a semitone [33]. Following [9,17], we use a chroma variant called CENS (chroma energy distribution normalized statistics) [34], which is adapted for the retrieval task using a post-processing strategy involving logarithmic quantization, temporal smoothing, and frame-wise normalization.
All database recordings are transformed into chroma sequences. We use a shingling approach [1,9,17,35], where the database's chroma sequences are subdivided into subsequences (also referred to as "shingles") of L = 20 chroma vectors, using a hop size of H = 1 frame. The length of a shingle corresponds to 20 s of audio (using a feature rate of 1 Hz). As for the retrieval, the query (in the form of a single shingle) is compared with all database shingles. Figure 2 illustrates such a query (left) for our Mendelssohn example and the set of overlapping shingles (right) for a database document (a track of a CD from the Carus collection). As the first option to compare two shingles, we reshape each shingle of dimension 12 × 20 to a vector of dimensionality K = 240 and apply a distance function Throughout this paper, we use the squared Euclidean distance between two vectors x 1 , x 2 ∈ R K as the distance function.  Alternatively, we reduce the dimensionality (K < 240) of the shingles before computing the distance. In particular, following Zalkow and Müller [9], we use classical PCA [24], or deep neural networks (DNNs) trained with the triplet loss function [36] for dimensionality reduction. The DNN is a convolutional neural network, having a relatively small amount of parameters (e.g., about 6000 parameters for K = 10). The last network layer performs an 2 -normalization of the embedding. For more details, we refer to [9]. The authors showed that the shingle dimension can be reduced from 240 to 15 without substantial loss in retrieval quality for the given application (while PCA-and DNN-based embeddings yield similar results). The DNN-based approach is beneficial for even smaller dimensionalities (below K = 12), where we only have a moderate loss in retrieval quality.
The retrieval task is then solved by finding the database's shingles with the smallest distance to the query shingle, which is a nearest neighbor search problem. Zalkow and Müller [9] compared the runtimes for this retrieval task using an exhaustive search and a search approach using K-d trees [12,13]. Using a small dataset of 16 h of music, they found that K-d trees are only beneficial for smaller dimensionalities (below K = 15). This finding agrees with the fact that K-d trees are inappropriate for high-dimensional data [21,37]. In our paper, building upon these findings, we want to explore the potential of a graph-based index structure for this music retrieval problem.

Datasets
We use two databases of different sizes in our experiments (see Table 1). Our first set D sm , which was already used as an evaluation set in the previous study [9] (there denoted by D 2 ), consists of 330 audio files and comprises about 16 h of music (corresponding to more than 50,000 shingles). These recordings contain interpretations of orchestral and piano pieces by Beethoven, Chopin, and Vivaldi. In our cross-version retrieval evaluation, we consider a recording relevant for a query if it represents a version of the piece underlying the query (e.g., Mendelssohn's Verleih und Frieden or the first movement of Beethoven's Third Symphony). Since the dataset D sm contains twelve "cliques" (works or movements), having a different number of versions each (4 to 67), a query may correspond to 4 to 67 relevant documents in our cross-version retrieval scenario. This dataset is well annotated and corresponds to a controlled retrieval scenario. We make the annotations and shingles of the dataset available on our accompanying website for reproducibility. As a second dataset D lg , we use the entire audio catalog of recordings offered by the Carus label. The dataset D lg contains 7115 audio files and about 390 h of professionally produced music (corresponding to more than 1.25 million shingles), mainly of vocal Western classical music. In contrast to D sm , this dataset is less well annotated and corresponds to a rather uncontrolled scenario "in the wild," having real practical relevance.

Graph-Based Nearest Neighbor Search
We now outline the nearest neighbor search problem underlying our retrieval task and a search procedure using graph-based data structures [38]. Then, we describe the HNSW graph, which we use later as an index in our experiments.

Graph-Based Data Structures
In our search problem, we have a database D of items x ∈ D. The items are K-dimensional vectors, thus D ⊂ R K . Figure 3 (initial situation) illustrates a dataset, where the items are gray points. Given a query x q ∈ R K (colored in orange in Figure 3) and some distance measure d (see Equation (1)), the aim is to find the ν ∈ N database items that are nearest to this query. As an example, let us now assume ν = 1. Then the aim is to find the closest database item A naive solution to this problem is to compute the distances between all database items and the query for selecting the item with the smallest distance (exhaustive search). With graph-based nearest neighbor search, we aim to find x * without evaluating all these distances, but only a subset of them. In general, we have no guarantee of finding x * with a graph-based search. Because of this reason, this strategy belongs to the category of approximate nearest neighbor search methods.

Initial situation
Step 1 Step 2 Step 3 Step 4 x * x e x q x q x q x q x q Figure 3. Illustration of graph-based approximate nearest neighbor search.
In our graph-based approach, the database is organized as an undirected graph structure, where the database items are nodes. Nearby nodes are connected by edges, which are represented by connecting lines in Figure 3. In essence, the search in the database consists in traversing along the edges of the graph. In the first step of the search procedure, we select an entry point x e ∈ D of the database (colored in dark blue in Figure 3, first step), e.g., by random choice. This entry point is our first active search node in the procedure. We then compute the distance of the query to this node. Next, we compute the distance between the query and all items that are connected to the active search node (colored in light blue in Figure 3). If any of these distances are smaller than the distance between the query and the active search node, we continue with the next step, where the node with the shortest distance is the next active search node in the procedure. Otherwise, we terminate, and our active search node is the final candidate for the nearest neighbor to the query. In Figure 3, we perform four steps until the algorithm terminates. In our example, the final candidate corresponds to the exact solution x * .

HNSW Graphs
To date, we have described a search procedure using a data structure with a single graph. Building upon such a structure, Malkov and Yashunin [25] introduced an improved data structure with multiple levels, called Hierarchical Navigable Small World (HNSW) graph, for approximate nearest neighbor search. Compared to search procedures for single-layer structures (as described in the previous section), the multi-layer search procedure has various benefits, such as improved search quality, higher efficiency (with a runtime that increases logarithmically with the dataset size), and greater stability with respect to the dimensionality K. Figure 4 illustrates an HNSW graph with three layers. The bottom level (first layer) is a graph containing the full database, similar to the graph structure described in the previous section. In the figure, this layer contains 16 database items. The middle level (second layer) contains a subset of these items (eight items). The top level (third layer) contains a subset of the second layer's items (four items). For illustration purposes, the dashed red lines in the figure indicate the items that are available at all layers. Items that are available only at the first two layers are indicated with dashed gray lines.
We now want to outline the search procedure for HNSW graphs. Given a query point x q , searching in an HNSW graph starts at the top layer with a suitably selected entry point x e (e.g., random selection, see [25] for details). A search procedure is then applied, similar to the approach described in the previous Section 3.1, to find a candidate for the nearest neighbor to x q in the top layer. Then, the search continues at the next layer, where the entry point is the item corresponding to the nearest neighbor candidate from the upper layer. This way, the search continues until we arrive at the bottom layer, where all database items are available. The aim is to find the ν nearest neighbors in this layer as the search procedure's result. To stabilize the approximate search results, we may first search for more than ν approximate nearest neighbors (using a graph-based search procedure as before). We denote the number of "intermediate" candidates by ν ∈ N, where ν ≤ ν ≤ |D|. Among the candidates, we then select the ν nearest neighbors (by exhaustive search) as the final result. The number ν is a free parameter that can be increased to improve the search results (at the cost of an increased runtime).

Layer 2
Layer 3 Layer 1 Malkov and Yashunin [25] also proposed an algorithm for constructing HNSW graphs, which we summarize briefly. During the construction process, the database items are consecutively inserted into the graph. For each new item x ∈ D, we randomly decide on its upper-most layer (x) ∈ N according to an exponentially decaying probability distribution. If the random process selects a higher layer (x) than the highest existing layer in the graph, the number of layers in the graph increases dynamically. The item x is then inserted in all layers [1 : (x)] := {1, . . . , (x)}. Next, we want to connect the new node x to M database items in each layer [1 : (x)] by edges. The parameter M ∈ N controls the minimum number of edges for each node of the data structure. We apply a top-down search procedure, similar to the approach described above, to search for suitable items in each layer [1 : (x)]. To stabilize the search results, we may first search for more than M candidates for inserting edges (using a graph-based search procedure as before). Similar to the number ν of intermediate neighbor candidates in the search procedure, we denote the number of intermediate candidates for inserting edges by M ∈ N. We select M database items among the M candidates for inserting edges. In their paper [25], the authors propose two options for this selection. As a first option, they select the nearest neighbors to x by exhaustive search. As a second option, the authors propose a heuristic to create connections to x from diverse directions, which is beneficial for highly clustered data (see the original paper [25] for more details). In our experiments, we use the second option. If any of the nodes turn out to have more edges than a predefined maximum (set to 2M for the bottom layer and M for all other layers), the nodes' surplus edges with the highest distance are removed.
In summary, we have various important parameters for the construction and search procedure of an HNSW graph. Beyond the distance function d, these parameters are ν (number of neighbors to search for), ν (number of neighbor candidates during search), M (minimum number of edges for each node), and M (number of edge candidates during construction). The authors recommend a range of M ∈ [5 : 48], where higher M values imply better search results and higher memory consumption. Furthermore, they recommend increasing M for higher dimensionalities K. In our experiments described in the next section, we fix the squared Euclidean distance (see Equation (2)) as distance function d as well as the default parameter settings of ν = M = 100 and M = 5. As we will see later, a fine-tuning of these parameters is not necessary for obtaining good results in our application.

Experiments
We now use the HNSW graph as an index structure in our music retrieval application.
Here, a node of the graph corresponds to a database shingle with or without dimensionality reduction.

Experimental Setup
In the following, we analyze the possible decrease in retrieval quality and the improvements of the retrieval runtime introduced by the HNSW graph. To this end, we consider quantitative performance measures, which we list in Table 2. To evaluate the retrieval quality, we use standard precision-based measures (more details in Section 4.2). To analyze the impact of the index structures on the retrieval speed, we consider several steps that are involved in our retrieval scenario. Some steps need to be computed offline when processing the database documents, and other steps need to be computed online when processing a query. In the offline phase, we need to compute features for all audio files of the database, construct an index and save the index file to disk. These steps can be carried out at any time and on any system (offline). When applying our index, we first need to load the index into the computer's main memory (RAM). This loading step can be considered as being in-between the online and offline stages. The index loading needs to be performed on the actual system where the retrieval service is provided. When the index structure can be kept in the main memory, it does not have to be reloaded for each query. Therefore, we still consider it as part of the offline stage. In the actual online phase, we need to compute the query features and perform the nearest neighbor search procedure using our index structure. In the following sections, we analyze these steps in the order given in Table 2. If not mentioned otherwise, we always report on average time measurements (µ ± σ) for 100 iterations of the experiment. Note that runtime evaluation is a delicate topic on its own [39]. For example, one may argue that it is more meaningful to report minimum instead of average runtime measurements because other processes running in parallel affect the mean more than the minimum. In our case, this is not a major issue because the standard deviation σ is always relatively low. We want to highlight that we take a practical perspective by measuring runtimes using distinct implementations of the respective index structures, implemented in different programming languages. The absolute runtimes obtained may vary when using different implementations or hardware systems. Our study gives practical insights into the runtimes obtained by specific implementations on specific platforms for our specific application. In general, we are interested in the orders of magnitude, the relative differences between the time measurements, and the relationships between index size and runtime.
We compare three different search approaches: an exhaustive search approach (full search, exact search solution), an indexing strategy using K-d trees (KD, exact search solution), and the graph-based index structure (HNSW, approximative search solution). We perform our experiments using Python 3.6.5 on a computer with an Intel Xeon CPU E5-2620 v4 (2.10 GHz) and 31 GiB RAM. We use the efficient pairwise-distance calculation of scipy 1.0.1 [40] for the full search, which is calling a highly optimized implementation in C. For the K-d trees (using a default leaf size of 30), we use the implementation of scikit-learn 0.20.1 [41], which is written in Cython. For the HNSW graph, we use the efficient hnswlib implementation in C++ by the authors of the original paper [25] (https://github.com/nmslib/hnswlib, accessed on 12 May 2021), using the Python wrapper version 0.4.0. We use librosa 0.7.1 [42] for the audio processing pipelines and TensorFlow 1.7.0 [43] for the deep neural network implementation.

Retrieval Quality
In contrast to the K-d tree approach, the HNSW graph only provides an approximate search solution. To understand the impact of this approximation within our retrieval scenario, we measure our retrieval system's quality using the dataset D sm , closely following [9]. We consider a document-level rather than a shingle-level retrieval. Here, the distance between a query shingle and a document is given by the minimizing distance between the query and all document shingles. We construct a single index structure for the entire dataset D sm (using either a K-d tree or an HNSW graph) and search for the 10,000 nearest items in the database to a given query. Using the distances of the returned items, we create a ranked list of documents, ordered by ascending distances. Note that we were not able to rank all database documents as some documents may not have a corresponding item among the items returned (this did not affect the evaluation measures in our experiments). For evaluating the ranked list, we consider three standard retrieval evaluation measures [44]. First, we use precision at one (P@1), which is 1 if the top-ranked document is relevant (i.e., being a version of the same musical work as that of the query), and 0 otherwise. Note that, for exact nearest neighbor searches, the top match is always identical to the query because, in our experiments, the query is part of the database (which leads to a P@1-value of 1). We still use this measure to check whether the approximate search approach is able to find the "trivial" match. Second, we use precision at three (P@3), which is the proportion of relevant documents among the top 3 documents of the ranked list. Third, we use R-precision (P R ), which is the proportion of relevant documents among the first R ranks, where R ∈ N denotes the number of relevant documents for the given query (which may differ for each query, between 4 and 67).
We generate 3300 query shingles from D sm by an equidistant sampling of ten queries from each recording of D sm , resulting in 3300 queries. Each evaluation measure is finally averaged over the 3300 query shingles used. Table 3 shows the resulting evaluation measures. A row in this table specifies the dimensionality reduction approach (no reduction, PCA-, or DNN-based embedding), the dimensionality K, and the search strategy (Full Search, or HNSW). The retrieval results for the exhaustive search (Full Search) and the K-d tree strategy (KD) are identical because both approaches are exact nearest neighbor searches (with different properties in terms of runtime, as decribed in Section 4.5). For example, without dimensionality reduction, we obtain a P@1-value of 1.0, a P@3-value of 0.9965, and a P R -value of 0.9434. This result shows that the shingle-based retrieval approach is able to identify most of the versions correctly, but there are a few false positives. Reducing the shingle dimensions (which is important for some indexing approaches, such as K-d trees) leads to further degradations of the retrieval quality, as already shown in previous work [9]. For example, reducing the dimensionality from 240 to 30 with the PCA-based embedding, we obtain a P@3-value of 0.9910. For smaller dimensionalities, the DNN is beneficial over PCA for embedding the shingles, e.g., resulting in P R -values of 0.7350 (PCA) and 0.8333 (DNN) for K = 6. Using the HNSW graph as an index structure, we obtain more or less the same evaluation metrics for all settings. This finding demonstrates that the approximate search approach of the HNSW graph has almost no negative impact on the retrieval results within our application scenario. When we analyze the runtime improvements in the following sections, we can bear in mind that they come without substantial loss in retrieval quality.

Feature Computation
In this section, we report on the runtimes for the various steps involved in the feature computation. This computation procedure has to be performed in the offline stage (for the whole database) and online stage (for the query). In contrast to the document-based analysis of the retrieval quality (evaluating a ranked list of documents), we now use an item-based evaluation (runtime to process a database item). To compute our measurements, we first load 20 s of an audio file (using librosa.load). In general, our audio files are longer, but for our runtime experiments, we only use a 20-second segment, corresponding to the length of a shingle. Then, we compute the spectral features [33] (with librosa.iirt). Next, we compute the CENS features [34] (using librosa.feature.chroma_cens). This step concludes the feature computation if no dimensionality reduction is applied. An additional step is performed in the embedding-based retrieval approaches (PCA-based embedding using sklearn.decomposition.PCA or DNN-based embedding as described in [9]). Table 4 shows the time measurements. The loading of the audio segment only takes 44.8 ms. The next step is the spectral feature computation, which needs more than a second (1171.5 ms). The time required for the CENS computation is not significant (0.9 ms). In the table, we list the times to embed the shingle for some selected dimensionalities. In general, the PCA-based embedding is faster than 0.1 ms, and the DNN-based embedding does not take more than 2 ms.  12) 0.05 ± 0.0 Embedding PCA (K = 6) 0.05 ± 0.0 The numbers of the table show that the major bottleneck of the feature computation is the spectral transform. Compared to this, the times of the other steps are not significant. The runtime for the query feature transform is not our focus in this paper. Obviously, it only scales linearly with the query length, which is usually short (i.e., no scalability issue). The runtime for computing the database documents' features is also not critical because it can be computed offline. A possible future research direction could be to compute embeddings from spectral representations that are less expensive to compute, e.g., using the short-time Fourier transform (STFT) with the FFT. Having the same window and hop length settings as the spectral transform used, computing the magnitude STFT for the 20-s audio snippet only takes 19.2 ms on average. However, using the STFT-based features may go along with a decrease of feature quality, which may affect the retrieval results.

Constructing, Saving, and Loading the Index
We now address various performance measures for the offline stage, i.e., for constructing, saving, and loading the index structures. For these steps, we restrict our analysis to one embedding technique (PCA) because the specific embedding strategy used has only a minor influence on these measures. The first step is to construct the index structure (either a K-d tree or an HNSW graph). We report on the times required to construct the index, given that the data to be indexed is already in the computer's main memory (i.e., pre-computed shingle embeddings, without having any distances pre-computed). When this data needs to be read from disk, it will cause some additional overhead. For example, loading all pre-computed shingle embeddings (K = 12) of D sm takes 0.9 ms on average. Loading the full shingles (K = 240) of D lg requires one second on average. Columns 4 and 5 of Table 5 show the time needed to construct the index structures for various dimensionalities. We include time measurements for the smaller dataset D sm and the larger dataset D lg . The first row in the table refers to the K-d tree index for shingles without dimensionality reduction (K = 240). This setting leads to construction times of 0.54 s for D sm and 43.03 s for D lg . For lower dimensions, this time decreases. For example, constructing the K-d tree index for D lg takes 3.89 s, 1.66 s, and 0.99 s for the dimensionalities of 30, 12, and 6, respectively. Constructing an HNSW graph for K = 240 requires 0.65 s for the smaller dataset D sm and 26.66 s for the larger dataset D lg . For this large dataset and a high dimensionality of K = 240, constructing an HNSW graph is faster (26.66 s) than constructing a K-d tree (43.03 s) in the implementations used. However, this is not the case for lower dimensions. For example, constructing the index structures for the larger dataset D lg using K = 30 requires 3.89 s for the KD and 16.38 s for the HNSW approach. In general, the construction time grows approximately in a linear fashion with the dimensionality K for lower dimensions. Only for large dimensionalities, the time for constructing a K-d tree explodes, which agrees with the fact that K-d trees are not suited for high-dimensional data [21,37]. In contrast, the HNSW approach behaves stable for all considered dimensionalities. In all our settings, constructing an index takes less than a minute. We do not consider this duration critical in our application because the step is performed offline. The next step is to save the index structure to the hard disk, where it requires disk space. In the case of the K-d tree, we use scikit-learn's [41] recommended default persistence format based on the Python package joblib without compression. In the case of the HNSW graph, we use the default storage format of hnswlib, which is a custom binary format (also without compression). Columns 6 and 7 of Table 5 show the required disk space used for storing the index structures. Without dimensionality reduction (K = 240), storing the K-d tree requires 209.7 MB and 5161.2 MB of disk space for D sm and D lg , respectively. The HNSW graph takes 53.9 MB and 1310.9 MB for the same data. In general, the required disk space scales roughly linearly with the dataset size as well as with the dimensionality in both indexing approaches. Furthermore, the graph-based index is generally more space-efficient than the tree-based structure in the given formats. To apply a pre-computed index for retrieval, we need to load it into the computer's main memory. We perform this step of loading the index with the functions required for the respective file formats used in the previous step (using functions from joblib and hnswlib, respectively). Columns 8 and 9 of Table 5 show the required time to load the index files. Loading a K-d tree without dimensionality reduction requires 131.5 ms for D sm and 3209.1 ms for D lg . Using the same data, loading an HNSW graph takes 108.1 ms and 2680.0 ms, respectively. For smaller dimensions, loading a K-d tree is faster than loading an HNSW graph (e.g., K = 12 and D lg : 167.0 ms for KD and 2071.5 ms for HNSW). The time to load an index scales linearly with the dimensionality in both KD and HNSW approaches (with a much flatter slope for HNSW).
With our system (see Section 4.1 for specifications), we did not have any issues with loading the index structures into the main memory (31 GiB RAM) in all settings. On other systems with less memory (8 GiB RAM), we could not fully load the index structures into the main memory for K = 240. In this case, dimensionality reduction becomes crucial for practical reasons.

Retrieval Time
In this section, we report on our runtime experiments for searching the nearest neighbors in our datasets. Regarding efficiency, these experiments refer to the most critical part of the retrieval pipeline because this step is part of the online stage (where the runtime affects the user experience). An efficient search is of major importance for scalability because, in general, the search runtime scales with the dataset size. The aim of our index structures is to improve the efficiency of this step.
We can also consider the complexity of the search approaches from a theoretical perspective (which is not the focus of this paper, being a practice report). The runtime of the full search increases linearly with the dataset size. In other words, its search complexity is in the order of O(|D|). The expected search complexity for K-d trees is in the order of O(log |D|) [13]. However, it is well known [45] that the actual performance may be equivalent or worse than an exhaustive search, depending on the data distribution (which influences the tree structure). Especially for high-dimensional data, the search performance degenerates. According to [25], the overall complexity scaling of the search for the HNSW graph is O(log |D|), which does not degenerate for high-dimensional data.
For the experiments of this section, we search for the ν nearest items in our dataset (either D sm or D lg ) to a given query, where we consider ν ∈ {1, 10, 100, 1000}. We again use 3300 queries (as in Section 4.2) and perform several repetitions of this experiment. Then, we normalize the measured runtimes with respect to the number of queries and repetitions, such that the reported measures refer to the time needed for a single query. Table 6 shows the results of our experiments. Let us first consider the PCA-based dimensionality reduction using K = 30 for the smaller dataset D sm . The exhaustive search requires 4.803 ms. The indexing approaches are much faster, taking 0.009 ms using a K-d tree and 0.007 ms using an HNSW graph. In this setting, there is no large difference between the indexing approaches. When we search for more neighbors (increasing ν), the KD approach slows down substantially (0.621 ms, 1.001 ms, and 2.025 ms for ν values of 10, 100, and 1000, respectively). The runtime does not increase to the same extent for the HNSW approach (0.007 ms, 0.007 ms, and 0.056 ms). Searching in the larger dataset D lg shows the potential of the HNSW index even more clearly. For example, using the PCA-based embedding (K = 30), for ν = 100, the K-d tree requires 66.499 ms and the HNSW graph needs only 0.019 ms. While the increased dataset size only has a minor effect on the runtime for the HNSW approach, it dramatically increases the KD strategy's runtime. For lower dimensionalities, the runtime differences between the K-d tree and the HNSW graph are less extreme. For example, the KD approach requires 0.201 ms for K = 6 (D lg , ν = 100). Still, the HNSW graph is much faster (0.011 ms). Without dimensionality reduction, the KD approach breaks down (more than 700 ms for D lg ), which is a known fact [21,37]. However, the HNSW graph still facilitates fast retrieval (e.g., 0.021 ms for ν = 100). This substantial decrease in retrieval time shows the power of the graph-based search approach. In general, the tendencies discussed for the PCA reduction are similar when using the DNN-based embedding. Note that the exhaustive search involves computing all pairwise distances between the query and the database items. As a consequence, the parameter ν does not influence the search time. Furthermore, we did not perform the full search for D lg because of excessive memory requirements. Figure 5 shows the search runtimes for various dimensionalities K. We observe that the runtime increases more than linearly with dimensionality K for the K-d tree. In contrast, the runtime for the HNSW graph increases only slightly with increasing dimensionality K. Note the different scales of the vertical axes, which again underline the substantial search time improvements caused by the HNSW graph. We see that the HNSW approach requires nearly the same time for searching 1, 10, or 100 database items (resulting in overlapping curves). The reason for this is the parameter setting ν = 100 (number of intermediate neighbor candidates, described in Section 3.2), which leads us to search internally for 100 neighbors anyway. To summarize, we can conclude that the dimensionality reduction approach (PCA or DNN) has only a minor influence on the runtime, which is expected. The dimensionality K of the index items has a substantial impact on the runtime. Still, the indexing approach (KD or HNSW) has the most important effect on the runtime. Our experiments show that, compared to K-d trees, the HNSW index is much faster and more stable concerning the dimensionality and the number of items to be indexed. This increase in retrieval efficiency comes without substantial loss in retrieval quality (as shown in Section 4.2), which makes the HNSW graph a powerful tool for music retrieval.

Conclusions
In our study, we compared various search techniques for a cross-version music retrieval task, where we aim to find the closest shingles in a database to a given query shingle. In particular, using datasets of different sizes, we applied indexing approaches with exact (exhaustive search, K-d tree) and approximate (HNSW graph) solutions. Our results showed that the approximate solution of the graph-based index has almost no negative impact on the retrieval quality in our music scenario. As our main finding, we obtained dramatic speed-ups by several orders of magnitude for the search operations involved in our retrieval system. We verified that HNSW graphs are robust with respect to the dimensionality of the items to be indexed, unlike K-d trees. As another contribution, we explored the impact of the HNSW index on several further steps in our pipeline, such as constructing, saving, and loading the index structure.
We based our work on a previous study [9] that used shingles with highly specialized features for a cross-version music retrieval task. The authors aimed to reduce the shingle dimensionality with different embedding strategies to make the retrieval application more efficient. They found that it is possible to substantially reduce the shingle dimensionality with only a moderate loss in retrieval quality, where a DNN-based embedding is beneficial over a PCA-based reduction for small dimensionalities below K = 12. This reduction in dimensionality was essential for using K-d trees. Our experiments with HNSW graphs demonstrated that the shingle dimensionality is not as relevant to efficiency as good indexing approaches since the runtimes obtained with the HNSW approach are generally lower than those obtained with other search approaches, even with a strong dimensionality reduction. Still, dimensionality reduction may be important, e.g., to reduce disk space and memory requirements. Given our results, the remaining bottleneck of the retrieval pipeline is the feature computation for the query. In future research, one may employ feature representations that are less expensive to compute, e.g., using the STFT. Here, the embedding techniques may be useful to adapt and further enhance these "raw" representations for the retrieval task.
As for future work, one may also explore alternative approaches to accelerate the nearest neighbor search in our cross-version retrieval scenario. One possibility is to apply an appropriate prototype selection method [46,47], where the dataset is reduced by selecting representative prototypes. Another option is the use of pivot-based methods [48,49], where pre-computed distances between the database items to some selected items (the pivots) are exploited for excluding some database items during the search.
As basis for further research in this direction, we make our work reproducible by using open source implementations in all steps and by providing example code that shows how to apply these implementations, along with feature representations for an example dataset. Given our strong improvements in retrieval runtime without quality loss, we consider the HNSW graph a powerful tool that deserves more attention in the MIR community.
Author Contributions: All authors substantially contributed to this work by developing the main ideas and concepts. F.Z. and M.M. wrote the paper. F.Z. and J.B. implemented the approaches and conducted the experiments. Some results of this paper emerged from work for J.B.'s Master thesis [50], supervised by F.Z. and M.M. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
We provide code that shows the usage of the index structures for music retrieval and pre-computed features for an example dataset in a publicly accessible repository (https://www.audiolabs-erlangen.de/resources/MIR/2020_signals-indexing, accessed on 12 May 2021).