1. Introduction
One of the longstanding problems of interest in the multimedia community is large-scale retrieval from databases of multimedia data. One area of focus in the last decade has been on building scalable retrieval systems that can handle the ever-increasing amount of data that is available. Much of this work has revolved around retrieval of natural images (e.g., [
1,
2,
3,
4]), spurred on by standardized retrieval benchmarks such as the Oxford Buildings [
5] and Paris [
6] datasets. A second area that has drawn significant interest in recent years is cross-modal retrieval, in which a query in one modality is used to retrieve an item in a different modality. Many works have focused on image-text retrieval (e.g., [
7,
8,
9,
10]), again spurred on by standardized benchmarks such as MIR Flickr [
11] and NUS-WIDE [
12].
This article studies a large-scale multimodal retrieval problem in the domain of music. Specifically, we investigate the problem of finding the matching sheet music for a short excerpt of piano music, where the excerpt may be in the form of an audio recording, MIDI file, digital scan of sheet music, or cell phone picture of sheet music. In our study, we consider all solo piano sheet music in the International Music Score Library Project (IMSLP) database, which contains a large collection of sheet music in the public domain along with corresponding metadata. This is the large-scale multimodal piano music identification task.
Previous work on sheet music retrieval falls into three areas. The first area of related works is audio-sheet image retrieval. The earliest works use optical music recognition (OMR) to first convert sheet music images into a symbolic music representation such as MIDI, extract chromatic features from both the audio and estimated MIDI representation, and then use dynamic time warping to perform alignment or retrieval. This approach has been used on a variety of audio-image retrieval and alignment tasks [
13,
14,
15,
16]. More recent approaches use convolutional neural networks (CNNs) to embed small fragments of sheet music and audio into a shared latent space that captures semantic similarity. This approach has been similarly applied to a variety of audio-image alignment and retrieval tasks [
17,
18,
19,
20]. This approach has since been extended in a number of ways, including using deep reinforcement learning to perform online audio-image score following [
21,
22] or performing position estimation on full page images [
23,
24].
The second area of related works is MIDI-sheet image retrieval. Several works bridge the modality gap by using a mid-level feature representation called a bootleg score that describes the positions of noteheads relative to staff lines in sheet music [
25]. This approach has been used for retrieval of MIDI passages using images of sheet music [
25,
26] as well as large-scale retrieval between datasets of sheet music and MIDI [
27,
28]. Other works use statistical models or OMR to estimate the sequence of symbols in sheet music, and then perform retrieval through string matching [
29], keyword spotting [
30], or n-gram lookups [
31,
32].
The third area of related works is image-image retrieval of sheet music. Because sheet music images are very high-dimensional, the general approach is to first extract a sequence of low-dimensional features, and then use dynamic time warping or fingerprinting methods to perform alignment or retrieval. Recent works have explored image-image retrieval tasks using the bootleg score representation [
28,
33], the predicted outputs of an OMR system [
34], and CNN features extracted from sheet music images and trained with triplet loss [
35].
Our problem of interest differs from previous work in three ways. First, we consider the piano music identification problem across multiple input modalities, not just one. Whereas previous works have generally focused on a single unimodal or cross-modal retrieval task (e.g., audio-audio retrieval [
36,
37], audio-image retrieval [
18,
19], image-MIDI retrieval [
26,
27]), we study the problem across a range of input modalities, including audio recordings, MIDI files, digital scans of sheet music, and cell phone pictures of sheet music. Second, we study the piano music identification problem using a searchable database of
sheet music. This has a very practical implication: whereas large databases of audio recordings are expensive to assemble and necessarily private due to copyright restrictions, there are large databases of sheet music that are publicly available and open to the research community. Third, we study the piano music identification problem at a very large scale, using all solo piano sheet music in the International Music Score Library Project (IMSLP), a database containing more than 600,000 sheet music scores in the public domain. Whereas most previous works have studied the piano music identification problem using only hundreds of searchable items in the database, we use the largest available dataset that is available to the research community.
Our approach is based on a novel fingerprinting scheme, which we call marketplace fingerprinting. The key defining characteristic of marketplace fingerprinting is choice: rather than trying to find the single best global fingerprint design and processing all queries with the same fingerprint design, we instead consider a range of different fingerprint designs and select the fingerprint designs at runtime that are best for a specific query. We show that this setup is equivalent to an economics scenario in which a consumer is given multiple options in a store and needs to decide which items to purchase. As in many economics scenarios, the consumer and the store have competing interests, which leads to a type of minimax optimization problem. By framing the fingerprinting problem in this manner, we are able to design a system that corresponds to a very efficient marketplace.
This paper has three main contributions. First, we propose a novel fingerprinting scheme called marketplace fingerprinting that operates on a generalization of standard n-grams. This approach frames the fingerprinting problem as an economics problem, and it yields a solution which corresponds to an efficient marketplace. This approach provides a way to design a fingerprint database and select fingerprint designs at runtime in order to maximize performance. Second, we provide experimental validation of the marketplace approach on four different sheet music retrieval tasks involving multiple input modalities, including sheet music, audio, and MIDI. Across all four retrieval tasks, our results show that the marketplace approach substantially outperforms previous methods in retrieval accuracy while simultaneously reducing average runtime. Third, we present extensive analyses to provide deeper insight and intuition into system behavior.
We note that this article is a journal extension to an earlier conference paper [
38]. It extends the conference paper by (a) proposing modifications to the original image-image retrieval system to enable cross-modal retrieval (
Section 2.5); (b) presenting experimental results on four different retrieval tasks involving multiple input modalities (
Section 3.2,
Section 3.3,
Section 3.4 and
Section 3.5), whereas the conference paper only considers a single input modality; and (c) conducting new or expanded analyses to provide deeper intuition into system behavior (
Section 4.1,
Section 4.2,
Section 4.3 and
Section 4.4). Code for this project can be found at
https://github.com/HMC-MIR/Marketplace (accessed on 23 April 2022).
This article is organized as follows.
Section 2 describes the proposed approach.
Section 3 explains the experimental setup and presents our empirical results.
Section 4 provides several analyses of the proposed system.
Section 5 concludes the work.
2. Materials and Methods
Figure 1 shows an overview of our proposed marketplace fingerprinting system for piano music identification. There are four main components: calculating a bootleg score (
Section 2.1), constructing generalized n-gram fingerprints (
Section 2.2), creating the marketplace database (
Section 2.3), and performing a runtime search (
Section 2.4). We first describe the system in its entirety assuming that the input is a sheet music image (
Section 2.1–
Section 2.4), and then we describe ways to modify the system to handle cross-modal searches (
Section 2.5). We point out that the bootleg score representation is adopted from previous work, but the remaining parts of the proposed system are novel contributions.
2.1. Bootleg Score
The first component (red box in
Figure 1) is converting sheet music images into a feature representation called a bootleg score [
25,
26]. The bootleg score is a mid-level feature representation that encodes the positions of filled noteheads in sheet music. It is represented as a
binary matrix, where 62 indicates the total number of possible staff line positions in both the left- and right-hand staves and where
L indicates the number of grouped note events (e.g., a chord containing three notes played simultaneously would constitute a single grouped note event). The first 28 staffline positions correspond to notes A0 through G4 in the left-hand staff, and the last 34 staffline positions correspond to notes E3 through C8 in the right-hand staff. Note that the range E3 through G4 is represented in both the left-hand and right-hand sections of a bootleg score column since notes in this middle range commonly appear in both staves. The upper half of
Figure 2 shows a short snippet of sheet music and its corresponding bootleg score representation. Clearly, this representation discards a lot of information contained in the sheet music, including non-filled noteheads (e.g., half notes or whole notes), note duration, accidentals, key signature, time signature, clef changes, and octave markings. Nonetheless, it has been used effectively for a wide range of tasks such as sheet-MIDI retrieval [
25,
26,
27], sheet music identification [
28,
33], and audio-sheet synchronization [
39,
40].
2.2. Generalized n-Gram Fingerprints
The second component (blue box in
Figure 1) is constructing generalized n-gram fingerprints from the bootleg score. This consists of two sub-steps, which are described in the next two paragraphs. The bottom half of
Figure 2 also illustrates the process.
The first sub-step is to convert the bootleg score into a list of integers. Since the bootleg score is a binary matrix, we can encode each 62-bit column as a 64-bit integer. The result is a sequence of L integers .
The second sub-step is to construct a set of fingerprints that we refer to as generalized n-grams. The concept of an n-gram comes from the language modeling literature [
41], where the probability of observing a sequence of
n consecutive words is estimated based on its frequency of occurrence in a large set of data. Whereas a standard n-gram considers
n consecutive elements, our generalized n-gram considers a subset of
n elements within a fixed context window. In order to avoid duplicates at different offsets, we always assume that the first “anchor” element is selected. For example, with a context window of length 4, there are a total of 8 generalized n-grams at each offset
i: one 1-gram
, three 2-grams
, three 3-grams
, and one 4-gram
. In our experiments, we use a context window of length 6 and consider up to 3-grams, resulting in a total of 16 generalized n-grams at each offset. By considering combinations of non-consecutive elements, this representation is more robust to errors in the bootleg score representation.
2.3. Marketplace Database
The third component (orange box in
Figure 1) is creating a database containing a subset of generalized n-gram fingerprints. Ideally, we would like to include all generalized n-grams from every page of sheet music in the dataset. However, this would take up too much RAM for a dataset as large as IMSLP. Therefore, we need to choose a subset of fingerprints to include in a database of finite size.
We use principles of economics to decide which fingerprints to include in the database. This problem is analogous to selecting which products to put in a store given finite shelf space. In this analogy, there are two agents with competing interests: the consumer and the store manager. The store manager wishes to maximize profit, whereas the consumer wishes to maximize utility given a finite budget. Note that simply offering the products with the highest profit margin will not result in the highest total profits, since the consumer may be less inclined to purchase expensive items. There is some optimal equilibrium between the competing interests of the store manager and the consumer. This scenario occurs frequently in economics and often leads to solutions like a Nash equilibrium or pareto-optimal solution.
We frame our database design problem based on the economics scenario described above. In our analogy, the store is the database, the store manager is the database designer, and the limited amount of shelf space is the amount of RAM on a server. Likewise, the consumer is a single query, purchased items are n-gram fingerprints, and the consumer’s budget is the amount of latency we are willing to tolerate to do a search. Given this analogy, the database design problem is akin to deciding which items to put on store shelves.
The defining characteristic of the marketplace fingerprinting approach is choice. Rather than trying to find the single best fingerprint design and forcing every query to use the same fingerprint design, we instead offer a variety of fingerprint designs and let each query select the fingerprint designs that are most effective for that particular query. This is analogous to a store offering multiple choices to a consumer and letting the consumer choose the items that provide the most utility given the consumer’s spending budget. This economics approach requires a consideration of both the utility and the cost of a fingerprint, which we describe in the next two paragraphs.
We define a fingerprint’s utility as the probability of a correct lookup. This definition has a very clear probabilistic interpretation. If the fingerprint lookup is correct, it will add 1 point to the score of the true matching item in the database. If fingerprint lookup is incorrect, it will add 0 points to the score of the true matching item in the database. The probability of a correct lookup can therefore be interpreted as the expected number of points that a fingerprint will add to the true matching item in the database. We estimate this probability of correctness based on the training queries in the following way. First, we perform subsequence DTW between each query bootleg score and its true matching reference bootleg score in the IMSLP database. The subsequence DTW provides an (estimated) alignment between each column of the query bootleg score and its matching column in the reference bootleg score. Second, for each element
in the estimated subsequence alignment path, we determine if the generalized n-gram of type
T at offset
i in the query exactly matches the generalized n-gram of type
T at offset
j in the reference. Third, we aggregate statistics across all training queries to estimate a probability of correctness for each n-gram type
T.
Figure 3 shows the estimated probability of correctness for all 16 generalized n-gram types on the training data. The numbers on the x-axis indicate the generalized n-gram type (e.g., “02” indicates the 2-gram
and “025” indicates the 3-gram
). Unsurprisingly, shorter n-grams and n-grams with smaller context have a higher probability of correctness.
Each fingerprint has two conceptually different types of costs: a memory cost and a runtime cost. The memory cost is the cost to the store and is analogous to the shelf space that an item takes up. We use the number of matches a fingerprint has in the database as a proxy for memory cost because it correlates strongly with the actual amount of memory required to store its occurrences in a reverse index. The runtime cost is the cost to the consumer and is analogous to the price that the consumer pays for an item. We use the number of matches a fingerprint has in the database as a proxy for runtime cost, since processing more matches requires more runtime.
The database design problem has two competing interests: the database’s (i.e., store’s) priorities and the query’s (i.e., consumer’s) priorities. The
database prefers items that have a high utility-to-memory-cost ratio. From the database’s perspective, the best fingerprints are very common 1-grams, which have a high probability of correctness (see
Figure 3) and low memory overhead. Note that adding a new fingerprint to a reverse index incurs a small amount of memory overhead, and this overhead is minimized proportionally when the fingerprint occurs many times. By selecting fingerprints that occur frequently, it is possible to store more total matches in the database. The
query prefers items that have a high utility-to-runtime-cost ratio. From the query’s perspective, the best fingerprints are those that occur very rarely in the database and thus require processing very few database matches. Note that very common 1-grams are a poor tradeoff for the query: they require processing potentially millions of matches in the database in order to add 1 point to the match score of the true matching item. Similarly, very rare 1-grams are a poor tradeoff for the database: they require the maximum (proportional) amount of memory overhead, which limits the total number of matches that can be fit into a fixed amount of memory. The best database design should therefore have fingerprints with a high probability of correctness that are neither extremely common nor extremely rare.
Based on these considerations, we select which fingerprints to include in the database in the following manner. First, we sort all unique fingerprints by their total utility, which we calculate as the utility multiplied by the number of occurrences in the database. The total utility tends to favor shorter n-grams (higher utility) and very common n-grams (more occurrences). Next, we discard any fingerprints that occur more than times. The hyperparameter is determined empirically to remove items that are so expensive that they are never purchased by the consumer and therefore provide no benefit in practice. For example, common 1-grams would require processing millions of matches in the database and will never be selected because of their extremely high cost. Finally, we put the remaining fingerprints into the database in decreasing order of their total utility until we run out of RAM.
The marketplace database is a reverse index of locations of the selected fingerprints. The keys are the fingerprints, which are tuples of integers. The values are a list of all instances where the fingerprint occurs in the database. These instances are specified as a list of (id, offset) tuples, where id identifies the IMSLP PDF and offset indicates where in the bootleg score it occurs. In practice, we implement this as a collection of databases where we we first select the appropriate database based on the n-gram type and then do a lookup in the appropriate database based on the actual n-gram value. In our experiments, the largest database has about 100 million unique fingerprints. These fingerprints have approximately 715 million total matches and take up 120 GB of RAM.
2.4. Image-Image Search
The fourth component (green box in
Figure 1) is searching the database at runtime. At an application level, we would like to design the search to maximize retrieval accuracy while staying below a maximum allowable runtime. Accordingly, our search mechanism uses a greedy strategy to maximize utility while remaining under a fixed total budget. We describe the strategy in the consumer-store analogy to provide intuition, and then provide technical details in the next paragraph. The consumer (1) decides on a total budget to spend in the store, (2) divides the total budget by the number of aisles in the store to determine a budget per aisle, (3) rank orders items in each aisle by their utility-to-cost ratio, and then (4) purchases items in each aisle in decreasing order of utility-to-cost ratio until the aisle budget is spent. Any unused funds carry over to the next aisle, and the process repeats for all aisles in the store. This strategy ensures that the consumer never spends more than the total budget, spreads out the spending across all aisles, and approximately maximizes utility in each aisle.
The purchasing strategy described above corresponds to the following steps. We first compute the bootleg score features from a query, where L indicates the number of bootleg score columns in the query. For each offset , we compute 16 generalized n-gram fingerprints. We can denote each fingerprint as , where i indicates the offset and j indicates the n-gram type. Note that each offset corresponds to an aisle and each n-gram type corresponds to a different item. The consumer decides on the total runtime budget (same for all queries) and determines a budget-per-aisle (unique for each query). Note that is the maximum number of database matches we are willing to process for each query, which correlates strongly with runtime. At each offset, the consumer purchases items (i.e., selects fingerprints) in decreasing order of utility-to-cost ratio until the budget-per-aisle has been depleted. Utility-to-cost ratio is calculated as , where indicates the probability of correctness (based only on the n-gram type j) and indicates the number of times occurs in the database. Note that both and can be computed from lookups in an inverted file index without having to actually process a list of matches in the database. Any remaining budget is carried over to the next aisle, and the process is repeated for all offsets . The end result of this purchasing strategy is a list of fingerprints whose matches in the database will be processed.
Match scores are calculated using a histogram of offsets method [
42]. First, each selected fingerprint
is used to query the inverted file index to retrieve a list of database matches of the form (
id,
offset), where
id specifies the IMSLP PDF and
offset indicates the location in the bootleg score where the fingerprint
occurs. A histogram is constructed for each IMSLP PDF, and each database match (
id,
offset) is processed by adding the relative offset (
offset) to the histogram for PDF
id, where the histograms use a bin size of 10. Note that a sequence of matching fingerprints will result in many matches falling into the same histogram bin, resulting in a peak in the histogram. Therefore, after processing all database matches in the query, we compute a PDF’s match score by the maximum bin count in the histogram. Since there may be multiple PDF versions of a single piece, we compute the piece match score as the maximum PDF score of its constituent PDFs. We then rank all pieces in IMSLP by their piece match score. This ranked list of pieces is the final prediction of the system.
2.5. Cross-Modal Search
The same general approach shown in
Figure 1 can be used for cross-modal search. In this subsection, we describe modifications to enable sheet music retrieval based on MIDI queries and audio queries. For audio recordings of piano music, we first convert the audio to MIDI using the Onsets & Frames automatic music transcription (AMT) system [
43]. Therefore, in the remainder of this section, we focus on how to enable sheet music retrieval based on a MIDI query. This requires changes to the bootleg score representation and the search mechanism.
The bootleg score representation needs to be modified to handle two types of ambiguities between MIDI and sheet music data. The first type of ambiguity is enharmonic ambiguity. For example, and are enharmonically equivalent but appear in different staff line positions in the sheet music (and therefore different positions in the bootleg score representation). The second type of ambiguity is clef ambiguity. Note that a MIDI note onset does not specify which hand is used to play the note. For high or low notes, the hand can easily be inferred, but it can be unclear which hand should play notes in the middle register. For example, if the left hand played a middle C, there would be a notehead one ledger line above the left hand staff of the bootleg score. If the right hand played it instead, the notehead would appear one ledger line below the right-hand staff. These two types of ambiguity are handled in different ways, which we describe next.
Enharmonic ambiguities are handled by considering two different versions of the bootleg score. The first version is the sharp version, in which all black notes on a piano are interpreted as sharps. For example, the MIDI note 85
would be interpreted as a
that appears two ledger lines above the right-hand staff. The second version is the flat version, in which all black notes on a piano are interpreted as flats. In the flat version, the MIDI note 85 would appear as a
. Both the sharp and flat versions of the bootleg score can be used to query the database, and the version that achieves a higher match score is used. This approach to handling enharmonic ambiguity was proposed in [
27] and adopted here without change.
We experimented with two different ways to handle clef ambiguities. The first is adopted from a previously proposed method, and the second is a novel experimental method.
The first way to handle clef ambiguity is to simply place noteheads at all possible positions in the bootleg score. For example, a MIDI note 60 (
) would appear in two positions in the bootleg score: one ledger line above the left-hand staff and one ledger line below the right-hand staff. This approach effectively duplicates all noteheads in the bootleg score that appear in the middle register. Note that this requires duplicating the middle register in the MIDI query bootleg score as well as all database sheet music bootleg scores. This way of handling clef ambiguity was first proposed in [
27] and has been shown to be effective in enabling cross-modal retrieval.
The second way to handle clef ambiguity is to predict at runtime which MIDI notes appear in the right-hand staff and which appear in the left-hand staff. For each column of the MIDI bootleg score, we first generate a list of candidate bootleg score columns. This list of candidates is generated by considering all left/right hand attribution combinations for notes in the middle register, which we define as notes appearing in the bootleg score between staff positions 18 and 38. For example, if there are N notes in the middle register, there will be candidates. Any notes appearing above position 38 are assumed to be in the right-hand staff, and any notes appearing below position 18 are assumed to be in the left-hand staff. Next, we determine which of the candidates is most likely. We do this by simply treating each candidate as a 1-gram fingerprint, and selecting the candidate that occurs the most times in the database. This process is repeated for each column of the MIDI bootleg score until the entire bootleg score is generated. Note that this method of handling clef ambiguity is done at runtime and does not require changing the bootleg scores in the database.
By incorporating the above changes to handle enharmonic and clef ambiguities, the marketplace system described in
Section 2.1,
Section 2.2,
Section 2.3 and
Section 2.4 can be easily adapted to handle cross-modal retrieval.
3. Results
In this section, we describe our experimental setup (
Section 3.1) and present experimental results on the four sheet music retrieval tasks: camera-based sheet music retrieval (
Section 3.2), scan-based sheet music retrieval (
Section 3.3), MIDI-based sheet music retrieval (
Section 3.4), and audio-based sheet music retrieval (
Section 3.5).
3.1. Experimental Setup
We perform experiments on four different retrieval tasks. Below, we describe the data used in our experiments and the evaluation methodology.
All four tasks use the same searchable database. The database consists of all solo piano sheet music in IMSLP. The database has 31,384 PDFs, 29,310 unique pieces, and 374,758 sheet music pages. This is the same database as [
28].
The first retrieval task uses cell phone pictures of physical sheet music as queries. We use the Sheet-MIDI Retrieval dataset [
26]. This dataset contains 2000 different cell-phone images from 25 different composers and 200 different piano pieces. The cell phone images are taken of physical pages of piano sheet music placed in various locations. There are 40 train and 160 test pieces. The dataset includes 10 images per piece, resulting in 400 train queries and 1600 test queries. We use the same train/test split as in [
33].
The second retrieval task uses scanned sheet music images as queries. This dataset is generated from the corresponding PDFs from the Sheet-MIDI Retrieval dataset. Each PDF page is converted to an image and used as a query. The PDFs are converted to PNG files at 300 dpi and resized so that the width of the image is 2550 pixels. The train queries consist of 217 single page scans from the 40 train pieces, and the test queries consist of 863 single page scans from the 160 test pieces.
The third retrieval task uses MIDI files as queries. The data is taken from the MAESTRO dataset [
44]. This dataset consists of audio and MIDI pairs. The audio files are recordings from a piano competition where the pianos were equipped to record MIDI note events. For each (audio, MIDI) pair, we manually found the matching piece in the IMSLP dataset. There are a total of 1276 MIDI files, and we use
(256) of the MIDI files for training and
(1020) for testing. From each MIDI file, we take
random samples of duration
L seconds. We consider
. The end result is a collection of 6 datasets, one for each value of
L.
The fourth retrieval task uses audio recordings as queries. The audio dataset is also taken from the MAESTRO dataset and is prepared similarly to the MIDI data. There are a total of 1276 audio files, and we use (256) of the audio files for training and (1020) for testing. From each audio file, we take samples of duration L seconds. We consider .
All tasks are evaluated along two different axes: retrieval accuracy and runtime. Each query matches a single unique piece in the IMSLP database. Therefore, we evaluate retrieval accuracy using mean reciprocal rank (MRR). The MRR is calculated as , where indicates the rank of the true matching item for the query and N indicates the total number of queries. We also measure the runtime required to process each query and report the average and standard deviation of query runtimes.
3.2. Camera-Based Sheet Music Retrieval
We compare the results of seven different systems. The first five systems use fixed n-gram fingerprints based on bootleg score features. This approach comes from [
27], which proposes using each column of a bootleg score representation as a fingerprint (i.e., a fixed 1-gram), and we also include the straightforward generalization of this strategy to 2-grams, 3-grams, 4-grams, and 5-grams. The sixth system uses dynamic n-gram fingerprints [
28,
33] based on bootleg score features, which is the current state of the art for sheet music retrieval. In the dynamic n-gram approach, the database contains multiple types of fixed n-gram fingerprints (e.g., 1-grams, 2-grams, 3-grams, and 4-grams), and at runtime, the size of each individual n-gram (query) fingerprint is selected dynamically to ensure that the fingerprint is sufficiently discriminative to warrant a table lookup. The seventh system is our proposed marketplace fingerprinting system.
Table 1 shows results on the camera-based sheet music retrieval task. To gain further insight, we evaluate performance under two different conditions. Condition 1 is when the original PDF (that the cell phone image is taken of) exists in the database. Condition 2 is when the original PDF is not in the database but an alternate edition of the same piece exists in the database. We can simulate the second condition by simply removing the exact matching PDF from the database. This means that queries for which there is only a single PDF available in IMSLP are removed from the test set under condition 2. For experiments under the second condition, there are a total of 370 train queries and 1560 test queries. Note that each row of the table corresponds to a single experiment, and the MRRs for the two conditions are determined in the evaluation phase.
There are three things to notice about
Table 1. First, the marketplace system substantially outperforms all other systems in retrieval accuracy. For example, the marketplace approach achieves
and
MRR under conditions 1 and 2, respectively, while the dynamic n-gram approach achieves
and
MRR. Second, the marketplace system has the lowest average runtime of all systems. The benefit of the marketplace approach is not simply a tradeoff between runtime and retrieval accuracy: it significantly improves retrieval accuracy while simultaneously reducing or maintaining average runtime. It is therefore a unilaterally better solution than the dynamic n-gram approach. Third, there is a small decrease in MRR if we remove the original PDF from the database (Condition 2). This indicates that the system is relatively insensitive to differences in various printed editions.
3.3. Scan-Based Sheet Music Retrieval
The baseline systems for scan-based sheet music retrieval are the same as those for the camera-based task. The only modification to the systems is that the bootleg score feature extraction uses slightly different hyperparameters to account for the differences between cell-phone pictures and scanned images. All systems share the same (modified) set of hyperparameters.
Table 2 shows the results on the scan-based sheet music retrieval task. There are three things to notice. First, the marketplace system unilaterally outperforms all other systems in both retrieval accuracy and runtime. For example, the marketplace approach achieves
MRR with an average runtime of
seconds, while the dynamic n-gram approach achieves
MRR with an average runtime of
s. Second, all systems show substantial improvement in retrieval accuracy compared to the camera-based retrieval task (
Table 1). This performance gap can be interpreted as the detrimental effect of non-ideal perspective and lighting conditions when using cell phone images. Third, there is a small decrease in MRR if we remove the original PDF from the database (Condition 2). Overall, we see the same trends and patterns for scan-based and cell-phone-based sheet music retrieval tasks.
3.4. MIDI-Based Sheet Music Retrieval
The baseline systems for MIDI-based sheet music retrieval are the same as those for the previous two tasks. However, there are two modifications to the baseline systems that are needed to bridge the modality gap between MIDI and sheet music. The first modification is to handle enharmonic ambiguities for black notes on a piano. We handle this ambiguity by generating both sharp and flat versions of the MIDI bootleg score (as described in
Section 2.5, paragraph 3), using both bootleg scores to search the database, and keeping the one with the higher match score. The second modification is to handle ambiguity about whether a MIDI note in the middle register appears in the left- or right-hand staff. We handle this ambiguity by placing all notes in the middle register in both the right and left-hand staves in the bootleg score (as described in
Section 2.5, paragraph 5). By applying this change to both the database bootleg scores and the query bootleg score, this ambiguity can be resolved. The two modifications described above were used in [
27,
28] to enable cross-modal search between MIDI and sheet music images.
Figure 4 shows results on the MIDI-based sheet music retrieval task. The different groups of bars show the performance with queries of different duration. For example, the
group indicates the retrieval accuracy with MIDI queries that are 30 s long. The individual bars within each group show the performance of different systems. For deeper insight, we evaluate the marketplace approach with two different feature representations. The purple bar (“Marketplace 1”) represents the marketplace system that uses bootleg score features with a duplicated middle register, as described in the previous paragraph. The pink bar (“Marketplace 2”) represents the marketplace system using the experimental feature representation described in paragraph 6 of
Section 2.5, which involves predicting the clef of middle register notes. The gap between the purple and pink bars provides insight into how important the cross-modal feature representation is to the effectiveness of the marketplace approach.
There are three things to notice about
Figure 4. First, the proposed marketplace system (pink) substantially outperforms all others systems by a wide margin. For example, for
s queries, the marketplace approach achieves
MRR, while the dynamic n-gram approach achieves
MRR. Second, longer duration queries yield steady improvement in retrieval accuracy across all systems. While it is no surprise that longer queries have better performance, it is useful to note that none of the systems have reached asymptotic performance up to
s queries but continue to show improvement as query length increases. Third, the proposed cross-modal feature representation is critical to the strong performance of the marketplace fingerprinting approach. When the marketplace approach is used with a bootleg score representation with duplicated middle notes (purple), the performance is on par with the dynamic n-gram fingerprinting approach (dark blue). But when MIDI notes in the middle register are assigned to the left- or right-hand clefs, the marketplace approach (pink) shows substantially better performance than the dynamic n-gram approach. We might expect that this novel cross-modal feature representation would also improve performance if used with the dynamic n-gram or fixed n-gram approaches.
3.5. Audio-Based Sheet Music Retrieval
The baseline systems for audio-based sheet music retrieval are the same as those for the MIDI-based task. The only difference is that the audio queries are first preprocessed with the Onsets & Frames automatic music transcription system [
43], which converts the audio query into a MIDI query. The MIDI query is then processed by the system exactly as before.
Figure 5 shows results on the audio-based sheet music retrieval task. Each of the groups of bars represent the performance with queries of different duration. Within each group, the individual bars show the performance of the baseline systems and the proposed marketplace approach. As before, we show the performance of the marketplace approach with two different feature representations: a bootleg score with duplicated notes in the middle register (“Marketplace 1”) and a bootleg score where notes in the middle register are classified into left- and right-hand staves (“Marketplace 2”).
Figure 5 shows the same trends as in
Figure 4. The proposed marketplace approach (“Marketplace 2”) outperforms all other systems by a substantial margin. Longer query durations up to
s show steady improvement in retrieval accuracy. The cross-modal feature representation is again critical to the strong performance of the marketplace approach. Additionally, we observe a slight but consistent decrease in MRRs with audio queries compared to MIDI queries. This accuracy decrease is due to errors in the automatic music transcription preprocessing.
5. Conclusions
This article describes a novel method for piano music identification called marketplace fingerprinting. The key defining characteristic of marketplace fingerprinting is choice: we define a broad range of fingerprint designs based on a generalization of n-grams, and then select the fingerprint designs at runtime that are best for a specific query. We show that this setup is equivalent to an economics scenario in which a consumer and a store have competing interests, and we use basic principles of economics to design a system that is an efficient marketplace. We present experimental results on four different retrieval tasks using various forms of piano music, including sheet music images, MIDI files, and audio recordings. Our results indicate that marketplace fingerprinting substantially outperforms previous methods in retrieval accuracy, while simultaneously reducing average runtime.
Future work includes applying the marketplace fingerprinting method to other retrieval tasks and other domains. Marketplace fingerprinting can be interpreted as a retrieval method specifically designed for data that is sequential, sparse, and binary (which in our piano music identification task was a bootleg score representation). In these situations, we might expect a traditional hashing method like locality-sensitive hashing [
45] to be inefficient, since most of the elements are zeros. Likewise, we would expect marketplace fingerprinting to be much more effective in finding specific sequence matches than retrieval methods, such as term frequency-inverse document frequency [
46], that consider only frequency of occurrence. We leave controlled comparisons of these approaches on various types of data for future work.