A Deeper Look at Sheet Music Composer Classiﬁcation Using Self-Supervised Pretraining

: This article studies a composer style classiﬁcation task based on raw sheet music images. While previous works on composer recognition have relied exclusively on supervised learning, we explore the use of self-supervised pretraining methods that have been recently developed for natural language processing. We ﬁrst convert sheet music images to sequences of musical words, train a language model on a large set of unlabeled musical “sentences”, initialize a classiﬁer with the pretrained language model weights, and then ﬁnetune the classiﬁer on a small set of labeled data. We conduct extensive experiments on International Music Score Library Project (IMSLP) piano data using a range of modern language model architectures. We show that pretraining substantially improves classiﬁcation performance and that Transformer-based architectures perform best. We also introduce two data augmentation strategies and present evidence that the model learns generalizable and semantically meaningful information.


Introduction
While much of the research involving sheet music images revolves around optical music recognition (OMR) [1], there are a lot of other interesting applications and problems that can be solved without requiring OMR as an initial preprocessing step. Some recent examples include piece identification based on a cell phone picture of a physical page of sheet music [2], audio-sheet image score following and alignment [3,4], and finding matches between the Lakh MIDI dataset and the International Music Score Library Project (IMSLP) database [5]. This article explores a composer style classification task based on raw sheet music images. Having a description of compositional style would be useful in applications such as music recommendation and generation, as well as an additional tool for musicological analysis of historical compositions.
Many previous works have studied various forms of the composer classification or style recognition task. Previous works generally adopt one of three broad approaches to the problem. The first approach is to extract manually designed features from the music and then apply a classifier. Some features that have been explored include chroma information [6,7], expert musicological features [8][9][10], low-level information such as musical intervals or counterpoint characteristics [11,12], high-level information such as piece-level statistics or global features that describe piece structure [7,13], and pre-defined feature sets such as the jSymbolic toolbox [14,15]. Many different classifiers have been explored for this task, including more interpretable models like decision trees [12,14] and KNN [16], simple classifiers like logistic regression [10] or SVMs [8,14], and more complex neural network models [7,17]. The second approach is to feed low-level features such as notes or note intervals into a sequence-based model. The two most common sequence-based models are N-gram language models [13,[18][19][20] and Markov models [21,22]. In this approach, selected bootleg score features (Section 4.3) in order to gain intuition about what the model is learning, (d) explores the importance of left and right hand information in classifying queries (Section 4.2), and (e) conducts experiments on unseen composers using the trained models as feature extractors in order to show that the models learn a generalizable notion of compositional style that extends beyond the composers in the labeled dataset (Section 4.4). Figure 1 shows an overview of our proposed approach to the composer style classification task. There are three main stages: pretraining, finetuning, and inference. We describe each of these in detail in the next three subsections. Overview of approach to composer style classification of sheet music images. The input x is a raw sheet music image, and the label y indicates the composer.

Pretraining
The first stage is to pretrain the model using self-supervised language modeling. Given a set of unlabeled sheet music images, the output of the pretraining stage is a trained language model. The pretraining consists of four steps, as shown at the top of Figure 1.
The first step of pretraining is to convert each sheet music image into a bootleg score representation. The bootleg score [41] is a mid-level feature representation that encodes the sequence and positions of filled noteheads in sheet music. It was originally proposed for a sheet image-MIDI retrieval task, but has been successfully applied to other tasks such as sheet music identification [2] and audio-sheet image synchronization [43]. The bootleg score itself is a 62 × N binary matrix, where 62 indicates the total number of possible staff line positions in both the left-and right-hand staves and N indicates the total number of grouped note events after collapsing simultaneous note onsets (e.g., a chord containing four notes would constitute a single grouped note event). Figure 2 shows a short excerpt of sheet music and its corresponding bootleg score which contains N = 23 grouped note events. The bootleg score is computed by detecting three types of musical objects in the sheet music: staff lines, bar lines, and filled noteheads. Because these objects all have simple geometric shapes (i.e., straight lines and circles), they can be detected reliably and efficiently using classical computer vision techniques such as erosion, dilation, and blob detection. The feature extraction has no trainable weights and only about 30 hyperparameters, so it is very robust to overfitting. Indeed, the feature representation has been successfully used with very different types of data, including synthetic sheet music, scanned sheet music, and cell phone pictures of physical sheet music. For more details on the bootleg score feature representation, the reader is referred to [41]. A short excerpt of sheet music and its corresponding bootleg score. Filled noteheads in the sheet music are identified and encoded as a sequence of binary columns in the bootleg score representation. The staff lines in the bootleg score are shown as a visual aid, but are not present in the actual representation.
The second step in pretraining is to augment the data by pitch shifting. One interesting characteristic about the bootleg score representation is that key transpositions correspond to shifts in the feature representation. Therefore, one easy way to augment the data is to consider shifted versions of each bootleg score up to a maximum of ±K shifts. For example, when K = 2, this augmentation strategy will result in five times more data: the original data plus versions with shifts of +2, +1, −1, and −2. When pitch shifting causes a notehead to "fall off" the bootleg score canvas (e.g., a notehead near the top of the left-hand staff gets shifted up beyond the staff's range), the notehead is simply eliminated. In our experiments, we considered values of K up to 3.
The third step in pretraining is to tokenize each bootleg score into a sequence of words or subwords. This is done in one of two ways, depending on the underlying language model. For word-based language models (e.g., AWD-LSTM [44]), each column of the bootleg score is treated as a single word. In practice, this can be done easily by simply representing each column of 62 bits as a single 64-bit integer. Words that occur fewer than three times in the data are mapped to a special unknown token <UNK>. For subwordbased language models (e.g., GPT-2 [36], RoBERTa [45]), each column of the bootleg score is represented as a sequence of 8 bytes, where each byte constitutes a single character. The list of unique characters forms the initial vocabulary set. We then apply a byte pair encoding (BPE) algorithm [46] to iteratively merge the most frequently occurring pairs of adjacent vocabulary items until we reach a desired subword vocabulary size. In our experiments, we use a vocabulary size of 30,000 subwords. The (trained) byte pair encoder can be used to transform a sequence of 8-bit characters into a sequence of subwords. This type of subword tokenization is commonly used in modern language models. At the end of this third step, each bootleg score is represented as a sequence of word or subword units.
The fourth step in pretraining is to train a language model on the sequence of words or subwords. In this work, we consider three different language models: AWD-LSTM [44], GPT-2 [36], and RoBERTa [45]. The AWD-LSTM is a 3-layer LSTM model that incorporates several different types of dropout throughout the model. The model is trained to predict the next word at each time step. The GPT-2 model is a 6-layer Transformer decoder model that is trained to predict the next subword at each time step. The subword units are fed into an embedding layer, combined with positional word embeddings, and then fed to a sequence of six Transformer decoder layers. Though larger versions of GPT-2 have been studied [36], we only consider a modestly sized GPT-2 architecture for computational and practical reasons. The RoBERTa model is a 6-layer Transformer encoder model that is trained on a masked language modeling task. During training, a certain fraction of subwords in the input are randomly masked, and the model is trained to predict the identity of the masked tokens. Note that, unlike the transformer decoder model, the encoder model uses information from an entire sequence and is not limited only to information in the past. RoBERTa is an optimized version of BERT [37], which has been the basis of many recent language models (e.g., [47][48][49]). For both GPT-2 and RoBERTa, the context window of the language model was set very conservatively (1024) to ensure that all sequences of interest would be considered in their entirety during the classification stage (i.e., no inputs would be shortened due to being too long). Note that, though we focus only on these three language model architectures in this work, our general approach would be compatible with any model that can process sequential data (e.g., GPT-3 [38], temporal convolutional networks [50], quasi-recurrent neural networks [51]).

Finetuning
The second stage is to finetune the pretrained classifier on labeled data. The finetuning process consists of five steps, as shown in the middle row of Figure 1.
The first step is to compute bootleg score features on the labeled sheet music images. Because each piece (i.e., PDF) contains multiple pages, we concatenated the bootleg scores from all labeled pages in the PDF into a single, global bootleg score. At the end of this first step, we have a set of M bootleg scores of variable length, where M corresponds to the number of pieces in the labeled dataset.
The second step is to randomly select fixed-length fragments from the bootleg scores. This can be interpreted as a data augmentation strategy analogous to taking random crops from images. This approach has two significant benefits. First, it allows us to train on a much larger set of data. Rather than being limited to the number of actual pages in the labeled data set, we can aggressively sample the data to generate a much larger set of unique training samples. Second, it allows us to construct a balanced dataset. By sampling the same number of fragments from each classification category, we can avoid many of the problems and challenges of imbalanced datasets [52]. In our experiments, we considered fragments of length 64, 128, and 256. Figure 3 shows several example fragments from different composers. The third step is to augment the labeled fragment data through pitch shifting. Similar to before, we consider multiple shifted versions of each fragment up to ±K shifts. This is an additional data augmentation that increases the number of labeled training samples by a factor of 2K + 1.
The fourth step is to tokenize each fragment into a sequence of words or subwords. The tokenization is done in the same manner as described in Section 2.1. For word-based models, each fragment will produce sequences of the same length. For subword-based model, each fragment will produce sequences of (slightly) different lengths based on the output of the byte pair encoder. At the end of the fourth step, we have a set of labeled data that has been augmented, where each data sample is a sequence of words or subwords.
The fifth step is to finetune the pretrained classifier. We take the pretrained language model, add a classification head on top, and train the classification model on the labeled data using standard cross-entropy loss. Because the language model weights have been pretrained, the only part of the model that needs to be trained from scratch is the classification head. For the AWD-LSTM model, the classification head consists of the following: (a) it performs max pooling along the sequence dimension of the outputs of the last LSTM layer, (b) it performs average pooling along the sequence dimension of the outputs of the last LSTM layer, (c) it concatenates the results of these two pooling operations, and (d) passes the result to two dense layers with batch normalization and dropout. For the GPT-2 model, the classification head is a simple linear layer applied to the output of the last Transformer layer at the last time step. For the RoBERTa model, the classification head is a single linear layer applied to the output of the last Transformer layer at the first time step, which corresponds to the special beginning-of-sequence token <s>. This is the approach recommended in the original paper.
The finetuning is done using the methods described in [33]. First, we use a learning rate range finding test [53] to determine a suitable learning rate. Next, the classifier is trained with the pretrained language model weights frozen, so that only the weights in the classification head are updated. This prevents catastrophic forgetting in the pretrained weights. As training reaches convergence, we unfreeze more and more layers of the model backbone using discriminative learning rates, in which earlier layers use smaller learning rates than later layers. All training is done using 1 cycle training [54], in which the global learning rate is varied cyclically. These methods were found to be effective for finetuning text classifiers in [33].
At the end of this stage, we have a finetuned classifier that can classify fixed-length fragments of bootleg score data.

Inference
The third stage is to classify unseen sheet music images using our finetuned classifier. This inference consists of six steps, as shown at the bottom of Figure 1. The first step is to compute bootleg score features on the sheet music image. This is done in the same manner as described in Section 2.1. The second step is to perform test time augmentation using pitch shifting. Multiple shifted versions of the bootleg score are generated up to ±L shifts, where L is a hyperparameter. Each of these bootleg scores is processed by the model (as described in the remainder of this paragraph), and we average the predictions to generate an ensembled prediction. The third step is to extract fixed-length fragments from the bootleg score. Because the bootleg score for a single page of sheet music will have variable length, we extract a sequence of fixed-length fragments with 50% overlap. Each fragment is processed independently by the model and the results are averaged. Note that the second and third steps are both forms of test time augmentation, which is a widely used technique in computer vision (e.g., taking multiple crops from a test image and averaging the results) [55,56]. The fourth step is to tokenize each fixed-length fragment into a sequence of words or subwords. This is done using the same tokenization process described in Section 2.1. The fifth step is to process each fragment's sequence of words or subwords with the finetuned classifier. The sixth step is to compute the average of the pre-softmax outputs from all fragments across all pitch-shifted versions to produce a single ensembled pre-softmax distribution for the entire page of sheet music.

Results
In this section we describe our experimental setup and present our results on the composer style classification task.

Experimental Setup
There are two sets of data that we use in our experiments: an extremely large unlabeled dataset and a smaller, carefully curated labeled dataset.
The unlabeled dataset consists of all solo piano sheet music images in IMSLP. We used the instrument metadata to identify all solo piano pieces. Because popular pieces often have many different sheet music editions, we selected one sheet music version (i.e., PDF) per piece to avoid overrepresentation of a small number of popular pieces. We computed bootleg score features for all sheet music images and discarded any pages containing less than a threshold number of bootleg score features. This threshold was selected as a simple heuristic to remove non-music pages such as title page, forematerial, and other filler pages. The resulting set of data contains 29,310 PDFs, 255,539 pages, and 48.5 million bootleg score features. We will refer to this unlabeled dataset as the IMSLP data. 90% of the unlabeled data was used for language model training and 10% for validation.
The labeled dataset is a carefully curated subset of the IMSLP data. We first identified nine composers who had a substantial amount of piano sheet music. (The limit of 9 was chosen to avoid extreme imbalance of data among composers.) We constructed an exhaustive list of all solo piano pieces composed by these nine composers, and then selected one sheet music version per piece. We then manually identified filler pages in the resulting set of PDFs to ensure that every labeled image contains actual sheet music. The resulting set of data contains 787 PDFs, 7151 pages, and 1.47 million bootleg score features. Table 1 shows a breakdown of the labeled dataset by composer. The labeled data was split by piece into training, validation, and test sets. In total, there are 4347 training, 1500 validation, and 1304 test images. This dataset will be referred to as the full-page data. The full-page data was further preprocessed into fragments as described in Section 2.2. In our experiments, we considered three different fragment sizes: 64, 128, and 256. We sampled the same number of fragments from each composer to ensure balanced classes. For fragment size of 64, we sampled a total of 32,400, 10,800, and 10,800 fragments across all composers for training, validation, and test, respectively. For fragment size of 128, we sampled 16,200, 5400, and 5400 fragments across all composers for training, validation, and test. For fragment size of 256, we sampled 8100, 2700, and 2700 fragments for training, validation, and test. This sampling strategy ensures the same "coverage" of the data regardless of the fragment size. These datasets will be referred to as the fragment data.

Fragment Classification Results
We compare the performance of four different models: AWD-LSTM, GPT-2, RoBERTa, and a baseline CNN model. The CNN model follows the approach described by [24] for a composer classification task using **kern scores. Their approach is to feed a piano roll-like representation into two convolutional layers, perform average pooling of the activations along the sequence dimension, and then apply a final linear classification layer. This CNN model can be interpreted as the state-of-the-art approach as of 2019.
For each of the four models, we compare the performance under three different pretraining conditions. The first condition is no pretraining, in which the model is trained from scratch on the labeled fragment data. The second condition is target pretraining, in which a language model is first pretrained on the labeled (full-page) dataset, and then the pretrained classifier is finetuned on the labeled fragment data. The third condition is IMSLP pretraining, which consists of three steps: (1) training a language model on the full IMSLP dataset, (2) finetuning the language model on the labeled dataset, and (3) finetuning the pretrained classifier on the labeled fragment data. The third condition corresponds to the ULMFit [33] method. Comparing the performance under these three conditions will allow us to measure how much pretraining improves the performance of the classifier. Figure 4 shows the performance of all models on the fragment classification task. While our original goal is to classify full-page images, the performance on the test fragment data is useful because the dataset is balanced and more reliable due to its much larger size (i.e., number of samples). The three groups in the figure correspond to fragment sizes of 64 (left), 128 (middle), and 256 (right). For a given fragment size, the bars indicate the classification accuracy of all models, where different colors correspond to different pretraining conditions. Note that the CNN model only has results for the no pretraining condition, since it is not a language model. The bars indicate the performance of each model with training data augmentation K = 3 and test time augmentation L = 2. These settings were found to be best for the best-performing model (see Section 4.1) and were applied to all models in Figure 4 for fair comparison. The performance with no data augmentation (K = 0, L = 0) is also overlaid as black rectangles for comparison. There are four things to notice about Figure 4. First, pretraining makes a big difference. For all language model architectures, we see a large and consistent improvement from pretraining. For example, for the RoBERTa model with fragment size 64, the accuracy improves from 50.9% to 72.0% to 81.2% across the three pretraining conditions. Second, the augmentation strategies make a big difference. Regardless of model architecture or pretraining condition, we see a very large improvement in classification performance. For example, the AWD-LSTM model with IMSLP pretraining and fragment size 64 improves from 53.0% to 75.9% when incorporating data augmentation. The effect of the training and test-time augmentation will be studied in more depth in Section 4.1. Third, the Transformer-based models outperform the LSTM and CNN models. The best model (GPT-2 with IMSLP pretraining) achieves a classification accuracy of 90.2% for fragment size 256. Fourth, the classification performance improves as fragment size increases. This is to be expected, since having more context information should improve classification performance. Figure 5 shows the performance of all models on the full-page classification task. This is the original task that we set out to solve. Note that the y-axis is now macro F1, since accuracy is only an appropriate metric when the dataset is approximately balanced. These results should be interpreted with caution, keeping in mind that the test (full-page) dataset is relatively small in size (1304 images) and also has class imbalance. There are a few things to point out about Figure 5. Most of the trends observed in the fragment classification task hold true for the full-page classification task: pretraining helps significantly, data augmentation helps significantly, and the Transformer-based models perform best. However, one trend is reversed for the full-page classification task: longer fragments sizes yield worse full-page classification performance. This suggests a mismatch between the fragment dataset and the fragments in the full-page data. Indeed, when we investigated this issue more closely, we found that many sheet music images had less than 128 or 256 bootleg score features on the whole page. This means that the fragment datasets with sizes 128 and 256 are biased towards sheet music containing a very large number of note events on a single page, and are not representative of single pages of sheet music. For this reason, the fragment classification with length 64 is most effective for the full-page classification task.

Discussion
In this section we perform four additional analyses to gain deeper insight and intuition into the best-performing model: GPT-2 with IMSLP pretraining and fragment size 64.

Effect of Data Augmentation
The first analysis is to characterize the effect of the training and test time augmentation strategies. Recall that the bootleg scores are shifted by up to ±K shifts during training, and that shifted versions of each query up to ±L shifts are ensembled at test time. Figure 6 shows the fragment classification performance of the GPT-2 model with IMSLP pretraining across a range of K and L values. Figure 7 shows the performance of the same models on the full-page classification task.
There are two notable things to point out in Figures 6 and 7. First, higher values of K lead to significant increases in model performance. For example, the performance of the GPT-2 model with L = 0 increases from 0.67 to 0.88 macro F1 as K increases from 0 to 3. Based on the trends shown in these figures, we would expect even better performance for values of K greater than 3. We only considered values up to 3 due to the extremely long training times. Second, test time augmentation helps significantly but only when used in moderation. For most values of K, the optimal value of L is 2 or 3. As L continues to increase, the performance begins to degrade. Combining both training and test time augmentation, we see extremely large improvements in model performance: the fragment classification accuracy increases from 57.3% (K = 0, L = 0) to 84.2% (K = 3, L = 2), and the full-page classification performance increases from 0.67 macro F1 (K = 0, L = 0) to 0.92 macro F1 (K = 3, L = 2).
The takeaway from our first analysis is clear: training and test time augmentation improve the model performance significantly and should always be used.  K specifies the amount of training data augmentation and L specifies the amount of test time data augmentation. Within each grouping, the bars correspond to values of L ranging from 0 (leftmost) to 4 (rightmost).

Single Hand Models
The second analysis is to quantify how important the right hand and left hand are in classification. The bootleg score contains 62 distinct staff line positions, of which 34 come from the right hand staff (E3 to C8) and 28 come from the left hand staff (A0 to G4). To study the importance of each hand, we trained two single-hand models: a right hand model in which any notes in the left hand staff are zeroed out, and a left hand model in which any notes in the right hand staff are zeroed out. We trained a BPE for each hand separately, pretrained the GPT-2 language model on the appropriately masked IMSLP data, finetuned the classifier on the masked training fragment data, and then evaluated the performance on the masked test fragment data. Figure 8 compares the results of the right hand and left hand models against the full model containing both hands. We see that, regardless of the pretraining condition, the right hand model outperforms the left hand model by a large margin. This matches our intuition, since the right hand tends to contain the melody and is more distinctive than the left hand accompaniment part. We also see a big gap in performance between the single hand models and the full model containing both hands. This suggests that a lot of the information needed for classification comes from the polyphonic, harmonic component of the music. If our approach were applied to monophonic music, for example, we would expect performance to be worse than the right hand model.

t-SNE
The third analysis is to visualize a t-SNE plot [57] of selected bootleg score features. Specifically, we extracted the learned embeddings for all bootleg score columns containing a single notehead, and then visualized them in a t-SNE plot. Because byte pair encoding (in the subword-based language models) makes it difficult to examine a single bootleg score column, we focus on the word-based AWD-LSTM model (trained on fragments of length 64) for this analysis. Note that our model does not explicitly encode any musical domain knowledge, so any relationships that we discover are an indication that the model has learned semantically meaningful information from the data.  Figure 9 shows the t-SNE plot for the embeddings of single notehead bootleg score features. We can see two distinct clusters: one cluster containing noteheads in the right hand staff (lower left), and another cluster containing noteheads in the left hand staff (upper right). Within each cluster, we see that noteheads that appear close to one another in the sheet music are close to one another in the t-SNE plot. This results in an approximately linear progression (with lots of zigzags) from low noteheads to high noteheads. This provides strong evidence that the model is able to learn semantically meaningful relationships between different bootleg score features, even though each word is considered a distinct entity.

Unseen Composer Classification
The fourth analysis is to determine if the models learn a more generalizable notion of style that extends beyond the nine composers in the classification task. To answer this question, we perform the following experiment. First, we randomly sample C composers from the full IMSLP dataset, making sure to exclude the nine composers in the training data. Next, we assemble all sheet music images for solo piano works composed by these C composers. For each page of sheet music in this newly constructed data set, we (a) pass the sheet music image through our full-page classification model, (b) take the penultimate layer activations in the model as a feature representation of the page, (c) calculate the average Euclidian distance to the K = 5 nearest neighbors (each corresponding to a single page of sheet music) for each of the C composers, and (d) rank the C composers according to their KNN distance scores. For step (c), we exclude all other pages from the same piece, so that the similarity to the true matching composer must be computed against other pieces written by the composer. Finally, we repeat the above experiment S = 10 times and report the mean reciprocal rank of all predictions. Figure 10 shows the results of these experiments. We evaluate all four model architectures with fragment size 64 (with data augmentation and IMSLP pretraining), along with a random guessing model as a reference. The figure compares the results of all five models for values of C ranging from 10 to 200. We can see that all four trained models perform much better than random guessing, and that the GPT-2 model continues to perform the best. This provides evidence that the models are learning a notion of compositional style that generalizes beyond the nine composers in the labeled dataset. The trained models can thus be used as a feature extractor to project a page of sheet music into this compositional style feature space. Figure 10. Comparison of models on a ranking task involving unseen composers. C composers are randomly selected from IMSLP (excluding the original 9 composers in the labeled dataset), each page of piano sheet music for these C composers is considered as a query, and the KNN distance (excluding pages from the same piece) is used to rank the C composers. From left to right, the bars in each group correspond to random guessing, CNN, AWD-LSTM, RoBERTa, and GPT-2 models. Each bar shows the average of 10 such experiments.

Conclusions
We have proposed an approach to the composer style classification task based on raw sheet music images. Our approach converts the sheet music images to a bootleg score representation, represents the bootleg score features as sequences of words, and then treats the problem as a text classification task. Compared to previous work, our approach is novel in that it utilizes self-supervised pretraining on unlabeled data, which allows for effective classifiers to be trained even with limited amounts of labeled data. We perform extensive experiments on a range of modern language model architectures, and we show that pretraining substantially improves classification performance and that Transformerbased architectures perform best. We also introduce two data augmentation strategies and conduct various analyses to gain deeper intuition into the model. Future work includes exploring additional data augmentation and regularization strategies, as well as applying this approach to non-piano sheet music. Funding: This research was made possible through the Brian Butler '89 HMC Faculty Enhancement Fund. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Large-scale computations on IMSLP data were performed with XSEDE Bridges at the Pittsburgh Supercomputing Center through allocation TG-IRI190019.
Data Availability Statement: Code and data for replicating the results in this article can be found at https://github.com/HMC-MIR/ComposerID.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: