SymCHM — An Unsupervised Approach for Pattern Discovery in Symbolic Music with a Compositional Hierarchical Model

This paper presents a compositional hierarchical model for pattern discovery in symbolic music. The model can be regarded as a deep architecture with a transparent structure. It can learn a set of repeated patterns within individual works or larger corpora in an unsupervised manner, relying on statistics of pattern occurrences, and robustly infer the learned patterns in new, unknown works. A learned model contains representations of patterns on different layers, from the simple short structures on lower layers to the longer and more complex music structures on higher layers. A pattern selection procedure can be used to extract the most frequent patterns from the model. We evaluate the model on the publicly available JKU Patterns Datasetsand compare the results to other approaches.


Introduction
In music, hierarchical representations are intuitive when one considers its spectral and temporal structures.In an analytical sense, the Generative Theory of Tonal Music (GTTM) by Lerdahl and Jackendoff [1] offers an approach of explicit hierarchical music modelling in musicology, well known in contemporary music theory.Although GTTM mostly relies on expert rules, the concept of hierarchical structuring seems reasonable, derived from the humans' search for structure in consciously perceived surroundings.There are several attempts to build a system capable of automatic analysis supported by the GTTM and Schenkerian analysis [2][3][4].Several other rule-based models were also researched in Music Information Retrieval (MIR) and related fields [5,6].Furthermore, the hierarchical models abound in analysis of music perception from the point of view of computational biology and neuroscience [7,8].
In parallel to explicit hierarchical representations, a variety of new approaches emerged under a common name of deep learning [9].Several neural-network-based approaches have been proposed for melody transcription (e.g., [10]), genre classification (e.g., [11]), onset detection (e.g., [12]), drum pattern analysis (e.g., [13]) and chord estimation (e.g., [14]).The idea behind a deep learning algorithm is to construct multiple levels of data abstraction: a hierarchy of features.The high-level representations in the training data are reflected in the hierarchy.However, the encoded knowledge is implicit and is difficult to explain in a transparent (non black-box) way.Therefore, although deep learning enables unsupervised learning of features and achieves good results on a variety of tasks, it is not very appropriate for pattern discovery in music where explicit explanations of input are desired.
The discovery of repeated patterns is a known problem in different domains, including computer vision (e.g., [15]), bioinformatics (e.g., [16]) and music information retrieval (MIR).Although a common problem, its definition, as well as pattern discovery algorithms, significantly differs across these fields.In music, the importance of repetition has been addressed and discussed by a number of music theorists (e.g., [17]) and, more recently, also by researchers who develop algorithms for semi-automatic music analysis, such as one described by Marsden [4].In the MIR field, an initiative for a common definition of different tasks was formalized into the Music Information Retrieval Evaluation eXchange (MIREX), in an attempt to compare different approaches.MIREX is a community-based framework for formal evaluation of algorithms and techniques related to MIR [18].The MIREX community established several tasks dealing with patterns and structures in music, including structural segmentation, symbolic melodic similarity and pattern matching, and pattern discovery.
The aim of the discovery of repeated themes and sections task is to find repetitions which represent one of the more significant aspects of a music piece [19].The MIREX task definition states "the algorithms take a piece of music as input, and output a list of patterns repeated within that piece" [20].The task may also seem similar to the well-known pattern matching task [21], However, while a pattern matching algorithm aims to find the place of a searched pattern within a dataset and usually has a clear quantitative relation between a query and a match, a discovery of repeated patterns finds locations of multiple similar sequences of data in the dataset, without any information about the searched pattern.The definition of a pattern has been troubling researchers since the beginning; while a pattern may come as an intuitive representation with a repetitive substance, patterns in music are more difficult to define and are usually formalized using theoretical rules, specific to the music era and genre.In the discovery of repeated themes and sections task, a pattern is defined as "a set of on-time-pitch pairs that occurs at least twice (i.e., is repeated at least once) in a piece of music.The second, third, etc. occurrences of the pattern will likely be shifted in time and perhaps also transposed, relative to the first occurrence."[20].As noted by Wang et al. [22], the pattern discovery task differs from the structural segmentation task, where segments cover the whole music piece and represent disjoint sets of events.In the pattern discovery task, patterns may partially overlap or be subsets of another pattern.However, some of the approaches mentioned in this section (e.g., [23,24]) perform pattern discovery by calculating a set of non-overlapping patterns.
A variety of approaches has been proposed for pattern discovery in music in the past years.Conklin and Anagnostopoulou [25] proposed a multiple viewpoint pattern discovery algorithm based on a suffix-tree.For a selected viewpoint (a transformation of a musical event into an abstract feature) the algorithm builds a suffix tree of viewpoint sequences (transformed music pieces).After selecting patterns which meet specified frequency and significance thresholds, the leafs of the suffix tree are reported as longest significant patterns in the corpus.Conklin and Bergeron [24] apply two algorithms, using viewpoints which represent abstract properties of musical notes for statistical modelling of melody [26].A viewpoint is thus a function that computes values for events in a sequence; a pattern is a sequence of such feature sets, where the latter represent a logical conjunction of multiple viewpoints.The authors present a complete algorithm which can find all 'maximal frequent patterns' and an optimization algorithm using a faster heuristic approach, where the found patterns may not always be the maximal frequent patterns.The maximal frequent pattern represents a pattern whose component feature set cannot be further specialized without the pattern becoming infrequent.Rolland [27] presents the FlExPat (Flexible Extraction of Patterns) algorithm for extracting sequential patterns from sequences of data.The algorithm first identifies equipollent passage pairs and produces a similarity graph, representing the relations between each two passages; patterns are extracted from the similarity graph.The author evaluated the approach on a set of ten Charlie Parker solos from the subset of Owens' corpus [28] and reported a satisfactory pattern extraction of a large number of the annotated patterns.Cambouropoulos et al. [23] introduced an approach for extraction of patterns from abstract strings of symbols, allowing for a partial overlap of various abstract symbolic classes.They also focused on time complexity of their solution and addressed the problem of approximate pattern matching.Based on their previous work [29], they presented the PAT algorithm for segmentation based on maximal repeated patterns.Besides discovering the patterns, and subject and counter-subject entries in fugues, Meredith [30] described multiple point-set compression algorithms, including several COSIATEC and COSIATECCompress approaches and Forth's algorithm.The author evaluated these approaches on three music analysis tasks: the classification of folk song melodies into tune families, discovering entries of subjects and counter-subjects in fugues, and the discovery of repeated themes and sections in polyphonic works task.Meredith [31] also evaluated his SIATECCompressSegment algorithm for the task, which is a greedy compression algorithm based on the previously introduced SIATEC approach [19].The algorithm evaluates patterns based on assumption that perceptually interesting patterns correspond to Maximal Translatable Patterns (MTP).The approach produces a compact encoding of a musical piece, defined by a point-set representation, in form of a set of Translational Equivalence Classes (TEC) of MTPs.The MTP with a defined particular vector is a set of points, which can be translated by that vector to give other points in the point-set representation.The authors observed that the MTPs often correspond to perceptually significant repeated patterns in music.The TEC defines a set of all patterns which are translationally equivalent to a pattern defining the specific TEC.The SIATECCompressSegment approach generates an ordered list of TECs which may overlap (in contrast to other related versions such as COSIATEC).
Recently, Velarde and Meredith [32] extended a previously introduced approach to melodic segmentation [33] for melodic classification and segmentation, where the symbolic input is first segmented, then compared and hierarchically clustered.Finally, the clusters are ranked, taking into account the cumulative length of all occurrences within each cluster.Based on their results, it can be assumed that the output is additionally filtered by a threshold defining the number of output patterns.Lartillot [34] introduced the PatMinr algorithm [35] which uses an incremental one-pass approach to identify pattern occurrences.To avoid redundancy, the author addresses two issues: closed pattern mining, which filters out the patterns that have more occurrences than their more specific patterns, thus providing more robust patterns, and pattern cyclicity, which removes redundant matches for successive occurrences of a single underlying pattern.The most recent approach submitted to the MIREX task by Ren [36] also employs a closed pattern approach commonly used in data mining.Nieto and FarBood [37] proposed the MotivesExtractor which obtains a harmonic representation of the audio or symbolic input and extracts patterns based on a produced self-similarity matrix.Using a score-based greedy algorithm ( [38]) the approach extracts repeated segments, allowing the patterns to overlap.Finally, the segments are grouped into clusters and provided in the algorithm's output as patterns.
In contrast to the existing hierarchical and deep approaches, the Compositional Hierarchical Model (CHM) presented in this paper is a transparent deep architecture.The model provides an explicit (transparent) encoding of concepts, learned in an unsupervised manner, thus merging the benefits of explicit and deep hierarchical models in MIR.The CHM is built around the premise that the repetitive nature of patterns can be captured by observing statistics of occurrences of their sub-patterns, thus providing a hierarchy of the analysed symbolic music representation(s) [39].Similar to other approaches that build a tree of patterns based on their subsumption (e.g., [25]), the CHM first extracts small atomic patterns and builds complex patterns as compositions of these atomic patterns.Its ability to concurrently provide multiple pattern hypotheses on several levels of complexity and their transparent descriptions makes it very suitable for pattern extraction, as patterns may overlap or be mutually included.
The compositional hierarchical model was first introduced by Pesek et al. [40] and was evaluated for several MIR tasks, including automated chord estimation and multiple fundamental frequency estimation [41].In the paper, we present an adaptation of the model for analysis of Symbolic music (SymCHM) applied to the task of finding repeated patterns and sections.Instead of finding compositions in a frequency-magnitude audio representation, the adjusted model searches for compositions of symbolic events in the time-pitch-onset domain.The model learns a hierarchy of patterns; the transparent nature of the model allows the user to explore and analyse a music piece by observing the hierarchy of pattern occurrences.For the automatic discovery of repeated patterns, the patterns represented in the hierarchy are extracted.We analyse the model output and propose an extension of the model named SymCHMMerge, which refines the extracted patterns.
The contributions of this paper are as follows: the compositional hierarchical model for symbolic music analysis that can learn hierarchical melodic structures in an unsupervised manner is presented.An application of the model to the task of finding repeated patterns and sections is evaluated.The improved pattern extraction and merging approach from knowledge encoded in the model (SymCHMMerge) is proposed and analysed.
The paper is structured as follows: we present the SymCHM in Section 2, describe its application and extension to pattern extraction in Section 3 and present its evaluation and error analysis in Section 4. We conclude the paper with an overview of other possible applications of the presented model and outline future work in Section 5.

The Symbolic Compositional Hierarchical Model
The Symbolic Compositional Hierarchical Model (SymCHM) is derived from the CHM [40,41], which in turn was inspired by an approach for object categorization in computer vision, named the learned Hierarchy of Parts (lHoP) [42].The SymCHM provides a hierarchical representation of a symbolic music piece, from individual notes on the lowest layer, up to complex musical patterns on higher layers.It is based on a hierarchical decomposition of music into atomic blocks, denoted as parts (not to be confused with 'voice' or 'vocal/instrumental part'.This denomination is used to retain the consistency in relation to the lHoP).According to their musical complexity, parts are structured across several layers, whereby parts on higher layers form compositions of parts on lower layers.A part can therefore describe a simple individual event as well as a complex composition of events.While events in the original compositional hierarchical model represent spectral audio features (frequencies, pitch partials and pitches), the SymCHM models notes and their compositions into melodic patterns.

Compositional Layers
The SymCHM consists of an input layer L 0 and several compositional layers {L 1 , . . ., L N }.Each compositional layer L n contains a set of parts {P n 1 , . . ., P n M n }, which are formed as compositions of parts from the previous layer L n−1 .The parts on the layer L n−1 may form any number of compositions on the layer L n , which enables their effective reuse and thus learning of compact models, as shown later in this paper.A hierarchy of parts is illustrated in Figure 1.
The SymCHM retains part definitions of the original CHM model.The i-th composition on the layer L n , denoted P n i , is defined as: P n i is a composition of K parts from the layer L n−1 , called subparts.The composition is governed by parameters µ 1,...,K−1 and σ 1,...,K−1 , which model relationships between the subparts.In contrast to most existing hierarchical and deep approaches, the CHM encodes compositions in a relative rather than absolute manner.This is achieved by encoding the relative distance (offset) between each subpart P n−1 k j , from the first subpart P n−1 k 0 , called the central part.The offset is encoded as a Gaussian with parameters µ j and σ j .In SymCHM, offsets are modelled in semitones in the pitch domain (a semitone is the smallest musical interval commonly used in Western tonal music), thus a composition encodes the semitone distance between patterns represented by various subparts.Currently, the standard deviation σ j is set to a small fixed value, which does not allow for deviations from the offset encoded by µ j .In future work we may relax this condition to potentially achieve similar robustness as in chromatic to morphetic pitch translation [43].As an example, the part P 3 2 in Figure 1 represents a composition of two subparts with offset 2 (µ = 2), meaning its pattern is a concatenation of two sub-patterns spaced two semitones apart.All compositions and their parameters (µ, σ) are learned in an unsupervised manner as explained in Section 2.2.
Such relative encoding of knowledge enables the model to learn position-independent concepts, which in turn enables learning of compact models from small datasets, which still generalize well [41].This is an advantage over most neural network deep approaches, which encode concepts in an absolute manner and therefore need very large datasets to train properly.The symbolic compositional hierarchical model.The input layer corresponds to a symbolic music representation (a sequence of pitches).Parts on higher layers are compositions of lower-layer parts (depicted as connections between parts, the parameter µ is given in semitones).The structure of a part is displayed above each part in the figure, represented by a sequence of pitch values relative to the first subpart (e.g., [0,0,1] for the part P 2 1 ).A part may be contained in several compositions, e.g., P 1 M 1 is a part of compositions P 2 2 and P 2 3 .The entire structure is transparent, thus we can observe the entire sub-tree of the part P 4  1 .A part activates, when (a part of) the pattern it represents is found in the input.As an example, P 4  1 activates twice (Inputs A and B), however there are differences in the found patterns.Pattern A is positioned five semitones higher than B; Pattern B is missing one event (dotted green rectangle); and the pitch of one event (blue rectangle) differs between the two patterns.

Activations: Occurrences of Patterns
An activation of a part corresponds to the presence of the concept it encodes (melodic pattern in SymCHM) in the model input.An activation has three components: location and onset time, which map the relative pattern representation onto a specific MIDI (Musical Instrument Digital Interface technical standard) pitch and a time position within the input sequence of events (thus making it absolute) and magnitude, representing its strength.
A part will activate at a given location if all of its subparts are activated with magnitude greater than zero (this condition is relaxed with hallucination, which we introduce later in this section).A part can concurrently activate at different locations and times, which indicates multiple occurrences of its concept in the input representation.In terms of the repeated pattern discovery task, each activation of a part can be observed as a pattern occurrence: a repetition of the pattern encoded by the observed part.
More formally, the activation A is defined as a triplet A L , A T , A M of location, time and magnitude.The activation location A L and the time A T of the part P n i are defined as: The compositions therefore propagate their locations and onset times upwards through the hierarchy.Such propagation can be usefully employed as an indexing mechanism and allows for a top-down analysis of activations.
The activation magnitude represents the strength of the composition's match with the input and is defined as a weighted sum of subpart magnitudes: where the weights w j are defined by the match between the learned and the observed relative subpart pitch locations and bounded by the difference in their activation times: The motivation behind the usage of tanh function introduced in Equation ( 3) is retained from neural-network-based architectures: it provides a saturated output with the maximum limited to one.Any other function could be used to calculate the magnitude of the activation, but the hyperbolic tangent function possesses several interesting properties: it is a monotonically increasing function with a smooth gradient and has a value close to one as it approaches infinity.Since the activation magnitudes are directly used to calculate activations on a higher layer, the output of the function needs to be normalized.
The parameter τ W represents the maximal difference between activation times of two subparts (time distance of two patterns) which still produces an activation.Such a limit must be imposed in order to avoid a combinatorial explosion of possible compositions.If subpart activations fall within this time window, their activation magnitude is calculated according to the match between their observed (δ Lj ) and their learned (µ j , σ j ) relative pitch distances.A part will activate with maximal magnitude when its subparts activate at pitch distances according to the learned representation encoded by µ j and σ j .Note that onset times do not directly influence the activation magnitude.Thus, the activation strength of a pattern is not dependent on the temporal distance between its sub-patterns (within τ W ) and remains the same whether they are adjacent or separated by other events, allowing for gaps between sub-patterns.

The Input Representation and Input Layer
A symbolic music representation encoding note pitches and onset times represents input to the SymCHM.Any symbolic encoding that includes these values can be used, such as MusicXML, MIDI or text-based representations; the latter two are also available for the MIREX pattern discovery task.
We can thus define the input representation as a set of note onset (e.g., in seconds) and note pitch (e.g., MIDI pitch) tuples S = {(N o , N p )}.
The input layer of SymCHM L 0 models such a symbolic music representation.It consists of a single atomic part P 0 1 , which activates for all note events as: Thus, the activation locations A L are equal to note pitches, the onset times A T to note onsets, while the magnitude A M is assumed to be 1 for all events (it can also represent note dynamics, if greater importance is to be put on accented notes).
An example of a learned hierarchy is shown in Figure 1.The part P 0 1 is activated for each input note event.The parts on the first layer represent intervals, e.g., P 1  4 represents a minor second (offset one semitone) and is activated for all such intervals in the input regardless of gaps, with notes spaced maximally τ W apart. P 4  1 represents a sequence of note events defined by a series of offsets [0,0,1,2,−7,−12,4,4,5,−3,−12,7] and is activated at MIDI locations 65 and 70.

Constructing a Hierarchy of Parts
The model is built layer-by-layer with unsupervised learning on a single or multiple musical pieces.In the 'intra-opus' pattern discovery task experiment described in this paper, we build a model for each musical piece separately.
The learning process is an optimization problem, where for each layer a set of all possible part compositions of the layer is searched for a minimal subset of compositions that covers a maximal amount of events in the training set.The learning process is driven by statistics of part activations that capture regularities in the input data.It consists of two main steps: (1) finding a set of all possible compositions, denoted candidate compositions, and (2) selecting compositions that explain a maximal amount of events in the training set.
To construct a new layer L n , a set of new candidate compositions C, which will be considered for inclusion in the new layer, is first formed (Step 1).This set of candidate compositions is obtained by inferring the hierarchy with the training data and generating activations of parts layer-by-layer from L 0 to L n−1 , as explained in Section 2.3.The candidate compositions for layer L n are generated from histograms of co-occurrences of L n−1 part activations within the time window τ W (see also Equation ( 4)).Frequent co-occurrences indicate the presence of underlying patterns.New compositions are formed from combinations of L n−1 parts where the number of co-occurrences exceeds the learning threshold τ C .The composition parameter µ is estimated from the corresponding histogram.
The L 1 candidate compositions are thus constructed as a relative structure of two co-occurring L 0 part activations, both occurring within the time window τ W .This procedure is repeated on all consecutive layers, where activations of parts co-occurring within the time window on a previous layer L n−1 compose new part candidates on the next layer L n .Since the model allows for partial overlapping of the covered structure (e.g., P 2 1 in Figure 1), the structures on these layers represent 3-4 music events.Consequently, the L N candidate compositions include all combinations of L N−1 part pairs representing structures of 2 N−1 -2 N music events.
In the second step, a subset of compositions from C that covers a maximum number of events in the input data is selected.As the problem of selecting a set of compositions from C which optimally cover the input data is NP (nondeterministic polynomial time) complete, a greedy approach, which selects a subset of compositions and leaves a minimal amount of events in the input uncovered, was introduced in [41].
The composition selection uses part coverage as a measure of the part's suitability for selection.The coverage of the part P n i can be obtained by projecting its activations to the input layer and observing the covered events.For a single activation of the part P n i at the time T and the location L, coverage is defined as the union of coverages of its subparts: When the input layer is reached, the coverage is defined by the presence of an event at the given location and time as: Based on coverage, the greedy composition selection approach is defined as follows: • the coverage of each part from C is calculated as a union of events in the training data covered by all activations of the part, • parts are iteratively added to the new layer L n by choosing the part that adds most to the coverage of the entire training set in each iteration.This ensures that only compositions that provide enough coverage of new data with regard to the currently selected set of parts will be added, • the algorithm stops when the additional coverage falls below the learning threshold τ L .
The learning procedure is repeated for each layer until a desired number of layers is reached.The reader should note that the number of layers governs the maximal length of encoded patterns, as discussed in the evaluation.

Inferring Patterns
A learned model captures the repetitive patterns in the training data, which are relatively encoded and may be observed through an inspection of the model's parts on its various layers.When a trained model is presented with new input data, the learned patterns may be located in the input through the process of inference.Inference calculates part activations on the input data (and thus absolute pattern positions) according to Equations ( 2) and (3).They are calculated bottom-up layer-by-layer, whereby the input data activates the layer L 0 .As already mentioned, the activation of a part represents a specific occurrence of the pattern it represents in the input.An activation has three components: location and onset time, which map the relative pattern onto a specific set of pitches within the input sequence of events (thus making it absolute), and magnitude, representing its strength.A part can concurrently activate at different locations, which indicates multiple occurrences of the represented pattern in the input representation.
Inference may be exact or approximate, where in the latter case two additional mechanisms, hallucination and inhibition, enable the model to find patterns with deletions, changes or insertions, thus increasing its predictive power and robustness.

Hallucination
As described in Section 2.1, a part activation is produced only if all subparts activate with magnitude greater than zero at locations which approximately correspond to the structure encoded by the part.This conservative behaviour may be relaxed by hallucination.It enables a part to produce activations even when the structure it represents is incomplete or modified in the input (e.g., missing notes, added notes, changed pitch, changed note order).Hallucination is important, as it enables the model to find variations of patterns represented by individual parts.The missing information is obtained from knowledge acquired during learning and encoded in the model structure.Using hallucination, the model generates activations of parts most fittingly covering the input representation, where notes which are not present, but are encoded in the model, are hallucinated.It is implemented by changing the conditions under which a part may activate.With hallucination, a part may activate even if all of its subparts are activated, when the percentage of events it represents, covered in the input, exceeds a hallucination threshold τ H . Thus, if we set τ H to one, the default behaviour is obtained, while lowering its value leads to increased hallucination and tolerance to changes in patterns.
The hallucination threshold τ H influences the number of discovered patterns and identified pattern occurrences.When lowered, the amount of activations increases, as parts may activate on incomplete matches, thus producing activations which would otherwise not be generated.Additionally, if used during learning, the number of parts on lower layers will decrease, as parts added to a layer will have higher coverage due to more activations.

Inhibition
Inhibition in our model is a hypothesis refinement mechanism, which reduces the amount of redundant activations.An activation of a part P n i is inhibited (removed) when one or multiple parts P n j 1 , . . ., P n j K cover a large part of the same events in the input, but with stronger magnitude.More formally, activation of the part P n i is inhibited when the following conditions are met: ∃{P n j 1 . . .P n j K } : and The C(A) represents activation coverage (Equation ( 6)), A M activation magnitude (Equation ( 3)) and τ I controls the strength of inhibition.If τ I is set to zero, no inhibition occurs; the larger its value, the more activations are inhibited and propagated less between model layers.Notably, only activations with magnitude larger than that of the part P n i are considered in the inhibition process.Besides reducing the number of activations and output patterns, the inhibition mechanism can also be used for producing alternative explanations of the input.If activations of the strongest pattern which inhibits other competing hypotheses are removed from the model, the next best hypothesis is selected during inference, thus providing an alternative explanation with different pattern occurrences to appear in the model's output.

Pattern Selection with SymCHM
The SymCHM model can be trained on a single or multiple symbolic music representations.It learns a hierarchical representation of patterns occurring in the input, where patterns encoded by parts on higher layers are compositions of patterns on lower layers.The inference produces part activations which expose the learned patterns (and their variations) in the input data.Shorter and more trivial patterns naturally occur more frequently, longer patterns less frequently.On the other hand, longer patterns may entirely subsume shorter patterns.Occurrences of melodic patterns in a given piece are discovered by observing activations of the learned model's parts, where each activation of a part is interpreted as an occurrence of the pattern encoded by the part.
To use the model for the discovery of repeated patterns and sections task, we need to select which of the found patterns will be provided in the model's output.In this Section, we present two approaches for a pattern selection.

Basic Selection
In a basic pattern selection, we output all patterns of sufficient complexity, as encoded by parts starting from the layer L up to the highest layer N. First, we select all parts from the layers L L . . .L N .Since parts on higher layers are compositions of parts on lower layers, we exclude all parts which are subparts of a composition on a higher layer to avoid redundancy.The final selection of parts can be formulated as: Inference is then performed on a music piece and activations of the selected parts represent the found patterns and their locations in the piece.Hallucination and inhibition are applied during inference to provide balance between producing hypotheses which partially match the input representation (hallucination) and the amount of competitive hypotheses produced (inhibition).

SymCHMMerge: Improved Pattern Selection
An analysis of the basic pattern selection algorithm showed lack of diversity in the found patterns, as the patterns were often very similar and overlapping.We improved the algorithm by merging redundant patterns and adjusting the learning and inference parameters, and named the resulting model SymCHMMerge.

Merging Redundant Patterns
Since parts in our model are learned in an unsupervised manner, several parts may represent similar and overlapping patterns (e.g., patterns shifted by a few notes).Inhibition reduces redundant activations of such parts, however it is usually not enforced strongly, as it could overly reduce the number of activations and found patterns.To reduce the number of such overlapping patterns, we merge them into single, longer patterns.
Let π(A(P n i )) represent a pattern occurrence defined by the projection π of the activation A of the part P n i onto the layer L 0 .Ψ n i represents the set of all such pattern occurrences discovered by activations of the part: Two pattern occurrences a i and a j , produced by the parts P n i and P m j , are taken to be redundant, if they overlap significantly.We express this by calculating the Jaccard similarity coefficient and compare it to a threshold τ R : We aim to merge redundant pattern occurrences of two parts if they frequently produce overlapping patterns.Therefore, we calculate the proportion of such patterns produced by the two parts as: If the proportion exceeds a threshold τ M , all redundant pattern occurrences of the two parts are merged.
For evaluation, the thresholds τ R and τ M were both set to 0.5, meaning that pattern occurrences produced by two parts had to share at least 50% of events in the input layer and appear together in at least 50% of cases, to be merged.

Increasing Diversity
To address the problem of pattern diversity, we needed to increase the number of patterns found by the model.This was achieved with three simple adjustments.First, we lowered the candidate selection thresholds in the greedy phase of the learning process to add more parts to each layer (evaluation showed that on average 16% more parts were added).Second, more layers were considered when searching for pattern occurrences, and third, hallucination was increased during inference.All these modifications could also be made with the basic pattern selection approach; however, they would result in an even higher number of redundant patterns.With SymCHMMerge, redundant occurrences are merged and thus the diversity of the found patterns increases.

Evaluation
We evaluated the proposed model for the discovery of repeated themes and sections task in symbolic monophonic music pieces.Since we are searching for patterns within a given piece (and not across the entire corpus) the model was built independently for each piece and inferred on the same piece.All model parameters were kept constant during all evaluations and were not tuned to each specific case.The parameters were set to the values defined in Table 1.The τ W parameter limiting the time span of activations was set to τ W = 2 n+2 events.The values and short descriptions of parameters are also listed in Table 1.The values for the τ H and τ I parameters are based on the stable performance achieved in the range around 0.5 for (see the Sensitivity to parameter values subsection.The τ R and τ M values were set to the majority thresholds of 50% and were not tuned.The τ L parameter value was retained from the original spectral CHM where it was evaluated empirically.Table 2 shows the performance of SymCHM on the MIREX 2015 discovery of repeated themes and sections task.To compare SymCHM to SymCHMMerge, the Table 2 also includes the results of their evaluation on the publicly available JKU Patterns Development Dataset (PDD) [44].Detailed results of SymCHMMerge on this dataset are shown in Table 3.

Evaluation Metrics
Evaluation metrics from the MIREX discovery of repeated themes and sections task were used for evaluation.This subsection provides a short description and formalization of the definitions found in the MIREX task definition [20].The establishment measure (precision P est , recall R est and F score F 1est ) evaluates the algorithm's ability to find at least one occurrence of each pattern shifted in time and pitch.Two occurrence measures F 1occ evaluate the extent of the model's ability to find all pattern occurrences, where the c = {0.5, 0.75} factor represents the inexactness tolerance threshold.Meredith [30] proposed an additional three-layer metric (P 3 , R 3 , TLF 1 ) that provides balance between the establishment and the occurrence measures.The exact precision, recall and F score measures (P, R, F 1 ) show the algorithm's performance in matching the found patterns with the reference annotations in an exact manner.
The metrics are formally defined using the following set of symbols: • n P : the number of patterns in a ground truth • Π = {P 1 , P 2 , . . ., P n P }: a set of ground truth patterns • P = {P 1 , P 2 , . . ., P m P }-occurrences of pattern P • n Q : the number of patterns in the algorithm's output the number of ground truth patterns identified by the algorithm The standard precision is defined as P = k/n Q , the recall as R = k/n P , and the F 1 score as F1 = 2 × P × R/(P + R).Due to the extreme difficulty of discovering strictly exact patterns, more robust versions of the metrics are provided: the occurrence and the establishment scores.First, the cardinality score is used to determine the music similarity between the annotated and the discovered patterns: A score matrix is calculated based on the similarity as follows: Based on the score matrix, the establishment matrix is calculated from the set of annotated patterns Π and the set of algorithm's output patterns Ξ: The establishment precision is thus defined as: The establishment recall is defined as: Additionally, the establishment F 1 score is calculated as: The establishment metrics reward a single match between the annotated and algorithm's patterns.To counterbalance this bias, the occurrence metrics are used.The occurrence metrics reward the algorithm's ability to find all occurrences of a single pattern.To loosen the exactness, the found patterns may be inexact.This inexactness is implemented using a threshold c (default values used in the 0.5 and 0.75), The indices I of the establishment matrix with values greater than or equal this threshold c are considered discovered.The occurrence matrix O(Π, Ξ) is calculated using the following approach, starting with an empty n P × n Q matrix and the establishment indices I: The occurrence precision score is consequently calculated using the occurrence matrix as follows: where n col represents the number of non-zero columns in occurrence matrix O.The occurrence recall score is analogously calculated as: where n row represents the number of non-zero rows in the occurrence matrix O.

Performance
The SymCHM with the basic pattern selection algorithm was submitted to the MIREX 2015 discovery of repeated themes and sections task.The results are shown in Table 2.The submitted model learned a six layer hierarchy, where activations of parts on Layers 4-6 were output as the found pattern occurrences.
In the MIREX 2015 evaluation [20], the two state-of-the art approaches by Velarde and Meredith (VM2) [32] and Lartillot (OL1 ) [34] achieved better overall results.However, the SymCHM outperformed other algorithms on the first piece in the MIREX evaluation dataset and achieved better results than VM2 in pattern occurrence measures, which indicated the model's ability to robustly identify the occurrences of the identified patterns.Compared to other approaches proposed in previous MIREX evaluations, such as NF1'14 [37] and DM1'13 [45], SymCHM found more pattern occurrences, as well a higher number of exact matches.The SymCHM also achieved a higher TLF 1 score when compared to NF1'14 submission.To increase diversity and decrease redundancy, we introduced the SymCHMMerge with an improved pattern selection algorithm.Activations of parts on Layers 2-6 were considered for finding pattern occurrences, where each layer included 16% more parts on average due to the more relaxed learning conditions.
A comparison between both models on the JKU PDD dataset showed that the SymCHMMerge achieved significantly better results (Friedman's test: χ 2 = 7.2, p < 0.01).It mostly improved in establishment measures, which indicated an improvement of the algorithm's ability to discover at least one occurrence of a pattern, tolerating for time shift and transposition [20].On the other hand, occurrence measures F 1occ(c=0.75) and F 1occ(c=0.5) which evaluated the algorithm's ability to find all occurrences of the established patterns, have dropped by 5%.We attribute this drop to a higher number of established patterns, for which the occurrence measure is calculated.Finally, the absolute precision, recall and F scores significantly increased due to the SymCHMMerge's pattern merging procedure and increased pattern diversity.

Sensitivity to Parameter Values
To assess the sensitivity of SymCHMMerge to changes of model parameters, we analysed its performance by varying the inhibition and hallucination parameters τ I and τ H , which affect inference.
We observed the behaviour of occurrence and establishment measures in order to estimate the balance between the two.Due to the large number of possible parameter combinations, we evaluated how changes in one parameter (set for all layers) affect performance when all other parameters are fixed.

Inhibition
The top part of Figure 2 shows how changes in the inhibition parameter τ I affect the results.An increase of τ I increases inhibition and removes activations which are only partially covered by others, while a decrease will allow for more overlapping activations to propagate to higher layers.The plots show that reduced inhibition has a positive effect on occurrence recall, which is expected, as more activations are produced.It is even more interesting that it also positively affects precision of found occurrences, which might be explained by the fact that overlapping activations are successfully merged by the merging algorithm of SymCHMMerge.For the establishment metrics, the effect of changes in inhibition is not so obvious, and apart from extreme values, performance is stable.

Hallucination
The bottom part of Figure 2 shows how changes in the hallucination parameter τ H affect performance.As described in Section 2.3.1, larger τ H values decrease hallucination and thus the number of activations.Decreased hallucination affects both occurrence and establishment of patterns, as there is little tolerance for pattern variations.With more hallucination, both measures increase and then remain stable; again, precision is not affected significantly, as the merging algorithm of SymCHMMerge reduces the growing number of activations on higher layers.

Error Analysis
To increase our understanding of the model's performance, we performed an analysis of its most common types of errors.

Incomplete Matches
We observed that the occurrence metrics increase when we allow for partially incomplete patterns to be discovered (hallucination), however, the exact F 1 scores do not always increase.After observing the pattern occurrences which do not contribute to the rise in F 1 score, we discovered that these patterns do not completely match the reference annotations, as shown in Figure 3.
The difference between a reference annotation and a model's proposed pattern usually presents itself at the edges of an occurrence, where the model assumes that one or more preceding or succeeding events belong to the pattern.These events frequently occur at the same locations (relative to the pattern), with similar time and pitch offsets.Thus, the model adds these events to the pattern occurrence, causing mismatch with the reference annotation.Such errors could be resolved by incorporating theoretical rules governing the beginnings and endings of patterns, e.g., gap rule ( [46], p. 68) into the pattern selection algorithm.Section patterns, such as in Mozart's Piano Sonata in E flat major, K. 282-2nd movement, remain unidentified.These section patterns represent large segments of music (50-137 events).The six layers in our model have the potential of encoding patterns of up to 64 events.While some of the reference patterns could be identified, the model did not contain a sufficient amount of layers to cover the largest patterns.We consequently focused on observing the absence of the shorter section patterns (between 50 and 64 events).While incomplete (often overlapping) matches of these patterns were found on the L 5 and L 6 layers (sub-patterns), there were no complete matches between the found patterns and the reference annotations.Furthermore, the overlap was not high enough that these sub-patterns would be merged during pattern merging.
The second subgroup-the short patterns-also frequently occur in evaluation datasets.These patterns are 4-5 events long.They are identified by the model on the layers L 2 and L 3 , and also form compositions on higher layers.If such larger compositions are present, the pattern selection procedure excludes the short patterns from the final output.
The discovery of larger patterns could be improved by building additional compositional layers while learning the model, and by adjusting the merging rules for long patterns.To find more short patterns, we could add additional criteria that would counterbalance the promotion of longer patterns during pattern selection.For example, the event duration could be used when considering the importance of short events.

Drawbacks of the Evaluation
To establish the effectiveness of the proposed model in the symbolic domain, we evaluated the model for the pattern discovery task, where a comparison between the SymCHM and other approaches is based on the JKU PDD and JKU PTD datasets.To avoid diminishing the MIREX's position of being an evaluation exchange and not a benchmarking framework, we focused our evaluation on the two variants of the compositional model we developed, the SymCHM and SymCHMMerge, as shown in Table 2.
As thoroughly discussed by Meredith [30], this MIREX task possesses many drawbacks and thus might not be the optimal tool for an algorithm comparison.However, it is rather difficult to create an experiment which would provide a clearer evaluation of the algorithm's performance.First, a definition of a pattern is vague; there are several sources gathered in the JKU datasets.Some of the patterns in the ground truth represent themes, while others represent entire sections.Without any prior knowledge about the goal (length of pattern, perhaps a ratio between the length and the variation within the pattern occurrences), the metrics are logically leaning towards awarding the approach which finds most occurrences of the discovered pattern.It seems impossible to design an algorithm capable of finding a "pattern" when the definition of a pattern varies among the annotators.The three-layer F score proposed by Meredith is a step towards a metric which provides the balance between the establishment and the occurrence metrics otherwise provided.Second, the size of the dataset presents a limitation: the combined JKU PDD and JKU PTD datasets represent ten (classical) musical pieces in total.It is thus difficult to claim the datasets provide a representative sample of any kind of music or genre.However, we acknowledge the incredible effort put in the creation of the datasets and the tasks; we believe the size of the datasets is affected by the effort needed.Nevertheless, we believe the MIREX discovery of repeated themes and sections task is currently the best currently available approximation of a performance evaluation for the pattern discovery in music.

Conclusions
In the paper, we presented the compositional hierarchical model for pattern discovery in symbolic music representations.The model calculates a hierarchical representation of melodic patterns in a music corpus with a statistically-based learning algorithm.It can be viewed as a transparent deep architecture, combining the ability of unsupervised learning of multi-layer hierarchies with a transparent structure that enables insight into the learned concepts.The inference process with hallucination and inhibition mechanisms enables the search for pattern variations.
We evaluated the model in the MIREX evaluation campaign and its improved pattern selection algorithm on the JKU PDD dataset, where we show that we can obtain favourable results with the improved version of the model.We showed that the model can be used for finding patterns in symbolic music and that it can learn to extract patterns in an unsupervised manner without hard-coding the rules of music theory.We have also demonstrated the transfer of the model from classification tasks based on audio representations to pattern extraction in the symbolic domain.The results obtained by the model are not on par with the best two performing algorithms.Nevertheless, the proposed model performs better than several other proposed approaches.As discussed in Section 4.5, this evaluation contains many potential drawbacks, but it is currently the best approximation for pattern discovery evaluation.The definition of the 'pattern' itself is elusive and may contain many different explanations, varying from strictly music-theoretical, to mathematical formalization.The human perception of patterns in music itself is too difficult to explain and incorporate in a single formalized task.However, with the proposed model, we have demonstrated that a deep transparent architecture can tackle the pattern discovery by employing unsupervised learning and may thus better approximate how listeners recognize patterns than the rule-based systems.Due to its transparency, the model is not only applicable to tasks where a single output is provided, but can also be used for exploration and pattern discovery by an expert.The model produces multiple hypotheses on several layers, which can be used as reference points in a deeper semi-automatic music analysis.We believe this further strengthens the model's usefulness to the wider MIR community.
In our future work, we will focus on improving the model.We plan to include event duration into pattern selection and merging and adapt the model for polyphonic pattern discovery.We could also introduce pattern ranking, similar to [32], and add music theory rules, as discussed in Section 4.4.The model's output could further be optimized by supervised training of model parameters, especially the number of layers in the hierarchy and the layers in the model's output.However, a sufficiently large annotated dataset is needed for such an optimization, significantly larger than the datasets currently used to evaluate the pattern discovery task.
The proposed approach can also be applied to identify similar and inexact patterns across larger corpora.We plan on evaluating the model in an inter-opus pattern discovery task, aiding the current research in tune family identification and folk music analysis.To tackle classification tasks, the model can be observed as a feature generator; thus, its output can be employed as an input to tune family analysis, similarity comparison or composer identification.

Figure 1 .
Figure 1.The symbolic compositional hierarchical model.The input layer corresponds to a symbolic music representation (a sequence of pitches).Parts on higher layers are compositions of lower-layer parts (depicted as connections between parts, the parameter µ is given in semitones).The structure of a part is displayed above each part in the figure, represented by a sequence of pitch values relative to the first subpart (e.g., [0,0,1] for the part P 2 1 ).A part may be contained in several compositions, e.g., P 1 M 1 is a part of compositions P 2 2 and P 2 3 .The entire structure is transparent, thus we can observe the entire sub-tree of the part P4  1 .A part activates, when (a part of) the pattern it represents is found in the input.As an example, P4  1 activates twice (Inputs A and B), however there are differences in the found patterns.Pattern A is positioned five semitones higher than B; Pattern B is missing one event (dotted green rectangle); and the pitch of one event (blue rectangle) differs between the two patterns.
Hallucination parameter retaining the activation of a part in an incomplete presence of the events in the input signal 0.5 τ I Inhibition parameter reducing the number of competing activations 0.4 τ R Redundancy parameter determining the the necessary amount of overlapping pattern occurrences in order for the occurrences to be merged 0.5 τ M Merging parameter determining the amount of redundant pattern occurrences needed for two patterns to be merged into one 0.5 τ L Learning threshold for added coverage which needs to be exceeded in order for a candidate composition to be retained while learning the model 0.005 τ W Window limiting the time span of activations, defined per layer L n 2 n+2

Figure 3 .
Figure 3.An incomplete pattern match of two pattern occurrences in Bach BWV 889 Fugue in A minor (from the JKU PDD dataset).Two pattern occurrences are presented in the figure (top and bottom).A piano roll representation is shown where the reference annotation is coloured in grey and the identified pattern occurrences outlined with red borders.Even though similar, events on the right side (shown in light blue) are not part of the reference annotations, however they are included in the model's patterns due to their co-occurrence with other events.

Table 1 .
Model's parameter settings for the experiment.

Table 2 .
Evaluation of SymCHM, SymCHMMerge and Music Information Retrieval Evaluation eXchange (MIREX) results of other proposed approaches for the discovery of repeated themes and sections task on the JKU Patterns Development Dataset (PDD) and JKU Patterns Testing Dataset (PTD), denoted as MIREX 2015.

Table 3 .
A detailed list of JKU Patterns Development Dataset results for the SymCHMMerge approach.The n P and n Q columns represent the number of annotated patterns and the number of discovered patterns respectively.Song names are shortened, using a four letter abbreviation of the composer's name.