# Stay True to the Sound of History: Philology, Phylogenetics and Information Engineering in Musicology

^{*}

## 1. Introduction

- constructing the stemma codicum (recension, or the Latin recensio) starting with a set of the sources (all the different witnesses of that musical work);
- selection (or selectio), where the original source is determined by examining variants, selecting the best ones [9].

## 2. Related Works

## 3. Problem Description

## 4. Algorithms

#### 4.1. Audio Pre-Processing

#### 4.2. Leader Tape Detection

- From two sets of keypoints $({\mathcal{K}}_{i},{\mathcal{K}}_{j})$, find a subset of matched pairs by comparing the related descriptors. Given the matched pairs $\left(({u}_{k},{v}_{k}),({u}_{h}^{\prime},{v}_{h}^{\prime})\right)$, estimate the optimum geometric transform mapping ${P}_{i}$ onto ${P}_{j}$ with the RANSAC algorithm [36]. If a leader-tape is present, the set of inlier points returned by the algorithm will converge to a subset of keypoints belonging to only one of the two portions of the spectrogram separated by the leader (Figure 3).
- Define a function ${g}_{i}\left(v\right)$ counting the number of keypoints detected in ${P}_{i}(u,v)$ for each image column v (in order to avoid strong oscillations, $g\left(v\right)$ is processed with a moving-average low-pass filter). Then, define ${g}_{i}^{\prime}\left(v\right)$ as the number of inlier points left on ${P}_{i}(u,v)$ after the RANSAC algorithm. In the presence of a leader insertion, distance $|{g}_{i}\left(v\right)-{g}_{i}^{\prime}\left(v\right)|$ shows an evident step that can be detected by looking for gradient peaks.
- Let ${v}_{l}$ be the coordinate associated with the detected step. Define the following sets:$$\begin{array}{c}{\mathcal{K}}_{i}^{\left(L\right)}=\left\{({u}_{k},{v}_{k})\in {\mathcal{K}}_{i}|{v}_{k}<{v}_{l}\right\},\\ {\mathcal{K}}_{i}^{\left(R\right)}=\left\{({u}_{k},{v}_{k})\in {\mathcal{K}}_{i}|{v}_{k}>{v}_{l}\right\},\end{array}$$
- Perform a new geometric transform estimation, on the left and right portion of the images separately, according to the subdivision defined in (2). The estimated models come in the form of $3\times 3$ homography matrices, ${H}^{\left(L\right)}$ and ${H}^{\left(R\right)}$, from which it is possible to extract the translation components along the v direction, ${t}^{\left(L\right)}$ and ${t}^{\left(R\right)}$. The length of the candidate leader is then given by:$${w}_{l}=|{t}^{\left(L\right)}-{t}^{\left(R\right)}|.$$

#### 4.3. Spectrogram Registration

- If a leader-tape has been detected in ${P}_{j}$, compensate it on ${P}_{i}$ by adding a band of black pixels centered in ${v}_{l}$ and with length ${w}_{l}$.
- Estimate the global geometric transform H by running RANSAC on all keypoints.
- Warp ${P}_{i}$ towards ${P}_{j}$ according to H, obtaining ${P}_{i}^{\prime}$.
- Compute the dissimilarity value ${d}_{i,j}$ as the MSE of ${P}_{i}^{\prime}$ and ${P}_{j}$:$${d}_{i,j}=\frac{1}{U\xb7V}\sum _{u,v}{|{P}_{j}(u,v)-{P}_{i}^{\prime}(u,v)|}^{2},$$

#### 4.4. Overdub Detection

- Compute the residual spectrogram as the pixel-wise absolute difference of ${P}_{i}^{\prime}$ and ${P}_{j}$ (Figure 4a).$${P}_{r}(u,v)=|{P}_{i}^{\prime}(u,v)-{P}_{j}(u,v)|$$
- Define the function $e\left(v\right)$ representing the energy content of the residual spectrogram over time.$$e\left(v\right)=\sum _{u}{P}_{r}(u,v),\phantom{\rule{2.em}{0ex}}v=1,\dots ,V$$
- Look for strong variations in the residual energy by computing the first derivative ${e}^{\prime}\left(v\right)$ and applying an outlier detector (three scaled MAD from the median, where MAD denotes the median absolute deviation), obtaining a set of points $\mathcal{O}=\left\{{v}_{k}\right\}$ (Figure 4b).
- Process the points ${v}_{k}\in \mathcal{O}$ in order to obtain the interval $[{v}_{1},{v}_{2}]$ corresponding to the candidate overdub. The employed criterion is that of selecting the couple of points which maximizes the average energy ratio between the regions inside and outside those points.$$({v}_{1},{v}_{2})=arg\underset{({v}_{a},{v}_{b})\in {\mathcal{O}}^{2}}{max}\frac{\mathbb{E}{\left[e\left(v\right)\right]}_{{v}_{a}<v<{v}_{b}\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}}}}{\mathbb{E}{\left[e\left(v\right)\right]}_{v<{v}_{a}\vee v>{v}_{b}}}$$Given a detected overdub spanning from ${v}_{1}$ to ${v}_{2}$, the algorithm tries to infer the phylogenetic relation. Again, we compare energy statistics inside and outside the overdub region, but in this case, we consider ${P}_{i}^{\prime}$ and ${P}_{j}$, instead of ${P}_{r}$.
- Scan through the spectrogram rows $u=1,\dots ,U$. For each u, compute:$$\begin{array}{c}{c}_{i}\left(u\right)=\left|\mathbb{E}{\left[{P}_{i\phantom{\rule{4pt}{0ex}}}(u,v)\right]}_{{v}_{1}<v<{v}_{2}}-\mathbb{E}{\left[{P}_{i\phantom{\rule{4pt}{0ex}}}(u,v)\right]}_{v<{v}_{1}\vee v>{v}_{2}}\right|\\ {c}_{j}\left(u\right)=\left|\mathbb{E}{\left[{P}_{j}(u,v)\right]}_{{v}_{1}<v<{v}_{2}}-\mathbb{E}{\left[{P}_{j}(u,v)\right]}_{v<{v}_{1}\vee v>{v}_{2}}\right|\end{array}$$

#### 4.5. Tree Estimation

- Starting from the matrix D, build an undirected graph $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ with N nodes, where the i-th node is associated with the audio track ${x}_{i}\left(n\right)$ and each edge $(i,j)$ exists if and only if ${d}_{i,j}<+\infty $ and ${d}_{j,i}<+\infty $.
- Run a maximal clique algorithm on $\mathcal{G}$, obtaining ${\mathcal{C}}_{1},\dots ,{\mathcal{C}}_{K}\subseteq \mathcal{V}$.
- Compute the $K\times K$ clique-dissimilarity matrix ${D}_{\mathcal{C}}$ as:$${D}_{\mathcal{C}}(p,q)=\frac{1}{|{\mathcal{C}}_{p}||{\mathcal{C}}_{q}|}\sum _{i\in {\mathcal{C}}_{p},j\in {\mathcal{C}}_{q}}{d}_{i,j}$$
- Starting from the matrix ${D}_{\mathcal{C}}$, build a complete directed graph ${\mathcal{G}}_{\mathcal{C}}=\{{\mathcal{V}}_{\mathcal{C}},{\mathcal{E}}_{\mathcal{C}}\}$, with K nodes, where every node is a clique of the undirected graph $\mathcal{G}$ and each edge $(p,q)$ has a weight equal to ${D}_{\mathcal{C}}$, corresponding to the average dissimilarity between the audio tracks belonging to the p-th and the q-th cliques.
- Compute the phylogenetic tree as the minimum spanning arborescence ${\widehat{\mathcal{G}}}_{\mathcal{C}}=\{{\mathcal{V}}_{\mathcal{C}},{\widehat{\mathcal{E}}}_{\mathcal{C}}\}$, i.e., the directed rooted spanning tree with minimum weight.$${\widehat{\mathcal{E}}}_{\mathcal{C}}=arg\underset{{\mathcal{E}}^{s}\subset {\mathcal{E}}_{\mathcal{C}}}{min}\sum _{(p,q)\in {\mathcal{E}}^{s}}{D}_{\mathcal{C}}(p,q)$$

## 5. Dataset

- addition of a leader-tape within the tape;
- overdub with silence or with another track;
- addition of a splice within the tape.

## 6. Results and Discussion

- In 50% of the cases, the estimated tree perfectly reproduces the ground-truth. Specifically, all the tracks sharing the same tape modifications (leader-tape and/or overdub) are collected in the same clique, and the resulting cliques are correctly ordered in the phylogeny sense.
- In 40% of the cases, the estimated tree is not identical to the ground-truth, but still makes sense in phylogeny terms. For instance, in some cases, it is possible to observe that certain cliques result in being over-clustered: tracks that should belong to the same meta-node are split into more nodes, which can be siblings or in a parent-child relationship. However, the relative depths in the tree structure are maintained, and the overall phylogenetic sense is preserved. Figure 5 reports a couple of examples of this scenario.
- In 10% of the cases, the estimated tree shows some wrong phylogenetic relations (ancestor-descendant swaps) with respect to the ground-truth.

## 7. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

**Figure 1.**Example of near-duplicates (witnesses). In the middle of the tape (

**a**) has been added a piece of leader-tape obtaining the modified version (

**b**); The difference between the two versions can be clearly observed comparing the corresponding spectrograms (

**c**,

**d**).

**Figure 2.**Block diagram of the proposed algorithm. The input consists of the digitalized audio tracks ${x}_{i}$, $i=1,\dots ,N$, and the output is the estimated audio phylogeny tree (APT).

**Figure 3.**Spectrogram image ${P}_{i}(u,v)$ of an audio track ${x}_{i}\left(n\right)$, with green asterisks representing the detected SURF keypoints. Subfigures show the SURF keypoints (

**a**) and inlier keypoints after RANSAC (

**b**). Note that the remaining inlier points are located to the right of the leader-tape.

**Figure 4.**Residual spectrogram and related energy-over-time associated with a track pair $(i,j)$ containing an overdub, which appears in (

**a**) as a bright region with clean edges. The red circles in (

**b**) represent the detected outliers ${v}_{k}\in \mathcal{O}$, and the two points marked with green asterisks are the selected edges $({v}_{1},{v}_{2})$.

**Figure 5.**Examples of tree reconstruction with over-clustering errors. Datasets consist of seven audio tracks, $\{\mathbf{a},\mathbf{b},\dots ,\mathbf{g}\}$. In (

**a**), cluster $\{\mathbf{b},\mathbf{e},\mathbf{g}\}$ is split into the parent-child pair $\left(\left\{\mathbf{e}\right\},\{\mathbf{b},\mathbf{g}\}\right)$; in (

**b**), cluster $\{\mathbf{d},\mathbf{e},\mathbf{f},\mathbf{g}\}$ is split into the sibling pair $\left(\{\mathbf{d},\mathbf{e},\mathbf{g}\},\left\{\mathbf{f}\right\}\right)$.

**Table 1.**Equalization standards supported by the Studer A810 described by their time constants. Source: [39].

30 ips | 15 ips | 7.5 ips | 3.75 ips |
---|---|---|---|

AES: 17.5/∞ | CCIR: 35/∞ | 70/∞ | 90/3180 |

AES: 17.5/∞ | NAB: 50/3180 | 50/3180 | 90/3180 |

**Table 2.**Samples of electroacoustic music recorded on experimental tapes with the related configuration.

Samples | Recording Parameters | |||||
---|---|---|---|---|---|---|

# | Composer | Title | Year(s) | Speed | Equation | DBX |

1 | Luciano Berio | Differences | 1958–1959 | 7.5 | CCIR | yes |

2 | Pierre Boulez | Dialogue de l’ombre double | 1985 | 7.5 | CCIR | yes |

3 | Brian Ferneyhough | Mnemosyne | 1986 | 7.5 | CCIR | no |

4 | Brian Ferneyhough | Mnemosyne | 1986 | 15 | CCIR | yes |

5 | Bruno Maderna | Continuo | 1958 | 15 | CCIR | no |

6 | Bruno Maderna | Dimensioni II—invenzione su una voce | 1960 | 7.5 | NAB | yes |

7 | Bruno Maderna | Notturno | 1956 | 7.5 | NAB | no |

8 | Luigi Nono | ...sofferte onde serene... | 1976 | 15 | NAB | yes |

9 | Gruppo NPS | Interferenze II | 1965–1968 | 15 | NAB | yes |

10 | Gruppo NPS | Ricerca 4 | 1965–1968 | 15 | NAB | no |

Leader | Overdub | ||
---|---|---|---|

$p(L|L)$ | $p(L|\neg L)$ | $p(O|O)$ | $p(O|\neg O)$ |

90.0% | 0.0% | 75.0% | 3.3% |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

