Optical Medieval Music Recognition Using Background Knowledge
Abstract
:1. Introduction
2. Related Work
3. Datasets
4. Workflow
5. Post-Processing Pipeline
- 1.
- Correct symbols in wrong layout blocks: Sometimes FP are recognized inside incorrect layout blocks, such as drop capitals, lyrics, or paragraphs. Such symbols are removed from the baseline symbols and the uncertain symbols if they are inside drop capitals. If symbols appear in text regions, they are also removed if the vertical distance to the nearest staff line is greater than a constant .
- 2.
- Overlapping symbols: In this post-processing step, symbols that overlap are removed. Since only the centers are recognized by our symbol recognition, the outline for each symbol depending on is calculated. Due to the fact that notes appear as squares in the original documents, a box around the center with a width and height of can be drawn. Because clefs are bigger and also differ in height and width, this value is modified slightly. For C-Clefs, a box with a width of and a height of is drawn. For F-Clefs, a width of and a height of is used. If a clef overlaps a symbol, the clef is prioritized and the symbol is removed. If two symbols overlap each other, and they have the same PIS, one of them is removed.
- 3.
- Position in staff of clefs: In the datasets, all clefs appear to be only on top of a staff line and not in between. If the algorithm has detected that the PIS of a clef is in between two staff lines, then the clef is moved to the closest staff line.
- 4.
- Missing clefs in the beginning of a staff line: Normally, a staff always begins with a clef. If this is not the case, the system tries to fix it:
- (a)
- Additional symbols: If no clef is recognized by the algorithm at the beginning, it is checked if a clef has been recognized in the uncertain symbols. If so, this clef is added to the baseline symbols.
- (b)
- Merging symbols: The C and F-clef consists of two boxes. It can happen that a clef is mistaken for two notes. Therefore, two vertically stacked symbols that appear at the beginning of a staff are replaced by a clef. This is only performed when the position of the two symbols is right at the beginning, which means that the symbols are no more than away from the start of the staff lines.
- (c)
- Prior knowledge: If the above steps did not help, a clef is then inserted at the beginning, based on the clefs of previous staves. This can lead to FP for the segmentation task, even if a clef has to be inserted at this position, because the exact center position can only be guessed. Nevertheless, it has a positive effect on the recognized melody.
- 5.
- Looped graphical connection: This post-processing step aims to fix errors in the graphical connection, i.e., fixes between neume start, gapped and looped classes.
- (a)
- Additional symbols: The algorithm looks for consecutive notes that have a graphical connection between them. If the horizontal distance between these symbols is larger than , it is corrected. If there is an uncertain symbol in between, it is added to the baseline symbols.
- (b)
- Replace class: If there is none, the class of the symbol is changed from “looped” to “neume start”
- (c)
- Stacked symbols: In addition to that, the system also recognizes stacked symbols. If the horizontal distance between them is less than , a graphical connection is inserted between them.
- 6.
- PIS of stacked notes: The PIS of the notes is often distorted due to the limited space on the staffs, because the author of the handwritten manuscripts wanted to maintain some structural characteristic of some neumes (e.g., in Podatus or Clivis neumes (https://music-encoding.org/guidelines/v3/content/neumes.html), accessed on 30 May 2022). However, due to the lack of space and to preserve the characteristics, the notes cannot be placed directly on the staff where they should be, but are slightly shifted. Since the system uses the distance of the symbols to the staff lines and the space to calculate the position in the staff lines (PIS), this shift can lead to FP. The system corrects the PIS in borderline cases based on the confidence of the PIS algorithm for the notes of the respective neumes.
6. Architecture
7. Layout Analysis
- 1.
- Music-Lyric: The bounding box for each encoded stave on a page is calculated and is padded with on the top and bottom side of the box. These are marked as music regions. Regions that lie between two music regions are marked as lyric regions. Since the bottom lyric line is not between two music regions, it is added separately. To do this, the average distance between two music regions (staffs) is calculated and afterwards used to determine the lowest music region.
- 2.
- Drop-Capital: Drop capitals (compare Figure 5) often overlap music regions, so these must be recognized separately. The pipeline uses a Mask-RCNN [21] with a ResNet-50 [19] encoder to detect drop capitals on the page.
- (a)
- Training: Since only limited data are available, the weights of the model trained on the COCO dataset [22] is used to fine-tune the model. For training, only horizontal flip augmentations are used. The network is then fine-tuned on the “Latin 14819” dataset. A total of 320 instances of drop capitals are included there. The model is trained with positive (images with drop capitals) and negative (images without drop capitals) examples. Inputs are the raw documents. SGD was used as the optimizer, with hyperparameters of learning rate = 0.005, momentum = 0.9 and weight decay = 0.0005.
- (b)
- Prediction: The raw documents are the input. A threshold of 0.5 is applied to the output of the model. Next, the output is used to calculate the concave hulls of the drop capitals on the document. If hulls overlap, all but one are removed and only the one with the smallest area is kept.
- (c)
- Qualitative Evaluation: A qualitative analysis on the Pa904 dataset was conducted. A total of 43 drop capitals were available on the pages. These could be subdivided into large, ornated ones and smaller ones, which still clearly protrude into the music regions. Afterwards, the recognized drop capitals were counted. It should be noted that the algorithm rarely recognized holes or inkblots as initials. These errors were ignored because the system only uses the drop capitals for the post-processing pipeline anyway and they do not cause errors for the symbol detection task. Results from the drop capital detection can be taken from Table 3. The error analysis showed that no FPs, such as neumes, were detected. Further experiments demonstrated similar results on other datasets without further fine-tuning. This is important because otherwise errors could be induced by the post-processing pipeline.
8. Results
- 1.
- Evaluation of different encoder/decoder architectures within our complete OMR pipeline with data from the same dataset as the training data by a 5-fold cross validation (i.e., 80% of the data used for training, the remaining 20% for evaluation) with a large dataset (Nevers Part 1–3 + Latin 14,819 with 50,607 symbols; compare Table 1 (Table 4).
- 2.
- Evaluation of the OMR pipeline similar to experiment 1 on a smaller dataset (Nevers Part 1–3 with 15,845 symbols) with 5-fold cross validation, repeating an evaluation from the literature (Table 5);
- 3.
- Evaluation of the OMR pipeline on a different, challenging dataset (Pa904 with 9025 symbols on 20 pages) as a real-world use case:
- (a)
- With pretrained mixed model and document specific fine-tuning: Starting with the mixed model of experiment 1 and using 16 pages of the dataset for fine-tuning while evaluating on the remaining 4 pages (5-fold cross validation; Table 6)
- (b)
- With document specific training without a pretrained model: Using solely the Pa904 dataset for training and evaluation in a 5-fold cross validation similar to experiment 1. Here 4 architectures (FCN, U-Net, Eff-b3, Eff-b7; Table 6) are compared;
- (c)
- With a mixed model without fine-tuning: Using the mixed model of experiment 1 and evaluation of the 20 pages of the dataset Pa904 (Table 7).
- (4)
- Evaluation how effective the different parts of post-processing pipeline are for the mixed model with and without fine-tuning on the Pa904 dataset (Table 8);
- (5)
- Evaluation of the frequency of remaining error types with and without post-processing on the Pa904 dataset using the mixed model with fine-tuning (Table 9).
8.1. Metrics
- dSAR: The diplomatic transcription compares the staff position and types of notes and clefs and their order; however, the actual horizontal position is ignored [12];
- hSAR: Similar to dSAR, but only evaluates the correctness of the harmonic properties, by ignoring the graphical connections of all note components (NC). [12];
- mAR: Evaluates the melody sequence. The melody sequence is generated from the predicted symbols by calculating the pitch for each note symbol.
- Symbol detection accuracy: A predicted symbol is counted as correctly detected (TP) if the distance to its respective ground truth symbol is less than 5 px.
- Type accuracy: Only TP pairs are considered. Here, the correctness of the predicted graphical connection is evaluated.
- Position in staff accuracy: Only TP pairs are considered as well. Here, the correctness of the predicted position in staff is evaluated.
8.2. Preliminary Experiments
8.3. Document-Specific Fine-Tuning
8.4. Mixed Model Training without Fine-Tuning
8.5. Evaluation of the Contribution of the Different Post-Processing Steps
8.6. Error Analysis
9. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
TP | True Positive |
FP | False Positive |
FN | False Negative |
GT | Ground Truth |
OMR | Optical Music Recogniton |
CNN | Convolutional Neural Network |
FCN | Fully Convolutional Network |
dSAR | Diplomatic Symbol Accuracy Rate |
dSER | Diplomatic Symbol Error Rate |
hSAR | Harmonic Symbol Accuracy Rate |
hSER | Harmonic Symbol Accuracy Rate |
NC | Note Component |
mAR | Melody Accuracy Rate |
mER | Melody Error Rate |
SGD | Stochastic gradient descent |
PIS | Position in staff |
GCN | Graphical connection between notes |
References
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
- Pacha, A.; Calvo-Zaragoza, J. Optical Music Recognition in Mensural Notation with Region-Based Convolutional Neural Networks. In Proceedings of the 19th International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018; pp. 240–247. [Google Scholar]
- Pacha, A.; Choi, K.Y.; Coüasnon, B.; Ricquebourg, Y.; Zanibbi, R.; Eidenberger, H.M. Handwritten Music Object Detection: Open Issues and Baseline Results. In Proceedings of the 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), Vienna, Austria, 24–27 April 2018; pp. 163–168. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2015, arXiv:1506.02640. [Google Scholar]
- Wel, E.; Ullrich, K. Optical Music Recognition with Convolutional Sequence-to-Sequence Models. arXiv 2017, arXiv:1707.04877. [Google Scholar]
- Calvo-Zaragoza, J.; Hajič, J., Jr.; Pacha, A. Understanding Optical Music Recognition. ACM Comput. Surv. 2021, 53, 1–35. [Google Scholar] [CrossRef]
- Baró-Mas, A. Optical Music Recognition by Long Short-Term Memory Recurrent Neural Networks. Master’s Thesis, Universitat Autònoma de Barcelona, Bellaterra, Spain, 2017. [Google Scholar]
- Calvo-Zaragoza, J.; Rizo, D. End-to-End Neural Optical Music Recognition of Monophonic Scores. Appl. Sci. 2018, 8, 606. [Google Scholar] [CrossRef] [Green Version]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the Proceedings of the 23rd International Conference on Machine Learning (ICML ’06), Pittsburgh, PA, USA, 25–29 June 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 369–376. [Google Scholar] [CrossRef]
- Calvo-Zaragoza, J.; Castellanos, F.J.; Vigliensoni, G.; Fujinaga, I. Deep Neural Networks for Document Processing of Music Score Images. Appl. Sci. 2018, 8, 654. [Google Scholar] [CrossRef] [Green Version]
- Wick, C.; Hartelt, A.; Puppe, F. Staff, Symbol and Melody Detection of Medieval Manuscripts Written in Square Notation Using Deel Fully Convolutional Networks. Appl. Sci. 2019, 9, 2646. [Google Scholar] [CrossRef] [Green Version]
- Hajic, J.; Dorfer, M.; Widmer, G.; Pecina, P. Towards Full-Pipeline Handwritten OMR with Musical Symbol Detection by U-Nets. In Proceedings of the ISMIR, Paris, France, 23–27 September 2018. [Google Scholar]
- d’Andecy, V.; Camillerapp, J.; Leplumey, I. Kalman filtering for segment detection: Application to music scores analysis. In Proceedings of the 12th International Conference on Pattern Recognition, Jerusalem, Israel, 9–13 October 1994; Volume 1, pp. 301–305. [Google Scholar] [CrossRef]
- FuJinaga, I. Optical Music Recognition Using Projections; Faculty of Music McGilI Universit: Montreal, QC, Canada, 1988. [Google Scholar]
- Bellini, P.; Bruno, I.; Nesi, P. Optical music sheet segmentation. In Proceedings of the First International Conference on WEB Delivering of Music. WEDELMUSIC 2001, Florence, Italy, 23–24 November 2001; pp. 183–190. [Google Scholar] [CrossRef]
- Chang, W.Y.; Chiu, C.C.; Yang, J.H. Block-based connected-component labeling algorithm using binary decision trees. Sensors 2015, 15, 23763–23787. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
- Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. Mask R-CNN. arXiv 2017, arXiv:1703.06870. [Google Scholar]
- Lin, T.; Maire, M.; Belongie, S.J.; Bourdev, L.D.; Girshick, R.B.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. [Google Scholar]
- Eipert, T.; Herrman, F.; Wick, C.; Puppe, F.; Haug, A. Editor Support for Digital Editions of Medieval Monophonic Music. In Proceedings of the 2nd International Workshop on Reading Music Systems, Delft, The Netherlands, 2 November 2019; pp. 4–7. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Dataset | Pages | Symbols | Symbols/Page | Clefs | Accidentals | GCN Annotated? |
---|---|---|---|---|---|---|
Nevers P1 | 14 | 3911 | 279 | 152 | 24 | Yes |
Nevers P2 | 27 | 10,265 | 380 | 345 | 37 | Yes |
Nevers P3 | 8 | 1669 | 209 | 83 | 1 | Yes |
Latin14819 | 182 | 34,762 | 191 | 2108 | 291 | Yes |
Pa904 | 20 | 9025 | 451 | 264 | 28 | No |
Id | Encoder | Decoder | Parameters | ImageNet-Weights |
---|---|---|---|---|
FCN | [12] | [12] | 603,649 | None |
U-Net | [1] | [1] | 31,032,265 | None |
MobileNet | [18] | Own | 6,584,033 | Encoder |
ResNet | [19] | Own | 34,345,545 | Encoder |
Eff-b1 (https://github.com/qubvel/efficientnet, accessed on 30 May 2022) | [20] | Own | 6,237,253 | Encoder |
Eff-b3 (https://github.com/qubvel/efficientnet, accessed on 30 May 2022) | [20] | Own | 7,851,275 | Encoder |
Eff-b5 (https://github.com/qubvel/efficientnet, accessed on 30 May 2022) | [20] | Own | 12,005,305 | Encoder |
Eff-b7 (https://github.com/qubvel/efficientnet, accessed on 30 May 2022) | [20] | Own | 20,141,881 | Encoder |
Big Drop Capital | Small Drop Capital | |
---|---|---|
TP | 18 | 15 |
FP | 0 | 0 |
FN | 1 | 9 |
Detection | Type | Position in Staff | Sequence | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Archit. | ||||||||||||
Non Post-proc. | FCN | 97.9 | 98.1 | 93.0 | 23.0 | 97.7 | 99.0 | 98.8 | 99.8 | 94.4 | 92.4 | 84.6 |
MobileNet | 98.4 | 98.5 | 96.4 | 75.0 | 97.8 | 99.1 | 98.9 | 99.7 | 95.4 | 93.5 | 89.6 | |
ResNet | 98.8 | 98.9 | 97.3 | 80.2 | 98.0 | 99.2 | 98.9 | 99.8 | 96.2 | 94.5 | 91.9 | |
Eff-b1 | 98.7 | 98.8 | 97.0 | 87.6 | 98.3 | 99.1 | 98.9 | 99.8 | 96.3 | 94.7 | 91.7 | |
Eff-b3 | 99.0 | 99.0 | 97.5 | 89.8 | 98.1 | 99.5 | 98.9 | 99.8 | 96.7 | 95.1 | 92.8 | |
Eff-b5 | 99.0 | 99.0 | 97.8 | 89.7 | 98.2 | 99.5 | 98.9 | 99.7 | 96.7 | 95.1 | 93.6 | |
Eff-b7 | 99.1 | 99.1 | 98.0 | 90.0 | 98.3 | 99.4 | 99.0 | 99.8 | 96.7 | 95.3 | 93.8 | |
U-Net | 99.2 | 99.3 | 98.2 | 86.0 | 98.4 | 99.1 | 99.0 | 99.8 | 97.2 | 95.8 | 94.1 | |
Mean | 98.6 | 98.8 | 96.9 | 77.7 | 98.1 | 99.2 | 98.9 | 99.8 | 96.2 | 94.5 | 91.5 | |
Post-proc. | FCN | 98.1 | 98.5 | 95.0 | 23.0 | 97.5 | 98.1 | 98.8 | 99.6 | 95.1 | 93.0 | 89.2 |
MobileNet | 98.5 | 98.6 | 97.0 | 74.4 | 97.4 | 99.0 | 99.1 | 99.8 | 95.8 | 93.8 | 92.5 | |
ResNet | 98.9 | 99.0 | 97.8 | 80.2 | 97.7 | 99.1 | 99.1 | 99.8 | 96.7 | 94.8 | 94.2 | |
Eff-b1 | 98.8 | 98.9 | 97.4 | 86.8 | 98.0 | 99.2 | 99.1 | 99.8 | 96.5 | 94.9 | 94.0 | |
Eff-b3 | 99.1 | 99.1 | 97.7 | 88.6 | 97.8 | 99.3 | 99.1 | 99.8 | 96.7 | 95.0 | 94.4 | |
Eff-b5 | 99.0 | 99.1 | 98.0 | 89.9 | 97.9 | 99.6 | 99.1 | 99.9 | 97.0 | 95.2 | 95.3 | |
Eff-b7 | 99.1 | 99.2 | 98.1 | 90.0 | 98.0 | 99.1 | 99.0 | 99.9 | 96.8 | 95.2 | 94.7 | |
U-Net | 99.3 | 99.4 | 98.2 | 87.0 | 98.0 | 99.3 | 99.2 | 99.9 | 97.3 | 95.7 | 95.1 | |
Mean | 98.9 | 99.0 | 97.4 | 77.5 | 97.9 | 99.1 | 95.0 | 99.8 | 96.5 | 94.7 | 93.7 |
Detection | Type | Position in Staff | Sequence | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Post-Processing | Archit. | |||||||||||
no | FCN | 97.8 | 97.9 | 89.5 | 30 | 96.1 | 98.7 | 97.9 | 99.3 | 92.9 | 89.6 | 79.7 |
Eff-b3 | 98.3 | 98.4 | 94.8 | 65 | 96.3 | 99.2 | 97.8 | 99.4 | 94.1 | 90.9 | 87.7 | |
yes | FCN | 98.2 | 98.3 | 94.3 | 30 | 95.4 | 98.0 | 98.0 | 99.2 | 93.7 | 90.2 | 88.0 |
Eff-b3 | 98.6 | 98.6 | 96.4 | 65 | 95.6 | 99.3 | 97.9 | 99.8 | 94.5 | 91.1 | 91.3 |
Detection | Type | Position in Staff | Sequence | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Archit. | ||||||||||||
No Post-proc. | FCN | 99.1 | 99.2 | 93.2 | 22.2 | n/a | 96.9 | 95.8 | 1 | 93.3 | n/a | 84.8 |
Eff-b3 | 99.6 | 99.6 | 98.0 | 39.9 | n/a | 97.5 | 95.7 | 1 | 94.0 | n/a | 91.2 | |
Eff-b7 | 99.4 | 99.5 | 97.4 | 23.8 | n/a | 97.6 | 95.7 | 1 | 93.8 | n/a | 91.4 | |
U-Net | 99.3 | 99.3 | 96.0 | 36.2 | n/a | 96.5 | 95.5 | 1 | 93.3 | n/a | 87.8 | |
Mean | 99.3 | 99.4 | 96.2 | 29.5 | n/a | 97.1 | 95.7 | 1 | 93.6 | n/a | 88.3 | |
Post-proc. | FCN | 99.2 | 99.3 | 97.0 | 22.2 | n/a | 95.3 | 98.8 | 1 | 96.4 | n/a | 91.3 |
Eff-b3 | 99.6 | 99.7 | 98.0 | 39.9 | n/a | 97.0 | 98.8 | 1 | 97.0 | n/a | 94.0 | |
Eff-b7 | 99.5 | 99.6 | 98.7 | 23.8 | n/a | 97.7 | 98.5 | 1 | 96.7 | n/a | 92.4 | |
U-Net | 99.5 | 99.6 | 98.0 | 36.2 | n/a | 95.4 | 98.6 | 1 | 96.6 | n/a | 94.7 | |
Mean | 99.4 | 99.6 | 97.9 | 29.9 | n/a | 96.4 | 98.7 | 1 | 96.7 | n/a | 93.1 | |
Fine-tuning + No Post-proc. | FCN | 99.1 | 99.2 | 96.0 | 0.03 | n/a | 98.7 | 98.8 | 1 | 93.8 | n/a | 88.8 |
Eff-b3 | 99.5 | 99.6 | 98.0 | 33.3 | n/a | 98.0 | 98.8 | 1 | 94.5 | n/a | 90.5 | |
Eff-b7 | 99.5 | 99.6 | 98.7 | 34.7 | n/a | 95.3 | 98.5 | 1 | 94.4 | n/a | 90.3 | |
U-Net | 99.5 | 99.5 | 97.0 | 32.3 | n/a | 97.0 | 98.6 | 1 | 94.4 | n/a | 90.3 | |
Mean | 99.4 | 99.5 | 97.4 | 25.1 | n/a | 97.3 | 98.7 | 1 | 94.3 | n/a | 90.0 | |
Fine-tuning + Post-proc. | FCN | 99.2 | 99.2 | 98.3 | 26.7 | n/a | 97.9 | 98.7 | 1 | 96.6 | n/a | 94.6 |
Eff-b3 | 99.5 | 99.6 | 98.8 | 33.3 | n/a | 98.0 | 99.0 | 1 | 97.4 | n/a | 95.8 | |
Eff-b7 | 99.5 | 99.6 | 98.0 | 34.7 | n/a | 95.3 | 98.9 | 1 | 97.3 | n/a | 93.0 | |
U-Net | 99.5 | 99.6 | 98.5 | 32.3 | n/a | 97.0 | 99.0 | 1 | 97.3 | n/a | 94.3 | |
Mean | 99.4 | 99.5 | 98.4 | 31.8 | n/a | 97.1 | 98.9 | 1 | 97.2 | n/a | 94.4 |
Detection | Type | Position in Staff | Sequence | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Post-Processing | Archit. | |||||||||||
no | FCN | 98.3 | 98.4 | 93.3 | 3.0 | n/a | 96.4 | 96.7 | 1 | 93.0 | n/a | 84.4 |
Eff-b3 | 99.2 | 99.4 | 98.3 | 48.0 | n/a | 96.4 | 96.9 | 99.9 | 95.0 | n/a | 90.4 | |
yes | FCN | 98.5 | 98.6 | 96.2 | 3.0 | n/a | 94.9 | 98.9 | 1 | 95.3 | n/a | 89.4 |
Eff-b3 | 99.3 | 99.4 | 99.1 | 46 | n/a | 96.1 | 98.9 | 1 | 97.0 | n/a | 93.2 |
Model | Document Specific (Section 8.3) | Mixed Model (Section 8.4) | ||
---|---|---|---|---|
Architecture | FCN | Eff-b3 | FCN | Eff-b3 |
Post-processing step (compare Section 5) | ||||
None | 88.8 | 90.5 | 84.4 | 90.4 |
(1) Correct symbols in wrong layout blocks | 88.8+ | 90.5 | 84.5 | 90.4 |
(2) Correct overlapping symbols | 89.3 | 90.6 | 84.7 | 90.5 |
(3) Correct position in staff of clefs | 88.8 | 90.5 | 84.6 | 90.4 |
(4) Correct missing clefs | 91.2 | 91.8 | 86.3 | 91.3 |
(5) Correct looped graphical connection | n/a | n/a | 85.9 | 90.7 |
(6) Correct PIS of stacked notes | 91.3 | 93.2 | 87.8 | 92.5 |
All combined | 94.6 | 95.8 | 89.4 | 93.2 |
FCN | +/− | Eff-b3 | +/− | ||||
---|---|---|---|---|---|---|---|
Post-proc. | no | yes | no | yes | |||
TP (hsar) | 93.78 | 96.59 | 94.55 | 97.42 | |||
FN | Missing Notes | 1.10 | 1.10 | 0% | 0.93 | 0.74 | 20% |
Wrong PiS | 2.08 | 0.74 | 64% | 1.80 | 0.63 | 65% | |
Missing Clef | 0.14 | 0.10 | 28% | 0.08 | 0.05 | 29% | |
Missing Accid | 0.16 | 0.16 | 0% | 0.09 | 0.09 | 0% | |
Sum | 3.48 | 2.10 | 39% | 2.90 | 1.51 | 48% | |
FP | Add. Notes | 0.63 | 0.52 | 17% | 0.71 | 0.39 | 45% |
Wrong PiS | 2.08 | 0.74 | 64% | 1.80 | 0.63 | 65% | |
Add. Clef | 0.04 | 0.03 | 9% | 0.04 | 0.04 | 0% | |
Add. Accid | 0 | 0 | 0% | 0 | 0 | 0% | |
Sum | 2.75 | 1.30 | 53% | 2.55 | 1.06 | 59% | |
FP + FN | Overall Sum | 6.23 | 3.40 | 45% | 5.45 | 2.57 | 53% |
Error Type | FN | Examples | FP | Examples |
---|---|---|---|---|
Symbols that are close to the text/Text that is mistaken as symbols | 23% | 57% | ||
Horizontal dense symbols | 23% | 0% | ||
Vertically dense symbols | 6% | 0% | ||
Faded symbols or noise | 12% | 0% | ||
Rare symbols | 16% | 0% | ||
Clef mistaken as symbols/symbols mistaken as clef | 5% | 29% | ||
No apparent reason | 4% | 0% | ||
Outside of staff lines | 4% | 0% | ||
Drop capitals | 3% | 14.2% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hartelt, A.; Puppe, F. Optical Medieval Music Recognition Using Background Knowledge. Algorithms 2022, 15, 221. https://doi.org/10.3390/a15070221
Hartelt A, Puppe F. Optical Medieval Music Recognition Using Background Knowledge. Algorithms. 2022; 15(7):221. https://doi.org/10.3390/a15070221
Chicago/Turabian StyleHartelt, Alexander, and Frank Puppe. 2022. "Optical Medieval Music Recognition Using Background Knowledge" Algorithms 15, no. 7: 221. https://doi.org/10.3390/a15070221
APA StyleHartelt, A., & Puppe, F. (2022). Optical Medieval Music Recognition Using Background Knowledge. Algorithms, 15(7), 221. https://doi.org/10.3390/a15070221