Few-Shot Object Detection: Application to Medieval Musicological Studies
Abstract
:1. Introduction
2. Related Works
2.1. Object Detection
2.2. Few-Shot Object Detection
2.3. Computer Vision in the Humanities
3. Dataset Annotation and Ground Truth Creation
- image dataset: The first step was to create a dataset of images for training. Therefore, three experts in musicology and professional singers searched for and manually selected images of manuscript pages containing singing representations.
- Annotate objects: The domain experts annotated the dataset using the Supervisely tool (https://supervise.ly/, accessed on 1 November 2021), which is a collaborative online tool for image annotation, allowing users to create bounding boxes and object masks. As a result, the objects (such as books, lecterns, altars) in each image are highlighted by defining its borders.
- Classify the objects: In the third step, the objects annotated previously were manually classified as book, folio, phylactery, lectern, or altar by the musicologists. Thus, we can not only detect objects but also the exact position of those objects within the image.
- Obtain a consensus in the classification of objects: As we explained previously, it is not easy to detect singing performances. We are working with images of artworks, so the singing representations are not real; they are paintings or drawings of an artist who does not necessarily know about vocal practices. Therefore, the fourth and last step in the classification of objects consists of achieving a consensus among all the experts to create the ground truth.
3.1. Image Dataset of Illuminations Representing Medieval Singing
3.2. Annotation of Written Supports in Illuminations
- Phylactery: a medieval speech scroll, which contains or depicts speech, song, or other sounds (for example, Figure 2d (Bodmer ms. 91 f. 39r: http://www.e-codices.unifr.ch/loris/fmb/fmb-cb-0091/fmb-cb-0091_039r.jp2/full/full/0/default/jpg, accessed on 1 November 2021)).
- Folio: a thin and flat surface that can be used for writing or drawing. (Figure 2b (Abbeville ms. 016 f. 15: https://iiif.irht.cnrs.fr/iiif/France/Abbeville/B800016201/DEPOT/IRHT_106357_2/1000,500,800,1500/full/0/default.jpg, accessed on 1 November 2021)).
- Book: a collection of sheets bound together containing printed or written texts, pictures, etc. (Figure 2a (BnF ms. fr. 166 f. 119v: https://gallica.bnf.fr/iiif/ark:/12148/btv1b105325870/f252/1500,3100,1000,1200/full/0/native.jpg, accessed on 1 November 2021)).
- Altar: a sacred table used for ritual sacrifice or offerings in a religious building. (Figure 2e (Initiale ms. 3028 f.082 http://initiale.irht.cnrs.fr/decor/58000, accessed on 1 November 2021)).
- Lectern: a reading desk with a slanted top, on which books are placed for reading aloud. (Figure 2c (Avignon ms. 0121, f. 162v: https://bvmm.irht.cnrs.fr/consult/consult.php?mode=ecran&reproductionId=15460&VUE_ID=1393519, accessed on 1 November 2021)).
4. Methodology
4.1. Bi-Training Few-Shot Object Detection
Algorithm 1: Bi-Training Few-Shot Object Detection |
4.1.1. Total Model Improvement
4.1.2. Classifier Fine-Tuning
5. Results
5.1. Global Few-Shot Object Detection Benchmark
5.2. Worst-Case Few-Shot Object Detection Benchmark
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Bekkouch, I.E.I.; Eyharabide, V.; Billiet, F. Dual Training for Transfer Learning: Application on Medieval Studies. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
- Bekkouch, I.E.I.; Constantin, N.D.; Eyharabide, V.; Billiet, F. Adversarial Domain Adaptation for Medieval Instrument Recognition. In Intelligent Systems and Applications; Arai, K., Ed.; Springer International Publishing: Cham, Switzerland, 2022; pp. 674–687. [Google Scholar]
- Bekkouch, I.E.I.; Aidinovich, T.; Vrtovec, T.; Kuleev, R.; Ibragimov, B. Multi-agent shape models for hip landmark detection in MR scans. In Medical Imaging 2021: Image Processing; Išgum, I., Landman, B.A., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2021; Volume 11596, pp. 153–162. [Google Scholar] [CrossRef]
- Bekkouch, I.E.I.; Nicolae, D.C.; Khan, A.; Kazmi, S.M.A.; Khattak, A.M.; Ibragimov, B. Adversarial Reconstruction Loss for Domain Generalization. IEEE Access 2021, 9, 42424–42437. [Google Scholar] [CrossRef]
- Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Yan, Z.; Tomizuka, M.; Gonzalez, J.; Keutzer, K.; Vajda, P. Visual Transformers: Token-based Image Representation and Processing for Computer Vision. arXiv 2020, arXiv:2006.03677. [Google Scholar]
- Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; Darrell, T. Few-shot object detection via feature reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 8420–8429. [Google Scholar]
- Wang, X.; Huang, T.E.; Darrell, T.; Gonzalez, J.E.; Yu, F. Frustratingly Simple Few-Shot Object Detection. arXiv 2020, arXiv:2003.06957. [Google Scholar]
- Wang, Y.; Yao, Q. Few-Shot Learning: A Survey. arXiv 2019, arXiv:1904.05046. [Google Scholar]
- Vanschoren, J. Meta-learning. In Automated Machine Learning; Springer: Cham, Switzerland, 2019; pp. 35–61. [Google Scholar]
- Batanina, E.; Bekkouch, I.E.I.; Youssry, Y.; Khan, A.; Khattak, A.M.; Bortnikov, M. Domain Adaptation for Car Accident Detection in Videos. In Proceedings of the 2019 Ninth International Conference on Image Processing Theory, Tools and Applications (IPTA), Istanbul, Turkey, 6–9 November 2019; pp. 1–6. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on COMPUTER Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Computer Vision and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2018; pp. 1804–2767. [Google Scholar]
- Yakovlev, K.; Bekkouch, I.E.I.; Khan, A.M.; Khattak, A.M. Abstraction-Based Outlier Detection for Image Data. In Proceedings of the SAI Intelligent Systems Conference, London, UK, 3–4 September 2020; pp. 540–552. [Google Scholar]
- Rivera, A.R.; Khan, A.; Bekkouch, I.E.I.; Sheikh, T.S. Anomaly Detection Based on Zero-Shot Outlier Synthesis and Hierarchical Feature Distillation. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 281–291. [Google Scholar] [CrossRef] [PubMed]
- Ibrahim, B.I.; Nicolae, D.C.; Khan, A.; Ali, S.I.; Khattak, A. VAE-GAN Based Zero-Shot Outlier Detection. In Proceedings of the 2020 4th International Symposium on Computer Science and Intelligent Control (ISCSIC 2020), Tyne, UK, 17–19 November 2020; Association for Computing Machinery: New York, NY, USA, 2020; p. 5. [Google Scholar] [CrossRef]
- Choi, H.T.; Lee, H.J.; Kang, H.; Yu, S.; Park, H.H. SSD-EMB: An Improved SSD Using Enhanced Feature Map Block for Object Detection. Sensors 2021, 21, 2842. [Google Scholar] [CrossRef] [PubMed]
- Afif, M.; Ayachi, R.; Said, Y.; Pissaloux, E.; Atri, M. An evaluation of retinanet on indoor object detection for blind and visually impaired persons assistance navigation. Neural Process. Lett. 2020, 51, 1–15. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [Green Version]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jegou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Palermo, Italy, 2021; Volume 139, pp. 10347–10357. [Google Scholar]
- Li, K.; Huang, Z.; Cheng, Y.C.; Lee, C.H. A maximal figure-of-merit learning approach to maximizing mean average precision with deep neural network based classifiers. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 4503–4507. [Google Scholar]
- Chen, W.; Liu, Y.; Kira, Z.; Wang, Y.F.; Huang, J. A Closer Look at Few-shot Classification. arXiv 2019, arXiv:1904.04232. [Google Scholar]
- Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
- Eyharabide, V.; Bekkouch, I.E.I.; Constantin, N.D. Knowledge Graph Embedding-Based Domain Adaptation for Musical Instrument Recognition. Computers 2021, 10, 94. [Google Scholar] [CrossRef]
- Elgammal, A.; Kang, Y.; Den Leeuw, M. Picasso, matisse, or a fake? Automated analysis of drawings at the stroke level for attribution and authentication. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Xu, Z.; Wilber, M.; Fang, C.; Hertzmann, A.; Jin, H. Learning from multi-domain artistic images for arbitrary style transfer. arXiv 2018, arXiv:1805.09987. [Google Scholar]
- Sabatelli, M.; Kestemont, M.; Daelemans, W.; Geurts, P. Deep transfer learning for art classification problems. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 14 March 2018. [Google Scholar]
- Sharafi, S.; Fouladvand, S.; Simpson, I.; Alvarez, J.A.B. Application of pattern recognition in detection of buried archaeological sites based on analysing environmental variables, Khorramabad Plain, West Iran. J. Archaeol. Sci. Rep. 2016, 8, 206–215. [Google Scholar] [CrossRef]
- Lambers, K. Learning to look at LiDAR: The use of R-CNN in the automated detection of archaeological objects in LiDAR data from the Netherlands. J. Comput. Appl. Archaeol. 2019, 2, 31–40. [Google Scholar]
- Bekkouch, I.E.I.; Youssry, Y.; Gafarov, R.; Khan, A.; Khattak, A.M. Triplet Loss Network for Unsupervised Domain Adaptation. Algorithms 2019, 12, 96. [Google Scholar] [CrossRef] [Green Version]
- Kaselimi, M.; Doulamis, N.; Doulamis, A.; Voulodimos, A.; Protopapadakis, E. Bayesian-optimized Bidirectional LSTM Regression Model for Non-intrusive Load Monitoring. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2747–2751. [Google Scholar] [CrossRef]
- Golovin, D.; Solnik, B.; Moitra, S.; Kochanski, G.; Karro, J.; Sculley, D. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 1487–1495. [Google Scholar]
- Feurer, M.; Hutter, F. Hyperparameter optimization. In Automated Machine Learning; Springer: Cham, Switzerland, 2019; pp. 3–33. [Google Scholar]
Class | Counts |
---|---|
Book | 338 |
Phylactery | 204 |
Lectern | 87 |
Altar | 37 |
Folio | 27 |
TOTAL | 693 |
Percentage | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
100% UB | 80% Ours | 80% LB | 50% Ours | 50% LB | |||||||
Base | Novel | Base | Novel | Base | Novel | Base | Novel | Base | Novel | ||
Yolov4 | s | 83.62 | 81.86 | 61.88 | 62.24 | 50.24 | 55.66 | 54.78 | 48.57 | 52.85 | 51.30 |
m | 72.86 | 77.19 | 65.27 | 58.87 | 58.28 | 58.66 | 36.94 | 52.54 | 33.76 | 36.77 | |
RCNN | Faster | 68.90 | 72.04 | 61.55 | 58.44 | 52.39 | 54.54 | 46.77 | 47.19 | 42.04 | 37.08 |
Mask | 79.13 | 77.472 | 71.31 | 62.49 | 59.77 | 62.17 | 55.58 | 57.03 | 49.15 | 54.03 | |
ViT | ViT | 57.13 | 54.96 | 45.27 | 42.72 | 36.84 | 50.12 | 33.05 | 35.55 | 28.53 | 23.26 |
Swin-t | 72.08 | 57.49 | 65.18 | 45.35 | 48.75 | 45.25 | 44.30 | 46.77 | 38.06 | 44.49 |
Lowest Novel Count | ||||||
---|---|---|---|---|---|---|
Full UB | 10 Ours | 10 LB | 5 Ours | 5 LB | ||
Yolov4 | s | 76.15 | 59.78 | 52.13 | 49.71 | 43.79 |
m | 71.89 | 53.17 | 54.40 | 48.86 | 31.82 | |
RCNN | Faster | 68.17 | 52.85 | 53.35 | 42.08 | 38.57 |
Mask | 72.65 | 57.92 | 57.06 | 51.14 | 48.60 | |
ViT | ViT | 52.69 | 46.18 | 41.92 | 35.97 | 24.05 |
Swin-t | 53.61 | 47.99 | 40.20 | 41.88 | 39.28 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ibrahim, B.I.E.; Eyharabide, V.; Le Page, V.; Billiet, F. Few-Shot Object Detection: Application to Medieval Musicological Studies. J. Imaging 2022, 8, 18. https://doi.org/10.3390/jimaging8020018
Ibrahim BIE, Eyharabide V, Le Page V, Billiet F. Few-Shot Object Detection: Application to Medieval Musicological Studies. Journal of Imaging. 2022; 8(2):18. https://doi.org/10.3390/jimaging8020018
Chicago/Turabian StyleIbrahim, Bekkouch Imad Eddine, Victoria Eyharabide, Valérie Le Page, and Frédéric Billiet. 2022. "Few-Shot Object Detection: Application to Medieval Musicological Studies" Journal of Imaging 8, no. 2: 18. https://doi.org/10.3390/jimaging8020018
APA StyleIbrahim, B. I. E., Eyharabide, V., Le Page, V., & Billiet, F. (2022). Few-Shot Object Detection: Application to Medieval Musicological Studies. Journal of Imaging, 8(2), 18. https://doi.org/10.3390/jimaging8020018