Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (35)

Search Parameters:
Keywords = music transcription

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
15 pages, 978 KB  
Article
SpectTrans: Joint Spectral–Temporal Modeling for Polyphonic Piano Transcription via Spectral Gating Networks
by Rui Cao, Yan Liang, Lei Feng and Yuanzi Li
Electronics 2026, 15(3), 665; https://doi.org/10.3390/electronics15030665 - 3 Feb 2026
Abstract
Automatic Music Transcription (AMT) plays a fundamental role in Music Information Retrieval (MIR) by converting raw audio signals into symbolic representations such as MIDI or musical scores. Despite advances in deep learning, accurately transcribing piano performances remains challenging due to dense polyphony, wide [...] Read more.
Automatic Music Transcription (AMT) plays a fundamental role in Music Information Retrieval (MIR) by converting raw audio signals into symbolic representations such as MIDI or musical scores. Despite advances in deep learning, accurately transcribing piano performances remains challenging due to dense polyphony, wide dynamic range, sustain pedal effects, and harmonic interactions between simultaneous notes. Existing approaches using convolutional and recurrent architectures, or autoregressive models, often fail to capture long-range temporal dependencies and global harmonic structures, while conventional Vision Transformers overlook the anisotropic characteristics of audio spectrograms, leading to harmonic neglect. In this work, we propose SpectTrans, a novel piano transcription framework that integrates a Spectral Gating Network with a multi-head self-attention Transformer to jointly model spectral and temporal dependencies. Latent CNN features are projected into the frequency domain via a Real Fast Fourier Transform, enabling adaptive filtering of overlapping harmonics and suppression of non-stationary noise, while deeper layers capture long-term melodic and chordal relationships. Experimental evaluation on polyphonic piano datasets demonstrates that this architecture produces acoustically coherent representations, improving the robustness and precision of transcription under complex performance conditions. These results suggest that combining frequency-domain refinement with global temporal modeling provides an effective strategy for high-fidelity AMT. Full article
Show Figures

Figure 1

19 pages, 2385 KB  
Article
Multitrack Music Transcription Based on Joint Learning of Onset and Frame Streams
by Tomoki Matsunaga and Hiroaki Saito
Signals 2026, 7(1), 12; https://doi.org/10.3390/signals7010012 - 2 Feb 2026
Abstract
Multitrack music transcription is the task of converting music recordings into symbolic music representations that are assigned to individual instruments. This task requires simultaneous transcription of note onset and offset events for individual instruments. In addition, the limited resources of many transcription datasets [...] Read more.
Multitrack music transcription is the task of converting music recordings into symbolic music representations that are assigned to individual instruments. This task requires simultaneous transcription of note onset and offset events for individual instruments. In addition, the limited resources of many transcription datasets make multitrack music transcription challenging. Thus, even state-of-the-art transcription systems are inadequate for applications requiring high accuracy. In this paper, we propose a framework to jointly transcribe onsets and frames for multiple instruments by integrating a deep learning architecture based on U-Net with an architecture based on Perceiver, which is a variant of the Transformer architecture. The proposed framework effectively detects the pitches of different instruments by employing the multi-layer combined frequency and periodicity (ML-CFP) with multilayered frequency-domain and quefrency-domain features as the input data representation. Our experiments demonstrate that the proposed multitrack music transcription system outperforms existing systems on five transcription datasets, including low-resource datasets. Furthermore, we evaluate the proposed system in terms of instrument type and show that the system provides high-quality transcription results for the predominant instruments. Full article
Show Figures

Figure 1

18 pages, 6947 KB  
Article
Introducing Gregorian Chant to a Malaysian Methodist Congregation: A Case Study
by Cecilia Ting, Eleanor J. Giraud and Helen Phelan
Religions 2026, 17(2), 151; https://doi.org/10.3390/rel17020151 - 28 Jan 2026
Viewed by 93
Abstract
This study explores the feasibility of introducing Gregorian chant into contemporary Chinese Methodist worship in Malaysia. Using ethnographic methods including participant observation, interviews, and focus groups, this article documents a pilot study conducted at Sing Ang Tong Methodist Church in Sibu, Sarawak, where [...] Read more.
This study explores the feasibility of introducing Gregorian chant into contemporary Chinese Methodist worship in Malaysia. Using ethnographic methods including participant observation, interviews, and focus groups, this article documents a pilot study conducted at Sing Ang Tong Methodist Church in Sibu, Sarawak, where seven singers learned and performed the communion chant Gustate et videte. Three different transcription editions were created to bridge the gap between medieval square notation and modern Western notation, which is more familiar to the participants. The chant was translated into Chinese alongside the original Latin text. The majority preferred the quaver-crotchet notation edition and supported performing the chant in both Latin and Chinese to balance authenticity with accessibility. Participants found the modal melodic structure and free rhythm challenging initially but developed appreciation for the chant’s meditative qualities. The performance during Holy Communion services in October 2022 received mixed congregational responses, with many describing it as creating a “calm and prayerful atmosphere” while some expressed discomfort with the unfamiliar musical style. The study demonstrates that Gregorian chant can be successfully integrated into Chinese Methodist worship contexts, particularly during solemn liturgical occasions, when approached with appropriate liturgical sensitivity and cultural adaptation. Full article
(This article belongs to the Special Issue Sacred Music: Creation, Interpretation, Experience)
Show Figures

Figure 1

16 pages, 6746 KB  
Article
Cross-Attentive CNNs for Joint Specral and Pitch Feature Learning in Predominant Instrument Recognition from Polyphonic Music
by Lekshmi Chandrika Reghunath, Rajeev Rajan, Christian Napoli and Cristian Randieri
Technologies 2026, 14(1), 3; https://doi.org/10.3390/technologies14010003 - 19 Dec 2025
Viewed by 339
Abstract
Identifying instruments in polyphonic audio is challenging due to overlapping spectra and variations in timbre and playing styles. This task is central to music information retrieval, with applications in transcription, recommendation, and indexing. We propose a dual-branch Convolutional Neural Network (CNN) that processes [...] Read more.
Identifying instruments in polyphonic audio is challenging due to overlapping spectra and variations in timbre and playing styles. This task is central to music information retrieval, with applications in transcription, recommendation, and indexing. We propose a dual-branch Convolutional Neural Network (CNN) that processes Mel-spectrograms and binary pitch masks, fused through a cross-attention mechanism to emphasize pitch-salient regions. On the IRMAS dataset, the model achieves competitive performance with state-of-the-art methods, reaching a micro F1 of 0.64 and a macro F1 of 0.57 with only 0.878M parameters. Ablation studies and t-SNE analyses further highlight the benefits of cross-modal attention for robust predominant instrument recognition. Full article
Show Figures

Figure 1

15 pages, 1366 KB  
Article
Multi-Feature Fusion for Automatic Piano Transcription Based on Mel Cyclic and STFT Spectrograms
by Jinliang Dai, Qiuyue Zheng, Yang Wang, Qihuan Shan, Jie Wan and Weiwei Zhang
Electronics 2025, 14(23), 4720; https://doi.org/10.3390/electronics14234720 - 29 Nov 2025
Viewed by 421
Abstract
Automatic piano transcription (APT) is a challenging problem in music information retrieval. In recent years, most APT approaches have been based on neural networks and have demonstrated higher performance. However, most previous works utilize a short-time Fourier transform (STFT) spectrogram as input, which [...] Read more.
Automatic piano transcription (APT) is a challenging problem in music information retrieval. In recent years, most APT approaches have been based on neural networks and have demonstrated higher performance. However, most previous works utilize a short-time Fourier transform (STFT) spectrogram as input, which results in a noisy spectrogram due to the mixing of harmonics from concurrent notes. To address this issue, a novel APT network based on two spectrograms is proposed. Firstly, the Mel cyclic and Mel STFT spectrograms of the piano musical signal are computed to represent the mixed audio. Next, separate modules for onset, offset, and frame-level note detection are constructed to achieve distinct objectives. To capture the temporal dynamics of notes, an axial attention mechanism is incorporated into the frame-level note detection modules. Finally, a multi-feature fusion module is introduced to aggregate different features and generate the piano note sequences. In this work, the two spectrograms provide complementary information, the axial attention mechanism enhances the temporal relevance of notes, and the multi-feature fusion module incorporates frame-level note, note onset, and note offset features together to deduce final piano notes. Experimental results demonstrate that the proposed approach achieves higher accuracies with lower error rates in automatic piano transcription compared with other reference approaches. Full article
Show Figures

Figure 1

35 pages, 24993 KB  
Article
Sensory Heritage Is Vital for Sustainable Cities: A Case Study of Soundscape and Smellscape at Wong Tai Sin
by PerMagnus Lindborg, Lok Him Lam, Yui Chung Kam and Ran Yue
Sustainability 2025, 17(16), 7564; https://doi.org/10.3390/su17167564 - 21 Aug 2025
Viewed by 2431
Abstract
Sensory heritage encompasses culturally valued practices, rituals, and everyday activities experienced through the senses. While sight often dominates, hearing and smelling are generally more immersive and pervasive. Soundscape research is a well-established field within urban studies; however, smellscape remains insufficiently recognised. This study [...] Read more.
Sensory heritage encompasses culturally valued practices, rituals, and everyday activities experienced through the senses. While sight often dominates, hearing and smelling are generally more immersive and pervasive. Soundscape research is a well-established field within urban studies; however, smellscape remains insufficiently recognised. This study is part of Multimodal Hong Kong, a project aimed at documenting sensory cultural heritage across the city by capturing the complex interplay between soundscape, smellscape, urban experiences, everyday activities, and memory. We investigated the multisensory environment at Wong Tai Sin Temple through acoustic measurements and perceptual ratings of soundscape and smellscape across 197 locations within and around the site. Additionally, semi-structured interviews were conducted with visitors (N = 54, 15,015 words of transcript), which were analysed using content analysis and natural language processing. The results indicate that elevated noise levels mainly arise from human voices and pipe music within the temple compound, as well as traffic noise in the surrounding area. The smell of incense dominates near the temple altars, whereas natural, grassy odours prevail in the adjacent park. Interview responses confirm that incense burning constitutes a traditional religious practice forming a distinctive olfactory marker for Chinese temples, but it is also perceived as having adverse health implications. This study contributes to the growing body of sensory heritage research, underscoring the importance of both soundscape and smellscape in fostering culturally inclusive, vibrant, and sustainable urban environments. Full article
(This article belongs to the Special Issue Urban Noise Control, Public Health and Sustainable Cities)
Show Figures

Figure 1

21 pages, 564 KB  
Article
Sounding Identity: A Technical Analysis of Singing Styles in the Traditional Music of Sub-Saharan Africa
by Alfred Patrick Addaquay
Arts 2025, 14(3), 68; https://doi.org/10.3390/arts14030068 - 16 Jun 2025
Viewed by 5273
Abstract
This article presents an in-depth examination of the technical and cultural dimensions of singing practices within the traditional music of sub-Saharan Africa. Utilizing an extensive body of theoretical and ethnomusicological research, comparative transcription, and culturally situated observation, it presents a comprehensive framework for [...] Read more.
This article presents an in-depth examination of the technical and cultural dimensions of singing practices within the traditional music of sub-Saharan Africa. Utilizing an extensive body of theoretical and ethnomusicological research, comparative transcription, and culturally situated observation, it presents a comprehensive framework for understanding the significance of the human voice in various performance contexts. The study revolves around a tripartite model—auditory clarity, ambiguous auditory clarity, and occlusion—that delineates the varying levels of audibility of vocal lines amidst intricate instrumental arrangements. The article examines case studies from West, East, and Southern Africa, highlighting essential vocal techniques such as straight tone, nasal resonance, ululation, and controlled (or delayed) vibrato. It underscores the complex interplay between language, melody, and rhythm in tonal languages. The analysis delves into the influence of sound reinforcement technologies on vocal presence and cultural authenticity, positing that PA systems have the capacity to either enhance or disrupt the equilibrium between traditional aesthetics and modern requirements. This research is firmly rooted in a blend of African and Western theoretical frameworks, drawing upon the contributions of Nketia, Agawu, Chernoff, and Kubik. It proposes a nuanced methodology that integrates technical analysis with cultural significance. It posits that singing in African traditional music transcends mere expression, serving as a vessel for collective memory, identity, and the socio-musical framework. The article concludes by emphasizing the enduring strength and flexibility of African vocal traditions, illustrating their capacity for evolution while preserving fundamental communicative and artistic values. Full article
10 pages, 451 KB  
Article
PF2N: Periodicity–Frequency Fusion Network for Multi-Instrument Music Transcription
by Taehyeon Kim, Man-Je Kim and Chang Wook Ahn
Mathematics 2025, 13(11), 1708; https://doi.org/10.3390/math13111708 - 23 May 2025
Viewed by 1387
Abstract
Automatic music transcription in multi-instrument settings remains a highly challenging task due to overlapping harmonics and diverse timbres. To address this, we propose the Periodicity–Frequency Fusion Network (PF2N), a lightweight and modular component that enhances transcription performance by integrating both spectral and periodicity-domain [...] Read more.
Automatic music transcription in multi-instrument settings remains a highly challenging task due to overlapping harmonics and diverse timbres. To address this, we propose the Periodicity–Frequency Fusion Network (PF2N), a lightweight and modular component that enhances transcription performance by integrating both spectral and periodicity-domain representations. Inspired by traditional combined frequency and periodicity (CFP) methods, the PF2N reformulates CFP as a neural module that jointly learns harmonically correlated features across the frequency and cepstral domains. Unlike handcrafted alignments in classical approaches, the PF2N performs data-driven fusion using a learnable joint feature extractor. Extensive experiments on three benchmark datasets (Slakh2100, MusicNet, and MAESTRO) demonstrate that the PF2N consistently improves transcription accuracy when incorporated into state-of-the-art models. The results confirm the effectiveness and adaptability of the PF2N, highlighting its potential as a general-purpose enhancement for multi-instrument AMT systems. Full article
(This article belongs to the Section E1: Mathematics and Computer Science)
Show Figures

Figure 1

22 pages, 1596 KB  
Article
Fuzzy Frequencies: Finding Tonal Structures in Audio Recordings of Renaissance Polyphony
by Mirjam Visscher and Frans Wiering
Heritage 2025, 8(5), 164; https://doi.org/10.3390/heritage8050164 - 6 May 2025
Viewed by 1359
Abstract
Understanding tonal structures in Renaissance music has been a long-standing musicological problem. Computational analysis on a large scale could shed new light on this. Encoded scores provide easy access to pitch content, but the availability of such data is low. This paper addresses [...] Read more.
Understanding tonal structures in Renaissance music has been a long-standing musicological problem. Computational analysis on a large scale could shed new light on this. Encoded scores provide easy access to pitch content, but the availability of such data is low. This paper addresses this shortage of data by exploring the potential of audio recordings. Analysing audio, however, is challenging due to the presence of harmonics, reverb and noise, which may obscure the pitch content. We test several multiple pitch estimation models on audio recordings, using encoded scores from the Josquin Research Project (JRP) as a benchmark for evaluation. We present a dataset of multiple pitch estimations from 611 compositions in the JRP. We use the pitch estimations to create pitch profiles and pitch class profiles, and to estimate the lowest final pitch of each recording. Our findings indicate that the Multif0 model yields pitch profiles, pitch class profiles and finals most closely aligned with symbolic encodings. Furthermore, we found no effect of year of recording, number of voices and ensemble composition on the accuracy of pitch estimations. Finally, we demonstrate how these models can be applied to gain insight into tonal structures in early polyphony. Full article
Show Figures

Figure 1

16 pages, 2532 KB  
Article
Towards Automatic Expressive Pipa Music Transcription Using Morphological Analysis of Photoelectric Signals
by Yuancheng Wang, Xuanzhe Li, Yunxiao Zhang and Qiao Wang
Sensors 2025, 25(5), 1361; https://doi.org/10.3390/s25051361 - 23 Feb 2025
Viewed by 1154
Abstract
The musical signal produced by plucked instruments often exhibits non-stationarity due to variations in the pitch and amplitude, making pitch estimation a challenge. In this paper, we assess different transcription processes and algorithms applied to signals captured by optical sensors mounted on a [...] Read more.
The musical signal produced by plucked instruments often exhibits non-stationarity due to variations in the pitch and amplitude, making pitch estimation a challenge. In this paper, we assess different transcription processes and algorithms applied to signals captured by optical sensors mounted on a pipa—a traditional Chinese plucked instrument—played using a range of techniques. The captured signal demonstrates a distinctive arched feature during plucking. This facilitates onset detection to avoid the impact of the spurious energy peaks within vibration areas that arise from pitch-shift playing techniques. Subsequently, we developed a novel time–frequency feature, known as continuous time-period mapping (CTPM), which contains pitch curves. The proposed process can also be applied to playing techniques that mix pitch shifts and tremolo. When evaluated on four renowned pipa music pieces of varying difficulty levels, our fully time-domain-based onset detectors outperformed four short-time methods, particularly during tremolo. Our zero-crossing-based pitch estimator achieved a performance comparable to short-time methods with a far better computational efficiency, demonstrating its suitability for use in a lightweight algorithm in future work. Full article
(This article belongs to the Special Issue Recent Advances in Smart Mobile Sensing Technology)
Show Figures

Figure 1

13 pages, 208 KB  
Article
Language of the Heart: Creating Digital Stories and Found Poetry to Understand Patients’ Experiences Living with Advanced Cancer
by Kathleen C. Sitter, Jessame Gamboa and Janet Margaret de Groot
Curr. Oncol. 2025, 32(2), 61; https://doi.org/10.3390/curroncol32020061 - 23 Jan 2025
Cited by 1 | Viewed by 1905
Abstract
In this article, we share our findings on patients’ experiences creating digital stories about living with advanced cancer, represented through found poetry. Over a period of 12 months, patients from the program “Managing Cancer and Living Meaningfully” (CALM) completed digital stories about their [...] Read more.
In this article, we share our findings on patients’ experiences creating digital stories about living with advanced cancer, represented through found poetry. Over a period of 12 months, patients from the program “Managing Cancer and Living Meaningfully” (CALM) completed digital stories about their experiences living with cancer. Digital stories are short, personalized videos that combine photographs, imagery, narration, and music to communicate a personal experience about a topic of inquiry. Patient interviews were conducted about the digital storytelling process. Found poetry guided the analysis technique. It is a form of arts-based research that involves using words and phrases found in interview transcripts to create poems that represent research themes. This article begins with a brief overview of the psychosocial intervention CALM, arts in healthcare, and found poetry, followed by the project background. The found poems represent themes of emotional impact, legacy making, and support and collaboration. Findings also indicate the inherently relational aspect of digital storytelling as participants emphasized the integral role of the digital storytelling facilitator. What follows is a discussion on digital storytelling, which considers the role of found poetry in representing patient voices in the research process. Full article
(This article belongs to the Special Issue Transdisciplinary Holistic Psychosocial Oncology and Palliative Care)
15 pages, 856 KB  
Article
DAFE-MSGAT: Dual-Attention Feature Extraction and Multi-Scale Graph Attention Network for Polyphonic Piano Transcription
by Rui Cao, Zushuang Liang, Zheng Yan and Bing Liu
Electronics 2024, 13(19), 3939; https://doi.org/10.3390/electronics13193939 - 5 Oct 2024
Cited by 1 | Viewed by 1724
Abstract
Automatic music transcription (AMT) aims to convert raw audio signals into symbolic music. This is a highly challenging task in the fields of signal processing and artificial intelligence, and it holds significant application value in music information retrieval (MIR). Existing methods based on [...] Read more.
Automatic music transcription (AMT) aims to convert raw audio signals into symbolic music. This is a highly challenging task in the fields of signal processing and artificial intelligence, and it holds significant application value in music information retrieval (MIR). Existing methods based on convolutional neural networks (CNNs) often fall short in capturing the time-frequency characteristics of audio signals and tend to overlook the interdependencies between notes when processing polyphonic piano with multiple simultaneous notes. To address these issues, we propose a dual attention feature extraction and multi-scale graph attention network (DAFE-MSGAT). Specifically, we design a dual attention feature extraction module (DAFE) to enhance the frequency and time-domain features of the audio signal, and we utilize a long short-term memory network (LSTM) to capture the temporal features within the audio signal. We introduce a multi-scale graph attention network (MSGAT), which leverages the various implicit relationships between notes to enhance the interaction between different notes. Experimental results demonstrate that our model achieves high accuracy in detecting the onset and offset of notes on public datasets. In both frame-level and note-level metrics, DAFE-MSGAT achieves performance comparable to the state-of-the-art methods, showcasing exceptional transcription capabilities. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

26 pages, 12966 KB  
Article
Optical Medieval Music Recognition—A Complete Pipeline for Historic Chants
by Alexander Hartelt, Tim Eipert and Frank Puppe
Appl. Sci. 2024, 14(16), 7355; https://doi.org/10.3390/app14167355 - 20 Aug 2024
Cited by 5 | Viewed by 2305
Abstract
Manual transcription of music is a tedious work, which can be greatly facilitated by optical music recognition (OMR) software. However, OMR software is error prone in particular for older handwritten documents. This paper introduces and evaluates a pipeline that automates the entire OMR [...] Read more.
Manual transcription of music is a tedious work, which can be greatly facilitated by optical music recognition (OMR) software. However, OMR software is error prone in particular for older handwritten documents. This paper introduces and evaluates a pipeline that automates the entire OMR workflow in the context of the Corpus Monodicum project, enabling the transcription of historical chants. In addition to typical OMR tasks such as staff line detection, layout detection, and symbol recognition, the rarely addressed tasks of text and syllable recognition and assignment of syllables to symbols are tackled. For quantitative and qualitative evaluation, we use documents written in square notation developed in the 11th–12th century, but the methods apply to many other notations as well. Quantitative evaluation measures the number of necessary interventions for correction, which are about 0.4% for layout recognition including the division of text in chants, 2.4% for symbol recognition including pitch and reading order and 2.3% for syllable alignment with correct text and symbols. Qualitative evaluation showed an efficiency gain compared to manual transcription with an elaborate tool by a factor of about 9. In a second use case with printed chants in similar notation from the “Graduale Synopticum”, the evaluation results for symbols are much better except for syllable alignment indicating the difficulty of this task. Full article
Show Figures

Figure 1

15 pages, 3038 KB  
Article
Korean Pansori Vocal Note Transcription Using Attention-Based Segmentation and Viterbi Decoding
by Bhuwan Bhattarai and Joonwhoan Lee
Appl. Sci. 2024, 14(2), 492; https://doi.org/10.3390/app14020492 - 5 Jan 2024
Cited by 1 | Viewed by 2381
Abstract
In this paper, first, we delved into the experiment by comparing various attention mechanisms in the semantic pixel-wise segmentation framework to perform frame-level transcription tasks. Second, the Viterbi algorithm was utilized by transferring the knowledge of the frame-level transcription model to obtain the [...] Read more.
In this paper, first, we delved into the experiment by comparing various attention mechanisms in the semantic pixel-wise segmentation framework to perform frame-level transcription tasks. Second, the Viterbi algorithm was utilized by transferring the knowledge of the frame-level transcription model to obtain the vocal notes of Korean Pansori. We considered a semantic pixel-wise segmentation framework for frame-level transcription as the source task and a Viterbi algorithm-based Korean Pansori note-level transcription as the target task. The primary goal of this paper was to transcribe the vocal notes of Pansori music, a traditional Korean art form. To achieve this goal, the initial step involved conducting the experiments with the source task, where a trained model was employed for vocal melody extraction. To achieve the desired vocal note transcription for the target task, the Viterbi algorithm was utilized with the frame-level transcription model. By leveraging this approach, we sought to accurately transcribe the vocal notes present in Pansori performances. The effectiveness of our attention-based segmentation methods for frame-level transcription in the source task has been compared with various algorithms using the vocal melody task of the MedleyDB dataset, enabling us to measure the voicing recall, voicing false alarm, raw pitch accuracy, raw chroma accuracy, and overall accuracy. The results of our experiments highlight the significance of attention mechanisms for enhancing the performance of frame-level music transcription models. We also conducted a visual and subjective comparison to evaluate the results of the target task for vocal note transcription. Since there was no ground truth vocal note for Pansori, this analysis provides valuable insights into the preservation and appreciation of this culturally rich art form. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

20 pages, 686 KB  
Article
High-Quality and Reproducible Automatic Drum Transcription from Crowdsourced Data
by Mickaël Zehren, Marco Alunno and Paolo Bientinesi
Signals 2023, 4(4), 768-787; https://doi.org/10.3390/signals4040042 - 10 Nov 2023
Cited by 7 | Viewed by 4689
Abstract
Within the broad problem known as automatic music transcription, we considered the specific task of automatic drum transcription (ADT). This is a complex task that has recently shown significant advances thanks to deep learning (DL) techniques. Most notably, massive amounts of labeled data [...] Read more.
Within the broad problem known as automatic music transcription, we considered the specific task of automatic drum transcription (ADT). This is a complex task that has recently shown significant advances thanks to deep learning (DL) techniques. Most notably, massive amounts of labeled data obtained from crowds of annotators have made it possible to implement large-scale supervised learning architectures for ADT. In this study, we explored the untapped potential of these new datasets by addressing three key points: First, we reviewed recent trends in DL architectures and focused on two techniques, self-attention mechanisms and tatum-synchronous convolutions. Then, to mitigate the noise and bias that are inherent in crowdsourced data, we extended the training data with additional annotations. Finally, to quantify the potential of the data, we compared many training scenarios by combining up to six different datasets, including zero-shot evaluations. Our findings revealed that crowdsourced datasets outperform previously utilized datasets, and regardless of the DL architecture employed, they are sufficient in size and quality to train accurate models. By fully exploiting this data source, our models produced high-quality drum transcriptions, achieving state-of-the-art results. Thanks to this accuracy, our work can be more successfully used by musicians (e.g., to learn new musical pieces by reading, or to convert their performances to MIDI) and researchers in music information retrieval (e.g., to retrieve information from the notes instead of audio, such as the rhythm or structure of a piece). Full article
(This article belongs to the Topic Research on the Application of Digital Signal Processing)
Show Figures

Figure 1

Back to TopTop