Multi-Feature Fusion for Automatic Piano Transcription Based on Mel Cyclic and STFT Spectrograms

Dai, Jinliang; Zheng, Qiuyue; Wang, Yang; Shan, Qihuan; Wan, Jie; Zhang, Weiwei

doi:10.3390/electronics14234720

Open AccessArticle

Multi-Feature Fusion for Automatic Piano Transcription Based on Mel Cyclic and STFT Spectrograms

by

Jinliang Dai

¹

,

Qiuyue Zheng

²,

Yang Wang

³,

Qihuan Shan

²,

Jie Wan

^4,* and

Weiwei Zhang

³

¹

Sinwt Technology Company Limited, Beijing 100176, China

²

Beijing Institute of Computer Technology and Application, Beijing 100854, China

³

Information Science and Technology College, Dalian Maritime University, Dalian 116026, China

⁴

Beijing United Wisdom Robotics Co., Ltd., Beijing 100176, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4720; https://doi.org/10.3390/electronics14234720

Submission received: 14 November 2025 / Revised: 27 November 2025 / Accepted: 27 November 2025 / Published: 29 November 2025

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Automatic piano transcription (APT) is a challenging problem in music information retrieval. In recent years, most APT approaches have been based on neural networks and have demonstrated higher performance. However, most previous works utilize a short-time Fourier transform (STFT) spectrogram as input, which results in a noisy spectrogram due to the mixing of harmonics from concurrent notes. To address this issue, a novel APT network based on two spectrograms is proposed. Firstly, the Mel cyclic and Mel STFT spectrograms of the piano musical signal are computed to represent the mixed audio. Next, separate modules for onset, offset, and frame-level note detection are constructed to achieve distinct objectives. To capture the temporal dynamics of notes, an axial attention mechanism is incorporated into the frame-level note detection modules. Finally, a multi-feature fusion module is introduced to aggregate different features and generate the piano note sequences. In this work, the two spectrograms provide complementary information, the axial attention mechanism enhances the temporal relevance of notes, and the multi-feature fusion module incorporates frame-level note, note onset, and note offset features together to deduce final piano notes. Experimental results demonstrate that the proposed approach achieves higher accuracies with lower error rates in automatic piano transcription compared with other reference approaches.

Keywords:

automatic piano transcription; mel cyclic spectrogram; axial attention mechanism; multi-feature fusion; dual-stream structure

1. Introduction

Automatic piano transcription is a crucial task in music information retrieval. It extracts the note sequences from a given piece of piano music. The primary objective of APT is multi-pitch estimation, which remains a challenging issue for polyphonic music as it requires identifying multiple notes that are played simultaneously. APT has been widely used in various applications, including computer-generated music [1], piano music education [2], music auto-tagging [3], genre classification [4], and so on.

Early works for APT relied on spectrogram factorization, including non-negative matrix factorization (NMF) [5], sparse factorization [6], and so on. The NMF-based approaches assumed that the amplitude spectrogram of the piano audio was the product of two non-negative matrices: one for frequency templates and the other for pitch intensities. The pitch intensity matrix could then be estimated by minimizing the distance between the amplitude spectrogram matrix and the product matrix. Although these approaches can directly obtain the transcribed notes by decomposing the spectrogram, they cannot capture the temporal dynamics of the notes. Building upon this approach, Benetos et al. [7] introduced probabilistic latent component analysis into spectrogram factorization. More recently, Zhang et al. [8] presented a novel 2-D spectrogram for multi-pitch estimation, which calculated the cross-correlation between the pseudo 2-D spectrum and a predefined 2-D harmonic template.

Recently, deep learning has achieved great progress in the field of APT, including recurrent neural network (RNN) [9], convolutional neural network (CNN) [10], deep convolutional with long and short-term memory (LSTM) network and its variation [11,12], Transformer [13], graph convolutional network [14], and so on. Except for focusing on different neural models, some studies devoted their efforts towards evaluating various signal representations [15], deducing precise times of critical instances (such as onsets and offsets) [16], or eliminating overfitting [17]. As a consequence, APT accuracies have been improved greatly in the past decade. Still, there are unresolved problems.

As far as signal representation is concerned, various time-frequency transforms have been utilized in the existing works, such as STFT [15,18], constant-Q transform [19], and log-Meier transform [17]. From [15], it can be observed that log-Meier and constant-Q transform have certain advantages in feature extraction, which exhibit richer harmonic information at higher frequencies. However, these two spectrograms are still quite noisy, since several notes may sound simultaneously, and each note contains the fundamental and multiple harmonics, making it still difficult to accurately extract the concurrent notes.

To overcome this issue, Su et al. [20] combined the final two layers of the multi-layered cepstrum (MLC) and combined frequency and periodicity (CFP) representations. This fusion mitigates the harmonics while strengthening the fundamental frequency components, so it has also been introduced into other related works [21]. Then, a CNN was used to extract the pitches. This approach reduces the complexity of the spectrogram, but it does not take into account the note-specific features in the subsequent network architecture. The advantages and disadvantages of these signal representations are listed in Table 1.

Pitched sounds are composed of fundamental and harmonic components. The fundamental frequency, also known as pitch, is the output of piano music transcription, and harmonic components help estimate the fundamental frequency, especially when the fundamental frequency is absent or weak. However, the interleaved harmonics of multiple concurrent notes increase the difficulty of fundamental frequency estimation, which constitutes a contradiction. In order to utilize the complementary information between fundamental frequency and harmonics, an APT network based on Mel cyclic and Mel STFT spectrograms is proposed in this work.

Specifically, the Mel cyclic spectrogram is constructed to suppress harmonic components and enhance fundamental frequencies. On the other hand, the STFT spectrogram is also utilized to provide complete spectral information. Additionally, to make use of note-specific features, Mel cyclic and Mel STFT spectrograms are fed into a dual-stream structure for specific note feature extraction. Finally, a multi-feature fusion module is proposed for combining the extracted note features and obtaining the final note sequences.

The main contributions of this work include the following: (1) To the best of our knowledge, the Mel cyclic spectrogram is utilized as the signal representation of polyphonic music for the first time, which significantly reduces the spectral complexity of piano music and facilitates pitch estimation. (2) An axial attention mechanism is incorporated into the frame-level note detection modules to model the temporal relevance of piano notes. (3) A multi-feature fusion module is proposed to aggregate the information of note onsets, offsets, and frame-level notes.

The rest of this paper is organized as follows. The cyclic spectrum is presented in Section 2. The proposed APT network based on the two spectrograms is elaborated in Section 3. The experimental results and discussions are provided in Section 4. Finally, some conclusions are drawn in Section 5.

2. Cyclic Spectrum

A cyclic spectrum is commonly used for analyzing cyclostationary signals [22]. Both the mean value

m_{x} (t)

and the autocorrelation function

R_{x} (t, τ)

of the cyclostationary signal

x (t)

exhibit periodicity with respect to a given time difference

τ

, as shown in Equations (1) and (2).

m_{x} (t) = m_{x} (t + n T)

(1)

R_{x} (t, τ) = R_{x} (t + n T, τ)

(2)

where T is the signal period, and t represents the time index.

Since the musical signal can be viewed as a cyclostationary signal in a short time interval, it satisfies approximate cyclostationarity. According to Fourier analysis theory, a periodic function can be represented by a Fourier series. Thus, for the autocorrelation function

R_{x} (t, τ)

of a cyclostationary signal

x (t)

, its Fourier coefficients

U_{x} (α, τ)

can be expressed as

U_{x} (α, τ) = lim_{T \to \infty} \frac{1}{T} \int_{- \frac{T}{2}}^{\frac{T}{2}} R_{x} (t + \frac{τ}{2}, t - \frac{τ}{2}) e^{- j 2 π α t} d t

(3)

where

α = i / T

represents the cyclic frequency, and i is an integer.

Equation (3) denotes the cyclic autocorrelation function for a given time difference

τ

. According to Fourier analysis theory, the power spectrum of a signal can be obtained by performing a Fourier transform on the cyclic autocorrelation function with respect to the time difference

τ

. Therefore, the cyclic spectral correlation function is defined as

S_{x} (α, f) = \int_{- \infty}^{+ \infty} U_{x} (α, τ) e^{- j 2 π f τ} d τ

(4)

where

S_{x} (α, f)

denotes a bivariate function with respect to the frequency f and the cyclic frequency

α

.

The STFT and cyclic spectrograms of one segment of piano music are shown in Figure 1. As can be seen from this figure, the cyclic spectrogram is much clearer than the STFT spectrogram, with harmonics suppressed greatly, while the STFT spectrogram provides more detailed information in the lower frequency range.

3. Proposed Approach

To mitigate the interference of harmonic components inherent in the STFT and accentuate the lower-frequency elements within the cyclic spectrogram, a novel dual-stream architecture for automatic piano transcription is proposed in this paper. The network architecture is depicted in Figure 2. First, the Mel-scaled STFT and Mel-scaled cyclic spectrograms are computed. Subsequently, the Mel cyclic spectrogram is fed into three modules for onset, offset, and frame-level note detection. In parallel, the Mel STFT spectrogram is fed into another frame-level note detection module. Additionally, to capture the temporal relevance of piano notes, an axial attention mechanism is incorporated into both frame-level note detection modules. Finally, considering the natural energy decay characteristic of piano notes, the outputs of the onset, offset, and frame-level note detection modules are aggregated. The composite feature is subsequently input into the multi-feature fusion module to derive the note sequence. This section will provide a detailed explanation of the proposed network.

3.1. Computation of Mel Cyclic Spectrogram

Each note contains the fundamental and harmonic components, which are located at integral multiples of the fundamental frequency. The concurrent notes are ubiquitous in piano music, resulting in a rather complex spectrogram. Fortunately, the cyclic spectrum can suppress the harmonic components, making the fundamental frequency information more significant. Therefore, it is introduced herein for signal representation.

In order to improve the computational efficiency, we discretize the calculation process of the cyclic spectrogram. Firstly, the piano musical signal is split into frames. Then, the discretized frame-level cyclic autocorrelation function is calculated as

\begin{matrix} S_{C S} (α, k) = D F T [R_{x} (α, m)] = \frac{1}{N} \sum_{m = 0}^{N - 1} \sum_{n = 0}^{N - 1} x (n + m) \\ x^{*} (n) exp (- j 2 π α n) exp (- j 2 π α m) exp (- j \frac{2 π k m}{N}) \end{matrix}

(5)

where

x (n)

represents the time-domain signal of one frame, k denotes the frequency index, N is the frame length, m is the time difference, and

*

denotes the conjugate operator.

Subsequently,

S_{C S} (α, k)

can be further simplified as

S_{C S} (α, k) = \frac{1}{N} X_{N} (k + N α / 2) X_{N}^{*} (k - N α / 2)

(6)

where

X_{N} (k)

represents the Fourier transform of

x (n)

.

S_{C S} (α, k)

is a bivariate function that relates to the cyclic frequency

α

and frequency k. The frame-wise cyclic spectrum can be obtained by setting

k = 0

in Equation (6). Similarly, the cyclic spectrogram of one piece of audio signal can be obtained frame by frame.

In order to improve the computational efficiency of the model, the linear frequencies corresponding to the cyclic spectrum are mapped to the Mel-scale frequencies by using the Mel filter banks. The mapping function from linear frequency f to Mel scale frequency

f_{m e l}

is defined as

\begin{matrix} f_{m e l} = M e l (f) = 2595 \times lg (1 + \frac{f}{700}) \end{matrix}

(7)

The Mel filter banks are a series of triangular filters where the amplitude is 1 at the center frequency point and gradually decreases to 0 on both sides of the filter. The amplitude transfer functions of Mel filter banks can be defined as

\begin{matrix} H_{b} (i) = \{\begin{matrix} \frac{i - f (b - 1)}{f (b) - f (b - 1)} & f (b - 1) \leq i \leq f (b) \\ \frac{f (b + 1) - i}{f (b + 1) - f (b)} & f (b) < i \leq f (b + 1) \\ 0 & O t h e r \end{matrix} \end{matrix}

(8)

where

f (b)

represents the linear frequency of the bth Mel frequency, and

H_{b} (i)

represents the amplitude function of the bth filter.

Correspondingly, the Mel cyclic spectrogram

S_{M e l - C S}

is obtained through a series of filter banks with the transfer function defined as

\begin{matrix} S_{M e l - C S} (b) = \sum_{i = 1}^{N} H_{b} (i) {| S_{C S} (i) |}^{2} \end{matrix}

(9)

Similarly, the Mel spectrogram

S_{M e l - F T}

can also be derived by filtering the STFT spectrogram by the same Mel filter banks. Through the Mel filter banks, the dimensions of both STFT and Cyclic Spectrograms are reduced greatly.

3.2. Frame-Level Note Detection Module

The two frame-level note detection modules with either the Mel STFT spectrogram or the Mel cyclic spectrogram as input share the same structure. They both consist of four layers: convolutional, axial attention, dense, and bidirectional gated recurrent units (BiGRU) layers, as illustrated in Figure 2.

In more detail,

S_{M e l - C S}

or

S_{M e l - F T}

is first fed into CNN to extract the local spatial information and obtain feature map

X_{C} \in R^{C \times H \times W}

. Then, to model the temporal relevance of piano notes across frames, we incorporate an axial attention layer [23] after the CNN to enhance the correlation among different frames along the axial direction.

The structure of axial attention is shown in Figure 3. There are four steps for the axial attention mechanism. First, three separate 1 × 1 convolution layers are utilized to obtain three feature matrices:

Q \in R^{C^{'} \times H \times W}

,

K \in R^{C^{'} \times H \times W}

and

V \in R^{C \times H \times W}

, respectively. To reduce computational complexity,

C^{'}

is smaller than C, while the size of

V

remains unchanged. Second, the

Q

and

K

feature matrices are multiplied, and followed by softmax function to obtain the weight matrix

A \in R^{W \times H \times H}

, which reflects the correlation of notes at the same position of each frame in the feature map. Thirdly, to capture compact contextual information, the weight matrix

A

is multiplied by the feature matrix

V

to derive

D \in R^{C \times H \times W}

. Finally,

D

is added to

X_{C}

to obtain the output matrix

X_{A} \in R^{C \times H \times W}

.

After axial attention, the result is fed into a dense layer and BiGRU to derive piano note sequences.

Loss function is crucial for APT. As a multi-class classification problem, the labels are set to 1 at the positions where the notes are played, and set to 0 at the other positions. The frame-level note detection module predicts the probability of each note. Therefore, following [11], the loss function of frame-level note detection module

l_{f r}

is defined as

\begin{matrix} l_{f r} = - \sum_{t = 1}^{T} \sum_{n = 1}^{N} I_{f r} (t, n) ln [P_{f r} (t, n)] + \\ [1 - I_{f r} (t, n)] ln [1 - P_{f r} (t, n)] \end{matrix}

(10)

where T refers to the number of frames,

N = 88

(equal to the number of piano notes),

I_{f r} (t, n) \in {0, 1}

represents the label value of the nth note at the tth frame, and

P_{f r} (t, n) \in [0, 1]

represents the probability value at the nth note of the tth frame.

3.3. Onset and Offset Detection Modules

The frame-level note detection module does not utilize the note onset and offset, which contain additional information. In order to make full use of note onset and offset information to enhance the accuracy of piano transcription, the note onset and offset modules are utilized on the Mel STFT spectrogram, as illustrated in Figure 2. They consist of three layers: convolutional, dense, and BiGRU layers. The corresponding loss functions are defined as

\begin{matrix} l_{o n} = - \sum_{t = 1}^{T} \sum_{n = 1}^{N} I_{o n} (t, n) ln [P_{o n} (t, n)] + \\ [1 - I_{o n} (t, n)] ln [1 - P_{o n} (t, n)] \end{matrix}

(11)

\begin{matrix} l_{o f f} = - \sum_{t = 1}^{T} \sum_{n = 1}^{N} I_{o f f} (t, n) ln [P_{o f f} (t, n)] + \\ [1 - I_{o f f} (t, n)] ln [1 - P_{o f f} (t, n)] \end{matrix}

(12)

where

l_{o n}

and

l_{o f f}

are loss functions of onset and offset detection modules, respectively;

I_{o n} (t, n)

and

I_{o f f} (t, n)

correspond to the label values for the onset and offset of the nth note at the tth frame, respectively, and

P_{o n} (t, n)

and

P_{o f f} (t, n)

denote the predicted probability values for the onset and offset of the nth note at the tth frame, respectively.

3.4. Multi-Feature Fusion Module

The above-mentioned three modules conduct piano music transcription from different perspectives. To utilize multi-feature information, a multi-feature fusion scheme is proposed. As illustrated in Figure 4, the four matrices generated from the note onset, offset, and frame-level note detection modules are stacked to construct

C_{m} \in R^{4 \times H \times N}

. Afterwards,

C_{m}

is average-pooled and max-pooled along the channel dimension to generate the average pooling matrix

C_{a v g}

and the maximum pooling matrix

C_{m a x}

, respectively, i.e.,

C_{a v g} = A v g p o o l (C_{m})

(13)

C_{m a x} = M a x p o o l (C_{m})

(14)

Then, a convolutional layer is used to associate the interaction features of

C_{a v g}

and

C_{m a x}

to obtain an enhanced feature matrix of size

H \times N

. The enhanced feature matrix integrates the note onset, offset, and sustain information.

Subsequently, the enhanced feature matrix is dot-produced with each slice of

C_{m}

to obtain the fused feature matrix

C_{m}^{'}

, defined as

{C^{'}}_{m} = s i g m o i d (C o n v ([C_{a v g}, C_{max}])) ⊙ C_{m}

(15)

where

C o n v (\cdot)

represents the convolutional layer with a

3 \times 3

convolution kernel, and ⊙ denotes the slice-wise dot product operator.

Finally, the BiGRU is used to model the timing information, after which the note probability matrix is obtained using the sigmoid activation function. To obtain note sequences, a simple threshold strategy is employed herein. Specifically, the note probability matrix is binarized with a threshold of 0.5. As a consequence, the adjacent pitches with a probability larger than 0.5 form a note.

4. Experimental Results and Discussions

4.1. Dataset and Metrics

The MAPS dataset [10], with approximately 60 h of piano music recordings, is utilized for performance evaluation. This dataset encompasses 270 piano music recordings of 9 categories. Each category contains 30 piano recordings. Notably, 7 of these categories are generated using piano synthesis software, while the other 2 categories are obtained from Yamaha Disklavier. The sampling rate is 44.1 kHz. Each recording is labeled with the sustain and onset information for different notes. In our experiments, the 30 audio recordings from the

A k P n B c h t

category were used as the test set. And 15 randomly selected audio recordings from the other 8 categories were used as the validation set for adjusting the hyperparameters, while the other 225 recordings were for model training. That is, the percentages of training, validation, and test sets are 83.33%, 11.11%, and 5.56%, respectively.

Three metrics are used to assess the accuracy of the pitch sequences, i.e., precision, recall, and F-measure [24]. Four kinds of error rates are also evaluated, i.e., substitution error rate (

E_{s u b}

), missing error rate (

E_{m i s s}

), false alarm error rate (

E_{f a}

), and total error rate (

E_{t o l}

) [25].

4.2. Experimental Setup

The piano music in the dataset is processed in the following way. Firstly, the original stereo piano music signal is converted into a monophonic signal by taking the arithmetic mean, and then downsampled to a sampling rate of 16 kHz. The duration of each frame for both cyclic and STFT spectrograms is 128 ms (with a window length of 2048 points), and the time interval is 32 ms. To improve the computational efficiency, the cyclic spectrogram and STFT spectrogram are filtered using 229 Mel filter banks [26] to obtain the Mel cyclic spectrogram and Mel spectrogram, respectively. In this way, the frequency dimension of the spectrograms is reduced from 1025 to 229.

Four detection modules share the same structure of convolutional layers. There are 4 convolutional layers for each Conv2D block, and the number of output feature maps is 16, 32, 64, and 128, respectively. The size of the convolutional kernel is

3 \times 3

with a random dropout rate of 0.5. The activation function of convolutional layers is ReLU. The input batch size is 200. Seven consecutive frames of the Mel cyclic spectrogram are concatenated together to form the feature of the current frame, so the input of the detection module is a tensor in shape (200, 1, 7, 229). The feature maps after the convolutional layers are of dimensions (200, 128, 7, 14). Then, the feature maps are flattened along the frequency and channel dimensions to (200, 7, 1792), and the shapes after dense layers are (200, 7, 512). These tensors are then input into the BiGRU, whose output layer size is 256.

The input channel number of the frame-level note detection module, C, is 128.

C^{'}

is set as 64. The output layers of onset, offset, and frame detection modules are activated by the

s i g m o i d (\cdot)

function. The threshold after the last sigmoid function for determining if a note is active is 0.5.

The label of each frame is a vector of length 88. The entry is set to be 1 when the corresponding note is active. On the contrary, it is 0. The loss function is optimized by the Adam optimizer with a learning rate of 0.0006 for training. The experiments are conducted in the OS of Ubuntu 18.04.6 LTS with NVIDIA RTX 3080 GPU. The codes are written in Python 3.9 in the PyTorch framework.

4.3. Reference Approaches

Six typical APT approaches are used for performance comparison, including spectral peaks and non-peak regions modeling (SPNRM) [18], the sound space-based spectral factorization (S3F) [7], the pseudo 2-D spectrum-based approach (P2SB) [8], the note transcription using RNN (RNN) [9], the joint onset and frame estimation using CNN and bidirectional long short-term memory networks (CBLSTM) [11], and the high-resolution piano transcription (HPT) [16].

These approaches are chosen since they cover all categories. For instance, S3F belongs to the classical spectrogram factorization category. P2SB is based on a two-dimensional spectral representation. SPNRM is devoted to modeling spectral peaks. RNN, CBLSTM, and HPT are typical deep learning-based approaches. All parameters of the reference methods are the same as their original settings. To be fair, all deep learning-based methods are trained and tested using the same datasets as the proposed one.

4.4. Evaluation Results and Discussions

4.4.1. Ablation Experiment

As mentioned before, the proposed approach is based on a dual-stream structure using both Mel cyclic and STFT spectrograms. To verify the effectiveness of the dual-stream structure, we performed an ablation experiment on the validation dataset. In more detail, the proposed approach is trained on the same training set, and the performance on the validation dataset with either the Mel cyclic spectrogram or the Mel STFT spectrogram branch omitted is tested. The experimental results are provided in Table 2.

As shown in this table, the proposed approach based on both the Mel cyclic and STFT spectrograms achieves the highest accuracy and the lowest total error rate. This observation indicates that both streams contribute to the performance. In terms of F-measure, the multi-feature fusion scheme based on both spectrograms is 4.1% higher than the Mel cyclic spectrogram, and 2.9% higher than the Mel spectrogram, and the

E_{t o l}

is 6.2% lower than the Mel cyclic spectrogram, and 4.3% lower than the Mel spectrogram, respectively.

In addition, the branch with only the Mel cyclic spectrogram obtains a relatively lower recall rate and a higher total error rate

E_{m i s s}

. This phenomenon might be due to the lower frequency resolution. As illustrated in Figure 1, the lower frequency components of the cyclic spectrogram are weak. If the frequency resolution of the cyclic spectrum is increased, the accuracy can be improved, but the computation load is also heavier.

4.4.2. Comparison with Reference Approaches

The piano transcription results on MAPS-AkPnBcht, with respect to precision, recall, and F-measure, are shown in Figure 5. To provide more intuitive results, the mean accuracies of different methods are listed in Table 3.

It can be found from Figure 5 and Table 3 that the proposed approach achieves the highest F-measure and recall, and the third highest precision. The precision of the proposed approach is lower than HPT and CBLSTM, indicating that there is still work to do to reduce the false positives.

Additionally, among the compared approaches, S3F achieves a higher precision. However, its recall is the lowest, resulting in a lower F-measure value. From this observation, we can infer that the estimated notes of S3F are not many and the false alarm rate is low, but it missed many notes. Similar conclusions can be drawn for SPNRM.

The corresponding error rates for all approaches are shown in Table 4. It can be observed that the proposed approach also obtains the second lowest total error rate. Its false alarm rate is higher than HPT and CBLSTM, which confines its precision. Moreover, the S3F consistently achieves a lower substitution error rate and false alarm error rate, but it obtains the highest missing error rate. This observation confirms that a significant number of actual notes are missed.

In order to see if the differences between different approaches are significant, we conducted a statistical significance analysis on all test recordings. A paired-sample t-test is performed between the proposed and other reference approaches in terms of precision, recall, and F-measure. The results are listed in Table 5, Table 6 and Table 7.

It can be observed from Table 5 that HPT and CBLSTM achieve higher precision values than the proposed approach. And the precision differences between the proposed approach and CBLSTM, HPT, and S3F are not significant, with p-values higher than 0.2. Also, it can be observed from Table 6 that the proposed approach obtains the highest recall, and the superiority of the proposed approach over HPT is slightly significant with a p-value equal to 0.191. The proposed approach significantly outperforms the other approaches in terms of recall rate. Finally, it can be found from Table 7 that the proposed approach obtains the highest F-measure, which is consistent with Figure 5. The difference between the proposed approach and HPT with respect to F-measure is insignificant, with a p-value equal to 0.574. Moreover, the proposed approach significantly outperforms the other approaches.

4.4.3. Piano Transcription Example

To provide a more intuitive illustration of the piano transcription results, the ground truth labels and the transcription results of these approaches on one excerpt are shown in Figure 6. This excerpt is ’MAPS_MUS_chpn-p4_AkPnBcht.wav’, taken from the evaluation dataset.

From Figure 6, it can be observed that the piano transcription results of RNN, CBLSTM, HPT, and the proposed approach are much better than those of S3F, P2SB, and SPNRM, which is consistent with the previous evaluation results. Only HPT and the proposed approach correctly estimate short segments of note with the MIDI number equal to 35 during the interval between 2.5 and 4 s. That is because there is also another note exactly one octave above this one at the same time. Therefore, discriminating notes with an exact one octave distance is still unsolved. Additionally, the notes by the deep learning approaches are more continuous, indicating that they work better in modeling temporal dependency.

4.4.4. Computational Complexity Comparison

It can be seen from Table 5, Table 6 and Table 7 that the performance of deep learning based approaches is diverse. Their parameter amount and FLOPs are also exported. The results are given in Table 8. Obviously, RNN is the most lightweight one with the least parameter amount and FLOPs, which are much fewer than the other reference methods. On the other hand, CBLSTM needs the most parameters and FLOPs.

Furthermore, it can be seen from Table 8 that the RNN is most computationally efficient, while the proposed approach ranks second. The statistical significance results indicate that the proposed approach’s performance is on par with that of HPT. However, its model complexity and parameter amount are much less than those of HPT. Therefore, the proposed approach features a more lightweight model. This efficiency reveals that the proposed approach achieves better performance with a much smaller model size compared with the reference methods.

5. Conclusions

In this work, an APT approach based on a dual-stream structure is proposed. The Mel cyclic spectrogram is introduced herein to effectively mitigate the interference of harmonic components of notes. To reduce computation load, the Mel spectrogram is also utilized to supplement the information in the lower frequency range. Specifically, the cyclic and STFT spectrograms are first computed frame by frame and filtered using Mel filter banks to obtain the Mel cyclic and Mel spectrograms, respectively. Then, the frame-level note detection, onset, and offset detection modules are utilized to extract specific piano note features in the dual-stream structure. Considering the temporal dependency of notes, an axial attention is incorporated into the frame-level note detection modules. Finally, the features of different detection modules are fused together to deduce piano notes. To the best of our knowledge, this is the first work that models the piano transcription problem using two spectrograms as input representations. Experimental results demonstrate that these two spectrograms provide supplementary information to the other one, and the proposed approach achieves promising performance.

In the future, we would try to explore a more sophisticated feature fusion strategy to improve the information interaction between the two branches.

Author Contributions

Conceptualization, J.D. and Q.Z.; methodology, Y.W.; software, Y.W.; validation, Y.W.; formal analysis, Q.S.; investigation, Y.W.; resources, Q.Z.; data curation, Y.W.; writing—original draft preparation, J.D. and Y.W.; writing—review and editing, W.Z.; supervision, J.W.; project administration, J.D.; funding acquisition, J.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We sincerely appreciate the authors of the reference methods for providing the codes.

Conflicts of Interest

Author J.D. was employed by Sinwt Technology Company Limited, and J.W was employed by Beijing United Wisdom Robotics Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Qiu, Y.; Zhang, J.; Ren, H.; Shan, Y.; Zhou, J. Humming2Music: Being a composer as long as you can humming. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023; pp. 7163–7166. [Google Scholar]
Ramoneda, P.; Tamer, N.C.; Eremenko, V.; Serra, X.; Miron, M. Score Difficulty Analysis for Piano Performance Education based on Fingering. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Virtual, 7–13 May 2022; pp. 201–205. [Google Scholar] [CrossRef]
Song, G.; Wang, Z.; Han, F.; Ding, S.; Iqbal, M.A. Music auto-tagging using deep Recurrent Neural Networks. Neurocomputing 2018, 292, 104–110. [Google Scholar] [CrossRef]
Yu, Y.; Luo, S.; Liu, S.; Qiao, H.; Liu, Y.; Feng, L. Deep attention based music genre classification. Neurocomputing 2020, 372, 84–91. [Google Scholar] [CrossRef]
Gao, L.; Su, L.; Yang, Y.H.; Lee, T. Polyphonic piano note transcription with non-negative matrix factorization of differential spectrogram. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 5–9 March 2017; pp. 291–295. [Google Scholar] [CrossRef]
Rizzi, A.; Antonelli, M.; Luzi, M. Instrument learning and sparse nmd for automatic polyphonic music transcription. IEEE Trans. Multimed. 2017, 19, 1405–1415. [Google Scholar] [CrossRef]
Benetos, E.; Weyde, T. Multiple-F0 estimation and note tracking for Mirex 2015 using a sound state-based spectrogram factorization model. In Proceedings of the Annual Music Information Retrieval eXchange (MIREX’15), Malaga, Spain, 26–30 October 2015; pp. 1–2. [Google Scholar]
Zhang, W.; Chen, Z.; Yin, F. Multi-pitch estimation of polyphonic music based on pseudo two-dimensional spectrum. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2095–2108. [Google Scholar] [CrossRef]
Böck, S.; Schedl, M. Polyphonic piano note transcription with recurrent neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, 25–30 March 2012; pp. 121–124. [Google Scholar]
Sigtia, S.; Benetos, E.; Dixon, S. An end-to-end neural network for polyphonic piano music transcription. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 927–939. [Google Scholar]
Hawthorne, C.; Elsen, E.; Song, J.; Roberts, A.; Simon, I.; Raffel, C.; Engel, J.H.; Oore, S.; Eck, D. Onsets and Frames: Dual-Objective Piano Transcription. In Proceedings of the International Conference on Music Information Retrieval, Paris, France, 23–27 September 2018; pp. 50–57. [Google Scholar]
Kelz, R.; Böck, S.; Widmer, G. Deep Polyphonic ADSR Piano Note Transcription. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 246–250. [Google Scholar]
Ou, L.; Guo, Z.; Benetos, E.; Han, J.; Wang, Y. Exploring transformer’s potential on automatic piano transcription. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Virtual, 7–13 May 2022; pp. 776–780. [Google Scholar]
Xiao, Z.; Chen, X.; Zhou, L. Polyphonic Piano Transcription Based on Graph Convolutional Network. Signal Process. 2023, 212, 109134. [Google Scholar] [CrossRef]
Cheuk, K.W.; Agres, K.; Herremans, D. The Impact of Audio Input Representations on Neural Network based Music Transcription. In Proceedings of the International Joint Conference on Neural Networks, Glasgow, UK, 19–24 July 2020; pp. 1–6. [Google Scholar] [CrossRef]
Kong, Q.; Li, B.; Song, X.; Wan, Y.; Wang, Y. High-resolution piano transcription with pedals by regressing onset and offset times. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3707–3717. [Google Scholar] [CrossRef]
Edwards, D.; Dixon, S.; Benetos, E.; Maezawa, A.; Kusaka, Y. A Data-Driven Analysis of Robust Automatic Piano Transcription. IEEE Signal Process. Lett. 2024, 31, 681–685. [Google Scholar] [CrossRef]
Duan, Z.; Pardo, B.; Zhang, C. Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 2121–2133. [Google Scholar] [CrossRef]
Wang, Q.; Liu, M.; Chen, X.; Xiong, M. Multi-task Piano Transcription with Local Relative Time Attention. In Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference, Taipei, Taiwan, 31 October–3 November 2023; pp. 966–971. [Google Scholar] [CrossRef]
Yu, C.Y.; Lin, J.H.; Su, L. Harmonic Preserving Neural Networks for Efficient and Robust Multipitch Estimation. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Auckland, New Zealand, 7–10 December 2020; pp. 561–567. [Google Scholar]
Yu, S.; Yu, Y.; Sun, X.; Li, W. A neural harmonic-aware network with gated attentive fusion for singing melody extraction. Neurocomputing 2023, 521, 160–171. [Google Scholar] [CrossRef]
Chen, Z.; Mauricio, A.; Li, W.; Gryllias, K. A deep learning method for bearing fault diagnosis based on cyclic spectral coherence and convolutional neural networks. Mech. Syst. Signal Process. 2020, 140, 106683. [Google Scholar] [CrossRef]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-Cross Attention for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Poliner, G.E.; Ellis, D.P. A discriminative model for polyphonic piano transcription. EURASIP J. Adv. Signal Process. 2006, 2007, 048317. [Google Scholar] [CrossRef]
Music Information Retrieval Evaluation eXchange (MIREX 2005)–Audio Tempo Extraction. Available online: https://music-ir.org/mirex/wiki/2005:Audio_Tempo_Extraction (accessed on 13 November 2025).
Hawthorne, C.; Stasyuk, A.; Roberts, A.; Simon, I.; Huang, C.A.; Dieleman, S.; Elsen, E.; Engel, J.H.; Eck, D. Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]

Figure 1. STFT and cyclic spectrograms of a segment of piano music from ‘MAPS_MUS-bk_xmas5_AkPnBcht.wav’. (a) STFT spectrogram. (b) Cyclic spectrogram.

Figure 2. Diagram of the proposed piano transcription model.

Figure 3. Structure of axial attention.

Figure 4. Structure of multi-feature fusion module. (a): The average pooling matrix and the maximum pooling matrix. (b): The enhanced feature matrix.

Figure 5. Results on MAPS-AkPnBcht with respect to precision, recall, and F-measure. The error bars indicate the standard deviations.

Figure 6. Ground truth and transcription results by the compared approaches for one excerpt. This excerpt is taken from ‘MAPS_MUS_chpn-p4_AkPnBcht.wav’ and its maximum polyphony is 9. (a) Ground truth. (b) S3F. (c) P2SB. (d) SPNRM. (e) RNN. (f) CBLSTM. (g) Proposed. (h) HPT.

Table 1. The advantages and limitations of SOTA methods with different representations for APT.

Approach	Representation	Advantage	Limitation
DDA [17]	log-Meier transform	Rich harmonics	Noisy
SPNRM [18]	STFT	Rich harmonics	Noisy
LRTA [19]	CQT	Rich harmonics	Noisy
HPNN [20]	CFP	Harmonics mitigated	Note-specificfeatures ignored

Table 2. Accuracies and total error rates with different branches.

Approach	Precision	Recall	F-Measure	$E_{t o l}$
Mel cyclic spectrogram	91.07%	83.09%	86.90%	21.85%
Mel STFT spectrogram	92.87%	83.89%	88.15%	19.96%
Both	$94.27$ %	$88.01$ %	$91.03$ %	$15.62$ %

Table 3. Accuracies of different approaches on MAPS-AkPnBcht.

Metric	P2SB	S3F	RNN	SPNRM	CBLSTM	HPT	Proposed
F-measure	0.58	0.39	0.65	0.52	0.82	0.88	0.89
Precision	0.53	0.91	0.67	0.78	0.92	0.93	0.92
Recall	0.66	0.28	0.64	0.40	0.73	0.84	0.86

Table 4. Error rates of different approaches on MAPS-AkPnBcht.

Approach	$E_{s u b}$	$E_{m i s s}$	$E_{f a}$	$E_{t o l}$
P2SB	0.17	0.17	0.23	0.57
S3F	0.02	0.70	$0.02$	0.73
RNN	0.07	0.27	0.38	0.72
SPNRM	0.08	0.53	0.03	0.64
CBLSTM	0.02	0.31	$0.02$	0.35
Proposed	0.02	$0.11$	0.05	0.19
HPT	$0.01$	0.13	0.04	$0.18$

Table 5. Statistical significance analysis results with respect to precision on all test recordings.

Pairs	Mean	Std	Mean Error	p-Value
Proposed-HPT	−0.00984	0.07292	0.01331	0.466
Proposed-CBLSTM	−0.00707	0.06062	0.01107	0.528
Proposed-S3F	0.01639	0.073543	0.01377	0.243
Proposed-SPNRM	0.13848	0.10567	0.01934	<0.001
Proposed-RNN	0.25061	0.11671	0.02131	<0.001
Proposed-P2SB	0.39650	0.09143	0.01849	<0.001

Table 6. Statistical significance analysis results with respect to recall on all test recordings.

Pairs	Mean	Std	Mean Error	p-Value
Proposed-HPT	0.01959	0.08017	0.01464	0.191
Proposed-CBLSTM	0.14955	0.08452	0.01543	<0.001
Proposed-P2SB	0.20144	0.06726	0.01228	<0.001
Proposed-RNN	0.22480	0.09739	0.01778	<0.001
Proposed-SPNRM	0.47088	0.09264	0.01691	<0.001
Proposed-S3F	0.58497	0.15639	0.02855	<0.001

Table 7. Statistical significance analysis results with respect to F-measure on all test recordings.

Pairs	Mean	Std	Mean Error	p-Value
Proposed-HPT	0.00611	0.05891	0.01075	0.574
Proposed-CBLSTM	0.06434	0.01174	0.01264	<0.001
Proposed-RNN	0.23721	0.11138	0.02034	<0.001
Proposed-P2SB	0.31001	0.06209	0.01134	<0.001
Proposed-SPNRM	0.44080	0.06846	0.01249	<0.001
Proposed-S3F	0.49022	0.17224	0.03145	<0.001

Table 8. Computational complexity of different models.

Methods	Parameter Amount	FLOPs
RNN	4.4483 M	12.4 G
Proposed	23.5255 M	49.1 G
HPT	24.6519 M	67.2 G
CBLSTM	26.4912 M	89.5 G

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dai, J.; Zheng, Q.; Wang, Y.; Shan, Q.; Wan, J.; Zhang, W. Multi-Feature Fusion for Automatic Piano Transcription Based on Mel Cyclic and STFT Spectrograms. Electronics 2025, 14, 4720. https://doi.org/10.3390/electronics14234720

AMA Style

Dai J, Zheng Q, Wang Y, Shan Q, Wan J, Zhang W. Multi-Feature Fusion for Automatic Piano Transcription Based on Mel Cyclic and STFT Spectrograms. Electronics. 2025; 14(23):4720. https://doi.org/10.3390/electronics14234720

Chicago/Turabian Style

Dai, Jinliang, Qiuyue Zheng, Yang Wang, Qihuan Shan, Jie Wan, and Weiwei Zhang. 2025. "Multi-Feature Fusion for Automatic Piano Transcription Based on Mel Cyclic and STFT Spectrograms" Electronics 14, no. 23: 4720. https://doi.org/10.3390/electronics14234720

APA Style

Dai, J., Zheng, Q., Wang, Y., Shan, Q., Wan, J., & Zhang, W. (2025). Multi-Feature Fusion for Automatic Piano Transcription Based on Mel Cyclic and STFT Spectrograms. Electronics, 14(23), 4720. https://doi.org/10.3390/electronics14234720

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Feature Fusion for Automatic Piano Transcription Based on Mel Cyclic and STFT Spectrograms

Abstract

1. Introduction

2. Cyclic Spectrum

3. Proposed Approach

3.1. Computation of Mel Cyclic Spectrogram

3.2. Frame-Level Note Detection Module

3.3. Onset and Offset Detection Modules

3.4. Multi-Feature Fusion Module

4. Experimental Results and Discussions

4.1. Dataset and Metrics

4.2. Experimental Setup

4.3. Reference Approaches

4.4. Evaluation Results and Discussions

4.4.1. Ablation Experiment

4.4.2. Comparison with Reference Approaches

4.4.3. Piano Transcription Example

4.4.4. Computational Complexity Comparison

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI