1. Introduction
Automatic piano transcription is a crucial task in music information retrieval. It extracts the note sequences from a given piece of piano music. The primary objective of APT is multi-pitch estimation, which remains a challenging issue for polyphonic music as it requires identifying multiple notes that are played simultaneously. APT has been widely used in various applications, including computer-generated music [
1], piano music education [
2], music auto-tagging [
3], genre classification [
4], and so on.
Early works for APT relied on spectrogram factorization, including non-negative matrix factorization (NMF) [
5], sparse factorization [
6], and so on. The NMF-based approaches assumed that the amplitude spectrogram of the piano audio was the product of two non-negative matrices: one for frequency templates and the other for pitch intensities. The pitch intensity matrix could then be estimated by minimizing the distance between the amplitude spectrogram matrix and the product matrix. Although these approaches can directly obtain the transcribed notes by decomposing the spectrogram, they cannot capture the temporal dynamics of the notes. Building upon this approach, Benetos et al. [
7] introduced probabilistic latent component analysis into spectrogram factorization. More recently, Zhang et al. [
8] presented a novel 2-D spectrogram for multi-pitch estimation, which calculated the cross-correlation between the pseudo 2-D spectrum and a predefined 2-D harmonic template.
Recently, deep learning has achieved great progress in the field of APT, including recurrent neural network (RNN) [
9], convolutional neural network (CNN) [
10], deep convolutional with long and short-term memory (LSTM) network and its variation [
11,
12], Transformer [
13], graph convolutional network [
14], and so on. Except for focusing on different neural models, some studies devoted their efforts towards evaluating various signal representations [
15], deducing precise times of critical instances (such as onsets and offsets) [
16], or eliminating overfitting [
17]. As a consequence, APT accuracies have been improved greatly in the past decade. Still, there are unresolved problems.
As far as signal representation is concerned, various time-frequency transforms have been utilized in the existing works, such as STFT [
15,
18], constant-Q transform [
19], and log-Meier transform [
17]. From [
15], it can be observed that log-Meier and constant-Q transform have certain advantages in feature extraction, which exhibit richer harmonic information at higher frequencies. However, these two spectrograms are still quite noisy, since several notes may sound simultaneously, and each note contains the fundamental and multiple harmonics, making it still difficult to accurately extract the concurrent notes.
To overcome this issue, Su et al. [
20] combined the final two layers of the multi-layered cepstrum (MLC) and combined frequency and periodicity (CFP) representations. This fusion mitigates the harmonics while strengthening the fundamental frequency components, so it has also been introduced into other related works [
21]. Then, a CNN was used to extract the pitches. This approach reduces the complexity of the spectrogram, but it does not take into account the note-specific features in the subsequent network architecture. The advantages and disadvantages of these signal representations are listed in
Table 1.
Pitched sounds are composed of fundamental and harmonic components. The fundamental frequency, also known as pitch, is the output of piano music transcription, and harmonic components help estimate the fundamental frequency, especially when the fundamental frequency is absent or weak. However, the interleaved harmonics of multiple concurrent notes increase the difficulty of fundamental frequency estimation, which constitutes a contradiction. In order to utilize the complementary information between fundamental frequency and harmonics, an APT network based on Mel cyclic and Mel STFT spectrograms is proposed in this work.
Specifically, the Mel cyclic spectrogram is constructed to suppress harmonic components and enhance fundamental frequencies. On the other hand, the STFT spectrogram is also utilized to provide complete spectral information. Additionally, to make use of note-specific features, Mel cyclic and Mel STFT spectrograms are fed into a dual-stream structure for specific note feature extraction. Finally, a multi-feature fusion module is proposed for combining the extracted note features and obtaining the final note sequences.
The main contributions of this work include the following: (1) To the best of our knowledge, the Mel cyclic spectrogram is utilized as the signal representation of polyphonic music for the first time, which significantly reduces the spectral complexity of piano music and facilitates pitch estimation. (2) An axial attention mechanism is incorporated into the frame-level note detection modules to model the temporal relevance of piano notes. (3) A multi-feature fusion module is proposed to aggregate the information of note onsets, offsets, and frame-level notes.
The rest of this paper is organized as follows. The cyclic spectrum is presented in
Section 2. The proposed APT network based on the two spectrograms is elaborated in
Section 3. The experimental results and discussions are provided in
Section 4. Finally, some conclusions are drawn in
Section 5.
2. Cyclic Spectrum
A cyclic spectrum is commonly used for analyzing cyclostationary signals [
22]. Both the mean value
and the autocorrelation function
of the cyclostationary signal
exhibit periodicity with respect to a given time difference
, as shown in Equations (
1) and (
2).
where
T is the signal period, and
t represents the time index.
Since the musical signal can be viewed as a cyclostationary signal in a short time interval, it satisfies approximate cyclostationarity. According to Fourier analysis theory, a periodic function can be represented by a Fourier series. Thus, for the autocorrelation function
of a cyclostationary signal
, its Fourier coefficients
can be expressed as
where
represents the cyclic frequency, and
i is an integer.
Equation (
3) denotes the cyclic autocorrelation function for a given time difference
. According to Fourier analysis theory, the power spectrum of a signal can be obtained by performing a Fourier transform on the cyclic autocorrelation function with respect to the time difference
. Therefore, the cyclic spectral correlation function is defined as
where
denotes a bivariate function with respect to the frequency
f and the cyclic frequency
.
The STFT and cyclic spectrograms of one segment of piano music are shown in
Figure 1. As can be seen from this figure, the cyclic spectrogram is much clearer than the STFT spectrogram, with harmonics suppressed greatly, while the STFT spectrogram provides more detailed information in the lower frequency range.
3. Proposed Approach
To mitigate the interference of harmonic components inherent in the STFT and accentuate the lower-frequency elements within the cyclic spectrogram, a novel dual-stream architecture for automatic piano transcription is proposed in this paper. The network architecture is depicted in
Figure 2. First, the Mel-scaled STFT and Mel-scaled cyclic spectrograms are computed. Subsequently, the Mel cyclic spectrogram is fed into three modules for onset, offset, and frame-level note detection. In parallel, the Mel STFT spectrogram is fed into another frame-level note detection module. Additionally, to capture the temporal relevance of piano notes, an axial attention mechanism is incorporated into both frame-level note detection modules. Finally, considering the natural energy decay characteristic of piano notes, the outputs of the onset, offset, and frame-level note detection modules are aggregated. The composite feature is subsequently input into the multi-feature fusion module to derive the note sequence. This section will provide a detailed explanation of the proposed network.
3.1. Computation of Mel Cyclic Spectrogram
Each note contains the fundamental and harmonic components, which are located at integral multiples of the fundamental frequency. The concurrent notes are ubiquitous in piano music, resulting in a rather complex spectrogram. Fortunately, the cyclic spectrum can suppress the harmonic components, making the fundamental frequency information more significant. Therefore, it is introduced herein for signal representation.
In order to improve the computational efficiency, we discretize the calculation process of the cyclic spectrogram. Firstly, the piano musical signal is split into frames. Then, the discretized frame-level cyclic autocorrelation function is calculated as
where
represents the time-domain signal of one frame,
k denotes the frequency index,
N is the frame length,
m is the time difference, and
denotes the conjugate operator.
Subsequently,
can be further simplified as
where
represents the Fourier transform of
.
is a bivariate function that relates to the cyclic frequency
and frequency
k. The frame-wise cyclic spectrum can be obtained by setting
in Equation (
6). Similarly, the cyclic spectrogram of one piece of audio signal can be obtained frame by frame.
In order to improve the computational efficiency of the model, the linear frequencies corresponding to the cyclic spectrum are mapped to the Mel-scale frequencies by using the Mel filter banks. The mapping function from linear frequency
f to Mel scale frequency
is defined as
The Mel filter banks are a series of triangular filters where the amplitude is 1 at the center frequency point and gradually decreases to 0 on both sides of the filter. The amplitude transfer functions of Mel filter banks can be defined as
where
represents the linear frequency of the
bth Mel frequency, and
represents the amplitude function of the
bth filter.
Correspondingly, the Mel cyclic spectrogram
is obtained through a series of filter banks with the transfer function defined as
Similarly, the Mel spectrogram can also be derived by filtering the STFT spectrogram by the same Mel filter banks. Through the Mel filter banks, the dimensions of both STFT and Cyclic Spectrograms are reduced greatly.
3.2. Frame-Level Note Detection Module
The two frame-level note detection modules with either the Mel STFT spectrogram or the Mel cyclic spectrogram as input share the same structure. They both consist of four layers: convolutional, axial attention, dense, and bidirectional gated recurrent units (BiGRU) layers, as illustrated in
Figure 2.
In more detail,
or
is first fed into CNN to extract the local spatial information and obtain feature map
. Then, to model the temporal relevance of piano notes across frames, we incorporate an axial attention layer [
23] after the CNN to enhance the correlation among different frames along the axial direction.
The structure of axial attention is shown in
Figure 3. There are four steps for the axial attention mechanism. First, three separate 1 × 1 convolution layers are utilized to obtain three feature matrices:
,
and
, respectively. To reduce computational complexity,
is smaller than
C, while the size of
remains unchanged. Second, the
and
feature matrices are multiplied, and followed by softmax function to obtain the weight matrix
, which reflects the correlation of notes at the same position of each frame in the feature map. Thirdly, to capture compact contextual information, the weight matrix
is multiplied by the feature matrix
to derive
. Finally,
is added to
to obtain the output matrix
.
After axial attention, the result is fed into a dense layer and BiGRU to derive piano note sequences.
Loss function is crucial for APT. As a multi-class classification problem, the labels are set to 1 at the positions where the notes are played, and set to 0 at the other positions. The frame-level note detection module predicts the probability of each note. Therefore, following [
11], the loss function of frame-level note detection module
is defined as
where
T refers to the number of frames,
(equal to the number of piano notes),
represents the label value of the
nth note at the
tth frame, and
represents the probability value at the
nth note of the
tth frame.
3.3. Onset and Offset Detection Modules
The frame-level note detection module does not utilize the note onset and offset, which contain additional information. In order to make full use of note onset and offset information to enhance the accuracy of piano transcription, the note onset and offset modules are utilized on the Mel STFT spectrogram, as illustrated in
Figure 2. They consist of three layers: convolutional, dense, and BiGRU layers. The corresponding loss functions are defined as
where
and
are loss functions of onset and offset detection modules, respectively;
and
correspond to the label values for the onset and offset of the
nth note at the
tth frame, respectively, and
and
denote the predicted probability values for the onset and offset of the
nth note at the
tth frame, respectively.
3.4. Multi-Feature Fusion Module
The above-mentioned three modules conduct piano music transcription from different perspectives. To utilize multi-feature information, a multi-feature fusion scheme is proposed. As illustrated in
Figure 4, the four matrices generated from the note onset, offset, and frame-level note detection modules are stacked to construct
. Afterwards,
is average-pooled and max-pooled along the channel dimension to generate the average pooling matrix
and the maximum pooling matrix
, respectively, i.e.,
Then, a convolutional layer is used to associate the interaction features of and to obtain an enhanced feature matrix of size . The enhanced feature matrix integrates the note onset, offset, and sustain information.
Subsequently, the enhanced feature matrix is dot-produced with each slice of
to obtain the fused feature matrix
, defined as
where
represents the convolutional layer with a
convolution kernel, and ⊙ denotes the slice-wise dot product operator.
Finally, the BiGRU is used to model the timing information, after which the note probability matrix is obtained using the sigmoid activation function. To obtain note sequences, a simple threshold strategy is employed herein. Specifically, the note probability matrix is binarized with a threshold of 0.5. As a consequence, the adjacent pitches with a probability larger than 0.5 form a note.
4. Experimental Results and Discussions
4.1. Dataset and Metrics
The MAPS dataset [
10], with approximately 60 h of piano music recordings, is utilized for performance evaluation. This dataset encompasses 270 piano music recordings of 9 categories. Each category contains 30 piano recordings. Notably, 7 of these categories are generated using piano synthesis software, while the other 2 categories are obtained from Yamaha Disklavier. The sampling rate is 44.1 kHz. Each recording is labeled with the sustain and onset information for different notes. In our experiments, the 30 audio recordings from the
category were used as the test set. And 15 randomly selected audio recordings from the other 8 categories were used as the validation set for adjusting the hyperparameters, while the other 225 recordings were for model training. That is, the percentages of training, validation, and test sets are 83.33%, 11.11%, and 5.56%, respectively.
Three metrics are used to assess the accuracy of the pitch sequences, i.e., precision, recall, and F-measure [
24]. Four kinds of error rates are also evaluated, i.e., substitution error rate (
), missing error rate (
), false alarm error rate (
), and total error rate (
) [
25].
4.2. Experimental Setup
The piano music in the dataset is processed in the following way. Firstly, the original stereo piano music signal is converted into a monophonic signal by taking the arithmetic mean, and then downsampled to a sampling rate of 16 kHz. The duration of each frame for both cyclic and STFT spectrograms is 128 ms (with a window length of 2048 points), and the time interval is 32 ms. To improve the computational efficiency, the cyclic spectrogram and STFT spectrogram are filtered using 229 Mel filter banks [
26] to obtain the Mel cyclic spectrogram and Mel spectrogram, respectively. In this way, the frequency dimension of the spectrograms is reduced from 1025 to 229.
Four detection modules share the same structure of convolutional layers. There are 4 convolutional layers for each Conv2D block, and the number of output feature maps is 16, 32, 64, and 128, respectively. The size of the convolutional kernel is with a random dropout rate of 0.5. The activation function of convolutional layers is ReLU. The input batch size is 200. Seven consecutive frames of the Mel cyclic spectrogram are concatenated together to form the feature of the current frame, so the input of the detection module is a tensor in shape (200, 1, 7, 229). The feature maps after the convolutional layers are of dimensions (200, 128, 7, 14). Then, the feature maps are flattened along the frequency and channel dimensions to (200, 7, 1792), and the shapes after dense layers are (200, 7, 512). These tensors are then input into the BiGRU, whose output layer size is 256.
The input channel number of the frame-level note detection module, C, is 128. is set as 64. The output layers of onset, offset, and frame detection modules are activated by the function. The threshold after the last sigmoid function for determining if a note is active is 0.5.
The label of each frame is a vector of length 88. The entry is set to be 1 when the corresponding note is active. On the contrary, it is 0. The loss function is optimized by the Adam optimizer with a learning rate of 0.0006 for training. The experiments are conducted in the OS of Ubuntu 18.04.6 LTS with NVIDIA RTX 3080 GPU. The codes are written in Python 3.9 in the PyTorch framework.
4.3. Reference Approaches
Six typical APT approaches are used for performance comparison, including spectral peaks and non-peak regions modeling (SPNRM) [
18], the sound space-based spectral factorization (S3F) [
7], the pseudo 2-D spectrum-based approach (P2SB) [
8], the note transcription using RNN (RNN) [
9], the joint onset and frame estimation using CNN and bidirectional long short-term memory networks (CBLSTM) [
11], and the high-resolution piano transcription (HPT) [
16].
These approaches are chosen since they cover all categories. For instance, S3F belongs to the classical spectrogram factorization category. P2SB is based on a two-dimensional spectral representation. SPNRM is devoted to modeling spectral peaks. RNN, CBLSTM, and HPT are typical deep learning-based approaches. All parameters of the reference methods are the same as their original settings. To be fair, all deep learning-based methods are trained and tested using the same datasets as the proposed one.
4.4. Evaluation Results and Discussions
4.4.1. Ablation Experiment
As mentioned before, the proposed approach is based on a dual-stream structure using both Mel cyclic and STFT spectrograms. To verify the effectiveness of the dual-stream structure, we performed an ablation experiment on the validation dataset. In more detail, the proposed approach is trained on the same training set, and the performance on the validation dataset with either the Mel cyclic spectrogram or the Mel STFT spectrogram branch omitted is tested. The experimental results are provided in
Table 2.
As shown in this table, the proposed approach based on both the Mel cyclic and STFT spectrograms achieves the highest accuracy and the lowest total error rate. This observation indicates that both streams contribute to the performance. In terms of F-measure, the multi-feature fusion scheme based on both spectrograms is 4.1% higher than the Mel cyclic spectrogram, and 2.9% higher than the Mel spectrogram, and the is 6.2% lower than the Mel cyclic spectrogram, and 4.3% lower than the Mel spectrogram, respectively.
In addition, the branch with only the Mel cyclic spectrogram obtains a relatively lower recall rate and a higher total error rate
. This phenomenon might be due to the lower frequency resolution. As illustrated in
Figure 1, the lower frequency components of the cyclic spectrogram are weak. If the frequency resolution of the cyclic spectrum is increased, the accuracy can be improved, but the computation load is also heavier.
4.4.2. Comparison with Reference Approaches
The piano transcription results on MAPS-AkPnBcht, with respect to precision, recall, and F-measure, are shown in
Figure 5. To provide more intuitive results, the mean accuracies of different methods are listed in
Table 3.
It can be found from
Figure 5 and
Table 3 that the proposed approach achieves the highest F-measure and recall, and the third highest precision. The precision of the proposed approach is lower than HPT and CBLSTM, indicating that there is still work to do to reduce the false positives.
Additionally, among the compared approaches, S3F achieves a higher precision. However, its recall is the lowest, resulting in a lower F-measure value. From this observation, we can infer that the estimated notes of S3F are not many and the false alarm rate is low, but it missed many notes. Similar conclusions can be drawn for SPNRM.
The corresponding error rates for all approaches are shown in
Table 4. It can be observed that the proposed approach also obtains the second lowest total error rate. Its false alarm rate is higher than HPT and CBLSTM, which confines its precision. Moreover, the S3F consistently achieves a lower substitution error rate and false alarm error rate, but it obtains the highest missing error rate. This observation confirms that a significant number of actual notes are missed.
In order to see if the differences between different approaches are significant, we conducted a statistical significance analysis on all test recordings. A paired-sample t-test is performed between the proposed and other reference approaches in terms of precision, recall, and F-measure. The results are listed in
Table 5,
Table 6 and
Table 7.
It can be observed from
Table 5 that HPT and CBLSTM achieve higher precision values than the proposed approach. And the precision differences between the proposed approach and CBLSTM, HPT, and S3F are not significant, with p-values higher than 0.2. Also, it can be observed from
Table 6 that the proposed approach obtains the highest recall, and the superiority of the proposed approach over HPT is slightly significant with a p-value equal to 0.191. The proposed approach significantly outperforms the other approaches in terms of recall rate. Finally, it can be found from
Table 7 that the proposed approach obtains the highest F-measure, which is consistent with
Figure 5. The difference between the proposed approach and HPT with respect to F-measure is insignificant, with a p-value equal to 0.574. Moreover, the proposed approach significantly outperforms the other approaches.
4.4.3. Piano Transcription Example
To provide a more intuitive illustration of the piano transcription results, the ground truth labels and the transcription results of these approaches on one excerpt are shown in
Figure 6. This excerpt is ’MAPS_MUS_chpn-p4_AkPnBcht.wav’, taken from the evaluation dataset.
From
Figure 6, it can be observed that the piano transcription results of RNN, CBLSTM, HPT, and the proposed approach are much better than those of S3F, P2SB, and SPNRM, which is consistent with the previous evaluation results. Only HPT and the proposed approach correctly estimate short segments of note with the MIDI number equal to 35 during the interval between 2.5 and 4 s. That is because there is also another note exactly one octave above this one at the same time. Therefore, discriminating notes with an exact one octave distance is still unsolved. Additionally, the notes by the deep learning approaches are more continuous, indicating that they work better in modeling temporal dependency.
4.4.4. Computational Complexity Comparison
It can be seen from
Table 5,
Table 6 and
Table 7 that the performance of deep learning based approaches is diverse. Their parameter amount and FLOPs are also exported. The results are given in
Table 8. Obviously, RNN is the most lightweight one with the least parameter amount and FLOPs, which are much fewer than the other reference methods. On the other hand, CBLSTM needs the most parameters and FLOPs.
Furthermore, it can be seen from
Table 8 that the RNN is most computationally efficient, while the proposed approach ranks second. The statistical significance results indicate that the proposed approach’s performance is on par with that of HPT. However, its model complexity and parameter amount are much less than those of HPT. Therefore, the proposed approach features a more lightweight model. This efficiency reveals that the proposed approach achieves better performance with a much smaller model size compared with the reference methods.
5. Conclusions
In this work, an APT approach based on a dual-stream structure is proposed. The Mel cyclic spectrogram is introduced herein to effectively mitigate the interference of harmonic components of notes. To reduce computation load, the Mel spectrogram is also utilized to supplement the information in the lower frequency range. Specifically, the cyclic and STFT spectrograms are first computed frame by frame and filtered using Mel filter banks to obtain the Mel cyclic and Mel spectrograms, respectively. Then, the frame-level note detection, onset, and offset detection modules are utilized to extract specific piano note features in the dual-stream structure. Considering the temporal dependency of notes, an axial attention is incorporated into the frame-level note detection modules. Finally, the features of different detection modules are fused together to deduce piano notes. To the best of our knowledge, this is the first work that models the piano transcription problem using two spectrograms as input representations. Experimental results demonstrate that these two spectrograms provide supplementary information to the other one, and the proposed approach achieves promising performance.
In the future, we would try to explore a more sophisticated feature fusion strategy to improve the information interaction between the two branches.