1. Introduction
The automated assessment of skills involves quantifying how proficient a person is in the task at hand. Automated skill assessment can be used in many areas such as physical movement assessment and music education. Due to the higher cost of manual instruction and space limitations, face-to-face piano instruction between teacher and student has become difficult, which makes automated skill assessment very important.
As we know, piano performances comprise both visual and auditory aspects. Therefore, the assessment of piano players’ skills can also be conducted from both visual and auditory perspectives. For the visual aspect, a piano performance judge can score the performance by observing the player’s fingering. Similarly, for the aural aspect, the judge scores the performance by assessing the rhythm of the music played by the performer.
At present, most studies on piano performance evaluation have been conducted based on a single audio mode [
1,
2,
3], disregarding the information contained in the video mode, such as playing technique and playing posture, resulting in a one-sided assessment of the player’s skill level and an inability to conduct a comprehensive assessment from multiple aspects.
However, most existing studies on multimodal piano skill assessment are based on shallow networks that are deficient in extracting complex spatio-temporal and time-frequency features. There is also a significant gap between the number of extracted video features and audio features during feature fusion, resulting in models that do not fully utilize feature information from both modalities.
Consequently, we propose an audio-visual fusion model that enables the extraction of feature information from both video and audio data, combining them to form multimodal features for a more comprehensive and accurate assessment of the skill levels of piano players. Although there have been some studies on multimodality, there are limitations in extracting complex spatio-temporal and time-frequency features, and the feature information from both modalities is not fully utilized. Therefore, we consider ResNet as the backbone network for the learning model, as it can extend the network to a deeper depth through the structures of residual connections to better extract complex features. ResNet-3D [
4] and ResNet-2D [
5] are used to extract the video features and audio features, respectively. Specifically, ResNet-3D is able to model both the spatial and temporal relationships between video frames, enabling it to better capture motion and dynamic changes in videos and extract spatio-temporal features. Furthermore, for audio data that have been converted into a time-frequency spectrogram, ResNet-2D can effectively extract time-frequency features and capture the frequency and time information in the audio. To fully utilize both the video and audio features, we employ ResNet18-3D and ResNet18-2D to address the issue of feature extraction, ensuring uniformity in the number of video and audio features.
The main contributions of this paper are as follows:
(1) We present a novel ResNet-based audio-visual fusion model for evaluating piano players’ proficiencies by effectively utilizing both video and audio information. This approach addresses the problem where the unimodal approaches fail to utilize video information and the multimodal approaches fail to fully utilize both video and audio information. Firstly, we extract video and audio features using ResNet-3D and ResNet-2D, respectively. Then, the extracted features are fused to generate multimodal features that will be used for piano skills evaluation.
(2) We propose an effective method that fully utilizes information from both the visual and aural modalities to improve the accuracy and comprehensiveness of piano skills assessment. By incorporating visual and aural information, we can obtain richer and more comprehensive features, enabling a more precise assessment of the skill levels of the performers. Moreover, we maintain a balance between the number of visual and aural features, thereby ensuring that the information from both modalities is fully utilized.
(3) We conduct experiments on the PISA dataset, and the results show that our proposed ResNet-based audio-visual fusion model, with an accuracy of 70.80% and an average training time of 74.02 s, outperforms the state-of-the-art models in the evaluation of piano players’ skill levels, as well as in computational efficiency.
3. Methodology
In this section, we detail the audio-visual fusion model used for assessing the skill levels of piano performers.
Figure 1 shows the framework of our proposal. It consists of three main components: data pre-processing, feature extraction and fusion, and performance evaluation. First, the video data are framed and cropped to serve as the input for the visual branch. The raw audio is then converted to the corresponding Mel-spectrogram using signal processing techniques and spectral analysis methods. Second, we feed the processed video and audio data into the audio-visual fusion model to extract their respective features and fuse the extracted features to form multimodal representations. Finally, we pass the multimodal features as input to the fully connected layer and then perform prediction.
3.1. Data Pre-Processing
For the visual input, we discard the background, face, and other irrelevant information contained in the video. So, we crop the video data before feeding it into the model. The resulting visual information includes the forearm, hand, and piano of the player, as shown in
Figure 2.
For the auditory input, we convert the raw audio to the corresponding Mel-spectrogram, which helps us to better extract information such as pitch that is embedded in the audio data. Firstly, we convert the raw audio into the corresponding spectrogram using STFT (Short-Time Fourier Transform):
where
represents the outcome of the STFT,
represents the window function,
refers to the time-domain waveform of the original signal,
f represents the frequency,
t indicates the time, and
j is the imaginary unit, which satisfies
. Then, the obtained spectrogram is mapped to the corresponding Mel-spectrogram using the Mel-scale [
23]:
where
m represents the Mel frequency and
f means the original frequency. Finally, the Mel-spectrogram information is expressed in decibels.
3.2. Feature Extraction and Fusion
Visual branch: There are some skills of performers that can only be observed and evaluated visually rather than by hearing such as the fingering skills of the performer. For example, a professional pianist playing a piece at high speed may use their index and ring fingers to play octaves, which is difficult for an average pianist. For a pianist, the lack of such skills does not necessarily indicate their proficiency level. However, if a pianist possesses these skills, it can prove that they have reached a high level of technical ability.
The movements of fingers captured in videos involve both the appearance and temporal dynamics of the video sequences. The efficient modeling of the spatio-temporal dynamics of these sequences is crucial for extracting robust features, which, in turn, improves the performance of the model. Three-dimensional CNNs are efficient in capturing the spatio-temporal dynamics in videos. Specifically, we consider ResNet-3D [
4] to extract the spatio-temporal features of the performance clips from a video sequence. Compared to conventional 3D CNNs, ResNet-3D can effectively capture the spatio-temporal dynamics of the video modality with higher computational efficiency. For ResNet-3D, we stack multiple 3D convolutional layers to model motion features in the temporal dimension and utilize 3D pooling layers and fully connected layers for feature descent and combination. In this way, we can extract rich visual features from the video data, including information such as the object’s shape and color, to capture finger motion patterns. In addition, we can utilize a pre-trained model to improve the model’s performance, as shown in Algorithm 1. Finally, we employ the averaging method as our aggregation scheme (see Aggregation option).
Algorithm 1 Model Initialization Algorithm |
- Input:
: the dictionary of model parameters; : the dictionary of pre-trained model parameters; - Output:
: the model dictionary after completing the update; - 1:
function MODELINIT (, ) - 2:
for each do - 3:
if then - 4:
- 5:
end if - 6:
end for - 7:
return - 8:
end function
|
Aural branch: We can also obtain a lot of information about the piano from the audio. The rhythm, notes, and pitches of the piano can be perceived through listening, and this is a common, simple, and practical way to evaluate a piano piece. In fact, different scores vary greatly in terms of style, rhythm, etc. This requires the judges to have a high level of proficiency in the piece being played by the performer in a piano competition or performance, and judges who are not familiar with the piece may encounter difficulties in evaluating the skills of the performer.
Information such as the pitch and rhythm of a piano performance is contained in the audio data, and both raw audio waveforms [
24,
25] and spectrograms [
26,
27] can be used to extract the auditory features. However, the spectrogram can provide more detailed and accurate audio features. Specifically, we convert the raw audio data into the corresponding Mel-spectrogram, which can be regarded as image data, due to its two-dimensional matrix form. We then feed the obtained Mel-spectrogram to the auditory network. Further, ResNet-2D [
5] outperforms the traditional 2D CNN in terms of computational efficiency and feature extraction. Additionally, it can utilize a pre-trained model to improve performance. Therefore, we prefer to use ResNet-2D for feature extraction from the Mel-spectrogram. By stacking 2D convolutional layers, we can capture the patterns and variations in the audio data in the frequency and time dimensions. Moreover, the extracted auditory features include information such as the audio spectrum’s shape, pitch, and rhythm, which can reflect the content and characteristics of the audio data. Finally, we employ the averaging method as our aggregation scheme (see Aggregation option).
Multimodal branch: By utilizing the ResNet-3D and ResNet-2D networks, we have obtained visual and aural features. To better capture the semantic association and complementary information between the video and audio modalities, we adopt a joint representation approach for the features extracted from the video and audio data. This helps to create a more comprehensive and accurate feature representation. Let
and
represent two sets of deep feature vectors extracted for the visual and aural modalities, where
and
.
d represents the dimension of the visual and aural feature representations, and
M and
N denote the number of extracted visual and aural features, respectively. The multimodal features,
, are obtained by splicing
with
in the corresponding dimensions using Algorithm 2:
where
L =
M +
N denotes the number of fused features, as shown in
Figure 3.
Algorithm 2 Feature Fusion Algorithm |
- Input:
: the video features extracted by ResNet-3D; : the audio features extracted by ResNet-2D; - Output:
: multimodal features; - 1:
functionFeatureFusion(, ) - 2:
←.shape[0]; - 3:
←.shape[0]; - 4:
if then - 5:
- 6:
- 7:
- 8:
- 9:
- 10:
end if - 11:
return - 12:
end function
|
Aggregation option: During the piano performance, the score obtained by the players can be perceived as an additive operation. It is often advantageous to perform linear operations on the learned features, which enhances the interpretability and expressiveness of the learned features. Linear operations can also be utilized to reduce the dimensionality of the features, which enhances the efficiency and generalization capabilities of the model. Consequently, we propose the utilization of linear averaging as the preferred aggregation scheme. The application of linear averaging is detailed in Algorithm 3 and depicted in
Figure 4.
Algorithm 3 Feature Average Algorithm |
- Input:
: The list of features obtained from the network; - Output:
: Features after the averaging process; - 1:
Initialize to Tensor format; - 2:
for each in do - 3:
- 4:
- 5:
- 6:
- 7:
end for - 8:
- 9:
return
|
3.3. Performance Evaluation
In the visual and aural branches, to reduce the dimensionality of the features to 128, we pass them through a linear layer, as shown in
Figure 3, and finally input them into the prediction layer. In the multimodal branch, our operations are similar to those of others, except that we do not back-propagate from the multimodal branch to a separate modal backbone to avoid cross-modal contamination.
5. Conclusions
In this work, we propose a ResNet-based audio-visual fusion model for evaluating piano skills. The main focus of our study is to use feature information from both the video and audio modalities to comprehensively assess the skill levels of piano players. Our study leverages the features of both modalities by fusing visual and aural information, which enhances the overall assessment of piano players’ skills.
The fusion of visual and auditory features enables the discovery of correlations and complementarities between audio and visual information, resulting in a more comprehensive and accurate feature representation. By utilizing ResNet as the backbone network, the proposed model leverages ResNet-3D and ResNet-2D to extract visual and auditory features from finger motions (visual) and audio features (auditory), respectively. Then, the visual and auditory features are combined by feature stitching to form multimodal features. Finally, the multimodal features are fed to the linear layer to predict the piano player’s skill level.
We conducted experiments on the PISA dataset and achieved an accuracy of 70.80% in the assessment of piano skills, surpassing the performance of state-of-the-art methods. We also explored the impact of different layers in ResNet on the model, providing a reference for the further optimization of the model. Our work provides piano learners with a more accurate and comprehensive skill evaluation, while also providing a deeper understanding of the piano playing process and performance.