1. Introduction
Automatic Music Transcription (AMT) aims to infer symbolic score representations from acoustic signals and underpins diverse applications such as music information retrieval, practice support, automatic accompaniment, and digital archiving. Among instruments, the piano remains particularly challenging due to dense polyphony and strong overtone interference, which complicate note separation and the detection of simultaneous key presses. Consequently, purely audio-based AMT models still struggle to robustly estimate onsets and offsets with high accuracy under varying noise levels and recording conditions.
In contrast, visual information obtained from performance videos provides direct cues about key locations and hand/finger motion. Recent studies have explored tracking hand skeletons over time and integrating them with acoustic features [
1,
2,
3,
4]. However, visual signals suffer from detection dropouts and uncertainty due to occlusions, illumination changes, and viewpoint dependence; therefore, naive integration can increase both false positives and false negatives. In particular, mismatches in temporal alignment between audio and visual streams can lead to event-level errors that are not easily resolved by simple merging strategies.
In this paper, we propose a multimodal pipeline that integrates probability maps derived from piano audio with key-press probability maps inferred from hand-skeleton trajectories. To obtain reliable vision-derived probabilities, we introduce a model called HandSkeletonNet, a graph neural network that estimates per-key onset probabilities from hand-skeleton trajectories, and we investigate the effect of temporal aggregation using temporal convolution. These hand-skeleton probabilities are merged with the audio-derived probability maps via a parameterized weighting-and-masking scheme based on a nonlinear gain and a probability floor, rather than hard rule-based suppression or rescue heuristics. In addition, we further investigate a learning-based probability merging approach using a compact Convolutional Neural Network (CNN) that directly combines the audio- and vision-derived probability maps to produce the final merged representation. The merged output is subsequently converted to Musical Instrument Digital Interface (MIDI) messages for quantitative evaluation. Our approach differs from raw image-based methods [
1,
2,
3] in that it leverages skeleton representations to naturally incorporate positional constraints imposed by the human hand structure. It also differs from the CRNN-GCN model [
4] in that, rather than extracting uninterpretable latent features from the skeleton, it estimates per-key probabilities and then merges them. This design improves explainability and is expected to facilitate analysis of how undesirable outputs in audio-derived results are efficiently suppressed.
The main contributions of this paper are summarized as follows: (i) the design of HandSkeletonNet, a Graph Neural Network (GNN)-based model that estimates per-key onset probabilities from hand-skeleton trajectories, with an investigation of temporal aggregation; (ii) a parameterized weighting-and-masking framework that aims to merge audio and visual probabilities through a continuous weight map, a probability floor, and a nonlinear gain; (iii) a learning-based merging model based on a compact CNN that integrates the two probability streams in a data-driven manner; and (iv) extensive experiments on internal and external datasets demonstrate that multimodal fusion improves transcription accuracy over a state-of-the-art audio-only baseline. The proposed approach is expected to be most effective when hand tracking is sufficiently stable and synchronization errors are limited, whereas robust onset-level improvements under severe domain shift remain challenging.
2. Background and Related Work
2.1. Fundamentals of Piano AMT
Given a waveform
of a piano performance, the goal of AMT is to estimate a time-discretized piano roll as follows:
where
T denotes the number of time frames after discretizing the waveform into frames with hop size
h, i.e.,
for an audio duration
D, and
indicates that key
k is active at time frame
t. In practice, alternative formulations predict separated representations for onset, duration, and offset channels, or directly infer MIDI notes as onset–length pairs (
Figure 1). Piano AMT is challenging mainly due to dense polyphony, resonance, and the use of the sustain pedal. In addition, a piano produces strong harmonic interference, making analysis of the audio waveform difficult (
Figure 2).
Standard evaluation metrics include note-level and frame-level precision, recall, and F1 score. For onset matching, a tolerance of is typically adopted. While our system estimates duration and offset probabilities to stabilize the transcription, we primarily focus on the evaluation of onset detection accuracy.
2.2. Audio-Only AMT
Early AMT systems relied on signal-processing approaches such as spectral-peak tracking and Non-negative Matrix Factorization (NMF) [
5]. Subsequent advances have been driven by deep learning models that take time–frequency representations (e.g., Short-Time Fourier Transform (STFT), Combined Frequency, and Periodicity (CFP)) as input. Representative sequence models output probability maps for onsets, durations (or frame-level activation), and offsets, which are converted to MIDI via thresholding, peak picking, and temporal connection. Omnizart [
6] follows this line and is widely used as a practical baseline. Nevertheless, under conditions of high polyphony, heavy reverberation, extensive pedal usage, or recording-condition mismatch, controlling false positives and false negatives remains a significant challenge [
7].
2.3. Visual Cues and Multimodal AMT
A complementary direction leverages visual cues from performance videos—such as keyboard layout and hand/fingertip motion—to assist audio models. Existing approaches can be broadly grouped into: (i) key- or fingertip-detection and alignment methods for estimating key activity [
8], (ii) learned image-feature-based schemes [
1,
2,
3], and (iii) temporal modeling of hand-skeleton trajectories [
4,
9]. Visual information is particularly effective for narrowing down onset candidates; however, it is susceptible to detection dropouts, uncertainty, occlusions, illumination changes, and viewpoint dependence. Consequently, naive merging of audio and visual streams may inadvertently increase both false positives and false negatives. Furthermore, practical systems must handle audio–video synchronization issues, such as frame-rate mismatch and latency.
3. Proposed Method
3.1. Overall Pipeline
Our system consists of the following four stages: (i) estimating audio-derived probability maps using Omnizart (
Section 3.2), (ii) obtaining key-press probability maps based on hand-skeleton cues extracted from performance videos (
Section 3.3), (iii) merging these two modalities, and (iv) converting the merged output into MIDI format for evaluation (
Figure 3).
For the merging stage (iii), we propose two distinct approaches: (iii-1) a merging method based on thresholding and weighting (
Section 3.4.3) and (iii-2) an approach using a CNN-based integration model (
Section 3.5).
3.2. Audio Branch (Omnizart)
From the audio waveform, we compute a time–frequency representation (e.g., CFP/STFT) and feed it to the Omnizart piano model to obtain frame-wise, per-pitch probability maps for onsets, sustains (duration), and offsets. Specifically, for time frame
t and key
, we denote the probabilities as
where
represents the probability that a note begins,
represents the probability that the frame falls within the duration of a note, and
denotes the probability that a note ends. These maps are later aligned with the visual branch and used by the merging module (
Figure 4).
3.3. Hand Skeleton Branch
We use MediaPipe Hands [
10] to estimate 21 2D landmarks
per hand in each video frame and treat the two hands as a 42-node graph [
11] (
Figure 5). Our proposed model,
HandSkeletonNet, uses only the 2D coordinates as node features. When only one hand is detected, the frame is reshaped to 42 nodes by zero padding.
MediaPipe Hands generally has good robustness to hand occlusion; however, in cases of severe occlusion, it may fail to detect one or both hands. Fortunately, when MediaPipe Hands produces low-confidence estimates, it refrains from outputting unreliable results, thereby preventing significant disruption to audio-based transcription. Rapid hand movements can also lead to reduced accuracy, particularly in pieces with frequent leaps. When audio and visual signals are well synchronized, this effect is limited; however, when they are temporally misaligned, its impact becomes more pronounced. Such synchronization will be discussed in
Section 3.4.2.
Let
denote the stacked keypoint coordinates at frame
t (
). The model outputs per-MIDI-note key-press probabilities,
where
represents the probability for frame
t and key
k. The sequence is later aligned to the unified time base and synchronized with the audio branch (
Section 3.4.2).
3.4. Weighting and Masking of Probabilities
3.4.1. Notation
Since the frame rate optimal for audio analysis, which affects frequency resolution, may differ from that used in video-based skeleton analysis, we use separate frame rates for each analysis and resample the probability maps to ensure consistency. Let and denote the original frame lengths of the audio and hand streams, respectively, and let T denote the length of the unified time base. The audio probabilities are , and the hand probabilities are . After resampling (and alignment compensation), we obtain and .
3.4.2. Temporal Alignment
We resample the audio-derived probability maps
onto the merged time axis of length
T by linear interpolation, obtaining
. Similarly, the hand-derived probabilities
are linearly interpolated to the same length
T. After resampling, we compensate for residual audio–video latency by searching a global alignment offset. We compute frame-wise mean sequences and normalize them as follows:
where
and
denote the temporal mean and standard deviation of
, respectively, while
and
denote those of
. We search
(we use
) and select
, and then shift the hand stream accordingly.
3.4.3. Construction of the Weight Map
We nonlinearly enhance the contrast of the hand-derived confidence map using an exponent
, and impose a lower bound
to avoid excessive suppression. We define the weight map as
Here, controls the emphasis on high-confidence regions, and is a lower bound introduced to avoid excessive suppression in regions where the hand evidence is sparse.
3.4.4. Application to Audio Probabilities and MIDI Inference
The continuous weight map
W is applied element-wise to the audio onset and duration probabilities. Note that we exclude the offset probability
from this weighting to preserve release timings. The merged probabilities are given by
where ⊙ denotes element-wise multiplication. In this way, the audio probabilities are largely preserved at time–pitch positions where the hand-presence probability is high, while they are attenuated in regions with low hand-presence probability.
Finally, we convert the merged probability maps into a MIDI sequence using the Omnizart decoding API. Specifically, we feed the merged onset and duration maps and , together with the unmodified offset map , to the decoder. The decoder performs note tracking to obtain note events (onset–duration pairs) and writes them to a standard MIDI file.
3.5. Learning-Based Merging Module
In addition to the parameterized weighting-and-masking scheme in
Section 3.4, we implemented a Learning-based Merging Module that predicts the onset probability map from both audio- and hand-derived probability maps using a CNN.
3.5.1. Inputs
We construct a 4-channel tensor
by stacking the Omnizart probabilities (onset, duration, offset) and the hand-skeleton probability as follows:
3.5.2. Time Alignment
Before constructing the 4-channel input, we apply the same resampling and alignment compensation procedure as in
Section 3.4.2 (i.e., linear interpolation to the unified time base and alignment offset search by maximizing cross-correlation of frame-wise mean
z-scores).
3.5.3. Model and Training
A compact 2D CNN over the time–pitch plane is trained with ground-truth onset labels using a binary classification loss (BCEWithLogitsLoss with positive weighting) and the Adam optimization. The network outputs onset logits , and the merged onset probability is obtained by .
3.5.4. Inference and MIDI Decoding
At inference time, only the onset stream is replaced by the CNN output, while the duration and offset streams are taken from the audio branch as follows:
The resulting probability maps are converted to MIDI using Omnizart’s note-inference API with thresholds and hop size consistent with the unified time base.
4. Dataset and Experimental Setup
4.1. Internal Dataset
To train our models and conduct a preliminary evaluation of our architecture, we compiled a small dataset of performance videos and their corresponding MIDI files for nine piano pieces. These include Flight of the Bumblebee by N. Rimsky-Korsakov (arranged by S. Rachmaninoff); other seven classical pieces composed by F. Liszt, F. Chopin, C. Debussy, and S. Rachmaninoff; and one arrangement of a popular song. The total duration of the audio clips in the dataset is 24.5 min. The videos are recorded from an overhead viewpoint, with all 88 keys visible. The audio sampling rate is 44.1 kHz, while the video frame rate is 25 frames per second (fps).
4.2. Partitioning Strategy
For each musical piece, its audio waveform was processed with Omnizart version 0.5.0 to obtain probability maps for onset, duration, and offset events. From the video, we extracted the two-dimensional coordinates of 21 keypoints per hand using MediaPipe Hands and converted them into PyTorch Geometric graphs with 42 nodes following the procedure described in
Section 4.3.1. Since the main focus of this paper is the post hoc evaluation of the merging mechanism, we first train HandSkeletonNet once on the internal dataset and then keep its hyperparameters fixed for all experiments. We therefore do not perform cross-validation over HandSkeletonNet itself; instead, we evaluate the performance of the merging strategy on specific target pieces.
4.3. Training Procedure of HandSkeletonNet
HandSkeletonNet is trained as a binary classification model that estimates, for each frame, the onset activation probabilities of the 88 piano keys based on the sequence of hand-skeleton graphs. Supervision is provided by the onset piano roll derived from the MIDI annotation, denoted as .
4.3.1. Preprocessing and Input Generation
For each frame, the 2D coordinates
of the 21 keypoints per hand estimated by MediaPipe Hands are normalized by the image width and height, rescaling them to the range
. If a hand is missing in a frame, it is filled by zero padding so that each frame contains 42 nodes in total. We use a sliding window of width
with a stride of 1. We further consider an ablation setting with
. In this case, the input reduces to the center frame only, and the model cannot exploit any temporal neighborhood. We use this variant to quantify the contribution of temporal context in the hand-skeleton branch. For each central time step
t, we construct a sequence of graphs as follows:
The edges are defined by the skeletal topology of each hand and remain fixed within each frame.
4.3.2. Model Architecture
For each frame-wise graph, we first apply two layers of Graph Convolutional Network (GCN) with Rectified Linear Unit (ReLU) activation to extract node-level features. To enable the model to jointly recognize the shapes of both hands, we then aggregate node features into a single graph-level embedding by mean pooling over the node dimension. This produces one feature vector per frame, forming a temporal sequence over a
-frame window (center frame
). The temporal dimension is then aggregated by a depthwise temporal convolution [
12] over this sequence of graph-level embeddings. From the resulting feature sequence, we extract the vector corresponding to the central time step and feed it into a fully connected layer to obtain an 88-dimensional logit vector for that frame (
Figure 6).
4.3.3. Loss Function
We train the network with a composite loss that combines a class-imbalance-aware binary cross-entropy (BCE) with logits and a differentiable surrogate of the macro F
1 score (soft-F
1) [
13]. Let
and
denote the logit matrix and the ground-truth label matrix in a mini-batch of size
B, respectively, and let
be the predicted probabilities after the sigmoid function
.
Soft-F1
For each key
, we define soft counts over the mini-batch as follows:
Using these, the soft-F
1 score for key
c is given by
where
is a small constant for numerical stability (we use
). We then compute the macro score across 88 keys as
Composite Loss
The final training objective is
where
is the class-wise positive weight vector and
controls the contribution of the soft-F
1 term. We set
.
Positive Class Weights
The class-wise positive weights
are estimated from the training set to mitigate class imbalance [
14]. For each key
c, let
and
be the number of positive and negative labels in the training set, respectively. We define
where
truncates the ratio to the interval
.
4.4. Training Procedure of Learning-Based Merging Module
Section 3.5 introduced a Learning-based Merging Module that predicts the onset probability map from both audio- and hand-derived probability maps.
4.4.1. Architecture
We use a compact 2D CNN over the time–pitch plane. The network consists of four convolution layers with ReLU activations, including a dilated temporal convolution to widen the receptive field in time (
Figure 7). The model outputs onset logits
, and the merged onset probability is obtained by
.
4.4.2. Loss and Optimization
We train the network using BCEWithLogitsLoss with a scalar positive weight . Optimization is performed using the Adam algorithm with learning rate . To avoid variable-length batching, we use batch size and shuffle pieces at each epoch.
4.5. Baselines
To ensure a fair comparison, all methods share the same evaluation pipeline and temporal alignment.
- 1.
Audio Only (Omnizart): The audio-derived probability maps are converted to MIDI using Omnizart’s note-inference API.
- 2.
Audio + Hand Skeleton Region: MediaPipe hand coordinates are mapped onto the keyboard positions. Notes that fall outside the physically reachable range of the hands are pruned based on a deterministic rule.
- 3.
Proposed(i) Omnizart + GNN: The GNN-estimated hand-skeleton probabilities are merged with the audio probabilities using the proposed weighting and masking mechanism. The final piano roll is converted to MIDI using the Omnizart API.
- 4.
Proposed(ii) Omnizart + GNN + CNN: The audio and hand-skeleton probabilities are merged using the CNN-based merging model.
At inference time, only the onset stream is replaced by the CNN output, while the duration and offset streams are taken from the audio branch (
Section 3.5).
4.6. Hyperparameters
4.6.1. Merging Hyperparameters
The merging process (Proposed(i) Omnizart + GNN) is controlled by the hyperparameters summarized in
Table 1 and explained below.
Weight Floor (): A lower bound imposed on the continuous weight map W. Even when the hand-skeleton probabilities are low or missing, W is clipped below by so that the audio stream retains a minimum contribution.
Weight Gamma (): An exponent used when constructing W from the hand-skeleton probabilities , emphasizing high-confidence regions of in the continuous weight map.
Onset Threshold () and Duration Threshold (): Decision thresholds used to binarize the onset and duration streams in the final MIDI conversion stage. These thresholds are applied to the score values used by the Omnizart decoding stage rather than to the -normalized probabilities, so their candidate values are not restricted to the interval .
For the pieces contained in the training dataset of HandSkeletonNet, these hyperparameters were tuned independently for each song by an exhaustive grid search over the candidate sets (
Table 1). For each song, we selected the configuration yielding the highest onset-only F
1 score under a latency-compensated evaluation with a tolerance of
. Since this per-song tuning utilizes ground-truth annotations, the results should be regarded as an
upper bound on the achievable performance.
In addition, we defined a single global configuration, reported in the “Adapted Value” column of
Table 1. These values were calculated by averaging the optimal hyperparameters obtained for each piece in the training dataset. This global setting is intended for evaluating pieces outside the training dataset (e.g., external test pieces) where ground-truth tuning is not feasible.
4.6.2. HandSkeletonNet Hyperparameters
HandSkeletonNet utilizes specific architectural and optimization hyperparameters, which were kept fixed across all experiments. Unless otherwise noted, we used a temporal window of frames with a stride of 1. The network consists of a GCN block with a hidden dimension of 128, followed by a linear projection to a 256-dimensional feature space. The temporal aggregation is performed by a depthwise Temporal Convolution Network (TCN) with a kernel size of 3, consisting of two layers and a dropout rate of 0.1. The final fully connected layer outputs 88 logits per frame.
The model was trained using the Adam optimizer with a learning rate of , a weight decay of , and a batch size of 32. Gradient clipping was applied with a maximum global norm of 5.0. The weight of the soft-F1 loss term was set to . These hyperparameters were determined empirically based on preliminary experiments.
4.6.3. Learning-Based Merging Module Hyperparameters
Table 2 summarizes the training parameters for the Learning-based Merging Module proposed in
Section 4.4.
4.7. Evaluation Metrics
We compute frame-level metrics using
mir_eval.multipitch [
15]. For each frame, we obtain the predicted and reference pitch sets and count true positives (TP), false positives (FP), and false negatives (FN). We then aggregate TP/FP/FN over all frames and report the following:
Note that this accuracy is the multi-pitch set-based accuracy defined in mir_eval.multipitch, not a per-key classification accuracy.
We also report onset-level precision/recall/F1 by evaluating note-onset events. From the predicted MIDI and the ground-truth MIDI, we extract onset events as pairs of (onset time, pitch). A predicted onset is counted as a true positive if there exists a ground-truth onset with the same pitch whose onset time is within a tolerance of s; otherwise, it is a false positive. Unmatched ground-truth onsets are counted as false negatives. We then compute Precision/Recall/F1 from the aggregated TP/FP/FN counts.
5. Results
We report frame-level and onset-level metrics defined in
Section 4.7. We also report an ablation of the hand-skeleton branch with
, which removes temporal context and uses only the center-frame GNN features.
5.1. Evaluation with the Internal Dataset
We first evaluate the performance on the dataset used for training HandSkeletonNet.
Table 3 compares the precision, recall, and F
1 scores of the baselines and the proposed merging methods.
Table 3 shows the evaluation results on the internal dataset. The audio baseline (Omnizart) achieves an onset-level F1 of 64.09% (precision = 58.74%, recall = 71.27%) and a frame-level F1 of 39.38% (precision = 69.98%, recall = 27.4%, accuracy = 24.17%).
The Hand Skeleton Region baseline yields very high onset-level precision (83.89%) but low recall (42.23%), resulting in a reduced onset-level F1 of 53.02%. At the frame level, it also degrades performance (frame-level F1 = 27.87%), indicating that hard spatial restriction can suppress correct activations.
For rule-based merging, Naive product attains onset-level F1 = 56.97% and frame-level F1 = 31.96%. With a single Global params configuration, the temporal-window comparison shows that performs better than on both metrics (onset-level F1: 63.95% vs. 57.37%; frame-level F1: 39.99% vs. 31.91%). The Optimized params setting improves onset-level F1 to 60.73% and frame-level F1 to 31.74%, but does not exceed the best global configuration in this re-evaluation.
Finally, the learning-based merging model (Omnizart + GNN + CNN) achieves the best performance on the internal dataset, with onset-level F1 = 66.38% (precision = 66.43%, recall = 67.44%) and frame-level F1 = 42.96% (precision = 64.8%, recall = 32.14%, accuracy = 26.69%).
5.2. Evaluation with External Datasets
Next, we evaluate the generalization performance on external datasets that were not used for training HandSkeletonNet.
Table 4 and
Table 5 present a comparison between the baseline method and the proposed approach. For rule-based merging, we used the global parameter values shown in
Table 1, which were obtained through adaptation using the internal dataset. The audio sampling rate used in the experiments is 44.1 kHz, while the video frame rate is 25 fps for the PianoYT dataset and 60 fps for the PianoVAM dataset.
5.2.1. External Evaluation on PianoYT
Table 4 summarizes the results on the PianoYT dataset [
1]. The audio baseline achieves onset-level precision = 81.23%, recall = 96.56%, and F1 = 87.68%, while the frame-level F1 is 75.12%. For rule-based merging with
Global params, both temporal-window settings improve the onset-level metrics:
yields onset-level F1 = 90.06% and
yields onset-level F1 = 90.07%. Their frame-level performance is also similar (frame-level F1 = 75.69% for both settings; accuracy = 59.23%). The learning-based merging model (Omnizart + GNN + CNN) achieves onset-level F1 = 86.93% and the best frame-level F1 = 75.76%.
5.2.2. External Evaluation on PianoVAM
Table 5 reports the results on the PianoVAM dataset [
9]. The audio baseline attains onset-level precision = 62.51%, recall = 77.05%, and F1 = 67.94% and frame-level F1 = 54.68%. Rule-based merging improves the onset-level results, especially with
:
yields onset-level F1 = 70.3%, while
yields onset-level F1 = 68.89%. Their frame-level performance remains close (frame-level F1 = 55.28% for
and 55.30% for
). The learning-based merging model (Omnizart + GNN + CNN) achieves onset-level F1 = 69.78% and the best frame-level F1 = 57.57%.
5.3. Qualitative Analysis
To better understand the behavior of the proposed method, we visually inspect the generated piano rolls.
Figure 8,
Figure 9,
Figure 10, and
Figure 11 compare the ground truth, the audio-only baseline output, and the proposed method’s output.
5.3.1. Recovery of Missed Notes
Figure 8 shows a typical success case demonstrating the recovery of false negatives. In the audio-only baseline (middle row), the chord progressions in the lower region played by the left hand, as well as the fast chromatic passages in the latter half, are largely missing. However, as shown in the piano roll of the proposed method (bottom row), these notes—which were absent in the baseline—have been successfully restored. By using the hand-skeleton probabilities as a soft spatial guide, the proposed method preserves plausible audio activations in played regions while attenuating implausible regions, bringing the transcription closer to the ground truth.
5.3.2. Suppression of Noise Artifacts
Figure 9 illustrates the suppression of false positives. As indicated by the red arrows in the baseline piano roll (middle row), scattered noise artifacts are observed in both the lower and higher regions. It is evident that such spurious notes are effectively suppressed in the proposed method (bottom row). This demonstrates that the weighting mechanism based on hand-skeleton probabilities correctly filters out audio activations in regions where no hand activity is detected.
5.3.3. Failure Cases: Introduction of Noise and Suppression of Valid Notes
While the proposed method generally improves performance, there are instances where the integration of visual cues degrades the transcription quality.
Figure 10 illustrates a case where the proposed method introduces noise artifacts that were absent in the audio-only baseline. As indicated by the red arrows, spurious notes appear in close proximity to the ground-truth notes, as well as in regions far removed from the main melody line. This suggests that, in some frames, the hand skeleton model may output false-positive probabilities due to visual ambiguity or motion blur, which are then erroneously merged with the audio predictions.
Conversely,
Figure 11 demonstrates an instance where correct notes, originally captured by the audio baseline, are inadvertently suppressed by the proposed method (indicated by the red arrows in the central region). Similar to
Figure 10, the emergence of unnecessary notes is also observed in this example. These cases highlight that integrating probability values from the hand-skeleton model does not always guarantee improvement and can, in certain scenarios, lead to a reduction in transcription accuracy due to visual tracking errors or misalignment.
6. Discussion
6.1. Effect of Temporal Context in the Hand-Skeleton Branch
A key question in our hand-skeleton branch is whether temporal context improves the overall transcription quality. The ablation provides evidence that temporal context does not consistently yield gains.
As shown in
Table 3, on the internal dataset, rule-based merging with
Global params performs better with
(onset-level F1 = 63.95%; frame-level F1 = 39.99%) than with
(onset-level F1 = 57.37%; frame-level F1 = 31.91%). This suggests that, under the current training and alignment conditions, temporal aggregation in HandSkeletonNet can be detrimental. One plausible explanation is that onset events are temporally sharp and sparse, and temporal modeling may blur or shift peaks. If residual audio–video misalignment remains after global offset compensation, such peak shifts can propagate to the downstream weighting or masking behavior used in rule-based merging, leading to reduced detection quality.
In contrast, on external datasets, the difference between
and 9 becomes small (
Table 4 and
Table 5). On PianoYT, onset-level F1 is 90.06% (
) versus 90.07% (
) and frame-level F1 is identical (75.69%). On PianoVAM, onset-level F1 is 70.3% (
) versus 68.89% (
), while frame-level F1 is nearly identical (55.28% vs. 55.30%). These results indicate that temporal context in the hand branch provides limited additional benefit under domain shift, and that the generalization bottleneck is likely dominated by other factors such as hand detection reliability, viewpoint/scale differences, and residual alignment errors.
6.2. Learned Merging vs. Rule-Based Merging
Across datasets, the learned merging model (Omnizart + GNN + CNN) demonstrates a strong ability to improve at least one of the two evaluation levels. On the internal dataset, it achieves the best overall performance with onset-level F1 = 66.38% and frame-level F1 = 42.96% (
Table 4 and
Table 5). This suggests that the CNN can learn nonlinear decision rules that are not captured by fixed-form rule-based merging.
On external datasets, the learned model improves frame-level performance but does not consistently exceed the best rule-based onset-level score. On PianoYT, the CNN achieves the best frame-level F1 (75.76%), but its onset-level F1 (86.93%) is lower than the rule-based results (90.06%/90.07%). On PianoVAM, the CNN again yields the best frame-level F1 (57.57%), while onset-level F1 (69.78%) is slightly below the best rule-based result (70.3%). This indicates that the learned model can shift the balance between frame-wise activity and onset-event detection differently across domains. A practical implication is that the optimal merging strategy may depend on the target dataset and the downstream use-case (e.g., onset-sensitive evaluation versus sustained frame activation).
6.3. Error Taxonomy on External Datasets
To better understand error sources on external datasets, we analyze false positives (FP) and lost true positives (Lost-TP) using an explicit taxonomy (
Table 6). In our implementation, FP events are defined as unmatched onset events in the merged (rule-based) MIDI output, and Lost-TP events are defined as Ground Truth (GT) onsets that are matched by the audio baseline within
but are not matched by the merged output. FP-A corresponds to timing/alignment-related cases, FP-B to hand-driven spurious activations, and FP-C to audio-driven hallucinations not suppressed by the hand cue. Lost-A captures timing shifts in the merged output, while Lost-B (dropout) captures cases where the hand cue is weak or absent despite strong audio evidence.
Table 6 indicates that FP events are dominated by category A on both datasets. In particular, FP-A accounts for 74.67% (PianoYT) and 62.01% (PianoVAM) for the rule-based merged output, while FP-B and FP-C are near zero in all settings. This suggests that the dominant FP mechanism is not purely “hand-only” or “audio-only” hallucination, but rather timing/alignment inconsistencies or locally shifted peaks.
Similarly, Lost-TP is dominated by category A (95.20% on PianoYT and 90.00% on PianoVAM for the GNN fixed setting), indicating that many missed events are attributable to timing shifts rather than complete disappearance of acoustic evidence. These observations align with the limited benefit of temporal context in the hand branch on external datasets: when timing-related instability dominates, adding temporal aggregation does not necessarily resolve the underlying mismatch.
Comparing settings also highlights that learned merging changes the balance between FP and Lost-TP. On PianoYT, the FP count decreases from 17,436 to 14,155, but Lost-TP increases from 2479 to 8843. On PianoVAM, FP increases from 33,966 to 37,229 and Lost-TP changes from 2900 to 3065. These shifts imply that improving one error source (e.g., suppressing spurious onsets) can introduce another (e.g., over-suppression leading to missed events), which helps explain why the best method depends on the evaluation level and the target domain (
Table 4 and
Table 5).
6.4. Note Fragmentation in MIDI Decoding
We frequently observe note fragmentation during MIDI decoding, where a single GT note is represented as multiple short notes in the predicted MIDI output. This phenomenon increases the number of onset events and can directly affect onset-level precision: even if one fragment matches a GT onset within the tolerance , additional fragments are counted as false positives. Moreover, fragmentation can exacerbate timing-related errors that fall into FP-A and Lost-A, because short fragments often produce jittered or shifted onset times around the true onset.
From the perspective of error analysis, note fragmentation can inflate FP counts and make timing-related categories appear even more dominant, especially on external datasets where probability peaks are less stable. Therefore, improving the MIDI decoding and post-processing stage is complementary to improving the merging strategy. Practical countermeasures include enforcing a minimum inter-onset interval per pitch, merging adjacent note segments separated by short gaps, or applying hysteresis thresholds when converting probability maps to binary events. Such post-processing can reduce spurious fragments and yield a more faithful event representation, potentially improving onset-level metrics without changing the upstream probability estimation.
While note fragmentation is problematic in generating sheet music, precise note segmentation is not necessary in certain applications, for example, when performing harmonic analysis or when resynthesizing with instruments with weak attack characteristics, such as wind instruments. In such scenarios, our method is expected to be effective.
7. Conclusions
In this paper, we investigated multimodal automatic piano transcription by combining an audio-based transcription model with visual hand-skeleton cues extracted from performance videos. The underlying motivation was that hand motion provides spatial information about which pitches are likely to be played and when, potentially complementing audio-based predictions in acoustically ambiguous situations.
To explore this idea, we proposed HandSkeletonNet to estimate pitch-wise probability representations from hand-skeleton sequences and examined multiple merging strategies to integrate these visual probabilities with the audio probabilities produced by Omnizart. Both rule-based merging and learning-based merging using a CNN were evaluated on our self-compiled dataset and on two external datasets (PianoYT and PianoVAM).
On our dataset, incorporating hand-skeleton cues improved onset-level transcription compared with the audio-only baseline, and the learning-based merging model achieved the best overall performance. This indicates that, within the training domain, the model can learn useful relationships between audio and visual modalities. However, experiments with different temporal window sizes revealed that temporal context in the hand-skeleton branch does not consistently improve performance: using a single-frame representation () outperformed a wider temporal window on our dataset.
On the external datasets, while merging improved frame-level performance in some cases, consistent onset-level improvements were not observed, and in other words, onset-level metrics were sensitive to domain shift. Error analysis showed that most false positives and missed events were dominated by timing- and alignment-related errors rather than by pure audio- or hand-driven hallucinations. These results suggest that residual audio–video misalignment and temporal instability limit the effectiveness of hand-skeleton cues on unseen data, and that adding temporal context alone is insufficient to address this issue. For these reasons, the applicability of our method may be limited to specific scenarios, such as harmonic analysis.
We also observed frequent note fragmentation during MIDI decoding, where a single ground-truth note is represented as multiple short notes in the predicted output. This phenomenon inflates onset-level false positives and further amplifies timing-related errors, highlighting the importance of post-processing and event-level consistency in addition to probability estimation.
In conclusion, hand-skeleton cues can serve as a complementary signal for piano transcription, particularly within the training domain. However, achieving robust onset-level improvements on unseen datasets remains challenging due to timing instability, alignment errors, and note fragmentation. As future work, it would be beneficial to expand the dataset, particularly by incorporating videos captured from diverse viewpoints and covering a wider range of composers’ time periods, and to explore estimation using multiple cameras simultaneously, in order to improve accuracy, robustness to visual variations, and generalization across datasets. In addition, explicit handling of temporal alignment and MIDI post-processing within the network should be investigated to ensure more stable and faithful event representations.
Author Contributions
Conceptualization, K.Y., S.N. and J.S.; methodology, K.Y. and S.N.; software, K.Y.; validation, K.Y. and S.N.; formal analysis, K.Y.; investigation, K.Y.; resources, K.Y. and S.N.; data curation, K.Y.; writing—original draft preparation, K.Y.; writing—review and editing, K.Y., S.N., and J.S.; visualization, K.Y. and S.N.; supervision, S.N. and J.S.; project administration, S.N. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The datasets presented in this article are not readily available due to copyright reasons. Requests to access the datasets should be directed to the corresponding author.
Acknowledgments
We sincerely thank Julián Villegas at the University of Aizu for his careful review and constructive feedback, which greatly improved this research.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Koepke, A.S.; Weng, O.; Leman, D.; Zisserman, A.; Sewell, C. Sight to Sound: An End-to-End Approach for Visual Piano Transcription. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2020; pp. 1838–1842. [Google Scholar] [CrossRef]
- Wang, X.; Xu, W.; Liu, J.; Yang, W.; Cheng, W. An Audio-Visual Fusion Piano Transcription Approach Based on Strategy. In Proceedings of the 2021 24th International Conference on Digital Audio Effects (DAFx); IEEE: Piscataway, NJ, USA, 2022. [Google Scholar] [CrossRef]
- Li, Y.; Wang, X.; Wu, R.; Xu, W.; Cheng, W. A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3618–3630. [Google Scholar] [CrossRef]
- Li, Y.; Wang, X.; Wu, R.; Xu, W.; Chen, W. A CRNN-GCN Piano Transcription Model Based On Audio And Skeleton Features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW); IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Benetos, E.; Dixon, S.; Giannoulis, D.; Kirchhoff, H.; Klapuri, A. Automatic Music Transcription: Challenges and Future Directions. J. Intell. Inf. Syst. 2013, 41, 407–434. [Google Scholar] [CrossRef]
- Wu, Y.-T.; Chen, B.; Su, L. Omnizart: A General Toolbox for Automatic Music Transcription. J. Open Source Softw. 2021, 6, 3391. [Google Scholar] [CrossRef]
- Hawthorne, C.; Elsen, E.; Song, J.; Roberts, A.; Simon, I.; Raffel, C.; Engel, J.; Oore, S.; Eck, D. Onsets and Frames: Dual-Objective Piano Transcription. arXiv 2017, arXiv:1710.11153. [Google Scholar] [CrossRef]
- Zivanovic, U.; Pilkov, I.; Cancino-Chacón, C.E. Pay Attention to the Keys: Visual Piano Transcription Using Transformers. In Proceedings of the 34th International Joint Conference on Artificial Intelligence (IJCAI), Montreal, QC, Canada, 8–16 August 2025. [Google Scholar] [CrossRef]
- Kim, Y.; Park, J.; Bae, J.; Kim, K.; Kwon, T.; Lerch, A.; Nam, J. PianoVAM: A Multimodal Piano Performance Dataset. In Proceedings of the 26th International Society for Music Information Retrieval Conference (ISMIR), Daejeon, Republic of Korea, 21–25 September 2025. [Google Scholar]
- Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.-L.; Grundmann, M. MediaPipe Hands: On-device Real-time Hand Tracking. arXiv 2020, arXiv:2006.10214. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2018. [Google Scholar]
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016; pp. 1800–1807. [Google Scholar]
- B’en’edict, G.; Koops, V.; Odijk, D.; de Rijke, M. sigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. arXiv 2021, arXiv:2108.10566. [Google Scholar]
- Lin, T.-Y.; Goyal, P.; Girshick, R.B.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
- Raffel, C.; McFee, B.; Humphrey, E.J.; Salamon, J.; Nieto, O.; Liang, D.; Ellis, D.P.W. mir_eval: A Transparent Implementation of Common MIR Metrics. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Taipei, Taiwan, 27–31 October 2014. [Google Scholar]
Figure 1.
Illustration of Audio-based AMT.
Figure 1.
Illustration of Audio-based AMT.
Figure 2.
Spectrogram illustrating overtones detected from C4.
Figure 2.
Spectrogram illustrating overtones detected from C4.
Figure 3.
Overview of the proposed method.
Figure 3.
Overview of the proposed method.
Figure 4.
Probabilities estimated by Omnizart (Onset, Duration, Offset).
Figure 4.
Probabilities estimated by Omnizart (Onset, Duration, Offset).
Figure 5.
Hand-skeleton graph derived from MediaPipe Hands.
Figure 5.
Hand-skeleton graph derived from MediaPipe Hands.
Figure 6.
Structure of HandSkeletonNet.
Figure 6.
Structure of HandSkeletonNet.
Figure 7.
Architecture of Learning-based Merging Module.
Figure 7.
Architecture of Learning-based Merging Module.
Figure 8.
Comparison of note recovery. Ground Truth (Top), Audio Only (Middle), and Proposed Method (Bottom). The horizontal axis represents time, while the vertical axis represents pitch. Each note is shown as a blue bar. The proposed method successfully recovers the lower chords and fast chromatic scales that were missed by the baseline.
Figure 8.
Comparison of note recovery. Ground Truth (Top), Audio Only (Middle), and Proposed Method (Bottom). The horizontal axis represents time, while the vertical axis represents pitch. Each note is shown as a blue bar. The proposed method successfully recovers the lower chords and fast chromatic scales that were missed by the baseline.
Figure 9.
Comparison of noise suppression. The horizontal axis represents time, and the vertical axis represents pitch. The red arrows in the baseline indicate scattered noise artifacts in the low and high regions, which are filtered out by the proposed method.
Figure 9.
Comparison of noise suppression. The horizontal axis represents time, and the vertical axis represents pitch. The red arrows in the baseline indicate scattered noise artifacts in the low and high regions, which are filtered out by the proposed method.
Figure 10.
Failure case showing noise introduction. Ground Truth (Top), Audio Only (Middle), and Proposed Method (Bottom). The horizontal axis represents time, and the vertical axis represents pitch. The proposed method generates spurious notes near correct notes and in disjoint regions (red arrows), which were not present in the baseline.
Figure 10.
Failure case showing noise introduction. Ground Truth (Top), Audio Only (Middle), and Proposed Method (Bottom). The horizontal axis represents time, and the vertical axis represents pitch. The proposed method generates spurious notes near correct notes and in disjoint regions (red arrows), which were not present in the baseline.
Figure 11.
Failure case showing note suppression. The horizontal axis represents time, and the vertical axis represents pitch. The red arrows in the center indicate valid notes that were correctly detected by the baseline but suppressed by the proposed method. Additionally, some spurious noise is introduced.
Figure 11.
Failure case showing note suppression. The horizontal axis represents time, and the vertical axis represents pitch. The red arrows in the center indicate valid notes that were correctly detected by the baseline but suppressed by the proposed method. Additionally, some spurious noise is introduced.
Table 1.
Merging hyperparameters used in the proposed method.
Table 1.
Merging hyperparameters used in the proposed method.
| Name | Symbol | Candidate Values | Adapted Value |
|---|
| Weight Floor | | | |
| Weight Gamma | | | |
| Onset Thresh | | | |
| Duration Thresh | | | |
Table 2.
Training hyperparameters for the Learning-based Merging Module (implementation setting).
Table 2.
Training hyperparameters for the Learning-based Merging Module (implementation setting).
| Name | Symbol | Value |
|---|
| Input channels | | 4 (onset, offset, duration, hand) |
| Base channels | – | 64 |
| Dropout | – | 0.1 |
| Dilated conv | – | kernel , dilation |
| Loss | – | BCEWithLogitsLoss |
| Positive weight | – | |
| Optimizer | – | Adam |
| Learning rate | | |
| Batch size | | 1 |
Table 3.
Evaluation results on the internal dataset. “Naive product” refers to simple multiplication without the weight floor and gamma parameters. “Optimized params” represents the performance where hyperparameters shown in
Table 1 tuned per song.
Table 3.
Evaluation results on the internal dataset. “Naive product” refers to simple multiplication without the weight floor and gamma parameters. “Optimized params” represents the performance where hyperparameters shown in
Table 1 tuned per song.
| Method | Onset Level | Frame Level |
|---|
| Precision | Recall | F1 | Precision | Recall | Accuracy | F1 |
|---|
| (%) | (%) | (%) | (%) | (%) | (%) | (%) |
|---|
| Omnizart (Audio baseline) | 58.74 | 71.27 | 64.09 | 69.98 | 27.4 | 24.17 | 39.38 |
| Omnizart + Hand Skeleton Region | 83.89 | 42.23 | 53.02 | 68.61 | 17.49 | 16.50 | 27.87 |
| Omnizart + GNN (Naive product) | 68.5 | 54.55 | 56.97 | 64.77 | 21.21 | 19.49 | 31.96 |
| Omnizart + GNN (Global params, ) | 66.41 | 62.64 | 63.95 | 70.29 | 27.94 | 24.69 | 39.99 |
| Omnizart + GNN (Global params, ) | 65.52 | 54.23 | 57.37 | 68.81 | 20.77 | 19.37 | 31.91 |
| Omnizart + GNN (Optimized params, ) | 74.09 | 56.11 | 60.73 | 68.57 | 20.65 | 19.29 | 31.74 |
| Omnizart + GNN + CNN () | 66.43 | 67.44 | 66.38 | 64.8 | 32.14 | 26.69 | 42.96 |
Table 4.
Evaluation results on the PianoYT dataset.
Table 4.
Evaluation results on the PianoYT dataset.
| Method | Onset Level | Frame Level |
|---|
| Precision | Recall | F1 | Precision | Recall | Accuracy | F1 |
|---|
| (%) | (%) | (%) | (%) | (%) | (%) | (%) |
|---|
| Omnizart (Audio baseline) | 81.23 | 96.56 | 87.68 | 71.69 | 78.89 | 58.72 | 75.12 |
| Omnizart + GNN (Global params, ) | 86.41 | 94.83 | 90.06 | 71.55 | 80.34 | 59.23 | 75.69 |
| Omnizart + GNN (Global params, ) | 86.42 | 94.84 | 90.07 | 71.54 | 80.34 | 59.23 | 75.69 |
| Omnizart + GNN + CNN () | 83.37 | 91.32 | 86.93 | 73.27 | 78.43 | 59.28 | 75.76 |
Table 5.
Evaluation results on the PianoVAM dataset.
Table 5.
Evaluation results on the PianoVAM dataset.
| Method | Onset Level | Frame Level |
|---|
| Precision | Recall | F1 | Precision | Recall | Accuracy | F1 |
|---|
| (%) | (%) | (%) | (%) | (%) | (%) | (%) |
|---|
| Omnizart (Audio baseline) | 62.51 | 77.05 | 67.94 | 75.34 | 43.13 | 31.54 | 54.68 |
| Omnizart + GNN (Global params, ) | 67.47 | 75.14 | 70.3 | 74.58 | 43.9 | 31.5 | 55.28 |
| Omnizart + GNN (Global params, ) | 66.22 | 73.66 | 68.89 | 74.6 | 43.93 | 31.6 | 55.30 |
| Omnizart + GNN + CNN () | 66.38 | 75.45 | 69.78 | 71.61 | 48.14 | 32.14 | 57.57 |
Table 6.
Error taxonomy on external datasets. The total number of false-positive onset events (FP) and lost true positives (Lost-TP), along with the percentage breakdown by cause.
Table 6.
Error taxonomy on external datasets. The total number of false-positive onset events (FP) and lost true positives (Lost-TP), along with the percentage breakdown by cause.
| Dataset | Setting | False Positives | Lost True Positives |
|---|
| FP | A | B | C | Other | Lost-TP | A | B (Drop) | Other |
|---|
| PianoYT | GNN (Global params, ) | 17,436 | 74.67% | 0.01% | 1.21% | 24.11% | 2479 | 95.20% | 0.48% | 4.32% |
| | GNN + CNN | 14,155 | 82.68% | 0.00% | 0.92% | 16.40% | 8843 | 80.04% | 2.26% | 17.70% |
| PianoVAM | GNN (Global params, ) | 33,966 | 62.01% | 0.00% | 0.07% | 37.91% | 2900 | 90.00% | 2.28% | 7.72% |
| | GNN + CNN | 37,229 | 63.01% | 0.00% | 0.05% | 36.94% | 3065 | 91.55% | 1.21% | 7.24% |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |