Multimodal Automatic Music Transcription Using Piano Audio and Hand-Skeleton Information

Yamada, Kosuke; Nishimura, Satoshi; Shin, Jungpil

doi:10.3390/electronics15102005

Open AccessArticle

Multimodal Automatic Music Transcription Using Piano Audio and Hand-Skeleton Information

by

Kosuke Yamada

,

Satoshi Nishimura

^*

and

Jungpil Shin

School of Computer Science and Engineering, The University of Aizu, Fukushima 965-8580, Japan

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2005; https://doi.org/10.3390/electronics15102005

Submission received: 10 March 2026 / Revised: 30 April 2026 / Accepted: 1 May 2026 / Published: 8 May 2026

(This article belongs to the Special Issue Next-Generation Machine Learning and Deep Learning Models for Complex Data, Vision, and Intelligent Applications)

Download

Browse Figures

Versions Notes

Abstract

Automatic Music Transcription (AMT) for piano is difficult for audio-only systems due to dense polyphony, resonance, and reverberation, which lead to false positives and unstable onset decisions. We present a multimodal AMT framework that fuses Omnizart audio probability maps with visual cues from hand-skeleton tracking. A graph-based model called HandSkeletonNet estimates per-key onset probabilities from hand trajectories, and the two modalities are merged via a weighting-and-masking scheme or a compact CNN-based merger. Experiments show consistent improvements over the audio-only baseline on our self-compiled dataset, while evaluations with external datasets primarily improve frame-level sensitivity. The frame-level F1 score improved from 75.12% to 75.76% for the PianoYT dataset and from 54.68% to 57.57% for the PianoVAM dataset compared with the audio-only baseline. Our experiments also reveal limited onset-level gains under domain shift. Remaining errors are largely explained by timing/misalignment and note fragmentation in MIDI decoding, suggesting that robustness to missing hand detections and explicit temporal alignment are key directions.

Keywords:

automatic music transcription; hand pose analysis; graph neural networks; multimodal recognition

1. Introduction

Automatic Music Transcription (AMT) aims to infer symbolic score representations from acoustic signals and underpins diverse applications such as music information retrieval, practice support, automatic accompaniment, and digital archiving. Among instruments, the piano remains particularly challenging due to dense polyphony and strong overtone interference, which complicate note separation and the detection of simultaneous key presses. Consequently, purely audio-based AMT models still struggle to robustly estimate onsets and offsets with high accuracy under varying noise levels and recording conditions.

In contrast, visual information obtained from performance videos provides direct cues about key locations and hand/finger motion. Recent studies have explored tracking hand skeletons over time and integrating them with acoustic features [1,2,3,4]. However, visual signals suffer from detection dropouts and uncertainty due to occlusions, illumination changes, and viewpoint dependence; therefore, naive integration can increase both false positives and false negatives. In particular, mismatches in temporal alignment between audio and visual streams can lead to event-level errors that are not easily resolved by simple merging strategies.

In this paper, we propose a multimodal pipeline that integrates probability maps derived from piano audio with key-press probability maps inferred from hand-skeleton trajectories. To obtain reliable vision-derived probabilities, we introduce a model called HandSkeletonNet, a graph neural network that estimates per-key onset probabilities from hand-skeleton trajectories, and we investigate the effect of temporal aggregation using temporal convolution. These hand-skeleton probabilities are merged with the audio-derived probability maps via a parameterized weighting-and-masking scheme based on a nonlinear gain and a probability floor, rather than hard rule-based suppression or rescue heuristics. In addition, we further investigate a learning-based probability merging approach using a compact Convolutional Neural Network (CNN) that directly combines the audio- and vision-derived probability maps to produce the final merged representation. The merged output is subsequently converted to Musical Instrument Digital Interface (MIDI) messages for quantitative evaluation. Our approach differs from raw image-based methods [1,2,3] in that it leverages skeleton representations to naturally incorporate positional constraints imposed by the human hand structure. It also differs from the CRNN-GCN model [4] in that, rather than extracting uninterpretable latent features from the skeleton, it estimates per-key probabilities and then merges them. This design improves explainability and is expected to facilitate analysis of how undesirable outputs in audio-derived results are efficiently suppressed.

The main contributions of this paper are summarized as follows: (i) the design of HandSkeletonNet, a Graph Neural Network (GNN)-based model that estimates per-key onset probabilities from hand-skeleton trajectories, with an investigation of temporal aggregation; (ii) a parameterized weighting-and-masking framework that aims to merge audio and visual probabilities through a continuous weight map, a probability floor, and a nonlinear gain; (iii) a learning-based merging model based on a compact CNN that integrates the two probability streams in a data-driven manner; and (iv) extensive experiments on internal and external datasets demonstrate that multimodal fusion improves transcription accuracy over a state-of-the-art audio-only baseline. The proposed approach is expected to be most effective when hand tracking is sufficiently stable and synchronization errors are limited, whereas robust onset-level improvements under severe domain shift remain challenging.

2. Background and Related Work

2.1. Fundamentals of Piano AMT

Given a waveform

x (t)

of a piano performance, the goal of AMT is to estimate a time-discretized piano roll as follows:

Y \in {0, 1}^{T \times 88},

(1)

where T denotes the number of time frames after discretizing the waveform into frames with hop size h, i.e.,

T = ⌈\frac{D}{h}⌉

for an audio duration D, and

Y_{t, k} = 1

indicates that key k is active at time frame t. In practice, alternative formulations predict separated representations for onset, duration, and offset channels, or directly infer MIDI notes as onset–length pairs (Figure 1). Piano AMT is challenging mainly due to dense polyphony, resonance, and the use of the sustain pedal. In addition, a piano produces strong harmonic interference, making analysis of the audio waveform difficult (Figure 2).

Standard evaluation metrics include note-level and frame-level precision, recall, and F₁ score. For onset matching, a tolerance of

\pm 50 ms

is typically adopted. While our system estimates duration and offset probabilities to stabilize the transcription, we primarily focus on the evaluation of onset detection accuracy.

2.2. Audio-Only AMT

Early AMT systems relied on signal-processing approaches such as spectral-peak tracking and Non-negative Matrix Factorization (NMF) [5]. Subsequent advances have been driven by deep learning models that take time–frequency representations (e.g., Short-Time Fourier Transform (STFT), Combined Frequency, and Periodicity (CFP)) as input. Representative sequence models output probability maps for onsets, durations (or frame-level activation), and offsets, which are converted to MIDI via thresholding, peak picking, and temporal connection. Omnizart [6] follows this line and is widely used as a practical baseline. Nevertheless, under conditions of high polyphony, heavy reverberation, extensive pedal usage, or recording-condition mismatch, controlling false positives and false negatives remains a significant challenge [7].

2.3. Visual Cues and Multimodal AMT

A complementary direction leverages visual cues from performance videos—such as keyboard layout and hand/fingertip motion—to assist audio models. Existing approaches can be broadly grouped into: (i) key- or fingertip-detection and alignment methods for estimating key activity [8], (ii) learned image-feature-based schemes [1,2,3], and (iii) temporal modeling of hand-skeleton trajectories [4,9]. Visual information is particularly effective for narrowing down onset candidates; however, it is susceptible to detection dropouts, uncertainty, occlusions, illumination changes, and viewpoint dependence. Consequently, naive merging of audio and visual streams may inadvertently increase both false positives and false negatives. Furthermore, practical systems must handle audio–video synchronization issues, such as frame-rate mismatch and latency.

3. Proposed Method

3.1. Overall Pipeline

Our system consists of the following four stages: (i) estimating audio-derived probability maps using Omnizart (Section 3.2), (ii) obtaining key-press probability maps based on hand-skeleton cues extracted from performance videos (Section 3.3), (iii) merging these two modalities, and (iv) converting the merged output into MIDI format for evaluation (Figure 3).

For the merging stage (iii), we propose two distinct approaches: (iii-1) a merging method based on thresholding and weighting (Section 3.4.3) and (iii-2) an approach using a CNN-based integration model (Section 3.5).

3.2. Audio Branch (Omnizart)

From the audio waveform, we compute a time–frequency representation (e.g., CFP/STFT) and feed it to the Omnizart piano model to obtain frame-wise, per-pitch probability maps for onsets, sustains (duration), and offsets. Specifically, for time frame t and key

k \in {1, \dots, 88}

, we denote the probabilities as

\begin{matrix} P_{a}^{(on)} (t, k) \in [0, 1], P_{a}^{(dur)} (t, k) \in [0, 1], P_{a}^{(off)} (t, k) \in [0, 1], \end{matrix}

(2)

where

P_{a}^{(on)}

represents the probability that a note begins,

P_{a}^{(dur)}

represents the probability that the frame falls within the duration of a note, and

P_{a}^{(off)}

denotes the probability that a note ends. These maps are later aligned with the visual branch and used by the merging module (Figure 4).

3.3. Hand Skeleton Branch

We use MediaPipe Hands [10] to estimate 21 2D landmarks

(x, y)

per hand in each video frame and treat the two hands as a 42-node graph [11] (Figure 5). Our proposed model, HandSkeletonNet, uses only the 2D coordinates as node features. When only one hand is detected, the frame is reshaped to 42 nodes by zero padding.

MediaPipe Hands generally has good robustness to hand occlusion; however, in cases of severe occlusion, it may fail to detect one or both hands. Fortunately, when MediaPipe Hands produces low-confidence estimates, it refrains from outputting unreliable results, thereby preventing significant disruption to audio-based transcription. Rapid hand movements can also lead to reduced accuracy, particularly in pieces with frequent leaps. When audio and visual signals are well synchronized, this effect is limited; however, when they are temporally misaligned, its impact becomes more pronounced. Such synchronization will be discussed in Section 3.4.2.

Let

S_{t} \in R^{42 \times 2}

denote the stacked keypoint coordinates at frame t (

t = 1, \dots, T_{hand}

). The model outputs per-MIDI-note key-press probabilities,

P_{hand} \in {[0, 1]}^{T_{hand} \times 88},

(3)

where

P_{hand} (t, k)

represents the probability for frame t and key k. The sequence is later aligned to the unified time base and synchronized with the audio branch (Section 3.4.2).

3.4. Weighting and Masking of Probabilities

3.4.1. Notation

Since the frame rate optimal for audio analysis, which affects frequency resolution, may differ from that used in video-based skeleton analysis, we use separate frame rates for each analysis and resample the probability maps to ensure consistency. Let

T_{a}

and

T_{hand}

denote the original frame lengths of the audio and hand streams, respectively, and let T denote the length of the unified time base. The audio probabilities are

P_{a}^{(\cdot)} \in {[0, 1]}^{T_{a} \times 88}

, and the hand probabilities are

P_{hand} \in {[0, 1]}^{T_{hand} \times 88}

. After resampling (and alignment compensation), we obtain

{\tilde{P}}_{a}^{(\cdot)} \in {[0, 1]}^{T \times 88}

and

{\tilde{P}}_{hand} \in {[0, 1]}^{T \times 88}

.

3.4.2. Temporal Alignment

We resample the audio-derived probability maps

P_{a}^{(\cdot)}

onto the merged time axis of length T by linear interpolation, obtaining

{\tilde{P}}_{a}^{(\cdot)} \in {[0, 1]}^{T \times 88}

. Similarly, the hand-derived probabilities

{\tilde{P}}_{hand}

are linearly interpolated to the same length T. After resampling, we compensate for residual audio–video latency by searching a global alignment offset. We compute frame-wise mean sequences and normalize them as follows:

m_{a} (t) = \frac{1}{88} \sum_{k} {\tilde{P}}_{a}^{(on)} (t, k), m_{h} (t) = \frac{1}{88} \sum_{k} {\tilde{P}}_{hand} (t, k),

z_{a} (t) = \frac{m_{a} (t) - μ_{a}}{σ_{a}}, z_{h} (t) = \frac{m_{h} (t) - μ_{h}}{σ_{h}},

where

μ_{a}

and

σ_{a}

denote the temporal mean and standard deviation of

m_{a} (t)

, respectively, while

μ_{h}

and

σ_{h}

denote those of

m_{h} (t)

. We search

ℓ \in [- L, L]

(we use

L = 12

) and select

ℓ^{★} = arg {max}_{ℓ} \sum_{t} z_{a} (t) z_{h} (t - ℓ)

, and then shift the hand stream accordingly.

3.4.3. Construction of the Weight Map

We nonlinearly enhance the contrast of the hand-derived confidence map using an exponent

γ

, and impose a lower bound

P_{floor}

to avoid excessive suppression. We define the weight map as

W (t, k) = max ({\tilde{P}}_{hand} {(t, k)}^{γ}, P_{floor}), W \in {[P_{floor}, 1]}^{T \times 88} .

(4)

Here,

γ \geq 1

controls the emphasis on high-confidence regions, and

P_{floor} \in [0, 1]

is a lower bound introduced to avoid excessive suppression in regions where the hand evidence is sparse.

3.4.4. Application to Audio Probabilities and MIDI Inference

The continuous weight map W is applied element-wise to the audio onset and duration probabilities. Note that we exclude the offset probability

{\tilde{P}}_{a}^{(off)}

from this weighting to preserve release timings. The merged probabilities are given by

\begin{matrix} {\hat{P}}^{(on)} & = {\tilde{P}}_{a}^{(on)} ⊙ W, \end{matrix}

(5)

\begin{matrix} {\hat{P}}^{(dur)} & = {\tilde{P}}_{a}^{(dur)} ⊙ W . \end{matrix}

(6)

where ⊙ denotes element-wise multiplication. In this way, the audio probabilities are largely preserved at time–pitch positions where the hand-presence probability is high, while they are attenuated in regions with low hand-presence probability.

Finally, we convert the merged probability maps into a MIDI sequence using the Omnizart decoding API. Specifically, we feed the merged onset and duration maps

{\hat{P}}^{(on)}

and

{\hat{P}}^{(dur)}

, together with the unmodified offset map

{\tilde{P}}_{a}^{(off)}

, to the decoder. The decoder performs note tracking to obtain note events (onset–duration pairs) and writes them to a standard MIDI file.

3.5. Learning-Based Merging Module

In addition to the parameterized weighting-and-masking scheme in Section 3.4, we implemented a Learning-based Merging Module that predicts the onset probability map from both audio- and hand-derived probability maps using a CNN.

3.5.1. Inputs

We construct a 4-channel tensor

X \in R^{4 \times T \times 88}

by stacking the Omnizart probabilities (onset, duration, offset) and the hand-skeleton probability as follows:

X (t, k) = [{\tilde{P}}_{a}^{(on)} (t, k), {\tilde{P}}_{a}^{(off)} (t, k), {\tilde{P}}_{a}^{(dur)} (t, k), {\tilde{P}}_{hand} (t, k)] .

3.5.2. Time Alignment

Before constructing the 4-channel input, we apply the same resampling and alignment compensation procedure as in Section 3.4.2 (i.e., linear interpolation to the unified time base and alignment offset search by maximizing cross-correlation of frame-wise mean z-scores).

3.5.3. Model and Training

A compact 2D CNN over the time–pitch plane is trained with ground-truth onset labels using a binary classification loss (BCEWithLogitsLoss with positive weighting) and the Adam optimization. The network outputs onset logits

Z_{θ} (t, k)

, and the merged onset probability is obtained by

σ (Z_{θ} (t, k))

.

3.5.4. Inference and MIDI Decoding

At inference time, only the onset stream is replaced by the CNN output, while the duration and offset streams are taken from the audio branch as follows:

P_{merge}^{(on)} = {\hat{P}}_{cnn}^{(on)}, P_{merge}^{(dur)} = {\tilde{P}}_{a}^{(dur)}, P_{merge}^{(off)} = {\tilde{P}}_{a}^{(off)} .

The resulting

(T, 88, 3)

probability maps are converted to MIDI using Omnizart’s note-inference API with thresholds and hop size consistent with the unified time base.

4. Dataset and Experimental Setup

4.1. Internal Dataset

To train our models and conduct a preliminary evaluation of our architecture, we compiled a small dataset of performance videos and their corresponding MIDI files for nine piano pieces. These include Flight of the Bumblebee by N. Rimsky-Korsakov (arranged by S. Rachmaninoff); other seven classical pieces composed by F. Liszt, F. Chopin, C. Debussy, and S. Rachmaninoff; and one arrangement of a popular song. The total duration of the audio clips in the dataset is 24.5 min. The videos are recorded from an overhead viewpoint, with all 88 keys visible. The audio sampling rate is 44.1 kHz, while the video frame rate is 25 frames per second (fps).

4.2. Partitioning Strategy

For each musical piece, its audio waveform was processed with Omnizart version 0.5.0 to obtain probability maps for onset, duration, and offset events. From the video, we extracted the two-dimensional coordinates of 21 keypoints per hand using MediaPipe Hands and converted them into PyTorch Geometric graphs with 42 nodes following the procedure described in Section 4.3.1. Since the main focus of this paper is the post hoc evaluation of the merging mechanism, we first train HandSkeletonNet once on the internal dataset and then keep its hyperparameters fixed for all experiments. We therefore do not perform cross-validation over HandSkeletonNet itself; instead, we evaluate the performance of the merging strategy on specific target pieces.

4.3. Training Procedure of HandSkeletonNet

HandSkeletonNet is trained as a binary classification model that estimates, for each frame, the onset activation probabilities of the 88 piano keys based on the sequence of hand-skeleton graphs. Supervision is provided by the onset piano roll derived from the MIDI annotation, denoted as

y_{t} \in {0, 1}^{88}

.

4.3.1. Preprocessing and Input Generation

For each frame, the 2D coordinates

(x, y)

of the 21 keypoints per hand estimated by MediaPipe Hands are normalized by the image width and height, rescaling them to the range

[0, 1]

. If a hand is missing in a frame, it is filled by zero padding so that each frame contains 42 nodes in total. We use a sliding window of width

T_{win} = 9

with a stride of 1. We further consider an ablation setting with

T_{win} = 1

. In this case, the input reduces to the center frame only, and the model cannot exploit any temporal neighborhood. We use this variant to quantify the contribution of temporal context in the hand-skeleton branch. For each central time step t, we construct a sequence of graphs as follows:

{G_{t - ⌊ T_{win} / 2 ⌋}, \dots, G_{t + ⌊ T_{win} / 2 ⌋}} .

(7)

The edges are defined by the skeletal topology of each hand and remain fixed within each frame.

4.3.2. Model Architecture

For each frame-wise graph, we first apply two layers of Graph Convolutional Network (GCN) with Rectified Linear Unit (ReLU) activation to extract node-level features. To enable the model to jointly recognize the shapes of both hands, we then aggregate node features into a single graph-level embedding by mean pooling over the node dimension. This produces one feature vector per frame, forming a temporal sequence over a

T_{win}

-frame window (center frame

\pm ⌊ T_{win} / 2 ⌋

). The temporal dimension is then aggregated by a depthwise temporal convolution [12] over this sequence of graph-level embeddings. From the resulting feature sequence, we extract the vector corresponding to the central time step and feed it into a fully connected layer to obtain an 88-dimensional logit vector for that frame (Figure 6).

4.3.3. Loss Function

We train the network with a composite loss that combines a class-imbalance-aware binary cross-entropy (BCE) with logits and a differentiable surrogate of the macro F₁ score (soft-F₁) [13]. Let

Z \in R^{B \times 88}

and

Y \in {0, 1}^{B \times 88}

denote the logit matrix and the ground-truth label matrix in a mini-batch of size B, respectively, and let

P = σ (Z) \in {(0, 1)}^{B \times 88}

be the predicted probabilities after the sigmoid function

σ (\cdot)

.

Soft-F₁

For each key

c \in {1, \dots, 88}

, we define soft counts over the mini-batch as follows:

\begin{matrix} {TP}^{(c)} & = \sum_{b = 1}^{B} P_{b, c} Y_{b, c}, \end{matrix}

(8)

\begin{matrix} {FP}^{(c)} & = \sum_{b = 1}^{B} P_{b, c} (1 - Y_{b, c}), \end{matrix}

(9)

\begin{matrix} {FN}^{(c)} & = \sum_{b = 1}^{B} (1 - P_{b, c}) Y_{b, c} . \end{matrix}

(10)

Using these, the soft-F₁ score for key c is given by

F_{1, soft}^{(c)} (P, Y) = \frac{2 {TP}^{(c)} + ϵ}{2 {TP}^{(c)} + {FP}^{(c)} + {FN}^{(c)} + ϵ},

(11)

where

ϵ

is a small constant for numerical stability (we use

ϵ = 10^{- 8}

). We then compute the macro score across 88 keys as

{\bar{F}}_{1} (P, Y) = \frac{1}{88} \sum_{c = 1}^{88} F_{1, soft}^{(c)} (P, Y) .

(12)

Composite Loss

The final training objective is

L = BCEWithLogits (Z, Y; w_{pos}) + λ (1 - {\bar{F}}_{1} (σ (Z), Y)),

(13)

where

w_{pos} \in R^{88}

is the class-wise positive weight vector and

λ

controls the contribution of the soft-F₁ term. We set

λ = 0.2

.

Positive Class Weights

The class-wise positive weights

w_{pos}

are estimated from the training set to mitigate class imbalance [14]. For each key c, let

N_{pos}^{(c)}

and

N_{neg}^{(c)}

be the number of positive and negative labels in the training set, respectively. We define

w_{pos} [c] = clip (\frac{N_{neg}^{(c)}}{max (N_{pos}^{(c)}, 1)}, 1, 20),

(14)

where

clip (\cdot, 1, 20)

truncates the ratio to the interval

[1, 20]

.

4.4. Training Procedure of Learning-Based Merging Module

Section 3.5 introduced a Learning-based Merging Module that predicts the onset probability map from both audio- and hand-derived probability maps.

4.4.1. Architecture

We use a compact 2D CNN over the time–pitch plane. The network consists of four convolution layers with ReLU activations, including a dilated temporal convolution to widen the receptive field in time (Figure 7). The model outputs onset logits

Z_{θ} (t, k)

, and the merged onset probability is obtained by

σ (Z_{θ} (t, k))

.

4.4.2. Loss and Optimization

We train the network using BCEWithLogitsLoss with a scalar positive weight

pos_weight = 3

. Optimization is performed using the Adam algorithm with learning rate

η_{cnn} = 1 \times 10^{- 3}

. To avoid variable-length batching, we use batch size

B_{cnn} = 1

and shuffle pieces at each epoch.

4.5. Baselines

To ensure a fair comparison, all methods share the same evaluation pipeline and temporal alignment.

1.: Audio Only (Omnizart): The audio-derived probability maps are converted to MIDI using Omnizart’s note-inference API.
2.: Audio + Hand Skeleton Region: MediaPipe hand coordinates are mapped onto the keyboard positions. Notes that fall outside the physically reachable range of the hands are pruned based on a deterministic rule.
3.: Proposed(i) Omnizart + GNN: The GNN-estimated hand-skeleton probabilities $P_{hand}$ are merged with the audio probabilities using the proposed weighting and masking mechanism. The final piano roll is converted to MIDI using the Omnizart API.
4.: Proposed(ii) Omnizart + GNN + CNN: The audio and hand-skeleton probabilities are merged using the CNN-based merging model.

At inference time, only the onset stream is replaced by the CNN output, while the duration and offset streams are taken from the audio branch (Section 3.5).

4.6. Hyperparameters

4.6.1. Merging Hyperparameters

The merging process (Proposed(i) Omnizart + GNN) is controlled by the hyperparameters summarized in Table 1 and explained below.

Weight Floor ( $P_{floor}$ ): A lower bound imposed on the continuous weight map W. Even when the hand-skeleton probabilities are low or missing, W is clipped below by $P_{floor}$ so that the audio stream retains a minimum contribution.
Weight Gamma ( $γ$ ): An exponent used when constructing W from the hand-skeleton probabilities $P_{hand}$ , emphasizing high-confidence regions of $P_{hand}$ in the continuous weight map.
Onset Threshold ( $P_{th}^{(on)}$ ) and Duration Threshold ( $P_{th}^{(dur)}$ ): Decision thresholds used to binarize the onset and duration streams in the final MIDI conversion stage. These thresholds are applied to the score values used by the Omnizart decoding stage rather than to the $[0, 1]$ -normalized probabilities, so their candidate values are not restricted to the interval $[0, 1]$ .

For the pieces contained in the training dataset of HandSkeletonNet, these hyperparameters were tuned independently for each song by an exhaustive grid search over the candidate sets (Table 1). For each song, we selected the configuration yielding the highest onset-only F₁ score under a latency-compensated evaluation with a tolerance of

δ = 0.05 s

. Since this per-song tuning utilizes ground-truth annotations, the results should be regarded as an upper bound on the achievable performance.

In addition, we defined a single global configuration, reported in the “Adapted Value” column of Table 1. These values were calculated by averaging the optimal hyperparameters obtained for each piece in the training dataset. This global setting is intended for evaluating pieces outside the training dataset (e.g., external test pieces) where ground-truth tuning is not feasible.

4.6.2. HandSkeletonNet Hyperparameters

HandSkeletonNet utilizes specific architectural and optimization hyperparameters, which were kept fixed across all experiments. Unless otherwise noted, we used a temporal window of

T_{win} = 9

frames with a stride of 1. The network consists of a GCN block with a hidden dimension of 128, followed by a linear projection to a 256-dimensional feature space. The temporal aggregation is performed by a depthwise Temporal Convolution Network (TCN) with a kernel size of 3, consisting of two layers and a dropout rate of 0.1. The final fully connected layer outputs 88 logits per frame.

The model was trained using the Adam optimizer with a learning rate of

1 \times 10^{- 4}

, a weight decay of

1 \times 10^{- 4}

, and a batch size of 32. Gradient clipping was applied with a maximum global norm of 5.0. The weight of the soft-F₁ loss term was set to

λ = 0.2

. These hyperparameters were determined empirically based on preliminary experiments.

4.6.3. Learning-Based Merging Module Hyperparameters

Table 2 summarizes the training parameters for the Learning-based Merging Module proposed in Section 4.4.

4.7. Evaluation Metrics

We compute frame-level metrics using mir_eval.multipitch [15]. For each frame, we obtain the predicted and reference pitch sets and count true positives (TP), false positives (FP), and false negatives (FN). We then aggregate TP/FP/FN over all frames and report the following:

P = \frac{T P}{T P + F P}, R = \frac{T P}{T P + F N}, Acc = \frac{T P}{T P + F P + F N} .

Note that this accuracy is the multi-pitch set-based accuracy defined in mir_eval.multipitch, not a per-key classification accuracy.

We also report onset-level precision/recall/F1 by evaluating note-onset events. From the predicted MIDI and the ground-truth MIDI, we extract onset events as pairs of (onset time, pitch). A predicted onset is counted as a true positive if there exists a ground-truth onset with the same pitch whose onset time is within a tolerance of

δ = 0.05

s; otherwise, it is a false positive. Unmatched ground-truth onsets are counted as false negatives. We then compute Precision/Recall/F1 from the aggregated TP/FP/FN counts.

5. Results

We report frame-level and onset-level metrics defined in Section 4.7. We also report an ablation of the hand-skeleton branch with

T_{win} = 1

, which removes temporal context and uses only the center-frame GNN features.

5.1. Evaluation with the Internal Dataset

We first evaluate the performance on the dataset used for training HandSkeletonNet. Table 3 compares the precision, recall, and F₁ scores of the baselines and the proposed merging methods.

Table 3 shows the evaluation results on the internal dataset. The audio baseline (Omnizart) achieves an onset-level F1 of 64.09% (precision = 58.74%, recall = 71.27%) and a frame-level F1 of 39.38% (precision = 69.98%, recall = 27.4%, accuracy = 24.17%).

The Hand Skeleton Region baseline yields very high onset-level precision (83.89%) but low recall (42.23%), resulting in a reduced onset-level F1 of 53.02%. At the frame level, it also degrades performance (frame-level F1 = 27.87%), indicating that hard spatial restriction can suppress correct activations.

For rule-based merging, Naive product attains onset-level F1 = 56.97% and frame-level F1 = 31.96%. With a single Global params configuration, the temporal-window comparison shows that

T_{win} = 1

performs better than

T_{win} = 9

on both metrics (onset-level F1: 63.95% vs. 57.37%; frame-level F1: 39.99% vs. 31.91%). The Optimized params setting improves onset-level F1 to 60.73% and frame-level F1 to 31.74%, but does not exceed the best global configuration in this re-evaluation.

Finally, the learning-based merging model (Omnizart + GNN + CNN) achieves the best performance on the internal dataset, with onset-level F1 = 66.38% (precision = 66.43%, recall = 67.44%) and frame-level F1 = 42.96% (precision = 64.8%, recall = 32.14%, accuracy = 26.69%).

5.2. Evaluation with External Datasets

Next, we evaluate the generalization performance on external datasets that were not used for training HandSkeletonNet. Table 4 and Table 5 present a comparison between the baseline method and the proposed approach. For rule-based merging, we used the global parameter values shown in Table 1, which were obtained through adaptation using the internal dataset. The audio sampling rate used in the experiments is 44.1 kHz, while the video frame rate is 25 fps for the PianoYT dataset and 60 fps for the PianoVAM dataset.

5.2.1. External Evaluation on PianoYT

Table 4 summarizes the results on the PianoYT dataset [1]. The audio baseline achieves onset-level precision = 81.23%, recall = 96.56%, and F1 = 87.68%, while the frame-level F1 is 75.12%. For rule-based merging with Global params, both temporal-window settings improve the onset-level metrics:

T_{win} = 1

yields onset-level F1 = 90.06% and

T_{win} = 9

yields onset-level F1 = 90.07%. Their frame-level performance is also similar (frame-level F1 = 75.69% for both settings; accuracy = 59.23%). The learning-based merging model (Omnizart + GNN + CNN) achieves onset-level F1 = 86.93% and the best frame-level F1 = 75.76%.

5.2.2. External Evaluation on PianoVAM

Table 5 reports the results on the PianoVAM dataset [9]. The audio baseline attains onset-level precision = 62.51%, recall = 77.05%, and F1 = 67.94% and frame-level F1 = 54.68%. Rule-based merging improves the onset-level results, especially with

T_{win} = 1

:

T_{win} = 1

yields onset-level F1 = 70.3%, while

T_{win} = 9

yields onset-level F1 = 68.89%. Their frame-level performance remains close (frame-level F1 = 55.28% for

T_{win} = 1

and 55.30% for

T_{win} = 9

). The learning-based merging model (Omnizart + GNN + CNN) achieves onset-level F1 = 69.78% and the best frame-level F1 = 57.57%.

5.3. Qualitative Analysis

To better understand the behavior of the proposed method, we visually inspect the generated piano rolls. Figure 8, Figure 9, Figure 10, and Figure 11 compare the ground truth, the audio-only baseline output, and the proposed method’s output.

5.3.1. Recovery of Missed Notes

Figure 8 shows a typical success case demonstrating the recovery of false negatives. In the audio-only baseline (middle row), the chord progressions in the lower region played by the left hand, as well as the fast chromatic passages in the latter half, are largely missing. However, as shown in the piano roll of the proposed method (bottom row), these notes—which were absent in the baseline—have been successfully restored. By using the hand-skeleton probabilities as a soft spatial guide, the proposed method preserves plausible audio activations in played regions while attenuating implausible regions, bringing the transcription closer to the ground truth.

5.3.2. Suppression of Noise Artifacts

Figure 9 illustrates the suppression of false positives. As indicated by the red arrows in the baseline piano roll (middle row), scattered noise artifacts are observed in both the lower and higher regions. It is evident that such spurious notes are effectively suppressed in the proposed method (bottom row). This demonstrates that the weighting mechanism based on hand-skeleton probabilities correctly filters out audio activations in regions where no hand activity is detected.

5.3.3. Failure Cases: Introduction of Noise and Suppression of Valid Notes

While the proposed method generally improves performance, there are instances where the integration of visual cues degrades the transcription quality.

Figure 10 illustrates a case where the proposed method introduces noise artifacts that were absent in the audio-only baseline. As indicated by the red arrows, spurious notes appear in close proximity to the ground-truth notes, as well as in regions far removed from the main melody line. This suggests that, in some frames, the hand skeleton model may output false-positive probabilities due to visual ambiguity or motion blur, which are then erroneously merged with the audio predictions.

Conversely, Figure 11 demonstrates an instance where correct notes, originally captured by the audio baseline, are inadvertently suppressed by the proposed method (indicated by the red arrows in the central region). Similar to Figure 10, the emergence of unnecessary notes is also observed in this example. These cases highlight that integrating probability values from the hand-skeleton model does not always guarantee improvement and can, in certain scenarios, lead to a reduction in transcription accuracy due to visual tracking errors or misalignment.

6. Discussion

6.1. Effect of Temporal Context in the Hand-Skeleton Branch

A key question in our hand-skeleton branch is whether temporal context improves the overall transcription quality. The

T_{win} = 1

ablation provides evidence that temporal context does not consistently yield gains.

As shown in Table 3, on the internal dataset, rule-based merging with Global params performs better with

T_{win} = 1

(onset-level F1 = 63.95%; frame-level F1 = 39.99%) than with

T_{win} = 9

(onset-level F1 = 57.37%; frame-level F1 = 31.91%). This suggests that, under the current training and alignment conditions, temporal aggregation in HandSkeletonNet can be detrimental. One plausible explanation is that onset events are temporally sharp and sparse, and temporal modeling may blur or shift peaks. If residual audio–video misalignment remains after global offset compensation, such peak shifts can propagate to the downstream weighting or masking behavior used in rule-based merging, leading to reduced detection quality.

In contrast, on external datasets, the difference between

T_{win} = 1

and 9 becomes small (Table 4 and Table 5). On PianoYT, onset-level F1 is 90.06% (

T_{win} = 1

) versus 90.07% (

T_{win} = 9

) and frame-level F1 is identical (75.69%). On PianoVAM, onset-level F1 is 70.3% (

T_{win} = 1

) versus 68.89% (

T_{win} = 9

), while frame-level F1 is nearly identical (55.28% vs. 55.30%). These results indicate that temporal context in the hand branch provides limited additional benefit under domain shift, and that the generalization bottleneck is likely dominated by other factors such as hand detection reliability, viewpoint/scale differences, and residual alignment errors.

6.2. Learned Merging vs. Rule-Based Merging

Across datasets, the learned merging model (Omnizart + GNN + CNN) demonstrates a strong ability to improve at least one of the two evaluation levels. On the internal dataset, it achieves the best overall performance with onset-level F1 = 66.38% and frame-level F1 = 42.96% (Table 4 and Table 5). This suggests that the CNN can learn nonlinear decision rules that are not captured by fixed-form rule-based merging.

On external datasets, the learned model improves frame-level performance but does not consistently exceed the best rule-based onset-level score. On PianoYT, the CNN achieves the best frame-level F1 (75.76%), but its onset-level F1 (86.93%) is lower than the rule-based results (90.06%/90.07%). On PianoVAM, the CNN again yields the best frame-level F1 (57.57%), while onset-level F1 (69.78%) is slightly below the best rule-based result (70.3%). This indicates that the learned model can shift the balance between frame-wise activity and onset-event detection differently across domains. A practical implication is that the optimal merging strategy may depend on the target dataset and the downstream use-case (e.g., onset-sensitive evaluation versus sustained frame activation).

6.3. Error Taxonomy on External Datasets

To better understand error sources on external datasets, we analyze false positives (FP) and lost true positives (Lost-TP) using an explicit taxonomy (Table 6). In our implementation, FP events are defined as unmatched onset events in the merged (rule-based) MIDI output, and Lost-TP events are defined as Ground Truth (GT) onsets that are matched by the audio baseline within

δ

but are not matched by the merged output. FP-A corresponds to timing/alignment-related cases, FP-B to hand-driven spurious activations, and FP-C to audio-driven hallucinations not suppressed by the hand cue. Lost-A captures timing shifts in the merged output, while Lost-B (dropout) captures cases where the hand cue is weak or absent despite strong audio evidence.

Table 6 indicates that FP events are dominated by category A on both datasets. In particular, FP-A accounts for 74.67% (PianoYT) and 62.01% (PianoVAM) for the rule-based merged output, while FP-B and FP-C are near zero in all settings. This suggests that the dominant FP mechanism is not purely “hand-only” or “audio-only” hallucination, but rather timing/alignment inconsistencies or locally shifted peaks.

Similarly, Lost-TP is dominated by category A (95.20% on PianoYT and 90.00% on PianoVAM for the GNN fixed setting), indicating that many missed events are attributable to timing shifts rather than complete disappearance of acoustic evidence. These observations align with the limited benefit of temporal context in the hand branch on external datasets: when timing-related instability dominates, adding temporal aggregation does not necessarily resolve the underlying mismatch.

Comparing settings also highlights that learned merging changes the balance between FP and Lost-TP. On PianoYT, the FP count decreases from 17,436 to 14,155, but Lost-TP increases from 2479 to 8843. On PianoVAM, FP increases from 33,966 to 37,229 and Lost-TP changes from 2900 to 3065. These shifts imply that improving one error source (e.g., suppressing spurious onsets) can introduce another (e.g., over-suppression leading to missed events), which helps explain why the best method depends on the evaluation level and the target domain (Table 4 and Table 5).

6.4. Note Fragmentation in MIDI Decoding

We frequently observe note fragmentation during MIDI decoding, where a single GT note is represented as multiple short notes in the predicted MIDI output. This phenomenon increases the number of onset events and can directly affect onset-level precision: even if one fragment matches a GT onset within the tolerance

δ

, additional fragments are counted as false positives. Moreover, fragmentation can exacerbate timing-related errors that fall into FP-A and Lost-A, because short fragments often produce jittered or shifted onset times around the true onset.

From the perspective of error analysis, note fragmentation can inflate FP counts and make timing-related categories appear even more dominant, especially on external datasets where probability peaks are less stable. Therefore, improving the MIDI decoding and post-processing stage is complementary to improving the merging strategy. Practical countermeasures include enforcing a minimum inter-onset interval per pitch, merging adjacent note segments separated by short gaps, or applying hysteresis thresholds when converting probability maps to binary events. Such post-processing can reduce spurious fragments and yield a more faithful event representation, potentially improving onset-level metrics without changing the upstream probability estimation.

While note fragmentation is problematic in generating sheet music, precise note segmentation is not necessary in certain applications, for example, when performing harmonic analysis or when resynthesizing with instruments with weak attack characteristics, such as wind instruments. In such scenarios, our method is expected to be effective.

7. Conclusions

In this paper, we investigated multimodal automatic piano transcription by combining an audio-based transcription model with visual hand-skeleton cues extracted from performance videos. The underlying motivation was that hand motion provides spatial information about which pitches are likely to be played and when, potentially complementing audio-based predictions in acoustically ambiguous situations.

To explore this idea, we proposed HandSkeletonNet to estimate pitch-wise probability representations from hand-skeleton sequences and examined multiple merging strategies to integrate these visual probabilities with the audio probabilities produced by Omnizart. Both rule-based merging and learning-based merging using a CNN were evaluated on our self-compiled dataset and on two external datasets (PianoYT and PianoVAM).

On our dataset, incorporating hand-skeleton cues improved onset-level transcription compared with the audio-only baseline, and the learning-based merging model achieved the best overall performance. This indicates that, within the training domain, the model can learn useful relationships between audio and visual modalities. However, experiments with different temporal window sizes revealed that temporal context in the hand-skeleton branch does not consistently improve performance: using a single-frame representation (

T_{win} = 1

) outperformed a wider temporal window on our dataset.

On the external datasets, while merging improved frame-level performance in some cases, consistent onset-level improvements were not observed, and in other words, onset-level metrics were sensitive to domain shift. Error analysis showed that most false positives and missed events were dominated by timing- and alignment-related errors rather than by pure audio- or hand-driven hallucinations. These results suggest that residual audio–video misalignment and temporal instability limit the effectiveness of hand-skeleton cues on unseen data, and that adding temporal context alone is insufficient to address this issue. For these reasons, the applicability of our method may be limited to specific scenarios, such as harmonic analysis.

We also observed frequent note fragmentation during MIDI decoding, where a single ground-truth note is represented as multiple short notes in the predicted output. This phenomenon inflates onset-level false positives and further amplifies timing-related errors, highlighting the importance of post-processing and event-level consistency in addition to probability estimation.

In conclusion, hand-skeleton cues can serve as a complementary signal for piano transcription, particularly within the training domain. However, achieving robust onset-level improvements on unseen datasets remains challenging due to timing instability, alignment errors, and note fragmentation. As future work, it would be beneficial to expand the dataset, particularly by incorporating videos captured from diverse viewpoints and covering a wider range of composers’ time periods, and to explore estimation using multiple cameras simultaneously, in order to improve accuracy, robustness to visual variations, and generalization across datasets. In addition, explicit handling of temporal alignment and MIDI post-processing within the network should be investigated to ensure more stable and faithful event representations.

Author Contributions

Conceptualization, K.Y., S.N. and J.S.; methodology, K.Y. and S.N.; software, K.Y.; validation, K.Y. and S.N.; formal analysis, K.Y.; investigation, K.Y.; resources, K.Y. and S.N.; data curation, K.Y.; writing—original draft preparation, K.Y.; writing—review and editing, K.Y., S.N., and J.S.; visualization, K.Y. and S.N.; supervision, S.N. and J.S.; project administration, S.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets presented in this article are not readily available due to copyright reasons. Requests to access the datasets should be directed to the corresponding author.

Acknowledgments

We sincerely thank Julián Villegas at the University of Aizu for his careful review and constructive feedback, which greatly improved this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Koepke, A.S.; Weng, O.; Leman, D.; Zisserman, A.; Sewell, C. Sight to Sound: An End-to-End Approach for Visual Piano Transcription. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2020; pp. 1838–1842. [Google Scholar] [CrossRef]
Wang, X.; Xu, W.; Liu, J.; Yang, W.; Cheng, W. An Audio-Visual Fusion Piano Transcription Approach Based on Strategy. In Proceedings of the 2021 24th International Conference on Digital Audio Effects (DAFx); IEEE: Piscataway, NJ, USA, 2022. [Google Scholar] [CrossRef]
Li, Y.; Wang, X.; Wu, R.; Xu, W.; Cheng, W. A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3618–3630. [Google Scholar] [CrossRef]
Li, Y.; Wang, X.; Wu, R.; Xu, W.; Chen, W. A CRNN-GCN Piano Transcription Model Based On Audio And Skeleton Features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW); IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Benetos, E.; Dixon, S.; Giannoulis, D.; Kirchhoff, H.; Klapuri, A. Automatic Music Transcription: Challenges and Future Directions. J. Intell. Inf. Syst. 2013, 41, 407–434. [Google Scholar] [CrossRef]
Wu, Y.-T.; Chen, B.; Su, L. Omnizart: A General Toolbox for Automatic Music Transcription. J. Open Source Softw. 2021, 6, 3391. [Google Scholar] [CrossRef]
Hawthorne, C.; Elsen, E.; Song, J.; Roberts, A.; Simon, I.; Raffel, C.; Engel, J.; Oore, S.; Eck, D. Onsets and Frames: Dual-Objective Piano Transcription. arXiv 2017, arXiv:1710.11153. [Google Scholar] [CrossRef]
Zivanovic, U.; Pilkov, I.; Cancino-Chacón, C.E. Pay Attention to the Keys: Visual Piano Transcription Using Transformers. In Proceedings of the 34th International Joint Conference on Artificial Intelligence (IJCAI), Montreal, QC, Canada, 8–16 August 2025. [Google Scholar] [CrossRef]
Kim, Y.; Park, J.; Bae, J.; Kim, K.; Kwon, T.; Lerch, A.; Nam, J. PianoVAM: A Multimodal Piano Performance Dataset. In Proceedings of the 26th International Society for Music Information Retrieval Conference (ISMIR), Daejeon, Republic of Korea, 21–25 September 2025. [Google Scholar]
Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.-L.; Grundmann, M. MediaPipe Hands: On-device Real-time Hand Tracking. arXiv 2020, arXiv:2006.10214. [Google Scholar]
Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2018. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016; pp. 1800–1807. [Google Scholar]
B’en’edict, G.; Koops, V.; Odijk, D.; de Rijke, M. sigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. arXiv 2021, arXiv:2108.10566. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.B.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
Raffel, C.; McFee, B.; Humphrey, E.J.; Salamon, J.; Nieto, O.; Liang, D.; Ellis, D.P.W. mir_eval: A Transparent Implementation of Common MIR Metrics. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Taipei, Taiwan, 27–31 October 2014. [Google Scholar]

Figure 1. Illustration of Audio-based AMT.

Figure 2. Spectrogram illustrating overtones detected from C4.

Figure 3. Overview of the proposed method.

Figure 4. Probabilities estimated by Omnizart (Onset, Duration, Offset).

Figure 5. Hand-skeleton graph derived from MediaPipe Hands.

Figure 6. Structure of HandSkeletonNet.

Figure 7. Architecture of Learning-based Merging Module.

Figure 8. Comparison of note recovery. Ground Truth (Top), Audio Only (Middle), and Proposed Method (Bottom). The horizontal axis represents time, while the vertical axis represents pitch. Each note is shown as a blue bar. The proposed method successfully recovers the lower chords and fast chromatic scales that were missed by the baseline.

Figure 9. Comparison of noise suppression. The horizontal axis represents time, and the vertical axis represents pitch. The red arrows in the baseline indicate scattered noise artifacts in the low and high regions, which are filtered out by the proposed method.

Figure 10. Failure case showing noise introduction. Ground Truth (Top), Audio Only (Middle), and Proposed Method (Bottom). The horizontal axis represents time, and the vertical axis represents pitch. The proposed method generates spurious notes near correct notes and in disjoint regions (red arrows), which were not present in the baseline.

Figure 11. Failure case showing note suppression. The horizontal axis represents time, and the vertical axis represents pitch. The red arrows in the center indicate valid notes that were correctly detected by the baseline but suppressed by the proposed method. Additionally, some spurious noise is introduced.

Table 1. Merging hyperparameters used in the proposed method.

Name	Symbol	Candidate Values	Adapted Value
Weight Floor	$P_{floor}$	${0.8, 0.6, 0.5, 0.3}$	$0.667$
Weight Gamma	$γ$	${1.0, 1.2, 1.5}$	$1.189$
Onset Thresh	$P_{th}^{(on)}$	${0.5, 1.0, 1.5, 2.0}$	$1.167$
Duration Thresh	$P_{th}^{(dur)}$	${0.5, 1.0, 1.5}$	$1.278$

Table 2. Training hyperparameters for the Learning-based Merging Module (implementation setting).

Name	Symbol	Value
Input channels	$C_{in}$	4 (onset, offset, duration, hand)
Base channels	–	64
Dropout	–	0.1
Dilated conv	–	kernel $3 \times 1$ , dilation $(2, 1)$
Loss	–	BCEWithLogitsLoss
Positive weight	–	$pos_weight = 3$
Optimizer	–	Adam
Learning rate	$η_{cnn}$	$1 \times 10^{- 3}$
Batch size	$B_{cnn}$	1

Table 3. Evaluation results on the internal dataset. “Naive product” refers to simple multiplication without the weight floor and gamma parameters. “Optimized params” represents the performance where hyperparameters shown in Table 1 tuned per song.

Method	Onset Level			Frame Level
	Precision	Recall	F₁	Precision	Recall	Accuracy	F₁
	(%)	(%)	(%)	(%)	(%)	(%)	(%)
Omnizart (Audio baseline)	58.74	71.27	64.09	69.98	27.4	24.17	39.38
Omnizart + Hand Skeleton Region	83.89	42.23	53.02	68.61	17.49	16.50	27.87
Omnizart + GNN (Naive product)	68.5	54.55	56.97	64.77	21.21	19.49	31.96
Omnizart + GNN (Global params, $T_{win} = 1$ )	66.41	62.64	63.95	70.29	27.94	24.69	39.99
Omnizart + GNN (Global params, $T_{win} = 9$ )	65.52	54.23	57.37	68.81	20.77	19.37	31.91
Omnizart + GNN (Optimized params, $T_{win} = 9$ )	74.09	56.11	60.73	68.57	20.65	19.29	31.74
Omnizart + GNN + CNN ( $T_{win} = 9$ )	66.43	67.44	66.38	64.8	32.14	26.69	42.96

The bold numbers indicate the highest score in each metric.

Table 4. Evaluation results on the PianoYT dataset.

Method	Onset Level			Frame Level
	Precision	Recall	F₁	Precision	Recall	Accuracy	F₁
	(%)	(%)	(%)	(%)	(%)	(%)	(%)
Omnizart (Audio baseline)	81.23	96.56	87.68	71.69	78.89	58.72	75.12
Omnizart + GNN (Global params, $T_{win} = 1$ )	86.41	94.83	90.06	71.55	80.34	59.23	75.69
Omnizart + GNN (Global params, $T_{win} = 9$ )	86.42	94.84	90.07	71.54	80.34	59.23	75.69
Omnizart + GNN + CNN ( $T_{win} = 9$ )	83.37	91.32	86.93	73.27	78.43	59.28	75.76

The bold numbers indicate the highest score in each metric.

Table 5. Evaluation results on the PianoVAM dataset.

Method	Onset Level			Frame Level
	Precision	Recall	F₁	Precision	Recall	Accuracy	F₁
	(%)	(%)	(%)	(%)	(%)	(%)	(%)
Omnizart (Audio baseline)	62.51	77.05	67.94	75.34	43.13	31.54	54.68
Omnizart + GNN (Global params, $T_{win} = 1$ )	67.47	75.14	70.3	74.58	43.9	31.5	55.28
Omnizart + GNN (Global params, $T_{win} = 9$ )	66.22	73.66	68.89	74.6	43.93	31.6	55.30
Omnizart + GNN + CNN ( $T_{win} = 9$ )	66.38	75.45	69.78	71.61	48.14	32.14	57.57

The bold numbers indicate the highest score in each metric.

Table 6. Error taxonomy on external datasets. The total number of false-positive onset events (FP) and lost true positives (Lost-TP), along with the percentage breakdown by cause.

Dataset	Setting	False Positives					Lost True Positives
Dataset	Setting	FP	A	B	C	Other	Lost-TP	A	B (Drop)	Other
PianoYT	GNN (Global params, $T_{win} = 9$ )	17,436	74.67%	0.01%	1.21%	24.11%	2479	95.20%	0.48%	4.32%
	GNN + CNN	14,155	82.68%	0.00%	0.92%	16.40%	8843	80.04%	2.26%	17.70%
PianoVAM	GNN (Global params, $T_{win} = 9$ )	33,966	62.01%	0.00%	0.07%	37.91%	2900	90.00%	2.28%	7.72%
	GNN + CNN	37,229	63.01%	0.00%	0.05%	36.94%	3065	91.55%	1.21%	7.24%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yamada, K.; Nishimura, S.; Shin, J. Multimodal Automatic Music Transcription Using Piano Audio and Hand-Skeleton Information. Electronics 2026, 15, 2005. https://doi.org/10.3390/electronics15102005

AMA Style

Yamada K, Nishimura S, Shin J. Multimodal Automatic Music Transcription Using Piano Audio and Hand-Skeleton Information. Electronics. 2026; 15(10):2005. https://doi.org/10.3390/electronics15102005

Chicago/Turabian Style

Yamada, Kosuke, Satoshi Nishimura, and Jungpil Shin. 2026. "Multimodal Automatic Music Transcription Using Piano Audio and Hand-Skeleton Information" Electronics 15, no. 10: 2005. https://doi.org/10.3390/electronics15102005

APA Style

Yamada, K., Nishimura, S., & Shin, J. (2026). Multimodal Automatic Music Transcription Using Piano Audio and Hand-Skeleton Information. Electronics, 15(10), 2005. https://doi.org/10.3390/electronics15102005

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Automatic Music Transcription Using Piano Audio and Hand-Skeleton Information

Abstract

1. Introduction

2. Background and Related Work

2.1. Fundamentals of Piano AMT

2.2. Audio-Only AMT

2.3. Visual Cues and Multimodal AMT

3. Proposed Method

3.1. Overall Pipeline

3.2. Audio Branch (Omnizart)

3.3. Hand Skeleton Branch

3.4. Weighting and Masking of Probabilities

3.4.1. Notation

3.4.2. Temporal Alignment

3.4.3. Construction of the Weight Map

3.4.4. Application to Audio Probabilities and MIDI Inference

3.5. Learning-Based Merging Module

3.5.1. Inputs

3.5.2. Time Alignment

3.5.3. Model and Training

3.5.4. Inference and MIDI Decoding

4. Dataset and Experimental Setup

4.1. Internal Dataset

4.2. Partitioning Strategy

4.3. Training Procedure of HandSkeletonNet

4.3.1. Preprocessing and Input Generation

4.3.2. Model Architecture

4.3.3. Loss Function

Soft-F1

Composite Loss

Positive Class Weights

4.4. Training Procedure of Learning-Based Merging Module

4.4.1. Architecture

4.4.2. Loss and Optimization

4.5. Baselines

4.6. Hyperparameters

4.6.1. Merging Hyperparameters

4.6.2. HandSkeletonNet Hyperparameters

4.6.3. Learning-Based Merging Module Hyperparameters

4.7. Evaluation Metrics

5. Results

5.1. Evaluation with the Internal Dataset

5.2. Evaluation with External Datasets

5.2.1. External Evaluation on PianoYT

5.2.2. External Evaluation on PianoVAM

5.3. Qualitative Analysis

5.3.1. Recovery of Missed Notes

5.3.2. Suppression of Noise Artifacts

5.3.3. Failure Cases: Introduction of Noise and Suppression of Valid Notes

6. Discussion

6.1. Effect of Temporal Context in the Hand-Skeleton Branch

6.2. Learned Merging vs. Rule-Based Merging

6.3. Error Taxonomy on External Datasets

6.4. Note Fragmentation in MIDI Decoding

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Soft-F₁