Generating Fingerings for Piano Music with Model-Based Reinforcement Learning

: The piano ﬁngering annotation task refers to assigning ﬁnger labels to notes in piano sheet music. Good ﬁngering helps improve the smoothness and musicality of piano performance. In this paper, we propose a method for automatically generating piano ﬁngering using a model-based reinforcement learning algorithm. We treat ﬁngering annotation as a partial constraint combinatorial optimization problem and establish an environment model for the piano performance process based on prior knowledge. We design a reward function based on the principle of minimal motion and use reinforcement learning algorithms to decide the optimal ﬁngering combinations. Our innovation lies in establishing a more realistic environment model and adopting a model-based reinforcement learning approach, compared to model-free methods, to enhance the utilization of samples. We also propose a music score segmentation method to parallelize the ﬁngering annotation task. The experimental section shows that our method achieves good results in eliminating physically impossible ﬁngerings and reducing the amount of ﬁnger motion required in piano performance.


Introduction
Piano composers and arrangers add fingering to sheet music to indicate which finger should be used to play each note.However, most piano sheet music only provides detailed notation of the musical notes themselves, without specific fingering instructions.Human performers rely on their professional knowledge, personal physical conditions, the actual sequence of notes, and multiple rehearsals to determine the fingering.This process involves a significant amount of decision-making.A scientifically and logically arranged fingering not only reduces the difficulty of performance and relieves mental burden for the pianist but also enhances the musicality and fluidity of the performance.Generally, when novice performers receive a piece of music they are about to practice, their first task is to annotate their own fingering on the sheet music.In theory, any key on the piano can be played with any finger.However, the number of possible fingering paths increases exponentially with the length of the note sequence, and the majority of these fingerings are difficult to execute or even physically impossible.
The fingering annotation task is mainly approached through two categories: rule-based and data-driven methods.Rule-based methods [1][2][3][4][5][6][7] aim to summarize the anatomical and kinematic principles of piano playing, model the piano keyboard and human hand as realistically as possible, and scientifically quantify the motion cost of fingering transitions.They make use of the state information in simulated performances to decide on fingerings.Data-driven methods [8][9][10][11][12][13], on the other hand, utilize piano scores annotated with human fingerings to extract implicit patterns of how humans arrange fingerings.These patterns are then used to predict fingerings for new music scores.
In earlier works, due to the lack of human expert annotated dataset, rule-based methods were often used.However, in recent years, the availability of the PIG dataset [10] has shifted the research focus towards data-driven methods.Nevertheless, in recent research [12], we have found that the accuracy of the PIG dataset needs improvement, as it contains fingerings like "crossed chords" or "thumbless cross" that are not realistically playable by humans.This directly affects the effectiveness of data-driven methods, often resulting in the generation of fingerings that include such unplayable fingerings.
In this research, we employed rule-based approaches.We model the generation of piano fingerings as a Markov decision process (MDP).The agent interacts with its environment and evaluates the fingerings using a reward function.We introduce the prioritized sweeping algorithms [14] to implement model-based reinforcement learning.In order to execute the algorithm without enumerating the state space and action space of the piano fingering problem, we replaced the traditional Q-table used in tabular reinforcement learning with a hash table based on key-value storage.Additionally, we propose a method for segmenting the music score to enable the parallel execution of the algorithm.The completed processing steps are shown in Figure 1.The main contributions are summarized as follows: • We constructed a more realistic environment model for piano performance, and invalid action masking is used to maximally constrain physically impossible fingerings.

•
We used model-based reinforcement learning to address the fingering annotation task, which has improved the sampling efficiency compared to the previous model-free approach.

•
We introduce a Q-table based on key-value storage to achieve tabular reinforcement learning without enumerating action and state spaces.

•
We propose a piano sheet music segmentation method to parallelize problem-solving.
This paper is divided into seven sections.Section 1 provides a brief overview of the problem.Section 2 presents related work in the field.Section 3 describes the approach used to model the environment for the problem.Section 4 describes learning algorithm we used.
Section 5 discusses the experimental details and result evaluation.In Section 6, a method for segmenting the music score is presented.Finally, Section 7 concludes the paper and provides suggestions for future research.

Related Work
Piano fingering automatic annotation can be broadly categorized into two methods: rule-based and data-driven.
The earliest rule-based method for piano fingering annotation was proposed by R. Parncutt and others [1].Hart [2] introduced a method that utilizes dynamic programming to compute optimal fingerings for single-note melodies.Balliauw [3,4] applied two different search algorithms to tackle this problem.Ramoneda [5] and Koornstra [6] explored modelfree reinforcement learning algorithms to find fingerings with minimal hand movement; however, since it is based on a very simple environment model, it can only handle singlenote melodies.Xu [7] conducted research on the application of reinforcement learning to the piano performance of humanoid robotic hands with only four fingers.Applying reinforcement learning to other combinatorial optimization problems has also been the subject of extensive research [15][16][17].Huang et al. [18] conducted research on invalid action masking in reinforcement learning.
E. Nakamura and Y. Yonebayashi [8,9] proposed hidden Markov models (HMMs) for piano fingering, respectively, for single-handed and double-handed outputs, laying the foundation for subsequent data-driven approaches using statistical learning to predict fingerings.In their subsequent work [10], E. Nakamura formalized the task as a statistical learning problem and achieved good fingering consistency by modeling fingerings with a third-order HMM and a long short-term memory (LSTM) network, which is similar to the part-of-speech tagging task in natural language processing.They also introduced the chord hidden Markov model (CHMM) for chord fingering modeling and published the PIG dataset.Guan et al. [11] proposed the pitch difference model, which efficiently utilized the PIG dataset by using relative pitch instead of absolute pitch.Srivatsan et al. [12] proposed checklist models, which introduced soft constraints and used the REINFORCE algorithm to optimize evaluation metrics, improving the overall fluency of fingerings, but the experimental results show that the soft constraint rules used in this work do not fundamentally eliminate unplayable fingerings such as crossed chord, and the optimization objectives of the reinforcement learning part also need to be improved.Randolph et al. [13] treated the fingering annotation as an information retrieval problem and used the Czerny corpus as the dataset.
The quality of data determines the upper limit of data-driven approaches.Existing data-driven methods often struggle to directly eliminate physically impossible fingering predictions.Additionally, during data preprocessing, they often create unrealistic fingering, for example, due to limited data, data-driven methods adopt a technique of flipping lefthand finger labels and pitch to create a second "right-hand portion" in order to augment the training data.However, although the human left and right hands are mirror images of each other, the arrangement of black and white keys on the keyboard is not a mirror-symmetric structure.Simply flipping the left-hand pitch and finger labels cannot accurately represent a right-hand performance.Rule-based methods, on the other hand, have difficulties describing the performance process with fixed paradigms, and their modeling accuracy needs improvement, as many studies are limited to fingering for single-note melodies.
As a continuation of rule-based method research, we aim to formalize the piano performance process through MDP and propose a quantifiable method for calculating motion to enhance the accuracy of the environmental model.Our research aims to achieve a certain level of matching with human fingerings while maximizing the limitation of physically impossible fingerings and minimizing the motion, even seeking fingerings superior to human annotations.Our action constraints and modeling methods can also be applied to future research combining data-driven methods.

Overview of Model Definitions
In reinforcement learning, an environment model is a simulation model that interacts with an agent.It allows the agent to observe the state, receive the agent's action signals, return rewards to the agent, and transition to the next state according to probability.The environment model is an MDP, which can be composed of a 4-tuple (S, A, P a (s, s ), R a (s, s )).S is the state space, A is the action space, P a is the state transition probability, R a is the reward function for state transition, s and s represent the current state and the new state to which it transitions after executing action a, respectively.
In this problem, the state s is composed of a triplet (i, f g , n next ), where i represents the position of a note or chord in the musical piece in terms of its chronological occurrence; f g represents the current fingering action, which is a sequence composed of note-finger pairs, denoted by p n f = (n, f ); therefore, f g can be represented as f g = (p n f 1 , p n f 2 , . . ., p n f 5 ); f is a finger number (1 = thumb, 2 = index finger, • • • , 5 = little finger) on both hands; n next represents the sequence of notes to be played in the next time step.
Due to the difficulty of fully enumerating the state space, we employ the tabular reinforcement learning method with key-value storage described in Section 4.1 to avoid enumerating the state space S and action space A. Since the state transitions are deterministic, P a is always 1.The reward function R a quantifies the difficulty of fingering transitions for more challenging fingerings, and the reward function R a returns lower rewards, or conversely, returns higher rewards.The objective of the reinforcement learning algorithm is to maximize the accumulated reward, thereby determining the optimal fingering decision path.

Simulation of Keyboard and Hand Interaction
This study utilizes the "white key distance" metric to describe the distance between two keys on a keyboard.It converts the distance between any two keys on the keyboard into a count of white keys, with black keys counting as 0.5 white keys.
where k w represents the enumeration of MIDI note numbers for all white keys.d A0 represents the distance from the leftmost A0 key on the piano to a specific note.d k refers to the white key distance between two piano keys, note1 and note2.
Regarding hand features, we can sample the performer's hand and represent it with a matrix M f .
White key distance d k and hand feature matrix M f are illustrated in Figure 2. And, Figure 3 illustrates the correspondence between finger numbers and specific fingers.f 1-f 5 represent the thumb to little finger.The values in the matrix are measured in white key distance and record the maximum expansion distance or maximum crossing distance of each finger.When the value is −1, this means that the corresponding action cannot be performed.Taking the right hand as an example, let f l represent the finger which has a lower label, f h represent the fingers has a higher label, and f c represent the fingers used for cross-fingering.The maximum expansion distance d smax and maximum crossing distance d cmax between two fingers can be represented using the following method:

Invalid Action Masking
In an MDP, we want the agent to always perform anatomically possible actions through invalid action masking.This way, we do not need to penalize physically impossible fingerings in the design of the reward function (invalid action penalty).Instead, we can quantitatively evaluate the difficulty of all possible finger technique transfers.Therefore, we need to define a mask function (Algorithm 1) that returns the set of physically possible actions for different states s.On the other hand, invalid action masking has increased the sampling efficiency of the agent by reducing the search space .For the next moment when the number of notes requiring finger allocation is n, enumerating all potential fingerings can be seen as a simple combination problem: selecting n fingers from the five available fingers.The number of combinations is calculated as C n 5 .However, not all potential fingerings are playable.Some fingerings may have a distance between two fingers that exceeds the maximum expansion distance d smax , or violate finger contraction rules.To determine whether a fingering exceeds the maximum expansion distance, we can compare the white key distance d k between two notes in the fingering with the maximum expansion distance d smax .If d k > d smax , it is considered physically impossible for the two fingers to stretch that far.To identify violations of finger contraction, we can compare the number of piano keys between two notes with the number of fingers involved.If |note1 − note2| < | f inger1 − f inger2|, it is considered a violation of finger contraction, where neither f inger1 nor f inger2 can be the thumb.Furthermore, transfer constraints should also take into account physically impossible scenarios in 1-to-1 fingerings, such as the thumb not being able to transition from a white key to a black key during cross fingering.

Quantification of Fingering Transition Difficulty
Below are some metrics for quantifying the fingering transition difficulty, which will be involved in the computation of the reward function.

Stretching Rate
During single-note playing, finger stretching and contracting frequently occur, and the stretching abilities of different fingers are not equal.This study proposes a quantification method called a "finger stretching rate" to measure finger extension, as shown in figure 4a.
We first define the finger extension between two fingers: when two fingers are placed on the keyboard, the white key distance between their finger labels f a and f b is defined as the natural distance The maximum extension distance can be obtained from the matrix M f mentioned earlier, denoted by d max , which represents the maximum extension ability between the two fingers.The actual distance between the two fingers on the keyboard is the white key distance d k between two pressed notes.
The stretching rate can be defined as r s = (d k − d nature )/(d max − d nature ).A value close to 1 indicates that the fingers are approaching their maximum stretching ability, resulting in greater discomfort and a lower reward.Similarly, for finger contraction, we can define the contraction rate r s = (d nature − d k )/d nature .A value close to 1 indicates that the fingers are approaching their maximum contraction ability, resulting in stronger discomfort and a lower reward.
During chord playing, we allocate n fingers on the keyboard.We can calculate the combinations of fingers used, with the number of combinations being C 2 n .The overall stretching rate of the chord fingering can be obtained by averaging the stretching rates r s of each finger combination.Specifically, the formula is r all = ( c ∑ k=1 (r sk ) a )/c.Since lower stretching rates result in less discomfort, the discomfort is usually more noticeable when approaching maximum stretching.Therefore, a parameter a > 1 is introduced to slow down the growth of the stretching rate at lower values of stretching.

Hand Movement Distance
Hand position refers to the position of the hand on the keyboard.Since the hand covers an area of multiple piano keys, it is not convenient to be precise about specific keys.However, by observing the anatomical diagrams of the human hand, we can approximate the hand position with the positions of middle finger fingers when the hand is naturally relaxed on the keys.In cases where multiple fingers are involved in playing chords and other techniques, the hand position can be approximated by averaging the positions of the thumb and little fingers.The hand displacement distance can be calculated by considering the changes in hand position before and after fingerings, as shown in Figure 4b.
For the left hand, h = −1, and for the right hand, h = 1.In the case of single-note fingerings, where f n represents the finger pressing the note n, the hand position can be calculated as For chord fingerings with multiple notes, let us consider the finger f l and note n l of the lowest pitch in the fingering, as well as the finger f h and note n h of the highest pitch.The hand position for the chord fingering can be determined as If the positions before and after the fingering transition are denoted by p pre and p next , respectively, the hand position movement distance can be calculated as d h = p pre − p next .A larger hand position movement indicates a stronger discomfort during the fingering transition, resulting in a lower reward.

Cross Fingering Distance
Cross fingering is a special finger technique where one finger crosses over the other fingers, as shown in Figure 4c, it commonly used in fast-paced performances that cover a wide range of musical notes.Cross fingering refers to fingers 2, 3, and 4 crossing over finger 1, or finger 1 passing under fingers 2, 3, and 4. When calculating the cross fingering distance, let us assume the note played by finger 1 (thumb finger) is n a , and the note played by finger f c participating in the finger crossing is n c .The cross fingering distance is calculated as The greater the cross fingering distance, the stronger the discomfort in finger movement and the lower the reward.

Fingering Mismatches Count
The term "fingering mismatch count" c m refers to the number of occurrences in an adjacent fingering transitions where the same note is played using different fingers or where a single finger plays two different notes, as shown in Figure 4d.In such cases, finger transitions can also pose certain difficulties.

Inverse Fingering Count
The "inverse fingering count" c i involves rearranging all the notes within a pair of adjacent fingerings from low to high and calculating the inversion number of finger labels corresponding to each note, as shown in Figure 4e.If an inverse fingering exists, it makes fingering transitions more challenging, and thus lower rewards should be assigned.

Reward Function
The reward function represents the evaluation of fingering difficulty.We quantify the motion information of fingering transitions mentioned above and assign higher rewards to fingering transitions that are more comfortable and easier to play.The maximum reward for each fingering transition is set to around 50.The parameters of the reward function have been carefully balanced through iterative experiments.
For single-note to single-note fingering transitions (i.e., 1-to-1 transitions as shown in Table 1), there are three possibilities: finger expansion/contraction, cross fingering, and hand position movement.We prioritize finger expansion and contraction as the first choice, followed by cross fingering, and finally consider hand position movement.
For single-note to multiple-note or multiple-note to multiple-note fingering transitions (i.e., 1-to-n and n-to-n transitions, as shown in Table 1), when the hand position movement distance is significant, we prioritize fingering options with lower expansion rates For multiple-note to single-note fingering transitions (i.e., n-to-1 transitions as shown in Table 1), the main consideration is to minimize the hand movement distance .In this task, attempting to directly enumerate the discrete state space and action space to construct a |S| × |A| table is very cumbersome, especially since the combination of 88 piano keys and 5 finger states is very large.Moreover, for any state s, it is impossible to access the complete action space.

reward = (50
For this, we designed a key-value storage tabular reinforcement learning method, as shown in the Figure 5, replacing the underlying access of the Q-table with an arbitrary keyvalue storage system, such as a hash table.This method avoids the complete enumeration of the state space and action space.Like classical tabular reinforcement learning, this method is only applicable to MDPs with discrete state spaces and discrete action spaces, and requires that the state-action pairs in the MDP be finite.After the agent performs action a in state s, it will use a hash function to calculate the hash value of the state-action pair (s, a) as the address of the hash table.This address stores the Q-value of the corresponding state-action pair (s, a).After performing the action and obtaining the reward r, the iterated new Q-value is written back to the corresponding address in the hash table, and then the agent enters the next state.

Model-Based Reinforcement Learning in Fingerings Annotation
In earlier studies on piano fingering annotation using reinforcement learning, modelfree algorithms such as Q-learning [19] or deep Q network (DQN) [20] were commonly employed.Model-free algorithms do not estimate the probability distribution of state transitions and the reward function of the MDP.They can only learn through interaction with the environment.
Instead, model-based algorithms can learn by predicting the rewards and state transitions after executing actions.This process is referred to as model learning [17].In this task, since the environmental model is deterministic, meaning the state transition probability P a (s, s ) is always 1, utilizing model-based reinforcement learning algorithms can reduce the interaction between the agent and the environmental model compared to modelfree algorithms, thereby improving the sampling efficiency.Therefore, a model-based reinforcement learning approach like prioritized sweeping is more suitable for this task.
Prioritized sweeping is essentially an improvement to the Dyna architecture [21].It introduces a priority queue, during the planning step of Dyna, which utilizes the experience in model learning to prioritize learning state-action pairs with a larger TD-error.In Algorithm 2, we replaced the Q-table and Model(s, a) in the original algorithm with implementations using hash tables instead of 2D matrices.This is intended to avoid the enumeration of the state space and action space.

Implementation Details
This experiment utilizes music scores in MusicXML format as experimental materials and incorporates the PIG dataset for algorithm comparison.The process begins with parsing the MusicXML into a music21 [22] stream.Then, all chord symbols are removed, and the resulting treble clef and bass clef parts are separately exported into two MIDI files.The MIDI.jl [23] library is employed to parse the notes from the MIDI files, sorting them in chronological order, and combining them into a two-dimensional array based on their play time.The MDP modeling is implemented using POMDPs.jl[24], and a corresponding prioritized sweeping solver is developed.
The agent's exploration strategy utilizes the epsilon-greedy algorithm with an exploration rate of = 0.8.For the reinforcement learning algorithm in a fully deterministic environment, a learning rate of 0.99 is employed, emphasizing the learning of new Q-values to facilitate faster convergence.Since the constructed MDP for this task does not possess cyclic structures, the agent can reach the terminal state within a finite number of steps.Hence, a discount factor of γ = 1 is utilized, maximizing the prediction of absorbing future rewards, i.e., max a Q(s t+1 , a).In the planning process of the prioritized sweeping, a threshold of θ = 3 is chosen for the priority queue.Any state-action pair with a TD-error greater than θ is added to the priority queue.During the reinforcement learning algorithm's iterations, the total reward is calculated based on the optimal value function every 10 iterations.If the change in total reward is within the range of ±0.5, we consider that the algorithm has converged, and the iteration process is stopped.
Finally, the experiment was conducted on a regular computer with an i5-9500 CPU and 8GB RAM.The algorithm was deployed on a Linux 6.1 LTS operating system and executed using Python 3.11 and Julia 1.9 interpreters to run the experimental code.

Influence of Model-Based Method on Sampling Efficiency
We used model-free Q-learning and model-based prioritized sweeping to annotate the same music score, and recorded the total rewards obtained based on the optimal actions after each iteration, as shown in Figure 6.It can be observed that model-based methods have a higher sampling efficiency compared to model-free methods.This is because the prioritized sweeping agent utilizes the experience from the learned model during the planning step to iteratively update the stateaction pairs where the TD-error exceeds a threshold, and propagates this process forward.

Analysis of Fingering Results
To facilitate comparison, we conducted experiments using the PIG dataset.Due to differences in input representations, we performed some preprocessing steps.For instance, we treated the space between longer rests as separate sequences and considered arpeggio as chords rather than continuous single notes.We compared our algorithm with the statistical learning algorithm [10] and the DQN reinforcement learning algorithm [5].Since the DQN implementation only supports monophonic melodies and is based on a simpler environmental model, we removed most of the melodies containing chords and used less data for comparison.
We evaluated the results using the match rate M gen [10] for annotation accuracy and the metrics crossed chord, thumbless cross, hop, and step spread [12] for fingering quality.The dataset ground truth is presented in Table 2.The experimental results are presented in Table 3.
We made an improvement in evaluating the overall hand stretching using the step spread metric.Previously, the method used the semitone difference ∆p divided by the difference in finger labels ∆ f .In our approach, we replaced the semitone difference with the difference in white key distances ∆d k , resulting in StepSpread = ∆d k /∆ f .This modification provides a better reflection of the physical distances on the keyboard.Our method exhibited a lower match rate M gen compared to the statistical approach, which was expected considering that reinforcement learning algorithms do not acquire any experience from the PIG dataset.However, our method achieved significant improvement in several metrics that measure poor fingering, such as crossed chord, thumbless cross, and hop.We were able to eliminate nearly all instances of physically impossible fingerings and difficult-to-play fingerings.
Our method achieved the best results in the evaluation of the overall hand stretching metric, step spread.This indicates that our reinforcement learning algorithm has been optimized for motion efficiency, leading to superior performance in terms of hand stretching.

More Comfortable Fingering
Due to our method's optimization of motion cost, it exhibits smaller finger movement compared to statistical learning approaches.For instance, Figure 7 shows the fingering results of Masaru Yokoyama's "Syotengaru": The reinforcement learning approach selects finger 4 to play G4, while the statistical approach selects finger 3 and 2. Clearly, the statistical learning method involves greater hand stretching, whereas our method produces fingerings with reduced stretching, offering enhanced comfort.

Explore Better Fingering
Our method has the potential to discover fingerings that are better than human annotated fingerings, as shown in Figure 8. Human-annotated fingerings are in line with first intuition, as there are two octaves, A4-A5 and B4-B5, choosing to directly move hand positions is intuitive.However, the fingering choice derived from the algorithm does not involve hand position change.It selects the index finger to play B4 and then cross thumb finger to play C5.This approach not only avoids hand movement but also reduces the discomfort caused by the significant stretch of using the finger transition 2-5 to play C5-B5, as indicated in the human-annotated fingering.

Human Evaluation
We conducted a small-scale human evaluation of the fingering generated by the algorithm.We recruited a music master's degree holder with several years of experience playing piano and provided them with ten excerpts of sheet music annotated with fingering generated by the algorithm.
The pianist provided feedback on the fingering segments we provided.He noted that our method ensured fundamental playability, consistent with our expectations.Our method was beneficial in optimizing finger stretches in most cases, especially in piano arrangements for some popular music pieces, where these musical techniques typically require less demand on finger agility.However, the pianist also observed that in a very few instances, our method did not perform well on melodic segments with specific sequential patterns, where each sequential repetition requires the same fingerings for consistent fingering memorization and smooth performance rather than simply minimizing motion.Clearly, our method cannot account for musical semantics.This suggests that our future work must involve a combination of data-driven and rule-based methods.

Segmentation Method
Due to the exponential growth of the number of note combinations with sequence length, heuristic search methods or reinforcement learning methods often require more time when annotating long sequences of music scores.In previous studies, there have been no attempts to parallelize the task of automatic piano fingering annotation.Due to the presence of Markovian properties in the fingering allocation process, it is often believed that the notes to be annotated must be modeled as a complete HMM or a MDP.However, based on E. Nakamura's statistical analysis [10] of human-annotated piano fingering scores, they discovered that an apex note in a monophonic passage has an almost unique choice of finger, i.e., the little finger.We found that this property is not only global but also holds true for the apex single notes in local melodic fragments.
A local region is defined as a gradually expanding range centered around a single note.The notes within this range must be played in quick succession, with no significant gap in the time between two notes (usually not exceeding the length of a half note), and the pitch range should exceed a perfect fifth.If within this range, there are at least five different pitches of notes or, if the pitch range exceeds an octave, and the pitch of the center single note is the highest within the range, then that single note is considered a local apex note within the local region, as shown in Figure 9.The same applies to the lowest single note for the left hand.We analyzed the fingering data in the PIG dataset that were annotated by at least four individuals to verify the probability of local apex notes being played with the little finger.The results indicate an accuracy of 98.41% for right-hand notes and 99.34% for left-hand notes.This indicates that the fingering selection for local apex notes is also nearly unique.
The uniqueness of the fingering for local apex notes allows us to partition the entire music score into smaller sequential segments, enabling the parallel decomposition of the task and reducing the depth of the search.By scanning the unannotated complete music score sequence, we can identify the local apex notes in the right hand and the local lowest notes in the left hand.These points can serve as segmentation positions to divide the music score into smaller segments, treating each segmented music score sequence as an independent annotation task, as shown in Figure 10.

Influence of Music Score Segmentation on Processing Speed
In order to compare the influence of music score segmentation on the speed of fingering annotation, we conducted a set of comparative experiments.We segmented the right-hand part of Bach Invention No. 1 C into six melody segments of roughly equal length.Then, we constructed these melody segments into an MDP and parallelized the execution of the reinforcement learning algorithm using different six CPU cores.In another group of experiments, we did not segment the music score and directly constructed the complete right-hand part into an MDP, using the same CPU to execute the algorithm.Tables 4 and 5 are the performance records of the experiments averaged over multiple runs.It can be observed that the performance is the poorest when using a model-free reinforcement learning algorithm without any processing of the music score.Both using model-based algorithms and segmenting the music score for parallel processing result in performance improvements.However, during the experiments, we also discovered that when parallelizing the execution of the reinforcement learning algorithm, it is not advisable to divide the sequence into very small segments.This is because the overhead of thread scheduling may exceed the algorithm's execution cost.

Conclusions
In this study, we continue the exploration of rule-based fingering automatic annotation by utilizing model-based reinforcement learning algorithms to find fingerings with minimal motion.We constructed a more realistic environmental model, improved the sampling efficiency through model-based reinforcement learning, and propose a method for segmenting long sequential music scores.Our method achieves good results in eliminating physically impossible fingerings and reducing the amount of finger motion required in piano performance.
Since our algorithm is built upon a simulation model of the performance process and generates physically playable fingerings, it is particularly suitable for application in piano performances by humanoid robots in the future.The constraint methods we propose can also be applied in statistical learning approaches.
However, we must acknowledge that a purely rule-based fingering annotation method is not the ultimate solution.Firstly, the parameters in the reward function are difficult to determine directly and require repeated trial and adjustment.Secondly, the intricate patterns of hand movements are difficult to describe comprehensively with detailed rules, making it challenging to fully characterize the problem.Thirdly, due to our reinforcement learning algorithm needing to simultaneously optimize multiple objectives (such as reducing finger stretches and movements at the same time), it is difficult to assign clear priorities for each optimization objective to all states, leading to conflicts in rules.
We believe that a combination of rule-based and data-driven approaches will be the direction of future work.We believe that a true hard constraint machine learning system is needed to effectively improve the quality of fingerings.In addition, modern end-to-end constrained optimization learning is also a solution worth exploring.However, how to balance runtime and accuracy will be a challenge.
Moreover, considering that classical music in PIG has entered the public domain, we hope that subsequent research can establish an open source fingering dataset to facilitate collaboration among researchers to correct errors in the dataset and improve the quality of fingering annotations for data-driven methods.We also anticipate that future datasets can adopt more common formats, and include metadata such as tempo and time signature.Additionally, the playing techniques of notes such as ornaments and staccato can also be marked.
Author Contributions: W.G. designed and conceived the experiments; N.Z.provided professional piano fingering suggestions; X.X.parsed the musical score data; W.G. and X.X.performed the experiments; W.G. wrote the paper; S.Z., Z.S. and K.S. reviewed the paper and gave some suggestions for improvement.All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the National Natural Science Foundation of China, grant number 62262043, and the Science and Technology Project of Jiangxi Province Education Department, grant number GJJ170575.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: PIG dataset are available at https://beam.kisarazu.ac.jp/~saito/ research/PianoFingeringDataset/ (accessed on 20 August 20 2023).As part of our research, we transcribed the music score of commercially copyrighted piano music for analysis and investigation.Due to the proprietary nature of the transcribed copyrighted music, we are unable to provide public access to these materials.

Figure 1 .
Figure 1.Processing steps of automatic piano fingering annotation.

Figure 2 .
Figure 2. White key distance and hand feature matrix.The distance d k between keys can be obtained by subtracting the numbers on the keyboard.The numbers in the matrix M f represent the maximum stretchable distance for the hands.

Figure 4 .
Figure 4. Quantification of motion in fingering transfer, the blue arrows represent the transitions of fingerings.The numbers on the keys represent fingering labels, while the numbers enclosed in circles represent the fingering in the previous moment.

Figure 5 .
Figure 5. Traditional tabular Q-learning and key-value storage Q-learning.

Figure 6 .
Figure 6.Comparison between the model-free and model-based RL methods.

Figure 7 .
Figure 7. Thered line represents reinforcement learning fingering result, while the blue line represents the statistical learning result.Lower fluctuations in the line imply greater finger stretches.The green blocks below represent the MIDI visualization of the melody.

Figure 8 .
Figure 8.Comparison of human-annotated fingering and algorithm-annotated fingering.The numbers on the musical notes indicate fingering labels.

Figure 9 .
Figure 9. Local apex note.The red arrow represents the current local apex note.The blue arrow indicates a gradually expanding local range.The numbers represent the count of expansions.

Figure 10 .
Figure 10.The fingering annotation task between two local apex notes (red square) can be considered as independent.

Algorithm 1
Invalid action masking for fingering.procedure MASK(notes) all_ f ingerings ← enumerating all potential fingerings by note counts if length(notes) == 1 then Only single note at next time valid_ f ingerings ← remove thumbless crossing, hop, invalid cross fingering, invalid expansion and contraction else More than one notes at next time for f ingering ∈ all_ f ingerings do note_ f inger_pairs ← pair( f ingering, notes) for f inger_combine ∈ combination(note_ f inger_pairs, 2) do if invalid expansion and contraction in this f inger_combine then

Table 1 .
The motion metrics required for different types of fingering transitions.

Algorithm 2
Prioritized sweeping (key-value storage)/ procedure PRIORITIZED SWEEPING Initialize empty hash table Q(s, a), Model(s, a) and empty priority queue PQueue.

Table 3 .
Experimental result.HMM represents a previous data-driven statistical learning method.DQN is a model-free reinforcement learning algorithm, implemented based on a simpler environmental model.Prioritized sweeping is the model-based reinforcement learning algorithm utilized in this research.
1This implementation cannot label chords.

Table 4 .
Model-free method (Q-learning) execution time and memory usage.

Table 5 .
Model-based method (prioritized sweeping) execute time and memory.