Next Article in Journal
Supporting Rule-Based Control with a Natural Language Model
Previous Article in Journal
Multi-Objective Optimization for After-Sales Service Technician Scheduling: An Integrated Mixed-Integer Programming Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Design and Control of a 32-DoF Robot for Music Performance Using AI and Motion Planning †

by
Ilie Indreica
,
Mihnea Dimitrie Doloiu
,
Ioan-Alexandru Spulber
,
Gigel Măceșanu
,
Bogdan Sibișan
and
Tiberiu-Teodor Cociaș
*
Faculty of Electrical Engineering and Computer Science, Transilvania University of Brasov, 500036 Brașov, Romania
*
Author to whom correspondence should be addressed.
Presented at the Sustainable Mobility and Transportation Symposium 2025, Győr, Hungary, 16–18 October 2025.
Eng. Proc. 2025, 113(1), 53; https://doi.org/10.3390/engproc2025113053
Published: 11 November 2025
(This article belongs to the Proceedings of The Sustainable Mobility and Transportation Symposium 2025)

Abstract

This paper presents the development of a 32-degree-of-freedom (DoF) humanoid robotic system designed for autonomous piano performance. The system integrates a vision-based music sheet reader with a YOLOv8 neural network for real-time detection and classification of musical symbols, achieving a mean average precision (mAP) of 96% at IoU 0.5. A heuristic-based synchronization and motion planning module computes optimal finger trajectories and hand placements, enabling expressive and temporally accurate performances. The robotic hardware comprises two anthropomorphic hands mounted on linear rails, each with independently actuated fingers capable of vertical, horizontal, and rotational movements. Experimental validation demonstrates the system’s ability to execute complex musical passages with precision and synchronization. Limitations related to dynamic expressiveness and symbol generalization are discussed, along with proposed enhancements for future iterations. The results highlight the potential of AI-driven robotic systems in musical applications and contribute to the broader field of intelligent robotic performance.

1. Introduction

The real-time recognition of printed music notation using computer vision and deep learning generally addresses two core aspects: optical music recognition and real-time symbol detection. In Ref. [1], the authors present an end-to-end deep convolutional neural network model based on the Darknet-53 backbone (from YOLO [2]) for music symbol detection. The use of Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) for sequential music score reading and contextual retention is discussed in Ref. [3]. Similarly, Ref. [4] introduces a Connectionist Temporal Classification loss function to improve the system’s ability to generalize across various types of monophonic sheet music. Since RNNs compute the loss function in a serial manner, often resulting in low training efficiency and convergence difficulties, Ref. [5] proposes a sequence-to-sequence framework based on a transformer architecture with a masked language model, which yields improved accuracy.
The computation of optimal finger positions and movements has become an increasingly prominent topic in robotics. For instance, the robotic marimba player “Shimon” employs Brushless Direct Current (BLDC) motors to enhance both speed and dynamic range. This system achieves a performance level comparable to that of human musicians and surpasses solenoid-based systems in striking speed, thereby enabling more expressive robotic musical performances [6]. In Ref. [7], the authors introduce a real-time motion planning approach for multi-fingered robotic hands operating in constrained environments. Their method uses neural networks to model collision-free spaces, facilitating dynamic obstacle avoidance and improving dexterity in in-hand manipulation tasks. A comprehensive review of motion planning algorithms is provided in Refs. [8,9], offering insights into the performance, strengths, and limitations of various planners, along with guidance for selecting appropriate algorithms based on specific application requirements.
Controller-driven robotic actuators have advanced the field of robotic musical performance by addressing the challenges of precise actuation and expressive control. For example, Ghost Play [10] is designed to emulate human violin performances using seven electromagnetic linear actuators: three for bowing and four for fingering. Another implementation of Shimon, which also utilizes BLDC motors for enhanced speed, is described in Ref. [6]. The development of anthropomorphic robots capable of playing the flute and saxophone, which mimic human physiology and control mechanisms, is discussed in Ref. [11].
Microcontroller-based implementations for piano-playing robots, which focus on driving actuators for precise key presses and dynamic articulation, are introduced in Refs. [12,13,14]. A more advanced platform is described in Ref. [15], where an Arduino Mega board is employed. These projects collectively demonstrate the versatility of microcontrollers such as Arduino in automating piano performances.
Recent OMR frameworks such as Audiveris, OpenOMR, and Refs. [16,17,18] offer robust pipelines for symbol recognition and semantic reconstruction. In the context of robotic music performance, advanced systems like the two-hand robot [19] represent the frontier of human-like articulation. Moreover, the broader field of Music Information Retrieval (MIR) [20] explores expressive performance modeling, audio-to-score alignment, and emotion-aware playbacks—areas that align closely with the goals of our robotic system.
This work contributes to the field of intelligent robotic performance by proposing a fully integrated end-to-end system that combines computer vision, AI-driven music interpretation, motion planning, and mechanical actuation. Unlike previous systems that addressed individual components (e.g., OMR or actuation), our pipeline translates visual sheet music into executable trajectories with temporal synchronization and mechanical accuracy. The novelty lies in the real-time heuristic motion planner that dynamically groups notes and optimally assigns fingers using a cost-based formulation. Additionally, we present a modular 32-DoF robotic hand design capable of three-dimensional motion with formal latency and precision validation.

2. Materials and Methods

Robotic systems that integrate computer vision and machine learning techniques have demonstrated significant potential across a wide range of applications, from autonomous navigation to human–computer interaction. In this study, we propose a novel approach to musical performance automation using a robotic platform capable of interpreting and executing musical scores using image-based analysis and precise actuation.
The developed robotic system integrates advanced computer vision techniques, artificial intelligence-driven algorithms, and precise robotic actuation into a structured workflow, as illustrated in Figure 1. The workflow is divided into distinct processing modules. Initially, the system performs image acquisition by capturing a high-resolution image of the musical sheet. This captured image serves as the input for subsequent modules.
Following acquisition, the image-processing module employs a combination of sophisticated techniques. A YOLO-based neural network detects and classifies musical symbols such as notes, rests, clefs, and accidentals. In parallel, advanced image processing algorithms are utilized to remove staff lines and accurately identify the note positions on the processed sheet music.
Subsequently, a dedicated finger position control module calculates optimal finger placements and movement trajectories. This module employs heuristic algorithms to optimize both finger and hand motions while synchronizing the temporal sequence of the performance. The motion planning strategy incorporates dynamic positioning and collision avoidance to ensure both precision and efficiency.
Finally, the computed control commands from the finger position control module are transmitted to microcontroller that drives robotic actuators, enabling precise key presses and dynamic articulations, ultimately resulting in a coherent, expressive, and realistic musical performance.

2.1. The Image Processing Module

The accurate representation of musical notes requires the determination of two critical attributes: duration and pitch. To achieve this, the image processing module integrates advanced detection techniques, employing a YOLO-based neural network for symbol detection and image processing algorithms for pitch estimation [15].

2.1.1. AI Symbol Detection

The YOLOv8 neural network model was utilized to detect musical symbols within the sheet music images. The model was trained on a comprehensive dataset obtained from publicly available sources, specifically selected to include relevant and frequently represented musical notation classes. The resulting model identifies 14 distinct classes, encompassing note durations, rests, clefs, and accidentals. The performance of the model was evaluated using the mean Average Precision (mAP), achieving an impressive accuracy of 96% at an Intersection over Union (IoU) threshold of 0.5 (mAP50) and 63% across multiple IoU thresholds ranging from 0.5 to 0.95 (mAP50–95). Some results can be seen in Figure 2.

2.1.2. Image Processing-Based Pitch Detection

The pitch of a note is determined by the relative position of its notehead on the staff lines; therefore, accurately locating the center of the notehead is a critical step. However, for the improved detection of noteheads, it is necessary to first remove the staff lines. To achieve this, a row-wise histogram is computed by counting the number of black pixels in each row of the inverted binary image, in which black pixels are represented by a value of 1 and white pixels by 0. This ensures that the histogram reflects the density of black pixels per row. The histogram H y is computed as follows:
H y =   x = 1 W I ( x , y ) .
where H y denotes the histogram value at row y , W is the width of the image, and I x , y represents the binary pixel intensity at coordinates ( x , y ) in the inverted image. The peaks in the histogram correspond to the positions of staff lines (see Figure 3). Only groups of five closely spaced peaks are considered as valid staff candidates.
The vertical positions (y-coordinates) of the detected staff lines are retained for subsequent processing. Staff line removal is performed by detecting and eliminating short vertical runs of black pixels. Following the removal of staff lines, noteheads are detected using a structured image processing pipeline. A copy of the image is prepared to address potential fragmentation of noteheads caused during line removal. Morphological closing with an elliptical structuring element is applied to reconnect fragmented components, followed by horizontal filtering and hole filling to restore complete notehead shapes. Morphological opening is then used to enhance elliptical features.
From the resulting shapes, only those with a circularity above a certain threshold are retained. Circularity is computed according to the following equation:
C i r c u l a r i t y =   4 π × A r e a P e r i m e t e r 2
Finally, contours corresponding to elliptical structures are extracted and merged, and their centroids are computed and sorted to precisely determine the notehead positions.
Bar lines are identified as vertical structures with consistent height and alignment using rectangular filters. Unlike note stems or flags, they maintain uniform vertical positioning. Bar lines are essential for segmenting music into measures, enabling accurate temporal synchronization, especially in multi-hand performances.

2.2. Finger Commands and Synchronization Module

The finger command and synchronization module are responsible for calculating the precise timing, finger movements, and synchronization required for musically accurate robotic performances. Utilizing the detected musical information (pitch and duration), heuristic algorithms first establish a coherent temporal sequence, defining exact timings for note initiation and duration.
A mathematical framework synchronizes each note’s execution timing, represented as
T i = T i 1 + D i 1 ,
where T i   is the start time of the current note, T i 1 is the start time of the preceding note, and D i 1 is the duration of the preceding note.
Once the temporal sequence is established, the module calculates optimal finger positioning and hand placements. The algorithm is based on the central hypothesis that notes group together within the span of a single hand. As a result, the subsequent positions of the hand are calculated as
P hn k = median P i i G k ,
where P hn k denotes the optimal position of the hand for the k-th group of notes, calculated as the median of the note positions P i within the group G k . A new group is formed whenever a note exceeds the reach of the hand. The grouping process ensures that hand movements are optimized while avoiding collisions between hands.
For finger assignment, the task is posed as an optimization task, where the goal is to minimize the total cost of assigning fingers to notes based on reachability, distance, and collision constraints. This is solved using the Jonker–Volgenant algorithm, applied to a cost matrix C = c i j , where c i j represents the cost of assigning finger i to note j. The optimization objective is
min x ij i = 1 F j = 1 N c i j x i j ,
where F is the number of available fingers, N is the number of notes in the current group, and x i j = 1 if finger i is assigned to note j and 0 otherwise.
The computed trajectories are converted into commands sent to the microcontroller, where a state machine manages transitions between phases such as pressing, holding, releasing, and repositioning. This structured control ensures coordinated and deterministic hand and finger movements, enabling accurate and expressive playback.

2.3. Robotic Hardware Design

The robotic system comprises two anthropomorphic hands, each designed to replicate the human dexterity required for nuanced piano performance. Each hand consists of five fingers, with each finger providing three distinct degrees of freedom (DoFs). Mathematically, the system’s degrees of freedom can be expressed as
T o t a l   D o F s = 2 h a n d s × 5 f i n g e r s × 3 D o F s   p e r   f i n g e r + 2 l i n e a r   r a i l s = 32   D o F s .
Each finger’s three degrees of freedom encompass vertical displacement for key pressing, horizontal extension to reach black keys (Figure 4a), and lateral rotation for positioning on adjacent keys (Figure 4b). Vertical and horizontal movements are actuated through high-precision solenoids operating at 24 V and drawing 0.75 A, ensuring robust and responsive actuation. Lateral rotational movements are controlled by servo motors, providing accurate angular positioning critical for precise key selection.
Each robotic hand is mounted on a belt-driven linear rail system, allowing for translational movement along the full range of the keyboard. This motion is facilitated by stepper motors, providing precise positioning of the hands across the piano range. All actuators, including solenoids, servos, and stepper motors are controlled using a microcontroller.

3. Results

The robotic system was evaluated using digitally typeset Western music scores under controlled conditions to assess its performance across three primary dimensions: symbol recognition, motion planning, and mechanical execution. The image processing module, powered by a YOLOv8 neural network, demonstrated high accuracy in detecting musical symbols. The model achieved a mean Average Precision of 96% at an Intersection over Union threshold of 0.5 and 63% across a broader range of thresholds from 0.5 to 0.95. These results indicate strong performance in recognizing common musical elements such as note durations, rests, clefs, and accidentals. However, the model’s accuracy declined when encountering rare or complex symbols, a limitation attributed to the underrepresentation of such classes in the training dataset. The confusion matrix in Figure 5 reveal that misclassifications were most frequent among visually similar note types, underscoring the need for more diverse training data.
The motion-planning module was verified using simulated playback of known musical pieces. The output timing sequences and finger-to-note assignments were compared against manually annotated ground truth data derived from MIDI files corresponding to the same score. Timing accuracy was evaluated by measuring deviations in note onset times, which consistently remained within a ±60 ms window, sufficient for perceptual musical coherence.
For the mechanical execution module, a structured test suite was designed to assess repeatability, positional accuracy, and response time. Key positions across the full keyboard range were targeted sequentially, and finger actuation latency was measured using a microcontroller-based timing mechanism. Specifically, external interrupts were used to capture the time interval between command issuance and key depression, as detected by contact sensors mounted beneath the keys. The measured average actuation delay was 85 ms, with a standard deviation of 12 ms. Mechanical repeatability tests showed less than 2 mm positional deviation over 20 trials per finger. These evaluations confirm the system’s ability to generate reliable and precise motions for real-time piano performances.

4. Discussion

The results of this study highlight the potential of integrating computer vision, artificial intelligence, and robotic actuation in the domain of autonomous musical performance. The system’s modular architecture allowed for the independent development and optimization of each subsystem, contributing to its overall robustness and adaptability. The high accuracy of the symbol detection module, particularly for standard notation, validates the use of YOLO-based neural networks in real-time music interpretation. The motion planning algorithms, while heuristic in nature, proved to be effective in generating synchronized and efficient finger trajectories, enabling the robot to perform musical passages with a degree of fluency and coordination.
While the hardware limits dynamic key force, it must also be noted that expressive performance elements—such as dynamics (e.g., forte/piano), articulation (e.g., staccato, legato), and tempo variation—are not yet extracted during the OMR phase. Therefore, the current pipeline performs a rhythmically accurate but expressively neutral playback. Future work will involve enhancing the symbol detection module to recognize expressive annotations and extending the motion planner to include dynamic key velocity and tempo control, thus improving musicality beyond structural accuracy.
While generally effective, the motion planning module does not guarantee globally optimal solutions. Rapid note sequences and chromatic runs sometimes led to suboptimal finger assignments and inefficient hand movements. Occasional desynchronization between hands, e.g., during horizontal translations, further underscores the need for more advanced algorithms that consider execution latency and anticipate future movements.
Future work should address the simplifying assumptions of the current system, such as reliance on digitally typeset scores, fixed clefs and tempo, and the omission of ornaments and extended techniques. While these constraints eased implementation, they limit real-world applicability. Enhancing the system to handle handwritten scores, dynamic tempo, and expressive articulations would greatly improve its versatility.
Several improvement opportunities were identified during development. Replacing solenoids with proportional actuators would enable dynamic key control for enhanced expressiveness. Expanding the training dataset would improve recognition accuracy, while incorporating predictive models or reinforcement learning could yield more adaptive and efficient motion planning, especially for fast or complex passages.

5. Conclusions

The developed 32-DoF robotic system successfully demonstrated its ability to autonomously interpret and perform piano music on selected pieces, based on quantitative and qualitative evaluations. The system achieved timing deviations within ±60 ms and classification accuracy above 96% for standard symbols.
The YOLOv8-based model showed high accuracy in recognizing standard notation, while the heuristic motion planner produced coherent, synchronized finger trajectories. Experimental validation confirmed its reliable timing and mechanical precision in executing complex passages. Despite some limitations in expressiveness and symbol generalization, the system effectively delivers structured and musically coherent performances.

Author Contributions

Conceptualization, T.-T.C.; methodology, I.I.; software, I.I.; validation, I.-A.S. and M.D.D.; formal analysis, G.M.; investigation, G.M.; writing—original draft preparation, B.S.; writing—review and editing, B.S.; supervision, T.-T.C.; project administration, T.-T.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Huang, Z.; Jia, X.; Guo, Y. State-of-the-Art Model for Music Object Recognition with Deep Learning. Appl. Sci. 2019, 9, 2645. [Google Scholar] [CrossRef]
  2. Rejin, M.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 6–7 March 2024. [Google Scholar] [CrossRef]
  3. Baró, A.; Riba, P.; Calvo-Zaragoza, J.; Fornés, A. Optical Music Recognition by Recurrent Neural Networks. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 13–15 November 2017; Volume 2, pp. 25–26. [Google Scholar]
  4. Calvo-Zaragoza, J.; Rizo, D. End-to-End Neural Optical Music Recognition of Monophonic Scores. Appl. Sci. 2018, 8, 606. [Google Scholar] [CrossRef]
  5. Wen, C.; Zhu, L. A Sequence-to-Sequence Framework Based on Transformer With Masked Language Model for Optical Music Recognition. IEEE Access 2022, 10, 118243–118252. [Google Scholar] [CrossRef]
  6. Yang, N.; Savery, R.; Sankaranarayanan, R.; Zahray, L.; Weinberg, G. Mechatronics-Driven Musical Expressivity for Robotic Percussionists. In Proceedings of the International Conference on New Interfaces for Musical Expression (NIME-20), Birmingham, UK, 21–25 July 2020; pp. 133–138. [Google Scholar] [CrossRef]
  7. Gao, X.; Yao, K.; Khadivar, F.; Billard, A. Enhancing Dexterity in Confined Spaces: Real-Time Motion Planning for Multifingered In-Hand Manipulation. IEEE Robot. Autom. Mag. 2024, 31, 100–112. [Google Scholar] [CrossRef]
  8. Zhou, C.; Huang, B.; Fränti, P. A Review of Motion Planning Algorithms for Intelligent Robots. J. Intell. Manuf. 2022, 33, 387–424. [Google Scholar] [CrossRef]
  9. Orthey, A.; Chamzas, C.; Kavraki, L.E. Sampling-Based Motion Planning: A Comparative Review. Annu. Rev. Control Robot. Auton. Syst. 2024, 7, 285–310. [Google Scholar] [CrossRef]
  10. Kamatani, T.; Sato, Y.; Fujino, M. Ghost Play—A Violin-Playing Robot Using Electromagnetic Linear Actuators. In Proceedings of the International Conference on New Interfaces for Musical Expression, Auckland, New Zealand, 28 June–1 July 2022. [Google Scholar]
  11. Solis, J.; Ozawa, K.; Takeuchi, M.; Kusano, T.; Ishikawa, S.; Petersen, K.; Takanishi, A. Biologically-Inspired Control Architecture for Musical Performance Robots. Int. J. Adv. Robot. Syst. 2014, 11, 172. [Google Scholar] [CrossRef]
  12. Day, L. Robot Pianist Runs on Arduino Nano. Available online: https://hackaday.com/2023/12/01/robot-pianist-runs-on-arduino-nano/ (accessed on 12 May 2025).
  13. Khaled, S. Build Log of Prima: The Piano Playing Robot. Available online: https://shayonkhaled.com/portfolio/prima-the-piano-playing-robot/ (accessed on 12 May 2025).
  14. Arduino Controlled Piano Robot: PiBot. Available online: https://www.hackster.io/MoKo9/arduino-controlled-piano-robot-pibot-641a06 (accessed on 12 May 2025).
  15. Kevins, G. Arduino Piano Player Robot. Available online: https://www.instructables.com/Arduino-Piano-Player-Robot/ (accessed on 12 May 2025).
  16. Audiveris OMR Project. Available online: https://github.com/Audiveris/audiveris (accessed on 12 May 2025).
  17. OpenOMR. Available online: https://github.com/greenjava/OpenOMR (accessed on 7 November 2025).
  18. Elezi, I.; Tuggener, L.; Pelillo, M.; Stadelmann, T. DeepScores and Deep Watershed Detection: Current state and open issues. arXiv 2018, arXiv:1810.05423. [Google Scholar] [CrossRef]
  19. Chin-Shyurng, F.; Tsai, C.-F.; Lin, Y.-W. Development of a Novel Two-Hand Playing Piano Robot. In Proceedings of the 4th International Conference of Control, Dynamic Systems, and Robotics (CDSR’17), Toronto, ON, Canada, 31 May–1 June 2017. [Google Scholar]
  20. Downie, J.S. Music Information Retrieval. Annu. Rev. Inf. Sci. Technol. 2003, 37, 295–340. [Google Scholar] [CrossRef]
Figure 1. System architecture overview.
Figure 1. System architecture overview.
Engproc 113 00053 g001
Figure 2. Examples of detections using YOLO-based neural network; (a) simple example; (b) complex example.
Figure 2. Examples of detections using YOLO-based neural network; (a) simple example; (b) complex example.
Engproc 113 00053 g002
Figure 3. Row histogram. (a) Input image; (b) histogram H y computed along the y image axis.
Figure 3. Row histogram. (a) Input image; (b) histogram H y computed along the y image axis.
Engproc 113 00053 g003
Figure 4. (a) Horizontal extension. (b) Lateral rotation.
Figure 4. (a) Horizontal extension. (b) Lateral rotation.
Engproc 113 00053 g004
Figure 5. Confusion matrix illustrating the detection accuracy of the YOLO model for different musical symbols. Values indicate the number of predictions per class, clearly highlighting the classification performance and common misclassifications between closely related note types.
Figure 5. Confusion matrix illustrating the detection accuracy of the YOLO model for different musical symbols. Values indicate the number of predictions per class, clearly highlighting the classification performance and common misclassifications between closely related note types.
Engproc 113 00053 g005
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Indreica, I.; Doloiu, M.D.; Spulber, I.-A.; Măceșanu, G.; Sibișan, B.; Cociaș, T.-T. Design and Control of a 32-DoF Robot for Music Performance Using AI and Motion Planning. Eng. Proc. 2025, 113, 53. https://doi.org/10.3390/engproc2025113053

AMA Style

Indreica I, Doloiu MD, Spulber I-A, Măceșanu G, Sibișan B, Cociaș T-T. Design and Control of a 32-DoF Robot for Music Performance Using AI and Motion Planning. Engineering Proceedings. 2025; 113(1):53. https://doi.org/10.3390/engproc2025113053

Chicago/Turabian Style

Indreica, Ilie, Mihnea Dimitrie Doloiu, Ioan-Alexandru Spulber, Gigel Măceșanu, Bogdan Sibișan, and Tiberiu-Teodor Cociaș. 2025. "Design and Control of a 32-DoF Robot for Music Performance Using AI and Motion Planning" Engineering Proceedings 113, no. 1: 53. https://doi.org/10.3390/engproc2025113053

APA Style

Indreica, I., Doloiu, M. D., Spulber, I.-A., Măceșanu, G., Sibișan, B., & Cociaș, T.-T. (2025). Design and Control of a 32-DoF Robot for Music Performance Using AI and Motion Planning. Engineering Proceedings, 113(1), 53. https://doi.org/10.3390/engproc2025113053

Article Metrics

Back to TopTop