Precise Recognition of Gong-Che Score Characters Based on Deep Learning: Joint Optimization of YOLOv8m and SimAM/MSCAM

He, Zhizhou; Zhang, Yuqian; Zhang, Liumei; Hu, Yuanjiao

doi:10.3390/electronics14142802

Open AccessArticle

Precise Recognition of Gong-Che Score Characters Based on Deep Learning: Joint Optimization of YOLOv8m and SimAM/MSCAM

¹

School of Computer Science, Xi’an Shiyou University, Xi’an 710065, China

²

Technical Research Institute, QinChuan Machine Tool & Tool Group Co., Ltd., Baoji 721000, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2802; https://doi.org/10.3390/electronics14142802

Submission received: 4 June 2025 / Revised: 10 July 2025 / Accepted: 10 July 2025 / Published: 11 July 2025

(This article belongs to the Special Issue New Trends in AI-Assisted Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

In the field of music notation recognition, while the recognition technology for common notation systems such as staff notation has become quite mature, the recognition techniques for traditional Chinese notation systems like guqin tablature (jianzipu) and Kunqu opera gongchepu remain relatively underdeveloped. As an important carrier of China’s thousand-year musical culture, the digital preservation and inheritance of Kunqu opera’s Gongche notation hold significant cultural value and practical significance. By addressing the unique characteristics of Gongche notation, this study overcomes the limitations of Western staff notation recognition technologies. By constructing a deep learning model adapted to the morphology of Chinese character-style notation symbols, it provides technical support for establishing an intelligent processing system for Chinese musical documents, thereby promoting the innovative development and inheritance of traditional music in the era of artificial intelligence. This paper has constructed the LGRC2024 (Gong-che notation based on Lilu Qu Pu) dataset. It has also employed data augmentation operations such as image translation, rotation, and noise processing to enhance the diversity of the dataset. For the recognition of Gong-che notation, the YOLOv8 model was adopted, and the network performances of its lightweight (n) and medium-weight (m) versions were compared and analyzed. The superior-performing YOLOv8m was selected as the basic model. To further improve the model’s performance, SimAM, Triplet Attention, and Multi-scale Convolutional Attention Module (MSCAM) were introduced to optimize the model. The experimental results show that the accuracy of the basic YOLOv8m model increased from 65.9% to 78.2%. The improved models based on YOLOv8m achieved recognition accuracies of 80.4%, 81.8%, and 83.6%, respectively. Among them, the improved model with the MSCAM module demonstrated the best performance in all aspects.

Keywords:

Gong-che notation recognition; LGRC2024 dataset; YOLOv8

1. Introduction

Music scores, as important carriers for preserving, communicating, and inheriting music, record key musical information. However, paper scores are prone to damage and loss, and inconvenient for storage and querying, which restricts the communication and sharing among musicians and affects the inheritance and development of musical culture. To enable computers to understand music scores, digitization has become a development trend. To form digital scores more conveniently, an optical music recognition (OMR) technology similar to optical character recognition (OCR) is introduced. It can directly read optical music score images, recognize musical symbols, and save them in formats such as MusicXML or MIDI [1]. Compared with OCR, OMR is more suitable for processing information with complex structures and improving the coherence of up–down relationship sequences. The development of music score recognition technology has promoted the inheritance and development of musical culture, enabling precious paper scores to be digitally preserved. Many scores of traditional musical works are preserved on paper, vulnerable to damage and loss due to time, environment, and other factors. After converting paper scores into digital forms, people can access score information anytime and anywhere through electronic devices and the Internet. This not only facilitates music learners and performers but also enables wider dissemination and sharing of musical works. Digital preservation effectively protects traditional musical works. As precious cultural heritage of a nation, traditional music can be ensured to be continuously inherited and appreciated by future generations through the application of score recognition technology. This technology also allows students to access score information through electronic devices anytime and anywhere, improving learning efficiency and convenience and promoting the development of music education. In the development of music score recognition technology, the challenges of recognition technology for Chinese traditional scores, such as Gong-che notation, reduced-character notation, and half-character (半字谱) notation, are more significant. These scores have diverse forms and unique styles, and paper scores may have issues such as distortion, damage, or blurriness. Different notation methods make recognition algorithms face more complex situations. To overcome these challenges, it is necessary to further research and develop more excellent algorithms to adapt to scores of different styles and forms, filling the gap in the digital research field of Chinese traditional notation. This will provide necessary methods and technical support for inheriting and promoting Chinese traditional music and culture in the digital and networked era, thereby promoting the inheritance and development of Chinese traditional scores.

2. Related Work

The development of computer music has transformed the production model of musical activities, making the digital preservation and dissemination of paper scores of vital importance. Many scholars and institutions have conducted research on music score recognition technology [2], achieving numerous achievements. However, due to the unique structural and semantic attributes of music scores, most studies focus on staff notation and handwritten scores. In China, only a few professionals engaged in traditional music and personnel from some universities and research institutes understand and use Chinese traditional music notation methods, leading to the limited application of these scores, the ineffective inheritance of a large number of traditional musical works, and the scarcity of research on score expression forms. As a result, music score recognition faces numerous difficulties and obstacles in various fields.

With the development of computer technology, deep learning has been introduced into music score recognition. Traditional multi-stage methods are prone to error propagation, affecting accuracy. Researchers have explored the use of deep learning for end-to-end music score recognition, where a single model completes the entire process to avoid error propagation between stages. In the data preprocessing stage, Calvo-Zaragoza et al. [3] used a selective autoencoder to learn the binarization conversion of music score images, outperforming traditional methods, but with potential errors at the edges of foreground pixels. In the staff removal stage, CNN was used to treat staff detection as a classification task, surpassing traditional methods. The selective autoencoder proposed by Antonio-javier Gallego et al. [4] can remove staff lines and process grayscale images. In the note recognition and detection stage, Alexander Pacha et al. [5] used a Faster R-CNN network with a pre-trained model, fine-tuned based on the MUSCIMA++ dataset, achieving an mAP of 80%, but only for staff scores cropped into single lines.

In end-to-end music score recognition, researchers mainly adopt two methods: object detection and sequence recognition. Object detection first determines the position of notes and then classifies them [6]. Jan Hajicjr et al. [7] used a U-Net architecture for note segmentation, combined with a connected component detector to recognize note heads; Olaf Ronneberger et al. [8] proposed a deep learning model combining contraction and expansion paths to achieve few-shot training; Lukas Tuggener et al. [9] adopted a method combining a deep watershed detector and bounding box detection, but with unsatisfactory recognition accuracy for uncommon notes. The sequence recognition method treats music score images as sequences, using recurrent neural networks to model and predict note results. Eelco vander Wel et al. [10] were the first to use a CNN and seq2seq model to recognize single-voice scores; Jorge Calvo-Zaragoza et al. [11] used a CRNN structure to recognize printed scores; Arnau Baró et al. [12] used CNN to study handwritten scores; Antonio Ríos-Vila et al. [13] researched neural network-based OMR methods; Francisco J Castellanos et al. [14] combined staff region extraction and symbol sequence recognition to improve efficiency; Sachinda Edirisooriya et al. [15] introduced the process of generating large-scale OMR datasets and two decoding strategies; Maria Alfaro-Contreras et al. [16] proposed an end-to-end neural network model using score shape and height information; Carlos Garrido-Munoz et al. [17] proposed a new image-to-graph model. Chinese scholars have also conducted research, such as a low-quality score recognition algorithm with an attention mechanism and simplified recurrent units [18]; a score recognition algorithm based on residual gated recurrent convolution and attention mechanism [19], etc.

In the research of Chinese traditional music scores, most studies have been conducted from a musical perspective, with few scholars exploring the application of computer-related technologies to process Chinese traditional scores. The symbols of the Guqin (Chinese zither) are composed of reduced-character notations, through which the instrument can be gradually learned. However, the inheritance of Guqin art has faced significant challenges over the past few decades. Due to insufficient protection, many precious books on the Guqin have been lost or damaged, making digital research on Guqin reduced-character symbols imperative. Over the years, deep learning models have achieved remarkable success in text detection and recognition tasks. For example, Mengting Liu et al. [20] introduced a new deep learning model for recognizing oracle bone inscriptions. Nagender Aneja et al. [21] proposed a deep convolutional neural network for handwritten Devanagari chart recognition with 98% accuracy. Alex Graves et al. [22] leveraged trained neural networks and probabilistic language models to establish a system capable of directly transcribing raw online handwritten data. Marcus Liwicki et al. [23] introduced a unique approach using a specialized recurrent neural network to enable direct recognition of raw strokes or pixel data without preprocessing. Additionally, Najwa Altwaijry et al. [24] proposed a convolutional neural network (CNN)-based model for automatic recognition of handwritten Arabic characters. Despite advancements in text encoding systems, recognizing Guqin reduced characters remains a formidable challenge. Current research employs manual feature extraction methods for Guqin notations. Enzhi Ni et al. [25] proposed a technique for extracting strokes from reduced-character images using feature decomposition, but this method struggles with recognizing individual reduced characters. Wenqian Li et al. [26] introduced a four-layer CNN using an adapted CaffeNet model for classifying handwritten symbol images, but with low recognition accuracy.

Regarding Gong-che notation, research has focused on developing typesetting software for Nanyin Gong-che notation, enabling synthesis and output with numbered musical notation or staff notation [27]. Researchers have used MFC (Microsoft Foundation Classes) to overcome technical barriers in the serialization and editing of Gong-che notation, integrating user-friendly notation input mechanisms and flexible layout configuration. However, these efforts remain focused on translation and layout design. Subsequent studies on Gong-che notation recognition and digitization include Wu Ruimin [28], who analyzed notation rules and image characteristics of Pipa Gong-che scores, solved key problems in extracting and recognizing musical information from handwritten brush Gong-che scores, and established an experimental system with high recognition accuracy. Chen Genfang [29] integrated multidisciplinary knowledge to research Gong-che notation digitization, metadata schemes, image segmentation, symbol recognition, musical information storage, and score reconstruction, laying a theoretical and methodological foundation for digital Gong-che notation.

Current research lacks exploration of datasets for traditional scores. Existing datasets generally suffer from small scale, rough annotation, and limited scene coverage. For example, Guqin reduced-character notation datasets have few public resources, with some containing only a few samples and simple classification annotations, lacking details on character structures and writing variations. Gong-che notation datasets face similar challenges; complex notation rules and diverse font styles make it difficult to cover all symbol forms, and inconsistent annotation standards weaken model generalization. In contrast, other character recognition fields (e.g., Chinese and English handwriting) benefit from million-scale, finely annotated datasets, highlighting the urgency and importance of building traditional music score datasets [30].

Although deep learning has made breakthroughs in score recognition, many challenges remain. Existing methods often misrecognize complex scores. Future research could explore more efficient algorithms to enhance recognition accuracy and robustness. Meanwhile, with expanding dataset scales and increasing model complexity, optimizing training strategies will become an important research direction.

3. The LGRC2024 Gong-Che Notation Dataset

3.1. Dataset Acquisition

Gong-che notation, as shown in Figure 1, is an important notation system in Chinese traditional music with a profound historical background. Its origins can be traced back to the late Tang Dynasty and the Five Dynasties period, evolving from the vulgar character notation of the Northern Song Dynasty and gradually forming a general form after the Southern Song Dynasty. It has preserved a vast amount of musical heritage and made significant contributions to the inheritance of ethnic music.

The notation system of Kunqu Opera generally mainly uses Gong-che notation, among which the three most common types are Yuzhu Score, Yizi Score, and Suoyi Score, as shown in Figure 2. Yuzhu Score, as a model of early Kunqu notation, is characterized by taking lyrics as the core, with Gong-che notation characters neatly arranged on the right side of the lyrics and annotated vertically. This notation method was quite popular in scores before the Qianlong period. Classic works such as Jiugong Dacheng Nanbei Ci Gongpu and Nashuying Qupu all adopted the notation form of Yuzhu Score.

Yizi Score is a notation method in which Gong-che characters are horizontally arranged to the right of the lyrics. As a result, it is relatively laborious to read, and this notation method is generally rarely used. Only a few scores such as Eyunge Qupu adopt this unique notation form.

Suoyi Score is the most commonly used notation method in modern handwritten and printed scores. It cleverly notes Gong-che characters diagonally beside the lyrics, which is not only easy to read but also effectively avoids confusion caused by too many Gong-che characters. Therefore, scores such as Jicheng Qupu, Liuye Qupu, and Sulu Qupu all use this notation method.

Most common Gong-che notation uses ten basic characters to represent pitch [31]: 合(Hé), 四(Sì), 一(Yī), 上(Shàng), 尺(Chǐ), 工(Gōng), 凡(Fán), 六(Liù), 五(Wǔ), 乙(Yǐ). Additionally, different symbols and combinations are used to indicate duration and rhythm. In a score, determining the actual pitch of each character requires knowledge of the musical key. If the character 上(shàng) represents “middle C,” the pitch range covered by these ten characters spans from g to

b^{1}

, as shown in Table 1.

Although concise in form, it contains rich musical information and is crucial for the inheritance and dissemination of Chinese traditional music. Existing Gong-che notation scores such as the Tenpyo Biwa-ho are precious materials for the study of ancient music. In the face of changing times, Gong-che notation still occupies an irreplaceable position in folk music, opera, and instrumental performances.

Existing Gong-che notation datasets are generally small in scale, mostly focusing on specific scores or regions. Some datasets are derived from digital scans of a few precious ancient books. Due to limited acquisition channels, they cover a small number of tracks and types of notation characters, making it difficult to fully reflect the diversity of Gong-che notation. This paper selects the first two volumes of the Kunqu Gong-che notation book Lilu Qu Pu, which was compiled by the Kunqu master Yu Zhenfei from the representative plays of the Yu family’s handed-down repertoire, totaling 212 pages. According to the collected samples, 194 valid and universal score images were extracted as the Gong-che notation character sample set. Each score image contains more than ten categories. After data augmentation, the sample set was expanded to 2134 images in total, of which 1496 images were selected as training samples and 638 as test samples, thus forming the LGRC2024 dataset [32].

3.2. Labeling LGRC2024 Dataset

Gong-che notation primarily consists of ten fundamental characters which form a heptatonic scale with a natural range spanning a major tenth. However, this range is insufficient for the typical pitch capabilities of musical instruments and human voices. Therefore, the notation system expands its range through modifications to character structure, specifically using left-sided radical augmentation and final stroke morphological variation, analogous to how staff notation adds stems and flags to indicate note duration.

The Gong-che notation type used in this study is Suoyi notation, featuring handwritten characters. To raise a note by one octave, a specific radical (e.g., 亻(Rén) or 彳(Chì)) is added to the left side of the character, as shown in Table 2, which illustrates characters transposed up by one octave.

Lowering the pitch by one octave is achieved by adding a left-downward decorative stroke to the end of the final stroke of the character. A single decorative stroke corresponds to a one-octave reduction, while a double stroke corresponds to a two-octave reduction. All decorative strokes maintain structural continuity with the end of the final stroke, as shown in Table 3, which illustrates the notation characters lowered by one octave.

The Gong-che notation system predominantly employs a movable-do solfège system, so its actual pitch depends on the tonality of the piece. Tonality is typically determined by the key name, and the key-naming method of Gong-che notation is closely related to wind instruments. The seven most commonly used keys in folk music are sì key, liù key, fán key, xiǎo gōng key, chǐ key, shàng key, and yī key, corresponding to G major, F major, E♭ major, D major, C major, B♭ major, and A major in Western notation [33].

Before training the model, manual annotation of the notation characters in the images is required. The model learns from these annotated features and gains object detection capabilities through extensive training. Based on the knowledge of Gong-che notation characters, the annotation is performed as follows. The main notation characters are listed in Table 4 and Table 5, with characters of different pitches annotated sequentially. A total of 28 categories are used in this study. Terms such as keySignature-C, note-C3, and note-C4 in the tables represent the language for computer-based music generation and processing. In C major, note-C3 denotes middle C, corresponding to the central “do”.

3.3. Data Augmentation

The data annotation method described in this paper is directly applied to the entire music score image containing lyrics and other information. The aim is to make it effectively applicable in various complex background environments, ensuring that music score data can maintain a high level of recognition accuracy in diverse application scenarios, and achieving precise identification of the musical symbols corresponding to each notation character in the music score. This paper selects LabelImg, a powerful data annotation tool. LabelImg is widely praised in the field of data annotation for its intuitive and user-friendly interface design and precise and efficient annotation functions. Each key element in the music score is annotated using LabelImg, and this annotation information is saved in the form of xml files. As a structured tag file, the xml file can clearly record the tag name, position coordinates, and other relevant annotation information of each target object, providing a solid foundation for subsequent data processing and analysis. The annotation interface of LabelImg is shown in Figure 3.

Image data augmentation is a technique that can perform various transformations on multiple images to generate more diverse image samples with common features. This technique has been widely applied in multiple tasks within the field of computer vision, covering key aspects such as image classification, object detection, and image segmentation. The advantage of data augmentation lies in its ability to significantly enhance the generalization performance of models and effectively mitigate the problem of overfitting, especially when facing the challenges of a limited dataset size or a lack of data diversity. There are various methods of data augmentation, including traditional image processing methods, active learning enhancement methods, and model enhancement methods. Traditional data augmentation methods are relatively common and easy to understand. In this paper, traditional data augmentation methods are adopted, which include methods such as cropping, flipping transformation, reflection transformation, color transformation, geometric transformation, noise injection, and shifting.

In this paper, due to the relatively limited number of annotated images, the generalization ability of the model faces challenges, and it is prone to fall into the dilemma of overfitting, which seriously affects the detection performance. Although convolutional neural networks perform excellently in feature extraction, the target features they extract are relatively fixed, which further exacerbates the problems caused by data scarcity. To effectively address this challenge, data augmentation techniques were employed to expand the dataset.

Specifically, in the object detection method for Gong-che notation character recognition, a series of data augmentation methods were implemented. These methods perform various transformations and expansions on the labeled dataset, such as rotation, scaling, cropping, flipping, and color adjustment, aiming to generate more diverse training samples. Through data augmentation, the model can obtain richer and more diverse training resources based on the originally limited data, thus improving the training efficiency and effectiveness. Moreover, it enables the model to learn more robust and generalizable feature representations, making it better adapt to various complex scenarios and changes. As shown in Figure 4, it demonstrates some forms of data augmentation for the Gong-che score image dataset.

4. Recognition of Gong-Che Notation Characters

In this paper, the lightweight (n) and medium-weight (m) versions in YOLOv8 were used, respectively, to build models for this experiment. YOLOv8n is a lightweight model in the Ultralytics YOLO framework. It consists of a backbone network composed of Conv layers and C2f layers, and a detection head that includes Upsample, Concat, Conv, and Detect layers. The backbone network extracts multi-scale feature maps, and the detection head conducts object detection on them. The network structure diagram is shown in Figure 5. Compared with the YOLOv8n model, the YOLOv8m model is more accurate and powerful. YOLOv8n has 225 model layers, while YOLOv8m has 295 model parameters. It has a deeper backbone network and more parameters, which can provide better detection performance and accuracy.

4.1. Improved Model Based on YOLOv8

4.1.1. Improved YOLOv8 Model with Lightweight SimAM

SimAM [34] is a simple and parameter-free attention module proposed for convolutional neural networks. Its design concept is deeply inspired by neuroscience theories, successfully achieving parameter-free and efficient performance enhancement. The core advantage of this module lies in its ability to effectively generate true 3D weights. To implement the attention mechanism effectively, the key lies in accurately evaluating the importance of individual neurons.

In the field of visual neuroscience, neurons rich in information often exhibit firing patterns distinct from those of neighboring neurons. Additionally, active neurons may inhibit the activity of surrounding neurons, a phenomenon known as spatial suppression. In other words, during visual processing, neurons exhibiting significant spatial suppression effects should be assigned higher priority or importance. To identify these critical neurons, a simple yet effective method is to measure the linear separability between the target neuron and others. Based on these neuroscience findings, researchers defined a specific energy function for each neuron, as shown in Equation (1), to quantify its importance. These weights are then used to adjust the neuron values in the feature maps, thereby enhancing the network’s focus on important features. Owing to its closed-form solution, SimAM can be efficiently implemented without introducing additional parameters, making it lightweight and easy to integrate.

e_{t} (ω_{t}, b_{t}, y, x_{i}) = {(y_{t} - \hat{t})}^{2} + \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(y_{o} - x_{i})}^{2}

(1)

In the formula,

\hat{t} = ω_{t} t + b_{t}

and

{\hat{x}}_{i} = ω_{t} x_{i} + b_{t}

are linear transformations of t and

x_{i}

. Here, t and

x_{i}

are the target neuron and other neurons in a single channel of the input feature

X \in R^{C \times H \times W}

. i is the index in the spatial dimension, and

M = H \times W

is the number of neurons in this channel.

ω_{t}

and

b_{t}

are weights and bias transformations.

In the feature refinement stage, SimAM ingeniously employs a scaling operator to simulate the attention modulation process in the mammalian brain, implementing a gain effect on neuronal responses. This process comprehensively covers all channels and spatial dimensions of the feature map, ensuring the accuracy and rationality of feature refinement. In contrast, existing attention modules often only refine either the channel or spatial dimensions, as shown in Figure 6a,b, generating one-dimensional or two-dimensional weights. Moreover, they apply uniform operations to neurons in each channel or spatial location, which may limit their ability to learn more discriminative features. Therefore, the SimAM attention mechanism proposes using full three-dimensional weights to improve feature representation and capture more abundant information, as illustrated in Figure 6c.

In this section, the improvement is to introduce the SimAM attention module into the backbone network of the original YOLOv8m medium-sized version model, which has shown good performance, as shown in Figure 7. SimAM calculates the attention weights of the feature maps in a parameter-free manner, and then enhances important features while suppressing secondary features, thus improving the detection performance of the model.

While keeping the original backbone network structure and configuration unchanged, the SimAM attention module is introduced after the last convolutional layer of the backbone network. This improvement aims to take advantage of the increased number of channels and reduced resolution in the deep feature maps, and more effectively capture key features through the attention mechanism, thereby enhancing the basic feature extraction ability. The configuration parameter of the SimAM module is (1024), which matches the number of channels in the feature map output by the last convolutional layer in the backbone network. The output feature map is adjusted through the attention mechanism to enhance the ability to capture key features.

The SimAM module adaptively enhances important features and suppresses secondary features by calculating the attention weights of the feature maps. This mechanism helps improve the model’s ability to capture key information, thus enhancing the performance of object detection. After the introduction of the SimAM module, the SPPF module receives its output feature map, performs feature fusion and dimensionality reduction, extracts useful information, and prepares for the object detection task.

4.1.2. Improved YOLOv8 Model with Triplet Attention Mechanism

Attention mechanisms have been extensively studied and applied in computer vision tasks, including Squeeze-and-Excitation Networks (SENet) and Convolutional Block Attention Module (CBAM) [35]. However, these methods exhibit limitations such as requiring numerous learnable parameters, lacking spatial attention, disjointed spatial and channel attention, or insufficient cross-channel interactions. To address these drawbacks, Triplet Attention [36] introduces a novel and intuitive approach for computing attention weights, termed cross-dimensional interaction.

Cross-dimensional interaction is a straightforward concept that enables the module to compute attention weights for each dimension relative to every other dimension—specifically,

C \times W

,

C \times H

, and

H \times W

. This allows the simultaneous computation of spatial and channel attention within a single module. Cross-dimensional interaction is achieved by rearranging the input tensor dimensions, followed by a residual transformation to generate attention weights. This mechanism establishes strong interdependencies between dimensions in the input tensor, which is critical for precisely delineating attention regions in feature maps.

Triplet Attention is a three-branch structure [37], with each branch responsible for capturing the cross-dimensions between the spatial dimensions and channel dimensions of the input. Given an input tensor with the shape of

C \times H \times W

, each branch is responsible for aggregating the cross-dimensional interaction features between the spatial dimension H or W and the channel dimension C. Each branch first performs a Z-Pool operation, a special pooling operation that performs average pooling and max pooling on the input tensor along the channel dimension, and concatenates these two results along the channel dimension, thus reducing the depth of the tensor while preserving rich representations. The tensor is further processed by a convolutional layer. Although the specific size of the convolution kernel is not specified in the description, the shape of the tensor is maintained after the convolution operation. A Sigmoid activation function is applied to compress the output values between 0 and 1 to generate attention weights. These weights are then applied to the permuted input tensor, which is finally permuted back to its original shape, thereby achieving the aggregation and weighting of cross-dimensional interaction features of the input tensor. The entire process emphasizes the importance of cross-dimensional interaction and does not perform dimensionality reduction, thus eliminating the indirect correspondence between channels and weights.

The structural diagram of the three branches of Triplet Attention is shown in Figure 8. The top branch is responsible for calculating the attention weights of the channel dimension C and the spatial dimension W. The middle branch is responsible for the channel dimension C and the spatial dimension H. The bottom branch is used to capture spatial dependencies (H and W). In the first two branches, rotation operations are used to establish connections between the channel dimension and either of the spatial dimensions. Finally, the weights are aggregated by simple averaging.

The improvement in this section is based on the original YOLOv8m medium-sized model, which has demonstrated excellent performance. To further enhance the model’s ability to refine features, the Triplet Attention module is specifically introduced after the C2f structures at different levels in the Head (detection head) module of the model. This module can effectively capture the interactive relationships between the channel and spatial dimensions of features. By enhancing the representation of key features, it optimizes the model’s perception and discrimination of target features, ultimately improving the accuracy and effectiveness of the detection task. The specific model architecture is shown in Figure 9.

The Triplet Attention structure changes the dimensional order of the input tensor through dimension permutation and performs a Z-Pool operation. This operation combines the advantages of average pooling and max pooling to extract features from the channel dimension. A

7 \times 7

convolution, Batch Norm, and Sigmoid function are used to generate attention weights. By permuting the dimensions again and using a

1 \times 1

convolution, the generated weights are applied to the original input tensor to realize the reweighting of input features, as shown in TripletAttention of Figure 10.

In the head part of YOLOv8, Triplet Attention modules are introduced on feature maps of three different scales, as shown in Figure 9. After certain convolution and feature fusion operations on the feature maps of each scale, the Triplet Attention module is inserted to enhance feature representation.

In the processing flow of feature maps for each scale, the original C2f module is modified into a C2f_TripletAttention module with Triplet Attention. This module not only includes the functions of the original C2f module but also adds an additional Triplet Attention mechanism, as shown in Figure 10. In this way, the network can further enhance the capture and utilization of key features while maintaining the original feature extraction capability.

The functional mechanism of the Triplet Attention module involves calculating the correlations between different positions and channels in the feature map to generate attention weights. These weights can adaptively enhance important features and suppress secondary features, thereby improving the quality of the feature map. In YOLOv8, this mechanism is applied to feature maps of three different scales, which helps enhance the network’s object detection capabilities at various scales.

4.1.3. Improved YOLOv8 Model with Multi-Scale Convolutional Attention Module (MSCAM)

The MSCAM is an innovative and efficient module that performs depthwise convolution operations across multiple scales to optimize feature maps generated by the Vision Encoder. By suppressing irrelevant regions, MSCAM successfully captures multi-scale salient features [38]. The module consists of three main components [39]: a Channel Attention Block (CAB) for emphasizing relevant channels, a Spatial Attention Block (SAB) for capturing local contextual information, and a Multi-Scale Convolution Block (MSCB) for enhancing feature maps that preserve contextual relationships, as shown in Figure 11.

By dynamically adjusting channel and spatial weights, it enables the network to be more flexible in handling complex scenes, better adapting to diverse objects and backgrounds. This design concept provides a new direction for improving object detection models, demonstrating the potential of combining multi-scale convolution with attention mechanisms. MSCAM is shown in Figure 11 and mathematically defined by Equation (2):

MSCAM (x) = MSCB (SAB (CAB (x)))

(2)

where x is the input feature map, and MSCB, SAB, and CAB are the Multi-Scale Convolution Block, Spatial Attention Block, and Channel Attention Block, respectively.

Due to the use of depthwise convolution at multiple scales, MSCAM is more effective than the proposed Convolutional Attention Module (CAM) [40], with a significantly reduced computational cost.

(1): Channel Attention Block (CAB)

The core objective of the Channel Attention Block (CAB) is to enhance the network’s focus on key channels while reducing the impact of secondary channels. This effectively strengthens the representation of important feature channels, increases the model’s sensitivity to critical information, and ultimately improves detection accuracy [38].

In CAB, adaptive max-pooling (

P_{m}

) and adaptive average-pooling (

P_{a}

) are performed on the spatial dimensions (i.e., height and width) of the input feature map to extract the most important features of each channel across the entire feature map. Subsequently, for each pooled feature map, a point-wise convolution (

C_{1}

) and a ReLU activation function (R) are used to reduce the number of channels by a ratio

r = 1 / 16

, reducing computational complexity. Another point-wise convolution (

C_{2}

) is used to restore the original number of channels. The two restored feature maps are added together, and a Sigmoid (

σ

) activation is applied to estimate the attention weights. These weights are combined with the original input feature map x using the Hadamard product (⨀) to achieve channel-wise weighting. CAB is shown in Figure 11 and is mathematically defined by Equation (3):

CAB (x) = σ (C_{2} (R (C_{1} (P_{m} (x)))) + C_{2} (R (C_{1} (P_{a} (x))))) ⨀ x

(3)

(2): Spatial Attention Block (SAB)

The Spatial Attention Block (SAB) aims to enhance the model’s ability to assess the importance of key spatial regions in feature maps, guiding the model to focus on these critical areas. By introducing the spatial attention mechanism, the network can dynamically focus on important spatial locations in feature maps, thereby improving localization accuracy in complex scenes.

In SAB, the maximum value (

C h_{m a x}

) and the average value (

C h_{a v g}

) across the channel dimension are aggregated to focus on local features. A convolutional layer with a large kernel (i.e.,

7 \times 7

) is used to enhance the local contextual relationships between features. A Sigmoid activation (

σ

) is applied to calculate the attention weights, and these weights are combined with the original input feature map x using the Hadamard product (⨀) to direct attention more precisely. This design of SAB mimics the attention process of the human brain, enabling the model to concentrate more on specific parts of the input image. By enhancing the key positions in the feature map, SAB improves the model’s ability to recognize and respond to relevant spatial features, where the contextual and positional information of objects has a significant impact on the output results [38]. SAB is shown in Figure 11 and is mathematically defined by Equation (4):

SAB (x) = σ (LKC ([C h_{m a x} (x), C h_{a v g} (x)])) ⨀ x

(4)

(3): Multi-Scale Convolution Block (MSCB)

The Multi-Scale Convolution Block (MSCB) is a component for enhancing the model’s ability to extract features of objects of different sizes. It realizes multi-scale feature extraction by executing multiple depthwise convolutional layers in parallel, with each layer adopting convolutional kernels of different sizes (e.g.,

1 \times 1

,

3 \times 3

,

5 \times 5

). This parallel convolutional operation can capture the features of diverse objects, providing the model with rich multi-scale information.

In MSCB, a convolutional layer

{PWC}_{1}

with a

1 \times 1

convolutional kernel, a batch normalization layer BN, and a ReLU6 activation layer R6 are used to expand the number of channels (i.e., the expansion factor = 2). A multi-scale depthwise convolution MSDC is employed to capture multi-scale and multi-resolution context. Since depthwise convolution ignores the relationships between channels, a “channel shuffling” (shuffling the original channel order of the feature map) operation is carried out to incorporate the relationships between channels. Then, a convolutional layer

{PWC}_{2}

and a BN layer are used to convert back to the original number of channels, which also encodes the dependencies between channels. MSCB is shown in Figure 11 and is mathematically defined by Equation (5):

MSCB (x) = B N ({PWC}_{2} (CS (MSDC (R 6 (BN ({PWC}_{1} (x)))))))

(5)

This improvement involves the ingenious integration of the Multi-Scale Convolutional Attention Module (MSCAM) into the already high-performing YOLOv8m medium-sized model, as illustrated in Figure 11. It is able to enhance the model’s detection accuracy, enabling it to identify targets more precisely in complex and dynamic scenarios. The core advantage of the MSCAM module lies in its ability to strengthen the model’s focus on key channels and spatial regions. Through multi-scale convolution operations, the model can capture more detailed and comprehensive feature information, thereby demonstrating enhanced target recognition capabilities in intricate environments.

In the backbone network, the MSCAM is incorporated after the SPPF layer (as illustrated in Figure 12) to enhance attention weighting for high-level feature maps, thereby improving the quality of feature representation. In the detection head, MSCAM modules are individually introduced prior to the upsampling and concatenation of feature maps at each scale, aiming to strengthen the feature representation of each scale and facilitate cross-scale feature fusion.

The MSCAM module receives feature maps from either the backbone or the detection head. It first computes channel attention weights via the CAB, then enhances spatial key region information through the SAB, and captures multi-scale features using the MSCB. These processed outputs are then fused to generate an enhanced feature representation. By capturing multi-scale features and dynamically adjusting channel and spatial attention weights, MSCAM strengthens feature representation and improves the accuracy and robustness of object detection.

The three proposed improved methods are adapted to the unique task of Gong-che notation character recognition. The SimAM module, embedded after the last convolutional layer of the backbone network, can more effectively capture key features through semantic-aware attention and enhance basic feature extraction capability, as the feature maps at this position contain richer semantic information. The Triplet Attention module, introduced into the C2f structures at different levels in the Head module of YOLOv8, can better capture the interactive correlations between channel and spatial dimensions in feature maps of different scales and strengthen cross-scale and cross-dimensional feature interactions, considering that the Head module is responsible for target detection and fuses feature maps of different scales at this stage. The Multi-Scale Convolutional Attention Module (MSCAM), introduced after the SPPF layer of the backbone network and before the upsampling and concatenation of each scale feature map in the Head, is used to enhance the attention weighting of high-level feature maps and promote cross-scale feature fusion, respectively. Its integrated multi-scale convolutional kernels can extract hierarchical features of Gong-che characters of various sizes to cope with size fluctuations and differences in stroke adhesion. Combined with the weighting enhancement of MSCAM channel attention on key stroke features, it effectively reduces the misjudgment rate of similarly shaped characters.

5. Experiments

5.1. Dataset Splitting

A total of 2134 images were used as the sample set in this experiment. To effectively train and validate the model, these images were initially split at a ratio of 7:3. The training set consists of 1496 images, accounting for the majority of the total samples, which aims to provide abundant data for the model to learn sufficiently. The remaining 638 images are allocated to the validation set, which is used to evaluate and adjust the model performance during training to ensure the model has good generalization ability. This splitting strategy is designed to balance the needs of model training and the effectiveness of validation.

5.2. Parameter Settings

The following parameter configurations were adopted in this experiment to train the model, aiming to improve model stability, accelerate the convergence process, and effectively suppress overfitting. Specifically, for the selection of the optimizer, the AdamW optimizer—a variant of the Adam optimizer that incorporates an L2 regularization term in weight updates—was automatically screened as the most suitable algorithm for the current task. AdamW not only optimizes the training process but also significantly enhances the model’s generalization ability and reduces the risk of overfitting. The initial learning rate was set to 0.001, which ensures model stability while achieving a relatively fast convergence speed and a low overfitting risk. Additionally, the batch size was set to 16, which further improves training efficiency through reasonable sample grouping. The experiment was conducted for 150 epochs to ensure that the model can fully learn the data features. The experimental parameters are shown in Table 6.

5.3. Experimental Design

In designing the experiments, the insertion positions of these attention mechanisms were chosen based on their characteristics and performance. Specifically, the SimAM module was introduced after the last convolutional layer of the backbone network because the feature maps at this position contain richer semantic information. SimAM can more effectively capture key features on this basis, enhancing the basic feature extraction ability. Triplet Attention was incorporated into the C2f structures at different hierarchical levels within the Head module of YOLOv8. Considering that the Head module is responsible for target detection and feature maps of different scales are fused at this stage, Triplet Attention can better capture the interactive correlations between channel and spatial dimensions across multi-scale feature maps, thereby improving the model’s perception and discrimination of target features. MSCAM was introduced after the SPPF layer in the backbone network and before the upsampling and concatenation of feature maps at each scale in the head to enhance the attention weighting of high-level feature maps and promote cross-scale feature fusion, respectively.

To ensure that the results are primarily influenced by the model rather than the insertion positions, preliminary experiments were conducted. The type of attention mechanism module was fixed, while only its insertion position in YOLOv8 was varied. The positions where the attention mechanism modules exhibited better overall performance were selected for comparison with other models.

5.4. Experimental Analysis of YOLOv8n and YOLOv8m

In this experiment, YOLOv8n and YOLOv8m were used for training and testing the staff symbols of Gong-che notation. During the training process, as the number of iterations increased, the loss rates of the three loss functions of YOLOv8 dropped sharply and finally stabilized, reaching a convergent state. Taking the comparison of cls_loss curves as an example, as shown in Figure 13, YOLOv8m demonstrates better convergence of the loss function than YOLOv8n during training. The loss function of YOLOv8m converges faster, with smaller fluctuations and a smoother curve, indicating that it is more stable, efficient, and has better feature extraction capabilities. This shows that YOLOv8m has higher efficiency and stability in recognizing Gong-che staff symbols, making it more suitable as the recognition model for this task.

As shown in Figure 14, the accuracy of symbol recognition by the YOLOv8n and YOLOv8m models is compared. From the curves in the figure, it can be seen that the accuracy of the YOLOv8m model stabilizes between 70% and 80%, and almost stabilizes at nearly 80% after 100 epochs. In contrast, the accuracy of the YOLOv8n model fluctuates significantly and unstably, fluctuating drastically between 40% and 70%. It is evident that the YOLOv8m model outperforms the YOLOv8n model significantly, making it more suitable for recognition training on this dataset.

As shown in Table 7, the mAP50 and accuracy of the YOLOv8 models used in this study are compared. YOLOv8m outperforms YOLOv8n in both metrics, with significant improvements in accuracy. Specifically, the mAP50 increases from 62.90% to 74.26%, and the accuracy rises from 65.95% to 78.19%.

As shown in Figure 15, by comprehensively evaluating the metrics of precision, recall, mAP50, and mAP50-95, it can be clearly observed that the YOLOv8m model demonstrates advantages in all aspects compared to the YOLOv8n model, indicating its more excellent detection capability in complex scenarios. In addition, YOLOv8m has achieved a significant improvement in the accuracy of the Gong-che notation recognition task, being able to more accurately identify and parse more musical notation symbols, which further enhances its reliability and practicality in practical applications.

As shown in Figure 16, when using YOLOv8n for recognition, some characters cannot be accurately identified. In contrast, YOLOv8m has much stronger recognition ability; it can not only recognize almost all characters in the image, but also has a higher accuracy rate and a lower error rate. YOLOv8m can effectively recognize characters not only under original image conditions, but also maintain high recognition ability in the face of complex situations such as adding different noises, tilting and flipping at different angles. This robust recognition performance makes the YOLOv8m model more adaptive and reliable in handling various image deformations and interferences that may be encountered in practical applications.

5.5. Experimental Analysis of Improved YOLOv8 Models

In this experiment, three different attention mechanism modules—SimAM, TripletAttention, and MSCAM—were introduced based on the YOLOv8m model, as detailed in Section 4.1.1. The improved models based on the YOLOv8m medium-sized model were used for training and testing Gong-che notation symbols. During the training process, as the number of iterations increased, the loss rates of the loss functions dropped sharply and finally stabilized, reaching a convergent state.

Taking the comparison of cls_loss curves as an example, Figure 17 shows the changes in classification loss of the four models—MSE, SimAM, Triplet Attention, and YOLOv8m—during training. It can be clearly seen from the figure that with the progress of training (i.e., the increase in epochs), the classification loss of all models showed a steady downward trend. This change indicates that each model is continuously learning and optimizing, and its prediction accuracy is gradually improving. Among them, the YOLOv8 improved model with the MSCAM attention module introduced demonstrates better convergence of the loss function than the other models during training. It converges faster and has smaller curve fluctuations, indicating that the model has higher efficiency and stability in recognizing Gong-che symbols, thus making it more suitable as the recognition model for this task.

As shown in Figure 18, the accuracy of the three improved models with MSCAM, SimAM, Triplet Attention, and the original YOLOv8m model in recognizing staff symbols is compared. From the precision line chart drawn in the figure, it can be intuitively seen that the improved model with the MSCAM attention module performs the best in terms of accuracy, indicating that the model maintains high recognition accuracy at different training stages. At the same time, the improved models with the SimAM and Triplet Attention attention modules also show good performance. Their accuracy curves are improved compared with the original YOLOv8m model, indicating that the introduction of these attention mechanisms has enhanced the recognition ability of the model to a certain extent.

As shown in Table 8, the accuracy of these four models is compared. YOLOv8m, as the baseline model, achieves an accuracy of 78.2%, demonstrating its basic performance in the object detection task. The models improved by introducing attention mechanisms (SimAM, Triplet Attention, MSCAM) all significantly enhance the accuracy. Among them, MSCAM reaches the highest accuracy of 83.6%, whose advantage may stem from the dynamic fusion capability of multi-scale channel attention mechanisms for complex features. Triplet Attention achieves an accuracy of 81.8%, enhancing feature representation through cross-dimensional interaction, while SimAM achieves an accuracy of 80.4%, optimizing computational efficiency by simplifying the attention module. Overall, the three introduced attention mechanism modules can effectively improve the model’s accuracy, with MSCAM showing the best accuracy performance.

A comprehensive evaluation of precision, recall, mAP50, and mAP50-95 is conducted. As shown in the bar chart of Figure 19, it displays the comparison of the YOLOv8m model and its three improved versions across these three performance metrics. Overall, the data of the three improved models outperforms the original model in all indicators. In terms of the average precision of mAP50 to mAP59, the improved model with Triplet Attention is slightly inferior to the one with SimAM. However, the improved model with MSCAM demonstrates the best performance in all aspects. The improved model with Triplet Attention also shows relatively good results, while the one with SimAM is slightly worse but still outperforms the original YOLOv8m model, achieving notable improvements and enabling more accurate recognition of more characters. In summary, attention mechanisms can effectively enhance the model’s ability to capture target details. Notably, the improvement with MSCAM validates the importance of multi-scale feature fusion in boosting performance across various aspects.

As shown in Figure 19, the recognition effect of the YOLOv8 improved model with the multi-scale convolutional module (MSCAM) on Gong-che notation is displayed. The model recognizes almost all musical symbols with high accuracy, with most accuracy rates exceeding 80%. Compared with the previous YOLOv8m model, its accuracy is relatively higher, and many cases reach over 90%. Therefore, this model can not only recognize almost all characters in the image but also achieve higher recognition accuracy and lower error rate. Meanwhile, it maintains strong recognition ability in the face of complex scenarios, such as adding different noises, tilting at various angles, translation, and flipping. Such robust recognition performance makes the model more adaptive and reliable in handling various image deformations and interferences that may be encountered in practical applications.

6. Conclusions

This study focuses on the recognition of Chinese traditional musical scores based on deep learning, primarily centered around the identification of Gong-che notation characters, and introduces targeted solutions. With the maturity of Optical Music Recognition (OMR) and Optical Character Recognition (OCR) technologies, researching and developing efficient algorithms adaptable to musical symbols of different styles and forms is crucial for the digitalization of Chinese traditional musical score characters. Aiming at the particularity of Gong-che notation, this paper employs the YOLOv8 method to efficiently process Gong-che symbol images. It eliminates the need for complex segmentation and preprocessing, simplifying the workflow and improving efficiency, thus facilitating readers’ understanding of Gong-che notation. This study constructs the dataset LGRC2024 based on the Gong-che notation from Lilu Qupu and uses data augmentation techniques to enhance dataset diversity. Through comparative experiments, the performance of YOLOv8n and YOLOv8m on the task is analyzed. The results show that the YOLOv8m model significantly improves recognition accuracy and various performance metrics. To further enhance performance, three different attention mechanism modules are introduced based on the YOLOv8m model to optimize its recognition capabilities. Comprehensive analysis and evaluation of these improved models are conducted through comparative experiments. The final results demonstrate that integrating the Multi-Scale Convolutional Attention Module (MSCAM) with YOLOv8 significantly enhances the model’s performance in Gong-che symbol recognition, outperforming other attention mechanism modules. However, these studies still have limitations, such as a small sample size and the recognition of duration and rhythm in Gong-che notation, which require further research to address. Future work will further extend the modeling of complex degradation patterns in practical scanning scenarios, such as ink fading and text blurriness caused by paper aging, to enhance the model’s robustness and applicability in the digitalization of real historical musical scores.

Author Contributions

Conceptualization, Y.Z. and Z.H.; methodology, Y.Z.; software, L.Z.; validation, Y.Z., Z.H. and L.Z.; formal analysis, Y.H.; investigation, Y.Z.; resources, Z.H.; data curation, L.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, Z.H.; visualization, L.Z.; supervision, Z.H.; project administration, Z.H.; funding acquisition, Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset can be obtained from https://www.kaggle.com/datasets/liumeizhang/lgrc2024 (accessed on 3 June 2025).

Conflicts of Interest

Authors Yuqian Zhang was employed by Technical Research Institute, QinChuan Machine Tool & Tool Group Co., Ltd., Baoji. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AdamW	Adam with Weight Decay
BN	Batch Normalization
C2f	Cross Stage Partial 2 with Focus
CAB	Channel Attention Block
CBAM	Convolutional Block Attention Module
CNN	Convolutional Neural Network
LGRC2024	Lilu Qu Pu Gong-che Notation Dataset 2024
MIDI	Musical Instrument Digital Interface
MSCAM	Multi-Scale Convolutional Attention Module
MSCB	Multi-Scale Convolution Block
MSDC	Multi-Scale Depthwise Convolution
MusicXML	Music Extensible Markup Language
mAP50	mean Average Precision at 50 IoU
mAP50-95	mean Average Precision from 50 to 95 IoU
OMR	Optical Music Recognition
OCR	Optical Character Recognition
PWC	Point-Wise Convolution
ReLU6	Rectified Linear Unit 6
SAB	Spatial Attention Block
SENet	Squeeze-and-Excitation Networks
SimAM	Simple Attention Module
SPPF	Spatial Pyramid Pooling-Fast

References

Ríos-Vila, A.; Rizo, D.; Iñesta, J.M.; Calvo-Zaragoza, J. End-to-end optical music recognition for pianoform sheet music. Int. J. Doc. Anal. Recognit. (IJDAR) 2023, 26, 347–362. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, L.; Hu, Y.; Ju, T. Recognition of Chinese Gong-Che Notation Characters Based on YOLOv8. In Proceedings of the 2024 9th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 19–21 April 2024; pp. 768–771. [Google Scholar] [CrossRef]
Calvo-Zaragoza, J.; Gallego, A.J. A selectional auto-encoder approach for document image binarization. Pattern Recognit. 2019, 86, 37–47. [Google Scholar] [CrossRef]
Gallego, A.J.; Calvo-Zaragoza, J. Staff-line removal with selectional auto-encoders. Expert Syst. Appl. 2017, 89, 138–148. [Google Scholar] [CrossRef]
Pacha, A.; Choi, K.Y.; Coüasnon, B.; Ricquebourg, Y.; Zanibbi, R.; Eidenberger, H. Handwritten Music Object Detection: Open Issues and Baseline Results. In Proceedings of the 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), Vienna, Austria, 24–27 April 2018; pp. 163–168. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Hajic, J.; Dorfer, M.; Widmer, G.; Pecina, P. Towards Full-Pipeline Handwritten OMR with Musical Symbol Detection by U-Nets. In Proceedings of the International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Tuggener, L.; Elezi, I.; Schmidhuber, J.; Stadelmann, T. Deep Watershed Detector for Music Object Recognition. arXiv 2018, arXiv:1805.10548. [Google Scholar]
van der Wel, E.; Ullrich, K. Optical Music Recognition with Convolutional Sequence-to-Sequence Models. arXiv 2017, arXiv:1707.04877. [Google Scholar]
Calvo-Zaragoza, J.; Rizo, D. Camera-PrIMuS: Neural End-to-End Optical Music Recognition on Realistic Monophonic Scores. In Proceedings of the International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018. [Google Scholar]
Baró, A.; Riba, P.; Calvo-Zaragoza, J.; Fornés, A. From Optical Music Recognition to Handwritten Music Recognition: A baseline. Pattern Recognit. Lett. 2019, 123, 1–8. [Google Scholar] [CrossRef]
Ríos-Vila, A.; Calvo-Zaragoza, J.; Rizo, D. Evaluating Simultaneous Recognition and Encoding for Optical Music Recognition. In Proceedings of the DLfM’20: 7th International Conference on Digital Libraries for Musicology, Montreal, QC, Canada, 16 October 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 10–17. [Google Scholar] [CrossRef]
Castellanos, F.J.; Calvo-Zaragoza, J.; Iñesta, J.M. A Neural Approach for Full-Page Optical Music Recognition of Mensural Documents. In Proceedings of the International Society for Music Information Retrieval Conference, Montreal, QC, Canada, 11–16 October 2020. [Google Scholar]
Edirisooriya, S.; Dong, H.W.; McAuley, J.; Berg-Kirkpatrick, T. An Empirical Evaluation of End-to-End Polyphonic Optical Music Recognition. arXiv 2021, arXiv:2108.01769. [Google Scholar]
Alfaro-Contreras, M.; Valero-Mas, J.J. Exploiting the Two-Dimensional Nature of Agnostic Music Notation for Neural Optical Music Recognition. Appl. Sci. 2021, 11, 3621. [Google Scholar] [CrossRef]
Garrido-Munoz, C.; Rios-Vila, A.; Calvo-Zaragoza, J. A holistic approach for image-to-graph: Application to optical music recognition. Int. J. Doc. Anal. Recognit. 2022, 25, 293–303. [Google Scholar] [CrossRef]
Yuan, S. Research on Low-Quality Music Score Recognition Algorithm with Attention Mechanism and Simplified Recurrent Unit. Master’s Thesis, Tianjin University, Tianjin, China, 2023. [Google Scholar]
Hongyang, S.; Shang, W. End-to-End Optical Music Score Recognition Method Based on Residual Gated Recurrent Convolution and Attention Mechanism. Comput. Mod. 2022, 85–90. [Google Scholar]
Liu, M.; Liu, G.; ge Liu, Y.; Jiao, Q. Oracle Bone Inscriptions Recognition Based on Deep Convolutional Neural Network. J. Image Graph. 2020, 8, 114–119. [Google Scholar] [CrossRef]
Aneja, N.; Aneja, S. Transfer Learning using CNN for Handwritten Devanagari Character Recognition. In Proceedings of the 2019 1st International Conference on Advances in Information Technology (ICAIT), Chikmagalur, India, 24–27 July 2019; pp. 293–296. [Google Scholar] [CrossRef]
Graves, A.; Liwicki, M.; Bunke, H.; Schmidhuber, J.; Fernández, S. Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems; Platt, J., Koller, D., Singer, Y., Roweis, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2007; Volume 20. [Google Scholar]
Liwicki, M.; Graves, A.; Bunke, H. Neural Networks for Handwriting Recognition. In Computational Intelligence Paradigms in Advanced Pattern Classification; Springer: Berlin/Heidelberg, Germany, 2012; pp. 5–24. [Google Scholar] [CrossRef]
Altwaijry, N.; Al-Turaiki, I. Arabic handwriting recognition system using convolutional neural network. Neural Comput. Appl. 2021, 33, 2249–2261. [Google Scholar] [CrossRef]
Ni, E.; Jiang, M.; Ding, X.; Zhou, C. Handwriting input system of chinese guqin notation. J. Comput. Cult. Herit. 2011, 3, 9. [Google Scholar] [CrossRef]
Li, W.; Lee, H. The Influence of Guqin Music on Loneliness and Psychological Alienation in Society. J. Educ. Res. Policies 2024, 6, 155–156. [Google Scholar] [CrossRef] [PubMed]
Rongxin, C.; Weibin, C. Design and Implementation of Typesetting Software for Nanyin Gongche Score. Comput. Eng. Des. 2005, 8, 2246–2248. [Google Scholar] [CrossRef]
Ruimin, W. Research on Recognition Method of Pipa Gongche Score Characters. Master’s Thesis, Shanghai University, Shanghai, China, 2011. [Google Scholar]
Genfang, C. Research on Digital Implementation of Chinese Gongche Score. Ph.D. Thesis, Shanghai University, Shanghai, China, 2011. [Google Scholar]
Tan, F.; Zhai, M.; Zhai, C. Foreign object detection in urban rail transit based on deep differentiation segmentation neural network. Heliyon 2024, 10, e37072. [Google Scholar] [CrossRef] [PubMed]
Chen, Z. Introduction to Gongche Score; Huayue Publishing House: Beijing, China, 2004. [Google Scholar]
Zhang, Y.; Zhang, L. LGRC2024. 2025. Available online: https://www.kaggle.com/datasets/liumeizhang/lgrc2024 (accessed on 3 June 2025).
Dan, W. Research on Chinese Opera Gongche Scores from the Qing Dynasty to the Modern Era. Ph.D. Thesis, Zhejiang University, Hangzhou, China, 2016. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Online, 18–24 July 2021; Meila, M., Zhang, T., Eds.; Volume 139, pp. 11863–11874. [Google Scholar]
Tan, F.; Tang, Y.; Yi, J. Multi-pose face recognition method based on improved depth residual network. Int. J. Biom. 2024, 16, 514–532. [Google Scholar] [CrossRef]
Hartelt, A.; Eipert, T.; Puppe, F. Optical Medieval Music Recognition—A Complete Pipeline for Historic Chants. Appl. Sci. 2024, 14, 7355. [Google Scholar] [CrossRef]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to Attend: Convolutional Triplet Attention Module. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 3138–3147. [Google Scholar] [CrossRef]
Rahman, M.M.; Munir, M.; Marculescu, R. EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation. arxiv 2024, arXiv:2405.06880. [Google Scholar]
Torras, P.; Biswas, S.; Fornés, A. A unified representation framework for the evaluation of Optical Music Recognition systems. Int. J. Doc. Anal. Recognit. (IJDAR) 2024, 27, 379–393. [Google Scholar] [CrossRef]
Gao, H. Improving piano music signal recognition through enhanced frequency domain analysis. J. Meas. Eng. 2024, 12, 312–323. [Google Scholar] [CrossRef]

Figure 1. Gong-che notation examples. (a) Gong-che notation excerpted from Gong-che Notation of Southern and Northern Ci-Poetry in Jiucheng Dacheng. It presents traditional Chinese character-based musical notation, reflecting the writing style and musical recording characteristics of ancient Chinese scores. The text contains classical Chinese characters with rhythm-related annotations, demonstrating the integration of literature and music in traditional Chinese art. (b) Gong-che notation excerpted from Lilu Qupu, a representative work of traditional Chinese Kunqu opera scores. The notation combines Chinese characters and special musical symbols, showing the singing rhythm and lyrics of Kunqu opera. It reflects the artistic expression form of traditional Chinese opera music and is of great significance for studying the inheritance and characteristics of ancient Chinese vocal music.

Figure 2. Gong-che notation examples. (a) shows Yuzhu Notation from Gong-che Notation of Southern and Northern Ci-Poetry in Jiugong Dacheng, reflecting the musical writing style of ancient Chinese ci-poetry. (b) presents Yizi Notation from Eyun Ge Qupu, demonstrating the integration of Kunqu opera singing rhythms and text. (c) displays Suoyi Notation from Liu Ye Qupu, illustrating the unique symbolic expression of traditional Chinese vocal music. These notations are significant for studying the inheritance, evolution, and cultural connotations of ancient Chinese music.

Figure 3. Labelinginterface of LabelImg, showing the annotation process for Gong-che notation (a traditional Chinese musical notation system using character-like symbols to record music). The interface is used to mark Gong-che characters in images (e.g., the handwritten Chinese text and musical symbols in the figure) with labels like “note-C5”, “note-D4”, etc., preparing datasets for deep learning-based recognition of Gong-che notation.

Figure 4. Different forms of data augmentation. (a) Original image, (b) translation, (c) noise injection + flipping, (d) noise injection + flipping, (e) noise injection, (f) rotation.

Figure 5. YOLOv8 model structure.

Figure 6. Weights of three types of attention. Weights of three types of attention. Different colors are used as visual aids to distinguish operational logics across dimensions (channels/spatial locations) and simplify understanding of attention weight distributions. Letters (X, H, W, C) denote dimensions: X = input feature, H = height, W = width (spatial dimensions), C = channel dimension. (a) One - dimensional channel attention weights (refines only channel dimension, generating uniform operations per channel). (b) Two - dimensional spatial attention weights (refines only spatial dimension, applying uniform operations per spatial location). (c) Full 3D attention weights (covers all channels + spatial dimensions for comprehensive feature refinement, as proposed by SimAM to capture richer information). This visualization aligns with the study: existing modules (a,b) limit feature learning by focusing on single dimensions, while 3D weights (c) enable more discriminative feature representation.

Figure 7. Network structure model combining SimAM and YOLOv8.

Figure 8. Network Structure Model Combining SimAM and YOLOv8. Letters in the figure represent dimensions and operations: - H = height, W = width (spatial dimensions of feature maps); - C = channel dimension; - “Z-Pool”, “Conv”, “Sigmoid” = key operations (pooling, convolution, activation); - “Identity” = residual connection (direct feature mapping). This model illustrates how Triplet Attention refines features across spatial and channel dimensions for improved representation.

Figure 9. Network structure model combining Triplet Attention and YOLOv8.

Figure 10. Network structure details of YOLOv8 improved by introducing Triplet Attention module. Arrows in the figure represent data/feature flow directions: - Solid arrows: Indicate sequential processing of feature maps through operations (e.g., Conv, Split, Concat). - Dashed arrows (with red borders): Highlight the integration of Triplet Attention sub-modules into the main network. - Circular arrows (e.g., ⨀, ⨁): Denote element-wise operations (multiplication, addition) for feature fusion. This visualization shows how Triplet Attention refines features across channels and spatial dimensions within the YOLOv8 architecture.

Figure 11. MSCAM network structure. Arrows in the figure represent feature/data flow and module associations: - Solid arrows (e.g., between “Channel Max”→“Conv7x7”→“Sigmoid”): Indicate sequential processing of feature maps through operations (convolution, activation, etc.). - Colored arrows (red, pink): Highlight the integration of sub-modules (SAB, CAB, MSCB) into the overall MSCAM architecture, showing how they connect and interact. - Circular arrows (e.g., ⨀): Denote element-wise operations (e.g., feature fusion via multiplication) to refine feature representations. This visualization details the hierarchical design of MSCAM, including channel/spatial attention branches and multi-scale feature processing.

Figure 12. Network structure of MSCAM combined with YOLOv8.

Figure 13. Comparison diagram of cls_loss Loss. Function.

Figure 14. Model accuracy comparison.

Figure 15. Metrics for YOLOv8n and YOLOv8m models.

Figure 16. Prediction results of Gong-che notation recognition. (a,b) Prediction results of YOLOv8n. (c,d) Prediction results of YOLOv8m. Different colors in the figure represent various predicted pitch classes (e.g., “note-E4”, “note-D4”, etc.). Each color-coded label corresponds to a specific pitch prediction, with the associated numerical value indicating the prediction confidence score.

Figure 17. Classification loss function (cls_loss).

Figure 18. Model precision comparison results.

Figure 19. Metrics of four improved models.

Table 1. Pitch correspondence in Gong-che notation. Such notation is a traditional Chinese musical notation system, where “Gong-che Character” represents Chinese characters used in this notation to denote pitches. Superscript “1” (e.g.,

c^{1}

,

d^{1}

) follows the international pitch notation convention, indicating the octave range (here referring to the one-lined octave in scientific pitch notation). This table maps Gong-che characters to modern pitch terms, fixed pitches, and solfege for cross-cultural music analysis.

Table 1. Pitch correspondence in Gong-che notation. Such notation is a traditional Chinese musical notation system, where “Gong-che Character” represents Chinese characters used in this notation to denote pitches. Superscript “1” (e.g.,

c^{1}

,

d^{1}

) follows the international pitch notation convention, indicating the octave range (here referring to the one-lined octave in scientific pitch notation). This table maps Gong-che characters to modern pitch terms, fixed pitches, and solfege for cross-cultural music analysis.

Gong-Che Character	合	四	一	上	尺	工	凡	六	五	乙
Pitch(assuming 上= C4)	G3	A3	B3	C4(middle C)	D4	E4	F4	G4	A4	B4
International Fixed Pitch	g	a	b	$c^{1}$	$d^{1}$	$e^{1}$	$f^{1}$	$g^{1}$	$a^{1}$	$b^{1}$
Solfege	sol	la	si	do	re	mi	fa	sol	la	si

Table 2. Correspondence between basic Gong-che characters and their one-octave-higher versions.

Category	Notation Characters
Basic Characters	上	尺	工	凡	六	五	乙
One-Octave Higher	仩	伬	仁	𠆩	𠆾	伍	亿

Table 3. Correspondence between basic Gong-che characters and their one-octave lowered versions.

Category	Notation Characters
Basic Characters	合	四	一	上	尺	工	凡
One-Octave Lowered	合,	四,	一,	上,	尺,	工,	凡,

Table 4. Category labels for key signatures.

Annotation Target	Label
上字调 (Shàng Key)	keySignature-bB
尺字调 (Chǐ Key)	keySignature-C
小工调 (Xiǎo Gōng Key)	keySignature-D
凡字调 (Fán Key)	keySignature-bE
六字调 (Liù Key)	keySignature-F
五字调 (Wǔ Key)	keySignature-G
乙字调 (yǐ Key)	keySignature-A

Table 5. Category labels for ten notation characters.

Annotation Target	Label
合 (Hé)	note-G3
四 (Sì)	note-A3
一 (Yī)	note-B3
上 (Shàng)	note-C4
尺 (Chǐ)	note-D4
工 (Gōng)	note-E4
凡 (Fán)	note-F4
六 (Liù)	note-G4
五 (Wǔ)	note-A4
乙 (Yǐ)	note-B4

Table 6. Experimental parameters.

Parameter Name	Value
Optimizer	AdamW
Initial Learning Rate	0.001
Batch Size	16
Number of Epochs	150

Table 7. Comparison between YOLOv8 model performance.

Network	mAP50 (%)	Accuracy (%)
YOLOv8n	62.90	65.95
YOLOv8m	74.26	78.19

Table 8. Model accuracy comparison.

Model	Accuracy (%)
YOLOv8m	78.2
YOLOv8m + SimAM	80.4
YOLOv8m + Triplet Attention	81.8
YOLOv8m + MSCAM	83.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, Z.; Zhang, Y.; Zhang, L.; Hu, Y. Precise Recognition of Gong-Che Score Characters Based on Deep Learning: Joint Optimization of YOLOv8m and SimAM/MSCAM. Electronics 2025, 14, 2802. https://doi.org/10.3390/electronics14142802

AMA Style

He Z, Zhang Y, Zhang L, Hu Y. Precise Recognition of Gong-Che Score Characters Based on Deep Learning: Joint Optimization of YOLOv8m and SimAM/MSCAM. Electronics. 2025; 14(14):2802. https://doi.org/10.3390/electronics14142802

Chicago/Turabian Style

He, Zhizhou, Yuqian Zhang, Liumei Zhang, and Yuanjiao Hu. 2025. "Precise Recognition of Gong-Che Score Characters Based on Deep Learning: Joint Optimization of YOLOv8m and SimAM/MSCAM" Electronics 14, no. 14: 2802. https://doi.org/10.3390/electronics14142802

APA Style

He, Z., Zhang, Y., Zhang, L., & Hu, Y. (2025). Precise Recognition of Gong-Che Score Characters Based on Deep Learning: Joint Optimization of YOLOv8m and SimAM/MSCAM. Electronics, 14(14), 2802. https://doi.org/10.3390/electronics14142802

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Precise Recognition of Gong-Che Score Characters Based on Deep Learning: Joint Optimization of YOLOv8m and SimAM/MSCAM

Abstract

1. Introduction

2. Related Work

3. The LGRC2024 Gong-Che Notation Dataset

3.1. Dataset Acquisition

3.2. Labeling LGRC2024 Dataset

3.3. Data Augmentation

4. Recognition of Gong-Che Notation Characters

4.1. Improved Model Based on YOLOv8

4.1.1. Improved YOLOv8 Model with Lightweight SimAM

4.1.2. Improved YOLOv8 Model with Triplet Attention Mechanism

4.1.3. Improved YOLOv8 Model with Multi-Scale Convolutional Attention Module (MSCAM)

5. Experiments

5.1. Dataset Splitting

5.2. Parameter Settings

5.3. Experimental Design

5.4. Experimental Analysis of YOLOv8n and YOLOv8m

5.5. Experimental Analysis of Improved YOLOv8 Models

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI