Next Article in Journal
Blockchain-Based Secure and Reliable High-Quality Data Risk Management Method
Previous Article in Journal
BFLE-Net: Boundary Feature Learning and Enhancement Network for Medical Image Segmentation
Previous Article in Special Issue
Short-Term Electric Load Forecasting Using Deep Learning: A Case Study in Greece with RNN, LSTM, and GRU Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Recognition of Japanese Finger-Spelled Characters Based on Finger Angle Features and Their Continuous Motion Analysis

1
Tokyo Electric Power Company, Tokyo 100-8560, Japan
2
Graduate School of Engineering, Tokyo Polytechnic University, Atsugi 243-0218, Japan
3
Department of Engineering, Tokyo Polytechnic University, Atsugi 243-0218, Japan
*
Authors to whom correspondence should be addressed.
Electronics 2025, 14(15), 3052; https://doi.org/10.3390/electronics14153052
Submission received: 4 June 2025 / Revised: 15 July 2025 / Accepted: 29 July 2025 / Published: 30 July 2025

Abstract

To improve the accuracy of Japanese finger-spelled character recognition using an RGB camera, we focused on feature design and refinement of the recognition method. By leveraging angular features extracted via MediaPipe, we proposed a method that effectively captures subtle motion differences while minimizing the influence of background and surrounding individuals. We constructed a large-scale dataset that includes not only the basic 50 Japanese syllables but also those with diacritical marks, such as voiced sounds (e.g., “ga”, “za”, “da”) and semi-voiced sounds (e.g., “pa”, “pi”, “pu”), to enhance the model’s ability to recognize a wide variety of characters. In addition, the application of a change-point detection algorithm enabled accurate segmentation of sign language motion boundaries, improving word-level recognition performance. These efforts laid the foundation for a highly practical recognition system. However, several challenges remain, including the limited size and diversity of the dataset and the need for further improvements in segmentation accuracy. Future work will focus on enhancing the model’s generalizability by collecting more diverse data from a broader range of participants and incorporating segmentation methods that consider contextual information. Ultimately, the outcomes of this research should contribute to the development of educational support tools and sign language interpretation systems aimed at real-world applications.

1. Introduction

In recent years, the rapid advancement of deep learning technologies has facilitated the widespread adoption of translation systems and chat-based interactive applications. However, these systems predominantly rely on text and voice input, creating significant accessibility barriers for individuals with hearing and speech impairments. According to the JapanTrak survey [1], approximately 10% of the Japanese population identified themselves as deaf or “probably deaf” in 2022, underscoring the urgent need for assistive technologies tailored to the deaf and hard-of-hearing community.
These challenges are even more pronounced in Japanese finger-spelling, which involves a larger set of characters and phonetic variations—such as voiced, semi-voiced, and geminate consonants—expressed through subtle differences in hand shape and wrist angle. As a result, visually similar gestures (e.g., “あ (a),” “さ (sa),” and “た (ta)”) are easily confused, further reinforcing the need for extremely fine-grained motion analysis.
To address these issues, we propose a word-level recognition system for Japanese finger-spelling, built on character-level training. Our system extracts multimodal skeletal features via MediaPipe—hand joint angles, upper-body posture, and optical flow-based motion vectors—and evaluates their effectiveness against hand-only input. For training, we adopt a hybrid ViT-CNN architecture, combining ViT’s global spatial modeling with CNN’s local feature extraction. Input video is segmented into character-level temporal units using a change point detection algorithm, and each unit is then classified individually. The recognized characters are concatenated to reconstruct entire words, enabling word-level recognition within the ISLR framework.
Furthermore, to improve model performance, we expanded the training dataset by collecting additional sign language data, including expert-level signing that captures nuanced phonetic features such as voiced and semi-voiced consonants. A portion of this expert-annotated dataset was used for fine-tuning the model to enhance its adaptability to realistic signing variations.
The remainder of this paper is organized as follows. Section 2 reviews related work in both isolated and continuous sign-language recognition. Section 3 describes the ub-MOJI dataset. Section 4 details our hybrid ViT–CNN architecture and CPD-based segmentation pipeline. Section 5 presents experimental results. Section 6 discusses these findings and their implications. Finally, Section 7 concludes the paper and outlines future research directions.

2. Related Work

Sign language recognition has become a key area in human–machine interaction research and is generally categorized into Isolated Sign Language Recognition (ISLR) and Continuous Sign Language Recognition (CSLR) [2,3]. ISLR focuses on discrete signs or syllables and is relatively simpler due to the lack of temporal dependencies, while CSLR handles continuous signing sequences, requiring more complex temporal modeling. In ISLR, recent approaches include the SAM-SLR-v2 framework using multi-modal inputs [4], ViT-based models achieving near-perfect accuracy on ASL datasets [5], and Transformer models with self-supervised learning [6]. For CSLR, ConSignformer—a Conformer-based model adapted from speech recognition—has shown state-of-the-art performance on PHOENIX-2014 datasets [7]. A two-stream graph convolutional network with attention mechanisms also demonstrated strong performance on large-scale CSLR tasks [8].
In parallel with vision-based methods, research has also explored device-based approaches for sign recognition. For instance, Miku et al. [9] utilized the Leap Motion Controller to recognize Japanese Sign Language, achieving recognition rates of 78% for static signs and 75% for dynamic signs. However, their system’s accuracy deteriorated in cases of finger overlap. In another study, Syosaku et al. [10] proposed a feature representation method by combining OpenPose and MediaPipe [11] to construct vector embeddings of sign gestures, discovering novel motor synonyms through similarity analysis. In our previous study, we compared Vision Transformer (ViT) [12] and Convolutional Neural Network (CNN) for sign language recognition. While ViT achieved a high recognition accuracy of 99.4% for static Japanese finger-spelled characters, its performance dropped significantly to around 25% when recognizing time-series sequences of word-level that included characters requiring motion [13].
While ISLR and CSLR focus on recognizing isolated or continuous sign expressions at the word or sentence level, comprehensive sign language understanding requires accurately capturing fine-grained hand and wrist movements. In this regard, Dynamic Hand Gesture Recognition (DHGR) plays a vital foundational role, as detailed hand and finger motions are essential components of sign language expression. Nevertheless, DHGR presents several challenges. One major difficulty lies in the nature of video-based hand gesture data, which often lacks visual diversity—such as variations in background, lighting, and hand appearance—providing fewer visual cues for deep learning models to distinguish between gestures. This makes it especially challenging to recognize fine-grained motion patterns, highlighting the need for highly precise and robust recognition systems.

3. The ub-MOJI Database

We used three distinct datasets in this study, each curated to capture unique aspects of Japanese Sign Language (JSL) finger-spelling. Collectively referred to as the ub-MOJI datasets, these resources encompass a wide range of signer backgrounds and phonetic variations to enhance model generalizability. To promote reproducibility and support further research in the field, all ub-MOJI datasets have been released on Hugging Face [14].

3.1. Dataset A: Professional Sign Language Instructors

We recorded 46 basic syllables of Japanese Sign Language in Full HD (1920 × 1080, 60 fps) resolution using two experienced signers. Five syllables with the same consonant sound were recorded consecutively, following the order of the Japanese gojūon (50-sound chart). Each data unit consists of ten videos, with each video containing five syllables, collectively covering all 46 syllables. Each unit was recorded three times by both signers, resulting in three times more data than Datasets B and C.
For the word-level test, the male expert was instructed to perform seven words using natural signs in a fast and concise manner, while the female expert signed the same words in a more standardized style. Among the seven words, five are Japanese place names, while “のりもの (no-ri-mo-no)” means “vehicle” and “かんこく (kan-ko-ku)” means “Korea”.
The videos captured the upper body, including both hands and arms, to preserve natural signing posture and reflect realistic motion patterns. Representative examples of the recorded syllables are shown in Figure 1.

3.2. Dataset B: Beginner Signers (No Prior Experience)

Dataset B was constructed using participants with no prior experience in sign language. Twelve university students (10 males and 2 females), aged 21–23, participated in the recordings. To ensure consistency, all participants were recorded under identical conditions, including a white background, matched camera distance, and lighting settings. During filming, participants observed their gestures on a monitor placed beside the camera, and no detailed instructions on motion precision were given to capture natural, spontaneous signing. The dataset includes recordings of the 46 basic Japanese syllables, following the same procedure as Dataset A, with 10 videos recorded per participant. Examples of beginner signers performing various syllables are presented in Figure 2.

3.3. Dataset C: Experienced Signers with Voiced and Semi-Voiced Sounds

Dataset C was constructed to include not only the 46 basic syllables of Japanese finger-spelling but also voiced, semi-voiced, long, plosive, and palatalized sounds (see Figure 3). Before filming, participants completed a questionnaire covering age, gender, dominant hand, hearing ability, and experience with finger-spelling and sign language.
Filming was conducted in two sessions. The first involved four participants (three women, one man) aged 52–64, two of whom were moderately familiar with sign language, while the other two had limited prior experience. All had normal hearing. Participants signed naturally without specific instructions, in front of a white background. Those unfamiliar with some finger alphabets were shown reference examples on a monitor and asked to imitate them.
The second session involved thirteen participants (twelve women, one man), aged 46–74. All were right-handed and had varied levels of finger-spelling experience: six had 1–9 years, four had 10–20 years, and three had over 21 years. Most had no difficulty hearing, though a few reported hearing challenges. Filming conditions were standardized to match the first session.
To maintain consistency, recordings captured the upper body with a maximum tilt of 15 degrees, and participants signed five characters with the same consonant in succession. All videos were recorded in Full HD (1920 × 1080) at 30 fps and saved in MP4 format.
Dataset C includes standard finger-spelled characters as well as various linguistically and kinematically challenging sequences. These include semantic words (e.g., “いいやま(i-i-ya-ma)” and “かまくら(ka-ma-ku-ra)”), shape-based patterns with similar handshapes (e.g., “つちたあ (tsu-chi-ta-a)” and “みよわゆ (mi-yo-wa-yu)”), and motion-intensive combinations (e.g., “しゅんび (si-yu-n-bi;/ɕɯ̃mbi/)” and “べつもー (be-tsu-mo-o)”), offering rich variation in both shape and movement. A detailed overview of each dataset is provided in Table 1.
In this study, model training and evaluation were initially conducted using 18 units from 16 participants (6 units from Dataset A and 12 from Dataset B). To examine the effects of data size and participant diversity, we also trained and evaluated the model using all datasets (A, B, and C) and compared the results.

4. Methods

4.1. Problem Formulation

We consider the task of character-level recognition in continuous Japanese finger-spelling as follows. Let each input sample be a temporal sequence of feature vectors.
X = x 1 , x 2 , x 3 , , x T R T × d
where T is the number of frames (after interpolation and sampling) and d is the feature dimensionality (e.g., 40 or 2337). The goal is to predict the corresponding character label.
y { 1,2 , 3 ,   C }
where C is the total number of target classes. We model the mapping from sequence to class probabilities by a parameterized function.
p ^ = f θ X , p ^ = [ 0 ,   1 ] C ,   c = 1 C p ^ c = 1
and obtain the predicted label y ^ = a r g m a x c p ^ c .
The training objective is to minimize the categorical cross-entropy loss over N samples:
L ( θ ) =     1 N i = 1 N c = 1 C y i , c l o g   p ^ i , c
where y i , c is the one-hot indicator for sample i. We optimize θ using the Adam optimizer with a decaying learning rate schedule, as detailed in Section 4.2.

4.2. Feature Extraction

To develop a robust recognition model for Japanese finger-spelled characters, we investigated various combinations of hand and body motion features. Since it was unclear which specific modalities would most effectively support learning, we experimented with multiple feature types extracted from video sequences. Among them, we selected two distinct input vectors for model training.
The first feature vector is a 40-dimensional representation composed of joint angles calculated from 20 hand landmarks and finger inclination angles relative to the wrist, as illustrated in Figure 4. These features are structural invariants—remaining unchanged under transformations such as translation and scaling—and were extracted using the angle computation process with MediaPipe hand tracking, as detailed in our previous work [13].
The second feature vector is a high-dimensional 2337-length vector that combines three types of information: finger angles, upper-body posture (including shoulder and elbow joints), and dense optical flow features capturing motion dynamics. This multi-modal approach was designed to assess the effectiveness of integrating fine-grained finger movements with broader contextual motion cues for improved recognition performance. This second, high-dimensional feature vector is composed of three distinct feature sets, each capturing different aspects of the signer’s motion. We describe the extraction process for each set below.
The first feature set consists solely of hand angle features. As illustrated in Figure 4a, we extracted 3D coordinates (x, y, z) of 20 hand landmarks (A1–A20) using MediaPipe Hands module. The z-coordinate represents a relative depth value normalized to the camera plane. For each joint, we selected three consecutive points and computed the 3D angle at the middle point using the cosine similarity between adjacent vectors. This resulted in 20 joint angle values per frame. Additionally, a binary flag indicating whether the detected hand was left or right was appended, yielding a 21-dimensional feature vector per frame.
The second feature set focuses on upper-body posture features. As illustrated in Figure 5a, we utilized the MediaPipe Pose module to extract 3D coordinates (x, y, z) of 14 upper-body landmarks—including both shoulders, elbows, wrists, and the bases of the thumbs, index fingers, and little fingers—from which 12 joint angles were computed. These joint angles, highlighted in red in the figure, represent the relative orientations of key upper-body segments involved in sign language articulation. Similarly to the hand features, we selected three consecutive points for each joint and computed the 3D angle at the middle point using the cosine similarity between adjacent vectors. This yielded 12 posture angle values per frame, forming a 12-dimensional feature vector that effectively captures upper-body motion patterns.
The third feature set consists of global motion features extracted using optical flow with OpenCV’s Farneback method. Full HD video frames were downsampled by a factor of 1/30 and converted to grayscale to reduce computational cost. Optical flow magnitudes were then computed for each pixel between consecutive frames. The resulting flow maps were normalized and flattened into 2304-dimensional vectors for each frame. Figure 5b presents an example frame from a sign language video with the corresponding optical flow visualization.
The three types of feature sets—derived from the MediaPipe Hands module, the MediaPipe Pose module, and optical flow—were concatenated to form a unified 2337-dimensional feature vector for each frame. By incorporating the x, y, and z components included in these features, the model was able to more effectively capture spatial motion patterns, including depth-related information.

4.3. Temporal Modeling and Input Interpolation for ViT

To utilize Vision Transformers (ViT) for finger-spelling recognition, the extracted feature vectors must be transformed into a fixed-size matrix format suitable for model input. In this subsection, we describe the temporal interpolation and sequence sampling process used to achieve this. All extracted feature vectors—both the 40-dimensional and 2337-dimensional representations—were first normalized to a range of 0 to 255, treating the values analogously to grayscale pixel intensities. This normalization allowed us to treat the temporal sequences as 2D image-like arrays and apply OpenCV’s resize function for frame interpolation. Each video was temporally resampled to 100 frames, producing a standardized sequence length across the dataset.
To preserve temporal structure while reducing sequence length, we randomly sampled 20 frames from each 100-frame sequence. These frames were selected in order, with frame-to-frame intervals constrained to be less than or equal to 10, ensuring that temporal continuity was maintained. As a result, each input sample consisted of either a 20 × 40 or 20 × 2337 matrix, depending on the type of feature vector used.
Using this method, we generated 54 unique frame combinations per video, maintaining the original motion structure while enhancing variability for training. These matrices were then used as input to train the ViT model. Ultimately, this approach enabled the construction of a large-scale training set for the 46 Japanese finger-spelled characters, resulting in 972 samples per class and a total of 44,712 training instances.
The proposed recognition model is based on the ViT architecture, which employs a multi-head self-attention mechanism and fully connected feedforward networks within its encoder modules. We adopted the same ViT configuration as proposed in our previous work [13], tailoring it for frame-level sequence modeling, as illustrated in Figure 6. Each input video was represented as a sequence of feature vectors extracted from individual frames. These vectors were first linearly projected and then supplemented with positional embeddings to retain temporal ordering along the sequence. The resulting embeddings were fed into a stack of four Transformer encoder layers. Each encoder consists of a multi-head self-attention module, a multi-layer perceptron (MLP), and layer normalization components. To mitigate overfitting, dropout layers were inserted throughout the network, and the GELU activation function was applied at all non-linearities.
The model was trained using the Adam optimizer with an initial learning rate of 0.001. A learning rate scheduler was employed to halve the learning rate every 5 epochs, and training was conducted for 45 epochs in total. Categorical cross-entropy was used as the loss function, and 25% of the training data was reserved for validation. To accommodate differences in input dimensionality, we used a batch size of 8 for the 2337-dimensional feature vectors, while a larger batch size of 25 was used for the 40-dimensional feature vectors, leveraging their lower memory footprint. The dataset was split with 25% reserved for validation. All videos used for word-level recognition were recorded separately and excluded from training, serving as the test set.
Unlike conventional ViT models that divide input images into spatial patches, our approach directly processes frame-level feature matrices as temporal sequences. This design allows the model to more effectively capture the sequential structure inherent in sign language, contributing to improved recognition performance. Finally, the output from the Transformer encoder stack is passed through an MLP head, which produces class predictions corresponding to the target Japanese finger-spelled characters.

4.4. Automatic Character-Level Segmentation of Finger-Spelling Videos

To enable character-level recognition, we introduce an automatic segmentation method that divides finger-spelling videos into discrete character units using change point detection (CPD). While Transformer-based models offer strong performance in video understanding, they typically require manually segmented and annotated training data, which is labor-intensive—particularly for finger-spelling videos that involve continuous and highly variable hand motion.
To overcome this limitation, we apply established time-series CPD algorithms to the angular feature data extracted from each video, thereby detecting the temporal boundaries between individual signs without manual intervention. We evaluated four change point detection methods introduced in the comparative study by Truong et al. [15]:
  • PELT (Pruned Exact Linear Time) [16]: Efficiently detects change points through recursive segmentation but may fall into local optima when many change points exist.
  • Binary Segmentation [17]: Classifies segments as “normal” and “abnormal” using threshold-based classifiers (e.g., SVM, decision tree). Simple and fast, but prone to false positives in complex data.
  • Bottom-Up Segmentation [18]: Starts from small segments and iteratively merges them based on similarity. Effective but computationally demanding.
  • Dynamic Programming [19]: Finds a globally optimal change point sequence by reusing subproblem solutions, similar to the Viterbi algorithm. Powerful but resource intensive.
Each method was applied independently to all 40 angular feature dimensions. Specifically, change points were detected separately for each dimension, resulting in individual lists of candidate change points per method. The following hyperparameters were consistently applied across all CPD methods to ensure fair comparison and practical segment quality:
  • Threshold = 27: Minimum score required to retain a change point, accounting for closely occurring changes.
  • Upper limit = 10: Maximum number of change points allowed in the initial detection phase.
  • Range = 10: Window size for aggregating adjacent change points into one.
  • These values were chosen to balance detection accuracy and computational efficiency.
The detected change point lists were then merged across all 40 feature dimensions to generate a unified list for each CPD method, enabling more comprehensive temporal segmentation. To reduce false positives caused by minor motion fluctuations, temporal filtering was applied by removing change points that occurred within 50 frames of one another. Subsequently, each list was refined to ensure complete coverage of the video sequence: a change point at frame 0 was added to the beginning, and the final frame number of the video was appended to the end. In addition, closely spaced change points were merged to avoid over-segmentation and to produce more balanced segments.
As a result, we obtained four segmented versions of each video, corresponding to the four CPD methods: PELT, Binary Segmentation, Bottom-Up, and Dynamic Programming. Based on these segmentations, each finger-spelling video was divided into character-level sequences. The ViT model was then applied to the segmented sequences and produced frame-wise character classification results, enabling robust and accurate word-level interpretation of continuous finger-spelling input.

4.5. Application of the Learned Model to Recognition of Segmented Regions

Character-level recognition was performed using a ViT model that had been pre-trained and was applied to segments defined by the change point lists obtained from each detection method. For each pair of consecutive change points, the corresponding frame interval was treated as a candidate for a single character. The angular feature data within each segment served as input to the trained model.
The recognition process was structured as follows. First, the ViT model was applied to the initial 20 frames of a segment to generate a preliminary prediction, which was stored in a character prediction list. The input window was then shifted by one frame—removing the earliest frame and adding the next—after which the model was reapplied. This sliding window operation continued until the end of the segment, producing multiple prediction outputs for each character instance. The final predicted character was determined by computing the statistical mode (i.e., the most frequently predicted label) from all outputs. This method enhances prediction stability while accommodating temporal variation within each segment.
To improve word-level recognition accuracy, the model was further fine-tuned using a Supplementary dataset containing 76 classes. This extended dataset included finger-spelled representations of voiced, semi-voiced, unvoiced, and long vowel characters from Dataset C. The ViT architecture maintained its four-layer encoder configuration. Incorporating this additional data is expected to enhance the model’s ability to recognize the full range of Japanese finger-spelling patterns.

5. Results

5.1. Recognition Results

To evaluate the recognition performance of isolated finger-spelling, we compared the ViT model’s accuracy using two different types of input features. The first was a high-dimensional, 2337-length feature vector that integrates three modalities: hand joint angles extracted from MediaPipe Hands, upper-body posture features from MediaPipe Pose, and dense optical flow calculated between consecutive frames. The second was a compact, 40-dimensional feature vector composed of angular measurements between hand joints and finger inclinations, as described in Section 4.2.
Figure 7 presents the training results for both feature sets. Using the 2337-dimensional input from Datasets A and B, the ViT model achieved a recognition accuracy of 99.56% by epoch 40. Notably, the model surpassed 99% accuracy as early as epoch 9 and continued to improve gradually, reaching 99.50% by epoch 33. Despite the model showing no signs of overfitting, each training epoch required approximately 45 min due to the high dimensionality and the computational complexity of processing multimodal data.
In contrast, the 40-dimensional angular feature input demonstrated remarkably efficient and stable learning behavior. The model reached 98% accuracy by epoch 8 and consistently maintained over 99% accuracy from epoch 9 onward, ultimately achieving a peak accuracy of 99.80% at epoch 40. More importantly, the training time per epoch was dramatically reduced to approximately 5 s. These results indicate that the simpler feature representation not only yields slightly better recognition performance but also enables rapid training and lower resource consumption.
Although the multimodal input incorporated richer motion and context information, the performance gains were marginal compared to the lightweight 40-dimensional approach. The orange curve in Figure 7, corresponding to the 2337-dimensional input, consistently remained slightly below the blue curve of the 40-dimensional input across epochs. These findings suggest that while high-dimensional features can enhance recognition, the improvement is not substantial enough to offset the significant increase in training cost. From the standpoint of both computational efficiency and practical deployment, the 40-dimensional feature set offers a more desirable and scalable solution for real-world Japanese finger-spelling recognition. Therefore, we adopted the 40-dimensional feature vector for finger-spelling recognition.
All experiments were conducted on a machine equipped with an NVIDIA RTX 3080 GPU, Intel Core i7-13700F CPU, and 32 GB of RAM.

5.2. Recognition Results for Finger-Spelled Word Videos

We evaluated the recognition accuracy of four different change point detection (CPD) methods: PELT, Binary Segmentation, Bottom-Up, and Dynamic Programming. To ensure evaluation diversity, each of the seven finger-spelled words was recorded twice, resulting in a total of 14 videos. These videos were recorded independently from Dataset A and were explicitly excluded from the training set to ensure unbiased testing. The seven words were selected to include Japanese place names that exhibit a variety of motion and shape-based challenges. Specifically, we chose words containing finger-spelled characters with visually similar hand shapes, as well as those that require transitional movements or rapid changes in wrist posture. This design allows us to evaluate the robustness of both segmentation and recognition under realistic and diverse signing conditions. Figure 8 illustrates a representative example of character-level segmentation and recognition results for the word “かまくら |ka-ma-ku-ra|”, showing how the segmentation boundaries produced by each CPD method differ in precision and stability.
Figure 8 illustrates a representative example of character-level segmentation and recognition results for the word “かまくら |ka-ma-ku-ra|”, showing how the segmentation boundaries produced by each CPD method differ in precision and stability.
Among the evaluated words, only “たまし |ta-ma-shi|” was entirely and correctly recognized as a complete word by all CPD methods. However, the segmentation accuracy of the CPD methods was generally high—particularly for “かまくら |ka-ma-ku-ra|”—where all methods except PELT produced nearly accurate segmentations, as shown in Figure 8. This demonstrates that CPD algorithms can be quite effective in detecting character boundaries, especially when the signs follow standardized patterns.
Recognition accuracy was calculated as the ratio of correctly recognized syllables to the total number of ground-truth syllables, as illustrated in Figure 9. Among the four CPD methods, PELT achieved the highest overall accuracy, recording 51.9% for standardized signs, 33.3% for natural signs, and 42.6% overall. The Bottom-Up method showed the second-best performance overall, achieving a comparable accuracy to PELT for standardized signs (51.9%).
However, for natural signs, it yielded an accuracy of 25.9%, which was identical to that of the other two methods (Binary and Dynamic), resulting in an overall accuracy of 38.9%. These results reveal the strengths and weaknesses of each CPD approach in handling different signing styles.
Achieving consistent segmentation and recognition proved to be more difficult in continuous sign language videos. These sequences often include transitional or coarticulated gestures between syllables, which blur the boundaries between individual characters and complicate precise segmentation. This issue was especially pronounced in the natural sign language data, where the signing style is less rigid and more fluid.
In particular, the male expert signer employed a natural signing style, characterized by faster and more abbreviated movements. This led to greater frame-level variability within and between syllables, further increasing the challenge of accurate segmentation and recognition. As a result, segmentation errors—such as detecting too many or too few change points—were frequently observed in natural sign language sequences. Among the seven evaluated words, only three—“かまくら |ka-ma-ku-ra|,” “のりもの |no-ri-mo-no|,” and “たまし |ta-ma-shi|”—were segmented with the correct number of characters. The remaining words exhibited mismatches in character boundaries, leading to reduced recognition performance. In particular, the word “かんこく |ka-n-ko-ku|” could not be accurately recognized in either standardized or natural signing videos, underscoring the challenges of modeling real-world variability.
Figure 10 presents a comparison of recognition accuracy for seven Japanese finger-spelled words using two training datasets—Dataset A + B (14 participants) and the full Dataset A + B + C (33 participants). By incorporating a substantial amount of expert-level signing data from Dataset C, the overall recognition accuracy showed notable improvements across most words. In particular, “たまし (ta-ma-shi)” achieved near-perfect accuracy when trained with the complete dataset, while words with more complex motion patterns such as “かまくら (ka-ma-ku-ra)” and “かんこく (ka-n-ko-ku)” also demonstrated clear gains. These results suggest that supplementing the dataset with sign language data from participants of diverse backgrounds enhances the model’s generalization capability, especially for handling more natural and varied signing styles in real-world settings.

6. Discussion

6.1. Impact of Feature Dimensionality on Recognition Accuracy

We compared the recognition accuracy of Japanese sign language using two types of features: 2337-dimensional features and 40-dimensional angular features. The recognition accuracy achieved with the 2337-dimensional features was 99.56%, which did not surpass the 99.80% accuracy obtained using the 40-dimensional angular features. The high-dimensional features appeared to include redundant information and noise—particularly, the Optical Flow-based motion data contained extensive but non-essential information for recognizing finger-spelled characters.
In contrast, the 40D angular feature set offered a concise and direct representation of finger and hand movements, contributing to higher recognition accuracy. Moreover, it was significantly more efficient in terms of computational cost, with drastically reduced processing time. These findings suggest that selecting relevant features is crucial for improving recognition performance, and that increasing feature dimensionality does not necessarily yield better results.
However, despite the high overall accuracy, certain character pairs with similar hand shapes continue to be confused when using only MediaPipe-derived angular features. For example, Figure 11a shows misclassifications between カ (ka) and ナ (na), and Figure 11b between ク (ku) and テ (te), where the hand orientation differs but the finger configurations are nearly identical. Likewise, Figure 11c illustrates confusion between い (i) and ち (chi), whose overall hand shapes are too similar to distinguish reliably based solely on joint angles. These examples highlight the need to explore more discriminative or multimodal feature representations that can capture subtle differences in finger-shape and orientation for these challenging cases.

6.2. Effect of Additional Training Data

Another issue observed in Japanese finger-spelling recognition was a drop in accuracy for specific characters such as “の |no|,” “り |ri|,” and “ん |n|.” In particular, “ん |n|” was often mis-segmented due to abrupt movement changes. Additionally, recognition performance declined for finger shapes that overlapped or appeared similar, indicating that the current model struggles to capture fine-grained differences in complex hand configurations. Performance also degraded in scenes involving natural, everyday signing, implying that the dataset lacked sufficient variability to represent real-world movements.
Retraining the ViT model with an expanded dataset—including characters with voiced and semi-voiced sounds—substantially improved recognition accuracy. For example, the word “たまし |ta-ma-shi|” was recognized with 100% accuracy across two different signing styles. The inclusion of subjects with extensive signing experience helped improve recognition performance in more natural signing scenarios.
Nevertheless, the dataset still has limitations in size and participant diversity. Future work should include data from signers with varied experiences. It is also necessary to improve the precision of motion segmentation through advanced change point detection methods. In addition, collecting a more comprehensive dataset encompassing the full range of Japanese sign language, and incorporating temporal continuity and contextual awareness in segmentation, remain important challenges.

6.3. Practical Benefits of Angular Features

Looking ahead, we aim to develop models capable of adapting to natural variations and real-world conditions, targeting robust recognition performance in practical environments. A key contribution of this study is the demonstration that angular features are effective in capturing subtle distinctions in Japanese sign language. Because angular data obtained through MediaPipe is less affected by background noise or subject differences, and is lightweight, it is well suited for real-world applications. Furthermore, the enhanced dataset—featuring muddled and semi-muddled sounds and built primarily from experienced signers—enabled broader character coverage.

6.4. Limitations of CPD-Based Segmentation

Because transitions between successive hand shapes in continuous finger-spelling are gradual and often ambiguous, defining precise, frame-level boundary ground-truths is exceptionally difficult. Our ub-MOJI dataset therefore relies on sparse, point-level annotations marking key change points rather than dense frame-wise labels, which complicates direct application of standard boundary-detection metrics such as precision and recall.
While our CPD module achieved ~43% word-level accuracy, these results highlight the practical limits of applying traditional CPD methods to JSL data. Recognizing these challenges, we are now developing a point-supervised Temporal Action Localization (TAL) framework that uses sparse, point-level annotations to identify segment boundaries more robustly. We believe this TAL-based approach will offer a more scalable and effective solution, which we plan to present in future work.

7. Conclusions

This study proposed a word-level recognition system for Japanese finger-spelling, leveraging character-level training and a hybrid ViT-CNN architecture. By extracting multimodal skeletal features—including hand joint angles, upper-body posture, and optical flow-based motion—from MediaPipe, the system effectively captures both local and global motion patterns. Comparative analysis demonstrated that concise 40-dimensional angular features offered higher recognition accuracy and lower computational cost than high-dimensional inputs, underscoring the importance of feature selection.
To improve recognition performance for natural and complex signing styles, we augmented the training set with expert-generated data, including voiced and semi-voiced characters. Experimental results confirmed that fine-tuning with this dataset improved word-level accuracy, even for visually ambiguous or context-sensitive signs.
Moving forward, expanding the dataset to include a broader range of signers and signing styles, as well as improving segmentation precision using more advanced change point detection methods, will be essential. These efforts will enhance model robustness and generalizability, enabling practical deployment in real-world applications such as sign language interpretation systems and educational support tools. This research contributes a foundation for more accurate and adaptable Japanese finger-spelling recognition and encourages continued exploration in this domain.

Author Contributions

Conceptualization, D.S. and Y.K.; Methodology, T.K. and R.M.; Validation, T.K. and R.M.; Formal analysis, T.K. and R.M.; Data curation, T.K. and R.M.; Writing—original draft, T.K. and D.S.; Writing—review and editing, D.S. and Y.K.; Visualization, T.K. and Z.H.; Supervision, Y.K.; Project administration, Y.K.; Funding acquisition, Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JSPS KAKENHI Grant Number JP25K15166 and partially supported by the Co-G.E.I. Challenge 2024 at Tokyo Polytechnic University.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of Tokyo Polytechnic University (Protocol code: Ethics 2022-01, Date of approval: 19 May 2022).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Acknowledgments

We sincerely thank all participants who supported the creation of the Japanese Sign Language database, which was developed for the purpose of academic research.

Conflicts of Interest

Author Tamon Kondo was employed by Tokyo Electric Power Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Japan Hearing Instruments Manufacturers Association (JHIMA). JapanTrak 2022; JHIMA: Tokyo, Japan, 2022; p. 99. [Google Scholar]
  2. Ambar, R.; Fai, C.K.; Wahab, M.H.A.; Jamil, M.M.A.; Ma’radzi, A.A. Development of a Wearable Device for Sign Language Recognition. J. Phys. Conf. Ser. 2018, 1019, 012017. [Google Scholar] [CrossRef]
  3. Ma, L.; Huang, W. A Static Hand Gesture Recognition Method Based on Depth Information. In Proceedings of the 2016 International Conference on Intelligent Human–Machine Systems and Cybernetics (IHMSC), Hangzhou, China, 27–28 August 2016; pp. 27–28. [Google Scholar]
  4. Jiang, S.; Sun, B.; Wang, L.; Bai, Y.; Li, K.; Fu, Y. Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble. arXiv 2021, arXiv:2110.06161. [Google Scholar]
  5. Tan, C.K.; Kian, M.L.; Roy, K.Y.C.; Chin, P.L.; Ali, A. HGR-ViT: Hand Gesture Recognition with Vision Transformer. Sensors 2023, 23, 5555. [Google Scholar] [CrossRef] [PubMed]
  6. Marcelo, M.S.-C.; Liu, Y.; Brown, D.; Lee, K.; Smith, G. Self-Supervised Video Transformers for Isolated Sign Language Recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2024; pp. 413–422. [Google Scholar]
  7. Aloysius, N.; Geetha, G.M.; Nedungadi, P. Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining. arXiv 2024, arXiv:2405.12018. [Google Scholar] [CrossRef]
  8. Miah, A.S.M.; Hasan, M.A.M.; Nishimura, S.; Shin, J. Sign Language Recognition Using Graph and General Deep Neural Network Based on Large-Scale Dataset. IEEE Access 2024, 12, 123456–123467. [Google Scholar] [CrossRef]
  9. Miku, K.; Atsushi, T. Implementation and Evaluation of Sign Language Recognition by Using Leap Motion Controller. IPSJ Tohoku Branch SIG Tech. Rep. 2017, 2016, 1–8. [Google Scholar]
  10. Syosaku, T.; Hasegawa, K.; Masuda, Z. A Simple Method to Identify Similar Words with Respect to Motion in Sign Language Using Human Pose and Hand Estimations. Forum Inf. Technol. 2022, 21, 175–176. [Google Scholar]
  11. Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1302–1310. [Google Scholar] [CrossRef]
  12. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 3–7 May 2021; pp. 1–13. Available online: https://arxiv.org/abs/2010.11929 (accessed on 15 July 2025).
  13. Kondo, T.; Narumi, S.; He, Z.; Shin, D.; Kang, Y. Performance Comparison of Japanese Sign Language Recognition with ViT and CNN Using Angular Features. Appl. Sci. 2024, 14, 3228. [Google Scholar] [CrossRef]
  14. Kondo, T.; Murai, R.; Tsuta, N.; Kang, Y. ub-MOJI: A Japanese Fingerspelling Video Dataset. arXiv 2025, arXiv:2505.03150. Available online: https://huggingface.co/datasets/kanglabs/ub-MOJI (accessed on 29 May 2025).
  15. Truong, C.; Oudre, L.; Vayatis, N. A Review of Change Point Detection Methods. Signal Process. 2020, 167, 107299. [Google Scholar] [CrossRef]
  16. Killick, R.; Fearnhead, P.; Eckley, I.A. Optimal Detection of Changepoints with a Variable Penalty. J. Am. Stat. Assoc. 2012, 107, 1590–1598. [Google Scholar] [CrossRef]
  17. Kay, S.M. Fundamentals of Statistical Signal Processing: Estimation Theory; Prentice Hall PTR: Upper Saddle River, NJ, USA, 1993; pp. 1–512. [Google Scholar]
  18. Keogh, E.; Chu, S.; Hart, D.; Pazzani, M. An Online Algorithm for Segmenting Time Series. In Proceedings of the IEEE International Conference on Data Mining (ICDM), San Jose, CA, USA, 18–21 November 2001; pp. 289–296. [Google Scholar] [CrossRef]
  19. Bellman, R.E. Dynamic Programming; Princeton University Press: Princeton, NJ, USA, 1957; pp. 1–359. [Google Scholar]
Figure 1. Sample frames from Dataset A, showing two professional signers performing five basic syllables (“a,” “i,” “u,” “e,” “o”) each.
Figure 1. Sample frames from Dataset A, showing two professional signers performing five basic syllables (“a,” “i,” “u,” “e,” “o”) each.
Electronics 14 03052 g001
Figure 2. Sample frames from Dataset B, showing a beginner signer performing ten basic syllables (“a,” “ka,” “sa,” “ta,” “na,” “ha,” “ma,” “ya,” “ra,” “wa”).
Figure 2. Sample frames from Dataset B, showing a beginner signer performing ten basic syllables (“a,” “ka,” “sa,” “ta,” “na,” “ha,” “ma,” “ya,” “ra,” “wa”).
Electronics 14 03052 g002
Figure 3. Sample frames from Dataset C, showing fingerspelled characters including basic, voiced, and semi-voiced sounds. The red arrows indicate the movement of the hand gestures.
Figure 3. Sample frames from Dataset C, showing fingerspelled characters including basic, voiced, and semi-voiced sounds. The red arrows indicate the movement of the hand gestures.
Electronics 14 03052 g003
Figure 4. Visualization of 40-dimensional hand features [11]: (a) Joint angles computed from 20 hand landmarks; (b) Finger inclination angles relative to the wrist.
Figure 4. Visualization of 40-dimensional hand features [11]: (a) Joint angles computed from 20 hand landmarks; (b) Finger inclination angles relative to the wrist.
Electronics 14 03052 g004
Figure 5. Feature visualization: (a) upper-body posture angles. The numbered black circles (1–12) indicate keypoints defined by MediaPipe Pose for tracking upper-body movements. The red fan shapes illustrate the joint angles calculated between adjacent body segments. (b) optical flow during “り|ri|” movement.
Figure 5. Feature visualization: (a) upper-body posture angles. The numbered black circles (1–12) indicate keypoints defined by MediaPipe Pose for tracking upper-body movements. The red fan shapes illustrate the joint angles calculated between adjacent body segments. (b) optical flow during “り|ri|” movement.
Electronics 14 03052 g005
Figure 6. ViT model used for training and validation with 20 frames and features.
Figure 6. ViT model used for training and validation with 20 frames and features.
Electronics 14 03052 g006
Figure 7. Comparison of ViT recognition results. gray line: 40-dimensional input from the previous study [13] (only Dataset A); blue line: 40-dimensional input used in the current study (Dataset A + B); orange line: 2337-dimensional input combining hand, pose, and optical flow features in the current study (Dataset A + B).
Figure 7. Comparison of ViT recognition results. gray line: 40-dimensional input from the previous study [13] (only Dataset A); blue line: 40-dimensional input used in the current study (Dataset A + B); orange line: 2337-dimensional input combining hand, pose, and optical flow features in the current study (Dataset A + B).
Electronics 14 03052 g007
Figure 8. Recognition results for four CPD (Change Point Detection) methods using the word “かまくら|ka-ma-ku-ra|” as an example.
Figure 8. Recognition results for four CPD (Change Point Detection) methods using the word “かまくら|ka-ma-ku-ra|” as an example.
Electronics 14 03052 g008
Figure 9. Comparison of recognition accuracy across four CPD methods; (a) standardized vs. natural sign language styles; (b) accuracy by each word.
Figure 9. Comparison of recognition accuracy across four CPD methods; (a) standardized vs. natural sign language styles; (b) accuracy by each word.
Electronics 14 03052 g009
Figure 10. Comparison of recognition accuracy for each word using two training sets: Dataset A + B (14 participants) and the complete dataset A + B + C (33 participants).
Figure 10. Comparison of recognition accuracy for each word using two training sets: Dataset A + B (14 participants) and the complete dataset A + B + C (33 participants).
Electronics 14 03052 g010
Figure 11. Representative misclassification examples using 40-dimensional angular features: (a) カ (ka) vs. ナ (na), (b) ク (ku) vs. テ (te), (c) い (i) vs. ち (chi). Similar finger configurations lead to erroneous predictions.
Figure 11. Representative misclassification examples using 40-dimensional angular features: (a) カ (ka) vs. ナ (na), (b) ク (ku) vs. テ (te), (c) い (i) vs. ち (chi). Similar finger configurations lead to erroneous predictions.
Electronics 14 03052 g011
Table 1. Overview of the ub-MOJI datasets.
Table 1. Overview of the ub-MOJI datasets.
DatasetParticipanstsExperienceAge RangeHandednessResolutionContentPurpose
A2
(1M, 1F)
Instructor-level50s–60s (est.)1R/1L1920 × 1080, 60 fps46 syllablesBaseline model training
B12
(10M, 2F)
None21–23All right-handed1920 × 1080, 60 fps46 syllablesVariability testing
C
(Session 1)
4
(3F, 1M)
2 moderate, 2 minimal52–643R/1L1920 × 1080, 30 fps76 sign incl. voiced, etc.Fine-tuning with semi-experienced users
C
(Session 2)
2
(12F, 1M)
1–21+ years46–74All right-handed1920 × 1080, 30 fpsSame as C1Expert evaluation
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kondo, T.; Murai, R.; He, Z.; Shin, D.; Kang, Y. Recognition of Japanese Finger-Spelled Characters Based on Finger Angle Features and Their Continuous Motion Analysis. Electronics 2025, 14, 3052. https://doi.org/10.3390/electronics14153052

AMA Style

Kondo T, Murai R, He Z, Shin D, Kang Y. Recognition of Japanese Finger-Spelled Characters Based on Finger Angle Features and Their Continuous Motion Analysis. Electronics. 2025; 14(15):3052. https://doi.org/10.3390/electronics14153052

Chicago/Turabian Style

Kondo, Tamon, Ryota Murai, Zixun He, Duk Shin, and Yousun Kang. 2025. "Recognition of Japanese Finger-Spelled Characters Based on Finger Angle Features and Their Continuous Motion Analysis" Electronics 14, no. 15: 3052. https://doi.org/10.3390/electronics14153052

APA Style

Kondo, T., Murai, R., He, Z., Shin, D., & Kang, Y. (2025). Recognition of Japanese Finger-Spelled Characters Based on Finger Angle Features and Their Continuous Motion Analysis. Electronics, 14(15), 3052. https://doi.org/10.3390/electronics14153052

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop