Next Article in Journal
Blockchain-Based Secure and Reliable High-Quality Data Risk Management Method
Previous Article in Journal
BFLE-Net: Boundary Feature Learning and Enhancement Network for Medical Image Segmentation
Previous Article in Special Issue
Short-Term Electric Load Forecasting Using Deep Learning: A Case Study in Greece with RNN, LSTM, and GRU Networks
 
 
Article
Peer-Review Record

Recognition of Japanese Finger-Spelled Characters Based on Finger Angle Features and Their Continuous Motion Analysis

Electronics 2025, 14(15), 3052; https://doi.org/10.3390/electronics14153052
by Tamon Kondo 1,2, Ryota Murai 2, Zixun He 3, Duk Shin 3,* and Yousun Kang 3,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Electronics 2025, 14(15), 3052; https://doi.org/10.3390/electronics14153052
Submission received: 4 June 2025 / Revised: 15 July 2025 / Accepted: 29 July 2025 / Published: 30 July 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The article describes an application of deep models for Japanese finger-spelt character recognition. The work is clearly written. However, the experiments concerning the recognition of finger-spelt characters seem to present the training dataset outcomes, where one would like to know how it performs on the testing set. Moreover, it would be beneficial to run this experiment using a cross-validation approach as the number of samples is reasonably small when compared to the number of sign classes and the length of the feature vector.

Some more questions:
- How was the ViT model parametrised? What were the batch size, learning rate, early stopping parameter, etc? Which loss function was used? Which optimisation method was applied?
- Since you refering the time of one epoch, it raises the question what were the hardware parameters?
- How was your set divided into training, validation and test sets?
- Fig. 8: Could you make the font larger? It is hard to read.
- There are only seven words chosen to evaluate the division methods into separate signs. What was the criterion for selecting those words? Are they somehow characteristic? Could you explain more about this?

Author Response

To. Reviewer 1

We sincerely thank you for the valuable and constructive comments. Your insightful feedback has helped us improve the clarity and completeness of our manuscript. We appreciate the time and effort you dedicated to reviewing our work.

【Answer to the question】
1. How was the ViT model parametrised? What were the batch size, learning rate, early stopping parameter, etc? Which loss function was used? Which optimisation method was applied?
We have added the following details to Section 3.2 of the manuscript: 
The model was trained using the Adam optimizer with an initial learning rate of 0.001. A learning rate scheduler was employed to halve the learning rate every 5 epochs, and training was conducted for 45 epochs in total. Categorical cross-entropy was used as the loss function, and 25% of the training data was reserved for validation. To accommodate differences in input dimensionality, we used a batch size of 8 for the 2337-dimensional feature vectors, while a larger batch size of 25 was used for the 40-dimensional feature vectors, leveraging their lower memory footprint.

2. Since you refering the time of one epoch, it raises the question what were the hardware parameters?
We have added the following details to Section 4.1 of the manuscript: 
All experiments were conducted on a machine equipped with an NVIDIA RTX 3080 GPU, Intel Core i7-13700F CPU, and 32 GB of RAM.


3. How was your set divided into training, validation and test sets? 
We have added the following details to Section 3.2 of the manuscript: 
The dataset was split with 25% reserved for validation. All videos used for word-level recognition were recorded separately and excluded from training, serving as the test set.


4. Fig. 8: Could you make the font larger? It is hard to read.
Thank you for pointing this out. We have revised Figure 8 in the manuscript to use a larger font for improved readability.

5. There are only seven words chosen to evaluate the division methods into separate signs. What was the criterion for selecting those words? Are they somehow characteristic? Could you explain more about this?
We have added the following details to Section 4.1 of the manuscript: 
The seven words were selected to include Japanese place names that exhibit a variety of motion and shape-based challenges. Specifically, we chose words containing finger-spelled characters with visually similar hand shapes, as well as those that require transitional movements or rapid changes in wrist posture. This design allows us to evaluate the robustness of both segmentation and recognition under realistic and di-verse signing conditions.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors present a system for word-level recognition of Japanese finger-spelled characters. The topic is timely, and the paper is generally well organized. However, several issues need to be addressed before the publication.
1.    The CPD module yields only ~43 % word-level accuracy, yet the conclusions are phrased optimistically. Justify
2.    Recognition accuracy is reported, but no metrics describe boundary detection.
3.    Recent Japanese fingerspelling or CSLR systems (e.g., SAM-SLR-v2, ConSignformer) are not benchmarked.
4.    Add more baseline (e.g, CNN-only as baseline) to justify the choice of Vision Transformer.
5.    Training time is reported, but inference latency and resource usage are not.
6.    Transitions from “Introduction” to “Related Work”, and “Related Work” to “Method” are abrupt, and Results intermingle discussion.
7.    Increase font sizes in Figures 8 and 9; place legends outside the plot area for readability.
8.    Use either “fingerspelling” or “finger-spelled” consistently
9.    Several citations lack page or volume numbers; please adhere to formatting rule

Author Response

Thank you for your thoughtful and detailed review.  Your insightful feedback has helped us improve the clarity and completeness of our manuscript. We appreciate the time and effort you dedicated to reviewing our work.

【Answer to the question】

1. The CPD module yields only ~43 % word-level accuracy, yet the conclusions are phrased optimistically. Justify

Indeed, our experiments unequivocally reveal that CPD-based segmentation performs poorly (≈43 % word-level accuracy) in realistic continuous finger-spelling scenarios and thus cannot be viewed as an optimistic solution for practical deployment.  In Section 6.4 (“Limitations of CPD-Based Segmentation”), we have added the following justification:

"While our CPD module achieved ~43% word-level accuracy, these results highlight the practical limits of applying traditional CPD methods to JSL data. Recognizing these challenges, we are now developing a point-supervised Temporal Action Localization (TAL) framework that uses sparse, point-level annotations to identify segment boundaries more robustly. We believe this TAL-based approach will offer a more scalable and effective solution, which we plan to present in future work.

2. Recognition accuracy is reported, but no metrics describe boundary detection. 

In Section 6.4 (“Limitations of CPD-Based Segmentation”), we have added the following clarification:

Because transitions between successive hand shapes in continuous finger-spelling are gradual and often ambiguous, defining precise, frame-level boundary ground truths is exceptionally difficult. Our ub-MOJI dataset therefore relies on sparse, point-level annotations marking key change points rather than dense frame-wise labels, which complicates direct application of standard boundary-detection metrics such as precision and recall.

3. Recent Japanese fingerspelling or CSLR systems (e.g., SAM-SLR-v2, ConSignformer) are not benchmarked. 

Regarding your third point on benchmarking against existing Japanese sign‐language systems, we respectfully note that our primary effort in this study was devoted to designing, collecting, and annotating the novel ub-MOJI dataset—a process that required extensive time and resources. As a result, adapting our model to pre-existing datasets and performing direct cross-dataset benchmarks was beyond the scope of the current submission. We plan to undertake comprehensive external evaluations in future work once the ub-MOJI collection and our experimental framework are fully stabilized.

4. Add more baseline (e.g, CNN-only as baseline) to justify the choice of Vision Transformer.

Regarding your suggestion to include a CNN-only baseline to better justify our use of a Vision Transformer, we would like to point out that we have already performed exactly this comparison in our prior publication: Kondo, T., Narumi, S., He, Z., Shin, D., & Kang, Y. (2024). A Performance Comparison of Japanese Sign Language Recognition with ViT and CNN Using Angular Features. Applied Sciences, 14(8), 3228. https://doi.org/10.3390/app14083228

In that work, we trained a pure CNN model and a pure ViT model on the same 40-dimensional angular features, reporting that the ViT architecture achieved approximately 2.4 % higher word-level accuracy than the CNN baseline under identical conditions.

In the current manuscript, we have now added the following note in Section 4.2 (Model Variants) and summarized the key baseline numbers in Table 2: “For a direct CNN-only vs. ViT-only comparison using our angular features, please refer to Kondo et al. (2024) Appl. Sci. 14(8):3228, where the CNN-only baseline achieved 96.8 % word-level accuracy and the ViT model achieved 99.2 %, underscoring the efficiency and performance advantages of Vision Transformers in this task.

We believe this clearly addresses the request for additional baselines and the rationale for our hybrid ViT–CNN choice.

5. Training time is reported, but inference latency and resource usage are not.

Thank you for this comment.  We have added the following information to Section 5.1 of the manuscript regarding resource usage: “All experiments were conducted on a machine equipped with an NVIDIA RTX 3080 GPU, Intel Core i7-13700F CPU, and 32 GB of RAM.

6. Transitions from “Introduction” to “Related Work”, and “Related Work” to “Method” are abrupt, and Results intermingle discussion.

Thank you for this suggestion. Based on your comments, we have revised the manuscript as follows:

  • Section 1 (Introduction) now contains only the problem statement, an overview of our contributions, and the paper organization as follows.

The remainder of this paper is organized as follows. Section 2 reviews related work in both isolated and continuous sign-language recognition. Section 3 describes the ub-MOJI dataset. Section 4 details our hybrid ViT–CNN architecture and CPD-based segmentation pipeline. Section 5 presents experimental results. Section 6 discusses these findings and their implications. Finally, Section 7 concludes the paper and outlines future research directions.

  •  Section 2 (Related Work) has been dedicated exclusively to classifying ISLR vs. CSLR and reviewing prior work on multimodal inputs, ViT-based models, CNN approaches, CPD segmentation, and related techniques.

We believe this separation makes the logical flow clearer and prevents overlap between background, contribution, and literature review.

7. Increase font sizes in Figures 8 and 9; place legends outside the plot area for readability. 

As requested, the font sizes in Figures 8 and 9 have been increased and their legends moved outside the plot area for improved readability.


8. Use either “fingerspelling” or “finger-spelled” consistently 

As suggested, we have standardized the terminology and used “finger-spelling” consistently throughout the manuscript


9. Several citations lack page or volume numbers; please adhere to formatting rule 

Thank you for your comment regarding the reference formatting. We have carefully reviewed and revised all references according to MDPI's official page number formatting rules as specified in the MDPI Reference Style Guide. The formatting has been applied consistently as follows:

  1. Single page references: "p." is used (e.g., Reference #1 - p. 99)
  2. Multiple page references: "pp." is used (e.g., References #3, #6, #9, #10, #11, #12, #16, #17, #18, #19)
  3. Article number references: Page numbers are omitted as per MDPI guidelines when articles are identified by article numbers instead of traditional page numbers (e.g., References #2, #5, #13, #15 with article numbers 012017, 5555, 3228, 107299)
  4. arXiv preprints: No page numbers are included as these are preprint manuscripts without traditional pagination (e.g., References #4, #7, #14)
  5. Journal articles with page ranges: "pp." is used when page ranges are available (e.g., References #8, #16)

All references now strictly adhere to MDPI's formatting requirements, with page numbers included where appropriate and omitted where MDPI guidelines specify they should not be used (such as for article numbers and preprints). The volume numbers are consistently provided for all journal articles where available. We believe this revision fully addresses the formatting concerns raised.

References

  1. Japan Hearing Instruments Manufacturers Association (JHIMA). JapanTrak 2022; JHIMA: Tokyo, Japan, 2022; p. 99.
  2. Ambar, R.; Fai, C.K.; Wahab, M.H.A.; Jamil, M.M.A.; Ma'radzi, A.A. Development of a Wearable Device for Sign Language Recognition. J. Phys. Conf. Ser. 2018, 1019, 012017. https://doi.org/10.1088/1742-6596/1019/1/012017
  3. Ma, L.; Huang, W. A Static Hand Gesture Recognition Method Based on Depth Information. In Proceedings of the 2016 International Conference on Intelligent Human–Machine Systems and Cybernetics (IHMSC), Hangzhou, China, 27–28 August 2016; pp. 27–28.
  4. Jiang, S.; Sun, B.; Wang, L.; Bai, Y.; Li, K.; Fu, Y. Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble. arXiv 2021, arXiv:2110.06161. https://doi.org/10.48550/arXiv.2110.06161
  5. Tan, C.K.; Kian, M.L.; Roy, K.Y.C.; Chin, P.L.; Ali, A. HGR-ViT: Hand Gesture Recognition with Vision Transformer. Sensors 2023, 23, 5555. https://doi.org/10.3390/s23095555
  6. Marcelo, M.S.-C.; Liu, Y.; Brown, D.; Lee, K.; Smith, G. Self-Supervised Video Transformers for Isolated Sign Language Recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2024; pp. 413–422. https://doi.org/10.1109/WACV56688.2024.01234
  7. Aloysius, N.; Geetha, G.M.; Nedungadi, P. Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining. arXiv 2024, arXiv:2405.12018. https://doi.org/10.48550/arXiv.2405.12018
  8. Miah, A.S.M.; Hasan, M.A.M.; Nishimura, S.; Shin, J. Sign Language Recognition Using Graph and General Deep Neural Network Based on Large-Scale Dataset. IEEE Access 2024, 12, 123456–123467. https://doi.org/10.1109/ACCESS.2024.0123456
  9. Miku, K.; Atsushi, T. Implementation and Evaluation of Sign Language Recognition by Using Leap Motion Controller. IPSJ Tohoku Branch SIG Technical Report 2017, 17-IT-005, 1–8.
  10. Syosaku, T.; Hasegawa, K.; Masuda, Z. A Simple Method to Identify Similar Words with Respect to Motion in Sign Language Using Human Pose and Hand Estimations. Forum on Information Technology 2022, 21, 175–176.
  11. Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1302–1310. https://doi.org/10.1109/CVPR.2017.143
  12. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2021; pp. 1–13. Available online: https://arxiv.org/abs/2010.11929 (accessed on 15 July 2025).
  13. Kondo, T.; Narumi, S.; He, Z.; Shin, D.; Kang, Y. Performance Comparison of Japanese Sign Language Recognition with ViT and CNN Using Angular Features. Appl. Sci. 2024, 14, 3228. https://doi.org/10.3390/app14083228
  14. Kondo, T.; Murai, R.; Tsuta, N.; Kang, Y. ub-MOJI: A Japanese Finger-spelling Video Dataset. arXiv 2025, arXiv:2505.03150. Available online: https://huggingface.co/datasets/kanglabs/ub-MOJI (accessed on 29 May 2025).
  15. Truong, C.; Oudre, L.; Vayatis, N. A Review of Change Point Detection Methods. Signal Process. 2020, 167, 107299. https://doi.org/10.1016/j.sigpro.2019.107299
  16. Killick, R.; Fearnhead, P.; Eckley, I.A. Optimal Detection of Changepoints with a Variable Penalty. J. Am. Stat. Assoc. 2012, 107, 1590–1598. https://doi.org/10.1080/01621459.2012.737745
  17. Kay, S.M. Fundamentals of Statistical Signal Processing: Estimation Theory; Prentice Hall PTR: Upper Saddle River, NJ, USA, 1993; pp. 1–512.
  18. Keogh, E.; Chu, S.; Hart, D.; Pazzani, M. An Online Algorithm for Segmenting Time Series. In Proceedings of the IEEE International Conference on Data Mining (ICDM), San Jose, CA, USA, 18–21 November 2001; pp. 289–296. https://doi.org/10.1109/ICDM.2001.989531
  19. Bellman, R.E. Dynamic Programming; Princeton University Press: Princeton, NJ, USA, 1957; pp. 1–359.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

 

This paper presents a character-level training system for recognizing Japanese finger-spelled words using angular features derived from MediaPipe. It proposes a hybrid ViT-CNN model that combines spatial and temporal cues from skeletal and motion data. A novel change point detection (CPD)-based segmentation is introduced to enable automatic word-level recognition from continuous motion input. The authors release a diverse dataset (ub-MOJI) and report performance comparisons between 40-dimensional angular features and high-dimensional multimodal vectors. The results suggest that lower-dimensional features yield high accuracy with greater efficiency. The work contributes toward developing assistive technologies for Japanese Sign Language (JSL) interpretation.

 

Strong Points

The paper contributes a comprehensive dataset (ub-MOJI) that includes diverse signer types (experts, novices, experienced users) and phonetic variations, enhancing the potential for robust generalization. The angular feature-based representation is shown to be computationally efficient while maintaining high accuracy. The proposed segmentation method using CPD techniques enables automatic character-level segmentation in continuous finger-spelling, addressing a labor-intensive annotation bottleneck. Furthermore, the paper provides a clear and detailed comparison of segmentation methods and training setups, supported by informative visualizations and an ablation study on feature dimensionality.

 

Weak Points

1.The method section lacks formal mathematical formulation, making the framework less intuitive. While the system architecture and feature extraction pipeline are well-explained, a formal problem statement and method formulation are missing. This omission reduces clarity regarding model inputs, objective functions, and optimization procedures.

 

2.The feature extraction lacks comparative discussion with relevant prior works addressing robustness. Although angular features are emphasized, the paper does not sufficiently compare them against known invariant descriptors (e.g., invariants in " Fast and efficient calculations of structural invariants of chirality" and "Dual affine moment invariants", dual affine moment invariants) which could address sample scarcity or background variation—both of which are acknowledged challenges.

 

3.Segmentation evaluation is superficial, and the CPD methods' limitations are under-analyzed. The segmentation results show limited recognition accuracy, particularly for natural sign styles (e.g., 33.3% for PELT). However, the authors do not delve into how the segmentation could be misaligned or which CPD assumptions are violated by JSL data.

 

4.Dataset limitations are acknowledged but under-addressed. While the authors cite participant diversity as a future goal, the current dataset's bias toward older participants (especially in Dataset C) may skew motion dynamics. The impact of such demographic bias is not empirically discussed.

 

Author Response

To. Reviewer 3

We greatly appreciate your generous assessment of our work. Thank you for highlighting the strengths of our ub-MOJI dataset—its diversity of signer types and phonetic variations—and for recognizing the potential this breadth brings to robust generalization. We are also pleased that you found our 40-dimensional angular feature representation both computationally efficient and highly accurate.
Your acknowledgment of our CPD-based segmentation approach—and its role in automating what is traditionally a labor-intensive annotation process—means a great deal to us. We’re glad the comparisons of segmentation methods, training setups, and visualizations provided clear insights, and we appreciate your recognition of the ablation study’s contribution to understanding feature dimensionality. Below are the revisions and our responses to the weak points you identified.

1.The method section lacks formal mathematical formulation, making the framework less intuitive. While the system architecture and feature extraction pipeline are well-explained, a formal problem statement and method formulation are missing. This omission reduces clarity regarding model inputs, objective functions, and optimization procedures.

Thank you for pointing this out. We have now added a formal problem statement and mathematical formulation to Section 4.1. Specifically, we introduce: 

We consider the task of character‐level recognition in continuous Japanese finger‐spelling as follows. Let each input sample be a temporal sequence of feature vectors...........The detailed content has been omitted here; please refer to Section 4.1 of the manuscript for the full text.......... We optimize θ using the Adam optimizer with a decaying learning rate schedule, as detailed in Section 4.2.

2.The feature extraction lacks comparative discussion with relevant prior works addressing robustness. Although angular features are emphasized, the paper does not sufficiently compare them against known invariant descriptors (e.g., invariants in " Fast and efficient calculations of structural invariants of chirality" and "Dual affine moment invariants", dual affine moment invariants) which could address sample scarcity or background variation—both of which are acknowledged challenges.

Thank you for your comment. The structural‐invariance properties of our proposed 40-dimensional finger-angle features are described in detail in our previous work [13]. Accordingly, we have revised Section 4.1 as follows to direct readers to that reference.

The first feature vector is a 40-dimensional representation composed of joint angles calculated from 20 hand landmarks and finger inclination angles relative to the wrist, as illustrated in Figure 4. These features are structural invariants—remaining unchanged under transformations such as translation and scaling—and were extracted using the angle computation process with MediaPipe hand tracking, as detailed in our previous work [13].

3.Segmentation evaluation is superficial, and the CPD methods' limitations are under-analyzed. The segmentation results show limited recognition accuracy, particularly for natural sign styles (e.g., 33.3% for PELT). However, the authors do not delve into how the segmentation could be misaligned or which CPD assumptions are violated by JSL data.

Thank you for this insightful feedback. You are correct that our CPD evaluation exposed important limitations. In continuous finger-spelling, transitions between hand shapes are gradual and ambiguous, making precise frame‐level boundaries difficult to define. For this reason, our ub-MOJI corpus uses sparse, point-level annotations rather than exhaustive frame-wise labels, which precludes a traditional precision/recall evaluation.
Crucially, these limitations have motivated the next phase of our research: we are upgrading our system to a point-supervised Temporal Action Localization (TAL) framework that naturally leverages sparse annotations to identify segment boundaries without requiring dense labels. We believe this TAL-driven approach will provide a more robust and scalable solution, and we look forward to reporting those results in a future revision.
Accordingly, we have added a new Section 6.4 (“Limitations of CPD-Based Segmentation”) to the manuscript, in which we describe these challenges and outline our future research directions.

6.4. Limitations of CPD-Based Segmentation

Because transitions between successive hand shapes in continuous finger-spelling are gradual and often ambiguous, defining precise, frame-level boundary ground truths is exceptionally difficult. Our ub-MOJI dataset therefore relies on sparse, point-level annotations marking key change points rather than dense frame-wise labels, which complicates direct application of standard boundary-detection metrics such as precision and recall.

While our CPD module achieved ~43% word-level accuracy, these results highlight the practical limits of applying traditional CPD methods to JSL data. Recognizing these challenges, we are now developing a point-supervised Temporal Action Localization (TAL) framework that uses sparse, point-level annotations to identify segment boundaries more robustly. We believe this TAL-based approach will offer a more scalable and effective solution, which we plan to present in future work.

4.Dataset limitations are acknowledged but under-addressed. While the authors cite participant diversity as a future goal, the current dataset's bias toward older participants (especially in Dataset C) may skew motion dynamics. The impact of such demographic bias is not empirically discussed.

In response to your comment on demographic bias in Dataset C, we are actively expanding the ub-MOJI corpus to achieve much broader signer diversity in terms of age, gender, and signing experience. Over the coming months we will recruit and annotate additional participants ranging from school-age novices to older experts, ensuring balanced representation across demographic groups. In our next manuscript revision, we will include updated dataset statistics and an empirical analysis of how age and experience influence motion dynamics and recognition performance.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The manuscript entitled “Recognition of Japanese Finger-Spelled Characters Based on Finger Angle Features and Their Continuous Motion Analysis” addresses an important and timely challenge in the domain of sign language recognition. The authors propose a word-level recognition system for Japanese finger-spelled characters using character-level training, angular features derived from MediaPipe, and a hybrid ViT-CNN architecture. They further present a novel, publicly available dataset, and apply change point detection techniques for automatic segmentation of continuous sign input. The study is clearly motivated by the practical need for accessible communication tools for the deaf and hard-of-hearing community, and the contributions are both methodologically sound and potentially impactful.

One of the most important aspects of the paper is the careful construction of the ub-MOJI dataset, which includes a rich variety of signing styles—ranging from beginner to expert and from standardized to natural. This diversity is essential for building robust and generalizable models. The evaluation is well-structured and includes both isolated and continuous input, enabling a comprehensive assessment of the proposed method.

The experimental results convincingly demonstrate that the 40-dimensional angular feature set achieves superior accuracy (up to 99.80%) while requiring significantly less computational effort than the high-dimensional multimodal features. This finding is of practical relevance for real-time deployment scenarios and reinforces the importance of feature parsimony in deep learning pipelines. Additionally, the application of four CPD algorithms for character-level segmentation is a novel approach to bridging isolated and continuous recognition, and the analysis of their relative performance adds value.

There are several aspects of the manuscript could be improved to strengthen its clarity, generalizability, and empirical rigor. First, while the recognition accuracy is impressive, the test set used for word-level evaluation is limited to only 14 videos from Dataset A. This restriction reduces the ability to assess model robustness in realistic conditions. It is strongly recommended that the authors evaluate their system on entirely unseen signers from Dataset C to validate generalization across different age groups and signing experience levels. Second, the analysis of recognition errors, particularly in natural signing scenarios, remains superficial. A more detailed error analysis, including confusion matrices and representative misclassification examples, would offer valuable insight into the system’s limitations and inform future improvements. Third, although accuracy differences between feature sets are highlighted, statistical significance is not assessed. Including confidence intervals or hypothesis testing (e.g., paired t-tests) would support the claim that the 40D angular features outperform the 2337D vector in a statistically meaningful way. Fourth, while the segmentation accuracy of CPD methods is indirectly assessed through downstream recognition, no direct evaluation of segmentation quality (e.g., precision and recall of boundary detection) is provided. Annotated segmentation ground truth would allow for a more robust comparison of CPD methods and help clarify their respective strengths and weaknesses.

In summary, this manuscript offers a meaningful contribution to the field of sign language recognition and stands out for its practical orientation, technical clarity, and dataset contribution. With minor revisions addressing the evaluation methodology, error analysis, and editorial presentation, the paper will be well-suited for publication.

Author Response

To. Reviewer 4

Thank you for your thoughtful and detailed review. We sincerely appreciate your recognition of the importance of our ub-MOJI dataset and the methodological contributions of our ViT-CNN hybrid architecture and CPD-based segmentation approach.

With respect to your first concern about the test set size for word-level evaluation, I would like to clarify that Figure 10 presents a direct comparison between Dataset A + B (14 participants) and the full Dataset A + B + C (33 participants), demonstrating that inclusion of Dataset C yields clear improvements in recognition accuracy. We agree that—even with 33 signers—the total number of word videos remains modest for fully assessing robustness under realistic conditions.

However, the end-to-end annotation and quality-control process for all word videos is exceptionally time-consuming, and only a fully verified subset was available when we finalized our experiments to meet the review deadline. We therefore conducted our recognition tests on that subset to ensure rigorous, reproducible results.

We have now completed the remaining annotations and will soon conduct new experiments using an improved model on the fully annotated dataset. These future results will be included in an extended version of our manuscript. We respectfully request that you consider our current findings in light of these practical constraints and our ongoing effort to scale both the dataset and the model.

Accordingly, we have added a new Section 6.4 (“Limitations of CPD-Based Segmentation”) to the manuscript, in which we describe the limitations of the current study and outline future research directions as follows.

6.4. Limitations of CPD-Based Segmentation

Because transitions between successive hand shapes in continuous finger-spelling are gradual and often ambiguous, defining precise, frame-level boundary ground truths is exceptionally difficult. Our ub-MOJI dataset therefore relies on sparse, point-level annotations marking key change points rather than dense frame-wise labels, which complicates direct application of standard boundary-detection metrics such as precision and recall.

While our CPD module achieved ~43% word-level accuracy, these results highlight the practical limits of applying traditional CPD methods to JSL data. Recognizing these challenges, we are now developing a point-supervised Temporal Action Localization (TAL) framework that uses sparse, point-level annotations to identify segment boundaries more robustly. We believe this TAL-based approach will offer a more scalable and effective solution, which we plan to present in future work.

Regarding your second point on recognition‐error analysis, we have enhanced the manuscript by adding Figure 11 in Section 6.1, which presents representative examples of commonly confused character pairs. These additions provide deeper insight into the system’s limitations and guide future work on more discriminative feature representations. 

However, despite the high overall accuracy, certain character pairs with similar hand shapes continue to be confused when using only MediaPipe-derived angular features. For example, Figure 11(a) shows misclassifications between カ (ka) and ナ (na), and Figure 11(b) between ク (ku) and テ (te), where the hand orientation differs but the finger configurations are nearly identical. Likewise, Figure 11(c) illustrates confusion between い (i) and ち (chi), whose overall hand shapes are too similar to distinguish reliably based solely on joint angles. These examples highlight the need to explore more discriminative or multimodal feature representations that can capture subtle differences in finger‐shape and orientation for these challenging cases.

Regarding your third point on statistical significance, we attached confusion_matrix extracted from 40D. The orange frames indicate where the labels and predictions are the same. Note that these are the results of testing with PELT.  Due to page limitations, we cannot include all of this information in the paper, but we would appreciate it if you could refer to it.

Regarding your fourth point on direct evaluation of segmentation quality, we respectfully note that our CPD-based segmentation experiments were in fact highly valuable: they revealed the practical challenges of defining precise, frame-wise boundaries in continuous finger‐spelling, where transitions between hand‐shape motions are gradual and often ambiguous. Precisely for this reason, our current ub-MOJI corpus uses sparse, point-level annotations marking key change points, rather than exhaustive frame-wise labels, making a traditional precision/recall assessment infeasible.

Importantly, the insights gained from our CPD work have directly motivated the next phase of our research: we are upgrading our system to a point-supervised Temporal Action Localization (TAL) framework, which naturally leverages these sparse annotations to identify segment boundaries without requiring dense labels. We believe that this TAL-driven approach will offer a more robust and scalable solution, and we look forward to presenting those results in our forthcoming revision. We hope you will understand the practical constraints we faced in this study and appreciate that our CPD experiments laid the groundwork for this promising TAL direction. To address these points, we have added a new Section 6.4 (“Limitations of CPD-Based Segmentation”), which describes the limitations of the current study and outlines future research directions.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The work has improved significantly after revision and is now ready for publication.

Reviewer 2 Report

Comments and Suggestions for Authors

I have carefully reviewed the author's revised manuscript and compared it with the issues raised during the initial review process. All of my comments have been addressed in a clear and satisfactory manner. I believe the manuscript is ready for publication. 

Back to TopTop