Next Article in Journal
A Review of the State-of-the-Art Techniques and Analysis of Transformers for Bengali Text Summarization
Previous Article in Journal
Enhancing Deep Learning Sustainability by Synchronized Multi Augmentation with Rotations and Multi-Backbone Architectures
 
 
Article
Peer-Review Record

Transferring Natural Language Datasets Between Languages Using Large Language Models for Modern Decision Support and Sci-Tech Analytical Systems

Big Data Cogn. Comput. 2025, 9(5), 116; https://doi.org/10.3390/bdcc9050116
by Dmitrii Popov 1,2,3, Egor Terentev 1,2, Danil Serenko 1,2, Ilya Sochenkov 1,3,4 and Igor Buyanov 1,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Big Data Cogn. Comput. 2025, 9(5), 116; https://doi.org/10.3390/bdcc9050116
Submission received: 24 March 2025 / Revised: 18 April 2025 / Accepted: 23 April 2025 / Published: 28 April 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper explores how large language models (LLMs) can be used to transfer natural language datasets from one language to another to support modern decision support and scientific analysis systems. This is a practically meaningful work with potential for diverse extensions. Below are some questions and suggestions:

  1. The overall logic and grammar of the introduction are not smooth. The introduction of this work is added at the end of the first paragraph, which makes the transition between the first and second paragraphs less coherent. However, I can see that these two paragraphs are related, so the author should adjust this part or change the placement of the introduction to this work. Additionally, the second paragraph states that neural network training requires a large amount of labeled data but does not delve deeper into this point. This discussion is necessary as it helps introduce the purpose of studying language migration. There is also little description of neural networks—the author should explain why neural networks are the mainstream approach for solving this task before discussing their limitations, such as the need for extensive labeled data. Finally, the description of LLMs is too brief. How do they compare to traditional methods? What improvements do they bring? This should be one of the motivations of this work. The author is advised to optimize the overall logic and grammar.
  2. Add a subsection in Section 3. Materials and Methods to introduce large language models and the specific LLM used in this study.
  3. Consider adding discussions on other open-source LLMs, such as DeepSeek and NLLB-200.
  4. Conduct experiments on multiple datasets for a more comprehensive evaluation.
  5. Have you considered languages beyond English and Russian?

Author Response

Comments 1: The overall logic and grammar of the introduction are not smooth. The introduction of this work is added at the end of the first paragraph, which makes the transition between the first and second paragraphs less coherent. However, I can see that these two paragraphs are related, so the author should adjust this part or change the placement of the introduction to this work. Additionally, the second paragraph states that neural network training requires a large amount of labeled data but does not delve deeper into this point. This discussion is necessary as it helps introduce the purpose of studying language migration. There is also little description of neural networks—the author should explain why neural networks are the mainstream approach for solving this task before discussing their limitations, such as the need for extensive labeled data. Finally, the description of LLMs is too brief. How do they compare to traditional methods? What improvements do they bring? This should be one of the motivations of this work. The author is advised to optimize the overall logic and grammar.

Response 1: Thank you for your thorough feedback on improving the introduction’s coherence and depth. We have carefully restructured the section to address your concerns, as outlined below:

1.Improved Transition and Placement of Work Introduction:

We relocated the overview of our work from the end of the first paragraph to the beginning of the second paragraph and rephrased the concluding sentence of the first paragraph to ensure a smooth transition. This adjustment enhances the logical flow between the two paragraphs. These revisions can be found in Section 1 (Introduction), Pages 1–2.

2.Expanded Discussion on Neural Networks’ Data Requirements:

To emphasize the challenges of neural network training, we added the following clarification in the second paragraph (Page 2, line 45):

“The vast number of parameters in neural networks, especially deep learning models, necessitates extensive training data to prevent overfitting and to generalize well to new, unseen data.”

This directly ties to the motivation for studying language migration and data-efficient methods.

3.Rationale for Neural Networks as the Mainstream Approach:

We explicitly highlighted why neural networks are the dominant methodology for this task earlier in the introduction (Page 2, line 34):

“Their ability to model complex, non-linear relationships in data makes them highly effective across a wide range of applications, including terminology and definition extraction.”

4.Enhanced Description of LLMs and Their Advantages:

We expanded the discussion on LLMs to contrast them with traditional methods, underscoring their role as a key motivation for this work. The revised text (Page 2, lines 50–55) states:

“Unlike traditional NLP methods that rely heavily on manual feature engineering and rule-based systems, LLMs leverage the Transformer architecture to learn patterns and contextual relationships from vast amounts of text data. This capability enables them to generalize across tasks with minimal fine-tuning, addressing the limitations of data scarcity in specialized domains.”

These revisions optimize the introduction’s logic, grammar, and clarity while ensuring all critical points—neural networks’ strengths, limitations, and LLMs’ innovations—are thoroughly addressed. Thank you again for your constructive suggestions.

 

Comments 2: Add a subsection in Section 3. Materials and Methods to introduce large language models and the specific LLM used in this study.

Response 2: Thank you for this suggestion. We agree that a dedicated description of the models and their implementation is essential. Therefore, we have added Subsection 3.3 “Large Language Models” within Section 3 (Materials and Methods), where we provide a list of all models used in this study and specify the API employed to access them. This new subsection appears on page 5, lines 157–175 of the revised manuscript.

 

Comments 3: Consider adding discussions on other open-source LLMs, such as DeepSeek and NLLB-200.

Response 3: Thank you for this suggestion. We agree that including a broader range of models will strengthen our study. Therefore, we have expanded Subsection 3.3 “Large Language Models” to include the DeepSeek as additional open‑source models, and we have further evaluated newer closed‑source models (e.g., GPT‑4.1‑mini, Qwen). These additions and their respective performance summaries are presented in the revised manuscript on: page 5, lines 157-167 and in all other tables in the article

 

Comments 4: Conduct experiments on multiple datasets for a more comprehensive evaluation.

Response 4: Thank you for this suggestion. We agree that evaluating multiple datasets enhances the robustness of our study. Therefore, we have conducted additional experiments on the Wiki portion of the WCL corpus alongside the DEFT dataset, performing translation, annotation transfer, and model fine‑tuning on this new dataset. The overview is detailed in Subsection 3.2 “WCL Corpus” (page 4, lines 146–156), and the comparative performance results are presented in Table 5 (page 11).

 

Comments 5: Have you considered languages beyond English and Russian?

Response 5: Thank you for raising this question. We agree that expanding the scope to include other languages is an important direction for future research. In response, we have explicitly clarified the focus of this study by adding the following sentence on Page 3, line 82:

“We limit this work only to English-Russian language pairs, leaving another languages for future work.”

 

Reviewer 2 Report

Comments and Suggestions for Authors

Summary of the Paper:
This paper presents a method for cross-lingual dataset transfer using LLMs (ChatGPT-3.5-turbo and Llama-3.1-8B). The authors demonstrate an approach for translating and transferring annotations from the English DEFT corpus to Russian, helping address the shortage of annotated resources in under-resourced languages. The inclusion of additional information on packages, APIs, diagrams (for results as suggested below) could further improve the paper's readability and expand key sections.

Strengths:

  1. Innovative Approach: The application of LLMs for cross-lingual annotation transfer effectively addresses an important gap in NLP for under-resourced languages.
  2. Reproducibility: The authors enhance transparency by providing open access to datasets and code, aligning with open science principles.
  3. Comprehensive Evaluation: The thorough evaluation using multiple metrics (BLEU, LaBSE) and model comparisons (BERT, RuBERT, RoBERTa) strengthens the methodological rigor.

Suggestions for Improvement:

  1. Enhanced Data Presentation: While dataset statistics are provided (Table 1), incorporating additional visualizations (such as histograms from the existing GitHub Jupyter notebook) would improve clarity. These could be added to the appendix with minimal effort as they are already present in the notebooks on the author's repo.
  2. Definition of Terms: Key abbreviations (DEFT, CoNLL, BIO) should be defined upon first use to improve accessibility.
  3. Software Specifications: Including version numbers for Python libraries (e.g., gensim: 4.3.2, sentence-transformers: 2.3.1), APIs (OpenAI ChatGPT API gpt-3.5-turbo), and frameworks (Hugging Face datasets: 2.18.0) would aid reproducibility.
  4. API Documentation: A dedicated section detailing the APIs used (OpenAI API, Google Translate API v3, Llama-3.1-8B via Hugging Face TGI v1.4.0) and their configurations would help readers replicate the study. Also if there was any computational cost involved if any.
  5. The authors might consider explicitly reporting metrics like Term Retention, Definition Alignment, and Morphological Match in a section, as these aspects are implicitly evaluated in the current analysis but are scattered. 
  6. Improved Results Visualization:
    • For Table 2, presenting results as percentages (e.g., Exact Match: 92% GPT-3.5 vs. 75% Llama) or using a stacked bar chart (color-coded by match type) would better showcase the findings.
    • For Table 4, adding a confusion matrix (heatmap) would provide clearer insights into model performance, for example regarding RoBERTa's underperformance, which might warrants further discussion.

Author Response

Comments 1: Enhanced Data Presentation: While dataset statistics are provided (Table 1), incorporating additional visualizations (such as histograms from the existing GitHub Jupyter notebook) would improve clarity. These could be added to the appendix with minimal effort as they are already present in the notebooks on the author's repo.

Response 1: Thank you for this suggestion. We agree that a graphical representation will enhance clarity. Therefore, we have replaced the original Table with a histogram depicting the distribution of entity types in the DEFT corpus (see Figure 1 on page 4)

 

Comments 2: Definition of Terms: Key abbreviations (DEFT, CoNLL, BIO) should be defined upon first use to improve accessibility.

Response 2: Thank you for this suggestion. We agree that defining abbreviations improves readability. Therefore, we have added full definitions for all key abbreviations (DEFT, CoNLL, BIO) at their first occurrence in the manuscript (see page 1, line 7; page 4 line 134, 142;)

 

Comments 3: Software Specifications: Including version numbers for Python libraries (e.g., gensim: 4.3.2, sentence-transformers: 2.3.1), APIs (OpenAI ChatGPT API gpt-3.5-turbo), and frameworks (Hugging Face datasets: 2.18.0) would aid reproducibility.

Response 3: Thank you for pointing this out. We agree with this comment. Therefore, we have added Appendix F “Software and Dependencies,” which lists all Python libraries, frameworks, and APIs along with their exact version numbers. This information appears in the revised manuscript on page 15, lines 455–471.

 

Comments 4: API Documentation: A dedicated section detailing the APIs used (OpenAI API, Google Translate API v3, Llama-3.1-8B via Hugging Face TGI v1.4.0) and their configurations would help readers replicate the study. Also if there was any computational cost involved if any.

Response 4: Thank you for this suggestion. We agree with this comment. Therefore, we have added Subsection 3.3 “Large Language Models” (page 5, lines 157–175), in which we detail each API used— we leveraged the bothub.chat service as a unified proxy for accessing the aforementioned models and Google Translate API. Additionally, we have created Appendix E “LLM Usage: Time and Cost Statistics Across Tasks” (page 15, lines 454–462), where we report computational runtimes and monetary costs for each model call to facilitate full reproducibility

 

Comments 5: The authors might consider explicitly reporting metrics like Term Retention, Definition Alignment, and Morphological Match in a section, as these aspects are implicitly evaluated in the current analysis but are scattered.

Response 5: Thank you for this valuable suggestion. We agree that defining and reporting these metrics explicitly would clarify our evaluation. However, we have not been able to locate any prior work or formal definitions for “Term Retention,” “Definition Alignment,” or “Morphological Match” under those exact names. Could the reviewer kindly provide references or formal definitions for these metrics? Once we have the appropriate citations or definitions, we will compute the requested measures and present them in a new dedicated subsection of the manuscript.

 

Comments 6: Improved Results Visualization:

For Table 2, presenting results as percentages (e.g., Exact Match: 92% GPT-3.5 vs. 75% Llama) or using a stacked bar chart (color-coded by match type) would better showcase the findings.

For Table 4, adding a confusion matrix (heatmap) would provide clearer insights into model performance, for example regarding RoBERTa's underperformance, which might warrants further discussion.

Response 6: Thank you for your constructive suggestions. We agree that visual improvements enhance the interpretability of the results. Accordingly, we have implemented the following changes:

Table 2 (now Table 1):

  • We converted all numerical results into percentages to improve readability and comparison across models.
  • Expanded the table to include additional LLMs, which necessitated restructuring. The updated Table 1 is located at the bottom of Page 8 (under line 267).

Tables 4 and 5:

  • Originally, Table 4 presented results only for the RuDEFT dataset. In response to your feedback, we added Table 5 to include results for the WCL-Wiki-Ru dataset (see Page 11).
  • To address the request for clearer insights into model performance, we incorporated confusion matrices (heatmaps) into Appendix D: Confusion Matrices (located on Page 13, line 453). These matrices provide a detailed visual breakdown of model predictions.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The author carefully revised the paper.

Back to TopTop