Review Reports - An Empirical Study on Enhancing Large Language Models for Long-Term Conversations in Korean

Round 1

Reviewer 1 Report (Previous Reviewer 1)

Comments and Suggestions for Authors

This is a revised manuscript. In this study, the authors conducted an empirical analysis of LLMs for long-term conversations in Korean. The revised version is somewhat more polished than the original manuscript. Setting aside the consideration that the rapid advancement of LLMs might quickly diminish the paper's value, the manuscript features a well-structured narrative and employs its methodology appropriately. Therefore, I recommend accepting this manuscript for publication.

Author Response

Comments 1: This is a revised manuscript. In this study, the authors conducted an empirical analysis of LLMs for long-term conversations in Korean. The revised version is somewhat more polished than the original manuscript. Setting aside the consideration that the rapid advancement of LLMs might quickly diminish the paper's value, the manuscript features a well-structured narrative and employs its methodology appropriately. Therefore, I recommend accepting this manuscript for publication.

Response 1: We sincerely thank the reviewer for the positive evaluation and recommendation for acceptance. We greatly appreciate the recognition of our structured narrative and methodology. We are grateful for the constructive feedback throughout the review process.

Reviewer 2 Report (Previous Reviewer 2)

Comments and Suggestions for Authors

The correction of the PPL evaluation bug, the architectural realignment of the equations to accurately reflect SwiGLU mechanisms, and the crucial addition of double-blind human evaluation have fundamentally resolved my previous concerns. The methodology is now scientifically sound, and the paper provides a highly valuable contribution to the field of multilingual MSC.

I recommend acceptance with minor revisions, provided that the authors address the following minor typographical and consistency issues in the final version:

Typo in Equation 9: There is a parenthesis mismatch in the newly revised Equation 9. Specifically, in the term hi-1(c)), there is an extra, unmatched right parenthesis immediately following (c). The formula currently has two opening parentheses but three closing parentheses.

Text Inconsistency: In Section 4 (around line 381 of the revised text), the summary sentence still states "we train these tasks using LoRA, DPO, MoE, and Neuron tuning". Since Sections 4.5 (Continual Pre-training) and 4.6 (Specific Layer Tuning) were newly added to the manuscript during the revision, please update this sentence to encompass all evaluated baselines to maintain structural consistency.

Author Response

We sincerely thank the reviewer for the careful reading and constructive feedback. We have addressed both issues as follows:

Comments 1: Typo in Equation 9: There is a parenthesis mismatch in the newly revised Equation 9. Specifically, in the term hi-1(c)), there is an extra, unmatched right parenthesis immediately following (c). The formula currently has two opening parentheses but three closing parentheses.

Response 1: We have corrected the parenthesis mismatch in Equation 9. The extra closing parenthesis following h_{i-1}(c) has been removed.

Comments 2: Text Inconsistency: In Section 4 (around line 381 of the revised text), the summary sentence still states "we train these tasks using LoRA, DPO, MoE, and Neuron tuning". Since Sections 4.5 (Continual Pre-training) and 4.6 (Specific Layer Tuning) were newly added to the manuscript during the revision, please update this sentence to encompass all evaluated baselines to maintain structural consistency.

Response 2: We have updated the summary sentence in Section 4 (line 381) to include all evaluated methods. The revised sentence now reads: "It is important to note that we train these tasks using LoRA, DPO, MoE, CPT, Layer Tuning, and Neuron tuning, respectively. We then evaluate which method is most effective for enhancing Korean MSC capabilities." We appreciate the reviewer's attention to detail, which has helped improve the clarity and consistency of our manuscript.

Reviewer 3 Report (Previous Reviewer 3)

Comments and Suggestions for Authors

The authors have carefully revised their manuscript according to my comments and suggestions. However， there are still the following issues：

1、The format of the article is confused, which affects the reading experience. At the same time, there are some grammatical problems.

2、The author does not clearly explain the motivation behind this manuscript. What are the existing problems and why they are crucial should be explained in more detail. I suggest the author further strengthen the relevant parts of the Introduction.

3、More lastest research should be considered for related work. e.g. “SCAFNet: A Semantic Compensated Adaptive Fusion Network for Remote Sensing Images Change Detection”.

Author Response

We sincerely thank the reviewer for the valuable feedback. We have addressed each comment as follows:

Comment 1: The format of the article is confused, which affects the reading experience. At the same time, there are some grammatical problems.

Response 1: We appreciate the reviewer's concern regarding formatting and grammatical issues. We have carefully proofread the entire manuscript and corrected grammatical errors throughout. Prior to final publication, we will conduct a thorough and meticulous review of grammar, formatting, and overall readability to ensure the highest quality of presentation.

Comment 2: The author does not clearly explain the motivation behind this manuscript. What are the existing problems and why they are crucial should be explained in more detail. I suggest the author further strengthen the relevant parts of the Introduction.

Response 2: We thank the reviewer for this suggestion. We have strengthened the Introduction to more clearly articulate the motivation and existing problems. Specifically, we have revised the relevant paragraph as follows: "Despite the significant capability gap between English and Korean, efforts to address this disparity in long-term conversations remain largely unexplored. To investigate whether recent fine-tuning approaches can enhance Korean long-term conversation capabilities, we conduct a comprehensive set of experiments. Specifically, we apply low-rank adaptation (LoRA), direct preference optimization (DPO), mixture-of-experts (MoE), continual pre-training (CPT), and Layer Tuning techniques for instruction-tuning on Korean datasets for session summarization, memory update, and MSC. We evaluate the effectiveness of each method and analyze their impact on Korean long-term conversational performance. Our results reveal that while these fine-tuning approaches yield improvements on individual tasks, they suffer from catastrophic forgetting in continual learning settings where all three tasks are learned sequentially. To address this limitation, we draw inspiration from recent research on language-specific neurons and identify Korean-specific neurons in various LLMs. We demonstrate that selectively tuning these neurons not only enhances the models' long-term conversational capabilities in Korean but also exhibits robust performance in continual learning settings." This revision clearly explains: (1) the existing problem (capability gap and lack of exploration), (2) our investigative approach, (3) the limitation we discovered (catastrophic forgetting), and (4) our proposed solution (neuron tuning).

Comment 3: More lastest research should be considered for related work. e.g. “SCAFNet: A Semantic Compensated Adaptive Fusion Network for Remote Sensing Images Change Detection”.

Response 3: We thank the reviewer for suggesting additional references. We have carefully examined the recommended paper "SCAFNet: A Semantic Compensated Adaptive Fusion Network for Remote Sensing Images Change Detection" (Zhang et al., IEEE GRSL 2026). After thorough review, we find that this paper addresses a fundamentally different research problem in a completely different domain: - SCAFNet focuses on computer vision tasks, specifically detecting pixel-level changes in remote sensing imagery (e.g., building construction, seasonal variations) using CNN-Transformer hybrid architectures. - Our work addresses natural language processing challenges in multi-session dialogue systems, focusing on memory management and language-specific neuron tuning for large language models. While both papers use terms like "change" and "semantic," these refer to entirely different concepts: SCAFNet's "change detection" concerns visual pixel differences in satellite images, whereas our "memory update" concerns semantic changes in user information across dialogue sessions. We could not identify any methodological, conceptual, or application-level overlap that would warrant citation. We respectfully suggest that this recommendation may have been made in error, as the two works belong to entirely separate research communities (remote sensing vs. NLP/dialogue systems). In the revised manuscript, we have expanded the Related Work section to include recent advances in memory-augmented dialogue systems (Section 2.2) and language-specific neurons (Section 2.3), which are directly relevant to our contributions.

We appreciate the reviewer's time and constructive feedback, which has helped improve the quality of our manuscript.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

1. The study utilizes the GPT-5 API for evaluation. However, the GPT-5 ecosystem already comprises several iterations, and significant performance discrepancies exist between the GPT-5 API and versions like GPT-5.2-Codex. Furthermore, as GPT continues to undergo rapid versioning and updates, the findings of this paper, while potentially valuable at present, face a high risk of rapid depreciation in relevance. I generally do not recommend the publication of research whose academic utility is likely to be so short-lived due to its heavy dependence on specific, fleeting model versions.
2. One of the core contributions presented is an improvement over the LAPE method. However, this enhancement primarily involves applying L1 normalization to activation scores and jointly considering activation scores and entropy. Such modifications are considered standard optimizations within the field of model interpretability and lack the substantial theoretical or structural breakthrough required to represent a significant advancement over existing literature.
3. The newly introduced KEEM dataset relies heavily on ChatGPT-4.0 for generating summaries and extracting emotions and their underlying causes. While the authors cite existing studies to justify the reliability of ChatGPT , this methodology risks inheriting the inherent biases and stylistic idiosyncrasies of the GPT-4 model. Moreover, the study lacks a large-scale human evaluation component, which is essential to guarantee the precision of Korean semantics and the integration of cultural nuances.
4. In the comparative experiments, the authors contrast Neuron Tuning with LoRA, DPO, and MoE. However, a significant discrepancy exists in the experimental setup: LoRA and MoE are configured to keep the original model weights frozen, whereas Neuron Tuning involves direct updates to the weight parameters. This design creates an inherent bias against parameter-efficient adapter methods, making the reported performance gains for Neuron Tuning less methodologically sound.

Author Response

Response to Reviewer 1

Comment 1.1: GPT-5 API Dependency and Risk of Rapid Depreciation

"The study utilizes the GPT-5 API for evaluation. However, the GPT-5 ecosystem already comprises several iterations... I generally do not recommend the publication of research whose academic utility is likely to be so short-lived due to its heavy dependence on specific, fleeting model versions."

Response:

We appreciate the reviewer's concern regarding the reproducibility and longevity of our evaluation methodology. To address this concern, we have made the following substantial revisions:

Human Evaluation Added: We have conducted the comprehensive human evaluation for both memory update (Table 8) and response generation (Table 10) tasks. Five native Korean speakers evaluated the outputs, and we report inter-annotator agreement using Krippendorff's Alpha (α = 0.82–0.87) and Fleiss' Kappa (κ = 0.90–0.91), both indicating substantial to almost perfect agreement (Section 5.3, 5.6, and 5.7).
Correlation Analysis: We explicitly report the correlation between automatic (GPT-5) and human evaluation scores. The strong agreement between automatic and human evaluations validates the reliability of our automatic metrics, ensuring that our findings remain meaningful regardless of future model versions.
Naturalness Metric Exclusion: We demonstrate that GPT-5 is unreliable for assessing Korean naturalness (ρ = 0.58, substantially lower than other metrics), and therefore exclude this metric from automatic evaluation. This critical analysis shows our awareness of LLM-based evaluation limitations (Section 5.7).

We believe that with the addition of rigorous human evaluation and correlation analysis, our findings are grounded in human judgment and will remain relevant beyond specific model versions.

Comment 1.2: LAPE Improvement Lacks Substantial Theoretical Breakthrough

"This enhancement primarily involves applying L1 normalization to activation scores and jointly considering activation scores and entropy. Such modifications are considered standard optimizations..."

Response:

We respectfully clarify that our contribution extends beyond standard optimization techniques. Our key theoretical insights are:

Magnitude Blindness Problem in LAPE: We identify a fundamental limitation of LAPE—it converts activation scores to binary probabilities (>0 → 1, else → 0), thereby losing magnitude information. A neuron with activation score 0.01 is treated identically to one with score 5.0. We formally describe this limitation in Section 4.4 (lines 294–306).
Sparse Activation Trap: LAPE can yield low entropy for neurons that are weakly activated in the target language while remaining inactive in others. Such neurons have limited influence despite being classified as "language-specific."
Novel Identification Criteria: Our method requires neurons to satisfy two conditions jointly: (i) top 1% activation scores AND (ii) bottom 25% entropy. This dual criterion ensures identified neurons are both highly active for Korean AND language-specific (Equation 15).
Empirical Validation: Figure 4 (Section 6.2) demonstrates that neurons identified by LAPE lead to decreased performance when fine-tuned, whereas our method yields consistent improvements across all models. This empirical evidence validates our theoretical motivation.

We have expanded Section 4.4 to more clearly articulate these theoretical distinctions.

Comment 1.3: KEEM Dataset Relies on GPT-4.0 and Lacks Human Evaluation

"The newly introduced KEEM dataset relies heavily on ChatGPT-4.0... this methodology risks inheriting the inherent biases... the study lacks a large-scale human evaluation component."

Response:

We have addressed this concern through the following revisions:

Clarification on GPT-4.0 Usage: We clarify in Section 3.1 (lines 171–175) that "While GPT-4.0 was used for dataset construction, it was the state-of-the-art model at the time. More importantly, the quality of the KEEM dataset is independent of the generation model, as it has been rigorously validated through comprehensive human evaluation in our previous work."
Human Validation: The KEEM dataset quality was validated through human evaluation in our prior publication, which underwent peer review at COLING 2025.
Additional Human Evaluation in This Study: We have added human evaluation for both memory update and response generation tasks (Tables 8 and 10), with 5 native Korean speakers, demonstrating that our methods achieve consistent improvements under human judgment.

Comment 1.4: Unfair Comparison Between Methods (LoRA/MoE Frozen vs. Neuron Tuning Updates)

"LoRA and MoE are configured to keep the original model weights frozen, whereas Neuron Tuning involves direct updates to the weight parameters. This design creates an inherent bias..."

Response:

We acknowledge this important methodological concern and have added two additional baselines that directly update model weights:

Continual Pre-training (CPT): Following [Kim et al., 2025], we perform continual pre-training on all model parameters using the KEEM dataset (Section 4.5).
Specific Layer Tuning: Inspired by knowledge editing techniques [Wang et al., 2024], we identify and fine-tune the layer with the highest concentration of Korean-specific neurons (Section 4.6).

The revised comparison now includes:

Method	Weight Update	Parameters Modified
LoRA	Frozen + Adapter	Low-rank matrices only
MoE	Frozen + Experts	Expert modules only
CPT	Full Update	All parameters
Layer Tuning	Partial Update	Selected layer(s)
Neuron Tuning	Partial Update	Selected neurons

As shown in Tables 6–10, Neuron Tuning consistently outperforms CPT and Layer Tuning despite modifying fewer parameters. This demonstrates that the performance gains stem from our targeted neuron selection rather than simply from updating original weights.

Reviewer 2 Report

Comments and Suggestions for Authors

Summary of the Paper:

This paper addresses the challenge of Multi-Session Conversation (MSC) in Korean, a language the authors identify as under-represented in current MSC research. The authors introduce a new dataset (KEEM) synthesized via ChatGPT-4, which explicitly distinguishes between "Persona Memory" and "Episode Memory." Furthermore, the paper proposes a "Neuron Tuning" method guided by a novel metric (combining activation scores and entropy) to identify and fine-tune language-specific neurons. Experiments are conducted to compare this method against LoRA, DPO, and MoE across session summarization, memory update, and response generation tasks.

General Assessment:

The paper proposes a methodologically motivated approach (Neuron Tuning) that combines interpretability with parameter-efficient fine-tuning. The distinction between persona and episode memory is a logical and valuable contribution to MSC system design.However, the paper suffers from critical technical flaws, particularly regarding mathematical impossibilities in reported metrics, incorrect assumptions about model architectures, and a reliance on purely synthetic evaluation without human validation. These issues significantly undermine the credibility of the empirical results.

Major Concerns:

Mathematically Impossible Results:

In Table 4, the paper reports Perplexity (PPL) values of 0.9 and 0.7 for the Qwen 3 model on English MSC tasks.Perplexity is defined as PPL=e^H(P), where H(P) is the entropy. Since entropy is non-negative, PPL cannot be less than 1.0. A PPL of 1.0 implies perfect prediction with zero uncertainty. Reporting a PPL of 0.7 is theoretically impossible.

Incorrect Model Architecture Assumptions:

In Section 4.4 (Equations 9 & 10), the authors define the activation score for Attention Q/K/V projections and FFN down-projections as a=σ(Wx+b).The models used in this study (Llama 3.1, Qwen 2.5, Gemma 3) utilize architectures where linear projections (Query, Key, Value, and Output/Down projections) are linear transformations without non-linear activation functions () immediately following them. If the authors applied a non-linear function (like Sigmoid or ReLU) to calculate these scores, the methodology deviates from the actual model mechanics. If the formula is a typo, it raises concerns about the rigor of the mathematical derivation for "Neuron Identification." The authors must clarify this discrepancy.

The "Synthetic Loop" and Lack of Human Verification:

The experimental pipeline relies entirely on a closed synthetic loop:Training data (KEEM) is generated by ChatGPT-4.0.Evaluation metrics (Engagement, Informativeness, Conflict) are scored by GPT-5.Missing Validation: There is zero human evaluation provided to validate the quality of the generated dataset or the correlation between GPT-5 scores and human judgments. Given that the paper claims to solve subtle linguistic nuances in Korean, relying solely on LLM-as-a-Judge without a human-grounded baseline is insufficient.

Flawed Logic in Cross-Lingual Difficulty Assessment:

In Section 5.3, the authors claim "Korean MSC is intrinsically more challenging than English" based on higher PPL scores.Perplexity is heavily dependent on vocabulary size and tokenization granularity. Korean tokenizers often have different character-to-token ratios than English ones. Comparing raw PPL values across different languages is scientifically unsound and does not prove intrinsic task difficulty.

Specific Questions for the Authors:

Re: How did you calculate a PPL of 0.7? Is this a typo? If not, please provide the mathematical formula used.
Re:Which specific activation function was used? Did you modify the model architecture to include this function during inference/training?
Re: What percentage of the KEEM dataset was manually checked for hallucination or incorrect "Persona/Episode" classification?

Comments on the Quality of English Language

The quality of English writing is generally good and follows the standard academic structure. The ideas are presented clearly, and the terminology is mostly consistent.The term "Low-resource language" (used in the Abstract/Intro) to describe Korean is debatable in the NLP community. It would be more precise to describe it as "a language with limited open-domain dialogue resources" rather than a general low-resource language.

Author Response

Response to Reviewer 2

Comment 2.1: Mathematically Impossible Perplexity Values (PPL < 1.0)

"In Table 4, the paper reports Perplexity (PPL) values of 0.9 and 0.7 for the Qwen 3 model... Perplexity is defined as PPL=e^H(P), where H(P) is the entropy. Since entropy is non-negative, PPL cannot be less than 1.0."

Response:

We sincerely thank the reviewer for identifying this critical error. Upon thorough investigation, we discovered a bug in our evaluation code where PPL values were incorrectly computed with a "-1" offset, introduced during an attempted normalization procedure. We have corrected this error and re-run all experiments.

The corrected Table 4 now shows:

All PPL values are ≥ 1.0 (ranging from 1.7 to 8.1)
Qwen 3 32B on English MSC Session 1-2: 1.7 (previously reported as 0.7)
The relative trends and conclusions remain unchanged

We apologize for this oversight and have verified all numerical results in the revised manuscript.

Comment 2.2: Incorrect Model Architecture Assumptions (Activation Function in Eq. 9 & 10)

"The authors define the activation score for Attention Q/K/V projections and FFN down-projections as a=σ(Wx+b). The models used in this study utilize architectures where linear projections are linear transformations without non-linear activation functions..."

Response:

We thank the reviewer for this careful observation. We have completely revised Section 4.4 to accurately reflect the actual model architectures:

For Attention Q/K/V (Equation 9, revised): We clarify that modern transformer architectures (Llama, Qwen, Gemma) apply no activation function after Q/K/V projections. The activation score is computed as the absolute value of the linear projection output:

For FFN (Equations 10-12, revised): We explicitly describe the SwiGLU architecture and clarify that we measure activations at the intermediate representation after the gated activation:

Implementation Details Added (Section 5.2, lines 413-419): We now explicitly state: "For neuron activation measurement, we utilized the TransformerLens library... Attention Q/K/V: We capture the output of linear projection layers ('q_proj', 'k_proj', 'v_proj'), which do not include activation functions; FFN (MLP): We capture the intermediate activation after the gated activation function, corresponding to the 'post' hook point in TransformerLens."

Comment 2.3: "Synthetic Loop" and Lack of Human Verification

"The experimental pipeline relies entirely on a closed synthetic loop: Training data (KEEM) is generated by ChatGPT-4.0. Evaluation metrics are scored by GPT-5. Missing Validation: There is zero human evaluation..."

Response:

We have comprehensively addressed this concern by adding human evaluation throughout:

Human Evaluation for Memory Update (Table 8): Five native Korean speakers evaluated informativeness and conflict for the top 3 models. Inter-annotator agreement: Krippendorff's α = 0.82 (informativeness), Fleiss' κ = 0.91 (conflict).
Human Evaluation for Response Generation (Table 10): Same evaluators assessed engagement, naturalness, and AMU. Inter-annotator agreement: Krippendorff's α = 0.83 (engagement), α = 0.87 (naturalness), Fleiss' κ = 0.90 (AMU).
Evaluator Blinding (Section 5.3): "To ensure unbiased assessment, evaluators are not informed of the dataset construction procedures or the specific methodologies used in our approach and the baselines."
Bias Prevention (Section 5.6): "To prevent evaluator bias toward specific models or methods, we pool all generated memories from the three models and present them to evaluators in randomized order without revealing model identities."

The human evaluation results consistently align with automatic evaluation, validating our findings.

Comment 2.4: Flawed Logic in Cross-Lingual Difficulty Assessment

"Comparing raw PPL values across different languages is scientifically unsound and does not prove intrinsic task difficulty."

Response:

We acknowledge this concern and have addressed it in Section 5.4 (lines 442-457):

Controlled Comparison: We compare Korean MSC perplexity against Korean summarization tasks of similar input length (Table 5). Despite similar lengths, MSC shows significantly higher perplexity (e.g., 8.1 vs. 5.3 for EXAONE 4 32B), demonstrating that the difficulty stems from the MSC task itself, not Korean linguistic characteristics.
Within-Language Analysis: By comparing Korean MSC vs. Korean summarization (same language, similar length, different task), we isolate task difficulty from tokenization effects.
Explicit Acknowledgment: We explicitly state (lines 442-447): "In the context of LLM vocabularies, the tokenization of Korean may be less fine-grained than that of English... the higher perplexity observed in the Korean MSC dataset may stem from the linguistic characteristics of Korean rather than from the intrinsic difficulty of the MSC task itself. To address this, we also measure the perplexity of other Korean tasks with lengths similar to that of the Korean MSC task."

Reviewer 3 Report

Comments and Suggestions for Authors

This paper presents a comprehensive empirical study on enhancing Korean MSC capabilities of LLMs through dataset construction, memory modeling, and parameter-efficient fine-tuning. Here are some reviews of the paper:

The literature review is not comprehensive enough: there is limited coverage of the latest work on modeling multi round dialogue memory, and insufficient discussion on low resource language adaptation methods.
The KEEM dataset constructed in the paper is based on the KMSC dataset, but according to the statistical results (Table 2), its sample size is significantly smaller than the original KMSC. Small sample sizes may limit the model's generalization ability, and data scarcity may affect the stability validation of optimization methods.
During the optimization process, some key parameters were not explained in detail, such as the number of experts in MoE, the gate network structure, and the selection criteria for the temperature coefficient of DPO, which cannot prove the optimality of parameter selection.

4.More lastest research should be considered for related work. e.g. “ORSI Salient Object Detection via Progressive Interaction and Saliency-Guided Enhancement”.

Is the performance improvement of this framework statistically significant? Is there more detailed data analysis to support this conclusion? It is suggested that the authors include statistical significance analysis, such as using t-tests or Wilcoxon rank-sum tests.

Author Response

Response to Reviewer 3

Comment 3.1: Literature Review Not Comprehensive Enough

"There is limited coverage of the latest work on modeling multi-round dialogue memory, and insufficient discussion on low-resource language adaptation methods."

Response:

We have significantly expanded the Related Work section:

New Section 2.2: Memory-Augmented Dialogue Systems (lines 114-130): We discuss recent work including:

Reflective Memory Management (RMM)for hierarchical memory summarization
SHARE dataset and EPISODE framework for relational shared memories
We explicitly contrast these approaches with our work

Expanded Section 2.1: MSC Dataset (lines 89-113): We added discussion of:

LOCCO dataset for long-term chronological conversations
Analysis of temporal decay in LLM memory representations

Expanded Section 2.3: Language-Specific Neurons (lines 147-154): We added:

[Kim et al., 2025] on Korean-specific neurons for mathematical reasoning
Discussion of shallow-layer neurons for internal translation

Comment 3.2: Small Sample Size of KEEM Dataset

"According to the statistical results (Table 2), its sample size is significantly smaller than the original KMSC. Small sample sizes may limit the model's generalization ability..."

Response:

We acknowledge that the KEEM dataset is smaller than KMSC. However, we emphasize:

Quality over Quantity: KEEM was constructed with higher annotation quality, including emotional states and their causal context, which KMSC lacks (Section 3.1, lines 165-169).
Human Validation: The dataset quality was validated through human evaluation in [Kang et al., 2025].
Consistent Improvements: Despite the smaller size, models fine-tuned on KEEM consistently outperform baselines across all tasks, suggesting sufficient data for effective adaptation.
Focus on Fine-tuning: Our approach targets parameter-efficient fine-tuning of pre-trained LLMs, which typically requires less data than training from scratch.

Comment 3.3: Key Parameters Not Explained in Detail

"Some key parameters were not explained in detail, such as the number of experts in MoE, the gate network structure, and the selection criteria for the temperature coefficient of DPO..."

Response:

We have added comprehensive implementation details in Section 5.2 (lines 399-419):

MoE Configuration:

"We conduct a hyperparameter search over the number of experts ∈ {3, 4, 6, 8, 12} and top-k routing ∈ {1, 2, 3, 4}. The best performance is achieved with 8 experts and top-2 routing."
"We employ a shared expert architecture: one expert is designated as a shared expert that remains always activated, while the top-2 experts are dynamically selected from the remaining 7 routed experts based on softmax gating scores."
"Each expert is implemented as a two-layer feed-forward network with a hidden dimension of the backbone model."

Neuron Activation Measurement:

Detailed description of TransformerLens and PyTorch hook implementations
Specific hook points for attention and FFN modules

Comment 3.4: Statistical Significance Analysis Required

"Is the performance improvement of this framework statistically significant? Is there more detailed data analysis to support this conclusion? It is suggested that the authors include statistical significance analysis, such as using t-tests or Wilcoxon rank-sum tests."

Response:

We have added statistical significance testing throughout:

Experimental Runs: All experiments are now run five times (increased from three) to ensure robust statistical analysis.
Statistical Tests: We apply paired t-tests for all pairwise comparisons between Neuron Tuning and other methods.
Updated Table Captions: All result tables (Tables 6, 7, 9) now include: "Statistical significance was verified using a paired t-test; all pairwise comparisons between Neuron Tuning and other methods yield p < 0.05."
Inter-annotator Agreement: For human evaluation, we report:

Krippendorff's Alpha with ordinal metric for Likert-scale ratings
Fleiss' Kappa for binary classifications
All agreement scores indicate substantial to almost perfect agreement (α/κ ≥ 0.82)

Summary of Revisions

Reviewer Concern	Section Modified	Key Changes
PPL < 1.0 error	Table 4	Corrected all values (bug fix)
Activation function	Section 4.4, Eq. 9-12	Revised to match actual architectures
Human evaluation	Sections 5.3, 5.6, 5.7; Tables 8, 10	Added comprehensive human evaluation
Unfair baseline comparison	Sections 4.5, 4.6; Tables 6-10	Added CPT and Layer Tuning baselines
Statistical significance	Tables 6, 7, 9	Added p-value reporting
MoE hyperparameters	Section 5.2	Added detailed configuration
Related work	Sections 2.1, 2.2, 2.3	Expanded with recent literature
GPT-4.0 dataset concern	Section 3.1	Added clarification and justification
Cross-lingual PPL validity	Section 5.4, Table 5	Added within-language comparison

We believe these revisions comprehensively address all reviewer concerns and significantly strengthen the contribution of our work.