Evaluating the Impact of Adapter-Based Fine-Tuning on Structured Parsing Performance in Large Language Models
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper investigates whether parameter-efficient fine-tuning can improve the reliability of large language models in a structured parsing task, where natural-language shopping-cart commands are converted into a fixed JSON schema with action, product, and quantity fields. The study compares several prompting strategies with two adapter-based methods, LoRA and IA3, across three open-source LLMs. The topic is relevant for practical deployments where syntactically valid and semantically correct structured outputs are required, and the paper has some clear strengths, including a controlled experimental setup, direct comparison of prompt-based and adapter-based adaptation, and reporting of both accuracy-oriented and cost-related metrics. The results suggest that few-shot prompting already provides a strong baseline, while LoRA further improves structured-output accuracy and consistency; however, the current experimental design leaves several important questions about generalizability, robustness, and computational-cost interpretation. However, despite the relevance of the topic and the clarity of the controlled comparison, several aspects of the methodology and interpretation limit the strength of the paper’s conclusions and require further clarification:
1) I am not fully convinced that the reported results can be generalized beyond the current shopping-cart parsing setup. The task uses a very simple JSON schema with only three fields: action, product, and quantity. Could the authors clarify whether the same conclusions are expected to hold for more complex structured-output tasks?
2) The dataset is fully synthetic, and this substantially limits the strength of the practical conclusions. It would be useful to know how the models behave on real user commands, including noisy phrasing, typos, ambiguous requests, unseen product names, and out-of-distribution inputs.
3) The current benchmark does not seem sufficiently challenging to support broad claims about PEFT versus prompt engineering. More demanding cases such as nested JSON objects, optional fields, multiple products in one command, conflicting instructions, multi-intent requests, or multilingual inputs would make the comparison more convincing.
4) There appears to be an inconsistency in the computational-cost analysis. Table 3 reports average inference time for LoRA configurations, whereas Section 4.4 states that LoRA inference-time measurements were not available. This should be clarified, since the cost-benefit comparison is one of the central points of the paper.
5) The IA3 results are almost identical to minimal instruction prompting. This is an important finding, but the paper does not analyze it in sufficient depth. Is this outcome due to the IA3 configuration, the selected learning rate, the target modules, limited trainable capacity, model architecture, or the simplicity of the task?
6) The comparison between LoRA and IA3 would be more convincing with some sensitivity analysis. The manuscript states that no hyperparameter search was performed, although PEFT methods can be sensitive to rank, scaling factor, dropout, learning rate, batch size, and training schedule.
7) Some conclusions seem broader than what the experiments can support. The study provides evidence for one controlled synthetic parsing task, but the manuscript sometimes presents the findings as a more general basis for deciding between PEFT and prompt engineering in structured generation.
The submitted manuscript appears to contain unresolved tracked changes and non-final formatting elements. There are also template placeholders, formatting artifacts, and at least one added passage that conflicts with the reported results. A clean final version is needed before the technical contribution can be properly assessed.
Author Response
Respected reviewer,
Thank you for your comments on our manuscript.
Please find the point-by-point responses in the attached document.
Kind regards,
The authors
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsPls see the attachment.
Comments for author File:
Comments.pdf
Author Response
Respected reviewer,
Thank you for your comments on our manuscript.
Please find the point-by-point responses in the attached document.
Kind regards,
The authors
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis paper addresses a challenge in modern production environments: the reliable mapping of unstructured user intent into machine-readable JSON schemas. By comparing Prompt Engineering with Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA and , the authors provide an empirical framework for optimizing Large Language Models (LLMs) in structured parsing tasks. The paper is easy to follow and there are some major revision suggestions:
1 The study relies on a highly controlled, synthetic dataset of 12,000 instructions with a limited, fixed JSON schema (action, product, quantity). The authors acknowledge this limitation but should expand on how the framework might perform with more complex, nested JSON schemas or "noisy" real-world data. Adding a small-scale validation on a non-synthetic, human-generated dataset would significantly increase the paper's external validity.
2 The results in Table 3 show that significantly underperforms compared to LoRA and even few-shot prompting for certain models. For example, Llama 3.1 8B with achieved an Exact Match Accuracy of 0.796, whereas LoRA reached 0.981. Please provide a deeper analysis of why , which is designed for extreme parameter efficiency, failed to match the performance of advanced prompting in this specific domain. Is the "signal modulation" mechanism insufficient for structural JSON constraints compared to the "rank-decomposition" used in LoRA?
3 The paper mentions quantifying the trade-off between computational cost and performance gains. While "Average time per test" is provided, the paper would benefit from a more explicit discussion on the training costs (GPU hours, memory) associated with each PEFT method versus the inference latency of long few-shot prompts. This is critical for the "optimization dilemma" mentioned in the introduction.
4 In Table 3, the "Average time per test" for Qwen3 4B is remarkably higher (e.g., 21.14s for Minimal Instruction) compared to Llama 3.1 8B (1.68s). This massive discrepancy (nearly 12x slower for a smaller model) needs an explanation. Is this due to specific architectural differences, suboptimal tokenizer implementation for that model, or hardware-specific bottlenecks during testing?
5 The methodology states that the test set includes "novel paraphrasing patterns" to evaluate generalization. Please include a specific breakdown or qualitative analysis of the types of "novel" patterns the models failed on. This would help readers understand if the adapters are simply "overfitting" to the synthetic template or truly learning the underlying semantic mapping.
Author Response
Respected reviewer,
Thank you for your comments on our manuscript.
Please find the point-by-point responses in the attached document.
Kind regards,
The authors
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have addressed the main concerns raised in the previous review. The revised manuscript now more clearly positions the study as a controlled synthetic benchmark rather than as definitive evidence for all real-world structured-output tasks. The inconsistency regarding LoRA inference-time measurements has been corrected, the discussion of IA3 has been expanded, the computational-cost section is more coherent, and the added qualitative error analysis improves the interpretation of the results.
I consider the revision generally satisfactory. I do not think that additional major experiments are necessary at this stage. However, I recommend a few minor revisions before acceptance:
- The conclusions should be slightly toned down. Some statements still sound broader than the experimental setting supports, especially claims that LoRA-based fine-tuning is “highly recommended” for mission-critical production environments or that LoRA is “the most effective mechanism” for improving semantic mapping. These conclusions should be explicitly limited to the evaluated controlled benchmark.
- The description of dataset generation should briefly clarify how the quality of synthetic labels was controlled. Since the dataset was generated using teacher models, it would be useful to mention whether duplicate removal, automatic validation, filtering, or manual spot-checking was performed.
- The explanation of Qwen3 latency should be phrased more cautiously. The proposed causes, such as tokenizer differences, vocabulary size, or implementation bottlenecks, are plausible but not experimentally verified in the manuscript.
- The explanation of IA3 underperformance should also remain cautious. The current interpretation based on limited expressive capacity is reasonable, but without an ablation study it should be framed as a likely explanation under the current configuration rather than as a definitive conclusion.
The manuscript still requires minor proofreading and formatting cleanup. Some template elements remain visible in the PDF, and Table 3 is still difficult to read due to line breaks and formatting artifacts.
Author Response
Respected Reviewer,
Thank you for your review of our paper - please find our responses in the attached document.
Kind regards,
The authors
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have addressed all my comments!
Author Response
Respected reviewer,
Thank you for reviewing our paper.
Kind regards,
the authors

