Review Reports - Fine-Tuning a Local LLM for Thermoelectric Generators with QLoRA: From Generalist to Specialist

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This study focuses on developing a domain-specific large language model (LLM) for thermoelectric generators (TEGs) that can be deployed on local hardware. Starting with the generalist JanV1-4B model, the researchers employed the QLoRA fine-tuning technique, modifying only 3.18% of the model’s total parameters to avoid excessive computational costs. The core of this methodology lies in a custom-designed dataset consisting of 202 questions and answers (QA), which balances deep TEG domain knowledge and instruction tuning to shape the model’s response behavior and mitigate catastrophic forgetting. The fine-tuning process, optimized with the Unsloth library and conducted on an NVIDIA GeForce RTX 2070 SUPER GPU, took only 263 seconds over three epochs, and the model was converted to the GGUF format for efficient local inference via the Ollama framework. This work validates QLoRA as an effective and accessible strategy for specializing LLMs in complex engineering domains, eliminating reliance on large-scale computing infrastructures and paving the way for practical, locally deployed AI tools in TEG engineering. However, the manuscript requires minor revisions to enhance clarity, consistency, and completeness. It is recommended to be accepted after revision.

When describing the dataset composition in Section 4, the manuscript mentions a “potential overlap” between the “Calculation” and “Skill-based” subsets but does not specify how this overlap was controlled during data curation (e.g., whether duplicate QA were removed, how ambiguous entries were classified). It is recommended to add a brief explanation of data cleaning strategies for overlapping categories to clarify the dataset’s integrity and avoid potential bias in model training.
The manuscript refers to “NGA (niche genetic algorithm)” in Section 5.1.4, but incorrectly labels it as “GAN” in Table 4, which may confuse readers with generative adversarial networks. It is suggested to unify the algorithm name to “NGA” in both the text and tables, supplement its full name at the first mention, and verify the consistency of corresponding data (e.g., runtime, final error) to ensure accuracy.
The specialized model is mainly validated in TEG equation formulation, parameter optimization analysis, and thermal management design. It is suggested to supplement a brief analysis of the model’s applicability in other TEG-related scenarios (e.g., material selection for high-temperature TEGs, optimization of module geometric parameters) to clarify the application boundaries of the theoretical framework.

Author Response

Answer to Reviewer 1

Thank you very much for your thorough review and constructive suggestions. Your feedback is very helpful for improving the clarity and impact of our paper. We have carefully addressed each of your points below.

When describing the dataset composition in Section 4, the manuscript mentions a “potential overlap” between the “Calculation” and “Skill-based” subsets but does not specify how this overlap was controlled during data curation (e.g., whether duplicate QA were removed, how ambiguous entries were classified). It is recommended to add a brief explanation of data cleaning strategies for overlapping categories to clarify the dataset’s integrity and avoid potential bias in model training.

Answer: We appreciate this important observation. We agree that clarifying the data curation strategy is essential to ensure dataset integrity and minimize potential bias.
Proposed action: We have included the following text in Section 4, Dataset definition, on line 399:

“As a strategy for cleaning, controlling overlap, and ensuring dataset integrity, the classification was based on the primary intent of the QA. 100% of the 202 QA pairs were manually reviewed to eliminate conceptual duplicates such as duplicate questions and answers. Ambiguous entries were also classified in this way, ensuring that these QAs aligned with the computational and skills-based subsets.”

The manuscript refers to “NGA (niche genetic algorithm)” in Section 5.1.4, but incorrectly labels it as “GAN” in Table 4, which may confuse readers with generative adversarial networks. It is suggested to unify the algorithm name to “NGA” in both the text and tables, supplement its full name at the first mention, and verify the consistency of corresponding data (e.g., runtime, final error) to ensure accuracy.

Answer: We appreciate this correction.
Proposed action:We have corrected the nomenclature in Tables 4 and 5, and thus unified the nomenclature throughout the article, writing NGA.

The specialized model is mainly validated in TEG equation formulation, parameter optimization analysis, and thermal management design. It is suggested to supplement a brief analysis of the model’s applicability in other TEG-related scenarios (e.g., material selection for high-temperature TEGs, optimization of module geometric parameters) to clarify the application boundaries of the theoretical framework.

Answer: We appreciate this suggestion. We agree that discussing the applicability of TEG in other scenarios will broaden our understanding of the model's scope and theoretical framework.
Proposed action: We have added a brief discussion at the end of Section 5.1.4. Level 4 and 5: Quantitative & Critical analysis, on line 580.

“It is worth highlighting that, although the validation focuses on the 4-DOF model for formulating equations, the pure domain category of the dataset provides fundamental knowledge, including the ZT merit factor and its influence on geometry. This has allowed the model developed in this work to generate answers regarding material selection at other temperatures and the geometric optimization of parameters.”

We wish to inform the reviewers that all scripts used in this work —including the QLoRA fine-tuning code, dataset preprocessing routines, synthetic data generation scripts, multi-level evaluation pipeline, and local inference tools— will be made available in the Zenodo repository associated with the article.

Although the reference is not yet public, the reviewers can access the complete material through the following private link:

https://zenodo.org/records/17563453?preview=1&token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6ImE2NzA0NDJmLTIzZjYtNGQ1NC1iMTYwLTEyYmNhMTQ5ZjRiNCIsImRhdGEiOnt9LCJyYW5kb20iOiI3NmY4YTI3MzMxYWI5N2U0NzZhYTVmNTYzNmY3MDdhYSJ9.Yl3mnBpN4vZtiloMXOVXUvWG3xTlKZIWK6kLac8iauCdDJthkG2aXFS4-C4JwGV_LbQGuHM0mpcROl16UsJPbA

Once the editorial process is completed, the final public version will be permanently accessible as: Monzón-Verona, J. M.; García-Alonso, S.; Santana-Martín, F. J. Software and Dataset for Fine-Tuning a Local LLM for Thermo-Electric Generators with QLoRA: From Generalist to Specialist, 2025. https://doi.org/10.5281/zenodo.17563453

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Abstract

The description of the evaluation only states a questionnaire and an accuracy of 81 percent, but does not specify that two different benchmarks are used a 16 question cognitive test and a 42 question TEG benchmark, nor how accuracy is computed.
The size and structure of the training dataset are not mentioned, for example, the 202 question answer items and their split between domain and instruction tuning, which weakens the claim about quality-driven design.
The phrase about eliminating dependence on large-scale computing infrastructure is strong and qualitative; it would help to explicitly mention the actual hardware class used and any remaining limitations.
The scope of the specialization is not clear, for instance, whether both JanV1 and Qwen models were fine-tuned, yet later sections compare two fine-tuned models.
The contribution regarding experimental use of the LLM in TEG design is not visible; the reader only sees modeling and accuracy, while the later experimental section is an important part of the study.

Introduction

The research gap on domain specific LLMs for engineering is described in general terms but lacks a focused comparison with previous works on fine tuned assistants for scientific or physical modeling tasks.
There is no discussion of alternative approaches such as retrieval augmented generation for TEG support, so the specific advantage of full fine tuning with QLoRA is not clearly justified.
The four listed contributions mix high level claims with methodological steps; they would benefit from tighter alignment with later sections and from explicitly stating what is new compared with the cited TEG literature.
The choice of the four degree of freedom lumped TEG model as the knowledge backbone is motivated, but the novelty or modifications with respect to the reference work by Marjanović and coauthors are not clearly explained.
Several paragraphs repeat general explanations of PEFT and QLoRA, while missing a short state of the art on LLM use in thermoelectricity (if any) or in energy systems, which would better contrast your contribution.

Section 2. Mathematical model of the TEG

Assumptions behind the lumped model are not explicitly listed, for example one dimensional heat flow, constant material properties, neglect of Thomson effect or contact non linearities, which limits the reader’s understanding of model validity.
There are notation inconsistencies that look like errors, such as Qcm instead of Qm in equation 13, C1 instead of Cc1 in equation 15, and J43 being labeled as derivative with respect to Ta although the index suggests Tc1; these should be checked carefully.
The term complexity order four is used without definition, and the relation between this concept and the four state variables is not clearly articulated.

The reference numerical values of the parameters for the specific TEG module are only mentioned later in the results; a table with those values would fit naturally in this section and help connect the model and experiments.

Section 3. FT methodology

The hyperparameters of the LoRA adapter are under-specified; there is no information on rank, target modules, alpha value, or dropout, which are needed to reproduce the fine-tuning.
The optimization setup lacks details on batch size, gradient accumulation, and other parameters, so the training process cannot yet be exactly replicated.
Training and validation splits are not described; it is unclear whether any held out data were used to monitor overfitting beyond the single training loss curve.
The hardware description only mentions an RTX 2070 Super without VRAM or CPU and RAM specifications, and without clarifying whether training times and inference times in later tables were all measured on the same setup.

Section 4. Dataset definition

The safeguards against overlap between the 202 training question answer items and the evaluation benchmarks of 16 and 42 questions are not described; potential contamination would strongly affect the reported accuracies.
No concrete example of each dataset category is provided, such as one pure domain question or one skill-based prompt, which makes it hard to assess the claimed focus on structure and behavior.
The way external sources were transformed into question-answer pairs is only sketched; there is no detail on paraphrasing, length normalization, or the amount of text from each cited article used. If possible, provide an example.
The 20 question set dedicated to disambiguating thermoelectric power coefficient versus power factor is interesting.

Section 5.

The composition of the 16 question test is only summarized through topics; a brief description of how those questions were generated and whether they are all unseen with respect to the 202 item training set would strengthen the evaluation.
Only one explicit example of a generated equation is shown, which is not sufficient to support the rich qualitative claims about reasoning, self correction, and abstraction across all levels.
The text refers to levels one through four and sometimes to levels four and five together, which may confuse readers about the exact number of difficulty levels used in the taxonomy.
The evaluation process appears to rely on a single human judge; there is no mention of inter rater agreement or a protocol for resolving borderline cases, which limits the robustness of the reported 94 percent success rate.
For section 5.1.4 Level 4 and 5, quantitative and higher order analysis, the discussion of advanced cognitive abilities in these tasks would benefit from citing prior TEG parameter estimation studies, for example “Parameter estimation of a thermoelectric generator by using salps search algorithm” and “An effective parameter estimation on thermoelectric devices for power generation based on multiverse optimization algorithm”, to anchor the proposed levels in concrete, domain relevant research and show that such tasks reflect realistic quantitative reasoning demands in TEG modeling.
The specialized 42 question TEG benchmark is central to the conclusions, yet its construction, difficulty distribution, and example items are not described beyond a reference; a short subsection or appendix would be advisable.
Figure 6 must be improved.
The experimental setup in Figure 7 lacks quantitative details such as heat source power, flow rate, geometrical dimensions, and thermal contact properties, which are necessary to clearly understand the image.

Section 6. Conclusion

The text restates main results but does not discuss limitations, such as small dataset size, focus on a single TEG architecture, possible benchmark contamination, or reliance on synthetic question answer pairs rather than real user interactions.
The two key accuracy figures, 81 percent on the 42 question benchmark and 94 percent on the 16 question test, are not jointly summarized, which makes it harder for readers to understand the difference between the two evaluation setups.

Author Response

Answer to Reviewer 2

Abstract

1.- The description of the evaluation only states a questionnaire and an accuracy of 81 percent, but does not specify that two different benchmarks are used a 16 question cognitive test and a 42 question TEG benchmark, nor how accuracy is computed.

Answer: Thank you for your observation.