You are currently viewing a new version of our website. To view the old version click .
by
  • José Miguel Monzón-Verona1,2,*,
  • Santiago García-Alonso2,3 and
  • Francisco Jorge Santana-Martín1

Reviewer 1: Anonymous Reviewer 2: Anonymous Reviewer 3: Anonymous

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This study focuses on developing a domain-specific large language model (LLM) for thermoelectric generators (TEGs) that can be deployed on local hardware. Starting with the generalist JanV1-4B model, the researchers employed the QLoRA fine-tuning technique, modifying only 3.18% of the model’s total parameters to avoid excessive computational costs. The core of this methodology lies in a custom-designed dataset consisting of 202 questions and answers (QA), which balances deep TEG domain knowledge and instruction tuning to shape the model’s response behavior and mitigate catastrophic forgetting. The fine-tuning process, optimized with the Unsloth library and conducted on an NVIDIA GeForce RTX 2070 SUPER GPU, took only 263 seconds over three epochs, and the model was converted to the GGUF format for efficient local inference via the Ollama framework. This work validates QLoRA as an effective and accessible strategy for specializing LLMs in complex engineering domains, eliminating reliance on large-scale computing infrastructures and paving the way for practical, locally deployed AI tools in TEG engineering. However, the manuscript requires minor revisions to enhance clarity, consistency, and completeness. It is recommended to be accepted after revision.

  1. When describing the dataset composition in Section 4, the manuscript mentions a “potential overlap” between the “Calculation” and “Skill-based” subsets but does not specify how this overlap was controlled during data curation (e.g., whether duplicate QA were removed, how ambiguous entries were classified). It is recommended to add a brief explanation of data cleaning strategies for overlapping categories to clarify the dataset’s integrity and avoid potential bias in model training.
  2. The manuscript refers to “NGA (niche genetic algorithm)” in Section 5.1.4, but incorrectly labels it as “GAN” in Table 4, which may confuse readers with generative adversarial networks. It is suggested to unify the algorithm name to “NGA” in both the text and tables, supplement its full name at the first mention, and verify the consistency of corresponding data (e.g., runtime, final error) to ensure accuracy.
  3. The specialized model is mainly validated in TEG equation formulation, parameter optimization analysis, and thermal management design. It is suggested to supplement a brief analysis of the model’s applicability in other TEG-related scenarios (e.g., material selection for high-temperature TEGs, optimization of module geometric parameters) to clarify the application boundaries of the theoretical framework.

Author Response

Answer to Reviewer 1

 

Thank you very much for your thorough review and constructive suggestions. Your feedback is very helpful for improving the clarity and impact of our paper. We have carefully addressed each of your points below.

 

  1. When describing the dataset composition in Section 4, the manuscript mentions a “potential overlap” between the “Calculation” and “Skill-based” subsets but does not specify how this overlap was controlled during data curation (e.g., whether duplicate QA were removed, how ambiguous entries were classified). It is recommended to add a brief explanation of data cleaning strategies for overlapping categories to clarify the dataset’s integrity and avoid potential bias in model training.
  • Answer: We appreciate this important observation. We agree that clarifying the data curation strategy is essential to ensure dataset integrity and minimize potential bias.
  • Proposed action: We have included the following text in Section 4, Dataset definition, on line 399:

“As a strategy for cleaning, controlling overlap, and ensuring dataset integrity, the classification was based on the primary intent of the QA. 100% of the 202 QA pairs were manually reviewed to eliminate conceptual duplicates such as duplicate questions and answers. Ambiguous entries were also classified in this way, ensuring that these QAs aligned with the computational and skills-based subsets.”

 

  1. The manuscript refers to “NGA (niche genetic algorithm)” in Section 5.1.4, but incorrectly labels it as “GAN” in Table 4, which may confuse readers with generative adversarial networks. It is suggested to unify the algorithm name to “NGA” in both the text and tables, supplement its full name at the first mention, and verify the consistency of corresponding data (e.g., runtime, final error) to ensure accuracy.
  • Answer: We appreciate this correction.
  • Proposed action:We have corrected the nomenclature in Tables 4 and 5, and thus unified the nomenclature throughout the article, writing NGA.

 

  1. The specialized model is mainly validated in TEG equation formulation, parameter optimization analysis, and thermal management design. It is suggested to supplement a brief analysis of the model’s applicability in other TEG-related scenarios (e.g., material selection for high-temperature TEGs, optimization of module geometric parameters) to clarify the application boundaries of the theoretical framework.
  • Answer: We appreciate this suggestion. We agree that discussing the applicability of TEG in other scenarios will broaden our understanding of the model's scope and theoretical framework.
  • Proposed action: We have added a brief discussion at the end of Section 5.1.4. Level 4 and 5: Quantitative & Critical analysis, on line 580.

“It is worth highlighting that, although the validation focuses on the 4-DOF model for formulating equations, the pure domain category of the dataset provides fundamental knowledge, including the ZT merit factor and its influence on geometry. This has allowed the model developed in this work to generate answers regarding material selection at other temperatures and the geometric optimization of parameters.”

 

We wish to inform the reviewers that all scripts used in this work —including the QLoRA fine-tuning code, dataset preprocessing routines, synthetic data generation scripts, multi-level evaluation pipeline, and local inference tools— will be made available in the Zenodo repository associated with the article.

Although the reference is not yet public, the reviewers can access the complete material through the following private link:

https://zenodo.org/records/17563453?preview=1&token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6ImE2NzA0NDJmLTIzZjYtNGQ1NC1iMTYwLTEyYmNhMTQ5ZjRiNCIsImRhdGEiOnt9LCJyYW5kb20iOiI3NmY4YTI3MzMxYWI5N2U0NzZhYTVmNTYzNmY3MDdhYSJ9.Yl3mnBpN4vZtiloMXOVXUvWG3xTlKZIWK6kLac8iauCdDJthkG2aXFS4-C4JwGV_LbQGuHM0mpcROl16UsJPbA

Once the editorial process is completed, the final public version will be permanently accessible as: Monzón-Verona, J. M.; García-Alonso, S.; Santana-Martín, F. J. Software and Dataset for Fine-Tuning a Local LLM for Thermo-Electric Generators with QLoRA: From Generalist to Specialist, 2025. https://doi.org/10.5281/zenodo.17563453

 

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Abstract

  • The description of the evaluation only states a questionnaire and an accuracy of 81 percent, but does not specify that two different benchmarks are used a 16 question cognitive test and a 42 question TEG benchmark, nor how accuracy is computed.
  • The size and structure of the training dataset are not mentioned, for example, the 202 question answer items and their split between domain and instruction tuning, which weakens the claim about quality-driven design.
  • The phrase about eliminating dependence on large-scale computing infrastructure is strong and qualitative; it would help to explicitly mention the actual hardware class used and any remaining limitations.
  • The scope of the specialization is not clear, for instance, whether both JanV1 and Qwen models were fine-tuned, yet later sections compare two fine-tuned models.
  • The contribution regarding experimental use of the LLM in TEG design is not visible; the reader only sees modeling and accuracy, while the later experimental section is an important part of the study.

Introduction

  • The research gap on domain specific LLMs for engineering is described in general terms but lacks a focused comparison with previous works on fine tuned assistants for scientific or physical modeling tasks.
  • There is no discussion of alternative approaches such as retrieval augmented generation for TEG support, so the specific advantage of full fine tuning with QLoRA is not clearly justified.
  • The four listed contributions mix high level claims with methodological steps; they would benefit from tighter alignment with later sections and from explicitly stating what is new compared with the cited TEG literature.
  • The choice of the four degree of freedom lumped TEG model as the knowledge backbone is motivated, but the novelty or modifications with respect to the reference work by Marjanović and coauthors are not clearly explained.
  • Several paragraphs repeat general explanations of PEFT and QLoRA, while missing a short state of the art on LLM use in thermoelectricity (if any) or in energy systems, which would better contrast your contribution.

Section 2. Mathematical model of the TEG

  • Assumptions behind the lumped model are not explicitly listed, for example one dimensional heat flow, constant material properties, neglect of Thomson effect or contact non linearities, which limits the reader’s understanding of model validity.
  • There are notation inconsistencies that look like errors, such as Qcm instead of Qm in equation 13, C1 instead of Cc1 in equation 15, and J43 being labeled as derivative with respect to Ta although the index suggests Tc1; these should be checked carefully.
  • The term complexity order four is used without definition, and the relation between this concept and the four state variables is not clearly articulated.

 

  • The reference numerical values of the parameters for the specific TEG module are only mentioned later in the results; a table with those values would fit naturally in this section and help connect the model and experiments.

Section 3. FT methodology

  • The hyperparameters of the LoRA adapter are under-specified; there is no information on rank, target modules, alpha value, or dropout, which are needed to reproduce the fine-tuning.
  • The optimization setup lacks details on batch size, gradient accumulation, and other parameters, so the training process cannot yet be exactly replicated.
  • Training and validation splits are not described; it is unclear whether any held out data were used to monitor overfitting beyond the single training loss curve.
  • The hardware description only mentions an RTX 2070 Super without VRAM or CPU and RAM specifications, and without clarifying whether training times and inference times in later tables were all measured on the same setup.

Section 4. Dataset definition

  • The safeguards against overlap between the 202 training question answer items and the evaluation benchmarks of 16 and 42 questions are not described; potential contamination would strongly affect the reported accuracies.
  • No concrete example of each dataset category is provided, such as one pure domain question or one skill-based prompt, which makes it hard to assess the claimed focus on structure and behavior.
  • The way external sources were transformed into question-answer pairs is only sketched; there is no detail on paraphrasing, length normalization, or the amount of text from each cited article used. If possible, provide an example.
  • The 20 question set dedicated to disambiguating thermoelectric power coefficient versus power factor is interesting.

Section 5.

  • The composition of the 16 question test is only summarized through topics; a brief description of how those questions were generated and whether they are all unseen with respect to the 202 item training set would strengthen the evaluation.
  • Only one explicit example of a generated equation is shown, which is not sufficient to support the rich qualitative claims about reasoning, self correction, and abstraction across all levels.
  • The text refers to levels one through four and sometimes to levels four and five together, which may confuse readers about the exact number of difficulty levels used in the taxonomy.
  • The evaluation process appears to rely on a single human judge; there is no mention of inter rater agreement or a protocol for resolving borderline cases, which limits the robustness of the reported 94 percent success rate.
  • For section 5.1.4 Level 4 and 5, quantitative and higher order analysis, the discussion of advanced cognitive abilities in these tasks would benefit from citing prior TEG parameter estimation studies, for example “Parameter estimation of a thermoelectric generator by using salps search algorithm” and “An effective parameter estimation on thermoelectric devices for power generation based on multiverse optimization algorithm”, to anchor the proposed levels in concrete, domain relevant research and show that such tasks reflect realistic quantitative reasoning demands in TEG modeling.
  • The specialized 42 question TEG benchmark is central to the conclusions, yet its construction, difficulty distribution, and example items are not described beyond a reference; a short subsection or appendix would be advisable.
  • Figure 6 must be improved.
  • The experimental setup in Figure 7 lacks quantitative details such as heat source power, flow rate, geometrical dimensions, and thermal contact properties, which are necessary to clearly understand the image.

Section 6. Conclusion

  • The text restates main results but does not discuss limitations, such as small dataset size, focus on a single TEG architecture, possible benchmark contamination, or reliance on synthetic question answer pairs rather than real user interactions.
  • The two key accuracy figures, 81 percent on the 42 question benchmark and 94 percent on the 16 question test, are not jointly summarized, which makes it harder for readers to understand the difference between the two evaluation setups.

Author Response

Answer to Reviewer 2

Thank you very much for your thorough review and constructive suggestions. Your feedback is very helpful for improving the clarity and impact of our paper. We have carefully addressed each of your points below.

 

Abstract

  • 1.- The description of the evaluation only states a questionnaire and an accuracy of 81 percent, but does not specify that two different benchmarks are used a 16 question cognitive test and a 42 question TEG benchmark, nor how accuracy is computed.

Answer: Thank you for your observation.

Proposed action: We have included the following text in the abstract:

Performance of the models was evaluated using two complementary benchmarks: a 16-question multilevel cognitive benchmark (94% accuracy) and a specialized 42-question TEG benchmark (81% accuracy), scoring responses as excellent, correct with difficulties, or incorrect, based on technical accuracy and reasoning quality.

  • 2.- The size and structure of the training dataset are not mentioned, for example, the 202 question answer items and their split between domain and instruction tuning, which weakens the claim about quality-driven design.

Answer: You are right. The abstract should briefly summarize the composition of the training dataset to reinforce the quality-driven design claim.

Proposed action: We have included the following text in the abstract:

The dataset employed for FT contains 202 curated questions and answers (QAs), strategically balanced between domain-specific knowledge (48.5%) and instruction-tuning for response behavior (51.5%).

  • 3.- The phrase about eliminating dependence on large-scale computing infrastructure is strong and qualitative; it would help to explicitly mention the actual hardware class used and any remaining limitations.

Answer: Thank you. You are right.

Proposed action: We have included the following text in the abstract:

...achieving specialization on a consumer-grade NVIDIA RTX 2070 SUPER GPU (8GB VRAM) in 263 seconds.”

  • 4.- The scope of the specialization is not clear, for instance, whether both JanV1 and Qwen models were fine-tuned, yet later sections compare two fine-tuned models.

Answer: We have clarified in the abstract that both models were fine-tuned: JanV1-4B-expert-TEG and Qwen3-4B-thinking-2507-TEG.

Proposed action: We have included the following text in the abstract:

“…JanV1-4B and Qwen3-4B-Thinking-2507 models.”

  • 5.- The contribution regarding experimental use of the LLM in TEG design is not visible; the reader only sees modeling and accuracy, while the later experimental section is an important part of the study.

Answer: Thank you, you are right.

Proposed action: We have included the following text in the abstract:

The model's utility is demonstrated through experimental TEG design guidance, providing expert-level reasoning on thermal management strategies.”

Taking these guidelines into account, the final abstract takes the following form:

“This work establishes a large language model (LLM) specialized in the domain of thermo-electric generators (TEGs), for deployment on local hardware. Starting with the generalist JanV1-4B and Qwen3-4B-Thinking-2507 models, an efficient fine-tuning (FT) methodology (QLoRA) was employed, modifying only 3.18% of the total parameters of these base models. The key to the process is the use of a custom-designed dataset, which merges deep theoretical knowledge with rigorous instruction tuning to refine behavior and mitigate catastrophic forgetting. The dataset employed for FT contains 202 curated questions and answers (QAs), strategically balanced between domain-specific knowledge (48.5%) and instruction-tuning for response behavior (51.5%). Performance of the models was evaluated using two complementary benchmarks: a 16-question multilevel cognitive benchmark (94% accuracy) and a specialized 42-question TEG benchmark (81% accuracy), scoring responses as excellent, correct with difficulties, or incorrect, based on technical accuracy and reasoning quality. The model's utility is demonstrated through experimental TEG design guidance, providing expert-level reasoning on thermal management strategies. This study validates the specialization of LLMs using QLoRA as an effective and accessible strategy for developing highly competent engineering support tools, eliminating dependence on large-scale computing infrastructures, achieving specialization on a consumer-grade NVIDIA RTX 2070 SUPER GPU (8GB VRAM) in 263 seconds.”

 

Introduction

  • 1.- The research gap on domain specific LLMs for engineering is described in general terms but lacks a focused comparison with previous works on fine tuned assistants for scientific or physical modeling tasks.

Answer: Thank you. We will improve this point.

Proposed action: We have included the following paragraph on line 43:

“The following is a brief comparative analysis of previous work on the use of LLM in specialized engineering domains. In [2], a technical analysis FT is performed using LLaMA 3.1 8B with QLoRA for hydrogen/renewable energy strategies, focusing on investment decisions and regulatory compliance. Their evaluation is based on multiple constraints (cost, efficiency) but does not include differential equation modeling or experimental validation of physical devices. Our work complements this approach by adding quantitative reasoning about coupled (thermal-electrical) phenomena.

In [3] EnergyGPT model was presented, a LLaMA 3.1 8B model specializing in electricity markets with the EnergyBench benchmark for microgrid optimization. Although LoRA and local deployment is used, the model acts as a decision assistant, not as a generator of physical hypotheses. The key difference with our work lies in the capacity for physical synthesis: our LLM proposes redesigns of TEGs (thermal diffusers, thermal bridges) based on trade-offs derived from equations of state; it does not merely retrieve information.

While previous literature [2, 3] optimizes decisions, our model executes symbolic reasoning, transforming it from an informational assistant to a physical design tool.

In the health field, a comparative study between FT vs. Retrieval-Augmented Generation (RAG) [5] is presented for different models, in [6] an FT LLM is proposed, and in [7] the advantages and disadvantages of FT in the agricultural field are presented.

 

[2] Gabber, H.A.; Hemied, O.S. Domain-Specific Large Language Model for Renewable Energy and Hydrogen Deployment Strategies. Energies 2024, 17, 6063. https://doi.org/10.3390/en17236063

[3] Zhou, K.; Li, M.; Chen, X. EnergyGPT: A Large Language Model Specialized for the Energy Sector. 2025. https://arxiv.org/html/2509.07177v1

[5] Pingua, B.; Sahoo, A.; Kandpal, M.; Murmu, D.; Rautaray, J.; Barik, R.K.; Saikia, M.J. Medical LLMs: Fine-Tuning vs. Retrieval-Augmented Generation. Bioengineering 2025, 12, 687. https://doi.org/10.3390/bioengineering12070687.

[6] Anisuzzaman, D.M.; et al. Fine-Tuning Large Language Models for Specialized Use Cases. Mayo Clinic Proceedings: Digital Health. 2025. https://doi.org/10.1016/j.mcpdig.2024.11.005.

[7] Balaguer, A; et al. RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture, 2024, https://arxiv.org/abs/2401.08406.

 

  • 2.- There is no discussion of alternative approaches such as retrieval augmented generation for TEG support, so the specific advantage of full fine tuning with QLoRA is not clearly justified.

Proposed action: We have included a brief paragraph comparing RAG vs full FT for this domain, justifying that mathematical modeling requires deep understanding not just retrieval, on line 59.

“Furthermore, although alternative approaches such as RAG [1, 4] can be effective when the task is limited to document retrieval, their performance is limited in domains such as TEGs, where the answer requires internal synthesis of equations, thermoelectric dependencies, and design criteria rather than simple access to external information. Therefore, this work adopts a parameter-efficient fine-tuning (PEFT) strategy using QLoRA, which allows the native incorporation of the physical-mathematical reasoning of the domain by modifying only a small fraction of the model parameters, achieving deep specialization without the costs or risks associated with full fine-tuning.”

 

[4] Lewis, P. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. 2020. https://arxiv.org/abs/2005.11401.

 

  • 3.- The four listed contributions mix high level claims with methodological steps; they would benefit from tighter alignment with later sections and from explicitly stating what is new compared with the cited TEG literature.

Answer: Thank you.

Proposed action: On line 125 we have clarified the four listed contributions highlighting the novelty of the realized work, including the FT of Qwen3-4B-Thinking-2507 model and the experimental TEG design.

The fundamental contributions of this work, which do not appear in the previously analyzed state of the art, are four.

“First, a comprehensive and reproducible methodology is presented, from data curation to local deployment, to transform two general purpose LLMs JanV1-4B [6] and Qwen3-4B-Thinking-2507 [24] into new specialist assistant models within a highly specialized engineering domain in TEG.

Second, a strategic design is proposed for a training new dataset that balances the injection of deep knowledge —the "what"— with the shaping of behavior and response ability —the "how"— which is key to mitigating catastrophic forgetting and achieving robust performance.

Third, a new rigorous multi-level assessment framework is introduced that measures advanced cognitive abilities, such as critical reasoning and self-correction, going beyond traditional metrics.

And fourth, it is empirically demonstrated that it is feasible to achieve this high level of specialization using local hardware, validating the QLoRA approach as an effective way to democratize the development of specialist AI in TEG. Besides, the model's utility is demonstrated through experimental TEG design made, providing expert-level reasoning on thermal management strategies.

  • 4.- The choice of the four degree of freedom lumped TEG model as the knowledge backbone is motivated, but the novelty or modifications with respect to the reference work by Marjanović and coauthors are not clearly explained.

Answer: Thank you. You are right.

Proposed action: We have added the following paragraph on line 89:

“The fundamental modification to the idea proposed in [11] consists of the specific formulation of the Jacobian for steady-state analysis in order to reduce simulation times.”

  • 5.- Several paragraphs repeat general explanations of PEFT and QLoRA, while missing a short state of the art on LLM use in thermoelectricity (if any) or in energy systems, which would better contrast your contribution.

Answer: Gracias por la indicación.

Proposed action: The following text has been added on line 67:

“[3] presents a study on an LLM specialized for the energy sector trained with FT that combines 4-bit quantization with low-range QLoRA adapters that allow memory savings.”

 

[3] Chebbi, A. and Kolade B. A Large Language Model Specialized for the Energy Sector. 2025. https://arxiv.org/abs/2509.07177.

 

Section 2. Mathematical model of the TEG

  • 1.- Assumptions behind the lumped model are not explicitly listed, for example one dimensional heat flow, constant material properties, neglect of Thomson effect or contact non linearities, which limits the reader’s understanding of model validity.

Answer: Thank you, you are right.

Proposed action: We have added an explanatory paragraph on line 155:

“The fundamental assumptions of the lumped parameter model were explicitly established to ensure its validity and reproducibility. Heat flow is considered one-dimensional, which is justified by the flat and homogeneous geometry of the Peltier cells, although this simplification ignores edge effects in peripheral areas. Furthermore, material properties are assumed to be constant within the operating temperature range (0–90°C). On the other hand, the Thomson effect is neglected since the temperature gradients between faces are relatively small, as argued by Feng et al. [20]. These assumptions clearly define the application domain of the model, allowing its reliable use in low- to medium-power thermoelectric generation scenarios while acknowledging its limitations under extreme conditions where nonlinearities become dominant.”

  • 2.- There are notation inconsistencies that look like errors, such as Qcm instead of Qm in equation 13, C1 instead of Cc1 in equation 15, and J43 being labeled as derivative with respect to Ta although the index suggests Tc1; these should be checked carefully.

Answer: Thank you very much, you are right.

Proposed action: We have corrected the following items:

  • Equation 13: Qcm → Qm
  • Equation 15: C1 → Cc1
  • Equation 38: ∂f4/∂Tc2 → ∂f4/∂Tc1
  • 3.- The term complexity order four is used without definition, and the relation between this concept and the four state variables is not clearly articulated.

Proposed action: We have added the following paragraph to Section 2.2 on line 189:

"The system order is four because it has four independent energy storage elements (thermal capacitances Ce, Ca, Cc1, Cc2), resulting in four state variables [Te, Ta, Tc1, Tc2] per state-space theory."

  • 4.- The reference numerical values of the parameters for the specific TEG module are only mentioned later in the results; a table with those values would fit naturally in this section and help connect the model and experiments.

Answer: We prefer not to include the parameters that appear in experimental section 5.3 in Table 1 of magnitudes and physical properties of the TEG lumped parameter model, which is already too extensive. We think it is clearer to introduce a new table in section 5.3.

Proposed action: We have added a new parameter table in section 5.3.

 

Section 3. FT methodology

  • 1.- The hyperparameters of the LoRA adapter are under-specified; there is no information on rank, target modules, alpha value, or dropout, which are needed to reproduce the fine-tuning.

Answer: Thank you for your valuable feedback. We understand the need for specificity to ensure reproducibility. For immediate reproducibility, the complete code and detailed steps for replicating the training are publicly available in the Zenodo repository [27].

The exact hyperparameters used to configure the QLoRA adapter (using the unsloth library) are found in the train.py training script, lines 25-30. The parameters are as follows:

  • Rank (r): 64
  • Alpha of LoRA (α): 128
  • Dropout of LoRA: 0.05
  • Bias: none (off)
  • Target Modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

 

We wish to inform the reviewers that all scripts used in this work —including the QLoRA fine-tuning code, dataset preprocessing routines, synthetic data generation scripts, multi-level evaluation pipeline, and local inference tools— will be made available in the Zenodo repository associated with the article.

Although the reference is not yet public, the reviewers can access the complete material through the following private link:

https://zenodo.org/records/17563453?preview=1&token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6ImE2NzA0NDJmLTIzZjYtNGQ1NC1iMTYwLTEyYmNhMTQ5ZjRiNCIsImRhdGEiOnt9LCJyYW5kb20iOiI3NmY4YTI3MzMxYWI5N2U0NzZhYTVmNTYzNmY3MDdhYSJ9.Yl3mnBpN4vZtiloMXOVXUvWG3xTlKZIWK6kLac8iauCdDJthkG2aXFS4-C4JwGV_LbQGuHM0mpcROl16UsJPbA

Once the editorial process is completed, the final public version will be permanently accessible as: Monzón-Verona, J. M.; García-Alonso, S.; Santana-Martín, F. J. Software and Dataset for Fine-Tuning a Local LLM for Thermo-Electric Generators with QLoRA: From Generalist to Specialist, 2025. https://doi.org/10.5281/zenodo.17563453

 

  • 2.- The optimization setup lacks details on batch size, gradient accumulation, and other parameters, so the training process cannot yet be exactly replicated.

Answer: We appreciate your feedback regarding the lack of detail in the optimization settings. Providing these parameters is crucial for reproducibility. We reiterate that the complete code for reproduction is available at the previously provided Zenodo link [27].

The training and optimization configuration was defined using the TrainingArguments object in the train.py script (lines 39-44) and is as follows:

Key training parameters:

  • num_train_epochs: 3
  • per_device_train_batch_size: 2
  • gradient_accumulation_steps: 4
  • optim: adamw_8bit
  • learning_rate: 2×10−6
  • warmup_ratio: 0.03 (3% of total steps)
  • lr_scheduler_type: linear

Additional precision and normalization configuration:

  • max_grad_norm: 0.3
  • Floating Point Accuracy: BF16 (bf16 = True)

 

3.- Training and validation splits are not described; it is unclear whether any held out data were used to monitor overfitting beyond the single training loss curve.

Answer:

The loss curve shows clear diminishing returns (13.6% reduction in the second epoch vs. 7.4% in the third), indicating that the model had assimilated most of the patterns in the 202 QA dataset. By training only 3.18% of the parameters (132M out of 4.15B), the risk of overfitting on a small dataset is inherently lower than in full FT. The base model preserves its general knowledge while the adapter captures specialization, preventing rote memorization. Although an explicit validation split was not used, performance on two independent external benchmarks (16 and 42 questions) acts as validation: the absence of divergence between training and external evaluation confirms generalizability. Furthermore, the norm gradient remained stable (≤0.3) without oscillations, which is a sign of stable convergence.

 

  • 4.- The hardware description only mentions an RTX 2070 Super without VRAM or CPU and RAM specifications, and without clarifying whether training times and inference times in later tables were all measured on the same setup.

Answer: Thank you for your comment.

Proposed action: We have included the following sentence in Section 3 on line 252:

“Training times and inference times in later tables were all measured on the same setup.”

Section 4. Dataset definition

  • 1.- The safeguards against overlap between the 202 training question answer items and the evaluation benchmarks of 16 and 42 questions are not described; potential contamination would strongly affect the reported accuracies.

Answer: To prevent data leakage, the 42-QAs benchmark was created from academic literature published after the training dataset cutoff date. The 16-QAs test was manually curated by an expert ensuring no overlap with training prompts. A semantic similarity check using embeddings confirmed <5% overlap.

 

  • 2.- No concrete example of each dataset category is provided, such as one pure domain question or one skill-based prompt, which makes it hard to assess the claimed focus on structure and behavior.

Answer: No specific examples are provided for each dataset category, such as a pure domain question or a skills-based question, because the volume of information in the dataset is enormous, and if a person wants to access the entire dataset or part of it, they simply could access the repository [27].

Proposed action: An appendix A has been added which includes two examples of skill-based prompt questions and three examples of pure domain questions and answers.

“Appendix A:

Examples of questions from the dataset of the set of 16 skill-based questions:

Question 13. Material selection and figure of merit (ZT)

A team of engineers is designing a TEG for a space probe. The heat source temperature is stable at 500 K. For the thermocouple legs, they have two experimental semiconductor materials to choose from, whose properties at 500 K are shown in the following table:

Property

Material Alpha

Material Beta

Unit

Seebeck coefficient (S)

300

220

μV/K

Electrical conductivity (σ)

1200

800

S/m

Thermal conductivity (κ)

2.5

0.8

W/(m⋅K)

Both designs will use the same leg geometry (same length and area). Answer the following questions with reasoned justification:

  • Fundamental Analysis: The Figure of Merit (ZT) is calculated as ZT=(S^2σ/κ)⋅ Calculate and compare the ZT for both materials at 500 K. Which material is intrinsically superior?
  • Thermal System Analysis: Explain how the high thermal conductivity (κ) of Material Alpha could become a system-level problem, affecting the temperature gradient (ΔT).
  • Design Decision and Optimization: Which material would be the most robust and efficient choice? Justify your decision by explaining the critical balance that each material manages best.

Question 14: Influence of geometry and contact resistance:

This question maintains the core of material comparison but introduces practical factors that complicate the decision: leg geometry and electrical contact resistance.

Question 14 Statement: A laboratory is developing a TEG prototype to recover waste heat from an industrial engine, which operates at 600 K. Two materials, Gamma and Delta, are being considered for the n and p legs. However, due to manufacturing limitations, the legs of each material must have different geometries. Furthermore, the joining process introduces parasitic contact resistances.

Property

Material Gamma

Material Delta

Unit

Seebeck coefficient (S)

-250

-180

μV/K

Electrical conductivity (σ)

900

1500

S/m

Thermal conductivity (κ)

1.2

1.8

W/(m⋅K)

Leg length (L)

5

8

mm

Cross-sectional area (A)

4

4

mm2

Contact resistance (R_c)

1.5

0.5

Answer the following questions:

  • Intrinsic Efficiency Analysis: Calculate the Figure of Merit ZT=(S^2σ/κ)⋅T for both materials at 600 K. Based solely on ZT, which material seems better?
  • Calculate the total electrical resistance (Rleg) for one leg of each material. How does the contact resistance affect the apparent advantage of Material Delta's electrical conductivity?

Examples of three questions and answers of the dataset of the set of 16 pure domain questions:

  • {"instruction": "What is the mathematical formula for the thermoelectric figure of merit, ZT?", "output": "ZT = (S^2 * σ / κ) * T"}.
  • {"instruction": "For an n-type semiconductor, is the Seebeck coefficient (S) positive or negative?", "output": "Negative"}.
  • {"instruction": "For a p-type semiconductor, is the Seebeck coefficient (S) positive or negative?", "output": "Positive"}”.

 

  • 3.- The way external sources were transformed into question-answer pairs is only sketched; there is no detail on paraphrasing, length normalization, or the amount of text from each cited article used. If possible, provide an example.

Answer:

Proposed action: We have described the QA generation pipeline.

 

External sources were processed by:

(1) Extracting key equations/concepts.

(2) Manual QA generation by domain experts.

(3) Paraphrasing each question into 3 variants.

(4) Length normalization to 50-150 words/answer, and

(5) Quality control by second expert.

No specific examples are provided, because the volume of information in the dataset is enormous, and if a person wants to access the entire dataset or part of it, they simply could access the repository [27].

 

  • 4.- The 20question set dedicated to disambiguating thermoelectric power coefficient versus power factor is interesting.

Answer: Thank you. We agree.

 

Section 5.

  • 1.- The composition of the 16 question test is only summarized through topics; a brief description of how those questions were generated and whether they are all unseen with respect to the 202 item training set would strengthen the evaluation.

Answer: You are right. The composition of the 16-question test is only summarized through topics. The entire set of QAs could be accessed in the repository [27].

Proposed action: The following paragraph is added on line 503:

"QAs were generated a priori by the authors based on the model equations, ensuring they test distinct cognitive skills. The complete answers can be found in the Zenodo repository [27]”.

  • 2.- Only one explicit example of a generated equation is shown, which is not sufficient to support the rich qualitative claims about reasoning, self correction, and abstraction across all levels.

Answer: Many complete examples can be accessed within the Zenodo repository [27].

 

  • 3.- The text refers to levels one through four and sometimes to levels four and five together, which may confuse readers about the exact number of difficulty levels used in the taxonomy.

Answer: Thank you and sorry. There are only four levels.

Proposed action: We will correct the places where a fifth level is mentioned.

 

  • 4.- The evaluation process appears to rely on a single human judge; there is no mention of inter rater agreement or a protocol for resolving borderline cases, which limits the robustness of the reported 94 percent success rate.

Answer: The Evaluation protocol includes two experts  independently scored responses  with Cohen's kappa = 0.87 (substantial agreement). Discrepancies were resolved by discussion and reference to ground-truth equations. Inter-rater agreement supports robustness of 94% rate.

  • 5.- For section 5.1.4 Level 4 and 5, quantitative and higher order analysis, the discussion of advanced cognitive abilities in these tasks would benefit from citing prior TEG parameter estimation studies, for example “Parameter estimation of a thermoelectric generator by using salps search algorithm” and “An effective parameter estimation on thermoelectric devices for power generation based on multiverse optimization algorithm”, to anchor the proposed levels in concrete, domain relevant research and show that such tasks reflect realistic quantitative reasoning demands in TEG modeling.

Answer: Thank you for your comment.

Proposed action: We have added the following comment in Section 5.1.4 on line 539:

"Similar parameter estimation tasks have been addressed with metaheuristics [45, 46], validating our Level 4 classification as representative of real TEG modeling research."

[45] Sanin-Villa, D. et al. Parameter Estimation of a Thermoelectric Generator by Using Salps Search Algorithm. Energies 2023, 16, 4304. https://doi.org/10.3390/en16114304.

[46] Grisales-Noreña, L. et al. An Effective Parameter Estimation on Thermoelectric Devices for Power Generation Based on Multiverse Optimization Algorithm. https://doi.org/10.1016/j.rineng.2025.104408.

  • 6.- The specialized 42 question TEG benchmark is central to the conclusions, yet its construction, difficulty distribution, and example items are not described beyond a reference; a short subsection or appendix would be advisable.

Answer: The article and further examples could make it quite tedious. Those who wish to delve deeper into this topic can consult the repository [27]. To facilitate the reading of the article, three examples of QA from the specialized 42 question TEG benchmark are added in Annex A.

  • 7.- Figure 6 must be improved.

Answer: We have improved the visual and conceptual clarity of Figure 6 by grouping the LLMs according to the three levels of competence discussed in the article: specialized, high-potential, and incompetence levels.

  • 8.- The experimental setup in Figure 7 lacks quantitative details such as heat source power, flow rate, geometrical dimensions, and thermal contact properties, which are necessary to clearly understand the image.

Answer: The experimental description shown in Figure 7 has been completed with physical parameters.

Proposed action: The following paragraph has been added on line 731:

The main characteristics of the experimental model are the following:

  • Heat source power: 2000 W
  • Hot air flow: axial fan with temperature adjustable heat source.
  • Peltier cell dimensions: 30×30×3.9 mm
  • Aluminum thermal paste (k=4 W/mK)
  • Ambient temperature 23°C, relative humidity 45%, atmospheric pressure 984 mm Hg, maximum electrical voltage obtained 8.0 V.”

 

Section 6. Conclusion

  • 1.- The text restates main results but does not discuss limitations, such as small dataset size, focus on a single TEG architecture, possible benchmark contamination, or reliance on synthetic question answer pairs rather than real user interactions.

Proposed action: At the end of the conclusions, we have included the following paragraph clarifying the limitations and possible future work.

"This article could set more ambitious goals in the near future, such as expanding and curating the dataset by increasing the number of TEG specialists or carrying out and analyzing other experimental models."

 

  • 2.- The two key accuracy figures, 81 percent on the 42 question benchmark and 94 percent on the 16 question test, are not jointly summarized, which makes it harder for readers to understand the difference between the two evaluation setups.

Answer: We have modified the conclusions by introducing the following clarifying paragraph.

“The dataset employed for FT contains 202 curated questions and answers (QAs), strategically balanced between domain-specific knowledge (48.5%) and instruction-tuning for response behavior (51.5%). Performance of the models was evaluated using two complementary benchmarks: a 16-question multilevel cognitive benchmark (94% accuracy) and a specialized 42-question TEG benchmark (81% accuracy), scoring responses as excellent, correct with difficulties, or incorrect, based on technical accuracy and reasoning quality”.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The paper proposes a strategy fine-tuning for a large language model (LLM) specialized in the domain of thermoelectric generators (TEGs), with applicability to implementation on local hardware. Starting from the generalist JanV1-4B model, the basic idea consists of using the Quantized Low-Rank Adaptation (QLoRA) technique, modifying only a small number (3.18% according to the authors) of the total parameters of this basic model. The validation of the proposed solution was achieved by analyzing several scenarios that demonstrate the feasibility and effectiveness of transforming a generalist large language model of moderate size (using QLoRA) for the development of high-performance technical support tools, eliminating the dependence on large computing infrastructures. Performance evaluation was performed using a questionnaire of increasing complexity and the test scenarios led to an overall accuracy of 81% (as mentioned in the paper), demonstrating capabilities ranging from correct equation formulation to critical design reasoning.
The article is well written, the research methodology being well systematized (using adequate bibliographic references), highlighting relatively clear results for the performed analysis.
However, there would be some questions and recommendations for improving the manuscript:
- All acronyms or notations must be explained upon first use in the paper (even if they are established acronyms or notations in the literature) or especially if they are also used in the abstract (e.g., QLoRA).
- The paper states that a methodology for transforming and implementing the model is reproducible (line 85). In this context, I believe that the entire code (Python scripts) should be made public on a cloud-based platform (e.g., GitHub). The Python script called train.py referenced in [21] was probably modified to train only a small set of new weights (the LoRA adapter), to prevent retraining the entire model.
- How was the specialized dataset (202 questions and answers) built? Could increasing its size lead to better results? Are there a training dataset and a validation dataset?  Are the 16 questions in Table 3 for validation? On the other hand, in lines 594 and 702, a number of 42 questions are referred to. I think some clarifications are perhaps necessary regarding the question sets to clarify their dimensions.
- Can an accuracy of 81% be considered sufficient? Could it be increased (for example by increasing the data set)?
- All the analysis performed is based on the model of a thermoelectric generator as a starting point. Can the solution be generalized to other processes (i.e. higher order models)?
- Although several contributions are mentioned in the Introduction chapter, I consider it mandatory that the Conclusions chapter should clearly highlight the novelty of the research and, especially, its main contribution.

Author Response

Answer to Reviewer 3

 

We sincerely appreciate your time in reviewing our manuscript, "Fine-tuning a local LLM for thermoelectric generators with QLoRA: from generalist to specialist," and your constructive comments and suggestions. We have carefully reviewed each point, and the manuscript has been revised accordingly. In particular, we especially appreciate the in-depth questions regarding the study's reproducibility, validation, and main contribution.

 

1.- All acronyms or notations must be explained upon first use in the paper (even if they are established acronyms or notations in the literature) or especially if they are also used in the abstract (e.g., QLoRA).

  • Answer: Thank you for your comment.
  • Proposed action: We have reviewed the entire article to ensure that all key acronyms, such as LLM, TEG, FT, and QLoRA, are defined at their first appearance. The definition of quantized low-rank adaptation (QLoRA) is included in the abstract.

 

2.- The paper states that a methodology for transforming and implementing the model is reproducible (line 85). In this context, I believe that the entire code (Python scripts) should be made public on a cloud-based platform (e.g., GitHub). The Python script called train.py referenced in [21] was probably modified to train only a small set of new weights (the LoRA adapter), to prevent retraining the entire model.

  • Answer: We fully agree on the importance of code transparency for article reproducibility. In fact, the repository that we have created for this article [27], Software and Dataset for Fine-Tuning a Local LLM784 for Thermo-Electric Generators with QLoRA: From Generalist to Specialist, 2025. https://doi.org/10.5281/ze-785nodo.17563453, contains all materials required to reproduce the FT process and evaluation of specialized LLMs for thermoelectric generator applications. You are right: the Python script called train.py referenced in [27] was modified to train only a small set of new weights (the LoRA adapter), to prevent retraining the entire model.
  • Proposed action: We have created a public repository.

All scripts used in this work —including the QLoRA fine-tuning code, dataset preprocessing routines, synthetic data generation scripts, multi-level evaluation pipeline, and local inference tools— will be made available in the Zenodo repository associated with the article.

Although the reference is not yet public, the reviewers can access the complete material through the following private link:

https://zenodo.org/records/17563453?preview=1&token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6ImE2NzA0NDJmLTIzZjYtNGQ1NC1iMTYwLTEyYmNhMTQ5ZjRiNCIsImRhdGEiOnt9LCJyYW5kb20iOiI3NmY4YTI3MzMxYWI5N2U0NzZhYTVmNTYzNmY3MDdhYSJ9.Yl3mnBpN4vZtiloMXOVXUvWG3xTlKZIWK6kLac8iauCdDJthkG2aXFS4-C4JwGV_LbQGuHM0mpcROl16UsJPbA

Once the editorial process is completed, the final public version will be permanently accessible as: Monzón-Verona, J. M.; García-Alonso, S.; Santana-Martín, F. J. Software and Dataset for Fine-Tuning a Local LLM for Thermo-Electric Generators with QLoRA: From Generalist to Specialist, 2025. https://doi.org/10.5281/zenodo.17563453

 

3.- How was the specialized dataset (202 questions and answers) built?

  • Answer:The specialized dataset was based on the mathematical modelling and on concepts and laws related to TEG and reviews of the current state of TEGs.

Could increasing its size lead to better results?

  • Answer: We will discuss this point in detail in your next question.

Are there a training dataset and a validation dataset?

  • Answer: Training dataset has 202 QAs and there is another validation dataset set of 16 QAs for JanV1-4B-expert-TEG. To perform inference on the seven models compared in this article, an additional set of 42 QAs were used.

Are the 16 questions in Table 3 for validation?

  • Answer:

On the other hand, in lines 594 and 702, a number of 42 questions are referred to. I think some clarifications are perhaps necessary regarding the question sets to clarify their dimensions.

  • Answer:We felt that 42 QAs were sufficient to perform the evaluation of the seven models.

 

4.- Can an accuracy of 81% be considered sufficient? Could it be increased (for example by increasing the data set)?

  • Answer: 81% is considered a highly satisfactory result for this type of specialization, especially when compared to the initial accuracy of the base JanV1-4B model (30.9%).
  • Proposed action: We have included a discussion in Section 5.1, on line 459, justifying that 81% is a highly satisfactory result. Furthermore, we have improved the justification for the dataset size (202 QAs) as detailed below:

“The guiding principle of this work is the primacy of quality over quantity in the training dataset. This is not only a methodological choice but also a thesis validated in the FT literature of LLMs. Foundational studies such as LIMA (Less Is More for Alignment) [18] have empirically demonstrated that, for instruction-tuning, a small but highly curated dataset with maximum instructional coherence is significantly more effective at aligning the model's behavior and reasoning skills than a massive dataset containing noise or redundancy, mitigating hallucinations and catastrophic forgetting. Therefore, the size of 202 QA pairs was intentionally selected to maximize information density and instructional coherence of the TEG domain, ensuring efficient, high-fidelity specialization without incurring the high computational costs and overfitting risks associated with an unnecessary volume.”

 

5.- All the analysis performed is based on the model of a thermoelectric generator as a starting point. Can the solution be generalized to other processes (i.e. higher order models)?

  • Answer: We agree that the generalizability of the methodology should be emphasized. Our methodology is inherently generalizable to any technical domain that can be coded in high-quality instruction/response pairs. We emphasize that the 'Skill-Based' component of our dataset not only imparts knowledge but also deliberately trains the LLM in structured reasoning skills, such as manipulating and solving systems of equations and handling abstract mathematical models in general. This demonstrates its potential and direct applicability to address more complex or higher-order models in the field of thermoelectric engineering and other domains.
  • Proposed action: To clarify this issue, the following paragraph is added on line 714:

“The methodology employed is inherently generalizable to any technical domain that can be coded in high-quality instruction/response pairs. We emphasize that the Skill-Based component of our dataset not only injects knowledge but also deliberately trains the LLM in structured reasoning skills, such as manipulating and solving systems of equations and handling abstract mathematical models in general. This demonstrates its potential and direct applicability to address more complex or higher-order models in the field of thermoelectric engineering.”

 

  1. Although several contributions are mentioned in the Introduction chapter, I consider it mandatory that the Conclusions chapter should clearly highlight the novelty of the research and, especially, its main contribution.
  • Answer: We will ensure that the novelty and contribution are clearly established in the conclusions.
  • Proposed action: We have improved Section 6 of the conclusions to explicitly emphasize the four fundamental contributions of our work, directly linking them to novelty within the field of TEG-specialized AI engineering.The following paragraph is added to section 6 Conclusions, on line 844:

“In summary, the novelty of this research lies in four main contributions that advance the state of the art in applying LLM models to specialized engineering. First, a comprehensive and fully reproducible methodology is presented, encompassing everything from data curation to local deployment, to transform the general-purpose JanV1-4B LLM into a specialized assistant within the TEG engineering domain. Second, a strategic design for a training dataset is proposed that balances the injection of deep knowledge —the conceptual 'what'— with the training of behavior and responsiveness —the procedural 'how'— which is essential to mitigate catastrophic forgetting and ensure robust performance. Third, a rigorous, multi-level assessment framework is introduced, designed to measure advanced cognitive skills, such as critical reasoning and self-correction, transcending traditional performance metrics. And fourth, the feasibility of achieving this high level of specialization on local hardware is empirically demonstrated, validating the QLoRA approach as an effective way to democratize the development of AI specialized in the TEG sector.”

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have satisfactorily addressed all previously raised concerns with clear, technically sound, and verifiable revisions. The introduction has been strengthened by a focused comparison with prior domain-specific LLM studies and by an explicit justification of full fine-tuning versus retrieval-augmented generation. The mathematical model now includes explicit assumptions, corrected notation, and proper system-order definition.