Next Article in Journal
An Improved Localization Method Using Light Detection and Ranging for Indoor Positionings
Previous Article in Journal
FedRegNAS: Regime-Aware Federated Neural Architecture Search for Privacy-Preserving Stock Price Forecasting
 
 
Article
Peer-Review Record

A Novel Execution Time Prediction Scheme for Efficient Physical AI Resource Management

Electronics 2025, 14(24), 4903; https://doi.org/10.3390/electronics14244903 (registering DOI)
by Jin-Woo Kwon and Won-Tae Kim *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Electronics 2025, 14(24), 4903; https://doi.org/10.3390/electronics14244903 (registering DOI)
Submission received: 18 November 2025 / Revised: 10 December 2025 / Accepted: 11 December 2025 / Published: 13 December 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear authors, Thank you for submitting your research work to our journal. This manuscript provides a novel execution time prediction scheme for efficient physical AI resource management. It is very meaningful. However, its quality still needs to be improved before publication.

  1. The abstract can be improved. Firstly, the core contribution of this manuscript is a novel execution time prediction scheme, but the current abstract does not effectively summarize its importance or clarify the major challenges in efficient physical AI resource management, resulting in insufficient research motivation. For example, the first three sentences of the current abstract mentions ‘Physical AI enables reliable and timely operations of autonomous systems such as robots and smart manufacturing equipment under diverse and dynamic execution environments. The execution time of physical AI tasks is a complex relationship among hardware performances, computing resources and task characteristics. Since it is important to apprehend the complicated correlation, data-driven models including regression, deep learning and large-language models have been widely adopted.’ This completely fails to identify the specific challenges and significant value that a novel execution time prediction scheme is targeting. Secondly, the technical details of the novel execution time prediction scheme, such as system architecture and working methods, are completely unknown to readers. The authors mention 'In this paper, we propose a Calibration-Assisted Resource-aware Execution time prediction scheme (CARE-D). It leverages a Deep Neural Network to effectively model the complex nonlinear relationship among hardware performances, dynamically allocated computing resources, and task characteristics.' in the current abstract, which is unrelated to the details of the technical solution. Thirdly, the technical effectiveness of the novel execution time prediction scheme in solving challenges cannot be known, as there are no experiments and evaluation metrics designed specifically for the challenges. For example, in line 23 of the current abstract, it is claimed that ' Experiments show that CARE-D improves cross-environment prediction reliability, achieving an average 7.3% accuracy over zero-history performance.' How can it prove that the challenges in efficient physical AI resource management have been addressed or improved, as described in the current title?
  2. The introduction can be improved. Firstly, authors should reorganize their references around the challenges in efficient physical AI resource management, introduce the efforts of their peers, and then introduce research gaps and motivations. In order for readers to understand, it is necessary and important for this manuscript to propose a novel execution time prediction scheme. Secondly, the authors listed four contributions, but did not describe the technical details, such as the unique organizational structure and working process of the novel execution time prediction scheme. Moreover, these four points are inconsistent with the contributions in the abstract.
  3. Section 2 can be improved. This chapter did not fully grasp the efforts made by peers in addressing the challenges of efficient physical AI resource management. Readers are unaware of the methods used by their peers, whether there are similar methods to the novel execution time prediction scheme, and the limitations or research gaps.
  4. Section 3 can be improved. Firstly, this chapter is the core technical solution of this manuscript, but unfortunately, I did not see the technical details of the novel execution time prediction scheme, such as the unique network structure, working methods, calculation equations, and the deep thinking required to solve the challenges in efficient physical AI resource management. Secondly, the figures in this chapter need further improvement. For example, in Figure 1, what are the resources in efficient physical AI resource management? Where is the computing platform? How do computing platforms schedule resources? For example, in section 3.2.1 Description of DNN, can you provide a diagram to illustrate the technical details and network structure of the novel execution time prediction scheme. Thirdly, the algorithm steps and calculation equations in this chapter need further improvement. For example, in Algorithm 1, how is the algorithm name related to the novel execution time prediction scheme? Steps 1-5 do not have corresponding content or calculation equations in the main text. Additionally, there is no need to name the Algorithm 1 below it as Figure 2 or ‘Few history based calibration algorithm’, as they are neither relevant nor necessary. Please remove them. Equations 1~5 do not adequately describe the unique computational process of Algorithm 1 or the novel execution time prediction scheme.
  5. Section 4 can be improved. The experiments in this chapter are difficult to understand, for example, the challenges in efficient physical AI resource management have been addressed or progress has been made? For example, Figures 3 to 10 all compare accuracy. How do these relate to efficient physical AI resource management, and what progress has been made in the results of the novel execution time prediction scheme?
  6. Section 6 can be improved. The conclusion should summarize that the novel execution time prediction scheme has been effective or has made progress in addressing the challenges of efficient physical AI resource management. Then, the limitations or shortcomings of this proposal should be identified to guide the next steps of work.
  7. The references can be improved. The relevant literature on addressing the challenges in efficient physical AI resource management has not been well collected.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper presents a new approach called CARE-D that is a resource-aware prediction of execution time that can enhance the adaptability of physical AI systems to changing surroundings. The authors present an actual constraint found in current data-driven models that are used in AI: an obligation to use data gathered from a specific environment. The authors support their idea of integrating deep neural networks with a tiny-sample calibration approach using relatively few execution records, which is utilitarian and highlights the variability of environments. The results of the experiment(s) give an initial hint of ensuring slightly meaningful enhancements in accuracy, although not impressive on face value. Overall, the paper articulates a problem and suggests a suitable approach. Moreover, additional, specific information is needed about the robustness of any particular model and what else it might be generalized to perform.

However, the following elements need to be addressed in the manuscript:

  • An explicit novelty statement is missing from the Introduction statement. A clear and concise novelty statement can help the readers understand the novel element of this research study.
  • The paper reports 123,930 execution records from 11 environments (Section 4.1.1), but later mentions “the collected dataset of 74,000 execution records” and “five hardware environments listed in Table 4.1” (Sections 4.3–4.3.1). Can the authors reconcile these inconsistencies in sample count and number of environments, and clarify exactly which subset was used for which experiments and figures?
  • Execution time is defined as including model loading, computation, and termination, and the dataset mixes discrete-event simulations and neural inference tasks. How are these stages consistently measured across such heterogeneous workloads and operating systems, and is there any per-environment or per-application normalization that might confound cross-environment comparisons?
  • Equation (1) defines a general mapping , while the calibration description (Equations (2)–(4)) effectively assumes a 1D resource-ratio function and a multiplicative scaling factor . Can the authors precisely describe how the full DNN predictor over a 20-dimensional feature vector is reduced to the scalar efficiency function , and whether this reduction imposes any implicit constraints (e.g., separability between CPU ratio and other features)?
  • The calibration method assumes that the maximum execution time at minimum resource (e.g., 10% CPU) provides a “stable reference” independent of hardware variation and that runtime scales proportionally via . Given known non-monotonic effects such as OS scheduling jitter, turbo boost, thermal throttling, and background load, what empirical evidence is provided that per-task execution time is monotonic and smoothly rescalable in CPU percentage on each environment?
  • The few-history calibration uses a very small number of samples (often ) from the target environment to derive . How sensitive is this procedure to noise in individual measurements, and have the authors quantified confidence intervals or variance of across repeated runs under the same configuration?
  • For calibration samples, the paper states that performance often degrades due to “overfitting to the limited calibration subset” and that accuracy declines when more than 10 samples are used. What is the exact estimator used for when (single-sample, average ratio, regression over ratios, etc.), and why would a simple averaging or robust regression of ratios lead to overfitting rather than reduced variance?
  • The LOEO scheme uses one held-out environment as test and the remaining for training, while calibration samples in the target environment are drawn from “measured execution time from the target environment.” Are these calibration samples drawn from the same workloads and CPU settings as those in the test set, and if so, how is leakage between calibration and evaluation avoided (e.g., same workload–resource combinations appearing both in calibration and test)?
  • The dataset includes 20 features across workload, hardware, and performance indexes. Several of these (e.g., defect rate, processing time, waiting time, work time, quantity produced) are highly application-specific. How is feature availability handled when moving to new applications or plants with different logging schemas, and does the model’s reported cross-environment generalization rely on the assumption that the same feature set exists in all deployments?
  • For categorical hardware and OS attributes such as CPU architecture, CPU model, and operating system, what encoding strategy is used (one-hot, learned embeddings, ordinal codes, etc.)? In the LOEO setting, does the test environment ever contain unseen categorical values, and if so, how are those handled by the DNN and baselines?
  • The performance index features for CPU, memory, and I/O are described as “normalized metrics,” presumably derived from separate benchmarks. At what point in the pipeline are these indices measured (before training, under idle conditions, under load), and is there a risk that those indices already encode environment-specific runtime scaling, effectively giving the DNN a privileged signal that simpler baselines do not exploit?
  • The training procedure log-transforms both inputs and target execution times. For features such as CPU percentage (1%–100%) and execution times ranging down to 0.15 seconds, what offset or clipping strategy is used to handle values near zero, and how sensitive are results to this transformation choice?
  • The reported execution time range is 0.15–2847.5 seconds with median 127.3 seconds, yet the zero-history MAE is 574.3 seconds and drops to 32.3 seconds after calibration. Such a large baseline MAE suggests very large errors on long-running tasks. Can the authors provide per-quantile or per-environment error breakdowns (e.g., median absolute error, error for top 10% longest runs) to show that improvements are not dominated by a small number of extreme outliers?
  • The “accuracy” metric is defined as the proportion of predictions within 10% relative error. Given the heavy-tailed distribution of execution times, this threshold corresponds to sub-second tolerance for short runs and hundreds of seconds for very long runs. Have the authors checked whether conclusions hold under alternative metrics (e.g., median relative error, log-scale RMSE, or different thresholds) that balance the influence of short and long tasks?
  • The paper claims that CARE-D architectures show robust zero-history generalization and that deeper models overfit because “the applications used in this paper contain monotonic relationship between task and environment features and execution times.” Can the authors provide evidence that the true relationship is indeed approximately monotonic and low-complexity (e.g., partial dependence plots, shape-constrained fits), and explain why an overparameterized DNN with dropout and batch normalization still overfits under LOEO rather than being regularized enough?
  • The comparison with regression models and other deep learning models suggests that CARE-D benefits strongly from few-history calibration whereas other methods do not. Were all baselines given access to the same calibration mechanism (e.g., the same -based scaling or a method suited to each model), and if so, how exactly was calibration applied to tree-based and linear models to ensure a fair comparison?
  • The MESR analysis concludes that a single sample ( ) is almost always optimal and that performance declines when more than 10 samples are used. In a real deployment where the true optimal is unknown, what concrete strategy is proposed for selecting online without access to ground truth on the future workloads, and how would mis-specifying affect QoS guarantees?
  • The environments in Table 3 cover only high-end Intel desktop/server CPUs (10th, 13th, 14th gen) and two operating systems (Windows 11 and Ubuntu 22.04). How confident can one be that the calibration and scaling behavior would hold for different processor families (e.g., ARM, low-power embedded CPUs, GPUs) or for virtualized cloud environments where CPU “percentage” and contention semantics differ significantly?
  • In Section 3.4, CARE-D is proposed as the basis for resource allocation by solving . Is this optimization performed by exhaustively searching discrete CPU quotas, or by some continuous approximation? Has the impact of prediction error on deadline violations been quantified in a closed-loop scheduling simulation rather than only at the prediction level?
  • The calibration assumes a stable mapping between CPU percentage and execution time within each environment. How is CPU percentage enforced (e.g., cgroups quotas, taskset/affinity, hypervisor-level limits), and are there measurements showing that actual CPU utilization and effective compute capacity scale proportionally with the configured ratio over the 1%–100% range?
  • The dataset contains “multiple repetitions per configuration” to capture variability, but it is not clear whether these repetitions are averaged, all included independently, or otherwise aggregated. How is intra-configuration variance handled in training and evaluation, and is there any heteroskedastic modeling (e.g., per-configuration variance estimates) or are all samples treated as equally informative?
  • The paper includes sophisticated deep learning baselines such as GNNs and Transformers but concludes that their inductive biases are “misaligned” with execution-time prediction. Given that these models can be sensitive to architecture and feature design, what steps were taken to ensure they were not simply under-tuned (e.g., hyperparameter search, feature ordering for Transformers, graph construction choices for GNNs), and are there ablation results showing that their poor performance is not just due to suboptimal configurations?
  • CARE-D uses dropout (0.1), batch normalization, Adam with learning rate 0.001, and early stopping. Were these hyperparameters shared across all DNN variants and LOEO folds, or tuned per architecture and per fold? Have the authors quantified variance across random seeds, and are the reported gains over baselines statistically significant with respect to this training randomness?
  • The dataset aggregates simulation and AI-inference workloads but does not clearly separate their performance characteristics in the reported metrics. Do the models perform similarly across these two workload categories, or is CARE-D primarily gaining on one type? A per-workload-type analysis could reveal whether the proposed method generalizes across fundamentally different compute patterns.
  • The calibration procedure is said to “preserve the nonlinear shape” of the DNN’s predicted runtime curve while rescaling its vertical axis. In environments where the true scaling with CPU percentage shifts non-uniformly (e.g., more benefit at mid-range quotas due to cache effects), how does a purely multiplicative adjustment cope with such shape changes, and is there any evidence from the experiments that such non-uniform shifts do not occur?
  • Finally, the paper claims that CARE-D is intended for “real-time resource scheduling and reliable operation” in next-generation physical AI systems. Given the observed baseline accuracy of about 30–37% within a 10% error band even after calibration, what guarantees can realistically be offered in latency-sensitive or safety-critical scenarios, and is there any mechanism (e.g., prediction intervals, worst-case bounds) to avoid under-provisioning when the predictor is wrong?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Dear authors,

Thank you for carefully revising the manuscript and responding to the review comments one by one. After careful revisions, the quality of this manuscript has significantly improved compared to the original. However, the following modifications need to be completed before its publication.

  1. The correctness and symbol definitions of all equations need to be thoroughly checked, such as 𝜌𝑖 in Equation 6 and 𝑘 in Equation 7, which are undefined. Is 𝑘 in equation 5 and 𝑘 in equation 7 the same variable?
  2. Algorithm 1 does not need to be named as Figure 2, and the subsequent figure numbers will be sequentially reduced by 1.
  3. The text in the experimental result charts is too small to be seen clearly.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have addressed all the comments.

Author Response

We appreciate your positive feedback. Thank you for your valuable suggestions throughout the review process, which were very helpful in improving our paper.

Back to TopTop