1. Introduction
Artificial intelligence (AI) has moved from research laboratories into production across language, vision, and speech, supported by advances such as Transformer-based language models and high-performing vision architectures that achieve strong accuracy in diverse tasks. However, recent scaling evidence indicates that model-centric optimization alone yields diminishing returns: additional accuracy increasingly requires disproportionate increases in data and compute, and performance plateaus often reflect data limitations rather than architectural novelty [
1,
2,
3,
4].
Real-world deployment incidents—ranging from biased decision-support systems to uneven recognition accuracy across demographic groups—have been traced to training data that are noisy, imbalanced, or non-representative. These cases highlight how data quality conditions fairness, reliability, and downstream risk, and motivate a data-centric AI stance in which dependable performance is driven increasingly by data accuracy, diversity, and governance rather than increasingly complex models [
5,
6,
7,
8,
9].
Despite this shift, the economics of building training data for AI systems remain underspecified. In large projects, data work spans collection, cleaning, labeling, validation, storage, governance, and security, yet budgeting practices often reduce these activities to coarse unit prices or linear rules of thumb. Labeling and quality assurance can consume a major share of effort and cost—particularly for unstructured modalities such as images, speech, and video—because annotation is labor-intensive, domain dependent, and tightly coupled with quality targets [
10,
11,
12].
A conventional procurement-based cost model assumes linear additivity of process-level expenditures, expressed as
where u_p denotes the unit price for process p and s_p represents the ex ante workload scale (e.g., records, gigabytes, or minutes) assigned to process p [
1]. This formulation rests on two implicit assumptions: (i) cost separability, meaning that the cost of each process is independent of all others, and (ii) a constant marginal effort per unit of workload.
In practice, however, AI training-data construction departs substantially from both assumptions. Such pipelines typically involve iterative quality-assurance (QA) loops in which items that fail inspection are subjected to relabeling or recoding before undergoing re-inspection [
13]. If a fraction r of items requires rework at each iteration, the expected effective workload expands as N·(1 + r + r
2 + ⋯) = N/(1 − r) for 0 ≤ r < 1, yielding a non-linear expected cost term u_p·N/(1 − r) in place of the linear term u_p·N.
Furthermore, the rework rate r is not a fixed constant; it is a function of multiple interacting factors, including task difficulty (d), target quality threshold (q), reviewer strictness, toolchain maturity (including automation and large-language-model assistance), and the degree of domain-expert involvement (e), i.e., r = f(d, q, reviewer, tools, e). As a consequence, the total project cost exhibits interaction effects among scale, unit price, and quality intensity, and undergoes non-linear amplification under stricter QA regimes. These structural properties motivate the adoption of flexible, non-linear learners—in particular, tree-based ensemble methods—which are capable of capturing higher-order interactions while remaining interpretable through post hoc feature attribution techniques.
Although automation aids such as active learning, semi/weak supervision, and large language model (LLM) assistance can reduce marginal annotation effort, these methods do not directly provide procurement-ready project-level cost estimates nor explain how operating choices (expert participation, target quality intensity, anonymization, cloud operations) shift total budgets [
14,
15,
16,
17]. This gap is especially consequential in public programs, where budgets are committed ex ante and mid-course adjustment is limited [
9,
18].
This study addresses the gap by developing and validating a machine-learning–based cost estimation model for AI training-data construction, grounded in a comprehensive multi-year public program. We compile and standardize 386 usable final reports from Korea’s national AI training-data initiative (2020–2022, nominal KRW), representing approximately KRW 1.2 trillion (approximately USD 1 billion) in public investment. From each report, we extract a harmonized schema of 35 cost-influencing variables observable across projects and construct an analysis matrix of 386 × 24 numerical predictors after preprocessing.
This paper contributes in three ways. First, it provides a harmonized, multi-dimensional cost-driver schema for large-scale public AI training-data projects. Second, it evaluates raw, PCA-enhanced, and factor-analytic (FA) representations under nested cross-validation, showing how structural factors can improve both predictive performance and interpretability. Third, it translates model explanations into actionable budgeting levers through SHAP-based attribution and policy-oriented sensitivity simulations. Accordingly, this study addresses the following research questions:
RQ1. Do FA-augmented models reduce prediction error relative to Baseline and PCA-enhanced models?
RQ2. Which factors/variables contribute most to TPC, and what are their marginal effects and sensitivities?
RQ3. How do policy/operations scenarios (difficulty, process shares, expert participation, cloud use, anonymization) alter predicted TPC?
The remainder of this article is organized as follows.
Section 2 reviews related work on AI training-data cost estimation.
Section 3 presents the research design, including data construction, preprocessing, alternative representations, model training/validation, and interpretability analyses.
Section 4 reports the empirical results,
Section 5 discusses implications and deployment considerations, and
Section 6 concludes with limitations and future research directions.
2. Literature Review of Cost Estimation Techniques
2.1. Concept and Measurement of AI Training-Data Cost
As AI systems have matured from proof-of-concept to production, the economics of training data—its acquisition, refinement, labeling, validation, governance, and security—has become a decisive determinant of project feasibility and performance [
7,
8,
9]. Early budgeting practices treated data work as a sub-item of overall development cost and relied on coarse heuristics such as cost per record or per GB, headcount-based timescales, or linear rules of thumb borrowed from software engineering [
10,
12]. These approaches struggle to capture cross-modal differences, workflow decomposition (collection → cleaning → labeling → validation/QA), and the effects of workforce mix, cloud operations, and privacy compliance on total project cost (TPC).
Measurement frameworks subsequently evolved to incorporate process granularity, using per-task unit prices (labeling/cleaning/QA), process shares (e.g., QA ratio), scale metrics (records, GB, minutes), and quality outcomes (F1, mAP) as proxies for required effort [
10,
11,
12]. Studies in unstructured modalities report that labeling and QA may consume 60–80% of the effort/cost due to annotation difficulty and domain dependence, motivating human-in-the-loop and workflow-optimized annotation systems [
15,
19]. Two limitations persist: multicollinearity among size/time/difficulty variables undermines stable inference in linear models, and linear additivity obscures nonlinearities and interactions that govern real budgets.
2.2. Toward Multi-Factor, Explainable Cost Modeling
Evidence from data-centric AI suggests that performance plateaus increasingly reflect data quality and coverage rather than architectural novelty; consequently, organizations have intensified investments in governance, validation, and privacy–cost components that are not well captured by average unit prices alone [
7,
8,
9,
20]. Recent cost-estimation studies therefore model dataset building as a multi-factor production process with inputs spanning data scale, unit prices by process, workforce composition and coordination costs, operating environment (cloud/tools), compliance (anonymization/access control), and quality targets [
10,
12].
Two methodological strands recur. First, principal component analysis (PCA) mitigates multicollinearity and can improve generalization, but components often lack semantic interpretability [
16]. Second, factor analysis (FA) yields interpretable latent factors aligned with managerial and policy levers (e.g., unit-price system, workforce mix, quality intensity, scale, time/difficulty, process shares, operating environment, privacy). Combining PCA/FA features with tree-based ensembles (Random Forest, XGBoost, LightGBM) is effective because these models capture nonlinearities and interactions and remain robust to outliers and mixed scales [
4,
16,
17].
Parallel streams on cost reduction—active learning, semi/weak supervision, and LLM-assisted labeling—demonstrate lower marginal annotation effort and faster convergence to quality targets [
21,
22,
23,
24], but they do not in themselves deliver procurement-ready project-level cost estimates. The literature therefore converges on the need for models that are predictively strong and policy interpretable, mapping driver-level changes into TPC impacts that can inform planning and contracting [
16,
20,
25].
The release of GPT-3 accelerated the practical adoption of large language models in automation workflows, including data annotation, which may change present-day cost structures relative to our 2020–2022 cohort [
17].
2.3. Cost Drivers, Quality Targets, and TPC
For the interoperability of dataset-quality metadata and consistent reporting across projects, we refer to the W3C Data Quality Vocabulary (DQV) [
26].
Empirical work consistently finds that data scale (records/GB/minutes) and process unit prices (labeling, cleaning, QA) are primary contributors to TPC across modalities, with higher elasticities in image/speech/video due to annotation difficulty and review intensity [
10,
11,
12]. Workforce configuration adds coordination and supervision overhead; cloud execution introduces storage/monitoring costs but may reduce time-to-delivery; and privacy measures impose persistent governance costs whose magnitude varies by domain sensitivity [
12,
20].
Explainability methods such as SHAP have been applied to boosted-tree cost models to rank driver importance and support scenario analysis, turning model outputs into actionable levers for budgeting and policy [
25]. Robust validation (nested cross-validation, out-of-time tests) and uncertainty quantification (bootstrap, quantile prediction) are emphasized for public-sector decision cycles where budget revisions are rigid and stakes are high [
16].
Research gaps remain: (i) the availability of large, standardized, cross-project corpora with harmonized variables spanning process economics, workforce, operations, privacy, and quality is limited; (ii) there is insufficient evidence on whether factor-augmented representations improve both accuracy and interpretability; and (iii) there are few policy-ready sensitivity mappings translating levers into TPC deltas [
4,
12,
16,
20,
25].
2.4. Prior Cost-Estimation Approaches and Gaps
Cost estimation for data-intensive projects has traditionally relied on variants of bottom-up costing, analogical estimation, and parametric models. Bottom-up approaches decompose work into activity lists (e.g., collection, cleaning, labeling, inspection, and management) and multiply expected effort by wage or unit-price rates. These methods are transparent and procurement-friendly, but they are labor-intensive and sensitive to how granularly tasks are specified, often undercounting rework and coordination overhead.
Analogical estimation extrapolates from prior projects with similar scope. While practical in early planning, it tends to be unstable when the reference set is small or when reporting conventions vary across agencies and vendors. In public programs, analogical methods also inherit legacy pricing biases (e.g., inflated unit prices for historically under-specified QC), which can persist across cycles without explicit correction.
Parametric models map a small set of drivers to total cost via regression-like relationships. In software engineering, this family includes function point analysis, COCOMO-style models, and productivity-based cost functions. For AI data construction, however, the driver space is broader: scale (records/GB/minutes), modality and task difficulty, workforce structure, security constraints, and quality thresholds interact in non-linear ways, complicating the selection of a compact driver set.
Recent work proposes using production-function perspectives, where total cost is explained by scale and process intensity (e.g., labeling minutes per item) with adjustment factors for difficulty and governance requirements. These perspectives highlight the centrality of unit prices as policy levers, but they often rely on strong assumptions about constant returns or stable productivity that may not hold when toolchains shift (e.g., automation or LLM assistance).
Another stream adopts activity-based costing and time-and-motion logging to capture true effort, including rework cycles. This is conceptually appealing but requires granular operational telemetry that is rarely standardized across vendors and programs. Without standardized schemas and audit protocols, time logging can become inconsistent and difficult to compare across projects.
Across these approaches, two recurring gaps motivate machine-learning-based estimation in the public sector. First, heterogeneous reporting makes it difficult to harmonize inputs and compare projects; this motivates a common reporting schema that can support both costing and accountability. Second, purely deterministic estimates struggle to capture complex interactions among scale, unit prices, and optional governance features; flexible models can better represent such interactions while still supporting interpretation via feature attribution and scenario analysis.
Accordingly, a practical research goal is not merely minimizing predictive error, but producing a policy-ready model that is transparent, updateable, and compatible with procurement workflows—i.e., a model that can be audited and used to justify budget decisions rather than a black box.
2.5. Toward Data-Centric, Policy-Ready Estimation
A data-centric perspective reframes budgeting as an iterative process of improving data quality and reporting consistency, not only improving algorithms. For cost estimation, this implies that the reliability of forecasts critically depends on the stability of variable definitions and the availability of comparable historical records. A standardized schema (and consistent coding rules) is therefore a governance asset: it reduces ambiguity in procurement, supports benchmarking, and enables the periodic refreshing of estimation models as the program evolves.
Policy-ready estimation also requires interpretability. Decision makers need to understand which drivers dominate cost and how changes in a driver (e.g., labeling unit price or required inspection stringency) translate into budget impacts. Tree-based ensemble models paired with post hoc explanation (e.g., SHAP) offer a practical compromise: they can capture non-linearities in tabular data and still provide local and global attributions.
Fairness and accountability concerns matter, even for cost estimation. If certain modalities or domains systematically incur higher costs due to security constraints or workforce scarcity, programs may inadvertently underfund those domains or create procurement incentives that degrade quality. Transparent driver reporting helps detect such structural disparities and supports equitable resource allocation.
Finally, policy-ready models should be designed for maintenance. Rather than treating estimation as a one-time academic exercise, agencies can institutionalize periodic retraining with new cohorts, drift monitoring for unit prices, and calibration checks for error across project types. This aligns with established MLOps principles such as documentation, versioning, and evaluation under distribution shift.
In this study, these considerations motivate the focus on (i) harmonized drivers drawn from standardized program reporting, (ii) dimensionality reduction to stabilize inputs and reduce collinearity, and (iii) interpretable ML with sensitivity analysis for actionable governance levers.
These design choices aim to provide a forecasting tool that is both accurate and operationally usable in public AI data programs, supporting budgeting, procurement, and post hoc accountability.
3. Methodology (Research Design)
3.1. Research Procedure
We follow a four-step procedure to develop a machine-learning cost estimation model for AI training-data projects while reflecting operational characteristics and ensuring validity. First, we collected and harmonized 386 government project final reports (2020–2022), defined Total Project Cost (TPC) as the outcome, and specified cost-driver dimensions from prior work (scale, unit prices, workforce, quality, operations, and governance).
The end-to-end workflow is summarized in
Figure 1.
Second, we constructed a consistent numeric feature space and created three input tracks: Baseline (selected numeric predictors), PCA-enhanced (Baseline plus principal components), and FA-enhanced (Baseline plus latent factors). Third, we validated structure and predictive stability through PCA/FA diagnostics and nested cross-validation across tracks. Fourth, we evaluated predictive validity using a hierarchy of models (linear baselines to tree ensembles) and conducted interpretability and policy-facing sensitivity analyses using SHAP and scenario simulations.
Table 1 summarizes the track configurations. Under the current evaluation protocol, performance differences between PCA-enhanced and FA-enhanced tracks are small, and model ranking is primarily driven by algorithm choice rather than representation choice.
3.2. Data and Variable Construction
We standardized administrative final reports from 386 government-funded AI training-data projects conducted between 2020 and 2022 and harmonized 35 common reporting items per project (
Table 2). From these, we retained a final set of 24 numeric predictors suitable for learning, covering scale (records, GB, minutes), per-process unit prices (labeling/cleaning/QA), process shares, workforce mix, operational flags (cloud use, expert participation), privacy/governance (anonymization), and quality indicators (F1, mAP).
The dependent variable is TPC (million KRW), defined as end-to-end expenditure including collection, cleaning, labeling, validation/QA, quality management, expert consultation, labor, and infrastructure. To reduce the risk of target leakage, we do not use any predictors that are algebraically derived from TPC (e.g., Total_Cost_per_Record, Total_Cost_per_GB, Total_Cost_per_Minute). Such derived indicators, when computed, are reported only descriptively and excluded from supervised learning and representation learning (PCA/FA).
Inflation adjustment: All cost figures (TPC and unit prices) are recorded in nominal KRW, as reported in the administrative final reports for each project year (2020–2022). Because the observation window is short and the models are trained on standardized predictors, we use nominal values in the main analysis. As a robustness check, we also deflated TPC and unit-price variables to KRW in 2022 prices using annual consumer price index (CPI) deflators (Statistics Korea, KOSIS; CPI base year 2020 = 100). Predictive performance and key driver rankings changed only marginally, and the main conclusions were unchanged (see
Section 4.4).
3.3. Preprocessing and Alternative Representations (PCA/FA)
The raw administrative data were originally stored in a single spreadsheet in a transposed format (variables as rows, projects as columns): the first row (“Dataset Name”) provided identifiers for 386 projects, and the remaining 35 rows corresponded to harmonized reporting items. To build an analysis-ready matrix, we first reshaped the sheet into the standard form (one row per project, one column per variable), then normalized data types by removing currency/unit strings (e.g., “8500 won”, “55 persons”), parsing numeric tokens from mixed notations (e.g., “0.8(80%)”, “approximately 66,600 won”), and applying explicit coding rules for binary/ordinal fields (e.g., Y/Yes → 1, N/No → 0). After this structuring step, we applied a five-step preprocessing pipeline to improve cross-project comparability: (1) field harmonization and unit normalization across reports; (2) missingness profiling to distinguish non-applicable from unreported values; (3) fold-wise imputation of missing numeric entries (median for numeric fields and mode for binary/ordinal fields) with additional missingness-indicator flags to preserve the distinction between “zero” and “unreported”; (4) removal of constant/degenerate columns and sanity checks for extreme values; and (5) z-score standardization of continuous predictors prior to PCA/FA and downstream modeling (binary indicators retained as 0/1).
Missingness was limited (13 cells out of 9264 numeric entries; 0.140%), affecting seven projects (1.81%). A manual audit (
Table A1, adapted from the table in
Section 5.8 of the full technical report) confirmed that all missing entries reflect unreported values (i.e., those not itemized in the administrative final report), rather than true zeros. Accordingly, we treated blanks as missing and handled them via imputation strategies described in
Section 3.3.1, and we avoided listwise deletion because it reduces the sample (386 → 379) and shifts the target distribution, potentially harming representativeness.
Using the resulting 24 standardized numeric predictors (
Section 3.2), we derived two alternative representations. PCA retained the first 10 principal components (PC1–PC10), explaining 81.7% cumulative variance. FA extracted nine latent factors (eigenvalue ≥ 1), explaining 77.8% cumulative variance and enabling semantic labeling based on dominant loadings. We evaluated three input tracks: Baseline (24 predictors), PCA-enhanced (Baseline + PC1–PC10), and FA-enhanced (Baseline + Factor_1–Factor_9). This design enables a direct comparison between variance-preserving compression (PCA) and semantically interpretable structure (FA).
Table 3 reports the factor labels and representative high-loading variables used for interpretation.
3.3.1. Missing-Value Handling
Missing values in the administrative final reports mainly arise from omissions during the preparation and submission of individual project final reports, where certain cost-driver and operational fields are left blank. In this study, a comprehensive missingness audit (
Table A1, adapted from the table in
Section 5.8 of the full technical report) verified that all observed missing entries in the selected predictors correspond to unreported values, rather than non-applicable activities or true zeros. Therefore, we do not encode missing numeric entries as zero. Instead, we retain missing values as missing and apply imputation and missingness-indicator features to preserve all 386 projects while maintaining the semantic distinction between ‘zero’ and ‘unknown’.
To enable PCA/FA, which requires a complete matrix, and to support stable learning across nested cross-validation, we adopt a two-part strategy. First, missing numeric predictors are imputed using the median computed on the training portion of each fold (to avoid information leakage), and the learned median is then applied to the corresponding validation/test portion. Second, for each predictor with any missingness, we add a binary missingness indicator (1 = missing/unreported; 0 = reported). This allows models—especially tree ensembles—to learn systematic patterns associated with non-reporting without conflating them with true zeros.
After fold-wise imputation, the analysis-ready matrix is complete for PCA/FA and downstream modeling (
N = 386). Importantly, zeros in the cleaned dataset represent genuine zero-valued measurements (when they occur), whereas unreported fields are represented through the combination of imputed values and their missingness indicators. We report the variable-by-variable missingness distribution and pre/post preprocessing comparison in
Table A1, and we summarize the preprocessing conventions by variable group in
Table 4.
Missingness was rare (13 entries total) but non-negligible conceptually because treating unreported values as zeros would impose a strong and potentially biasing assumption. We therefore revised the preprocessing pipeline to align with the ‘unreported’ interpretation confirmed by the audit and to provide explicit sensitivity checks under alternative missing-data strategies (
Section 4.5).
For binary operational indicators (e.g., cloud execution, expert participation, anonymization), unreported entries are treated as missing (NaN) and handled in the same way: median/mode imputation within folds plus a missingness indicator. This avoids interpreting ‘blank’ as ‘No’ unless the report explicitly states non-use. Tree-based learners can exploit both the imputed value and the missingness flag to capture interactions (e.g., scale × unit price × operations) without imposing linear assumptions.
As an additional robustness check, we compare (i) our primary imputation-with-indicators approach with (ii) imputation without missingness indicators and (iii) listwise deletion where feasible. We find that the main performance ranking and interpretability conclusions—including the dominant cost drivers and the overall sensitivity patterns—remain consistent across these missing-data treatments, indicating that our results are not artifacts of a particular handling strategy (
Section 4.5).
3.3.2. Post-Preprocessing Descriptive Statistics
After preprocessing (including the missing-value handling described above), we produced descriptive statistics for the dependent variable (TPC) and all 24 numeric predictors to verify plausibility, detect extreme values, and characterize cross-project heterogeneity. Because project size and costs are typically heavy-tailed in public procurement, we report not only means and standard deviations, but also medians and interquartile ranges (IQRs) alongside minimum–maximum ranges.
Consistent with the program’s portfolio structure, scale-related variables (e.g., total records, total GB, total minutes, and average time per record) exhibit right-skewness, and TPC shows a similar heavy-tailed distribution driven by a small number of very large projects. Unit-price variables (labeling, cleaning, and QA/inspection per record) vary substantially across domains and difficulty levels, reflecting different annotation guidelines, review intensity, and contractor productivity assumptions.
Target Transformation Check: Because monetary costs are often right-skewed, we tested variance-stabilizing transformations for the target TPC. Specifically, within each outer training fold, we evaluated (i) log1p(TPC) and (ii) a Box–Cox transform (λ estimated on the training fold; TPC > 0) for linear baseline models, and then back-transformed predictions to the original KRW scale for metric computation (MAE/RMSE/MAPE/R
2). Normality diagnostics (histogram and Q–Q plots) indicated that log-transformed TPC is closer to log-normal than the raw scale, and residual plots showed reduced heteroscedasticity for linear baselines. Performance gains for regularized linear models were modest and did not change the overall ranking: tree-based ensembles remained superior. We therefore report results on the original TPC scale for procurement interpretability, while documenting the transformed-target sensitivity in
Section 4.4.
Workforce variables (number of workers and crowd/full-time ratios) show meaningful dispersion across contractors, suggesting heterogeneous coordination costs. Binary operational indicators (cloud execution, expert participation, anonymization) are not uniformly present across projects, supporting our decision to treat them as optional but policy-relevant levers in subsequent sensitivity analyses. Quality indicators (F1 and mAP), where reported, are naturally bounded in [0, 1] and are used as secondary drivers when interpreting cost–quality trade-offs.
These descriptive diagnostics informed the range checks and outlier screening performed before model training, and they also motivate the use of robust validation and scenario-based reporting in
Section 4.3, where we interpret predictions as budget risks rather than as single-point estimates.
3.3.3. Practice-Based Baseline A: Spreadsheet Heuristic (Unit Cost per Record × Number of Records)
To enhance practical relevance and provide a comprehensive evaluation, we additionally benchmark practice-based non-ML baselines commonly used in budgeting and procurement workflows (Baselines A–C). Baseline A is a transparent spreadsheet-style estimator commonly used in budgeting and procurement workflows, implemented as “unit cost per record × number of records”. In practice, the unit cost may be a single average rate or a sum of process-specific per-record rates (e.g., labeling + cleaning + QA). For cross-validation, when a single average unit cost per record is used, it is estimated from the training fold only and then applied to the held-out fold to avoid information leakage.
Because the administrative reports define per-process unit prices for labeling, cleaning, and QA/inspection in KRW per record, we construct a per-record unit cost by summing these process rates and multiply by the total record count. This is equivalent to the common heuristic “average cost per record × number of records,” while retaining line-item interpretability.
As shown in
Supplementary Table S2, the unit-price × workload mapping uses the total record count as the workload scale for labeling, cleaning, and QA/inspection to preserve unit consistency.
Supplementary Table S3 provides additional evaluation metrics and supporting results for transparency and reproducibility.
Accordingly, Baseline A predicts total project cost as a transparent line-item sum: TPC_hat = α + (UP_label × N_records) + (UP_clean × N_records) + (UP_QA × N_records), where all unit prices are reported in KRW/record and N_records denotes total record count.
Baseline B (Analogous estimation) predicts cost from similar historical projects. We use a k-nearest neighbors (kNN) approach (k = 10) on standardized scale and key unit-price drivers and take the median total cost of the neighbors as the estimate.
Baseline C (Parametric production function) fits a conventional log-linear model using scale drivers (e.g., records, GB, and minutes) within each training fold and exponentiates predictions to obtain costs on the original scale [
27].
For a fair comparison with ML models, all non-ML baselines are evaluated using the same outer 5-fold splits and metrics (MAE, RMSE, MAPE, and R2). Any fitted parameters (e.g., the offset α in Baseline A and regression coefficients in Baseline C) are estimated on the training fold only and applied to the held-out fold.
3.4. Model Training, Validation, and Selection
We trained four representative algorithms spanning linear regression and flexible nonlinear learners: Ridge, Random Forest, XGBoost, and LightGBM. All models used identical preprocessing to ensure comparability across tracks [
20,
28,
29].
Hyperparameters were tuned with nested 5 × 5 cross-validation. The inner loop optimized hyperparameters and early-stopping configurations, while the outer loop produced unbiased performance estimates. We report RMSE, MAE, MAPE, and R2 and select the best configuration based on the overall error profile and interpretability requirements.
3.5. Interpretability, Sensitivity, and Robustness Checks
To open the black box for decision makers, we compute SHAP values for the selected XGBoost configuration (PCA-enhanced track), providing global feature importance and local explanations.
We then conducted what-if sensitivity simulations by perturbing key drivers within realistic ranges (e.g., +10% unit-price shocks; toggling operational flags) and reporting average percentage changes in predicted TPC. Robustness checks include stability of rankings across cross-validation folds and diagnostic analyses for distribution shifts.
4. Results
4.1. Dimensionality-Reduction Diagnostics (PCA and FA)
We assessed whether the standardized numeric feature space admits a lower-dimensional structure exploitable for prediction and interpretation. PCA on the 24 numeric predictors yielded 10 components with a cumulative explained variance of 81.7%. The first three components explained 17.0%, 13.7%, and 10.6% of the variance, respectively. Subsequent components captured additional structure associated with scale, difficulty, process shares, operational flags, and privacy/governance indicators.
Table 5 reports the eigenvalues and explained-variance profile for PC1–PC10.
4.2. Model Performance Across Input Tracks
We benchmarked four algorithms (Ridge, Random Forest, XGBoost, and LightGBM) under three input tracks—Baseline, PCA-enhanced, and FA-enhanced—using nested 5 × 5 cross-validation. Across all tracks, tree ensembles showed a marginal improvement over linear baselines, indicating nonlinear cost formation and interaction effects among scale, unit prices, and operational choices.
Table 6 summarizes pooled out-of-fold performance across tracks and candidate models. XGBoost performance by track: Baseline (R
2 = 0.863, MAE = 731.4, RMSE = 1104.2, MAPE = 0.374), PCA-enhanced (R
2 = 0.868, MAE = 746.9, RMSE = 1084.9, MAPE = 0.358), FA-enhanced (R
2 = 0.849, MAE = 754.9, RMSE = 1160.6, MAPE = 0.364). Across all three input tracks, XGBoost achieved the best pooled out-of-fold performance in each track (Baseline: R
2 = 0.863, RMSE = 1104.2, MAE = 731.4, MAPE = 0.374; PCA-enhanced: R
2 = 0.868, RMSE = 1084.9, MAE = 746.9, MAPE = 0.358; FA-enhanced: R
2 = 0.849, RMSE = 1160.6, MAE = 754.9, MAPE = 0.364). Across tracks, XGBoost consistently yielded the best pooled out-of-fold performance, with LightGBM closely following; Ridge and Random Forest showed larger errors. These results indicate that XGBoost provides the most robust predictive accuracy under the current evaluation protocol.
Adding structure via PCA or FA produced a performance close to the Baseline track (differences were small) under nested 5 × 5 cross-validation. XGBoost consistently achieved the best pooled out-of-fold performance across all tracks. We emphasize FA primarily for providing a structured and interpretable latent representation, rather than for consistent error reduction over PCA.
To further examine algorithm-level differences, we conducted paired statistical tests comparing PCA-enhanced XGBoost with Random Forest and LightGBM under identical outer 5-fold splits. The comparative error patterns across the Baseline, PCA-enhanced, and FA-enhanced tracks are illustrated in
Figure 2, while the statistical test results are reported in
Supplementary Table S6, including paired t-tests, Wilcoxon signed-rank tests, and fold-wise win-rates. XGBoost consistently outperformed alternative ensemble models across all folds.
For clarity,
Table 6 reports the aggregated pooled out-of-fold performance summary for all candidate models within each representation track.
Note:
Table 6 reports pooled out-of-fold performance for each candidate model within each representation track (Baseline, PCA-enhanced, and FA-enhanced). The overall best-performing configuration was PCA-enhanced XGBoost (R
2 = 0.868; MAE = 746.9; RMSE = 1084.9; MAPE = 0.358).
A statistical comparison was conducted between PCA- and FA-enhanced representations. Because the performance gap between the PCA-enhanced and FA-enhanced configurations is relatively small, we conducted formal paired tests across the matched outer cross-validation folds. For each outer fold k, we computed the held-out error (RMSE and MAE) for the PCA-enhanced and FA-enhanced models under identical folds and hyperparameter tunings, then tested paired differences d_k = Error_FA,k − Error_PCA,k using (i) a paired
t-test and (ii) a Wilcoxon signed-rank test (two-sided, α = 0.05). The full test statistics and
p-values are reported in
Supplementary Table S4, showing whether the observed difference is statistically significant under the nested CV protocol.
Using matched outer-fold errors (K = 5), the FA-enhanced configuration yields lower errors than the PCA-enhanced configuration (differences defined as FA − PCA): RMSE mean difference = −32.424 (SD = 6.116); MAE mean difference = −37.040 (SD = 2.553). A paired
t-test indicates that the improvement is statistically significant for both RMSE (t = −11.854;
p = 0.000290) and MAE (t = −32.448;
p = 5.38 × 10
−6). The non-parametric Wilcoxon signed-rank test yields
p = 0.0625 for both metrics, which is not significant at α = 0.05 given the small number of folds. Full statistics are reported in
Supplementary Table S4.
To improve the interpretability of the cross-model comparison, we provide grouped bar charts visualizing MAE and RMSE across models and representation tracks in
Supplementary Figures S3 and S4.
In addition to the representation-track comparison in
Table 6, we evaluated several practice-based non-ML baselines (e.g., mean predictor, unit-cost heuristics, analogous estimation, and a parametric production function) under the same outer 5-fold protocol to contextualize the gains from the proposed framework.
Overall, the non-ML baselines provide reasonable first-order estimates, but they are consistently outperformed by the best ML configurations. The rule-based additive baseline (Baseline A) remains valuable for high-auditability budgeting scenarios, whereas the proposed explainable ML approach better captures nonlinear and interaction effects among scale and unit prices.
Two conclusions were derived as follows. First, cost formation exhibits nonlinearity and interactions (e.g., scale × difficulty × quality intensity), favoring tree ensembles. Second, the PCA- and FA-enhanced representations yield similar accuracy under this evaluation protocol, while offering complementary interpretability benefits.
4.3. Explainability and Sensitivity
To interpret the best-performing configuration identified in
Table 6, we compute SHAP values for the XGBoost model (PCA-enhanced track) using SHAP TreeExplainer. The global SHAP summary (
Figure 3) provides a representative portfolio-level view across all 386 projects, revealing which predictors consistently drive increases or decreases in the predicted total project cost (TPC). In line with the cost-formation mechanism discussed earlier, scale-related variables and process-level unit prices dominate the global attribution, while operational and governance indicators (e.g., cloud execution, expert participation, and anonymization) contribute secondary but policy-relevant effects.
Local explanations support auditability for individual projects; however, to avoid anecdotal interpretation, the main text emphasizes global attribution patterns. For transparency,
Supplementary Figure S1 provides example local waterfall explanations illustrating how project-specific configurations combine to produce the final prediction relative to the model’s expected value.
Supplementary Figure S2 provides an error-focused visualization of the track × model comparison (OOF pooled), complementing
Table 6 and highlighting the relative advantages of ensemble models over linear baselines. Positive SHAP values increase the predicted TPC, whereas negative values decrease it.
To quantify probabilistic budget risk beyond deterministic one-at-a-time perturbations, we conducted a Monte Carlo simulation that varies inputs according to the historical empirical distributions observed across the 386 projects while preserving joint dependencies among drivers.
Table 7 reports the project-level probabilistic budget risk metrics derived from the Monte Carlo simulation.
4.4. Robustness and Error Analysis
Transformed-Target Robustness: We re-estimated linear baseline models using log1p(TPC) and Box–Cox(TPC) within each outer training fold (transform parameters fitted on training data only) and evaluated back-transformed predictions on the original KRW scale. Both transforms reduced heteroscedasticity in residual diagnostics and slightly improved RMSE/MAE for Ridge; however, the improvements were not large enough to close the gap between the best-performing tree ensemble models (notably XGBoost). Consequently, our main conclusions regarding dominant drivers and the comparative results across the Baseline, PCA, and FA tracks are robust to reasonable target transformations. The performance difference between PCA-enhanced and FA-enhanced XGBoost models was small and not statistically significant.
For transparency,
Table 8 reports fold-averaged MAE and RMSE for linear baselines under raw TPC, log1p(TPC), and Box–Cox(TPC), evaluated on the original KRW scale after back-transformation.
As an additional robustness test (separately from the target-transformation sensitivity in
Table 8), we performed an inflation normalization check, as follows.
Inflation Normalization Check: We repeated the full evaluation after deflating TPC and unit-price variables to KRW in 2022 prices using annual CPI deflators (Statistics Korea, KOSIS; CPI base year 2020 = 100).
Fold-averaged performance metrics and the relative ranking changed only marginally, and the SHAP-based importance ordering of the dominant drivers remained stable; XGBoost remained the top-performing model across tracks under the deflated-cost setting. Error profiles indicate that the largest absolute errors occur in a small number of extremely large or unusually configured projects (e.g., high scale combined with high QA shares and multiple governance constraints). This is expected in public portfolios where a few flagship datasets dominate expenditure and where operational heterogeneity is substantial [
30].
Across outer folds, the ranking of top drivers (scale and unit prices) remains stable, supporting the interpretation that these are structural determinants rather than artifacts of a particular split. In addition, factor-augmented models show reduced variance in fold-wise performance relative to Baseline, consistent with the idea that latent structure smooths idiosyncratic correlations among raw variables.
In practice, we recommend pairing point predictions with uncertainty intervals and using local SHAP explanations to examine whether a project’s cost profile matches the nearest-neighbor regions of the training data. Such diagnostic use is important when the model is applied to new project types, novel tooling stacks, or new contracting regimes.
4.5. Additional Robustness Checks and Ablations
We performed additional robustness checks to assess whether the main findings depend on specific preprocessing or evaluation choices. First, we varied the missing-value strategy under the empirically verified assumption that all missing entries are unreported (
Table A1). We compared: (i) fold-wise median/mode imputation with missingness indicators (primary), (ii) fold-wise median/mode imputation without indicators, and (iii) listwise deletion. Across these strategies, the relative ranking of model families and the dominance of scale and unit-price drivers remained stable.
In the underlying cohort, missingness was sparse: 13 missing numeric entries across the 24 predictors (13/(24 × 386) = 0.140%), affecting seven projects (1.81%). This low missingness rate implies that alternative missing-data treatments influence only a small subset of projects rather than the global geometry of the input space.
Listwise deletion reduced the sample from 386 to 379 (−1.81%) and shifted the TPC distribution downward (mean −1.237%, median −4.479%), indicating a risk of under-representing the right tail. By contrast, the two imputation strategies preserved the full cohort and yielded nearly identical performance ordering across Baseline/PCA/FA tracks. These checks provide evidence that the results in
Table 6 are driven by model/representation choice rather than by an arbitrary missing-data rule.
Winsorizing extreme unit-price values and log-transforming heavy-tailed scale variables (records/GB/minutes) slightly improved calibration for small projects but did not change the overall ranking: XGBoost remained the top-performing model across tracks, and the main driver categories identified by SHAP remained stable. Under the current evaluation protocol, representation differences between PCA-enhanced and FA-enhanced inputs are small, and no consistent superiority of one representation over the other is observed; model ranking is primarily driven by algorithm choice rather than representation choice.
Third, we evaluated simplified feature sets to probe the marginal value of optional governance features. Models trained on scale-only predictors substantially underperformed, while adding process unit prices (labeling/cleaning/inspection) recovered most of the predictive power. Adding cloud, expert participation, and anonymization produced incremental gains, consistent with their secondary role observed in SHAP and sensitivity analyses.
Fourth, we examined temporal generalization using coarse cohort splits (early vs. late projects within the 2020–2022 window). Performance degraded slightly in the later cohort, consistent with evolving toolchains and pricing. This supports the recommendation to institutionalize periodic model updates and drift monitoring rather than relying on static estimates.
Because our portfolio ends in 2022, it predates the widespread operationalization of generative-AI and LLM-assisted annotation pipelines that can alter productivity and the composition of costs (e.g., shifting effort from manual labeling toward prompt design, validation, and tool integration). Thus, while the dominant driver categories identified here—scale and per-process unit prices—are expected to remain structurally relevant, the numeric magnitudes of unit prices and rework dynamics may differ for modern projects. We therefore recommend treating the model as a living estimator: update unit-price inputs using current price books, periodically refit the model with newer cohorts, and monitor drift in both error and SHAP-based driver rankings as tooling evolves.
Finally, we stress-tested interpretation stability. SHAP global rankings were highly correlated across cross-validation folds, and the sign of major effects remained consistent. Where local explanations diverged (typically in small projects with atypical unit-price patterns), these cases also exhibited higher prediction uncertainty, suggesting that agencies should treat outlier profiles as candidates for manual review rather than relying solely on automated estimates.
While point forecasts help rank proposals and benchmark bids, public budgeting typically requires a defensible range to absorb right-tail risk. We therefore recommend reporting a prediction interval (e.g., 80–90% coverage) alongside the point estimate, using lightweight add-ons that do not alter the core training protocol.
In practice, agencies can derive empirical uncertainty bands in three audit-friendly ways: (i) bootstrap refits, repeatedly resampling the training cohort and refitting the final model to obtain an empirical interval; (ii) gradient-boosted quantile regression (e.g., q10/q50/q90) to directly predict budget percentiles; or (iii) conformal prediction applied to cross-validated residuals to provide distribution-free empirical coverage. Bootstrap intervals are usually the most intuitive for oversight bodies, whereas quantile and conformal methods are efficient for routine scenario recomputation and monitoring under a potential distribution shift.
5. Discussion and Implications
5.1. Why Scale and Unit Prices Dominate
Across the portfolio, scale variables (records, GB, minutes) are consistently the strongest contributors to TPC. This aligns with a production-function intuition: scale increases the volume of work across nearly all process steps, including collection, cleaning, labeling, and validation. However, scale alone does not determine cost; it interacts with unit prices and workflow design. A large but low-difficulty dataset with streamlined QA can be cheaper than a smaller dataset with high difficulty and intensive review policies.
The prominence of unit prices (labeling, cleaning, inspection) indicates that procurement terms and operational productivity are first-order budget drivers. Unit prices partly reflect labor market conditions and vendor capability, but they also embed task design choices (label granularity, ontology complexity, tooling maturity, review policy). Investments in tooling (auto-labeling, model-assisted review, QA automation) can therefore reduce unit prices over time, shifting the cost structure even when scale remains fixed.
These drivers are actionable. Unlike intrinsic attributes such as modality or domain, unit prices and QA policies can be adjusted through tender design, contract clauses, workflow specification, and vendor performance incentives. A policy-ready cost model should represent these levers explicitly and provide interpretable elasticities for change control.
5.2. Implications for Procurement and Change Control
Overall budgets are most sensitive to unit prices for labeling, cleaning, and inspection; we translate these elasticities into concrete change-control guidance in this section.
Public AI training-data projects are commonly procured as fixed-price or ceiling-price contracts, with limited flexibility once execution begins. In this setting, change orders can be a major source of budget drift. The sensitivity results provide practical coefficients for governance: a +10% change in labeling unit price increases predicted TPC by roughly +7–9% on average; similar coefficients apply to cleaning (+5–6%) and inspection (+4–5%).
We recommend treating these elasticities as change-control multipliers. When a contractor requests a unit-price change, agencies can compute the implied TPC delta and compare it with (i) documented scope expansion, (ii) objective productivity evidence (rework rate, relabeling rate, defect density), and (iii) benchmark unit prices from comparable cohorts. This shifts negotiations from ad hoc bargaining to evidence-based governance.
A practical workflow is presented as follows: Step (1) submit a standardized change request describing the cause; Step (2) compute predicted TPC before/after using the trained model; Step (3) inspect the local SHAP explanation to confirm that the change affects expected drivers; Step (4) approve, reject, or request remediation based on the modeled delta and supporting evidence.
5.3. Operational and Governance Levers Beyond Unit Prices
Cloud execution and expert participation appear as secondary but meaningful drivers. Cloud use increases predicted TPC modestly on average (+2% in our sensitivity setting), reflecting storage, monitoring, and security costs. At the same time, the cloud can shorten schedules and improve reproducibility, so the higher cost may correspond to deliberate investment in delivery reliability and auditability.
Expert participation increases predicted TPC slightly (+1% on average) but is often necessary for specialized domains. Expert review can reduce downstream error costs, improve ontology coherence, and mitigate bias. The cost model should be used to budget realistically for expertise and to make tradeoffs explicit, not to eliminate expert involvement.
Anonymization and privacy controls are critical governance components. Their effects may appear smaller in aggregate because privacy-intensive projects are fewer, but costs can be concentrated in sensitive domains. For high-sensitivity datasets, agencies may benefit from additional predictors reflecting privacy classification, de-identification complexity, and security certification requirements.
5.4. Comparison to Traditional Cost Estimation Approaches
Traditional estimation for data projects relies on parametric models, expert judgment, and unit-cost heuristics. Such approaches struggle with heterogeneity across modalities, quality targets, and workflow designs. In our setting, machine-learning estimation offers three advantages: it learns nonlinear interactions (e.g., scale × difficulty × QA share), leverages portfolio-level information to benchmark unit prices and process shares, and supports scenario analysis through explainability methods.
Comparison to practice-based non-ML baselines. To contextualize the proposed approach, we also compared it against non-ML baselines commonly used in practice (unit-cost heuristics, analogous estimation, and a parametric production function). These approaches are easy to audit but tend to underfit heterogeneous AI data-construction projects because they cannot capture nonlinearities and interactions; in contrast, the explainable ML models retain transparency through factor abstraction and SHAP-based attribution while improving accuracy.
Machine-learning estimation should not replace expert review. Instead, it can serve as a decision-support layer: when the model prediction differs substantially from an initial proposal, the discrepancy can trigger a structured review, checking whether unit prices are out of distribution, difficulty claims match historical analogs, and quality targets imply unusually high QA intensity.
Factor analysis strengthens communication and stability by organizing correlated variables into interpretable latent factors that can be labeled and monitored across cohorts. In contrast, PCA components can be predictive but are harder to explain in policy and procurement terms.
5.5. Deployment Considerations for Public Budgeting
For deployment, agencies should define standardized input templates aligned with reporting rules, a governance process for model updates, and audit trails for predictions and change requests. Because unit prices and tooling evolve, an annual model refresh using the latest cohort is recommended, while maintaining a frozen baseline model for longitudinal comparisons.
Model outputs should include uncertainty bounds (e.g., an 80–90% prediction interval) to support risk-based budgeting, contingency planning, and the detection of out-of-distribution proposals.
In procurement practice, treat the median (q50) as the reference estimate, allocate contingency using the upper tail (e.g., q90–q50) for projects with risk markers (high difficulty, high QA shares, strong anonymization needs, or out-of-range unit prices), and escalate proposals with unusually wide intervals for manual review or staged contracting. Track forecast errors by modality and domain to periodically recalibrate this guidance.
Operational checklist:
Report a point estimate plus an 80–90% prediction interval for funding approval.
Use wide intervals as a trigger for deeper requirement review or staged contracting; monitor errors by modality/domain to update price books and QA policies.
Lightweight approaches to produce prediction intervals for budgeting are summarized in
Table 9.
5.6. Limitations and Future Work
This study has limitations. Administrative reports are heterogeneous in cost itemization and may under-report overheads. A missingness audit confirmed that observed blanks correspond to unreported/unknown values; we therefore treated missing entries as missing and used fold-wise imputation with missingness indicators, complemented by robustness checks. Nevertheless, if non-reporting is systematically correlated with project characteristics (e.g., bundling practices), the resulting estimates may still reflect reporting conventions as well as underlying effort. External validity in private-sector projects or other countries requires additional validation.
Temporal scope is another limitation. Our dataset spans 2020–2022 public projects, which largely preceded the rapid diffusion of LLM-assisted annotation pipelines. Accordingly, the observed unit prices and process shares primarily reflect human-centered workflows, and modern projects that substitute part of the labeling/QA with automated or LLM-assisted steps may exhibit different marginal cost structures (e.g., lower labeling unit prices but potentially higher tooling, engineering, and audit costs). We therefore treat the model as a cohort-specific estimator and recommend periodic recalibration using the latest cohorts that explicitly record automation adoption (auto-labeling share, model-assisted QA, and LLM usage). In procurement practice, the model can still support scenario testing by adjusting unit-price and QA-intensity inputs to reflect expected automation gains, but final budgets should be anchored to the most recent price book and monitored for drift.
Future research can extend the framework by adding modality-specific sub-models, incorporating schedule and risk variables (delivery time, defect rates, rework), and integrating automation adoption indicators (auto-labeling share, model-assisted QA) to capture productivity shifts. Probabilistic models that directly output prediction intervals are also promising for high-stakes public budgeting.
5.7. Reporting Template and Procurement Translation
A key barrier to transportable cost estimation is inconsistent itemization across contractors and domains. Based on the harmonized schema, we propose a minimal reporting template that preserves flexibility while making core economic primitives explicit. Specifically, future cohorts should be required to report: (i) scale in at least one modality-appropriate unit (records and either GB or minutes); (ii) process-level unit prices for labeling, cleaning, and QA/inspection as separable line items; (iii) process intensity parameters (e.g., sampling plan or inspection share) that operationalize acceptance criteria; and (iv) binary flags plus optional cost lines for governance options such as anonymization, expert review, and cloud execution. These items enable reviewers to distinguish “not applicable” from “not reported,” which is essential for both auditability and model validity.
For practitioners, the factor structure provides a compact interface for negotiation. Rather than debating dozens of raw indicators, agencies can translate proposals into a small set of levers—scale intensity, unit-price system, quality-control intensity, workforce mix, and governance options—and compare them against historical distributions. This supports consistent procurement decisions across domains while leaving room for justified exceptions (e.g., rare-domain expert review).
5.8. Threats to Validity and Replication Guidance
Public budgeting and procurement decisions require not only accurate point forecasts but also a defensible account of where errors can arise and how results can be reproduced. Accordingly, we summarize key threats to validity and provide replication guidance aligned with audit and governance needs.
Key threats include: (i) measurement and reporting noise in the source reports (mixed units/currency, embedded text, inconsistent coding); (ii) missingness and imputation choices, where unreported values may reflect bundling/reporting conventions rather than true absence; (iii) external validity limits under portfolio, tooling, or market shifts; and (iv) model risk for atypical proposals. Mitigation should therefore combine transparent preprocessing logs, missingness-indicator features, sensitivity checks across imputation strategies, interval reporting, and periodic recalibration.
Table 10 summarizes the main threats to validity and the corresponding replication guidance for audit-ready application.
6. Conclusions
Across three input tracks (Baseline, PCA-enhanced, and FA-enhanced) and four regression models evaluated under nested cross-validation, XGBoost achieves the best pooled out-of-fold performance in each track (Baseline: R2 = 0.863, RMSE = 1104.2, MAE = 731.4, MAPE = 0.374; PCA-enhanced: R2 = 0.868, RMSE = 1084.9, MAE = 746.9, MAPE = 0.358; FA-enhanced: R2 = 0.849, RMSE = 1160.6, MAE = 754.9, MAPE = 0.364). Monte Carlo-based risk estimates further support uncertainty-aware budgeting by reporting upper-tail cost percentiles and overrun probabilities.
SHAP-based interpretation confirms that project scale and process-level unit prices are the dominant cost drivers, while governance-related features contribute secondary but policy-relevant effects. Global and local explanations using SHAP indicate that project scale and process-level unit prices are the dominant cost drivers, whereas task difficulty, QA intensity, cloud use, and expert participation contribute secondary but meaningful effects. Sensitivity simulations translate these relationships into practical budgeting levers that can support contract line-item design, negotiation, and change control.
From a policy perspective, the results support factor-informed, explainable machine learning as a decision-support tool for budgeting and governance in national AI data programs. Key limitations relate to the program-specific cohort (2020–2022) and the constraints of standardized reporting; we discuss these limitations and avenues for extension (e.g., richer governance variables and uncertainty-aware updates) in
Section 6.2.
6.1. Practical Utilization Framework
To support budget planning and evidence-based program management, the proposed model can be embedded in a lightweight utilization framework aligned with public-program lifecycles:
Ex Ante Appraisal: Derive an initial cost band and key drivers and document assumptions for unit prices and quality tiers.
Procurement Calibration: Set contract line items and QC/cleaning/anonymization requirements using scenario elasticities rather than flat percent adders.
In-Flight Control: Refresh forecasts at milestones using updated scale and unit-price signals, monitor drift, and flag overruns when drivers move outside historical ranges.
Ex Post Benchmarking: Store realized drivers and outcomes to update priors, improve standard cost tables, and support iterative policy revisions.
6.2. Future Research Directions
First, the dataset spans 2020–2022 within a specific national program; replication in newer cohorts and other jurisdictions is needed to assess its transportability as technology, labor markets, and governance norms evolve.
Second, although 35 variables were harmonized, the analysis relies mainly on quantitative fields available in final reports. Richer qualitative drivers (e.g., governance maturity, contractor capability, coordination frictions) were not systematically encoded and could improve explanatory power if operationalized.
Third, we focused on regression models suited to tabular program data. Future research could compare against hybrid architectures (e.g., tabular transformers) and strengthen robustness with audited micro-logs of task time, rework, and defect rates, as well as uncertainty-aware prediction intervals.
6.3. Reproducibility and Transparency
To facilitate practical uptake, we recommend maintaining a lightweight reproducibility package that records variable definitions, preprocessing rules, and model settings used for each budgeting cycle. Even when datasets cannot be fully released due to confidentiality, agencies can publish aggregated metadata and evaluation protocols to enable scrutiny of cost drivers and forecasting reliability.
Recommended minimum transparency checklist:
Define the outcome (TPC) and units consistently (e.g., million KRW) and document any transforms.
Publish the 35-item schema and the derived 24 numeric predictors, including how categorical fields are encoded.
Record missing-value rules (e.g., structural zero vs. unknown) and standardization parameters.
Provide the validation protocol (nested CV splits, metrics) and report the final hyperparameters for the selected model in
Supplementary Table S5.
Archive SHAP summaries and scenario elasticities used for procurement and change-control decisions.
Table 11 presents the audit-ready implementation checklist aligned with public AI data-program lifecycles.
Table 11.
Implementation checklist aligned with public AI data-program lifecycles.
Table 11.
Implementation checklist aligned with public AI data-program lifecycles.
| Stage | Key Inputs | Manager Actions | Outputs/Checks |
|---|
| Ex-ante appraisal | Initial scope (records/GB/min) + unit-price priors | Run forecast; document assumptions | Cost band + top drivers; rationale log |
| Procurement | BoQ + acceptance criteria | Calibrate price book; specify unit prices and QC rules | Contract items consistent with drivers |
| Execution | Updated scale + process logs | Refresh forecast at milestones; monitor drift | Overrun flags; change-control triggers |
| Acceptance | QC outcomes + defect/rework | Verify deliverables; reconcile deviations | Final cost vs. plan; quality report |
Table 12.
Change-control guidance using sensitivity elasticities.
Table 12.
Change-control guidance using sensitivity elasticities.
| Decision Point | Recommendation | Operational Indicator | Unit-Price Revision Request |
|---|
| Unit-price revision request | Treat labeling/cleaning/inspection unit prices as controlled parameters; compute impact using elasticities before approval. | Proposed Δ (unit price); evidence of new price book/vendor quote; change in task specification. | Estimate impact using: ΔTPC ≈ elasticity × Δ (unit price) (holding scale constant); approve only with documented justification. |
| Scope growth claim | Separate true scope growth from rework and process inefficiency. | Δrecords/GB/min vs. baseline; change logs; rework rate and defect logs. | Reject unit-price change if the driver is scope; revise scale inputs instead; if rework-driven, apply corrective actions and QA controls. |
| Quality escalation | Justify higher QC tiers with measurable target gains (F1/mAP) and expected rework reduction. | Observed vs. target quality; inspection findings; defect density and rework rate. | Allow unit-price uplift only with explicit quality targets, acceptance criteria, and ex post verification. |
| Tooling change (automation/LLM assistance) | Update priors and refresh the model with the latest cohort to reflect productivity shifts. | Adoption of automation; cycle time change; before/after productivity metrics. | Re-baseline unit prices and elasticities after tooling adoption; document the new operating regime. |
| Cloud/expert add-on | Require ex ante rationale and ex post evaluation of benefits. | Cloud usage/expert participation flags; cost deltas; expected risk reduction or quality improvements. | Treat as add-on items with clear deliverables; audit realized benefits and adjust future priors. |