A Study on Machine Learning-Based Cost Estimation Models for AI Training Data Construction

Ko, Yoon-Seok; Lee, Bong Gyou

doi:10.3390/app16062891

Open AccessArticle

A Study on Machine Learning-Based Cost Estimation Models for AI Training Data Construction

by

Yoon-Seok Ko

¹ and

Bong Gyou Lee

^2,*

¹

National Information Society Agency, 68-11, Seohojungang-ro, Seogwipo 63568, Jeju-do, Republic of Korea

²

Graduate School of Information, Yonsei University, Yonsei-ro 50, Seodaemun-gu, Seoul 03722, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 2891; https://doi.org/10.3390/app16062891

Submission received: 3 February 2026 / Revised: 1 March 2026 / Accepted: 4 March 2026 / Published: 17 March 2026

Download

Browse Figures

Versions Notes

Abstract

This study proposes an explainable machine learning framework for estimating the total project cost (TPC) of AI training-data construction, where cost information is difficult to structure due to heterogeneous workflows and quality requirements. Using 386 public AI training-data projects conducted between 2020 and 2022, we derive 24 numerical predictors from standardized final reports and construct three input tracks: a baseline feature set, a principal component analysis (PCA)-enhanced set, and a factor analysis (FA)–enhanced set capturing latent cost structures. Four regression models (Ridge, Random Forest, XGBoost, and LightGBM) are evaluated using nested cross-validation. XGBoost achieves the best overall performance across all three tracks (Baseline, PCA-enhanced, and FA-enhanced). Among them, PCA-enhanced XGBoost attains the highest predictive accuracy (R² = 0.868; RMSE = 1084.9; MAE = 746.9; MAPE = 0.358; pooled out-of-fold), while Baseline XGBoost yields the lowest MAE (731.4; R² = 0.863). To support transparent decision-making, Shapley Additive exPlanations (SHAP)-based attribution and scenario-based sensitivity analyses are conducted. Results show that project scale and process-level unit costs are dominant cost-drivers, while cloud usage, expert participation, and de-identification requirements exhibit secondary effects. The proposed framework provides an interpretable, data-driven approach to cost information management and decision support for data-intensive AI projects.

Keywords:

AI training data; data construction cost; cost drivers; machine learning-based cost estimation; principal component analysis (PCA); factor analysis (FA); XGBoost; SHAP; data-centric AI

1. Introduction

Artificial intelligence (AI) has moved from research laboratories into production across language, vision, and speech, supported by advances such as Transformer-based language models and high-performing vision architectures that achieve strong accuracy in diverse tasks. However, recent scaling evidence indicates that model-centric optimization alone yields diminishing returns: additional accuracy increasingly requires disproportionate increases in data and compute, and performance plateaus often reflect data limitations rather than architectural novelty [1,2,3,4].

Real-world deployment incidents—ranging from biased decision-support systems to uneven recognition accuracy across demographic groups—have been traced to training data that are noisy, imbalanced, or non-representative. These cases highlight how data quality conditions fairness, reliability, and downstream risk, and motivate a data-centric AI stance in which dependable performance is driven increasingly by data accuracy, diversity, and governance rather than increasingly complex models [5,6,7,8,9].

Despite this shift, the economics of building training data for AI systems remain underspecified. In large projects, data work spans collection, cleaning, labeling, validation, storage, governance, and security, yet budgeting practices often reduce these activities to coarse unit prices or linear rules of thumb. Labeling and quality assurance can consume a major share of effort and cost—particularly for unstructured modalities such as images, speech, and video—because annotation is labor-intensive, domain dependent, and tightly coupled with quality targets [10,11,12].

A conventional procurement-based cost model assumes linear additivity of process-level expenditures, expressed as

TPC = ∑_{p ∈ P} (u_p × s_p)

(1)

where u_p denotes the unit price for process p and s_p represents the ex ante workload scale (e.g., records, gigabytes, or minutes) assigned to process p [1]. This formulation rests on two implicit assumptions: (i) cost separability, meaning that the cost of each process is independent of all others, and (ii) a constant marginal effort per unit of workload.

In practice, however, AI training-data construction departs substantially from both assumptions. Such pipelines typically involve iterative quality-assurance (QA) loops in which items that fail inspection are subjected to relabeling or recoding before undergoing re-inspection [13]. If a fraction r of items requires rework at each iteration, the expected effective workload expands as N·(1 + r + r² + ⋯) = N/(1 − r) for 0 ≤ r < 1, yielding a non-linear expected cost term u_p·N/(1 − r) in place of the linear term u_p·N.

Furthermore, the rework rate r is not a fixed constant; it is a function of multiple interacting factors, including task difficulty (d), target quality threshold (q), reviewer strictness, toolchain maturity (including automation and large-language-model assistance), and the degree of domain-expert involvement (e), i.e., r = f(d, q, reviewer, tools, e). As a consequence, the total project cost exhibits interaction effects among scale, unit price, and quality intensity, and undergoes non-linear amplification under stricter QA regimes. These structural properties motivate the adoption of flexible, non-linear learners—in particular, tree-based ensemble methods—which are capable of capturing higher-order interactions while remaining interpretable through post hoc feature attribution techniques.

Although automation aids such as active learning, semi/weak supervision, and large language model (LLM) assistance can reduce marginal annotation effort, these methods do not directly provide procurement-ready project-level cost estimates nor explain how operating choices (expert participation, target quality intensity, anonymization, cloud operations) shift total budgets [14,15,16,17]. This gap is especially consequential in public programs, where budgets are committed ex ante and mid-course adjustment is limited [9,18].

This study addresses the gap by developing and validating a machine-learning–based cost estimation model for AI training-data construction, grounded in a comprehensive multi-year public program. We compile and standardize 386 usable final reports from Korea’s national AI training-data initiative (2020–2022, nominal KRW), representing approximately KRW 1.2 trillion (approximately USD 1 billion) in public investment. From each report, we extract a harmonized schema of 35 cost-influencing variables observable across projects and construct an analysis matrix of 386 × 24 numerical predictors after preprocessing.

This paper contributes in three ways. First, it provides a harmonized, multi-dimensional cost-driver schema for large-scale public AI training-data projects. Second, it evaluates raw, PCA-enhanced, and factor-analytic (FA) representations under nested cross-validation, showing how structural factors can improve both predictive performance and interpretability. Third, it translates model explanations into actionable budgeting levers through SHAP-based attribution and policy-oriented sensitivity simulations. Accordingly, this study addresses the following research questions:

RQ1. Do FA-augmented models reduce prediction error relative to Baseline and PCA-enhanced models?
RQ2. Which factors/variables contribute most to TPC, and what are their marginal effects and sensitivities?
RQ3. How do policy/operations scenarios (difficulty, process shares, expert participation, cloud use, anonymization) alter predicted TPC?

The remainder of this article is organized as follows. Section 2 reviews related work on AI training-data cost estimation. Section 3 presents the research design, including data construction, preprocessing, alternative representations, model training/validation, and interpretability analyses. Section 4 reports the empirical results, Section 5 discusses implications and deployment considerations, and Section 6 concludes with limitations and future research directions.

2. Literature Review of Cost Estimation Techniques

2.1. Concept and Measurement of AI Training-Data Cost

As AI systems have matured from proof-of-concept to production, the economics of training data—its acquisition, refinement, labeling, validation, governance, and security—has become a decisive determinant of project feasibility and performance [7,8,9]. Early budgeting practices treated data work as a sub-item of overall development cost and relied on coarse heuristics such as cost per record or per GB, headcount-based timescales, or linear rules of thumb borrowed from software engineering [10,12]. These approaches struggle to capture cross-modal differences, workflow decomposition (collection → cleaning → labeling → validation/QA), and the effects of workforce mix, cloud operations, and privacy compliance on total project cost (TPC).

Measurement frameworks subsequently evolved to incorporate process granularity, using per-task unit prices (labeling/cleaning/QA), process shares (e.g., QA ratio), scale metrics (records, GB, minutes), and quality outcomes (F1, mAP) as proxies for required effort [10,11,12]. Studies in unstructured modalities report that labeling and QA may consume 60–80% of the effort/cost due to annotation difficulty and domain dependence, motivating human-in-the-loop and workflow-optimized annotation systems [15,19]. Two limitations persist: multicollinearity among size/time/difficulty variables undermines stable inference in linear models, and linear additivity obscures nonlinearities and interactions that govern real budgets.

2.2. Toward Multi-Factor, Explainable Cost Modeling

Evidence from data-centric AI suggests that performance plateaus increasingly reflect data quality and coverage rather than architectural novelty; consequently, organizations have intensified investments in governance, validation, and privacy–cost components that are not well captured by average unit prices alone [7,8,9,20]. Recent cost-estimation studies therefore model dataset building as a multi-factor production process with inputs spanning data scale, unit prices by process, workforce composition and coordination costs, operating environment (cloud/tools), compliance (anonymization/access control), and quality targets [10,12].

Two methodological strands recur. First, principal component analysis (PCA) mitigates multicollinearity and can improve generalization, but components often lack semantic interpretability [16]. Second, factor analysis (FA) yields interpretable latent factors aligned with managerial and policy levers (e.g., unit-price system, workforce mix, quality intensity, scale, time/difficulty, process shares, operating environment, privacy). Combining PCA/FA features with tree-based ensembles (Random Forest, XGBoost, LightGBM) is effective because these models capture nonlinearities and interactions and remain robust to outliers and mixed scales [4,16,17].

Parallel streams on cost reduction—active learning, semi/weak supervision, and LLM-assisted labeling—demonstrate lower marginal annotation effort and faster convergence to quality targets [21,22,23,24], but they do not in themselves deliver procurement-ready project-level cost estimates. The literature therefore converges on the need for models that are predictively strong and policy interpretable, mapping driver-level changes into TPC impacts that can inform planning and contracting [16,20,25].

The release of GPT-3 accelerated the practical adoption of large language models in automation workflows, including data annotation, which may change present-day cost structures relative to our 2020–2022 cohort [17].

2.3. Cost Drivers, Quality Targets, and TPC

For the interoperability of dataset-quality metadata and consistent reporting across projects, we refer to the W3C Data Quality Vocabulary (DQV) [26].

Empirical work consistently finds that data scale (records/GB/minutes) and process unit prices (labeling, cleaning, QA) are primary contributors to TPC across modalities, with higher elasticities in image/speech/video due to annotation difficulty and review intensity [10,11,12]. Workforce configuration adds coordination and supervision overhead; cloud execution introduces storage/monitoring costs but may reduce time-to-delivery; and privacy measures impose persistent governance costs whose magnitude varies by domain sensitivity [12,20].

Explainability methods such as SHAP have been applied to boosted-tree cost models to rank driver importance and support scenario analysis, turning model outputs into actionable levers for budgeting and policy [25]. Robust validation (nested cross-validation, out-of-time tests) and uncertainty quantification (bootstrap, quantile prediction) are emphasized for public-sector decision cycles where budget revisions are rigid and stakes are high [16].

Research gaps remain: (i) the availability of large, standardized, cross-project corpora with harmonized variables spanning process economics, workforce, operations, privacy, and quality is limited; (ii) there is insufficient evidence on whether factor-augmented representations improve both accuracy and interpretability; and (iii) there are few policy-ready sensitivity mappings translating levers into TPC deltas [4,12,16,20,25].

2.4. Prior Cost-Estimation Approaches and Gaps

Cost estimation for data-intensive projects has traditionally relied on variants of bottom-up costing, analogical estimation, and parametric models. Bottom-up approaches decompose work into activity lists (e.g., collection, cleaning, labeling, inspection, and management) and multiply expected effort by wage or unit-price rates. These methods are transparent and procurement-friendly, but they are labor-intensive and sensitive to how granularly tasks are specified, often undercounting rework and coordination overhead.

Analogical estimation extrapolates from prior projects with similar scope. While practical in early planning, it tends to be unstable when the reference set is small or when reporting conventions vary across agencies and vendors. In public programs, analogical methods also inherit legacy pricing biases (e.g., inflated unit prices for historically under-specified QC), which can persist across cycles without explicit correction.

Parametric models map a small set of drivers to total cost via regression-like relationships. In software engineering, this family includes function point analysis, COCOMO-style models, and productivity-based cost functions. For AI data construction, however, the driver space is broader: scale (records/GB/minutes), modality and task difficulty, workforce structure, security constraints, and quality thresholds interact in non-linear ways, complicating the selection of a compact driver set.

Recent work proposes using production-function perspectives, where total cost is explained by scale and process intensity (e.g., labeling minutes per item) with adjustment factors for difficulty and governance requirements. These perspectives highlight the centrality of unit prices as policy levers, but they often rely on strong assumptions about constant returns or stable productivity that may not hold when toolchains shift (e.g., automation or LLM assistance).

Another stream adopts activity-based costing and time-and-motion logging to capture true effort, including rework cycles. This is conceptually appealing but requires granular operational telemetry that is rarely standardized across vendors and programs. Without standardized schemas and audit protocols, time logging can become inconsistent and difficult to compare across projects.

Across these approaches, two recurring gaps motivate machine-learning-based estimation in the public sector. First, heterogeneous reporting makes it difficult to harmonize inputs and compare projects; this motivates a common reporting schema that can support both costing and accountability. Second, purely deterministic estimates struggle to capture complex interactions among scale, unit prices, and optional governance features; flexible models can better represent such interactions while still supporting interpretation via feature attribution and scenario analysis.

Accordingly, a practical research goal is not merely minimizing predictive error, but producing a policy-ready model that is transparent, updateable, and compatible with procurement workflows—i.e., a model that can be audited and used to justify budget decisions rather than a black box.

2.5. Toward Data-Centric, Policy-Ready Estimation

A data-centric perspective reframes budgeting as an iterative process of improving data quality and reporting consistency, not only improving algorithms. For cost estimation, this implies that the reliability of forecasts critically depends on the stability of variable definitions and the availability of comparable historical records. A standardized schema (and consistent coding rules) is therefore a governance asset: it reduces ambiguity in procurement, supports benchmarking, and enables the periodic refreshing of estimation models as the program evolves.

Policy-ready estimation also requires interpretability. Decision makers need to understand which drivers dominate cost and how changes in a driver (e.g., labeling unit price or required inspection stringency) translate into budget impacts. Tree-based ensemble models paired with post hoc explanation (e.g., SHAP) offer a practical compromise: they can capture non-linearities in tabular data and still provide local and global attributions.

Fairness and accountability concerns matter, even for cost estimation. If certain modalities or domains systematically incur higher costs due to security constraints or workforce scarcity, programs may inadvertently underfund those domains or create procurement incentives that degrade quality. Transparent driver reporting helps detect such structural disparities and supports equitable resource allocation.

Finally, policy-ready models should be designed for maintenance. Rather than treating estimation as a one-time academic exercise, agencies can institutionalize periodic retraining with new cohorts, drift monitoring for unit prices, and calibration checks for error across project types. This aligns with established MLOps principles such as documentation, versioning, and evaluation under distribution shift.

In this study, these considerations motivate the focus on (i) harmonized drivers drawn from standardized program reporting, (ii) dimensionality reduction to stabilize inputs and reduce collinearity, and (iii) interpretable ML with sensitivity analysis for actionable governance levers.

These design choices aim to provide a forecasting tool that is both accurate and operationally usable in public AI data programs, supporting budgeting, procurement, and post hoc accountability.

3. Methodology (Research Design)

3.1. Research Procedure

We follow a four-step procedure to develop a machine-learning cost estimation model for AI training-data projects while reflecting operational characteristics and ensuring validity. First, we collected and harmonized 386 government project final reports (2020–2022), defined Total Project Cost (TPC) as the outcome, and specified cost-driver dimensions from prior work (scale, unit prices, workforce, quality, operations, and governance).

The end-to-end workflow is summarized in Figure 1.

Second, we constructed a consistent numeric feature space and created three input tracks: Baseline (selected numeric predictors), PCA-enhanced (Baseline plus principal components), and FA-enhanced (Baseline plus latent factors). Third, we validated structure and predictive stability through PCA/FA diagnostics and nested cross-validation across tracks. Fourth, we evaluated predictive validity using a hierarchy of models (linear baselines to tree ensembles) and conducted interpretability and policy-facing sensitivity analyses using SHAP and scenario simulations. Table 1 summarizes the track configurations. Under the current evaluation protocol, performance differences between PCA-enhanced and FA-enhanced tracks are small, and model ranking is primarily driven by algorithm choice rather than representation choice.

3.2. Data and Variable Construction

We standardized administrative final reports from 386 government-funded AI training-data projects conducted between 2020 and 2022 and harmonized 35 common reporting items per project (Table 2). From these, we retained a final set of 24 numeric predictors suitable for learning, covering scale (records, GB, minutes), per-process unit prices (labeling/cleaning/QA), process shares, workforce mix, operational flags (cloud use, expert participation), privacy/governance (anonymization), and quality indicators (F1, mAP).

The dependent variable is TPC (million KRW), defined as end-to-end expenditure including collection, cleaning, labeling, validation/QA, quality management, expert consultation, labor, and infrastructure. To reduce the risk of target leakage, we do not use any predictors that are algebraically derived from TPC (e.g., Total_Cost_per_Record, Total_Cost_per_GB, Total_Cost_per_Minute). Such derived indicators, when computed, are reported only descriptively and excluded from supervised learning and representation learning (PCA/FA).

Inflation adjustment: All cost figures (TPC and unit prices) are recorded in nominal KRW, as reported in the administrative final reports for each project year (2020–2022). Because the observation window is short and the models are trained on standardized predictors, we use nominal values in the main analysis. As a robustness check, we also deflated TPC and unit-price variables to KRW in 2022 prices using annual consumer price index (CPI) deflators (Statistics Korea, KOSIS; CPI base year 2020 = 100). Predictive performance and key driver rankings changed only marginally, and the main conclusions were unchanged (see Section 4.4).

3.3. Preprocessing and Alternative Representations (PCA/FA)

The raw administrative data were originally stored in a single spreadsheet in a transposed format (variables as rows, projects as columns): the first row (“Dataset Name”) provided identifiers for 386 projects, and the remaining 35 rows corresponded to harmonized reporting items. To build an analysis-ready matrix, we first reshaped the sheet into the standard form (one row per project, one column per variable), then normalized data types by removing currency/unit strings (e.g., “8500 won”, “55 persons”), parsing numeric tokens from mixed notations (e.g., “0.8(80%)”, “approximately 66,600 won”), and applying explicit coding rules for binary/ordinal fields (e.g., Y/Yes → 1, N/No → 0). After this structuring step, we applied a five-step preprocessing pipeline to improve cross-project comparability: (1) field harmonization and unit normalization across reports; (2) missingness profiling to distinguish non-applicable from unreported values; (3) fold-wise imputation of missing numeric entries (median for numeric fields and mode for binary/ordinal fields) with additional missingness-indicator flags to preserve the distinction between “zero” and “unreported”; (4) removal of constant/degenerate columns and sanity checks for extreme values; and (5) z-score standardization of continuous predictors prior to PCA/FA and downstream modeling (binary indicators retained as 0/1).

Missingness was limited (13 cells out of 9264 numeric entries; 0.140%), affecting seven projects (1.81%). A manual audit (Table A1, adapted from the table in Section 5.8 of the full technical report) confirmed that all missing entries reflect unreported values (i.e., those not itemized in the administrative final report), rather than true zeros. Accordingly, we treated blanks as missing and handled them via imputation strategies described in Section 3.3.1, and we avoided listwise deletion because it reduces the sample (386 → 379) and shifts the target distribution, potentially harming representativeness.

Using the resulting 24 standardized numeric predictors (Section 3.2), we derived two alternative representations. PCA retained the first 10 principal components (PC1–PC10), explaining 81.7% cumulative variance. FA extracted nine latent factors (eigenvalue ≥ 1), explaining 77.8% cumulative variance and enabling semantic labeling based on dominant loadings. We evaluated three input tracks: Baseline (24 predictors), PCA-enhanced (Baseline + PC1–PC10), and FA-enhanced (Baseline + Factor_1–Factor_9). This design enables a direct comparison between variance-preserving compression (PCA) and semantically interpretable structure (FA). Table 3 reports the factor labels and representative high-loading variables used for interpretation.

3.3.1. Missing-Value Handling

Missing values in the administrative final reports mainly arise from omissions during the preparation and submission of individual project final reports, where certain cost-driver and operational fields are left blank. In this study, a comprehensive missingness audit (Table A1, adapted from the table in Section 5.8 of the full technical report) verified that all observed missing entries in the selected predictors correspond to unreported values, rather than non-applicable activities or true zeros. Therefore, we do not encode missing numeric entries as zero. Instead, we retain missing values as missing and apply imputation and missingness-indicator features to preserve all 386 projects while maintaining the semantic distinction between ‘zero’ and ‘unknown’.

To enable PCA/FA, which requires a complete matrix, and to support stable learning across nested cross-validation, we adopt a two-part strategy. First, missing numeric predictors are imputed using the median computed on the training portion of each fold (to avoid information leakage), and the learned median is then applied to the corresponding validation/test portion. Second, for each predictor with any missingness, we add a binary missingness indicator (1 = missing/unreported; 0 = reported). This allows models—especially tree ensembles—to learn systematic patterns associated with non-reporting without conflating them with true zeros.

After fold-wise imputation, the analysis-ready matrix is complete for PCA/FA and downstream modeling (N = 386). Importantly, zeros in the cleaned dataset represent genuine zero-valued measurements (when they occur), whereas unreported fields are represented through the combination of imputed values and their missingness indicators. We report the variable-by-variable missingness distribution and pre/post preprocessing comparison in Table A1, and we summarize the preprocessing conventions by variable group in Table 4.

Missingness was rare (13 entries total) but non-negligible conceptually because treating unreported values as zeros would impose a strong and potentially biasing assumption. We therefore revised the preprocessing pipeline to align with the ‘unreported’ interpretation confirmed by the audit and to provide explicit sensitivity checks under alternative missing-data strategies (Section 4.5).

For binary operational indicators (e.g., cloud execution, expert participation, anonymization), unreported entries are treated as missing (NaN) and handled in the same way: median/mode imputation within folds plus a missingness indicator. This avoids interpreting ‘blank’ as ‘No’ unless the report explicitly states non-use. Tree-based learners can exploit both the imputed value and the missingness flag to capture interactions (e.g., scale × unit price × operations) without imposing linear assumptions.

As an additional robustness check, we compare (i) our primary imputation-with-indicators approach with (ii) imputation without missingness indicators and (iii) listwise deletion where feasible. We find that the main performance ranking and interpretability conclusions—including the dominant cost drivers and the overall sensitivity patterns—remain consistent across these missing-data treatments, indicating that our results are not artifacts of a particular handling strategy (Section 4.5).

3.3.2. Post-Preprocessing Descriptive Statistics

After preprocessing (including the missing-value handling described above), we produced descriptive statistics for the dependent variable (TPC) and all 24 numeric predictors to verify plausibility, detect extreme values, and characterize cross-project heterogeneity. Because project size and costs are typically heavy-tailed in public procurement, we report not only means and standard deviations, but also medians and interquartile ranges (IQRs) alongside minimum–maximum ranges.

Consistent with the program’s portfolio structure, scale-related variables (e.g., total records, total GB, total minutes, and average time per record) exhibit right-skewness, and TPC shows a similar heavy-tailed distribution driven by a small number of very large projects. Unit-price variables (labeling, cleaning, and QA/inspection per record) vary substantially across domains and difficulty levels, reflecting different annotation guidelines, review intensity, and contractor productivity assumptions.

Target Transformation Check: Because monetary costs are often right-skewed, we tested variance-stabilizing transformations for the target TPC. Specifically, within each outer training fold, we evaluated (i) log1p(TPC) and (ii) a Box–Cox transform (λ estimated on the training fold; TPC > 0) for linear baseline models, and then back-transformed predictions to the original KRW scale for metric computation (MAE/RMSE/MAPE/R²). Normality diagnostics (histogram and Q–Q plots) indicated that log-transformed TPC is closer to log-normal than the raw scale, and residual plots showed reduced heteroscedasticity for linear baselines. Performance gains for regularized linear models were modest and did not change the overall ranking: tree-based ensembles remained superior. We therefore report results on the original TPC scale for procurement interpretability, while documenting the transformed-target sensitivity in Section 4.4.

Workforce variables (number of workers and crowd/full-time ratios) show meaningful dispersion across contractors, suggesting heterogeneous coordination costs. Binary operational indicators (cloud execution, expert participation, anonymization) are not uniformly present across projects, supporting our decision to treat them as optional but policy-relevant levers in subsequent sensitivity analyses. Quality indicators (F1 and mAP), where reported, are naturally bounded in [0, 1] and are used as secondary drivers when interpreting cost–quality trade-offs.

These descriptive diagnostics informed the range checks and outlier screening performed before model training, and they also motivate the use of robust validation and scenario-based reporting in Section 4.3, where we interpret predictions as budget risks rather than as single-point estimates.

3.3.3. Practice-Based Baseline A: Spreadsheet Heuristic (Unit Cost per Record × Number of Records)

To enhance practical relevance and provide a comprehensive evaluation, we additionally benchmark practice-based non-ML baselines commonly used in budgeting and procurement workflows (Baselines A–C). Baseline A is a transparent spreadsheet-style estimator commonly used in budgeting and procurement workflows, implemented as “unit cost per record × number of records”. In practice, the unit cost may be a single average rate or a sum of process-specific per-record rates (e.g., labeling + cleaning + QA). For cross-validation, when a single average unit cost per record is used, it is estimated from the training fold only and then applied to the held-out fold to avoid information leakage.

Because the administrative reports define per-process unit prices for labeling, cleaning, and QA/inspection in KRW per record, we construct a per-record unit cost by summing these process rates and multiply by the total record count. This is equivalent to the common heuristic “average cost per record × number of records,” while retaining line-item interpretability.

As shown in Supplementary Table S2, the unit-price × workload mapping uses the total record count as the workload scale for labeling, cleaning, and QA/inspection to preserve unit consistency. Supplementary Table S3 provides additional evaluation metrics and supporting results for transparency and reproducibility.

Accordingly, Baseline A predicts total project cost as a transparent line-item sum: TPC_hat = α + (UP_label × N_records) + (UP_clean × N_records) + (UP_QA × N_records), where all unit prices are reported in KRW/record and N_records denotes total record count.

Baseline B (Analogous estimation) predicts cost from similar historical projects. We use a k-nearest neighbors (kNN) approach (k = 10) on standardized scale and key unit-price drivers and take the median total cost of the neighbors as the estimate.

Baseline C (Parametric production function) fits a conventional log-linear model using scale drivers (e.g., records, GB, and minutes) within each training fold and exponentiates predictions to obtain costs on the original scale [27].

For a fair comparison with ML models, all non-ML baselines are evaluated using the same outer 5-fold splits and metrics (MAE, RMSE, MAPE, and R²). Any fitted parameters (e.g., the offset α in Baseline A and regression coefficients in Baseline C) are estimated on the training fold only and applied to the held-out fold.

3.4. Model Training, Validation, and Selection

We trained four representative algorithms spanning linear regression and flexible nonlinear learners: Ridge, Random Forest, XGBoost, and LightGBM. All models used identical preprocessing to ensure comparability across tracks [20,28,29].

Hyperparameters were tuned with nested 5 × 5 cross-validation. The inner loop optimized hyperparameters and early-stopping configurations, while the outer loop produced unbiased performance estimates. We report RMSE, MAE, MAPE, and R² and select the best configuration based on the overall error profile and interpretability requirements.

3.5. Interpretability, Sensitivity, and Robustness Checks

To open the black box for decision makers, we compute SHAP values for the selected XGBoost configuration (PCA-enhanced track), providing global feature importance and local explanations.

We then conducted what-if sensitivity simulations by perturbing key drivers within realistic ranges (e.g., +10% unit-price shocks; toggling operational flags) and reporting average percentage changes in predicted TPC. Robustness checks include stability of rankings across cross-validation folds and diagnostic analyses for distribution shifts.

4. Results

4.1. Dimensionality-Reduction Diagnostics (PCA and FA)

We assessed whether the standardized numeric feature space admits a lower-dimensional structure exploitable for prediction and interpretation. PCA on the 24 numeric predictors yielded 10 components with a cumulative explained variance of 81.7%. The first three components explained 17.0%, 13.7%, and 10.6% of the variance, respectively. Subsequent components captured additional structure associated with scale, difficulty, process shares, operational flags, and privacy/governance indicators. Table 5 reports the eigenvalues and explained-variance profile for PC1–PC10.

4.2. Model Performance Across Input Tracks

We benchmarked four algorithms (Ridge, Random Forest, XGBoost, and LightGBM) under three input tracks—Baseline, PCA-enhanced, and FA-enhanced—using nested 5 × 5 cross-validation. Across all tracks, tree ensembles showed a marginal improvement over linear baselines, indicating nonlinear cost formation and interaction effects among scale, unit prices, and operational choices. Table 6 summarizes pooled out-of-fold performance across tracks and candidate models. XGBoost performance by track: Baseline (R² = 0.863, MAE = 731.4, RMSE = 1104.2, MAPE = 0.374), PCA-enhanced (R² = 0.868, MAE = 746.9, RMSE = 1084.9, MAPE = 0.358), FA-enhanced (R² = 0.849, MAE = 754.9, RMSE = 1160.6, MAPE = 0.364). Across all three input tracks, XGBoost achieved the best pooled out-of-fold performance in each track (Baseline: R² = 0.863, RMSE = 1104.2, MAE = 731.4, MAPE = 0.374; PCA-enhanced: R² = 0.868, RMSE = 1084.9, MAE = 746.9, MAPE = 0.358; FA-enhanced: R² = 0.849, RMSE = 1160.6, MAE = 754.9, MAPE = 0.364). Across tracks, XGBoost consistently yielded the best pooled out-of-fold performance, with LightGBM closely following; Ridge and Random Forest showed larger errors. These results indicate that XGBoost provides the most robust predictive accuracy under the current evaluation protocol.

Adding structure via PCA or FA produced a performance close to the Baseline track (differences were small) under nested 5 × 5 cross-validation. XGBoost consistently achieved the best pooled out-of-fold performance across all tracks. We emphasize FA primarily for providing a structured and interpretable latent representation, rather than for consistent error reduction over PCA.

To further examine algorithm-level differences, we conducted paired statistical tests comparing PCA-enhanced XGBoost with Random Forest and LightGBM under identical outer 5-fold splits. The comparative error patterns across the Baseline, PCA-enhanced, and FA-enhanced tracks are illustrated in Figure 2, while the statistical test results are reported in Supplementary Table S6, including paired t-tests, Wilcoxon signed-rank tests, and fold-wise win-rates. XGBoost consistently outperformed alternative ensemble models across all folds.

For clarity, Table 6 reports the aggregated pooled out-of-fold performance summary for all candidate models within each representation track.

Note: Table 6 reports pooled out-of-fold performance for each candidate model within each representation track (Baseline, PCA-enhanced, and FA-enhanced). The overall best-performing configuration was PCA-enhanced XGBoost (R² = 0.868; MAE = 746.9; RMSE = 1084.9; MAPE = 0.358).

A statistical comparison was conducted between PCA- and FA-enhanced representations. Because the performance gap between the PCA-enhanced and FA-enhanced configurations is relatively small, we conducted formal paired tests across the matched outer cross-validation folds. For each outer fold k, we computed the held-out error (RMSE and MAE) for the PCA-enhanced and FA-enhanced models under identical folds and hyperparameter tunings, then tested paired differences d_k = Error_FA,k − Error_PCA,k using (i) a paired t-test and (ii) a Wilcoxon signed-rank test (two-sided, α = 0.05). The full test statistics and p-values are reported in Supplementary Table S4, showing whether the observed difference is statistically significant under the nested CV protocol.

Using matched outer-fold errors (K = 5), the FA-enhanced configuration yields lower errors than the PCA-enhanced configuration (differences defined as FA − PCA): RMSE mean difference = −32.424 (SD = 6.116); MAE mean difference = −37.040 (SD = 2.553). A paired t-test indicates that the improvement is statistically significant for both RMSE (t = −11.854; p = 0.000290) and MAE (t = −32.448; p = 5.38 × 10⁻⁶). The non-parametric Wilcoxon signed-rank test yields p = 0.0625 for both metrics, which is not significant at α = 0.05 given the small number of folds. Full statistics are reported in Supplementary Table S4.

To improve the interpretability of the cross-model comparison, we provide grouped bar charts visualizing MAE and RMSE across models and representation tracks in Supplementary Figures S3 and S4.

In addition to the representation-track comparison in Table 6, we evaluated several practice-based non-ML baselines (e.g., mean predictor, unit-cost heuristics, analogous estimation, and a parametric production function) under the same outer 5-fold protocol to contextualize the gains from the proposed framework.

Overall, the non-ML baselines provide reasonable first-order estimates, but they are consistently outperformed by the best ML configurations. The rule-based additive baseline (Baseline A) remains valuable for high-auditability budgeting scenarios, whereas the proposed explainable ML approach better captures nonlinear and interaction effects among scale and unit prices.

Two conclusions were derived as follows. First, cost formation exhibits nonlinearity and interactions (e.g., scale × difficulty × quality intensity), favoring tree ensembles. Second, the PCA- and FA-enhanced representations yield similar accuracy under this evaluation protocol, while offering complementary interpretability benefits.

4.3. Explainability and Sensitivity

To interpret the best-performing configuration identified in Table 6, we compute SHAP values for the XGBoost model (PCA-enhanced track) using SHAP TreeExplainer. The global SHAP summary (Figure 3) provides a representative portfolio-level view across all 386 projects, revealing which predictors consistently drive increases or decreases in the predicted total project cost (TPC). In line with the cost-formation mechanism discussed earlier, scale-related variables and process-level unit prices dominate the global attribution, while operational and governance indicators (e.g., cloud execution, expert participation, and anonymization) contribute secondary but policy-relevant effects.

Local explanations support auditability for individual projects; however, to avoid anecdotal interpretation, the main text emphasizes global attribution patterns. For transparency, Supplementary Figure S1 provides example local waterfall explanations illustrating how project-specific configurations combine to produce the final prediction relative to the model’s expected value. Supplementary Figure S2 provides an error-focused visualization of the track × model comparison (OOF pooled), complementing Table 6 and highlighting the relative advantages of ensemble models over linear baselines. Positive SHAP values increase the predicted TPC, whereas negative values decrease it.

To quantify probabilistic budget risk beyond deterministic one-at-a-time perturbations, we conducted a Monte Carlo simulation that varies inputs according to the historical empirical distributions observed across the 386 projects while preserving joint dependencies among drivers.

Table 7 reports the project-level probabilistic budget risk metrics derived from the Monte Carlo simulation.

4.4. Robustness and Error Analysis

Transformed-Target Robustness: We re-estimated linear baseline models using log1p(TPC) and Box–Cox(TPC) within each outer training fold (transform parameters fitted on training data only) and evaluated back-transformed predictions on the original KRW scale. Both transforms reduced heteroscedasticity in residual diagnostics and slightly improved RMSE/MAE for Ridge; however, the improvements were not large enough to close the gap between the best-performing tree ensemble models (notably XGBoost). Consequently, our main conclusions regarding dominant drivers and the comparative results across the Baseline, PCA, and FA tracks are robust to reasonable target transformations. The performance difference between PCA-enhanced and FA-enhanced XGBoost models was small and not statistically significant.

For transparency, Table 8 reports fold-averaged MAE and RMSE for linear baselines under raw TPC, log1p(TPC), and Box–Cox(TPC), evaluated on the original KRW scale after back-transformation.

As an additional robustness test (separately from the target-transformation sensitivity in Table 8), we performed an inflation normalization check, as follows.

Inflation Normalization Check: We repeated the full evaluation after deflating TPC and unit-price variables to KRW in 2022 prices using annual CPI deflators (Statistics Korea, KOSIS; CPI base year 2020 = 100).

Fold-averaged performance metrics and the relative ranking changed only marginally, and the SHAP-based importance ordering of the dominant drivers remained stable; XGBoost remained the top-performing model across tracks under the deflated-cost setting. Error profiles indicate that the largest absolute errors occur in a small number of extremely large or unusually configured projects (e.g., high scale combined with high QA shares and multiple governance constraints). This is expected in public portfolios where a few flagship datasets dominate expenditure and where operational heterogeneity is substantial [30].

Across outer folds, the ranking of top drivers (scale and unit prices) remains stable, supporting the interpretation that these are structural determinants rather than artifacts of a particular split. In addition, factor-augmented models show reduced variance in fold-wise performance relative to Baseline, consistent with the idea that latent structure smooths idiosyncratic correlations among raw variables.

In practice, we recommend pairing point predictions with uncertainty intervals and using local SHAP explanations to examine whether a project’s cost profile matches the nearest-neighbor regions of the training data. Such diagnostic use is important when the model is applied to new project types, novel tooling stacks, or new contracting regimes.

4.5. Additional Robustness Checks and Ablations

We performed additional robustness checks to assess whether the main findings depend on specific preprocessing or evaluation choices. First, we varied the missing-value strategy under the empirically verified assumption that all missing entries are unreported (Table A1). We compared: (i) fold-wise median/mode imputation with missingness indicators (primary), (ii) fold-wise median/mode imputation without indicators, and (iii) listwise deletion. Across these strategies, the relative ranking of model families and the dominance of scale and unit-price drivers remained stable.

In the underlying cohort, missingness was sparse: 13 missing numeric entries across the 24 predictors (13/(24 × 386) = 0.140%), affecting seven projects (1.81%). This low missingness rate implies that alternative missing-data treatments influence only a small subset of projects rather than the global geometry of the input space.

Listwise deletion reduced the sample from 386 to 379 (−1.81%) and shifted the TPC distribution downward (mean −1.237%, median −4.479%), indicating a risk of under-representing the right tail. By contrast, the two imputation strategies preserved the full cohort and yielded nearly identical performance ordering across Baseline/PCA/FA tracks. These checks provide evidence that the results in Table 6 are driven by model/representation choice rather than by an arbitrary missing-data rule.

Winsorizing extreme unit-price values and log-transforming heavy-tailed scale variables (records/GB/minutes) slightly improved calibration for small projects but did not change the overall ranking: XGBoost remained the top-performing model across tracks, and the main driver categories identified by SHAP remained stable. Under the current evaluation protocol, representation differences between PCA-enhanced and FA-enhanced inputs are small, and no consistent superiority of one representation over the other is observed; model ranking is primarily driven by algorithm choice rather than representation choice.

Third, we evaluated simplified feature sets to probe the marginal value of optional governance features. Models trained on scale-only predictors substantially underperformed, while adding process unit prices (labeling/cleaning/inspection) recovered most of the predictive power. Adding cloud, expert participation, and anonymization produced incremental gains, consistent with their secondary role observed in SHAP and sensitivity analyses.

Fourth, we examined temporal generalization using coarse cohort splits (early vs. late projects within the 2020–2022 window). Performance degraded slightly in the later cohort, consistent with evolving toolchains and pricing. This supports the recommendation to institutionalize periodic model updates and drift monitoring rather than relying on static estimates.

Because our portfolio ends in 2022, it predates the widespread operationalization of generative-AI and LLM-assisted annotation pipelines that can alter productivity and the composition of costs (e.g., shifting effort from manual labeling toward prompt design, validation, and tool integration). Thus, while the dominant driver categories identified here—scale and per-process unit prices—are expected to remain structurally relevant, the numeric magnitudes of unit prices and rework dynamics may differ for modern projects. We therefore recommend treating the model as a living estimator: update unit-price inputs using current price books, periodically refit the model with newer cohorts, and monitor drift in both error and SHAP-based driver rankings as tooling evolves.

Finally, we stress-tested interpretation stability. SHAP global rankings were highly correlated across cross-validation folds, and the sign of major effects remained consistent. Where local explanations diverged (typically in small projects with atypical unit-price patterns), these cases also exhibited higher prediction uncertainty, suggesting that agencies should treat outlier profiles as candidates for manual review rather than relying solely on automated estimates.

While point forecasts help rank proposals and benchmark bids, public budgeting typically requires a defensible range to absorb right-tail risk. We therefore recommend reporting a prediction interval (e.g., 80–90% coverage) alongside the point estimate, using lightweight add-ons that do not alter the core training protocol.

In practice, agencies can derive empirical uncertainty bands in three audit-friendly ways: (i) bootstrap refits, repeatedly resampling the training cohort and refitting the final model to obtain an empirical interval; (ii) gradient-boosted quantile regression (e.g., q10/q50/q90) to directly predict budget percentiles; or (iii) conformal prediction applied to cross-validated residuals to provide distribution-free empirical coverage. Bootstrap intervals are usually the most intuitive for oversight bodies, whereas quantile and conformal methods are efficient for routine scenario recomputation and monitoring under a potential distribution shift.

5. Discussion and Implications

5.1. Why Scale and Unit Prices Dominate

Across the portfolio, scale variables (records, GB, minutes) are consistently the strongest contributors to TPC. This aligns with a production-function intuition: scale increases the volume of work across nearly all process steps, including collection, cleaning, labeling, and validation. However, scale alone does not determine cost; it interacts with unit prices and workflow design. A large but low-difficulty dataset with streamlined QA can be cheaper than a smaller dataset with high difficulty and intensive review policies.

The prominence of unit prices (labeling, cleaning, inspection) indicates that procurement terms and operational productivity are first-order budget drivers. Unit prices partly reflect labor market conditions and vendor capability, but they also embed task design choices (label granularity, ontology complexity, tooling maturity, review policy). Investments in tooling (auto-labeling, model-assisted review, QA automation) can therefore reduce unit prices over time, shifting the cost structure even when scale remains fixed.

These drivers are actionable. Unlike intrinsic attributes such as modality or domain, unit prices and QA policies can be adjusted through tender design, contract clauses, workflow specification, and vendor performance incentives. A policy-ready cost model should represent these levers explicitly and provide interpretable elasticities for change control.

5.2. Implications for Procurement and Change Control

Overall budgets are most sensitive to unit prices for labeling, cleaning, and inspection; we translate these elasticities into concrete change-control guidance in this section.

Public AI training-data projects are commonly procured as fixed-price or ceiling-price contracts, with limited flexibility once execution begins. In this setting, change orders can be a major source of budget drift. The sensitivity results provide practical coefficients for governance: a +10% change in labeling unit price increases predicted TPC by roughly +7–9% on average; similar coefficients apply to cleaning (+5–6%) and inspection (+4–5%).

We recommend treating these elasticities as change-control multipliers. When a contractor requests a unit-price change, agencies can compute the implied TPC delta and compare it with (i) documented scope expansion, (ii) objective productivity evidence (rework rate, relabeling rate, defect density), and (iii) benchmark unit prices from comparable cohorts. This shifts negotiations from ad hoc bargaining to evidence-based governance.

A practical workflow is presented as follows: Step (1) submit a standardized change request describing the cause; Step (2) compute predicted TPC before/after using the trained model; Step (3) inspect the local SHAP explanation to confirm that the change affects expected drivers; Step (4) approve, reject, or request remediation based on the modeled delta and supporting evidence.

5.3. Operational and Governance Levers Beyond Unit Prices

Cloud execution and expert participation appear as secondary but meaningful drivers. Cloud use increases predicted TPC modestly on average (+2% in our sensitivity setting), reflecting storage, monitoring, and security costs. At the same time, the cloud can shorten schedules and improve reproducibility, so the higher cost may correspond to deliberate investment in delivery reliability and auditability.

Expert participation increases predicted TPC slightly (+1% on average) but is often necessary for specialized domains. Expert review can reduce downstream error costs, improve ontology coherence, and mitigate bias. The cost model should be used to budget realistically for expertise and to make tradeoffs explicit, not to eliminate expert involvement.

Anonymization and privacy controls are critical governance components. Their effects may appear smaller in aggregate because privacy-intensive projects are fewer, but costs can be concentrated in sensitive domains. For high-sensitivity datasets, agencies may benefit from additional predictors reflecting privacy classification, de-identification complexity, and security certification requirements.

5.4. Comparison to Traditional Cost Estimation Approaches

Traditional estimation for data projects relies on parametric models, expert judgment, and unit-cost heuristics. Such approaches struggle with heterogeneity across modalities, quality targets, and workflow designs. In our setting, machine-learning estimation offers three advantages: it learns nonlinear interactions (e.g., scale × difficulty × QA share), leverages portfolio-level information to benchmark unit prices and process shares, and supports scenario analysis through explainability methods.

Comparison to practice-based non-ML baselines. To contextualize the proposed approach, we also compared it against non-ML baselines commonly used in practice (unit-cost heuristics, analogous estimation, and a parametric production function). These approaches are easy to audit but tend to underfit heterogeneous AI data-construction projects because they cannot capture nonlinearities and interactions; in contrast, the explainable ML models retain transparency through factor abstraction and SHAP-based attribution while improving accuracy.

Machine-learning estimation should not replace expert review. Instead, it can serve as a decision-support layer: when the model prediction differs substantially from an initial proposal, the discrepancy can trigger a structured review, checking whether unit prices are out of distribution, difficulty claims match historical analogs, and quality targets imply unusually high QA intensity.

Factor analysis strengthens communication and stability by organizing correlated variables into interpretable latent factors that can be labeled and monitored across cohorts. In contrast, PCA components can be predictive but are harder to explain in policy and procurement terms.

5.5. Deployment Considerations for Public Budgeting

For deployment, agencies should define standardized input templates aligned with reporting rules, a governance process for model updates, and audit trails for predictions and change requests. Because unit prices and tooling evolve, an annual model refresh using the latest cohort is recommended, while maintaining a frozen baseline model for longitudinal comparisons.

Model outputs should include uncertainty bounds (e.g., an 80–90% prediction interval) to support risk-based budgeting, contingency planning, and the detection of out-of-distribution proposals.

In procurement practice, treat the median (q50) as the reference estimate, allocate contingency using the upper tail (e.g., q90–q50) for projects with risk markers (high difficulty, high QA shares, strong anonymization needs, or out-of-range unit prices), and escalate proposals with unusually wide intervals for manual review or staged contracting. Track forecast errors by modality and domain to periodically recalibrate this guidance.

Operational checklist:

Report a point estimate plus an 80–90% prediction interval for funding approval.
Use wide intervals as a trigger for deeper requirement review or staged contracting; monitor errors by modality/domain to update price books and QA policies.
Lightweight approaches to produce prediction intervals for budgeting are summarized in Table 9.

5.6. Limitations and Future Work

This study has limitations. Administrative reports are heterogeneous in cost itemization and may under-report overheads. A missingness audit confirmed that observed blanks correspond to unreported/unknown values; we therefore treated missing entries as missing and used fold-wise imputation with missingness indicators, complemented by robustness checks. Nevertheless, if non-reporting is systematically correlated with project characteristics (e.g., bundling practices), the resulting estimates may still reflect reporting conventions as well as underlying effort. External validity in private-sector projects or other countries requires additional validation.

Temporal scope is another limitation. Our dataset spans 2020–2022 public projects, which largely preceded the rapid diffusion of LLM-assisted annotation pipelines. Accordingly, the observed unit prices and process shares primarily reflect human-centered workflows, and modern projects that substitute part of the labeling/QA with automated or LLM-assisted steps may exhibit different marginal cost structures (e.g., lower labeling unit prices but potentially higher tooling, engineering, and audit costs). We therefore treat the model as a cohort-specific estimator and recommend periodic recalibration using the latest cohorts that explicitly record automation adoption (auto-labeling share, model-assisted QA, and LLM usage). In procurement practice, the model can still support scenario testing by adjusting unit-price and QA-intensity inputs to reflect expected automation gains, but final budgets should be anchored to the most recent price book and monitored for drift.

Future research can extend the framework by adding modality-specific sub-models, incorporating schedule and risk variables (delivery time, defect rates, rework), and integrating automation adoption indicators (auto-labeling share, model-assisted QA) to capture productivity shifts. Probabilistic models that directly output prediction intervals are also promising for high-stakes public budgeting.

5.7. Reporting Template and Procurement Translation

A key barrier to transportable cost estimation is inconsistent itemization across contractors and domains. Based on the harmonized schema, we propose a minimal reporting template that preserves flexibility while making core economic primitives explicit. Specifically, future cohorts should be required to report: (i) scale in at least one modality-appropriate unit (records and either GB or minutes); (ii) process-level unit prices for labeling, cleaning, and QA/inspection as separable line items; (iii) process intensity parameters (e.g., sampling plan or inspection share) that operationalize acceptance criteria; and (iv) binary flags plus optional cost lines for governance options such as anonymization, expert review, and cloud execution. These items enable reviewers to distinguish “not applicable” from “not reported,” which is essential for both auditability and model validity.

For practitioners, the factor structure provides a compact interface for negotiation. Rather than debating dozens of raw indicators, agencies can translate proposals into a small set of levers—scale intensity, unit-price system, quality-control intensity, workforce mix, and governance options—and compare them against historical distributions. This supports consistent procurement decisions across domains while leaving room for justified exceptions (e.g., rare-domain expert review).

5.8. Threats to Validity and Replication Guidance

Public budgeting and procurement decisions require not only accurate point forecasts but also a defensible account of where errors can arise and how results can be reproduced. Accordingly, we summarize key threats to validity and provide replication guidance aligned with audit and governance needs.

Key threats include: (i) measurement and reporting noise in the source reports (mixed units/currency, embedded text, inconsistent coding); (ii) missingness and imputation choices, where unreported values may reflect bundling/reporting conventions rather than true absence; (iii) external validity limits under portfolio, tooling, or market shifts; and (iv) model risk for atypical proposals. Mitigation should therefore combine transparent preprocessing logs, missingness-indicator features, sensitivity checks across imputation strategies, interval reporting, and periodic recalibration.

Table 10 summarizes the main threats to validity and the corresponding replication guidance for audit-ready application.

6. Conclusions

Across three input tracks (Baseline, PCA-enhanced, and FA-enhanced) and four regression models evaluated under nested cross-validation, XGBoost achieves the best pooled out-of-fold performance in each track (Baseline: R² = 0.863, RMSE = 1104.2, MAE = 731.4, MAPE = 0.374; PCA-enhanced: R² = 0.868, RMSE = 1084.9, MAE = 746.9, MAPE = 0.358; FA-enhanced: R² = 0.849, RMSE = 1160.6, MAE = 754.9, MAPE = 0.364). Monte Carlo-based risk estimates further support uncertainty-aware budgeting by reporting upper-tail cost percentiles and overrun probabilities.

SHAP-based interpretation confirms that project scale and process-level unit prices are the dominant cost drivers, while governance-related features contribute secondary but policy-relevant effects. Global and local explanations using SHAP indicate that project scale and process-level unit prices are the dominant cost drivers, whereas task difficulty, QA intensity, cloud use, and expert participation contribute secondary but meaningful effects. Sensitivity simulations translate these relationships into practical budgeting levers that can support contract line-item design, negotiation, and change control.

From a policy perspective, the results support factor-informed, explainable machine learning as a decision-support tool for budgeting and governance in national AI data programs. Key limitations relate to the program-specific cohort (2020–2022) and the constraints of standardized reporting; we discuss these limitations and avenues for extension (e.g., richer governance variables and uncertainty-aware updates) in Section 6.2.

6.1. Practical Utilization Framework

To support budget planning and evidence-based program management, the proposed model can be embedded in a lightweight utilization framework aligned with public-program lifecycles:

Ex Ante Appraisal: Derive an initial cost band and key drivers and document assumptions for unit prices and quality tiers.
Procurement Calibration: Set contract line items and QC/cleaning/anonymization requirements using scenario elasticities rather than flat percent adders.
In-Flight Control: Refresh forecasts at milestones using updated scale and unit-price signals, monitor drift, and flag overruns when drivers move outside historical ranges.
Ex Post Benchmarking: Store realized drivers and outcomes to update priors, improve standard cost tables, and support iterative policy revisions.

6.2. Future Research Directions

First, the dataset spans 2020–2022 within a specific national program; replication in newer cohorts and other jurisdictions is needed to assess its transportability as technology, labor markets, and governance norms evolve.

Second, although 35 variables were harmonized, the analysis relies mainly on quantitative fields available in final reports. Richer qualitative drivers (e.g., governance maturity, contractor capability, coordination frictions) were not systematically encoded and could improve explanatory power if operationalized.

Third, we focused on regression models suited to tabular program data. Future research could compare against hybrid architectures (e.g., tabular transformers) and strengthen robustness with audited micro-logs of task time, rework, and defect rates, as well as uncertainty-aware prediction intervals.

6.3. Reproducibility and Transparency

To facilitate practical uptake, we recommend maintaining a lightweight reproducibility package that records variable definitions, preprocessing rules, and model settings used for each budgeting cycle. Even when datasets cannot be fully released due to confidentiality, agencies can publish aggregated metadata and evaluation protocols to enable scrutiny of cost drivers and forecasting reliability.

Recommended minimum transparency checklist:

Define the outcome (TPC) and units consistently (e.g., million KRW) and document any transforms.
Publish the 35-item schema and the derived 24 numeric predictors, including how categorical fields are encoded.
Record missing-value rules (e.g., structural zero vs. unknown) and standardization parameters.
Provide the validation protocol (nested CV splits, metrics) and report the final hyperparameters for the selected model in Supplementary Table S5.
Archive SHAP summaries and scenario elasticities used for procurement and change-control decisions.

Table 11 presents the audit-ready implementation checklist aligned with public AI data-program lifecycles.

Table 11. Implementation checklist aligned with public AI data-program lifecycles.

Stage	Key Inputs	Manager Actions	Outputs/Checks
Ex-ante appraisal	Initial scope (records/GB/min) + unit-price priors	Run forecast; document assumptions	Cost band + top drivers; rationale log
Procurement	BoQ + acceptance criteria	Calibrate price book; specify unit prices and QC rules	Contract items consistent with drivers
Execution	Updated scale + process logs	Refresh forecast at milestones; monitor drift	Overrun flags; change-control triggers
Acceptance	QC outcomes + defect/rework	Verify deliverables; reconcile deviations	Final cost vs. plan; quality report

The sensitivity elasticities are translated into practical change-control guidance in Table 12.

Table 12. Change-control guidance using sensitivity elasticities.

Decision Point	Recommendation	Operational Indicator	Unit-Price Revision Request
Unit-price revision request	Treat labeling/cleaning/inspection unit prices as controlled parameters; compute impact using elasticities before approval.	Proposed Δ (unit price); evidence of new price book/vendor quote; change in task specification.	Estimate impact using: ΔTPC ≈ elasticity × Δ (unit price) (holding scale constant); approve only with documented justification.
Scope growth claim	Separate true scope growth from rework and process inefficiency.	Δrecords/GB/min vs. baseline; change logs; rework rate and defect logs.	Reject unit-price change if the driver is scope; revise scale inputs instead; if rework-driven, apply corrective actions and QA controls.
Quality escalation	Justify higher QC tiers with measurable target gains (F1/mAP) and expected rework reduction.	Observed vs. target quality; inspection findings; defect density and rework rate.	Allow unit-price uplift only with explicit quality targets, acceptance criteria, and ex post verification.
Tooling change (automation/LLM assistance)	Update priors and refresh the model with the latest cohort to reflect productivity shifts.	Adoption of automation; cycle time change; before/after productivity metrics.	Re-baseline unit prices and elasticities after tooling adoption; document the new operating regime.
Cloud/expert add-on	Require ex ante rationale and ex post evaluation of benefits.	Cloud usage/expert participation flags; cost deltas; expected risk reduction or quality improvements.	Treat as add-on items with clear deliverables; audit realized benefits and adjust future priors.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16062891/s1.

Author Contributions

Writing—original draft, Y.-S.K.; Supervision, B.G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Missingness Profile (Adapted from Table 10)

As summarized in Table A1, all missing entries were classified as unreported/unknown (i.e., left blank in the administrative final reports) and were handled using fold-wise imputation with missingness-indicator features, as described in Section 3.3.1. Variable-by-variable missingness counts are summarized in Table A1.

Table A1. Missingness audit summary.

Variable	Missing Cells (Raw)	Share of 9264 Cells	Reason (Audit)	Prior Rule (Tech Report)	Rule in This Paper (Revised)
Total data duration (minutes)	3	0.032%	Not reported/Unknown	Impute 0	Fold-wise median imputation + missingness indicator
Full-time employee proportion	3	0.032%	Not reported/Unknown	Impute 0	Fold-wise median/mode imputation + missingness indicator
Validation F1 score	3	0.032%	Not reported/Unknown	Impute 0	Fold-wise median imputation + missingness indicator
Validation mAP score	4	0.043%	Not reported/Unknown	Impute 0	Fold-wise median imputation + missingness indicator
Total	13	0.140%	Not reported/Unknown	Impute observed minimum	Fold-wise median imputation + missingness indicator

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2017. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2020. [Google Scholar]
Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; Casas, D.D.L.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training Compute-Optimal Large Language Models. arXiv 2022, arXiv:2203.15556. [Google Scholar] [CrossRef]
Buolamwini, J.; Gebru, T. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the Conference on Fairness, Accountability and Transparency, FAT 2018, New York, NY, USA, 23–24 February 2018; PMLR: Cambridge, MA, USA, 2018. [Google Scholar]
Obermeyer, Z.; Powers, B.; Vogeli, C.; Mullainathan, S. Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science 2019, 366, 447–453. [Google Scholar] [CrossRef] [PubMed]
Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J.W.; Wallach, H.; Iii, H.D.; Crawford, K. Datasheets for Datasets. Commun. ACM 2021, 64, 86–92. [Google Scholar] [CrossRef]
Mitchell, M.; Wu, S.; Zaldivar, A.; Barnes, P.; Vasserman, L.; Hutchinson, B.; Spitzer, E.; Raji, I.D.; Gebru, T. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency; ACM: New York, NY, USA, 2019. [Google Scholar]
Ng, A. Data-Centric AI: The Competitive Edge in the Era of Big Data; Landing AI (White Paper/Blog): Palo Alto, CA, USA, 2021. [Google Scholar]
Boehm, B.W. Software Engineering Economics; Prentice Hall: Englewood Cliffs, NJ, USA, 1981. [Google Scholar]
Albrecht, A.J. Measuring Application Development Productivity. In Proceedings of the IBM Applications Development Symposium; IBM Corp: Armonk, NY, USA, 1979. [Google Scholar]
National Information Society Agency (NIA). Guidelines for AI Training Data Construction Projects (Korea); 2020–2022 (in Korean). Available online: https://www.nia.or.kr (accessed on 3 March 2026).
Juran, J.M.; Godfrey, A.B. Juran’s Quality Handbook, 5th ed.; McGraw-Hill: New York, NY, USA, 1999. [Google Scholar]
Zhou, Z.-H. A Brief Introduction to Weakly Supervised Learning. Natl. Sci. Rev. 2018, 5, 44–53. [Google Scholar] [CrossRef]
Amershi, S.; Weld, D.; Vorvoreanu, M.; Fourney, A.; Nushi, B.; Collisson, P.; Suh, J.; Iqbal, S.; Bennett, P.N.; Inkpen, K.; et al. Guidelines for Human-AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems; ACM: New York, NY, USA, 2019. [Google Scholar]
Varma, S.; Simon, R. Bias in Error Estimation When Using Cross-Validation for Model Selection. BMC Bioinform. 2006, 7, 91. [Google Scholar] [CrossRef] [PubMed]
Jolliffe, I.T. Principal Component Analysis, 2nd ed.; Springer: New York, NY, USA, 2002. [Google Scholar]
W3C. Data Quality Vocabulary (DQV): A Vocabulary for Data Quality Information; W3C Recommendation: Wakefield, MA, USA, 2016. [Google Scholar]
Wu, X.; Xiao, L.; Sun, Y.; Zhang, J.; Ma, T.; He, L. A Survey of Human-in-the-Loop for Machine Learning. Future Gener. Comput. Syst. 2022, 135, 364–381. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Settles, B. Active Learning Literature Survey. Technical Report, University of Wisconsin–Madison. 2009. Available online: https://minds.wisconsin.edu/handle/1793/60660 (accessed on 4 November 2025).
Ratner, A.; Bach, S.H.; Ehrenberg, H.; Fries, J.; Wu, S.; Ré, C. Snorkel: Rapid Training Data Creation with Weak Supervision. Proc. VLDB Endow. 2017, 11, 269–282. [Google Scholar] [CrossRef] [PubMed]
Tan, Z.; Li, D.; Wang, S.; Beigi, A.; Jiang, B.; Bhattacharjee, A.; Karami, M.; Li, J.; Cheng, L.; Liu, H. Large Language Models for Data Annotation and Synthesis: A Survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 930–957. [Google Scholar] [CrossRef]
Bonet-Jover, A.; Sepúlveda-Torres, R.; Saquete, E.; Martínez-Barco, P.; Piad-Morffis, A.; Estevez-Velarde, S. Applying Hu-man-in-the-Loop to Construct a Dataset for Determining Content Reliability to Combat Fake News. Eng. Appl. Artif. Intell. 2023, 126, 107152. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2017. [Google Scholar]
Duan, N. Smearing Estimate: A Nonparametric Retransformation Method. J. Am. Stat. Assoc. 1983, 78, 605–610. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2016. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2017. [Google Scholar]
Statistics Korea (KOSIS). Consumer Price Index (CPI), Base Year 2020=100. Korean Statistical Information Service. Available online: https://kosis.kr/eng/ (accessed on 28 February 2026).

Figure 1. End-to-end workflow of the proposed cost-estimation framework.

Figure 2. MAE and RMSE by representation tracks (XGBoost).

Figure 3. Global SHAP summary (beeswarm) for the XGBoost model (PCA-enhanced track; N = 386). Each point represents an individual project. The horizontal axis indicates SHAP values (feature contributions to the predicted total project cost). Color represents the feature value, with blue indicating low values and red indicating high values.

Table 1. Summary of input-track configurations.

Purpose	Input Track	Feature Representation
Preserve original semantics for direct, interpretable cost modeling.	Baseline	24 standardized numeric predictors
Variance-preserving compression to mitigate multicollinearity; may improve generalization but reduce interpretability.	PCA-enhanced	Baseline predictors + 10 principal-component scores (PC1–PC10)
Semantically interpretable latent structure to support both accuracy and policy-facing explanation.	FA-enhanced	Baseline predictors + 9 factor scores (Factor_1–Factor_9)

Table 2. Summary of the harmonized reporting schema (35 items) for 386 final reports (2020–2022).

Source	Category	Representative Items (Examples)
Supplementary Table S1	Data descriptors	Project/dataset identifiers; data type (text/image/speech/video); data source; domain/subdomain; labeling type/method; tools; copyright
Supplementary Table S1	Scale	Record count; data volume (GB); total duration (min); average time per record (min)
Supplementary Table S1	Difficulty and operations	Cleaning difficulty (1–5); labeling difficulty (1–5); domain-expert participation (Y/N); cloud execution (Y/N); anonymization (Y/N)
Supplementary Table S1	Workforce	Number of workers; crowd-worker ratio; full-time ratio
Supplementary Table S1	Costs	Total project cost (TPC); per-process unit prices (labeling/cleaning/QA); process shares (cleaning, QA)
Supplementary Table S1	Quality	Validation metrics (F1 score; mAP)

Table 3. Factor interpretation summary for FA-enhanced inputs.

Factor	Eigenvalue/Var% (Cum%)	Dominant Variables (Examples)	Interpretation
F1: Unit price and total cost	4.07/17.0 (17.0)	labeling/cleaning/QA unit prices; process-share indicators	Unit-price system and process intensity
F2: Workforce composition	3.28/13.7 (30.6)	full-time ratio; crowd ratio; #workers	Staffing mix and labor deployment strategy
F3: Validation quality	2.54/10.6 (41.2)	F1 score; mAP	Achieved quality/performance targets
F4: Data volume	1.95/8.1 (49.3)	data volume (GB); scale measures; cost per GB	Volume-driven cost structure
F5: Time intensity	1.55/6.5 (55.8)	duration (min); cost per minute	Time burden/processing intensity
F6: QA and cleaning share	1.5/6.2 (62.0)	QA proportion; cleaning proportion	Strength of assurance/cleaning processes
F7: Task difficulty	1.45/6.0 (68.1)	labeling difficulty; cleaning difficulty	Complexity-driven effort
F8: Cloud and expert involvement	1.2/5.0 (73.1)	cloud execution; expert participation	Execution environment and specialized input
F9: Anonymization requirement	1.12/4.7 (77.8)	anonymization status	Privacy/security constraints

Table 4. Variable groups and preprocessing conventions.

Preprocessing Convention	Variable Group	Predictors/Units (Examples)
Range checks; z-score standardization	Scale	total_record_count (records); total_data_volume_GB (GB); total_data_duration_min (min); avg_time_per_record_min (min/record)
Bound checks; z-score standardization	Unit prices	labeling_unit_price (KRW/record); cleaning_unit_price (KRW/record); QA_unit_price (KRW/record)
Outlier screening; z-score standardization	Process shares	cleaning_cost_share (%); QA_cost_share (%)
Bounds 0–100 (where applicable); z-score standardization	Workforce and operations	num_workers; crowd_worker_ratio; full_time_ratio; cloud (0/1); expert (0/1); anonymization (0/1)
Bounds 0–1; z-score standardization	Quality	validation_F1 (0–1); validation_mAP (0–1)

Table 5. Explained variance of the first ten principal components (N = 386; cumulative = 81.7%).

Principal Component	Eigenvalue	Explained Variance (%)	Cumulative Variance (%)
PC1	4.07	17.0	17.0
PC2	3.28	13.7	30.6
PC3	2.54	10.6	41.2
PC4	1.95	8.1	49.3
PC5	1.55	6.5	55.8
PC6	1.5	6.2	62.0
PC7	1.45	6.0	68.1
PC8	1.2	5.0	73.1
PC9	1.12	4.7	77.8
PC10	0.95	4.0	81.7

Note: Table 5 reports the explained-variance profile for PCA (PC1–PC10). For FA, the cumulative variance (77.8%) is reported based on the initial (unrotated) eigenvalues of the extracted factors (prior to rotation), while factor interpretations are summarized in Table 3.

Table 6. Track- and model-level performance comparison (OOF pooled 5-fold CV).

Track	Model	MAE	RMSE	MAPE	R²
Baseline	Ridge	1437.9	1923.3	0.863	0.585
Baseline	Random Forest	868.8	1155.1	0.467	0.850
Baseline	XGBoost	731.4	1104.2	0.374	0.863
Baseline	LightGBM	819.1	1175.6	0.413	0.845
PCA-enhanced	Ridge	1437.9	1923.3	0.863	0.585
PCA-enhanced	Random Forest	934.6	1258.3	0.493	0.822
PCA-enhanced	XGBoost	746.9	1084.9	0.358	0.868
PCA-enhanced	LightGBM	815.4	1209.7	0.387	0.836
FA-enhanced	Ridge	1438.0	1923.4	0.863	0.585
FA-enhanced	Random Forest	894.4	1209.9	0.476	0.836
FA-enhanced	XGBoost	754.9	1160.6	0.364	0.849
FA-enhanced	LightGBM	814.0	1194.0	0.392	0.840

Table 7. Monte Carlo-based probabilistic budget risk estimates (XGBoost, PCA track).

Risk Metric	Median (IQR: Q1–Q3)	Q1	Q3
P80	4184.01	3427.74	5470.11
P90	4955.85	4144.43	6231.20
P95	6097.00	5293.71	7307.99
Pr (>Budget)	0.499	0.495	0.504
Pr (>1.10 × Budget)	0.353	0.324	0.394
Pr (>1.20 × Budget)	0.285	0.224	0.318

We define the baseline budget for each project as the XGBoost point estimate under observed inputs (PCA-enhanced track). Reported values summarize project-level risk metrics across all 386 projects using the median and interquartile range (IQR: Q1–Q3). P80–P95 denote percentiles of the simulated cost distribution, and Pr (·) denotes the probability that the simulated cost exceeds the baseline estimate by the specified threshold.

Table 8. Sensitivity of linear baseline models to target transformations.

Model	Target Scale	MAE (KRW)	RMSE (KRW)
Ridge	Raw TPC	1437.7	1915.2
Ridge	log1p(TPC)	1589.4	2187.9
Ridge	Box–Cox(TPC)	1455.7	1972.5

Table 9. Uncertainty quantification options for budget governance.

Approach	What It Provides	When to Use (Governance)	Practical Notes
Bootstrap refits	Empirical interval around the point forecast (via resampling/refitting).	Highest auditability; suitable for periodic offline refresh (e.g., annual).	Computation-intensive; best for offline governance reports and model refresh cycles.
Quantile boosting	Direct q10/q50/q90 predictions (percentile budgets).	Routine use; fast recomputation for what-if scenarios and negotiation.	Requires quantile-loss training; communicate percentiles as budget bands.
Conformal prediction	Distribution-free empirical coverage based on residuals (prediction intervals).	When coverage guarantees are needed under minimal assumptions (e.g., policy rules).	Needs calibration set and periodic recalibration if data drift is suspected.

Table 10. Threats to validity and replication guidance checklist.

Threat/Risk Marker	Recommended When (Governance)	Replication and Mitigation Guidance
Measurement and reporting noise (mixed units, embedded text, inconsistent coding)	During initial data ingestion or when reporting formats vary across contractors/agencies.	Normalize units/currency; parse numeric strings; document cleaning rules; run sanity checks for extreme values.
Missingness and zero meaning ambiguity	When projects have non-itemized fields or inconsistent reporting of ‘not applicable’ vs. ‘unknown’.	Apply the documented imputation rules; run a sensitivity check (0 vs. median vs. deletion) and report impacts on standardized inputs and Y distribution.
Portfolio/market shift (new domains, new tooling, price changes)	When distribution shift is a concern (new domains, new tooling) or price books have changed materially.	Refresh with the latest cohort (e.g., annually); track error drift by modality/domain; prefer quantile/conformal intervals when shift is suspected.
Out-of-range proposals (unit prices or scale outside training range)	When proposed unit prices or workloads fall outside the historical training envelope.	Flag for manual review; use staged contracting; allocate contingency based on upper-tail estimates (e.g., q90–q50).
Model stability (overfitting, fold instability)	When performance varies across folds or updates; before high-stakes adoption.	Use nested CV; store fold-level metrics; check robustness of SHAP and elasticities across folds/bootstraps.
Reproducibility package	Always, for auditability and replication by other agencies/teams.	Release preprocessing scripts, variable dictionary, and hyperparameters; fix random seeds; store CV predictions/residuals; version the price book.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ko, Y.-S.; Lee, B.G. A Study on Machine Learning-Based Cost Estimation Models for AI Training Data Construction. Appl. Sci. 2026, 16, 2891. https://doi.org/10.3390/app16062891

AMA Style

Ko Y-S, Lee BG. A Study on Machine Learning-Based Cost Estimation Models for AI Training Data Construction. Applied Sciences. 2026; 16(6):2891. https://doi.org/10.3390/app16062891

Chicago/Turabian Style

Ko, Yoon-Seok, and Bong Gyou Lee. 2026. "A Study on Machine Learning-Based Cost Estimation Models for AI Training Data Construction" Applied Sciences 16, no. 6: 2891. https://doi.org/10.3390/app16062891

APA Style

Ko, Y.-S., & Lee, B. G. (2026). A Study on Machine Learning-Based Cost Estimation Models for AI Training Data Construction. Applied Sciences, 16(6), 2891. https://doi.org/10.3390/app16062891

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Study on Machine Learning-Based Cost Estimation Models for AI Training Data Construction

Abstract

1. Introduction

2. Literature Review of Cost Estimation Techniques

2.1. Concept and Measurement of AI Training-Data Cost

2.2. Toward Multi-Factor, Explainable Cost Modeling

2.3. Cost Drivers, Quality Targets, and TPC

2.4. Prior Cost-Estimation Approaches and Gaps

2.5. Toward Data-Centric, Policy-Ready Estimation

3. Methodology (Research Design)

3.1. Research Procedure

3.2. Data and Variable Construction

3.3. Preprocessing and Alternative Representations (PCA/FA)

3.3.1. Missing-Value Handling

3.3.2. Post-Preprocessing Descriptive Statistics

3.3.3. Practice-Based Baseline A: Spreadsheet Heuristic (Unit Cost per Record × Number of Records)

3.4. Model Training, Validation, and Selection

3.5. Interpretability, Sensitivity, and Robustness Checks

4. Results

4.1. Dimensionality-Reduction Diagnostics (PCA and FA)

4.2. Model Performance Across Input Tracks

4.3. Explainability and Sensitivity

4.4. Robustness and Error Analysis

4.5. Additional Robustness Checks and Ablations

5. Discussion and Implications

5.1. Why Scale and Unit Prices Dominate

5.2. Implications for Procurement and Change Control

5.3. Operational and Governance Levers Beyond Unit Prices

5.4. Comparison to Traditional Cost Estimation Approaches

5.5. Deployment Considerations for Public Budgeting

5.6. Limitations and Future Work

5.7. Reporting Template and Procurement Translation

5.8. Threats to Validity and Replication Guidance

6. Conclusions

6.1. Practical Utilization Framework

6.2. Future Research Directions

6.3. Reproducibility and Transparency

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Missingness Profile (Adapted from Table 10)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI