Satellite Image Price Prediction Based on Machine Learning

Yang, Linhan; Chen, Zugang; Li, Guoqing

doi:10.3390/rs17121960

Open AccessArticle

Satellite Image Price Prediction Based on Machine Learning

by

Linhan Yang

^1,2

,

Zugang Chen

^1,*

and

Guoqing Li

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

School of Computer and Artiffcial Intelligence, Zhengzhou University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(12), 1960; https://doi.org/10.3390/rs17121960

Submission received: 11 April 2025 / Revised: 1 June 2025 / Accepted: 5 June 2025 / Published: 6 June 2025

Download

Browse Figures

Versions Notes

Abstract

This study develops a comprehensive, data-driven framework for predicting satellite imagery prices using four state-of-the-art ensemble learning algorithms: XGBoost, LightGBM, AdaBoost, and CatBoost. Two distinct datasets—optical and Synthetic Aperture Radar (SAR) imagery—were assembled, each characterized by nine technical and economic features (e.g., imaging mode, spatial resolution, satellite manufacturing cost, and acquisition timeliness). Bayesian optimization is employed to systematically tune hyperparameters, thereby minimizing overfitting and maximizing generalization. Models are evaluated on held-out test sets (20% of data) using Pearson’s correlation coefficient (R), mean bias error (MBE), root mean square error (RMSE), unbiased RMSE (ubRMSE), Nash–Sutcliffe Efficiency (NSE), and Kling–Gupta Efficiency (KGE). For optical imagery, the Bayesian-optimized XGBoost model achieves the best performance (

R = 0.9870

,

RMSE = $ 3.44 / {km}^{2}

,

NSE = 0.9651

,

KGE = 0.8950

), followed closely by CatBoost (

R = 0.9826

,

RMSE = $ 3.83 / {km}^{2}

). For SAR imagery, CatBoost outperforms all others after optimization (

R = 0.9278

,

RMSE = $ 9.94 / {km}^{2}

,

NSE = 0.8575

,

KGE = 0.8443

), reflecting its robustness to heavy-tailed price distributions. AdaBoost also demonstrates competitive accuracy, while LightGBM and XGBoost exhibit larger errors in high-value regimes. SHapley Additive exPlanations (SHAP) analysis reveals that imaging mode and spatial resolution are the primary drivers of price variance across both domains, followed by satellite manufacturing cost and acquisition recency. These insights demonstrate how ensemble models capture nonlinear, high-dimensional interactions that traditional rule-based pricing schemes overlook. Compared to static, experience-driven price brackets, our machine learning approach provides a scalable, transparent, and economically rational pricing engine—adaptable to rapidly changing market conditions and capable of supporting fine-grained, application-specific pricing strategies.

Keywords:

satellite imagery pricing; ensemble learning; Bayesian optimization; SHAP analysis

1. Introduction

Earth observation (EO) plays a crucial role in addressing some of the world’s most critical challenges, including resource and land [1], disaster monitoring [2,3,4,5], health risk assessments, biodiversity conservation, ecosystem services [6,7], and air quality management [8,9,10]. Over recent decades, the rapid expansion of satellite technology has significantly increased the availability of satellite imagery, bringing immense benefits to society [11]. These benefits include saving lives, enhancing environmental quality, and improving regulatory compliance and operational efficiency [12,13]. As the number of satellites launched since the 1970s has surged, the volume of satellite imagery has also grown, contributing to advancements in various sectors. At the same time, the globalization, commercialization, and industrialization of high-resolution Earth observation satellite markets have accelerated [14]. Commercial satellite imagery has become a key source of revenue for many businesses [15]. Accurately assessing the value of these satellite imagery products is a critical strategic decision for companies to ensure profitability. Currently, most Earth observation satellite companies have yet to establish a systematic, data-driven pricing mechanism, instead relying primarily on the experience and subjective judgment of decision-makers. This traditional pricing approach not only fails to accurately quantify complex market demand but also results in large price fluctuations, making it difficult to fully reflect the true value of imaging products. As the market grows rapidly and customer demands become increasingly diverse, relying solely on experience-based pricing is no longer sufficient to cope with the highly competitive market environment today.

In response to these limitations, researchers have begun to explore more systematic and data-driven approaches to pricing satellite imagery. In recent literature, Laxminarayan and Macauley [16] stated that the contingent-valuation method, which refers to consumers’ WTP, can be the most important method for assessing the benefits derived from Earth observation data. This method has been cited in the literature on diverse subjects [17], for example, urban planning [18], energy sources [19], and health care [20]. It was applied for the first time to satellite imagery in 2015 to evaluate the Landsat satellite images [21]. In a 2020 study by Jabbour, C et al., this method was applied to high-resolution (HR) satellite images [22]. By surveying their willingness-to-pay (WTP), HR satellite images were valued. Subsequently, Luo et al. applied the concept of neutrosophic fuzzy variables to the pricing problem of satellite imagery data products [23]. They constructed pricing models under two different market structures: Bertrand (parallel competition) and Stackelberg (leader–follower competition) [23]. Their results indicated that incorporating neutrosophic fuzzy theory effectively captured market uncertainty and improved the robustness of pricing strategies in competitive satellite imagery markets.

In recent years, machine learning has emerged as a powerful predictive tool across various domains due to its effectiveness in modeling complex nonlinear relationships among market features, demonstrating significant potential in price forecasting tasks. Machine learning algorithms such as Support Vector Machines (SVM), Random Forest (RF), and Extreme Gradient Boosting (XGBoost) have been widely applied in diverse pricing scenarios, including real estate valuation [24], gold price forecasting [25], energy market prediction [26,27], and financial stock price prediction [28,29,30]. Compared with traditional statistical models, machine learning approaches exhibit superior data adaptability and generalization capabilities, enabling effective handling of high-dimensional, heterogeneous, and nonlinear data characteristics. Consequently, they offer improved prediction accuracy while simultaneously reducing the risk of model overfitting. For example, XGBoost, renowned for its high predictive accuracy [31,32], leverages iterative error correction and regularization techniques to effectively capture intricate interactions within data, making it particularly suitable for dynamic market environments. Therefore, introducing machine learning methods into satellite imagery pricing has great potential to significantly enhance prediction accuracy, overcoming inherent limitations associated with single-model strategies or conventional experience-based pricing methods.

The primary objective of this study is to predict satellite imagery prices separately for optical and Synthetic Aperture Radar (SAR) products using four ensemble machine learning models: XGBoost, LightGBM, AdaBoost, and CatBoost. Two distinct datasets—one for optical imagery and one for SAR imagery—were assembled, each described by domain-specific technical and economic features such as imaging mode, spatial resolution, satellite manufacturing cost, acquisition timeliness, ground-range resolution, polarization type, and operating bands. Bayesian optimization was employed to systematically tune hyperparameters and enhance each model’s predictive performance. Models were evaluated on held-out test sets using a suite of statistical metrics (R, MBE, RMSE, ubRMSE, NSE, and KGE), while SHapley Additive exPlanations (SHAP) analysis quantified feature importance across both imagery types. By developing tailored, data-driven pricing models for optical and SAR imagery, this research delivers a robust framework that accommodates heavy-tailed price distributions and nuanced market dynamics, thereby overcoming the limitations of traditional rule-based pricing methods.

2. Materials and Methods

2.1. Datasets

2.1.1. Satellite Imagery Pricing Data

This study leverages two distinct datasets for model development and evaluation. The optical imagery dataset comprises 215 individual price observations, with values ranging from approximately $3/km² for low-resolution archival products to over $50/km² for sub-meter, tri-stereo offerings. Categorical variables are well represented: roughly 40% of samples are single-view, 35% are stereo-view, and 25% are tri-stereo. In terms of acquisition recency, 60% of images were captured more than 90 days ago, while 40% were acquired within the last 90 days. Continuous features exhibit considerable variation—for example, the median panchromatic resolution (PAN Res) is near 1 m, yet the distribution includes a substantial number of sub-meter measurements. This dataset covers a broad cross section of satellite platforms (e.g., WorldView, GeoEye, Sentinel-2), geographic regions (North America, Europe, Asia), and application contexts (agriculture, urban mapping, environmental monitoring).

The SAR imagery dataset contains 68 price entries, spanning roughly $10/km² for coarse-resolution archival scenes to over $200/km² for high-resolution, quad-polarized spotlight acquisitions. Polarization types are diverse: approximately 30% single-polar (HH or VV), 40% dual-polar (e.g., HH/HV), and 30% quad-polar configurations. Operating-band combinations are also varied, with about 50% of scenes in X-band, 30% in C-band, and 20% featuring mixed L-/C-band data. Continuous variables such as ground-range resolution (GRR) and incidence angle likewise display broad dispersion—GRR’s median value is around 3 m, although the dataset includes numerous sub-meter samples, and incidence angles range from approximately 15° to 55°. Sample diversity extends across multiple platforms (e.g., TerraSAR-X, Sentinel-1, RADARSAT-2), polarization modes, and acquisition geometries, covering applications from flood monitoring to infrastructure inspection. Both datasets are publicly available in the https://github.com/ShanLinn/Satellite-Image-Price-Data, Satellite Image Price Data repository on GitHub, accessed on 4 March 2025.

2.1.2. Data Preprocessing and Feature Extraction

The raw satellite imagery transaction data was thoroughly cleaned, removing outliers and undergoing rigorous preprocessing to ensure the accuracy, reliability, and robustness of subsequent analysis and modeling. Given the inclusion of both optical and Synthetic Aperture Radar (SAR) imagery, feature extraction was performed separately for these two imagery types, with comprehensive attributes detailed in Table 1 and Table 2, respectively. For optical imagery, extracted features included attributes such as Image Acquisition Completion Time, Year of Satellite Launch, Satellite Manufacturing Cost (SMC), Imaging Mode, Panchromatic and Multispectral Image Resolutions, Number of Spectral Bands, Minimum Order Area (MOA), and Area Price. Similarly, for SAR imagery, critical features extracted comprised Year of Satellite Launch, SMC, Imaging Mode, Ground Range Resolution (GRR), Polarization Type, SAR Operating Bands, Incidence Angle, MOA, and Image Type.

Categorical features within the dataset required careful encoding to accurately reflect their intrinsic characteristics and ensure effective integration into machine learning models. One–hot Encoding was employed for categorical attributes representing mutually exclusive categories (e.g., Imaging Mode for optical imagery indicating single imagery, stereo imagery, or tri-stereo imagery, and Image Acquisition Completion Time indicating recent versus older acquisitions). One–hot Encoding creates binary indicator variables for each category, effectively preventing the introduction of artificial ordinal relationships that could mislead the modeling process. For categorical attributes where multiple categories can simultaneously apply to a single data instance (e.g., Polarization Type and SAR Operating Bands in SAR imagery), Multi–Hot Encoding was implemented. This encoding method generates multiple binary indicator variables that accurately represent all applicable categories without imposing restrictive hierarchical or ordinal assumptions.

The deliberate selection of these encoding techniques was driven by methodological considerations. Proper categorical encoding is crucial because it enables machine learning models to accurately interpret categorical information without artificially imposing relationships or hierarchies that do not inherently exist. One–hot Encoding clearly distinguishes between mutually exclusive categories, avoiding misleading numerical interpretations, while Multi–Hot Encoding effectively captures complex categorical data where multiple attributes apply simultaneously, thereby preserving the dataset’s intrinsic relationships and reducing information loss.

Additionally, continuous numerical features were normalized using Min–Max normalization (Equation (1)), scaling feature values to a uniform range

[0, 1]

. This step ensures consistent scaling across diverse numerical features, enhancing the stability and predictive performance of the machine learning models.

x^{'} = \frac{x - min (x)}{max (x) - min (x)}

(1)

where x is the original feature value, and

min (x)

and

max (x)

denote the feature’s minimum and maximum values, respectively.

2.2. Machine Learning Algorithms

In this study, several advanced ensemble machine learning models, including XGBoost, CatBoost, LightGBM, and AdaBoost, were employed to estimate satellite imagery prices. For optical imagery, features such as Year of Satellite Launch, Satellite Manufacturing Cost (SMC), Imaging Mode, Panchromatic and Multispectral Image Resolutions (PAN Res, MS Res), among others, were extracted. For Synthetic Aperture Radar (SAR) imagery, features including Year of Satellite Launch, SMC, Imaging Mode, Ground Range Resolution (GRR), Polarization Type, and SAR Operating Bands, among others, were selected. These respective feature sets were used to perform separate price estimations for optical and SAR imagery.

2.2.1. Extreme Gradient Boosting (XGBoost)

The XGBoost algorithm, introduced by Chen and Guestrin [33], is widely recognized as an important method for supervised learning due to its remarkable performance [33,34,35]. Built on the boosting concept [33], it combines several simple models—each only marginally better than random guessing—into an ensemble that delivers highly accurate predictions [36]. XGBoost employs a tree ensemble based on the classification and regression trees (CART) framework, while individual CARTs may have limited accuracy [33], ensemble methods address this shortfall by training multiple CARTs and aggregating their predictions, typically by summing the individual scores [33]. Assuming that the model has k decision trees, the integrated model can be expressed as follows:

\hat{y_{i}} = \sum_{k = 1}^{K} f_{k} (x_{i}), f_{k} \in F

(2)

where

\hat{y_{i}}

denotes the estimated price at both stations,

f_{k}

shows the k-th tree. Moreover, the variable k indicates the sample size,

x_{i}

is the input parameter, and F represents all possible CARTs [33]. The objective function of the XGBoost model is a combination of a loss function with a regularization term [33]. These components, respectively, control the model’s accuracy and complexity, which can be illustrated as follows:

O b j = \sum_{i = 1}^{n} L (y_{i}, \hat{y_{i}}) + \sum_{k = 1}^{K} Ω (f_{k})

(3)

L (y_{i}, \hat{y_{i}}) = {(y_{i} - \hat{y_{i}})}^{2}

(4)

Ω (f) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} ω_{j}^{2}

(5)

where

y_{i}

denotes the true value,

\hat{y_{i}}

represents the predicted value.

Ω

indicates the regularization term to impose a penalty on the complexity of the model and to mitigate the risk of overfitting for the training data [33]. Furthermore,

L (y_{i}, \hat{y_{i}})

stands for a loss function to calculate the difference between

y_{i}

and

\hat{y_{i}}

, and T denotes the total count of leaves within the decision tree [33]. Moreover,

γ

signifies the complexity attributed to each individual leaf, and

λ

is a parameter of compromise primarily utilized to adjust the magnitude of the penalty [33]. Finally,

ω

denotes the score associated with the j-th leaf.

2.2.2. Light Gradient Boosting Machine (LightGBM)

Based on decision trees, LightGBM is a distributed framework designed for high-performance gradient boosting [37]. It integrates two efficient techniques—exclusive feature bundling (EFB) and gradient-based one-side sampling (GOSS)—to optimize data sampling and classification tasks [37,38]. Moreover, to reduce the risks inherent in gradient boosting decision tree (GBDT) models while maintaining high accuracy with minimal computational cost, LightGBM uses a histogram algorithm to lower memory usage, coupled with a strategy to limit the depth of leaf-wise growth [39]. To further prevent overfitting caused by excessively deep trees, it enforces a maximum depth constraint, ensuring both efficiency and robustness [37,38]. The objective function (OB) used in LightGBM is mathematically expressed as follows [37,38]:

{OF}_{LGBM} = \sum_{i = 1}^{N} l (y_{i}, {\hat{y}}_{i}) + \sum_{j = 1}^{t} Ω (f_{j})

(6)

where

y_{i}

denotes the true value,

\hat{y_{i}}

represents the predicted value, and

l (y_{i}, \hat{y_{i}})

indicates the difference between the predicted value obtained from the model and the true value.

Ω (\cdot)

denotes the regularization term, which serves to penalize model complexity and help prevent overfitting [37,38]. When adding a new tree

f_{t}

at iteration t, the objective function becomes:

{OF}_{LGBM} = \sum_{i = 1}^{N} l (y_{i}, {\hat{y_{i}}}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{t}) + \sum_{j = 1}^{t - 1} Ω (f_{j})

(7)

where

{\hat{y_{i}}}^{(t - 1)}

is the prediction from the previous

t - 1

trees. By applying a second-order Taylor expansion of the loss and incorporating regularization on the tree parameters, the objective is approximated as:

{OF}_{LGBM} = \sum_{i = 1}^{N} (l (y_{i}, {\hat{y_{i}}}^{(t - 1)}) + g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} {(f_{t} (x_{i}))}^{2}) + γ T + \frac{1}{2} λ \sum_{j = 1}^{T} ω_{j}^{2}

(8)

where

g_{i}

and

h_{i}

are the first and second derivatives (gradients) of the loss, T is the number of leaf nodes,

ω_{j}

is the weight of leaf j, and

γ

and

λ

control the complexity of the trees. By iteratively updating this objective, LightGBM achieves both efficient training and strong predictive performance.

2.2.3. Adaptive Boosting (AdaBoost)

AdaBoost, an ensemble learning method introduced by Freund and Schapire [40], reduces bias by combining multiple machine learning models into a single predictive framework. By adjusting the sample weights, the algorithm enhances the performance of weak learners [40]. Each iteration focuses on incorporating the most effective weak learner while discarding the less effective ones, thereby constructing a strong composite model [41]. In this process, samples with higher errors are given increased weight, ensuring that subsequent learners concentrate on correcting these mistakes [40]. The loss function is then defined based on the performance of the newly constructed learner

h_{t} (x_{i})

on the weighted dataset [40]:

ϵ_{t} = \sum_{i = 1}^{N} w_{t, i} I (h_{t} (x_{i}) \neq y_{i})

(9)

where

ϵ_{t}

measures the weighted error rate of the weak learner

h_{t}

at iteration t, and

w_{t, i}

is the weight assigned to the i-th sample at iteration t. The function

I (h_{t} (x_{i}) \neq y_{i})

is an indicator function that is 1 if the prediction

h_{t} (x_{i})

differs from the true label

y_{i}

and 0 otherwise. Furthermore, the learner weights are calculated as follows [40]:

α_{t} = \frac{1}{2} log (\frac{1 - ϵ_{t}}{ϵ_{t}})

(10)

where

α_{t}

is the weight or contribution of the weak learner

h_{t}

in the final strong ensemble. The formula shows that if

ϵ_{t}

is low (the weak learner is highly accurate), then

α_{t}

becomes larger, giving more influence to that learner [40]. Conversely, if

ϵ_{t}

is high,

α_{t}

becomes smaller (or can even be negative if

ϵ_{t} > 0.5

), reducing the influence of that weak learner in the final ensemble [40].

α_{t}

ensures that better-performing weak learners have a stronger say in the ensemble’s predictions, while weaker learners are down-weighted [40]. AdaBoost updates each sample’s weight after a new weak learner is added. The weights can be updated as follows [40]:

ω_{t + 1, i} = \frac{ω_{t, i} exp (- α_{t} y_{i} h_{t} (x_{i}))}{\sum_{j = 1}^{N} ω_{t, j} exp (- α_{t} y_{j} h_{t} (x_{j}))}

(11)

The AdaBoost algorithm’s general structure can be expressed as follows [40]:

F_{n} (x) = F_{m - 1} (x) + arg min_{h} \sum_{i = 1}^{n} L (y_{i}, F_{m - 1} (x_{i}) + h (x_{i}))

(12)

where

F_{n} (x)

represents the updated model at the n-th iteration. Starting from the previous model

F_{m - 1} (x)

, it seeks an additional function h that minimizes the sum of losses L over all training samples [40]. Specifically,

arg {min}_{h} \sum_{i = 1}^{n} L (y_{i}, F_{m - 1} (x_{i}) + h (x_{i}))

identifies the function h that best reduces the current residual error [40]. By adding this function to the previous model, the new model

F_{n} (x)

further refines the predictions.

2.2.4. Categorical Boosting (CatBoost)

Prokhorenkova et al. [42] introduced CatBoost methodology, a gradient boosting decision tree framework that is based on tree boosting. CatBoost’s primary advantage lies in its precise handling of categorical features during model training [42], which effectively addresses low bias and high variance concerns, thereby enhancing both performance and generalization [43,44]. To prevent overfitting, it employs a modified version of traditional gradient boosting known as ordered boosting (OB) [43,44]. Under OB, the training variables are arranged according to a predefined sequence, and CatBoost subsequently builds multiple models from these samples [42]. Through an iterative, greedy procedure driven by the minimization of a specified loss function, the algorithm refines its estimates step by step [42]. Furthermore, CatBoost converts categorical features into numerical values using a robust mechanism termed the ordered target indicator, described as follows.

X_{σ p, k} = \frac{\sum_{i = 1}^{p - 1} I [X_{σ j, k} = X_{σ p, k}] Y_{σ s} + β P}{\sum_{j = 1}^{p - 1} I [X_{σ j, k} = X_{σ p, k}] + β}

(13)

where,

X_{σ p, k}

represents the transformed (or encoded) value for the categorical feature at index

(p, k)

[42]. The notation

I [X_{σ j, k} = X_{σ p, k}]

is an indicator function that equals 1 if the feature value for sample j matches that of sample p, and 0 otherwise [42]. The term

Y_{σ s}

refers to the target variable for sample

σ s

, while

β

is a smoothing parameter and P is a prior value [42]. By summing over samples j from 1 to

p - 1

, the formula computes a smoothed average of the target variable among samples with the same categorical value, incorporating both the prior and the indicator function to reduce noise and overfitting [42].

2.2.5. Bayesian Optimization (BO)

Originally formalized by Jonas Mockus in the 1970s, Bayesian Optimization constructs a probabilistic surrogate model—typically a Gaussian Process—of the unknown objective function (for example, the validation error as a function of hyperparameters). It then iteratively selects hyperparameters to evaluate by maximizing an acquisition function that balances exploration of uncharted hyperparameter space and exploitation of promising regions. Bayesian Optimization is increasingly employed to automatically tune the hyperparameters of various boosting algorithms, such as XGBoost, CatBoost, and other gradient boosting algorithms. One common acquisition function is the Expected Improvement (EI), defined as:

α (x) = E [max (f (x) - f^{*}, 0)]

(14)

where

f (x)

is the surrogate model’s prediction at hyperparameter configuration x, and

f^{*}

is the best observed performance so far. This function quantifies the expected gain if the model were to be evaluated at x. By selecting the hyperparameters that maximize

α (x)

, Bayesian Optimization efficiently directs the search toward configurations likely to yield improved performance, thus automating and accelerating the tuning process for these advanced boosting algorithms.

2.3. Model Training and Evaluation

In this study, separate machine learning models were trained and evaluated for optical and Synthetic Aperture Radar (SAR) imagery datasets. The optical imagery model utilized nine input variables, including Image Acquisition Completion Time, Year of Satellite Launch, Satellite Manufacturing Cost (SMC), Imaging Mode, Panchromatic Image Resolution (PAN Res), Number of Panchromatic Bands (N_PAN), Number of Multispectral Bands (N_MS), Multispectral Image Resolution (MS Res), and Minimum Order Area (MOA), as detailed in Table 1. The output for this model was the optical imagery price. Similarly, the SAR imagery model incorporated nine distinct input variables, specifically Year of Satellite Launch, Satellite Manufacturing Cost (SMC), Imaging Mode, Ground Range Resolution (GRR), Polarization Type, SAR Operating Bands, Incidence Angle, Minimum Order Area (MOA), and Image Type, as summarized in Table 2. The SAR imagery price served as the output variable for this model.

The models underwent rigorous training and evaluation processes using satellite imagery price networks to ensure robustness and reliability. The complete dataset was divided into training/validation (80%) and testing (20%) subsets. The training/validation subset was further subdivided into approximately 80% training and 20% validation sets. To enhance robustness and mitigate overfitting, a 5-fold cross-validation approach was adopted. The training dataset facilitated initial model training, while the validation dataset was utilized to fine-tune the hyperparameters of the machine learning algorithms. The testing dataset, which remained completely unseen during model development, provided an unbiased assessment of model performance under realistic, unobserved conditions.

Furthermore, Bayesian Optimization was employed during model training to automatically and efficiently determine optimal hyperparameters for each of the four machine learning methods. This optimization technique leveraged a surrogate probabilistic model of the objective function alongside an acquisition function designed to balance exploration and exploitation within the hyperparameter space. Table 3 and Table 4 comprehensively summarize the optimal hyperparameter values identified for each machine learning algorithm, corresponding to optical and SAR imagery models, respectively.

The Pearson correlation coefficient (R), mean bias error (MBE), RMSE, ubRMSE, Nash–Sutcliffe Efficiency (NSE), and Kling–Gupta Efficiency (KGE) were employed to evaluate the accuracy of the satellite imagery price predictions (Price_pred) in replicating the observed market prices (Price_obs). These metrics are calculated as follows:

R = \frac{\sum_{i = 1}^{N} (θ_{obs, i} - {\bar{θ}}_{obs}) (θ_{pred, i} - {\bar{θ}}_{pred})}{\sqrt{\sum_{i = 1}^{N} {(θ_{obs, i} - {\bar{θ}}_{obs})}^{2} \sum_{i = 1}^{N} {(θ_{pred, i} - {\bar{θ}}_{pred})}^{2}}}

(15)

M B E = \frac{1}{N} \sum_{i = 1}^{N} (θ_{pred, i} - θ_{obs, i})

(16)

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(θ_{pred, i} - θ_{obs, i})}^{2}}

(17)

u b R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(θ_{pred, i} - θ_{obs, i})}^{2}} - M B E

(18)

N S E = 1 - \frac{\sum_{i = 1}^{N} {(θ_{pred, i} - θ_{obs, i})}^{2}}{\sum_{i = 1}^{N} {(θ_{obs, i} - {\bar{θ}}_{obs})}^{2}}

(19)

K G E = 1 - \sqrt{{(R - 1)}^{2} + {(α - 1)}^{2} + {(β - 1)}^{2}}

(20)

In our study, the Pearson correlation coefficient (R) quantifies the strength of the linear relationship between the predicted satellite imagery prices (

{Price}_{pred}

) and the observed market prices (

{Price}_{obs}

) [45]. The mean bias error (MBE) indicates whether the model systematically overestimates (positive values) or underestimates (negative values) the observed prices [46]. The unbiased RMSE (ubRMSE) measures the residual error after removing any systematic bias from the predictions [47]. The Nash–Sutcliffe Efficiency (NSE) compares the squared prediction errors to the variance of the observed prices, yielding values ranging from

- \infty

to 1, where 1 denotes perfect agreement and 0 indicates that using the mean observed price would yield equivalent performance [48,49].

Finally, the Kling–Gupta Efficiency (KGE) integrates three aspects of model performance—the correlation (R), the bias ratio

(β = \frac{μ_{pred}}{μ_{obs}})

, and the variability ratio

(α = \frac{σ_{pred}}{σ_{obs}})

—with values closer to 1 signifying more accurate predictions [49].

3. Results

3.1. Optical Imagery Pricing Prediction Results

3.1.1. Model Performance Evaluation

The results presented in Table 5 highlight the effectiveness of various ensemble learning algorithms in predicting optical imagery prices. Among these models, the Bayesian-optimized XGBoost (XGBoost2) consistently demonstrates superior performance across all key evaluation metrics. On the testing dataset, XGBoost2 achieves the highest coefficient of determination (

R = 0.9870

), the lowest root mean square error (RMSE = $3.4389), and the strongest efficiency measures (NSE = 0.9651, KGE = 0.8950). Compared to its default counterpart, the Bayesian-optimized model shows significantly improved predictive accuracy, with a 37.7% reduction in RMSE and a notably narrower gap between training and testing performance. Furthermore, optimization reduces model bias, improving the mean bias error (MBE) from $−0.9081 to $−0.8108, and supports near-ideal generalization, as evidenced by a training RMSE of only $1.3449. CatBoost also derives substantial benefits from Bayesian optimization. The optimized CatBoost2 model reduces the testing RMSE from $6.4326 to $3.8349—a 40% improvement—and substantially corrects the default model’s systematic negative bias (MBE reduced from $−1.0555 to $−0.3973). The model also demonstrates strong efficiency (NSE = 0.9566, KGE = 0.8881), closely approaching the performance of XGBoost2. Its high training R (0.9951) and minimal overfitting further reflect its ability to capture complex nonlinear pricing patterns in optical imagery data.

In contrast, AdaBoost shows moderate yet meaningful improvements post-optimization, while the default AdaBoost1 model is affected by substantial positive bias (MBE = $1.6241), optimization reduces this to $0.6273 and lowers the testing RMSE from $6.5145 to $4.7942. The optimized AdaBoost2 model achieves improved predictive reliability (NSE = 0.9322, KGE = 0.8634), although its overall performance still lags behind that of XGBoost and CatBoost.LightGBM performs comparatively poorly in this pricing task. Both the default and optimized versions exhibit pronounced negative biases (MBE ≈ $−2.35) and the highest testing RMSE values ($9.6777 and $9.5816, respectively). Even after Bayesian optimization, the performance gain is marginal (less than 1% reduction in RMSE), indicating a possible structural limitation of LightGBM in modeling optical imagery pricing. The significant drop in NSE from training (0.9286) to testing (0.7290) underscores LightGBM’s limited generalization capability.

The comparative performance across models is further illustrated in the scatter plots shown in Figure 1, which visually corroborate the quantitative results presented in Table 5. The Bayesian-optimized XGBoost (XGBoost2) displays a highly concentrated clustering of predicted values along the 1:1 diagonal line, indicating strong agreement with actual prices and minimal prediction error across the entire range. This visual pattern aligns with its superior evaluation metrics (

R = 0.9870

,

RMSE = 3.4389

), confirming its exceptional predictive accuracy and generalization capability. In contrast, its default counterpart (XGBoost1) exhibits wider vertical dispersion, particularly in higher price intervals (>30), with most points falling below the diagonal—evidencing the model’s tendency to underpredict due to its stronger negative bias (

MBE = - 0.9081

).

CatBoost exhibits a similar trend, while the default model (CatBoost1) suffers from systematic underestimation and increasing error magnitude at higher prices—resulting in a characteristic “fanning” pattern—the Bayesian-optimized version (CatBoost2) significantly mitigates these issues. The improved scatter plot shows tighter alignment with the diagonal, reduced bias (

MBE = - 0.3973

), and lower RMSE (

3.8349

), reinforcing its robustness in capturing nonlinear pricing dynamics in optical imagery data.

The impact of optimization on AdaBoost is evident but comparatively moderate. AdaBoost1 initially exhibits substantial overestimation during training (

MBE = + 1.6241

), which is notably mitigated in the optimized AdaBoost2. The corresponding scatter plot for AdaBoost2 shows more balanced predictions clustered around the diagonal, though a noticeable spread remains, particularly in the mid-to-high price range. This behavior aligns with its intermediate RMSE (

4.7895

) and underscores its relatively weaker capacity to capture complex pricing patterns when compared to XGBoost and CatBoost. In contrast, both the default and optimized LightGBM models generate highly scattered prediction points with substantial deviations from the diagonal, indicating weak correlation and large residuals across the full price spectrum. These patterns, coupled with persistently high RMSE values (>9.49) and pronounced negative bias (

MBE \approx - 2.35

), reveal LightGBM’s limited ability to generalize in this pricing context. The minimal gains achieved through Bayesian optimization further suggest that LightGBM faces intrinsic structural limitations that constrain its effectiveness for optical satellite image price prediction.

To assess model generalization under varying data volumes, we plotted learning curves for XGBoost and CatBoost models, both with and without Bayesian optimization (Figure 2). These curves depict RMSE values on training and validation sets as the training size increases from 10% to 90%. For XGBoost, the Bayesian-optimized version (XGBoost2) demonstrates clear advantages over the default model (XGBoost1). XGBoost2’s validation RMSE steadily decreases as the training set expands, approaching its training RMSE with no sign of divergence, indicating strong generalization capability and effective resistance to overfitting. In contrast, XGBoost1 exhibits a persistently large validation error gap relative to its training curve, suggesting suboptimal learning dynamics and potential overfitting, especially at smaller sample sizes.A similar trend is observed for CatBoost. The optimized CatBoost2 model exhibits rapidly declining RMSE for both training and validation sets, with the validation error converging toward the training curve. Notably, once the training set size exceeds 100 samples, CatBoost2 reaches stable and low RMSE values on both sets, reflecting consistent learning and minimal variance. The default CatBoost1, while also improving with increasing data, retains a larger and more persistent validation gap, again indicating inferior generalization.Overall, the convergence of training and validation curves for both XGBoost2 and CatBoost2 confirms that the optimized models effectively balance fitting capacity and generalizability. Their stable learning behavior across increasing data volumes reinforces their robustness and suitability for real-world optical satellite imagery pricing applications. In summary, the comprehensive analysis of model performance metrics, scatter plots, and learning curves confirms that Bayesian-optimized XGBoost and CatBoost models offer the most accurate and reliable predictions for optical imagery pricing. These models exhibit the lowest RMSE and highest R, NSE, and KGE scores, while also demonstrating minimal bias and strong generalization across varying training set sizes. The visual alignment of predicted versus actual prices and the convergence of training and validation errors further support their robustness. In contrast, AdaBoost shows moderate improvements after optimization but remains less effective, and LightGBM consistently underperforms across all evaluations, exhibiting high residuals and limited generalization.

3.1.2. SHAP Feature Importance Analysis

To interpret model predictions and identify the key drivers behind pricing decisions, SHAP summary and beeswarm plots were generated for each Bayesian-optimized model (Figure 3 and Figure 4).

XGBoost2’s SHAP analysis clearly ranks Imaging Mode and PAN Resolution as the two most influential predictors. In the bar chart, both features exhibit mean

| SHAP |

values above 5, and the beeswarm plot shows that high-end imaging modes (e.g., stereo or tri-stereo, red points) push SHAP values strongly positive (often >+6 USD/km²), whereas simpler modes (blue points) suppress price predictions. Likewise, high spatial resolution (blue points at 0.3–0.5 m) drives large positive impacts (

+ 5

to

+ 7

USD/km²), whereas coarser resolutions (red points at 1–5 m) register negative or neutral SHAP values. These patterns reflect real-world economics: advanced imaging modes and sub-meter panchromatic detail command substantial price premiums for applications such as urban mapping, defense reconnaissance, and precision agriculture.

Satellite Manufacturing Cost (SMC) and Minimum Order Area (MOA) occupy the next tiers of importance, with mean

| SHAP |

values around 2–3. High SMC (red) corresponds to expensive, agile platforms—pushing predicted prices upward by up to

+ 5

USD/km²—because customers pay more for rapid tasking and sophisticated onboard sensors. Conversely, low SMC (blue) satellites contribute negative SHAP impacts (

- 2

to

- 3

USD/km²). For MOA, small purchase units (blue) yield modest positive SHAP values (

+ 2

USD/km²), reflecting willingness to pay for granular acquisitions, whereas large minimum orders (red) introduce bulk discounts. Image Acquisition Completion Time appears with moderate importance: “recent” acquisitions (blue) confer slight positive SHAP values (around

+ 1.5

USD/km²) to satisfy near-real-time demands, whereas “older” imagery (red) is marginally devalued.

Less influential features—Year of Satellite Launch, Multispectral Resolution (MS Res), Number of Panchromatic Bands (N_PAN), and Number of Multispectral Bands (N_MS)—carry mean

| SHAP |

values below 1. Older launch years (blue) reduce predicted prices by about

- 2

USD/km² due to depreciation, while newer satellites (red) add small positive effects. Higher MS Res (blue) modestly boosts SHAP (

+ 1

USD/km²), but band counts contribute negligible value. Taken together, these insights reveal how different client segments and platform capabilities interact: commercial infrastructure operators emphasize PAN Resolution, while agriculture or environmental customers trade off resolution for cost. By capturing these nuanced SHAP patterns, XGBoost2 enables targeted, application-aware pricing strategies.

CatBoost2 demonstrates a similar feature hierarchy but with greater stability in SHAP dispersion. Imaging Mode and PAN Resolution remain dominant, while Acquisition Completion Time gains relative importance compared to other models. This highlights CatBoost2’s enhanced capacity to incorporate time-sensitive value differentiators—especially important in urgent imaging contracts. The SHAP gradient for Year of Satellite Launch reveals depreciation effects, with older satellites contributing negatively to predicted prices, consistent with market depreciation trends. High SMC and recent acquisition time also contribute positively for clients requiring rapid, high-quality deliveries, whereas large MOA and older satellites introduce downward adjustments. In practice, CatBoost2’s SHAP distributions suggest that high-end commercial providers (e.g., WorldView) experience stronger pricing sensitivity to PAN Resolution, whereas agile microsatellite constellations (e.g., Planet) exhibit higher SHAP responses to SMC and completion time, reflecting differing customer priorities from national defense to academic research.

AdaBoost2 places similarly strong emphasis on Imaging Mode but exhibits a broader SHAP distribution, indicating greater variability in how mode selections influence price. In AdaBoost’s sequential decision-tree framework, each iteration amplifies the impact of discrete splits—here, imaging-mode categories—resulting in extreme SHAP values (often >+7 USD/km²) for stereo and tri-stereo modes. This tendency mirrors the economic reality that certain customers, such as military reconnaissance units, are willing to pay substantial premiums for 3D imagery to support critical mission planning. However, AdaBoost2’s ability to differentiate continuous technical variables like PAN Resolution is comparatively weaker: the SHAP dot plot shows overlapping red and blue points around zero, indicating that AdaBoost’s weak learners struggle to translate subtle variations in spatial resolution into consistent price adjustments. Consequently, AdaBoost2 yields higher residual variance (reflected in its intermediate RMSE), making it better suited for mid-tier commercial tasks where imaging mode is the dominant price lever (e.g., stock-photo clients selecting between single-view and stereo imagery) but less reliable for scenarios demanding precise valuation of continuous technical attributes.

LightGBM2’s SHAP profiles reveal a contrasting feature hierarchy and a compressed impact range: PAN Resolution again emerges as the most important predictor, but its SHAP magnitudes seldom exceed

+ 5

USD/km², and Imaging Mode follows only modestly behind, with values rarely surpassing

+ 3

USD/km². This attenuation reflects LightGBM’s leaf-based tree structure, which—even after efficient optimization—tends to underrepresent nonlinear interactions when faced with the skewed price distributions characteristic of high-resolution products. In practical terms, while customers may accept incremental price increases for improved resolution—say, from 1 m to 0.5 m—LightGBM2 fails to fully capture the step changes that reflect regional authorities’ willingness to pay

+ 10

USD/km² or more for sub-meter orthorectified imagery. Consequently, high-resolution (blue) points in the SHAP beeswarm remain clustered near zero or only slightly positive, rather than producing the distinct positive tail seen in XGBoost2 and CatBoost2. Similarly, Imaging Mode’s influence is diluted: although more advanced modes yield higher predicted prices, the SHAP dot plot shows that LightGBM’s response to mode variation is relatively flat, undermining its ability to model premium scenarios—such as time-critical disaster mapping—where mode selection can more than double the base price. Satellite Manufacturing Cost (SMC) shows only modest contributions, while Year of Satellite Launch and Minimum Order Area (MOA) exhibit noisy, overlapping SHAP clusters, suggesting that LightGBM2 does not consistently leverage these features to reflect market depreciation or packaging effects. In practice, LightGBM2 systematically undervalues truly high-end offerings from platforms like Maxar’s WorldView and fails to differentiate adequately between client types—treating defense-grade acquisitions similarly to routine environmental surveys. Consequently, although LightGBM2 maintains computational efficiency, its structural constraints limit its economic interpretability and reduce its suitability for pricing engines that must accommodate diverse platform capabilities and customer demands.

3.2. SAR Imagery Pricing Prediction Results

3.2.1. Model Performance Evaluation

Table 6 presents the performance metrics of four ensemble models for SAR imagery price prediction, evaluated strictly on the testing set to ensure unbiased comparisons. Among the models, CatBoost4 (Bayesian-optimized) demonstrates the best overall performance, achieving the highest coefficient of determination

R = 0.9278

, the lowest RMSE (9.9384 USD), and strong Nash–Sutcliffe Efficiency (NSE = 0.8575) and Kling–Gupta Efficiency (KGE = 0.8443), indicating superior accuracy and robustness. AdaBoost4 exhibits significant improvement after optimization (RMSE = 9.9877 USD,

R = 0.9258

), matching the predictive accuracy of CatBoost4; however, it shows slightly higher variance in NSE and KGE, suggesting less consistent generalization compared to CatBoost4.

LightGBM4 (Bayesian-optimized) also performs well (

R = 0.8795

, RMSE = 13.4476 USD), outperforming its default counterpart and confirming the effectiveness of hyperparameter tuning, while LightGBM3 (default) achieves a moderate

R = 0.8668

, its larger errors (RMSE = 14.2391 USD) indicate underfitting in high-price regimes. In contrast, XGBoost models underperform significantly, particularly XGBoost3, which exhibits the poorest test performance (RMSE = 19.6583 USD,

R = 0.7981

) and clear signs of overfitting, as evidenced by the disparity between training and testing RMSE values. Although Bayesian optimization improves XGBoost4’s training performance, its testing RMSE (18.5375 USD) remains suboptimal, reflecting limited robustness to skewed price distributions.

The scatter plots in Figure 5 and Figure 6 visualize prediction accuracy across models. Bayesian-optimized models, particularly CatBoost4 and LightGBM4, display tighter clustering around the line of perfect prediction, whereas default models exhibit greater deviations, especially in high-price regions. This underscores the advantage of models with advanced regularization and categorical feature handling in addressing distribution skewness inherent to SAR datasets. Notably, SAR imagery pricing often follows a heavy-tailed distribution due to rare, high-cost imaging configurations (e.g., wide-swath quad-polarized modes), which are underrepresented in training data and disproportionately contribute to prediction errors. Models like CatBoost, which natively support categorical feature interactions and are robust to sparse high-value samples, generalize more effectively in such scenarios. Conversely, models such as XGBoost and AdaBoost, reliant on fixed-depth splits or weak learners, are prone to overfitting on frequent low-priced samples and underfitting on rare high-cost observations.

From an application perspective, models demonstrating robustness to price distribution skewness—such as CatBoost—are better suited for high-end services requiring precision (e.g., national defense, strategic reconnaissance, and emergency SAR missions). Conversely, simpler models like AdaBoost may suffice for standardized, low-cost applications such as agricultural monitoring or environmental surveillance. This distinction emphasizes the necessity of aligning model selection with domain-specific requirements to enhance both predictive accuracy and economic insights.

3.2.2. SHAP Feature Importance Analysis

The SHAP analyses of our four Bayesian-optimized ensemble models (XGBoost4, LightGBM4, AdaBoost4 and CatBoost4) yield several consistent and model-specific insights. In all cases, ground-range resolution (GRR), imaging mode and polarization type emerge as the most influential predictors, appearing among the top three features in both the bar and beeswarm plots (Figure 7 and Figure 8). Beyond this commonality, however, the models diverge in their sensitivity and stability profiles. LightGBM4 exhibits the greatest sensitivity to feature perturbations—its mean SHAP values span approximately 0.0–0.8 (versus 0.0–0.6 for XGBoost4), and its beeswarm displays the widest SHAP distribution (–1.0 to 2.0). AdaBoost4 likewise demonstrates a broad SHAP range (0.0–1.0) and extreme outlier influences (up to 2.5), suggesting pronounced responsiveness to unusual observations. In contrast, CatBoost4 produces a more tightly clustered SHAP distribution, indicative of greater prediction robustness. Together, these patterns illustrate a trade-off between discriminative sensitivity and noise-resilient stability: LightGBM4 and AdaBoost4 may be preferred when fine-grained feature effects or rare extreme values are important, whereas XGBoost4 and CatBoost4 offer more consistent, noise-resistant performance.

The dominance of imaging mode, GRR and polarization type in pricing predictions can be traced to their direct relationship with sensor complexity, operational expense and end-user requirements. Imaging mode dictates spatial resolution, swath width and spectral capability—high-resolution SAR bands, for example, incur greater sensor and processing costs and thus command premium prices in defense or urban-monitoring applications. Polarization type enhances target discrimination and material classification, which is critical in domains such as mineral exploration or disaster assessment, driving up willingness to pay among resource and insurance clients. GRR reflects the granularity of the imaged surface, with finer resolutions needed for tactical intelligence justifying higher fees, while coarser resolutions remain acceptable—and more cost-effective—for agricultural or environmental monitoring. By contrast, features such as satellite launch year and minimum order area (MOA) contribute minimally to price variation: launch year is largely redundant with GRR and imaging mode, and MOA’s influence is mediated by market packaging conventions. Likewise, incidence angle and SAR operating bands exhibit niche importance—relevant only to specialized scenarios that are underrepresented in general pricing datasets—while widespread adoption of standard bands (e.g., C-band) further attenuates their pricing impact.

4. Discussion

In this study, we have demonstrated that ensemble-based machine learning models—specifically XGBoost, LightGBM, AdaBoost, and CatBoost—can be effectively applied to the problem of satellite imagery price prediction when trained on technically and economically relevant features. By separating optical imagery and SAR imagery datasets and tailoring each model to the unique characteristics of these two data types, we have identified not only the optimal predictive algorithms but also the underlying feature relationships that drive pricing. Below, we discuss: (1) How the choice of data type (optical vs. SAR) interacts with specific business applications and which models are best suited in each case; (2) How our machine-learning–based pricing approach compares with traditional, experience-driven methods currently used by satellite imagery providers.

4.1. Application-Specific Data-Model Alignment

4.1.1. Optical Imagery

Optical imagery pricing relies heavily on features such as imaging mode (single-view vs. stereo/tri-stereo), panchromatic resolution (PAN Res), multispectral resolution (MS Res), satellite manufacturing cost (SMC), and acquisition recency. In our experiments, Bayesian-optimized XGBoost (XGBoost2) and CatBoost (CatBoost2) substantially outperformed other approaches, achieving testing coefficients of determination (R) of 0.9870 and 0.9826, with RMSE values of $3.44 and $3.83 per km², respectively. These two models exhibited minimal bias (MBE ≈ –$0.81 and –$0.40) and strong efficiency metrics (NSE ≈ 0.965 and 0.957; KGE ≈ 0.895 and 0.888). Such performance can be attributed to two key factors: (1) both algorithms’ native ability to handle high-cardinality categorical variables (e.g., imaging mode, acquisition time), and (2) their robustness in modeling nonlinear interactions among resolution, satellite cost, and minimum order area.

From a business application perspective, the slightly superior performance of XGBoost2 and the comparably strong accuracy of CatBoost2 make both models excellent candidates for pricing optical products used in high-precision, mission-critical applications—such as defense-grade urban mapping or rapid-response environmental monitoring—where customers demand premium imaging modes and sub-meter resolution, and are willing to pay accordingly. In contrast, AdaBoost2, which achieved a testing R of 0.9712 with RMSE ≈ $4.79/km², and LightGBM2, which yielded testing

R \approx 0.8669

with RMSE ≈ $9.58/km², are better suited for low-complexity, high-volume applications such as standardized agricultural or land-cover monitoring, where moderate resolution and fixed pricing structures dominate. In particular, AdaBoost2 reliably captures the dominant influence of imaging mode but exhibits greater residual variance across resolution levels, making it appropriate for mid-range commercial tasks where imaging-mode selection drives most of the value (e.g., commodity-scale crop assessments). Conversely, LightGBM2 consistently underestimates the marginal premium for high-resolution or specialized imaging modes, leading to distorted pricing for applications that require fine-grained, sub-meter detail.

4.1.2. SAR Imagery

SAR imagery pricing is governed by a distinct yet similarly rich feature space: ground-range resolution (GRR), polarization type, SAR operating bands, incidence angle, SAR image type (archive vs. tasking), as well as shared features such as SMC, imaging mode, and MOA. In our experiments, CatBoost4 (Bayesian-optimized) achieved the best overall performance on SAR testing data (

R = 0.9278

; RMSE = $9.94/km²; NSE = 0.8575; KGE = 0.8443). AdaBoost4 followed closely (

R = 0.9258

; RMSE = $9.99/km²) but showed slightly higher variance in NSE and KGE. LightGBM4 yielded moderate performance (

R = 0.8795

; RMSE = $13.45/km²), while both default and optimized XGBoost variants (XGBoost3/4) underperformed (best

R \approx 0.7984

; RMSE ≈ $18.54/k²).

Practically, CatBoost4’s ability to handle multi-hot categorical SAR features (e.g., dual- vs. quad-polarization, multi-band support) and its ordered boosting structure enable it to generalize effectively to the heavy-tailed, skewed distribution of SAR pricing—particularly for large-format, national-scale defense or disaster-response contracts that require wide-swath, quad-polarized imagery with multi-pass acquisition. AdaBoost4 also performs well in such high-end SAR scenarios, albeit with slightly reduced stability in sparse, high-cost regions. Although LightGBM4 performs better than its default counterpart, it tends to underfit high-end price extremes, making it more suitable for standard, mid-tier SAR services (e.g., environmental monitoring, agriculture), where resolution and polarization complexity are moderate and pricing remains relatively uniform. XGBoost’s reliance on fixed tree splits and its limited capacity to model complex multi-category interactions make it less suitable for SAR pricing, particularly beyond commoditized use cases.

In summary, the interplay between data type (optical vs. SAR) and business application (e.g., high-precision defense mapping vs. routine environmental monitoring) determines the optimal model choice. For optical imagery, where spatial detail, multispectral richness, and acquisition recency are paramount, XGBoost2 and CatBoost2 provide the most reliable price predictions. For SAR data, where pricing depends on complex categorical combinations and highly skewed value distributions, CatBoost4 offers the best trade-off between accuracy and generalization, with AdaBoost4 as a strong alternative, while LightGBM may be suitable for high-volume, lower-tier applications in both imagery domains, it fails to adequately capture the full economic premium associated with high-end products.

4.2. Comparison with Traditional Pricing Methods

Commercial satellite imagery providers have traditionally relied on experience-driven, rule-based pricing—assigning rates based on a few key attributes (e.g., sensor resolution band, archival freshness, and minimum order area), while simple to administer, such schemes fail to capture nuanced feature interactions—such as polarization complexity combined with specific ground-range resolution (GRR), or a temporal premium for sub-24-h tasking—and often default to broad “price brackets” (e.g., $20–$30/km² for <30 cm panchromatic imagery). As a result, static rule sets may lag behind market dynamics, overprice competitive segments, or undercharge specialized, time-sensitive scenarios.

Our machine learning-based approach overcomes these limitations in four key ways:

Data-Driven Adaptability: Ensemble methods such as XGBoost and CatBoost can ingest dozens of continuous and categorical variables simultaneously, capturing complex, nonlinear interactions—such as the combined effect of sub-meter panchromatic resolution with recent (<90 days) acquisition for urban-infrastructure clients. This yields dynamic price estimates that more accurately reflect supply and demand than fixed rule brackets.
Hyperparameter Optimization for Robustness:Bayesian optimization systematically tunes regularization strength and tree complexity to minimize overfitting, ensuring strong generalization to unseen data. In contrast, traditional rules lack explicit mechanisms to balance model complexity and bias, often resulting in inconsistent pricing at edge cases.
Granular Feature-Importance Insights: SHAP analysis quantifies each feature’s marginal economic contribution (e.g., a $+ $ 6 / {km}^{2}$ premium for 30 cm resolution in stereo mode), revealing exactly which attributes drive willingness to pay. Conventional approaches typically treat resolution as a binary “high” vs. “low” toggle and overlook such pricing subtleties.
Scalability and Real-Time Updating: Once trained, ensemble models can instantly generate quotes for millions of parameter combinations (e.g., any combination of GRR, polarization type, and incidence angle). Manual rule systems cannot adapt without extensive human intervention—a growing limitation in an era of increasingly diverse optical and SAR offerings.

Adopting a data-driven pricing engine does require access to reliable transaction histories and periodic model retraining. However, the benefits are substantial: transparency, quantifiability, and the ability to capture market heterogeneity, reduce price volatility, and align prices precisely with feature combinations. For example, while a traditional rule might assign $20/km² to all products under 50 cm resolution, our optimized XGBoost model can add

+ $ 5 / {km}^{2}

for advanced stereo mode and subtract

- $ 1.50 / {km}^{2}

for acquisitions older than 90 days—pricing nuances unattainable by static methods.

In summary, Bayesian-optimized XGBoost (for optical imagery) and CatBoost (for SAR imagery) offer a robust, data-driven alternative to conventional rule-based pricing systems. By (1) modeling high-dimensional feature interactions, (2) automatically tuning model complexity to avoid overfitting, and (3) quantifying feature value via SHAP, these methods enable more accurate, transparent, and economically rational price recommendations. Providers who implement such machine-learning models can more effectively segment customers, reduce mispricing in critical contracts, and capture greater revenue from specialized imaging services.

5. Conclusions

This study presents a unified, data-driven framework for satellite imagery price prediction, applying four ensemble algorithms—XGBoost, LightGBM, AdaBoost, and CatBoost—to both optical and SAR datasets. Each dataset was described by nine normalized technical and economic features, with categorical variables encoded via one-hot and multi-hot schemes. Bayesian optimization of hyperparameters consistently improved model generalization and minimized bias.

For optical imagery, Bayesian-optimized XGBoost emerged as the top performer (

R = 0.9870

, RMSE $3.44/km², NSE 0.9651, KGE 0.8950), closely followed by CatBoost (

R = 0.9826

, RMSE $3.83/km²). Both models exhibited minimal overfitting across increasing data volumes. In the SAR domain, CatBoost attained the highest accuracy (

R = 0.9278

, RMSE $9.94/km², NSE 0.8575, KGE 0.8443), demonstrating robustness to heavy-tailed price distributions, while AdaBoost remained a competitive alternative. In all cases, LightGBM and non-optimized XGBoost showed larger residual errors, especially for high-value samples.

SHAP analyses revealed that imaging mode and spatial resolution were the most influential features in both domains. Advanced imaging modes (e.g., stereo, tri-stereo) and sub-meter resolutions commanded significant price premiums across commercial, defense, and emergency-response use cases. Satellite manufacturing cost and acquisition recency also contributed meaningfully but to a lesser extent. These insights provide a clear, quantitative understanding of how technical specifications and temporal factors drive customer willingness to pay.

Compared to traditional rule-based pricing—where broad “price brackets” often overlook complex feature interactions—our machine-learning approach offers a scalable, transparent engine that captures nonlinear relationships and supports real-time adjustments. By automatically tuning regularization parameters and providing SHAP-based interpretability, these models enable providers to align prices more precisely with observable feature combinations and market conditions.

Future work should broaden the modeling framework by incorporating market-demand factors that go beyond purely technical product specifications. For instance, adding macro-level indicators such as annual remote-sensing procurement totals—gleaned from corporate or industry reports—can help capture overall shifts in purchasing power and budgetary cycles. Similarly, real-time search-interest metrics (for example, Google Trends or Baidu Index scores for “satellite imagery” or “high-resolution imagery”) can provide leading signals of rising customer demand or emerging application areas. In addition, insights into actual usage patterns—such as order volumes, repeat-purchase rates, and customer churn—would offer valuable context about how frequently and at what price points imagery is being acquired. Finally, benchmarking against competitive pricing (for example, comparing rate cards across similar imagery platforms) can help adjust for market-level price differentials and promotional strategies. By integrating these demand-side data sources into the ensemble models, future pricing engines will be able to dynamically adjust to both supply- and demand-driven forces, leading to even more accurate, economically rational recommendations that reflect real-world market behavior.

Author Contributions

Conceptualization, Z.C. and L.Y.; methodology, Z.C. and L.Y.; software, L.Y.; validation, L.Y.; formal analysis, L.Y., Z.C. and G.L.; resources, G.L., Z.C. and L.Y.; data curation, L.Y.; writing—original draft preparation, L.Y.; writing—review and editing, L.Y. and Z.C.; visualization, L.Y.; supervision, G.L. and Z.C.; project administration, Z.C. and G.L.; funding acquisition, Z.C. and G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China [Grant No. 42201505]; the Natural Science Foundation of Hainan Province of China [Grant No. 622QN352]; the National Key Research and Development Program of China [Grant No. 2021YFF070420304]; and Computer Network and Information Special Project Of Chinese Academy of Sciences [Grant No. 2025000010]. The author is very grateful to the anonymous reviewer and editor. They have greatly helped improve the quality of the paper.

Data Availability Statement

The datasets analyzed and generated during the current study are publicly available in the Satellite Image Price Data repository at: https://github.com/ShanLinn/Satellite-Image-Price-Data (accessed on 4 March 2025).

Acknowledgments

We thank the anonymous reviewers and all of the editors who participated in the revision process.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Herold, M.; Scepan, J.; Clarke, K.C. The Use of Remote Sensing and Landscape Metrics to Describe Structures and Changes in Urban Land Uses. Environ. Plan. A 2002, 34, 1443–1458. [Google Scholar] [CrossRef]
Choy, S.; Handmer, J.; Whittaker, J.; Shinohara, Y.; Hatori, T.; Kohtake, N. Application of Satellite Navigation System for Emergency Warning and Alerting. Comput. Environ. Urban Syst. 2016, 58, 12–18. [Google Scholar] [CrossRef]
Kaku, K. Satellite remote sensing for disaster management support: A holistic and staged approach based on case studies in Sentinel Asia. Int. J. Disaster Risk Reduct. 2019, 33, 417–432. [Google Scholar] [CrossRef]
Rees, G. Physical Principles of Remote Sensing; Cambridge University Press: Cambridge, UK, 2013. [Google Scholar]
Poursanidis, D.; Chrysoulakis, N. Remote Sensing, natural hazards and the contribution of ESA Sentinels missions. Remote Sens. Appl. Soc. Environ. 2017, 6, 25–38. [Google Scholar] [CrossRef]
Jordan, H.; Cigna, F.; Bateson, L. Identifying Natural and Anthropogenically-Induced Geohazards from Satellite Ground Motion and Geospatial Data: Stoke-on-Trent, U.K. Int. J. Appl. Earth Obs. Geoinf. 2017, 63, 90–103. [Google Scholar] [CrossRef]
Shi, S.; Zhong, Y.; Zhao, J.; Lv, P.; Liu, Y.; Zhang, L. Land-use/land-cover change detection based on class-prior object-oriented conditional random field framework for high spatial resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2020, 60, 1–16. [Google Scholar] [CrossRef]
Patino, J.E.; Duque, J.C. A Review of Regional Science Applications of Satellite Remote Sensing in Urban Settings. Comput. Environ. Urban Syst. 2013, 37, 1–17. [Google Scholar] [CrossRef]
Wilhelmi, O.V.; Purvis, K.L.; Harriss, R.C. Designing a Geospatial Information Infrastructure for Mitigation of Heat Wave Hazards in Urban Areas. Nat. Hazards Rev. 2004, 5, 147–158. [Google Scholar] [CrossRef]
Liu, Y.; Pang, C.; Zhan, Z.; Zhang, X.; Yang, X. Building change detection for remote sensing images using a dual-task constrained deep siamese convolutional network model. IEEE Geosci. Remote Sens. Lett. 2020, 18, 811–815. [Google Scholar] [CrossRef]
Zhang, B.; Li, Y.; Wang, Z.; Wang, H.; Liu, C.; Ren, J.; Wu, H.; Zhang, L. Progress and Challenges in Intelligent Remote Sensing Satellite Systems. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1814–1822. [Google Scholar] [CrossRef]
Bernknopf, R.; Shapiro, C. Economic Assessment of the Use Value of Geospatial Information. ISPRS Int. J. Geoinf. 2015, 4, 1142–1165. [Google Scholar] [CrossRef]
Corbane, C.; Politis, P.; Melchiorri, M.; Florczyk, A.J.; Sabo, F.; Freire, S.; Pesaresi, M. Remote Sensing for Mapping Natural Habitats and Their Conservation Status—New Opportunities and Challenges. Int. J. Appl. Earth Obs. Geoinf. 2015, 37, 7–16. [Google Scholar] [CrossRef]
Schreier, G.; Dech, S. High Resolution Earth Observation Satellites and Services in the Next Decade—A European Perspective. Acta Astronaut. 2005, 57, 520–533. [Google Scholar] [CrossRef]
Dial, G.; Bowen, H.; Gerlach, F.; Grodecki, J.; Oleszczuk, R. IKONOS Satellite, Imagery, and Products. Remote Sens. Environ. 2003, 88, 23–36. [Google Scholar] [CrossRef]
Laxminarayan, R.; Macauley, M.K. The Value of Information: Methodological Frontiers and New Applications in Environment and Health; Springer Science & Business Media: New York, NY, USA, 2012. [Google Scholar]
Venkatachalam, L. The Contingent Valuation Method: A Review. Environ. Impact Assess. Rev. 2004, 24, 89–124. [Google Scholar] [CrossRef]
Tyrväinen, L.; Väänänen, H. The Economic Value of Urban Forest Amenities: An Application of the Contingent Valuation Method. Landsc. Urban Plan. 1998, 43, 105–118. [Google Scholar] [CrossRef]
Botelho, A.; Pinto, L.M.C.; Lourenço-Gomes, L.; Valente, M.; Sousa, S. Social Sustainability of Renewable Energy Sources in Electricity Production: An Application of the Contingent Valuation Method. Sustain. Cities Soc. 2016, 26, 429–437. [Google Scholar] [CrossRef]
Lin, P.-J.; Cangelosi, M.J.; Lee, D.W.; Neumann, P.J. Willingness to Pay for Diagnostic Technologies: A Review of the Contingent Valuation Literature. Value Health 2013, 16, 797–805. [Google Scholar] [CrossRef]
Loomis, J.; Koontz, S.; Miller, H.; Richardson, L. Valuing Geospatial Information: Using the Contingent Valuation Method to Estimate the Economic Benefits of Landsat Satellite Imagery. Remote Sens. 2015, 81, 647–656. [Google Scholar] [CrossRef]
Jabbour, C.; Hoayek, A.; Maurel, P.; Rey-Valette, H.; Salles, J.-M. How Much Would You Pay for a Satellite Image? Lessons Learned From French Spatial-Data Infrastructure. IEEE Geosci. Remote Sens. Mag. 2020, 8, 8–22. [Google Scholar] [CrossRef]
Luo, S.; Pedrycz, W.; Xing, L. Pricing of Satellite Image Data Products: Neutrosophic Fuzzy Pricing Approaches Under Different Game Scenarios. Appl. Soft Comput. 2021, 102, 107106. [Google Scholar] [CrossRef]
Xu, Y.; Chen, R.; Du, H.; Chen, M.; Fu, C.; Li, Y. Evaluation of Green Space Influence on Housing Prices Using Machine Learning and Urban Visual Intelligence. Cities 2025, 158, 105661. [Google Scholar] [CrossRef]
Cohen, G.; Aiche, A. Forecasting Gold Price Using Machine Learning Methodologies. Chaos Solitons Fractals 2023, 175, 114079. [Google Scholar] [CrossRef]
Mouchtaris, D.; Sofianos, E.; Gogas, P.; Papadimitriou, T. Forecasting Natural Gas Spot Prices with Machine Learning. Energies 2021, 14, 5782. [Google Scholar] [CrossRef]
Harris, M.; Kirby, E.; Agrawal, A.; Pokharel, R.; Puyleart, F.; Zwick, M. Machine Learning Predictions of Electricity Capacity. Energies 2023, 16, 187. [Google Scholar] [CrossRef]
Nti, I.K.; Adekoya, A.F.; Weyori, B.A. A Comprehensive Evaluation of Ensemble Learning for Stock-Market Prediction. J. Big Data 2020, 7, 20. [Google Scholar] [CrossRef]
Dichtl, H.; Drobetz, W.; Otto, T. Forecasting Stock Market Crashes via Machine Learning. J. Financ. Stab. 2023, 65, 101099. [Google Scholar] [CrossRef]
Zhu, R.; Zhong, G.-Y.; Li, J.-C. Forecasting Price in a New Hybrid Neural Network Model with Machine Learning. Expert Syst. Appl. 2024, 249 Pt B, 123697. [Google Scholar] [CrossRef]
Cao, W.; Liu, Y.; Mei, H.; Shang, H.; Yu, Y. Short-Term District Power Load Self-Prediction Based on Improved XGBoost Model. Eng. Appl. Artif. Intell. 2023, 126 Pt A, 106826. [Google Scholar] [CrossRef]
Gong, J.; Chu, S.; Mehta, R.K.; Meng, Y.; Zhang, Y. XGBoost Model for Electrocaloric Temperature Change Prediction in Ceramics. Npj Comput. Mater. 2022, 8, 140. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Volume 22, pp. 785–794. [Google Scholar] [CrossRef]
Meddage, D.P.P.; Ekanayake, I.U.; Weerasuriya, A.U.; Lewangamage, C.S.; Tse, K.T.; Miyanawala, T.P.; Ramanayaka, C.D.E. Explainable Machine Learning (XML) to Predict External Wind Pressure of a Low-Rise Building in Urban-Like Settings. J. Wind Eng. Ind. Aerodyn. 2022, 226, 105027. [Google Scholar] [CrossRef]
Meddage, D.P.P.; Mohotti, D.; Wijesooriya, K. Predicting Transient Wind Loads on Tall Buildings in Three-Dimensional Spatial Coordinates Using Machine Learning. J. Build. Eng. 2024, 85, 108725. [Google Scholar] [CrossRef]
Amjad, M.; Ahmad, I.; Ahmad, M.; Wróblewski, P.; Kamiński, P.; Amjad, U. Prediction of Pile Bearing Capacity Using XGBoost Algorithm: Modeling and Performance Evaluation. Appl. Sci. 2022, 12, 2126. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. Available online: https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html (accessed on 4 March 2025).
Li, L.; Liu, Z.; Shen, J.; Wang, F.; Qi, W.; Jeon, S. A LightGBM-Based Strategy To Predict Tunnel Rockmass Class from TBM Construction Data for Building Control. Adv. Eng. Inf. 2023, 58, 102130. [Google Scholar] [CrossRef]
Shehadeh, A.; Alshboul, O.; Al Mamlook, R.E.; Hamedat, O. Machine Learning Models for Predicting the Residual Value of Heavy Construction Equipment: An Evaluation of Modified Decision Tree, LightGBM, and XGBoost Regression. Autom. Constr. 2021, 129, 103827. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Busari, G.A.; Lim, D.H. Crude Oil Price Prediction: A Comparison between AdaBoost-LSTM and AdaBoost-GRU for Improving Forecasting Performance. Comput. Chem. Eng. 2021, 155, 107513. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. Adv. Neural Inf. Process. Syst. 2018, 31, 6638–6648. Available online: https://proceedings.neurips.cc/paper/2018/hash/14491b756b3a51daac41c24863285549-Abstract.html (accessed on 4 March 2025).
Dong, L.; Zeng, W.; Wu, L.; Lei, G.; Chen, H.; Srivastava, A.K.; Gaiser, T. Estimating the Pan Evaporation in Northwest China by Coupling CatBoost with Bat Algorithm. Water 2021, 13, 256. [Google Scholar] [CrossRef]
Qian, L.; Chen, Z.; Huang, Y.; Stanford, R.J. Employing Categorical Boosting (CatBoost) and Meta-Heuristic Algorithms for Predicting the Urban Gas Consumption. Urban Clim. 2023, 51, 101647. [Google Scholar] [CrossRef]
Paulik, C.; Dorigo, W.; Wagner, W.; Kidd, R. Validation of the ASCAT Soil Water Index Using in Situ Data from the International Soil Moisture Network. Int. J. Appl. Earth Obs. Geoinf. 2014, 30, 1–8. [Google Scholar] [CrossRef]
Morales, P.; Sykes, M.T.; Prentice, I.C.; Smith, P.; Smith, B.; Bugmann, H.; Zierl, B.; Friedlingstein, P.; Viovy, N.; Sabaté, S.; et al. Comparing and Evaluating Process-Based Ecosystem Model Predictions of Carbon and Water Fluxes in Major European Forest Biomes. Glob. Chang. Biol. 2005, 11, 2211–2233. [Google Scholar] [CrossRef] [PubMed]
Entekhabi, D.; Reichle, R.H.; Koster, R.D.; Crow, W.T. Performance Metrics for Soil Moisture Retrievals and Application Requirements. J. Hydrometeorol. 2010, 11, 832–840. [Google Scholar] [CrossRef]
Li, L.; Dai, Y.; Shangguan, W.; Wei, Z.; Wei, N.; Li, Q. Causality-Structured Deep Learning for Soil Moisture Predictions. J. Hydrometeorol. 2022, 23, 1315–1331. [Google Scholar] [CrossRef]
Nash, J.E.; Sutcliffe, J.V. River Flow Forecasting through Conceptual Models Part I—A Discussion of Principles. J. Hydrol. 1970, 10, 282–290. [Google Scholar] [CrossRef]

Figure 1. Scatter plots comparing actual and predicted optical-imagery prices obtained using the XGBoost, LightGBM, AdaBoost, and CatBoost models before and after Bayesian optimization.

Figure 2. Learning Curves for XGBoost1, XGBoost2, CatBoost1, and CatBoost2 Models.

Figure 3. HAP summary (bar, left) and SHAP beeswarm (dot, right) plots for Bayesian-optimized XGBoost2 and LightGBM2 models.

Figure 4. SHAP summary (bar, left) and SHAP beeswarm (dot, right) plots for Bayesian-optimized AdaBoost2 and CatBoost2 models.

Figure 5. Scatter plots comparing actual and predicted optical imagery prices obtained by XGBoost3,4 and LightGBM3,4 models before and after Bayesian optimization.

Figure 6. Scatter plots comparing actual and predicted SAR imagery prices obtained by AdaBoost3,4 and CatBoost3,4 models before and after Bayesian optimization.

Figure 7. SHAP summary (bar, left) and SHAP beeswarm (dot, right) plots for Bayesian-optimized XGBoost4 and LightGBM4 models.

Figure 8. SHAP summary (bar, left) and SHAP beeswarm (dot, right) plots for Bayesian-optimized AdaBoost4 and CatBoost4 models.

Table 1. Optical Imagery Feature Descriptions.

No.	Input Variable Name	Variable Type	Encoding Method	Description
1	Image Acquisition Completion Time	Categorical	One-hot Encoding	Binary classification indicating whether the image acquisition was completed recently (≤90 days) or earlier (>90 days).
2	Year of Satellite Launch	Continuous	Not encoded	Launch year of the satellite.
3	Satellite Manufacturing Cost (SMC)	Continuous	Not encoded	Satellite manufacturing cost (billion USD).
4	Imaging Mode	Categorical	One-hot Encoding	Categorical variable indicating the satellite imaging mode: single-view, stereo-view, or tri-stereo.
5	Panchromatic Image Resolution (PAN Res)	Continuous	Not encoded	Panchromatic resolution in centimeters (CM).
6	Number of Panchromatic Bands (N_PAN)	Continuous	Not encoded	Count of panchromatic spectral bands.
7	Number of Multispectral Bands (N_MS)	Continuous	Not encoded	Count of multispectral spectral bands.
8	Multispectral Image Resolution (MS Res)	Continuous	Not encoded	Multispectral resolution in centimeters (CM).
9	Minimum Order Area (MOA)	Continuous	Not encoded	Minimum purchase unit (km²).
10	Price	Continuous	Not encoded	Satellite imagery price (USD/km²).

Table 2. SAR Imagery Feature Descriptions.

No.	Input Variable Name	Variable Type	Encoding Method	Description
1	Year of Satellite Launch	Continuous	Not encoded	Current year minus satellite launch year (years).
2	Sensor Manufacturing Cost (SMC)	Continuous	Not encoded	Launch year of the satellite (million USD).
3	Imaging Mode	Categorical	One-hot Encoding	Radar imaging mode: ScanSAR, Spotlight, or Stripmap.
4	Ground Range Resolution (GRR)	Continuous	Not encoded	Spatial resolution in the range direction (meters).
5	Polarization Type	Categorical	Multi-Hot Encoding	Polarization type used during acquisition: Single, Dual, or Quad polarization
6	SAR Operating Bands	Categorical	Multi-Hot Encoding	SAR operating frequency bands, such as L, X, C, P, S, Ku, Ka; multiple bands may be supported simultaneously
7	Incidence Angle	Continuous	Not encoded	Incidence angle between radar beam and ground (degrees).
8	Minimum Order Area (MOA)	Continuous	Not encoded	Minimum number of scenes required for order (km²).
9	Image Type	Categorical	One-hot Encoding	Image type indicating whether the SAR data is from archive or newly tasked acquisition;
10	Price	Continuous	Not encoded	SAR imagery price (USD/km²).

Table 3. Default and Optimal Hyperparameters for Optical Imagery Model Training.

Model	Hyperparameter	Optimal	Default
XGBoost	Learning rate	0.15	0.05
	Maximum depth	5	3
	Number of trees	183	200
	Subsample for tree	0.901	0.8
	Depth sample fraction	0.866	0.8
	Regularization (alpha)	5	0
	Regularization (lambda)	10	0
LightGBM	Number of boosting iterations	185	200
	Learning rate	0.1723	0.1
	Number of leaves	46	31
	Maximum depth	4	−1
	Min data in leaf	20	20
	Regularization (alpha)	2.6094	2
	Regularization (lambda)	12.8713	5
AdaBoost	Base Estimator	5	3
	Number of Weak Learners	95	50
	Learning Rate	0.9702	1.0
	Loss Function	linear	linear
CatBoost	Number of trees	1200	1000
	Learning rate	0.0949	0.03
	Depth of tree	4	4
	Subsample for iteration	0.8	1.0
	Level feature proportion	0.779	1.0
	Regularization	67.49	30

Table 4. Default and Optimal Hyperparameters for SAR Imagery Model Training.

Model	Hyperparameter	Optimal	Default
XGBoost	Learning rate	0.0981	0.05
	Maximum depth	6	4
	Number of trees	185	100
	Subsample for tree	0.8719	0.8
	Depth sample fraction	0.846	0.8
LightGBM	Number of boosting iterations	100	100
	Learning rate	0.270069	0.1
	Number of leaves	54	128
	Maximum depth	8	10
	Min data in leaf	16	20
	Regularization (alpha)	0.9	0
	Regularization (lambda)	0.458	0
AdaBoost	Base Estimator	5	3
	Number of Weak Learners	143	50
	Learning Rate	0.7835	1.0
	Loss Function	Linear	linear
CatBoost	Number of trees	200	200
	Learning rate	0.26867	0.1
	Depth of tree	9	6
	Subsample for iteration	0.786	1.0
	Level feature proportion	0.7158	1.0
	L2 regularization	10	3

Table 5. Performance metrics of machine learning models for optical imagery pricing.

Models	Dataset	R	MBE ($)	RMSE ($)	ubRMSE ($)	NSE	KGE
XGBoost1 (Default Parameters)	Training	0.9820	$- 0.1198$	3.1336	3.1313	0.9613	0.9261
	Testing	0.9697	$- 0.9081$	5.5192	5.4440	0.9101	0.7964
XGBoost2 (Bayesian Optimized)	Training	0.9965	$- 0.0121$	1.3449	1.3449	0.9929	0.9855
	Testing	0.9870	$- 0.8108$	3.4389	3.3420	0.9651	0.8950
LightGBM1 (Default Parameters)	Training	0.9616	0.0017	4.4000	4.4000	0.9238	0.9211
	Testing	0.8647	$- 2.3323$	9.6777	9.3925	0.7236	0.7170
LightGBM2 (Bayesian Optimized)	Training	0.9640	0.0043	4.2577	4.2577	0.9286	0.9277
	Testing	0.8669	$- 2.3659$	9.5816	9.2850	0.7290	0.7289
AdaBoost1 (Default Parameters)	Training	0.9715	1.6241	4.5098	4.2072	0.9199	0.8287
	Testing	0.9522	0.0513	6.5145	6.5143	0.8747	0.7684
AdaBoost2 (Bayesian Optimized)	Training	0.9878	0.6273	2.7615	2.6893	0.9700	0.9146
	Testing	0.9712	0.2114	4.7942	4.7895	0.9322	0.8634
CatBoost1 (Default Parameters)	Training	0.9768	$- 0.1336$	3.8078	3.8054	0.9429	0.8689
	Testing	0.9608	$- 1.0555$	6.4326	6.3454	0.8779	0.7483
CatBoost2 (Bayesian Optimized)	Training	0.9951	$- 0.0018$	1.6176	1.6176	0.9897	0.9733
	Testing	0.9826	$- 0.3973$	3.8349	3.8143	0.9566	0.8881

Table 6. Performance Metrics of Machine Learning Models for SAR Imagery Pricing.

Models	Dataset	R	MBE ($)	RMSE ($)	ubRMSE ($)	NSE	KGE
XGBoost3 (Default Parameters)	Training	0.8128	$- 2.9964$	14.9660	14.6629	0.5022	0.3124
	Testing	0.7981	$- 6.1475$	19.6583	18.6724	0.4585	0.2801
XGBoost4 (Bayesian Optimized)	Training	0.8154	$- 2.2651$	14.0263	13.8422	0.5628	0.4157
	Testing	0.7984	$- 4.9981$	18.5375	17.8510	0.5185	0.3729
LightGBM3 (Default Parameters)	Training	0.9580	$- 0.6651$	7.5399	7.5105	0.8737	0.7350
	Testing	0.8668	$- 2.3450$	14.2391	14.0447	0.7159	0.6365
LightGBM4 (Bayesian Optimized)	Training	0.9780	$- 0.1252$	5.2519	5.2504	0.9387	0.8427
	Testing	0.8795	$- 2.1293$	13.4476	13.2780	0.7466	0.6770
AdaBoost3 (Default Parameters)	Training	0.8294	2.0758	10.1853	9.9715	0.6605	0.7446
	Testing	0.8561	$- 1.8330$	14.3591	14.2416	0.7026	0.6343
AdaBoost4 (Bayesian Optimized)	Training	0.9153	$- 1.0339$	7.1323	7.0569	0.8335	0.8339
	Testing	0.9258	$- 0.4336$	9.9877	9.9783	0.8561	0.8705
CatBoost3 (Default Parameters)	Training	0.9268	$- 2.4994$	7.4572	7.0259	0.8180	0.6835
	Testing	0.9283	$- 0.9455$	9.9797	9.9348	0.8564	0.8296
CatBoost4 (Bayesian Optimized)	Training	0.9282	$- 2.5491$	7.4694	7.0209	0.8174	0.6764
	Testing	0.9278	$- 0.8663$	9.9384	9.9005	0.8575	0.8443

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, L.; Chen, Z.; Li, G. Satellite Image Price Prediction Based on Machine Learning. Remote Sens. 2025, 17, 1960. https://doi.org/10.3390/rs17121960

AMA Style

Yang L, Chen Z, Li G. Satellite Image Price Prediction Based on Machine Learning. Remote Sensing. 2025; 17(12):1960. https://doi.org/10.3390/rs17121960

Chicago/Turabian Style

Yang, Linhan, Zugang Chen, and Guoqing Li. 2025. "Satellite Image Price Prediction Based on Machine Learning" Remote Sensing 17, no. 12: 1960. https://doi.org/10.3390/rs17121960

APA Style

Yang, L., Chen, Z., & Li, G. (2025). Satellite Image Price Prediction Based on Machine Learning. Remote Sensing, 17(12), 1960. https://doi.org/10.3390/rs17121960

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Satellite Image Price Prediction Based on Machine Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.1.1. Satellite Imagery Pricing Data

2.1.2. Data Preprocessing and Feature Extraction

2.2. Machine Learning Algorithms

2.2.1. Extreme Gradient Boosting (XGBoost)

2.2.2. Light Gradient Boosting Machine (LightGBM)

2.2.3. Adaptive Boosting (AdaBoost)

2.2.4. Categorical Boosting (CatBoost)

2.2.5. Bayesian Optimization (BO)

2.3. Model Training and Evaluation

3. Results

3.1. Optical Imagery Pricing Prediction Results

3.1.1. Model Performance Evaluation

3.1.2. SHAP Feature Importance Analysis

3.2. SAR Imagery Pricing Prediction Results

3.2.1. Model Performance Evaluation

3.2.2. SHAP Feature Importance Analysis

4. Discussion

4.1. Application-Specific Data-Model Alignment

4.1.1. Optical Imagery

4.1.2. SAR Imagery

4.2. Comparison with Traditional Pricing Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI