Integrating Multi-Strategy Improvements to Sand Cat Group Optimization and Gradient-Boosting Trees for Accurate Prediction of Microclimate in Solar Greenhouses

Cui, Xiao; Cheng, Yuwei; Zhang, Zhimin; Mu, Juanjuan; Zhang, Wuping

doi:10.3390/agriculture15171849

Open AccessArticle

Integrating Multi-Strategy Improvements to Sand Cat Group Optimization and Gradient-Boosting Trees for Accurate Prediction of Microclimate in Solar Greenhouses

by

Xiao Cui

,

Yuwei Cheng

,

Zhimin Zhang

,

Juanjuan Mu

and

Wuping Zhang

^*

Software College, Shanxi Agricultural University, Taigu 030800, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(17), 1849; https://doi.org/10.3390/agriculture15171849

Submission received: 15 July 2025 / Revised: 15 August 2025 / Accepted: 28 August 2025 / Published: 29 August 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Solar greenhouses are an important component of modern facility agriculture, and the dynamic changes in their internal environment directly affect crop growth and yield. Among these factors, crop transpiration releases water vapor through transpiration, directly altering the indoor humidity balance and forming a dynamic coupling with factors such as temperature and light. The environment of solar greenhouses exhibits highly nonlinear and multivariate coupling characteristics, leading to insufficient prediction accuracy in existing models. However, accurate predictions are crucial for regulating crop growth and yield. However, current mainstream greenhouse environmental prediction models still have obvious limitations when dealing with such complexity: traditional machine learning models and single-variable-driven models have issues such as insufficient accuracy (average MAE is 15–20% higher than in this study) and weak adaptability to nonlinear environmental changes in multi-environmental factor coupling predictions, making it difficult to meet the needs of precision farming. A review of relevant research over the past five years shows that while LSTM-based models perform well in time series prediction, they ignore the spatial correlations between environmental factors. Models incorporating attention mechanisms can capture key variables but suffer from high computational costs. To address these issues, this study proposes a prediction model based on multi-strategy optimization and gradient-boosting (GBDT) algorithms. By introducing a multi-scale feature fusion module, it addresses the accuracy issues in multi-factor coupling prediction. Additionally, it employs a lightweight network design to balance prediction performance and computational efficiency, filling the gap in existing research applications under complex greenhouse environments. The model optimizes data preprocessing and model parameters through Sobol sequence initialization, adaptive t-distribution perturbation strategies, and Gaussian–Cauchy mixture mutation strategies and combines CatBoost for modeling to enhance prediction accuracy. Experimental results show that the MSCSO–CatBoost model performs excellently in temperature prediction, with the mean absolute error (MAE) and root mean square error (RMSE) reduced by 22.5% (2.34 °C) and 24.4% (3.12 °C), respectively, and the coefficient of determination (R²) improved to 0.91, significantly outperforming traditional regression methods and combinations of other optimization algorithms. Additionally, the model demonstrates good generalization capability in predicting multiple environmental variables such as temperature, humidity, and light intensity, adapting to environmental fluctuations under different climatic conditions. This study confirms that combining multi-strategy optimization with gradient-boosting algorithms can significantly improve the prediction accuracy of solar greenhouse environments, providing reliable support for precision agricultural management. Future research could further explore the model’s adaptive optimization in complex climatic regions.

Keywords:

solar greenhouse; IOT; gradient-boosting tree; sand cat swarm optimization algorithm; environmental prediction model

1. Introduction

Daylight greenhouse is an important part of modern facility agriculture [1] and is widely used in agricultural production because of its ability to regulate the growing environment of crops [2], extend the growing period, and increase the yield [3]. However, changes in the internal environment (e.g., temperature, humidity, light intensity, and carbon dioxide concentration) of solar greenhouses are affected by a variety of external meteorological conditions (e.g., air temperature, wind speed, and solar radiation) and internal control strategies (e.g., factors such as ventilation, heating, and sprinkler irrigation) [4], which result in the manifestation of significant nonlinear, dynamically coupled, and spatial-temporal heterogeneity characteristics. Therefore, how to accurately predict the greenhouse environment and further realize intelligent control has become one of the research hotspots in academia and engineering applications [5]. The covering material of the solar greenhouse in this study is ethylene-vinyl acetate (EVA) film with a thickness of 0.12 mm. Its radiometric characteristics are as follows: transmittance of 88% and reflectance of 10% in the solar radiation band (300–2500 nm); transmittance of 12% and reflectance of 75% in the far-infrared band (5–50 μm). This material ensures photosynthetic demand through high transmittance and enhances thermal insulation by reducing nighttime radiant heat loss via high infrared reflectance. The indoor soil sensors for temperature and humidity were installed at a depth of 10 cm below the surface, which can accurately reflect the soil environment in the crop root activity layer. The solar radiation sensor in this study was installed outside the top of the greenhouse, to collect total solar radiation intensity with a data sampling frequency of once every 30 min. The carbon dioxide concentration sensor was installed inside the greenhouse, 50 cm above the crop canopy, to accurately monitor the CO₂ concentration in the crop growth environment. The greenhouse in this experiment was not equipped with CO₂ enrichment equipment, and the indoor CO₂ concentration during the monitoring period was mainly affected by crop respiration, photosynthesis, and natural ventilation.

Currently, traditional methods for greenhouse environment prediction mainly rely on statistical modeling and empirical formulas, such as Linear Regression, Autoregressive Moving Average Model (ARIMA) [6], and so on. However, these models often show large limitations when dealing with highly complex, nonlinear, and time-varying environmental data, which makes it difficult to meet the needs of practical applications. For this reason, optimal measurement and machine learning methods have been explored for greenhouse environmental prediction, such as (Simulated Annealing, Genetic Algorithm, Particle Swarm Optimization) Support Vector Machine (SVM), Long Short-Term Memory (LSTM) [7], and so on. However, although these methods can improve the prediction accuracy of the model, they are prone to overfitting problems when facing high-dimensional feature data [8], and the model training time is long, the tuning parameter is complicated, and the stability and accuracy are unsatisfactory, thus affecting the practical application value and stability of the model [9]. The core mechanism behind the model’s outstanding performance stems from two key aspects: first, the implementation of multi-strategy optimization techniques, where Sobol sequence initialization improves parameter optimization efficiency and Gaussian–Cauchy mixture variance enhances global search capabilities; second, improvements to gradient-boosting algorithms, particularly CatBoost’s adaptive handling of category features. Additionally, the discussion addresses the model’s limitations and applicability boundaries, noting that its performance may deteriorate in extreme data scarcity scenarios (e.g., data missing rates exceeding 30% due to sensor failures) and ultra-short-term high-frequency predictions (intervals less than 10 min). Potential causes are further analyzed, such as the feature fusion module’s reliance on data integrity, and potential improvement strategies are proposed, including integrating data interpolation and model lightweighting methods.

For this reason, this study proposes to firstly overcome the deficiencies of previous algorithms in solar greenhouse environment prediction by fusing the improved multi-strategy cat swarm optimization (SCSO) and optimized gradient-boosting machine learning algorithms. Specifically, the traditional methods show large prediction errors when facing the nonlinear and complex coupling relationship of greenhouse environment, while SCSO optimizes the hyperparameters of CatBoost model through multi-strategy fusion, so that the model can quickly converge to the global optimal solution, which effectively improves the prediction accuracy and robustness; second, this paper introduces the feature engineering strategy based on the time series, which fully exploits the time-series dependency relationship of greenhouse environmental data, and select representative feature variables to enhance the sensitivity and adaptability of the model to the dynamic changes in the environment; finally, through experimental validation, we compare the prediction performance of the proposed MSCSO–CatBoost model with that of the traditional model in multiple environmental factors, such as temperature, humidity, light intensity, and carbon dioxide concentration, to validate its superiority under the complex nonlinear environmental conditions. The superiority of the MSCSO–CatBoost model in complex nonlinear environmental conditions is verified.

The hyperparameters of the CatBoost model are optimized by MSCSO to ensure that the model can quickly converge to the optimal solution; secondly, a feature engineering strategy based on time series is introduced to fully explore the temporal dependence of the greenhouse environmental data, and representative feature variables are selected to enhance the predictive stability of the model.

2. Materials and Methods

2.1. Data Collection

The experimental site of this study was located in a daylight greenhouse at the tomato base in Wanggantun Town, Yanggao County, Datong City, Shanxi Province (113°9′34″ E, 40°40′54″ N). In order to obtain detailed data on the environment inside and outside the greenhouse, an IoT-based data acquisition system was used in the experiment. The system consists of a weather monitoring mainframe, a four-in-one louver box, soil sensors, and a small weather station, which can continuously monitor and record a variety of environmental parameters inside and outside the greenhouse, including indoor and outdoor temperatures, humidity, carbon dioxide concentration, and light intensity, as well as indoor soil temperature and humidity, and information on wind speed and direction.

In Table 1, soil temperature and humidity in the greenhouse were maintained in a relatively stable range, with soil temperature varying from 15.6 °C to 19.5 °C and soil humidity maintained at about 25%, providing suitable basic conditions for plant growth. The humidity inside and outside the greenhouse showed some fluctuations, and the wind speed varied greatly, with the maximum wind speed reaching 6.2 m/s, which might affect the humidity regulation and air circulation in the greenhouse. The temperature inside the greenhouse fluctuates from 16.25 °C to 28.87 °C, affected by the external environmental changes. The temperature difference between day and night is obvious, and the temperature control system needs to regulate the internal temperature by means of heating and ventilation to ensure that the plant grows in the optimal environment.

Figure 1 shows the dynamic temperature changes at different monitoring points within the greenhouse. The horizontal axis is divided into 30 min intervals, labeled from 1 to 48, representing 48 consecutive 30 min monitoring periods (totaling 24 h). The vertical axis is measured in degrees Celsius (°C), clearly illustrating the temperature fluctuations at each monitoring point during the corresponding time periods. As shown in Figure 1 (Temperature 1), the frequent interference of manual work causes the indoor temperature to show significant fluctuations, indicating that the existing temperature control system is difficult to maintain a stable state under human intervention. The data in Figure 1 (Temperature 2) shows that when the outdoor ambient temperature is high, the temperature inside the greenhouse is also above the horizontal line. The anomalous cooling process that occurred at 08:00 in Figure 1 (Temperature 3) was closely related to the outdoor snowfall weather, which resulted in a continuous drop in the indoor temperature. Figure 1 (Temperature 4), on the other hand, shows that the internal environment of the greenhouse was also consistently preserved at low temperature levels during the low-temperature period. These results systematically demonstrate the complex mechanism by which the thermal environment of the solar greenhouse is jointly influenced by human operation, outdoor climate, and building characteristics, providing an important basis for optimizing the greenhouse environmental regulation strategy.

In the data collection section, the following details are added: The sensors used include SHT30 soil temperature and humidity sensors (temperature −40~125 °C, accuracy ± 0.3 °C; humidity 0~100% RH, accuracy ± 2% RH), DHT22 air temperature and humidity sensors (temperature −40~80 °C, accuracy ± 0.5 °C; humidity 0~100% RH, accuracy ± 2% RH), SCD30 carbon dioxide sensors (400~10,000 ppm, accuracy ± (30 ppm + 3% reading)), BH1750 light sensors (0~65,535 Lux, accuracy ± 20%), and FC−28 wind speed and direction sensors (wind speed 0~30 m/s, accuracy ± 0.3 m/s; wind direction 0~360°, accuracy ± 3°). For data quality control, outliers are monitored in real time, and continuous 3 missing points are filled with linear interpolation. Single sample deviations exceeding 3 times the standard deviation are replaced with moving averages, ensuring a daily data integrity rate ≥ 98% and outlier ratio < 1%. The 106-day data is divided into a training set (1 December 2023–15 February 2024) and a test set (16 February 2024–15 March 2024) in a 7:3 time-series split. Hourly, lag features and sliding window statistics are extracted, time-series cross-validation is adopted, and standardization is performed based on the training set to prevent data leakage.

The greenhouse uses a 0.15-millimeter-thick ethylene–vinyl acetate (EVA)-modified polyethylene (PE) film. It has a visible light transmittance of up to 92% (peak at 550 nanometers) and a UV blocking rate of 30% (280–380 nanometers), effectively protecting crops from excessive UV damage. The film is stretched and covered over a galvanized steel frame, with an overlap width of 15 centimeters between adjacent films to ensure airtightness. Its light diffusion rate is 12%, which helps to evenly distribute light within the greenhouse.

To ensure that the collected data can comprehensively and accurately reflect the environmental changes inside the greenhouse, the sensors were uniformly deployed inside the greenhouse, covering different areas and distributed at multiple heights and levels to obtain environmental information at different spatial locations. This layout scheme ensures the representativeness and comprehensiveness of the data, thus providing data for model training.

Twenty-four temperature and humidity sensors (S1–S24) were deployed in the solar greenhouse (Figure 2), distributed in three planes in the east–west direction (14 m, 40 m, and 70 m), and arranged at three heights of 1.5 m, 2.5 m, and 3.0 m in each plane, with six points at each height, specifically: S1–S6 in the eastward plane, S7–S12 in the centerward plane, and S13–S18 in the westward plane. The remaining 6 points are arranged in higher spatial positions to ensure that the environmental data at different heights and spatial positions can be collected uniformly so as to reflect the environmental changes in the greenhouse in a comprehensive manner.

From 1 December 2023 to 15 March 2024, 106 days of continuous environmental data collection were conducted. The data collection covered several key parameters. This type of data was collected at a frequency of every 30 min, ensuring that the high-resolution time-series data adequately reflected the environmental changes. By recording the data meticulously, it provides a data basis for in-depth analysis of the environmental characteristics of solar greenhouses in Yanggao County and provides valuable data support and a decision-making basis for agricultural production, climate research, and other fields.

The correlation between indoor and outdoor environmental data is an important factor in daylight greenhouse environmental prediction. Indoor data are influenced by greenhouse structure, temperature and humidity control, and other factors, while outdoor data reflect natural climatic conditions. In order to ensure that the prediction results of the model have high accuracy, this study ensures that the prediction model can effectively handle the dynamic changes in indoor and outdoor temperature, humidity, and other variables by analyzing the correlation and consistency of indoor and outdoor data. By establishing an accurate mapping relationship between indoor and outdoor data, the changes in outdoor climate can be effectively reflected in the prediction of indoor environment.

In the prediction of solar greenhouse environment, the correspondence and consistency between indoor and outdoor data are the key factors to ensure the accuracy of the prediction model. Environmental parameters inside the greenhouse, such as temperature, humidity, and light intensity, are directly affected by outdoor climatic conditions. Factors such as outdoor temperature, humidity, radiation intensity, and wind speed are transmitted to the indoor environment through the intermediary effects of the greenhouse’s building structure, ventilation system, and light transmittance. Therefore, establishing a reasonable mapping relationship between indoor and outdoor data can enable the model to better reflect the impact of external climate change on the internal environment of the greenhouse, thus improving the accuracy of the prediction.

2.2. Data Pre-Processing

2.2.1. Missing Value Processing

In the process of data cleaning, the processing median method is suitable for dealing with missing values and outliers [10]. The median method fills or corrects anomalies in the data by calculating the number of centroids in the data.

For missing values, the median method first identifies the missing parts of the data, then calculates the median for each variable and uses these medians to fill in the missing data points [11]. This method maintains the robustness of the data by avoiding bias due to extreme values or skewed distributions. The number of data points n is odd for Equation (1), and the number of data points n is even for Equation (2)

When dealing with outliers, the median method is usually combined with Interquartile Range (IQR) to identify outliers. The normal range of fluctuation of the data is determined by calculating the first quartile (Q1) and the third quartile (Q3) of the data (IQR = Q3 − Q1). Values exceeding the normal range by 1.5 times the IQR are considered outliers and replaced with the median to eliminate their negative impact on the analysis [12].

M = \frac{X_{n + 1}}{2}

(1)

M = \frac{x_{\frac{n}{2}} + x_{\frac{n}{2} + 1}}{2}

(2)

2.2.2. Z-Score

The Z-score-based standard deviation method is a common outlier detection method that is widely used in data analysis. The method identifies and rejects outliers by standardizing the data to measure the relative distance between data points and the overall mean [13].

The score is a statistic [14] that describes the relative position of a data point in the overall population. Specifically, a Z-score represents the distance between a data point and the overall mean in standard deviations. The formula for Z-score is given below:

z = \frac{x - μ}{α}

(3)

where z is the Z-score, i.e., the standardized score; x is a specific data point in the dataset; μ is the mean of the dataset; and σ is the standard deviation of the dataset.

2.3. Greenhouse Environmental Prediction Model

2.3.1. CatBoost

CatBoost is a Gradient-Boosting Decision Tree (GBDT) algorithm [15] specifically designed to handle category characterization and regression tasks. Its name is derived from the combination of the words “Category” and “Boosting”, reflecting its unique advantages in handling complex categorical datasets. Traditional gradient-boosting algorithms (e.g., XGBoost and LightGBM) usually rely on preprocessing methods such as solo hot coding or target coding when dealing with category features, which may lead to dimensionality explosion, information loss, and waste of computational resources when the number of features is large. In order to solve this problem, CatBoost introduces an innovative “category coding” technique, which can directly process the original category features during the training process without additional coding steps, thus greatly improving the model’s adaptability to high-dimensional category features [16].

In addition, CatBoost incorporates several optimization strategies in the algorithm design to enhance the stability and generalization ability of the model. First, CatBoost effectively handles the problem of missing values in data by automatically selecting segmentation points for missing values through a gradient-based greedy learning technique [17]. Secondly, it prevents the model from overfitting during the training process by introducing a regularization mechanism. Unlike traditional GBDT models, CatBoost adopts a unique Ordered Boosting strategy that avoids using the training data of the current tree for split node selection when constructing each tree, thus eliminating the prediction bias caused by Target Shift during the training process. In addition, CatBoost supports automatic Feature Scaling and efficient multi-threaded parallelized training, which greatly improves the computational efficiency of the model on large-scale datasets [18]. Taken together, CatBoost provides a more robust and efficient solution for gradient-boosting algorithms in classification and regression problems through its optimizations in class feature processing, missing value filling, and model stability.

2.3.2. Sand Cat Swarm Optimization Algorithm

The sand cat swarm optimization algorithm [19] simulates the two main behaviors of sand cat foraging: searching for prey and attacking prey [20]. This strategy of mimicking the natural behavior of the sand cat makes the algorithm more efficient in the global search process and helps to avoid falling into local optimal solutions. And compared with the traditional LSTM model, the BP network optimized by SCSO will converge to a satisfactory solution faster during the iteration process, which can reduce the training time and improve the learning efficiency. In the d-dimensional optimization problem, the sand cat is a 1 × d-dimensional array representing the solution of the problem [21], defined as follows:

x_{i} = [x_{1}, x_{2}, x_{3}, \dots, x_{d}]

(4)

where X_i is the ith sand cat; each variable value x₁, x₂… x_d is a floating point number; and each x must lie between the upper and lower bounds.

The main steps of the sand cat swarm optimization algorithm are as follows:

Initialization according to the size of the problem (Npop × Nd) (pop is taken as 1, 2… s), an initialization matrix is created with the sand cat population:

X = [\begin{matrix} X_{1} \\ ⋮ \\ X_{i} \\ ⋮ \\ X_{n} \end{matrix}] = {[\begin{matrix} x_{11} & \dots & x_{1 d} \\ ⋮ & ⋮ \\ x_{n 1} & \dots & x_{n d} \end{matrix}]}_{n \times d}

(5)

where X is the population matrix of the sand cat population; Xi is the i_th sand cat; and x_ij is the j_th dimension coordinate of the i_th sand cat. The fitness cost of each sand cat is obtained by evaluating the defined fitness function, and the fitness function for the sand cat population is as follows:

F = [\begin{matrix} F_{1} \\ ⋮ \\ F_{i} \\ ⋮ \\ F_{n} \end{matrix}] = f (X) = {[\begin{matrix} f (x_{11}, x_{12}, \dots, x_{1 d}) \\ ⋮ \\ f (x_{i 1}, x_{i 2}, \dots, x_{i d}) \\ ⋮ \\ f (x_{n 1}, x_{n 2}, \dots, x_{n d}) \end{matrix}]}_{n \times 1}

(6)

where F is the fitness function of the sand cat population.

Calculate the conversion coefficient R: The sand cat population starts searching or attacking the prey stage after completing the initialization, i.e., the exploration or exploitation stage, and the main parameter for controlling the conversion between the exploration and exploitation stages is the conversion coefficient R. The formula for the calculation of the conversion coefficient R is as follows:

R = 2 r_{G} \cdot r a n d (0, 1) - r_{G}

(7)

where rand (0,1) denotes a random number between 0 and 1; r_G is the sensitivity coefficient of the sand cat group, which is used to guide the conversion coefficient R to realize the inter-stage transfer control; and its calculation formula is as follows:

r_{G} = S_{M} - \frac{S_{M} \cdot z}{Z}

(8)

where z is the current iteration number; Z is the maximum iteration number; SM is the hearing coefficient of sand cat group, which takes the value of 2.

Searching for prey: When |R| > 1, the sand cat enters the exploration phase. At this time, the i_th sand cat of the sand cat group finds other possible best prey locations by updating its position according to Equation (7) based on the current position Pc, the best candidate position P_bc, and the range of sensitivity of this cat r_i:

P_{o s} (z + 1) = r_{i} [P_{b c} (z) - r a n d (0, 1) P_{c} (z)]

(9)

where r_i denotes the sensitivity coefficient of the i_th sand cat, which is calculated as follows:

r_{i} = r_{G} \cdot r a n d (0, 1)

(10)

Attacking prey: When the conversion coefficient |R| ≤ 1, the sand cat enters the exploitation phase. At this time, the i_th sand cat of the sand cat group generates a random position P_md based on the best candidate position P_bc and the current position P_c, which is calculated as follows:

P_{m d} = |P_{b c} (z) \cdot r a n d (0, 1) - P_{c}|

(11)

Assuming that the sensitivity range of sand cats is a circle, using the roulette wheel selection algorithm to choose a random angle θ for each sand cat, each sand cat in the group is able to move along different circular directions in the search space. The sand cat updates its position according to Equation (10) to attack the prey:

P_{o s} (z + 1) = P_{b c} (z) - r_{i} P_{m d} \cdot \cos θ

(12)

The flow of the sand cat swarm optimization algorithm is shown in Figure 3. The sand cat swarm optimization algorithm improves the global search ability of the neural network through the balance between the exploration phase and the development phase, the multi-dimensional search in space and the dynamic adjustment of the search strategy, reduces the training error through the optimization of the weights and bias, the activation function parameter and the learning rate and other network parameter adjustments, improves the network’s ability to fit the complex problem, and makes the network training more stable and efficient, and the SCSO algorithm is able to SCSO algorithm can accelerate the convergence speed of neural network training, the built-in parallel search mechanism can make full use of computational resources, save training time, improve training efficiency, SCSO algorithm helps the network to better extract the key features in the data so as to improve the generalization ability of the model, in addition, SCSO optimized BP neural network in the processing of noisy data shows stronger robustness, which helps the network to better handle outliers and improve the stability of the model.

2.4. Multi-Strategy Improvement

2.4.1. Improvement Point I: Sobol Sequence Population Initialization Function

In the standard SCSO algorithm, the population initialization usually adopts random generation, i.e., the initial population is generated by randomly distributing the positions of individuals in the search space [22]. However, although this random initialization strategy is simple to implement, it often leads to the lack of global exploration ability of the algorithm in the early search phase due to the uneven distribution of the population in the space, thus affecting the overall search effect and convergence speed. Therefore, population initialization, as a key step in the optimization algorithm, has an important impact on the global search ability and convergence performance of the algorithm.

In order to address the shortcomings of the traditional random initialization strategy in terms of search space coverage and distribution uniformity, this paper introduces a low-discrepancy population initialization method based on Sobol sequences, which are a kind of Low-Discrepancy (LD) sequences capable of generating a uniformly distributed set of sample points in a multidimensional space [23]. Compared with the traditional random initialization strategy, Sobol sequences have the following significant advantages: the point sets generated by Sobol sequences can cover the search space more uniformly, which significantly improves the initial population’s ability to explore the global search space, and increases the probability of finding the global optimal solution; the low-discrepancy property of Sobol sequences avoids the overlap between the point sets and the redundant sampling problem, which makes the distribution of sample points more dispersed, and thus improves the search efficiency. Finally, since the initial population is uniformly distributed in the search space, the algorithm can converge to the optimal solution faster, which significantly improves the overall convergence speed.

This study adopted 5-fold cross-validation for model training and testing. The dataset was randomly divided into 5 subsets, with 4 used as training sets and 1 as the validation set in turn. The average performance indicators were obtained after 5 repetitions to enhance model robustness and avoid overfitting.

In summary, the introduction of the Sobol sequence into the population initialization process helps to improve the global exploration ability and convergence speed of the SCSO algorithm, and lays a better initial condition for the subsequent search process. The application of this strategy effectively makes up for the shortcomings of the traditional stochastic initialization method, so that the algorithm can more stably converge to the global optimal solution in high-dimensional optimization problems.

2.4.2. Improvement Point II: Adaptive T-Distribution Perturbation Strategy

During the exploration and development phase of the SCSO algorithm, the sand cat population may produce some inferior solutions that cannot be effectively utilized after completing the position update. This is due to the fact that the algorithm does not have a built-in population variation mechanism in its initial design, which leads to the fact that it is easy to fall into the predicament of local optimal solutions in the later search process, lacking sufficient jumping ability and global exploration ability. In order to solve this problem, it is particularly crucial to introduce an adaptive t-distribution perturbation strategy [24]. The t-distribution is a form of probability distribution commonly used in statistics, which has a large tail property, i.e., it can better capture the extreme values in the case of small sample sizes and thus increase the exploration ability of the solution in the boundary region of the search space. Applying the t-distribution to the optimization algorithm can effectively improve the diversity of the population in the search space and prevent the algorithm from converging to the local optimal solution prematurely. Therefore, the introduction of adaptive t-distribution perturbation can significantly enhance the global search ability of the algorithm and provide theoretical and technical support for jumping out of the local optimum [25].

The probability density function of the t-distribution exhibits the characteristic of zero symmetry, i.e., it peaks near zero, while the probability density shows a rapid decrease as the value of the random variable moves away from the zero point [26]. This property allows the t-distribution to better control the magnitude of the fluctuations of the solution during the perturbation process, especially in the early stage of the global search to maintain a large jump, thus increasing the probability of finding a better solution in a wide space. Specifically, the perturbation formula for the t-distribution is as follows:

f (x, v) = \frac{Γ (\frac{v + 1}{2})}{\sqrt{v π} Γ (\frac{v}{2})} {(1 + \frac{x^{2}}{v})}^{- \frac{v + 1}{2}}

(13)

where x is a random variable; ν is the degree of freedom; Γ is the Gamma function. The t-distribution’s degree of freedom parameter ν affects the perturbation amplitude. When ν is large, the t-distribution tends to be close to the standard normal distribution, which is suitable for localized search, while when ν is small, the tail of the distribution is heavier, which helps to make a large range of jumps in the stage of the global search, thus increasing the diversity of the population.

2.4.3. Improvement Point III: Adaptive Gauss–Cauchy Mixed Variation Strategy

The adaptive Gaussian–Cauchy hybrid variation strategy aims to solve the problem that the population intelligent optimization algorithm is prone to falling into local optimal solutions in the middle and later stages. By combining two different perturbation methods, Gaussian distribution and Cauchy distribution, the strategy introduces a more diversified and robust search mechanism in the process of population evolution so as to effectively improve the local search accuracy of the algorithm and the ability to jump out of the local optimal solution [27]. The Gaussian distribution is mainly used for local search, with a smaller fluctuation range and better convergence, which can help the algorithm to realize the precise fine-tuning of the solution in the later iterations, while the heavy-tailed property of the Cauchy distribution allows the solution to be explored to a larger extent in the global search phase, thus enhancing the global search capability. This hybrid perturbation strategy allows the algorithm to flexibly balance between global and local search, thus improving the overall optimization performance [28].

In practice, the adaptive Gaussian–Corsi hybrid variational strategy dynamically adjusts the ratio of Gaussian to Corsi perturbations by introducing two control parameters (u1) and (u2). The algorithm initially favors the global search of the Cauchy distribution to ensure a high diversity of the population, while it gradually shifts to the local search of the Gaussian distribution in the later iterations to improve the accuracy of the solution and the convergence speed. The specific formula for the mixed-variance strategy is as follows:

x_{n e w} = \{\begin{matrix} x_{c u r r e n t} + u_{1} \cdot N (0, α^{2}) & i f r a n d () < p \\ x_{c u r r e n t} + u_{2} \cdot C (0, γ) & o t h e r w i s e \end{matrix}

(14)

where xnew is the solution after mutation; xcurrent is the current solution; N(0,σ²) denotes the Gaussian distribution with mean 0 and variance σ²; C(0,γ) denotes the Cauchy distribution; and ρ is the probabilistic control parameter of Gaussian perturbation. By adaptively adjusting (u1) and (u2), the hybrid variational strategy is able to achieve a dynamic balance between global and local search, effectively improving the robustness of the algorithm and the ability to jump out of the local optimal solution.

The adaptive t-distribution perturbation and the Gauss–Cauchy hybrid variational strategy cooperate with each other in the SCSO algorithm, which can effectively adjust the distribution of the solutions at different stages. The former ensures solution diversity and exploration ability in the global search, while the latter provides precise fine-tuning in the local search, thus demonstrating a stronger stability and convergence effect in solving complex optimization problems. By introducing these two strategies, the SCSO algorithm is able to strike a balance between global and local optimal solutions more efficiently and provide better solutions for a variety of optimization problems.

2.5. Model Evaluation

In order to evaluate the prediction ability and accuracy of the model, the prediction model is analyzed using the coefficient of determination (R²), the root mean square error (MSE), and the mean absolute error (MAE). The larger the value of R², the smaller the MSE and MAE, indicating that the prediction model predicts the results more accurately, which is calculated as follows:

M A E = \frac{1}{N} \sum_{i = 1}^{N} | y_{i} - {\tilde{y}}_{i} |

(15)

M S E = \frac{1}{N} \sum_{i - 1}^{N} {(y_{i} - {\tilde{y}}_{i})}^{2}

(16)

R^{2} = 1 - \frac{\sum_{i} (\tilde{y_{i}} - y_{i})^{2}}{\sum_{i} {(\bar{y_{i}} - y_{i})}^{2}}

(17)

3. Results and Discussion

3.1. Different Model Prediction Results and Comparative Analysis

The CatBoost model parameters were set according to the MSCSO optimization results in order to make the final prediction of the solar greenhouse environment. In order to verify the performance of the proposed models in this study, the CNN, LSTM model, and JAYA–CatBoost model were used to test the prediction of solar greenhouse environment, respectively. The results of each prediction model are shown in Figure 4.

The comparative analysis of the light intensity prediction models in Figure 4 (Carbon dioxide concentration) examines the performance of each model in the prediction of carbon dioxide concentration. The LSTM model has a large initial error: the first moment of the prediction of 535.69 ppm deviates from the true value of 624 ppm by up to 88.31 ppm, and with the passage of time the error is reduced, but there is still a bias. The CNN model has better overall stability, but the error in the early stage is significant; the first moment of the prediction of 735.38 ppm exceeds the true value of 111 ppm. The JAYA–CatBoost model has a stable performance, predicting 683.88 ppm at the 3rd moment, which is close to the true value of 692 ppm, with an error of only 8.12 ppm. The AOA–CatBoost model has a high prediction accuracy most of the time, and the prediction at the 17th moment, which is 426.63 ppm, is close to the true value of 403 ppm. The MSCSO–CatBoost model has the best overall performance, with a difference of only 5.88 ppm between the predicted 745.88 ppm and the true value of 740 ppm at the 4th moment, and the predicted 579.89 ppm at the 30th moment is almost the same as the true value of 586 ppm, which demonstrates excellent prediction accuracy and stability.

Figure 4 (Light intensity) shows that there are significant differences in the performance of the models. The LSTM model has the worst performance, and its prediction value generally deviates from the real value; for example, the 1st moment prediction of 310 lux compares with the real value of 6591 lux. The CNN model is relatively stable, but there is still a significant deviation; for example, the 15th moment prediction of 40,163 lux compares with the real value of 49,556 lux. The JAYA–CatBoost model is the best predictor. The JAYA–CatBoost model has improved its prediction accuracy but still has errors in the complex change time; for example, the prediction of 19,306 lux at the 21st moment compares with the real value of 12,266 lux. The AOA–CatBoost model performs better and is able to accurately predict the light intensity in the smooth time; for example, the prediction of 4435 lux at the 6th moment compares with the real value of 4649 lux. The MSCSO–CatBoost model has the best overall performance, and its prediction value is the closest to the real value; for example, the 30th moment prediction of 48,440 lux compared with the real value of 48,369 lux shows the best prediction accuracy and stability.

Figure 4 (Humidity) shows that the LSTM model has the worst performance among the humidity prediction models, and the prediction error is significant, such as the 31% prediction at the 1st moment compared with the true value of 62%; the CNN model is not stable enough, and the 57% prediction at the 21st moment has a large deviation from the true value of 27%; the JAYA–CatBoost model has a stable performance as a whole, and the 42% prediction at the 2nd moment is close to the true value of 65%; AOA–CatBoost model performs well in the smooth time, the 5th moment prediction 80% close to the true value of 72%; MSCSO–CatBoost model has the best overall performance, and its prediction value is the closest to the true value, such as the 18th moment prediction of 60% compared with the true value of 59.2%, which shows the best prediction accuracy and stability.

Figure 4 (Temperature) analyzes the performance evaluation of temperature prediction models, showing significant differences in the performance of each model. The LSTM model has a large prediction error, such as 16.3 °C predicted 19.8 °C, 15.9 °C predicted 19.1 °C; in the stabilization of the temperature range is generally overestimated. The CNN model stability is still good, but the response is slow: 15.6 °C predicted 4.5 °C, 15.0 °C predicted 14.9 °C, and the mutation temperature adaptation is insufficient. The JAYA model has the best overall performance. Abrupt temperature changes. The JAYA–CatBoost model performed well in the stable interval, with 13.9 °C predicted at 12.1 °C and 27.5 °C predicted at 23.8 °C, but fluctuated during complex changes, with 13.9 °C predicted at 22.1°C. The AOA–CatBoost model had better predictions for the smooth interval, with 14.9 °C predicted at 19.7 °C and 15.5 °C predicted as 18.0 °C, but 15.6 °C was predicted as 15.8 °C and 23.6 °C was predicted as 23.3 °C under drastic changes, with a decrease in accuracy. The MSCSO–CatBoost model was overall optimal, with 14.2 °C predicted as 14.0 °C and 24.4 °C predicted as 20.7 °C, and was verified to be the optimal temperature prediction as it maintained the minimum bias and the best tracking ability under all types of temperature changes.

Another advantage of the MSCSO–CatBoost model lies in its ability to adapt to complex factor coupling effects. In a solar greenhouse environment, factors such as light intensity, temperature, humidity, and carbon dioxide concentration typically do not change independently but interact through complex coupling relationships, influencing one another. Traditional models, due to structural and training strategy limitations, struggle to capture such multivariate dynamics and often exhibit lag or errors during prediction. The MSCSO–CatBoost model, however, introduces multi-strategy optimization to dynamically adjust globally, ensuring the smoothness and accuracy of prediction values while maintaining high consistency with actual values even within complex fluctuating ranges.

In summary, the MSCSO–CatBoost model has significant advantages in the prediction of multiple environmental factors in solar greenhouses. It can maintain a high degree of fit under the complex fluctuations of different factors and effectively deal with the coupling relationship between variables, which is better than the traditional LSTM, CNN, and other optimization algorithm models (JAYA–CatBoost and AOA–CatBoost). In the future, its prediction accuracy in more complex environments can be improved by further optimizing the model structure, increasing the types of environmental variables, and model fusion strategy so as to provide a more reliable prediction means for the environmental control of solar greenhouses.

3.2. Prediction Results of Different Models and Comparative Analysis

Figure 5 shows the convergence speed curves of the MSCSO–CatBoost model in the prediction of four environmental factors: temperature, humidity, carbon dioxide concentration, and light intensity. These curves clearly reflect the optimization characteristics of the model under different environmental variables.

The convergence characteristics of the MSCSO–CatBoost model in the prediction of greenhouse environmental parameters are analyzed through Figure 5, which reveals the optimization characteristics of the algorithm for different environmental parameters. The convergence curve of temperature prediction in Figure 5a shows a rapid decrease in the initial error, indicating that the model can effectively capture the change rule of temperature; Figure 5b, for light intensity prediction, shows a progressive optimization trend, reflecting the algorithm’s continuous learning ability for periodic features. Figure 5c, for humidity prediction, reaches a stable state in a relatively short period of time, and the error is maintained at a low level, while Figure 5d, for carbon dioxide concentration prediction, shows late fluctuation characteristics, which may be related to the dynamic influence of the greenhouse ventilation system. The results show that all parameter prediction tasks converge within a limited number of iterations, verifying the high efficiency and robustness of the MSCSO–CatBoost algorithm in multi-parameter environmental prediction. The differences in the convergence characteristics of different parameters provide an important basis for model optimization, and it is suggested that subsequent studies can design differentiated optimization strategies for the parameter characteristics.

The MSCSO–CatBoost model performs well in the convergence speed curves of different environmental factors, and the fast convergence characteristics reflect the model’s ability to learn the data features effectively. The model provides accurate and reliable prediction results for the environmental control of solar greenhouses, which further supports the intelligent management and decision-making of greenhouses. Future research can explore the application of the model under more complex environmental factors to enhance the efficiency and sustainability of agricultural production.

3.3. Statistical Significance Testing

To verify the statistical significance of performance differences between models, the Friedman test and the Nemenyi post hoc test were conducted based on the MAE values of five models (LSTM, CNN, JAYA–CatBoost, AOA–CatBoost, and MSCSO–CatBoost) across four environmental factor prediction tasks. The Friedman test indicated significant differences in model performance (p < 0.01). Results of the Nemenyi test showed that MSCSO–CatBoost had statistically significant differences (p < 0.05) from LSTM and CNN, while differences from JAYA–CatBoost and AOA–CatBoost were not statistically significant (p > 0.05), but MSCSO–CatBoost maintained numerical advantages.

All baseline models underwent hyperparameter optimization using the same strategy as MSCSO–CatBoost (grid search with 5-fold cross-validation). Specific parameter ranges and optimal values are as follows: LSTM had ranges of hidden units (32–128), learning rate (0.001–0.01), and epochs (100–300), with optimal values of 64 units, 0.005 learning rate, and 200 epochs; CNN had ranges of kernel size (3 × 3–7 × 7), number of convolutional layers (2–4), and dropout rate (0.2–0.5), with optimal values of 5 × 5 kernel, 3 layers, and 0.3 dropout; JAYA–CatBoost had ranges of number of trees (50–500), learning rate (0.01–0.3), and depth (3–10), with optimal values of 300 trees, 0.1 learning rate, and depth 6; AOA–CatBoost had the same parameter ranges as JAYA–CatBoost, with optimal values of 250 trees, 0.08 learning rate, and depth 5. Hyperparameter optimization for all baselines was performed under the same computing environment (Intel i7–12700K, 32 GB RAM) to ensure fairness.

3.4. Ablation Experiment Analysis

To verify the individual contributions of the three improvement strategies—Sobol sequence initialization, adaptive t-distribution perturbation, and Gaussian–Cauchy mixed variation—four sets of comparative experiments were designed, including a baseline model. The baseline model is the original SCSO–CatBoost without any improvements. There is also Variant 1, which removed only Sobol sequence initialization while retaining the other two strategies; Variant 2, which removed only adaptive t-distribution perturbation while retaining the other two strategies; Variant 3, which removed only Gaussian–Cauchy mixed variation while retaining the other two strategies; and the full model, which is MSCSO–CatBoost with all three strategies adopted. Taking temperature prediction as an example, key metrics are compared in Table 2. Results show that removing Sobol sequence initialization increased MAE by 18.7%. The MAE went from 1.50 °C to 1.78 °C, indicating Sobol sequence initialization’s critical role in improving initial population uniformity and accelerating convergence. Removing adaptive t-distribution perturbation increased MAE by 24.0% to 1.86 °C, demonstrating adaptive t-distribution perturbation’s significance in enhancing global search capability and avoiding local optima. Removing Gaussian–Cauchy mixed variation increased MAE by 12.0% to 1.68 °C, validating Gaussian–Cauchy mixed variation’s effect on optimizing local search accuracy. The full model achieved the best performance, indicating complementary interactions among the three components in balancing exploration and exploitation.

3.5. Comparison of Model Performanc

In this study, the performance of five prediction models in environmental factor prediction was systematically compared. The results show that the MSCSO–CatBoost model significantly outperforms the other models in several evaluation indexes, demonstrating excellent performance and robustness.

Table 3 presents a comparative analysis of model performance in multi-environmental factor prediction tasks, revealing that the MSCSO–CatBoost model holds significant advantages over traditional LSTM, CNN, JAYA–CatBoost, and AOA–CatBoost models. In temperature prediction, compared to the four aforementioned models, MSCSO–CatBoost reduces the mean absolute error (MAE) by 32.7%, 36.4%, 18.0%, and 21.9%, respectively, and the mean square error (MSE) by 38.8%, 37.6%, 17.0%, and 3.1%, respectively, while maintaining a high coefficient of determination (R²) of 0.94. In humidity prediction, it also performs impressively, with MAE reduced by 31.0%, 27.3%, 14.4%, and 9.0% and MSE decreased by 31.6%, 26.3%, 11.3%, and 4.7% compared to the other models, along with an R² of 0.93, further underscoring the model’s accuracy and stability. For CO₂ concentration prediction, MSCSO–CatBoost achieves a 36.8%, 30.9%, 12.6%, and 6.3% reduction in MAE and a 44.3%, 35.3%, 18.3%, and 9.5% decrease in MSE relative to the comparison models, while retaining an R² of 0.91, showcasing excellent accuracy. In light intensity prediction, its superiority is even more pronounced, with MAE reduced by 33.7%, 24.9%, 12.7%, and 5.1%; MSE reduced by 32.3%, 41.9%, 14.8%, and 5.1%; and R² improved to 0.93. In summary, the MSCSO–CatBoost model exhibits excellent accuracy, significant advantages, and high robustness across all environmental factor prediction tasks, validating its strong applicability in analyzing complex multidimensional environmental data.

In summary, the MSCSO–CatBoost model significantly outperforms other models in environmental factor prediction. Its performance on key metrics such as MAE, MSE, and R² indicates that the model has higher accuracy and stability. Future research can be based on these findings to further optimize the MSCSO–CatBoost model and explore the introduction of more environmental variables to improve its effectiveness in practical applications. These research results provide important support for the intelligent management of solar greenhouses and lay the foundation for the future development of intelligent agriculture.

4. Discussion

This study combines the optimized Multi-Strategy Sand Cat Optimizer (MSCSO) algorithm with the CatBoost model to propose a precise prediction method for solar greenhouse environments. Temperature, as a core parameter influencing greenhouse crop growth, plays a crucial role in regulating production environments, enhancing crop yields, and improving quality.

Through systematic comparison, it is evident that the MSCSO–CatBoost model significantly outperforms other models across multiple evaluation metrics. In temperature prediction, its mean absolute error (MAE) is reduced by 32.8%, 36.5%, 18.2%, and 21.9%, respectively; the mean squared error (MSE) was reduced by 53.0%, 36.4%, 33.7%, and 28.7%, respectively; and the coefficient of determination (R²) was improved by 12.3%, 5.3%, 1.3%, and 0.8%, respectively, significantly enhancing the model’s ability to explain data variability. This result addresses the shortcomings of traditional machine learning models in multi-factor coupled prediction, such as insufficient accuracy and limited adaptability of single optimization algorithms.

Compared with other related research findings [29], the model developed by Zhang et al., which combines an improved Harris Eagle optimization algorithm with CatBoost (IHHO–CatBoost) [30], reduced the MAE in temperature prediction by 49.8% compared to LSTM, but it lacks comprehensiveness in the integrated prediction of multiple environmental factors. In contrast, the MSCSO–CatBoost model developed in this study not only performs excellently in temperature prediction but also demonstrates high accuracy and reliability in predicting multiple environmental factors such as humidity, carbon dioxide concentration, and light intensity. For example, in humidity prediction, MAE is reduced by 15–20% compared to some traditional machine learning-based models.

In studies related to radiation prediction, some hybrid models combining convolutional neural networks with CatBoost (CNN–CatBoost) [31] aim to improve the predictive performance of solar radiation. However, these models lack adaptability in multi-parameter predictions under complex greenhouse environments. The model in this study achieves an R² of over 0.85 in light intensity prediction under complex greenhouse environments through unique multi-strategy optimization, outperforming some similar models.

5. Conclusions

In this study, based on the MSCSO–CatBoost model, an accurate prediction method for the environment of daylight greenhouses is proposed by combining the optimized multi-strategy sand cat swarm optimization algorithm with the CatBoost model. It should be noted that the data in this study are derived from observations in a tomato greenhouse in Yanggao County, Shanxi Province, from December 2023 to March 2024, and its applicability may be affected by regional climate, greenhouse types, and crop species. Future research will further explore the practical application of the model in facility agriculture around this model, verify it across multiple regions, seasons, and crops, and optimize the model with transfer learning to promote the sustainable development and digital transformation of agriculture, enhance greenhouse production efficiency, improve crop quality, and provide strong support for the development of modern agriculture.

Author Contributions

Conceptualization, X.C. and Y.C.; methodology, Z.Z.; software, J.M.; validation, Y.C., Z.Z. and X.C.; formal analysis, J.M.; investigation, X.C.; resources, Y.C.; data curation, J.M.; writing—original draft preparation, Y.C.; writing—review and editing, J.M.; visualization, X.C.; supervision, X.C. and Z.Z.; project administration, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The article contains original data from this study, which should be available upon reasonable request by contacting the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Nie, Y.; Bu, H.; Chen, L.; Zhong, Y.; Wu, S. Intelligentization of Facility Vegetables under the Background of Agricultural Modernization: Challenges and Strategy Exploration. China Veg. 2024, 1, 1–7. [Google Scholar] [CrossRef]
Pu, G.; Liu, S.; Liu, L.; Ren, L. Effects of Different Light Qualities on Growth and Physiological Characteristics of Tomato Seedlings. Acta Hortic. Sin. 2005, 32, 420–425. [Google Scholar] [CrossRef]
Chen, J.; Yu, R.; Yang, M.; Che, W.; Ning, Y.; Zhan, Y. SN—YOLO: A Rotation Detection Method for Tomato Harvest in Greenhouses. Electronics 2025, 14, 3243. [Google Scholar] [CrossRef]
Chen, L.; Wang, S.; Zhao, H.; Xu, W. Impact of Light Intensity and CO₂ Concentrationon the Microclimateand Yield of Cropsina Daylight Greenhouse. Agriculture 2023, 13, 1352. [Google Scholar]
Zhang, H.; Li, X.; Wu, P.; Yan, J. Development of a Predictive Model for Environmental Parameters in Solar Greenhouses Using Artificial Neural Networks. Energies 2022, 15, 6390. [Google Scholar] [CrossRef]
Seber, G.A.F.; Lee, A.J. Linear Regression Analysis, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2003. [Google Scholar] [CrossRef]
Zhao, Z.; Chen, W.; Wu, X.; Chen, P.; Liu, J.; Xu, J. LSTM Network: A Deep Learning Approach for Short-Term Traffic Forecast. IET Intell. Transp. Syst. 2017, 11, 68–75. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Chang, C.-C.; Lin, C.-J. LIBSVM: A Library for Support Vector Machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 27. [Google Scholar] [CrossRef]
Chandra, W.; Suprihatin, B.; Resti, Y. Median-KNN Regressor-SMOTE-Tomek Links for Handling Missing and Imbalanced Data in Air Quality Prediction. Symmetry 2023, 15, 887. [Google Scholar] [CrossRef]
Eaton, M.; Harbick, K.; Shelford, T.; Mattson, N. Modeling Natural Light Availability in Skyscraper Farms. Agronomy 2021, 11, 1684. [Google Scholar] [CrossRef]
Winoto, C.; Suprihatin, B.; Resti, Y. Strategies for Imputing Missing Values and Removing Outliers in the Dataset for Machine Learning-Based Construction Cost Prediction. Buildings 2024, 14, 933. [Google Scholar] [CrossRef]
Yaro, A.S.; Maly, F.; Prazak, P. Outlier Detection in Time-Series Receive Signal Strength Observation Using Z-Score Method with Sn Scale Estimator for Indoor Localization. Appl. Sci. 2023, 13, 3900. [Google Scholar] [CrossRef]
Bagdonavičius, V.; Petkevičius, L. Multiple Outlier Detection Tests for Parametric Models. Mathematics 2020, 8, 2156. [Google Scholar] [CrossRef]
Wu, Y.; Bai, Y.; Zhang, W. ReMAHA–CatBoost: Addressing Imbalanced Data in Traffic Accident Prediction Tasks. Appl. Sci. 2023, 13, 13123. [Google Scholar] [CrossRef]
Jiang, F.; Peng, J. A Semi-Supervised Tri-CatBoost Method for Driving Style Recognition. Symmetry 2020, 12, 336. [Google Scholar] [CrossRef]
Zhang, W.; Wu, Y.; Bai, Y. Achieving Personalized Precision Education Using the CatBoost Model during the COVID-19 Lockdown Period in Pakistan. Sustainability 2023, 15, 2714. [Google Scholar] [CrossRef]
Bai, I.; Ji, U. Outlier Detection and Smoothing Process for Water Level Data Measured by Ultrasonic Sensor in Stream Flows. Water 2019, 11, 951. [Google Scholar] [CrossRef]
Wu, D.; Rao, H.; Wen, C.; Jia, H.; Abualigah, L. Modified Sand Cat Swarm Optimization Algorithm for Solving Constrained Engineering Optimization Problems. Mathematics 2022, 10, 4350. [Google Scholar] [CrossRef]
Peng, H.; Zhang, X.; Li, Y.; Qi, J.; Kan, Z.; Meng, H. A Modified Sand Cat Swarm Optimization Algorithm Based on Multi-Strategy Fusion and Its Application in Engineering Problems. Mathematics 2024, 12, 2153. [Google Scholar] [CrossRef]
Wu, D.; Seyyedabbasi, A.; Kiani, F. Improved Multi-Strategy Sand Cat Swarm Optimization for Solving Global Optimization Problems. Mathematics 2021, 10, 3284. [Google Scholar]
Xia, Q.; Ding, Y.; Zhang, R.; Zhang, H.; Li, S.; Li, X. Optimal Performance and Application for Seagull Optimization Algorithm Using a Hybrid Strategy. Entropy 2022, 24, 973. [Google Scholar] [CrossRef]
Bangyal, W.H.; Nisar, K.; Ag Ibrahim, A.A.B.; Haque, M.R.; Rodrigues, J.J.; Rawat, D.B. Comparative Analysis of Low Discrepancy Sequence-Based Initialization Approaches Using Population-Based Algorithms for Solving the Global Optimization Problems. Appl. Sci. 2021, 11, 7591. [Google Scholar] [CrossRef]
Xu, Y.; Sang, B.; Zhang, Y. Application of Improved Sparrow Search Algorithm to Path Planning of Mobile Robots. Biomimetics 2024, 9, 351. [Google Scholar] [CrossRef]
Liu, Y.; Shi, Z.; Fu, B.; Xu, H. Radar Error Correction Method Based on Improved Sparrow Search Algorithm. Appl. Sci. 2024, 14, 3714. [Google Scholar] [CrossRef]
Yang, X.; Liu, J.; Liu, Y.; Xu, P.; Yu, L.; Zhu, L.; Chen, H.; Deng, W. A Novel Adaptive Sparrow Search Algorithm Based on Chaotic Mapping and T-Distribution Mutation. Appl. Sci. 2021, 11, 11192. [Google Scholar] [CrossRef]
Wang, X.; Li, Q.; Zhang, L.; Chen, X. An Adaptive Sand Cat Swarm Algorithm Based on Cauchy Mutation and Optimal Neighborhood Disturbance Strategy. Mathematics 2023, 11, 4311. [Google Scholar] [CrossRef]
Guo, Z.; Ji, X.; Wang, H.; Yang, X. Active Distribution Network Fault Diagnosis Based on Improved Northern Goshawk Search Algorithm. Electronics 2024, 13, 1202. [Google Scholar] [CrossRef]
Huang, J.; Qin, J.; Song, S. A Novel Wind Power Outlier Detection Method with Support Vector Machine Optimized by Improved Harris Hawk. Energies 2023, 16, 7998. [Google Scholar] [CrossRef]
Yang, J.; Ren, G.; Wang, Y.; Liu, Q.; Zhang, J.; Wang, W.; Li, L.; Zhang, W. Environmental Prediction Model of Solar Greenhouse Based on Improved Harris Hawks Optimization-CatBoost. Sustainability 2024, 16, 2021. [Google Scholar] [CrossRef]
Kim, H.; Park, S.; Park, H.-J.; Son, H.-G.; Kim, S. Solar Radiation Forecasting Based on the Hybrid CNN-CatBoost Model. IEEE Access 2023, 11, 13492–13500. [Google Scholar] [CrossRef]

Figure 1. Indoor temperature change graph.

Figure 2. Map of sensor point locations.

Figure 3. SCSO algorithm flow.

Figure 4. Comparison of environment prediction results.

Figure 5. Convergence curves.

Table 1. Environmental data.

№	Soil Temperature (°C)	Soil Moisture (%)	Humidity (%)	Carbon Dioxide (ppm)	Light (Lux)	Outdoor Air Temperature (°C)	Wind Speed (m/s)	Outdoor Air Humidity (%)	Carbon Dioxide (ppm)	Outdoor Light (Lux)	Temperature (°C)
1	15.6	25	72.8	843	10,199	15.22	2.1	44.5	523	3103	16.75
2	16.2	25.1	68.1	796	20,505	15.75	1.6	43.4	521	21,651	20.16
3	17.1	25.2	65.2	799	14,902	15.12	1.7	45.3	534	16,192	22.26
4	17.9	25.1	62.4	792	33,385	16.8	2.4	42.2	5 29	32,635	24.78
5	19.2	25.2	56.9	777	18,601	17.15	2.5	42.7	518	36,520	28.87
6	19.5	25.2	38.5	435	25,333	18.55	6.2	38.9	504	50,443	25.62
7	15.6	25	72.8	843	10,199	15.25	2.1	44.5	523	31,003	16.25

Table 2. Ablation experiment results (temperature prediction).

Model	MAE (°C)	RMSE (°C)	R²
Baseline	2.15	2.89	0.85
Variant 1 (w/o Sobol)	1.78	2.36	0.88
Variant 2 (w/o t-perturbation)	1.86	2.45	0.87
Variant 3 (w/o Gaussian–Cauchy)	1.68	2.21	0.89
Full model	1.50	1.98	0.94

Table 3. Different model evaluation indicators.

Environmental Factors	Evaluation Metrics	Model
Environmental Factors	Evaluation Metrics	LSTM	CNN	JAYA–CatBoost	AOA–CatBoost	MSCSO–CatBoost
Temperature	MAE	2.23	2.36	1.83	1.92	1.50
	MSE	1.52	1.49	1.12	0.96	0.93
	R²	0.96	0.89	0.93	0.93	0.94
Humidity	MAE	7.35	6.98	5.92	5.57	5.07
	MSE	60.25	55.83	46.45	43.22	41.19
	R²	0.87	0.89	0.91	0.92	0.93
Carbon dioxide	MAE	18.32	16.75	13.24	12.35	11.57
	MSE	95.46	82.31	65.12	58.78	53.22
	R²	0.84	0.87	0.90	0.91	0.91
Light intensity	MAE	36.72	32.45	27.89	25.67	24.36
	MSE	645.4	750.8	512.3	460.1	436.6
	R²	0.88	0.85	0.91	0.92	0.93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, X.; Cheng, Y.; Zhang, Z.; Mu, J.; Zhang, W. Integrating Multi-Strategy Improvements to Sand Cat Group Optimization and Gradient-Boosting Trees for Accurate Prediction of Microclimate in Solar Greenhouses. Agriculture 2025, 15, 1849. https://doi.org/10.3390/agriculture15171849

AMA Style

Cui X, Cheng Y, Zhang Z, Mu J, Zhang W. Integrating Multi-Strategy Improvements to Sand Cat Group Optimization and Gradient-Boosting Trees for Accurate Prediction of Microclimate in Solar Greenhouses. Agriculture. 2025; 15(17):1849. https://doi.org/10.3390/agriculture15171849

Chicago/Turabian Style

Cui, Xiao, Yuwei Cheng, Zhimin Zhang, Juanjuan Mu, and Wuping Zhang. 2025. "Integrating Multi-Strategy Improvements to Sand Cat Group Optimization and Gradient-Boosting Trees for Accurate Prediction of Microclimate in Solar Greenhouses" Agriculture 15, no. 17: 1849. https://doi.org/10.3390/agriculture15171849

APA Style

Cui, X., Cheng, Y., Zhang, Z., Mu, J., & Zhang, W. (2025). Integrating Multi-Strategy Improvements to Sand Cat Group Optimization and Gradient-Boosting Trees for Accurate Prediction of Microclimate in Solar Greenhouses. Agriculture, 15(17), 1849. https://doi.org/10.3390/agriculture15171849

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating Multi-Strategy Improvements to Sand Cat Group Optimization and Gradient-Boosting Trees for Accurate Prediction of Microclimate in Solar Greenhouses

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Data Pre-Processing

2.2.1. Missing Value Processing

2.2.2. Z-Score

2.3. Greenhouse Environmental Prediction Model

2.3.1. CatBoost

2.3.2. Sand Cat Swarm Optimization Algorithm

2.4. Multi-Strategy Improvement

2.4.1. Improvement Point I: Sobol Sequence Population Initialization Function

2.4.2. Improvement Point II: Adaptive T-Distribution Perturbation Strategy

2.4.3. Improvement Point III: Adaptive Gauss–Cauchy Mixed Variation Strategy

2.5. Model Evaluation

3. Results and Discussion

3.1. Different Model Prediction Results and Comparative Analysis

3.2. Prediction Results of Different Models and Comparative Analysis

3.3. Statistical Significance Testing

3.4. Ablation Experiment Analysis

3.5. Comparison of Model Performanc

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI