1. Introduction
Solar radiation is a fundamental source of energy for life on Earth, regulating climatic processes and sustaining essential biological functions. However, excessive exposure to ultraviolet (UV) radiation has long been recognized as a major risk factor for human health, contributing to skin cancer, premature aging, cataracts, and other dermatological and ocular conditions [
1,
2]. These risks are particularly acute in equatorial regions such as Ecuador, where geographical and atmospheric conditions intensify UV incidence [
3]. Reports indicate that in both the inter-Andean region and coastal zones, the UV index frequently exceeds extreme thresholds, exposing millions of people to chronic radiation hazards.
Global concerns over UV exposure are further amplified by lifestyle changes, outdoor occupational activities, insufficient protective practices, and depletion of the ozone layer [
4]. These factors highlight the urgency of preventive strategies that combine environmental monitoring, early warning systems, and public education [
5]. While meteorological networks and satellite-based systems have enhanced forecasting capacity, they often require costly infrastructure and lack the adaptability needed to deliver real-time, user-centered alerts.
In this context, artificial intelligence (AI) and machine learning (ML) have emerged as powerful tools for processing large volumes of environmental data and producing accurate predictive models [
6,
7]. These techniques have shown strong performance in forecasting solar irradiance, photovoltaic energy production, and extreme weather events [
8,
9,
10,
11]. Hybrid approaches that combine algorithms such as Random Forest, Gradient Boosting, and CatBoost have demonstrated greater robustness, reducing overfitting and improving generalization capacity [
12,
13,
14]. Accurate prediction of solar radiation strengthens the relationship between artificial intelligence and sustainability and is essential to anticipate: the production of photovoltaic energy [
15,
16], reduce the operation and maintenance costs of these energy sources, minimize energy losses during climate variations, and mitigate risks to human health by anticipating episodes of extreme sun exposure [
17,
18,
19,
20,
21].
Parallel to these advances, chatbots have been increasingly deployed as interfaces for environmental and health-related information [
22]. By embedding AI models in real-time communication platforms, chatbots provide personalized interaction with users, enabling timely preventive alerts and informed decision-making. Previous applications have included air quality monitoring, climate awareness, and educational contexts, underscoring their accessibility and versatility [
23,
24,
25].
Nevertheless, predicting solar radiation remains challenging due to the limited distribution of monitoring stations, heterogeneous data resolution, and the absence of key variables such as cloud cover [
26,
27]. These constraints reduce the accuracy of conventional models and emphasize the need for alternative approaches that leverage available meteorological data while optimizing predictive performance. In Ecuador, machine learning techniques have already been applied to predict global horizontal irradiance, improving model accuracy by increasing R
2 values from 0.607 to 0.876. Analyses have shown that between February and April, the UV index often reaches extreme levels of 14.3%, with fluxes of up to 1381.5 W/m
2, while in other months, values remain 1.6% below the March equinox [
28,
29,
30,
31,
32].
Satellite-based early warning systems, such as GLE Alert++ and HENON, stand out for their precision in detecting solar radiation peaks and issuing alerts [
33,
34,
35,
36,
37,
38,
39,
40,
41,
42]. However, compared to these, chatbot-based systems offer greater adaptability and customization, allowing real-time alerts to be tailored to specific locations and user profiles [
43,
44,
45,
46,
47]. Moreover, unlike satellite infrastructure such as ESPERTA, which requires high financial investment, AI-based solutions using meteorological API data offer an affordable and scalable alternative [
48,
49,
50,
51,
52,
53,
54].
The present study addresses these challenges by implementing a hybrid AI model that integrates Random Forest, Gradient Boosting, CatBoost, and a 1D Convolutional Neural Network (CNN). The model was trained and validated with data from the ESMET-IAIRD meteorological station in Calceta, Ecuador, and embedded into a chatbot deployed on Telegram. This system delivers real-time, user-centered alerts on solar radiation exposure, contributing not only to the advancement of hybrid AI methods in environmental prediction but also to sustainable public health strategies and climate adaptation in high-UV regions.
Beyond their technological contributions, AI- and ML-based approaches to UV risk forecasting align directly with the broader goals of sustainability. By enabling more efficient environmental monitoring and fostering adaptive capacity, these systems support public health protection, inform resilient urban planning, and promote the responsible use of natural resources. In particular, the integration of predictive analytics into decision-making frameworks strengthens the capacity of communities to cope with climate variability and environmental stressors, advancing progress toward the United Nations Sustainable Development Goals (SDGs), especially SDG 3 (Good Health and Well-Being) and SDG 13 (Climate Action).
2. Materials and Methods
The development of the proposed system was carried out in several stages, each designed to ensure the accuracy and reliability of solar radiation predictions: (i) data preprocessing, (ii) implementation of prediction algorithms, and (iii) integration into a chatbot, as illustrated in
Figure 1.
2.1. Data Acquisition and Preprocessing
The data used in this study were provided by the ESMET-IAIRD meteorological station, located at 0°49′45″ S, 80°11′8″ W, at an altitude of 2 m, in the town of Calceta (Ecuador). The historical dataset collected over a 9-month period consisted of 144 daily samples, resulting in approximately 38,880 records of 18 meteorological variables.
The data was processed to remove outliers, such as null values, zeros, nighttime radiation measurements, and records affected by power failures or maintenance, before being used to train the model. Outliers were identified and treated using the interquartile range (IQR) method, where values outside the range defined by 1.5 times the IQR were considered extreme and removed to improve the quality of the dataset.
Initially, exploratory correlation plots were generated to visually inspect the relationships among variables and to identify anomalous values or data points without clear trends. These values were considered outliers and filtered out to ensure data quality. After this step, a Spearman correlation analysis was performed using all variables provided by the meteorological station to identify those most strongly associated with solar radiation. Variables with a correlation coefficient (r) greater than 0.3 or lower than −0.3 were retained as relevant for further modeling, following common practice for identifying moderate to strong relationships in environmental data as shown in
Figure 2.
After the correlation analysis, 12 variables were retained for further modeling, as they showed moderate to strong associations with solar radiation. These include Temperature, Thermal Sensation, Dew Point, Heat Index, Interior Humidity, Maximum Wind Gust, Average Wind Speed, Average Wind Direction, Atmospheric Pressure, Rainfall, Rain Intensity, and UV Index. In addition, solar radiation data itself was considered in subsequent training and validation stages of the models (
Figure 3).
2.2. Prediction Algorithms
In this stage, various predictive models were implemented with the aim of predicting solar radiation. The models considered include Linear Regression, Random Forest, CatBoost, a Fully Connected Neural Network (FCN), a Long Short-Term Memory (LSTM) network, and the proposed hybrid model. A one-dimensional Convolutional Neural Network (CNN) layer was tested independently to assess its capacity to extract local temporal features from meteorological variables, capturing short-term fluctuations in solar radiation data. Similarly, a Long Short-Term Memory (LSTM) network was included due to its ability to model sequential dependencies. The purpose of these tests was not to replace ensemble models but to verify whether feature extraction through convolutional layers or sequential learning through LSTM could provide complementary advantages in prediction accuracy. To evaluate their performance, the Mean Squared Error (MSE) and the Mean Absolute Error (MAE) were used as the primary evaluation metrics.
During the training process, cross-validation was applied to an initial set of 12 variables with the highest correlation to solar radiation. From this dataset, 75% of the data was allocated for model training, while the remaining 25% was used for validation.
The choice of three models (Random Forest, CatBoost) was made not only for their accuracy but also for their low computational complexity in inference, which guarantees the real-time performance of the Chatbot.
2.2.1. Linear Regression
In the Multiple Linear Regression (MLR) model, the predictor variables selected in the preprocessing stage are denoted as
xi, while the dependent variable, represented by
yi corresponds to solar radiation Equation (1) shows the relationship:
where
β0 is the intercept term,
β1,
β2,…,
βn are the coefficients quantifying the effect of each predictor variable, and εi represents the error term. The coefficients
βj were estimated using the Ordinary Least Squares (OLS) method, which minimizes the sum of squared residuals Equation (2) shows the relationship.
Here,
X is the feature matrix, and
Y is the vector of observed solar radiation values. The predictor variables were selected according to their Spearman correlation with solar radiation in the preprocessing step, ensuring that only variables with moderate to strong relationships were included [
55].
2.2.2. Random Forest
The Random Forest model is an ensemble learning algorithm that combines multiple decision trees to improve prediction accuracy and reduce overfitting. Each tree is constructed from training data generated by bootstrapping, and the final prediction is obtained by aggregating the outputs of all trees.
The Random Forest was trained using the 12 selected predictor variables and the target variable, solar radiation. The dataset can be represented as Equation (3).
where
Xi represents a set of characteristics (wind chill, rain, UV Index, etc.) and
Yi is the output variable (solar radiation). A bagging technique was used for the training process of the random forest model, where for each Tn tree, a subset of the training data was generated by sampling with replacement.
Each tree Tn was trained on a bootstrap sample of the dataset, and the final prediction
P for a new observation
Xi was calculated by averaging the predictions of all trees with Equation (4).
where
Tn(Xi) is the prediction of the
n-th tree for sample
Xi, and
N is the total number of trees.
Since the target variable is continuous, the Random Forest was configured for regression tasks. Variable importance was evaluated based on the reduction in variance during tree construction. At each split, the feature that maximized the reduction in the variance of the target variable was selected as the splitting criterion. This approach, standard in regression-based Random Forest models, allows the identification of the most influential predictors of solar radiation. The variable that contributes the most to reducing variance across all trees is considered the most important, thus highlighting key environmental factors in the prediction of solar radiation.
2.2.3. CatBoost
The CatBoost model is based on an enhanced version of Gradient Boosting. This decision tree algorithm efficiently handles categorical data without requiring extensive preprocessing [
56]. CatBoost achieves this by using a technique called ordered boosting, which allows categorical features to be processed directly, reducing the need for transformations such as one-hot encoding or target encoding. Ordered boosting helps preserve relationships between categories and reduces the risk of overfitting. The objective function is defined as Equation (5).
where
Yrad is the vector of observed solar radiation values,
Xweather represents the meteorological predictor variables (e.g., temperature, UV index, thermal sensation), and F(
Xweather) is the model prediction. The loss function l corresponds to the Mean Squared Error (MSE), which is standard for regression tasks.
The training process follows an additive scheme. Starting with an initial estimate
F0, calculated as the means of the target values, each subsequent tree approximates the negative gradient of the loss function, progressively reducing residual errors. The additive model is expressed as Equation (6).
where each tree
hi (
Xweather) is fitted to correct the residuals from the previous iteration. The iterative construction of the CatBoost model can be generalized as Equation (7).
where
γm is the learning rate at iteration mmm, regulating the contribution of each tree.
As the model learns the data, the value of μ corresponds to the adjustment made in each iteration n to improve the prediction of solar radiation. The equations are used (8)–(9).
For model optimization, hyperparameters such as tree depth, learning rate, and number of iterations were tuned using cross-validation. This process ensured that the model maintained predictive performance while reducing the risk of overfitting.
2.2.4. Fully Connected Neural Network (FCN)
The Fully Connected Neural Network (FCN) was implemented to nolinear relationships between meteorological variables and solar radiation. The architecture consisted of an input layer with the 12 predictor variables, two hidden layers, and one output layer. Each hidden layer applied a weighted sum of the inputs, followed by a nolinear activation. The Rectified Linear Unit (ReLU) activation function was used in hidden layers due to its efficiency and ability to mitigate vanishing gradients, while the output layer used a linear activation suitable for regression tasks.
The forward propagation through the network can be expressed as Equation (10).
where
a(l) is the activation vector of layer l,
W(l) and
b(l) are the weight matrix and bias vector, and f is the activation function.
2.2.5. Long Short-Term Memory (LSTM)
The Long Short-Term Memory (LSTM) model was applied to capture temporal dependencies in solar radiation data. LSTMs are a specialized form of recurrent neural networks (RNNs) that address the vanishing gradient problem by incorporating memory cells and gating mechanisms. This allows the network to selectively retain or discard information, making it suitable for time series forecasting where past conditions influence present outcomes.
LSTM cell operates through three main gates: the forget gate (
ft), the input gate (
it), and the output gate (
ot), which control the flow of information Equations (11)–(13).
where
xt is the input at time step
t, and
ht − 1 is the previous hidden state
Although the dataset covered six months, its high temporal resolution (144 measurements per day) provided enough observations to capture short-term autocorrelation, full diurnal cycles, and multi-day variability, allowing the LSTM to effectively exploit sequential patterns in solar radiation.
2.2.6. Hybrid Model
The hybrid model integrates Random Forest (RF), Gradient Boosting (GB), and CatBoost (CB) to improve prediction accuracy and reduce error in solar radiation forecasting. Each component contributes complementary strengths: Random Forest handles feature interactions effectively, Gradient Boosting refines predictions by focusing on residuals, and CatBoost efficiently manages categorical data. By combining these models, the hybrid approach leverages their complementary features to enhance predictive performance.
The architecture of the hybrid model included a one-dimensional convolutional layer followed by two additional layers for feature extraction, as shown in
Figure 4. This design allowed the model to capture both local and global patterns in the data, which is particularly important for time-series predictions such as solar radiation.
Model performance was evaluated using multiple metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), and the coefficient of determination (R2) for regression accuracy, as well as Precision, Recall, F1-Score, and Accuracy for classification of solar radiation into four categories (low, medium, high, and extreme). This combination of metrics provided a comprehensive assessment of model performance.
To develop the hybrid model, different combinations of base learners were tested (e.g., RF + GB, RF + GB + CB). Hyperparameters such as the number of iterations (500, 1000, 1500), tree depth (4, 6, 8), and learning rate (0.01, 0.2, 0.3) were tuned through cross-validation. This process was implemented to minimize overfitting and optimize generalization capacity.
2.3. Implementation of the Chatbot
The chatbot was designed as a user interface for the solar radiation prediction system, allowing real-time access to forecasts via the Telegram platform. The implementation was carried out in Python 3.13, using the python-telegram-bot library for interaction, asyncio for asynchronous event handling, and the trained hybrid model for predictions.
The predictive engine of the chatbot was trained on a historical dataset, where categorical variables were transformed using Label Encoder. Feature selection was applied, and the hybrid model was trained using a Stacking Regressor with Ridge regression as the meta-learner. This configuration allowed the combination of multiple base learners to reduce overfitting and improve generalization.
Once trained, the system retrieves real-time weather data from Weather API, constructs a Data Frame with the required predictor variables, and generates a radiation risk category (low, medium, high, very high). The chatbot then communicates results to users with textual messages and visual alerts. Interactive buttons were implemented to request predictions or retrieve recent station data. Deployment was performed on Heroku, using Long Polling to ensure continuous interaction between the server and end users. The chatbot required minimal technological resources, without the need for significant storage or cloud computing capacity.
To validate usability and system stability, pilot tests were conducted with 20 academic users over a two-week period. The group was composed of participants aged 24–65, selected for their ability to interpret forecasts and provide structured feedback. The limited sample size allowed for close monitoring of interactions and early detection of system issues in a controlled environment. Evaluation metrics included response time, number of successful interactions, and user satisfaction, the latter assessed through a short survey administered at the end of the testing period.
Figure 5 illustrates the chatbot workflow, from the reception of weather data to the final delivery of radiation risk predictions through the user interface.
3. Results
The evaluation metrics highlight notable differences in performance across the models tested for solar radiation forecasting (
Table 1).
In terms of MAE, the Hybrid model (RF + GB + CB) achieved the lowest error (13.93 W/m2), closely followed by CatBoost (14.21 W/m2) and Random Forest (14.41 W/m2). These results indicate that ensemble-based models were more effective in minimizing average prediction errors compared to linear (21.19 W/m2) and neural approaches (FCN: 34.39 W/m2; LSTM: 73.80 W/m2).
For MSE, the Hybrid model again reported the smallest error (871.70), confirming its superior ability to reduce squared deviations. CatBoost (897.88) and Random Forest (934.79) followed closely, while Linear Regression showed the highest error (1466.61). Interestingly, the FCN reported a low MSE (288.76) despite a higher MAE, suggesting that it captured the general trend but struggled with localized fluctuations.
Regarding R2, the Hybrid model explained the highest proportion of variance (0.98), outperforming Random Forest (0.97), CatBoost (0.97), and Linear Regression (0.95). The FCN achieved a moderate R2 (0.91), while the LSTM performed poorly (0.69), indicating limited capacity to model temporal dependencies despite the high-frequency dataset.
Finally, in terms of prediction counts, the Hybrid model correctly classified 9759 out of 9912 cases, slightly ahead of CatBoost (9747) and Random Forest (9721). Linear Regression reached 9592 correct predictions, while FCN (9652) and LSTM (9552) underperformed relative to ensemble methods.
Overall, the Hybrid model consistently outperformed the individual models across all metrics, confirming its robustness, stability, and suitability for solar radiation forecasting. From a sustainability perspective, the ability of ensemble approaches to combine complementary strengths not only improves predictive reliability but also enhances their potential for real-world applications such as solar energy management, climate adaptation strategies, and public health risk communication.
Based on the results shown in
Table 1, the FCN and LSTM models were excluded from graphical comparisons, as their performance was considerably lower and their inclusion would not contribute to the visual interpretation of prediction accuracy.
The performance of the Multiple Linear Regression model suggested overfitting, as validation error remained higher than training error and did not improve with larger training sizes.
Figure 6 shows the MAE learning curve against training size, where the gap between the training and validation lines illustrates the limited generalization capacity of this model.
Ensemble methods improved predictive accuracy. Random Forest reduced the MAE compared to linear regression, but the learning curve in
Figure 7 shows that validation error plateaued above training error, reflecting moderate overfitting. This implies that the model effectively captures patterns in the training data but fails to generalize optimally to unseen data.
The CatBoost model showed improved stability compared to MLR and Random Forest. As the training size increased, the validation MAE decreased gradually and converged closer to the training error, as shown in
Figure 8. However, a persistent gap remained between both curves, suggesting that while the model learned the training data effectively, its ability to generalize to unseen data was still constrained.
The comparison between models in prediction vs. actual values is shown through scatter plots in
Figure 9. Each plot represents predicted versus observed solar radiation for one model: Linear Regression (top left), Random Forest (top right), CatBoost (bottom left), and the Hybrid model (bottom right). In all cases, the points follow a positive correlation along with the ideal diagonal. Linear Regression exhibits greater dispersion, particularly at higher values. Random Forest and CatBoost reduce this scatter, showing a closer alignment with the diagonal. The Hybrid model presents the tightest clustering, with most points concentrated around the line, indicating the best agreement between predictions and actual values.
Figure 10 displays the temporal evolution of observed versus predicted solar radiation for Linear Regression, Random Forest, CatBoost, and the Hybrid model (RF + GB + CB). Across all models, the predicted series generally followed the seasonal and daily variability of solar radiation. However, clear differences in alignment are evident. Linear Regression showed larger deviations from the observed values, particularly under peak radiation conditions, reflecting limited accuracy in capturing fluctuations. Random Forest and CatBoost presented improved correspondence, with predictions more closely tracking the observed series and fewer large deviations. The Hybrid model achieved the closest alignment with the real values, maintaining stable accuracy across the entire evaluation period, including both peaks and troughs. Overall, ensemble approaches, particularly the Hybrid model, demonstrated superior consistency and reduced error compared to individual methods.
To evaluate the ability of the models to categorize solar radiation into four levels (low, medium, high, very high), classification metrics were calculated (
Figure 11). The results show that ensemble-based models consistently outperformed individual learners across all indicators. Precision and recall were higher for the Hybrid model (RF + GB + CB), reaching values close to 0.9, followed by Gradient Boosting and CatBoost. The F1-Score confirmed this trend, indicating that the Hybrid approach provided the best balance between precision and recall. Accuracy values were also highest for the Hybrid model, demonstrating superior robustness in assigning radiation categories correctly. These results complement the regression analysis, highlighting that the Hybrid model not only minimizes prediction errors but also provides reliable classification of solar radiation into practical risk levels.
The integration of the hybrid prediction model into the chatbot achieved robust performance, with a deviation of approximately 15% when comparing Weather API data with filtered datasets, indicating that external data sources did not compromise prediction accuracy. Overall, the chatbot reached a 95% accuracy in solar radiation predictions.
Figure 12 illustrates the operational interface of the solar radiation prediction chatbot implemented in Telegram. The system provides users with clear information on daily radiation levels, categorized into three risk levels (low, medium, high), and delivers hourly forecasts in both textual and graphical formats. Users can select the desired prediction day and receive structured outputs that facilitate interpretation and decision-making. This demonstrates the capacity of the chatbot to integrate the hybrid prediction model into a user-friendly platform, effectively bridging predictive analytics with practical communication.
User feedback confirmed the system’s applicability and acceptance. Despite the limited scope of the test population, acceptance was unanimous: 40% of participants agreed, and 60% strongly agreed on their willingness to receive radiation notifications from the system. Furthermore, when asked about the reliability of the chatbot compared to satellite APIs, 65% agreed and 35% strongly agreed that the chatbot provided more accurate and useful information, with no dissenting responses recorded.
In terms of usability, 90% of participants expressed satisfaction with the clarity and usefulness of the alerts. Based on this feedback, the message structure was refined to enhance readability and personalize recommendations according to the user profile. The classification of radiation levels into three categories (low, medium, high) was identified as particularly helpful, facilitating interpretation for non-specialist users.
These results highlight the chatbot as an effective tool for real-time solar radiation prediction and communication, combining high predictive accuracy with strong user acceptance and accessibility.
The development of the chatbot is the central component of this research, since it integrates prediction algorithms with a conversational interface that is easily accessible to non-specialized users. The system was designed under a client-server architecture
Figure 13, where the backend implements the hybrid prediction model (RF, GB, and CB) and the frontend corresponds to the conversational chatbot. The chatbot receives weather parameters (temperature, relative humidity, wind speed, and atmospheric pressure) as input, which are entered directly by the user or automatically acquired from the weather station. This data is pre-processed and sent to the hybrid prediction engine, which generates an estimate of solar radiation in real time. Subsequently, the response is returned in textual and graphic formats within the conversational interface.
The interaction is structured on three levels: (i) data entry (manual or automatic), (ii) processing using the hybrid model, and (iii) feedback to the user in the form of numerical values, graphs, and explanatory messages. This design allows users without training in machine learning to interpret the results intuitively, favoring decision-making in renewable energy and smart agriculture applications.
Computational efficiency analysis is a determining factor in the viability of the chatbot system. It was shown that the total response time perceived by the user, from entering the prediction request until the data is displayed by the chatbot, is consistently in the range of 2 to 3 s. This response performance, suitable for real-time applications, is due to the strategic management of computational complexity: while the training of complex models (CatBoost and LSTM) was performed offline, the inference phase is optimized for low latency. In fact, the average processing time of the predictive core (pure inference) remained below 5 milliseconds (<5 ms) when running on a standard central processing unit (CPU). The difference between the total response time (2–3 s) and the inference time (≤5 ms) is attributed to network latency, natural language processing, and the chatbot’s communication overhead, not to the model’s computation. This ultra-fast internal efficiency confirms that the system is computationally modest and does not require GPU acceleration, ensuring its commercial scalability and enabling efficient deployment on microservices or low-cost hardware.
4. Discussion
The results demonstrate that the hybrid model (RF + GB + CB) consistently outperformed the individual models in terms of MAE, MSE, and R
2, highlighting its robustness for solar radiation prediction. This aligns with findings from previous studies [
57,
58,
59,
60,
61,
62], confirming that combining multiple algorithms leverages complementary strengths to enhance predictive performance.
Recent research has also explored hybrid forecasting approaches in the context of integrated energy systems. For example, solar–hydropower optimization studies have applied advanced hybrid decomposed residual ensembling methods, combining statistical (ARIMA, STL) and deep learning models (Bi-LSTM) with optimization algorithms such as WOA, achieving remarkably low error rates (MAE = 1.31 W/m
2, RMSE = 1.85 W/m
2) [
53]. While these results demonstrate the potential of highly specialized hybrid models, they often require complex architectures [
53]. In contrast, the present study adopts a simpler yet effective hybrid architecture (RF + GB + CB with CNN), which achieved robust accuracy (R
2 = 0.98; MAE = 13.77 W/m
2) while maintaining generalizability and accessibility. This balance underscores the potential of hybrid ensemble approaches not only for technical accuracy but also for practical deployment through user-oriented platforms such as chatbots.
Previous studies have emphasized the critical role of solar forecasting in the renewable energy transition, particularly for grid stability and load balancing. For instance, a data-driven contextual forecasting (DCF) framework combined Support Vector Machines (SVM) for short-term horizons and Facebook Prophet (FBP) for long-term horizons, achieving average R
2 values of 85% across multiple U.S. cities [
55]. While such approaches demonstrated strong adaptability depending on the prediction horizon, the present study contributes by integrating ensemble methods with convolutional layers to enhance prediction accuracy under high temporal resolution conditions. Unlike DCF, which partitions models by forecast horizon, the proposed hybrid model simultaneously captures nonlinear relationships and temporal patterns, achieving superior accuracy for short-term daily predictions. This distinction highlights the versatility of hybrid ML architectures for real-time applications, where both accuracy and accessibility—via chatbot implementation—are essential for sustainable energy management.
While linear regression provided efficiency and interpretability, its limitations in capturing nonlinear relationships restricted its accuracy. Similarly, although Random Forest and CatBoost achieved strong performance, signs of overfitting during testing highlighted the advantages of the hybrid approach. The inclusion of a 1D convolutional layer further improved the model’s ability to capture temporal patterns, reinforcing its suitability for time-series data such as solar radiation.
In contrast to ensemble methods, the Fully Connected Neural Network (FCN) and the Long Short-Term Memory network (LSTM) achieved lower performance. The FCN, although capable of modeling nonlinear relationships, requires large datasets to optimize its parameters and avoid overfitting. In this study, the limited dataset size restricted its ability to generalize, resulting in higher MAE and MSE values compared to boosting algorithms. The LSTM exhibited the weakest performance overall, with an R2 of 0.69. This outcome can be attributed to two main factors: (i) the dataset covered only a six-month period, which limited the availability of long-term temporal patterns that recurrent networks typically exploit, and (ii) the temporal resolution of the dataset (144 measurements per day) introduced noise and short-term variability, which LSTMs tend to overfit when not balanced with longer sequences. As a result, LSTM could not capture stable temporal dependencies as effectively as ensemble methods.
Nevertheless, occasional misalignments between predicted and actual values remain inevitable due to the inherent complexity of solar radiation forecasting. Sudden weather changes, sensor inaccuracies, and unpredictable environmental variability introduce uncertainty that no model can fully capture. The 15% error margin observed illustrates these challenges, particularly in the categorization of radiation levels near threshold boundaries. This margin can lead to misclassification between categories (e.g., high vs. very high), reducing reliability for decision-making in borderline cases.
Similar challenges have been reported in tropical regions, where high variability in cloud cover complicates solar irradiance forecasting. A study conducted in Thailand achieved correlations above 0.8 for intra-day and short-term forecasts but still showed phase shifts in irradiance fluctuations during cloudy conditions [
54]. These findings reinforce the notion that prediction errors are to some extent unavoidable in tropical climates, where convective processes introduce short-term variability that models struggle to capture. In line with these observations, the 15% error margin identified in the present study reflects not only model limitations but also the intrinsic unpredictability of tropical weather systems. This highlights the importance of integrating robust forecasting with user-oriented communication tools, ensuring that even imperfect predictions provide actionable value for decision-making in renewable energy management and public health protection.
Finally, the implementation of the chatbot on Telegram illustrates how predictive models can be effectively translated into practical tools. By offering real-time accessibility through a widely used communication platform, the system bridges advanced machine learning with user-centered sustainability applications. This combination of predictive accuracy and accessible delivery demonstrates significant potential for supporting renewable energy management and informed decision-making in everyday contexts.
The chatbot trained using Random Forest, Cat-Boost, and Gradient Boosting demonstrates, the feasibility of integrating artificial intelligence into energy management. This predictive capability not only optimizes the use of solar radiation but also has a tangible impact on the transition to clean energy by facilitating sustainable energy planning and reducing dependence on fossil fuels.
The results obtained are based on a nine-month dataset recorded at a meteorological station in Ecuador, which implies a limitation in the generalization of the model. Although the performance of the hybrid approach was superior to that of individual models in this context, it is reasonable to expect that its applicability will improve with larger datasets, including several years and stations from different geographic regions. In particular, the robustness of the ensemble algorithms employed (RF, GB, CB) suggests that the hybrid architecture can adapt to diverse climatic and geographic conditions, although its multi-center validation constitutes a line of future research.