1. Introduction
Conventional electrical energy generation approaches that were considered effective methods in earlier decades have become unsustainable due to their extensive carbon dioxide emissions, which are associated with chronic health impacts. Recently, the population of Europe has become familiar with the cause of greenhouse gases and with its environmental damage; however, it is crucial to take into account as a serious problem and try to decrease those devastating, rapid human activities which lead to global warming [
1,
2]. According to Azadin et al., 1.4 billion people are deprived of electricity, which results in more electrical energy needing to be generated [
1,
3]. Moreover, reducing and minimizing the environmental impact of the combustion of fossil fuels have received considerable emphasis in research, seeking innovative solutions [
4,
5]. In parallel, the growing global demand for renewable energy highlights the urgent need for developing environmentally friendly and sustainable electricity generation technologies [
6]. Within this context, the dye-sensitized solar cell (DSSC) is a promising candidate, offering an attractive substitute for conventional silicon (Si)-based photovoltaic technologies. This study focuses on DSSCs; however, within the broader PV landscape, mono- and polycrystalline Si-based PV technology dominates; by contrast, DSSCs are one of the most relevant technologies under diffuse irradiation and Building Integrated Photovoltaics systems [
7].
The performance advantages of the DSSC have led to a surge in research activity reflected in the number of publications over the years [
7,
8]. A conventional DSSC consists of five main components: (i) a transparent conducted oxide (TCO)-coated glass substrate; (ii) an n-type semiconductor film like titanium dioxide (TiO
2) or zinc oxide (ZnO); (iii) a sensitizing dye (photosensitizer); (iv) an electrolyte solution which is typically iodide/triiodide; (v) and a counter electrode [
9,
10]. When electromagnetic radiation strikes the solar cell, the internal energy of the dye molecules increases, and the electron is excited from the ground state (Highest Occupied Molecular Orbital (HOMO)) to the higher energy orbitals (Lowest Unoccupied Molecular Orbital (LUMO)). The excited electrons transfer into the conduction band of the semiconductor layer and diffuse towards the external circuit. These processes result in an oxidized dye, which is regenerated by electron donation from electrolyte solution via the platinum counter electrode [
11]. There is a key distinguished characteristic between the operation mechanism of DSSC and conventional monocrystalline silicon (mono-Si)-based cells. While the light absorption and charge separation mechanisms occur in the p-n junction in the case of mono-Si-based cells, these two are decoupled in the case of DSSC [
12]. Ferber’s electrical model for DSSCs, which simulates the electrical behavior (charge transport in the interface region) of DSSC, has been widely adopted and serves as a foundation for numerous studies. It was revealed that the electron movement within the semiconductor film is mainly driven by diffusion and influenced by factors such as particle size and porosity [
13]. Additionally, TiO
2 and ZnO semiconductor materials, which share a similar bandgap of 3.2 eV, are commonly employed [
14]. Other compounds with similar properties, such as SnO
2 [
15] and Nb
2O
5 [
16], have also been tested. Beyond that, to improve the electrical parameters of the solar cell, morphological structure strategies have been considered, including nanorods [
17], nanotubes [
18], nanoforests [
19], and nanowires [
20]. Furthermore, based on the reviewed literature, it was highlighted that hydrothermal synthesis is frequently chosen for forming the semiconductor layer under high pressure and temperature. The other essential component is light-harvesting dye, which is expected to (i) have broad light absorption capacity; and (ii) sufficient stability during operation [
21]. Moreover, the applied dye layer can be categorized into two groups: metal-based complexes, such as ruthenium polypyridyl, and metal-free organic dyes. In this context, natural dyes, such as those extracted from fruits and leaves, fall into the metal-free organic dye categories [
22]. Metal-based sensitizers, especially ruthenium-based complexes as N3 or N719 dyes, are engineered with electron-donating and electron-withdrawing groups, and binding groups maintain strong attachment to the semiconductor layer [
23]. Beyond that, many researchers explored natural dyes because of their environmental friendliness and cost-effectiveness. In Hosseinnezhad et al.’s comparative investigation, a wide selection of natural dyes, including pomegranate, orange peel, and red onion, have been examined. It was concluded that pomegranate-based DSSC achieved the highest efficiency among the others [
24]. The third major component of the solar cell is the electrolyte solution, which is expected to have high charge mobility, chemical stability, and non-corrosivity [
25]. The iodide and triiodide redox pair, which was originally implemented in the first dye-sensitized solar cell, is still a widely popular solution [
26].
The power conversion efficiency of the solar cell is defined as
, where
is the open-circuit voltage,
is the short-circuit current,
is the fill factor, and
denotes the incoming light intensity. It can be observed from the formula that increasing the parameters in the numerator (
) can provide higher efficiency, under standardized measurement settings. In this context, a high
value can be achieved when the redox potential of the electrolyte is more positive, like using a Co(II/III) electrolyte solution, resulting in greater potential differences. On the other hand,
can be increased by broadening the spectral absorption of the panchromatic dye to enhance the photon harvesting capabilities. Furthermore, taking the last parameter from the efficiency equation into account, improvement of the FF can be achieved by increasing the shunt resistance and lowering the series resistance [
7].
While several studies have focused on improving the performance of the DSSC by material modifications, simulation techniques play an important role in understanding and predicting device behavior. Among the various modeling approaches, traditional physical modeling has been employed to explore multiple processes like diffusion and charge recombination under stationary, non-steady-state, one dimensional, and multidimensional conditions. Within this framework, numerical approaches such as the finite difference method and the finite element method have been employed, where the governing equations include the transport and diffusion equations. Ni et al. explored the effects of electrode morphology, including thickness, porosity, and titanium dioxide and transparent conductive oxide interface behavior [
27]. The work of Mitroi et al. represents an additional simplification of the physical model by considering only diffusive electron transport. Furthermore, the recombination pathway was modeled as a linear process [
28]. On the other hand, other researchers investigated the applicability of DSSCs under real environmental conditions employing meteorological datasets [
29]. Miettunen et al. applied the Nernst–Planck equation to model the two-dimensional distribution of charges and transient phenomena. Based on their findings, it was revealed that even a minor gap on the photoelectrode side can modify the ion arrangement [
30].
These physical models can provide detailed insights into the working principles of the solar cell; however, the mathematical complexity makes them computationally intensive, and they may fail to describe the full cell behavior. In this regard, machine learning algorithms have significantly contributed to optimizing power systems, bringing revolutionary forms in how solar energy systems are operated. These models are considered efficient alternatives, which are capable of handling large multidimensional datasets, extracting patterns between parameters, and providing predictions in a cost-effective form [
31]. Therefore, applying these approaches to dye-sensitized solar cells can help to explore the material design in an efficient, data-driven way. Sahu et al. investigated the efficiency of the organic solar cells using gradient boosting, Random Forest, Artificial Neural Network machine learning algorithms. A dataset for 280 small-molecule organic solar cells has been constructed. It was found that beyond the HOMO and LUMO levels, the orbitals located energetically close to them play a crucial role in dissociation and charge conduction. In their work, gradient boosting (GB), Random Forest (RF), and artificial neural network (ANN) have been used for the investigation [
32]. Likewise, Xu et al. investigated the absorption maxima of organic dyes for DSSCs using an ANN-based Quantitative Structure–Property Relationship (QSPR) method, which utilizes numerical parameters extracted from molecular structures to predict the chemical properties of a compound. To analyze the entire set of 70 dyes, the genetic algorithm method was applied, and based on their model, the maximum absorption of a new, not yet synthesized dye from the molecular structure can be accurately predicted [
33]. Similarly, Venkatraman et al. explored the metal-free dye characterization, aiming to predict the direction of absorption spectrum shifts upon dye adsorption onto the TiO
2 surface using six different classification algorithms. Within the tested algorithms, RF and the SVM achieved the highest model accuracy between 70% and 80% [
34]. Beyond that, several other researchers have used machine learning for analyzing organic cells, addressing the urge to improve the efficiency of organic devices [
35,
36]. Among these methods, the application of RF and ANN is frequently encountered.
Furthermore, Yosipof et al. aimed to investigate metal oxide-based (TiO
2 and copper-based) cells using machine learning techniques such as KNN and genetic programming (GP). The thickness of the window layer, the thickness ratio between the window and absorber layers, the measured resistance of the absorber layer, band gap, and distance of the sample from the central position in the deposition system have been considered as input parameters, while short-circuit current density (J
sc), V
oc, and internal quantum efficiency (IQE) were the target variables. Analysis confirmed that the thickness parameter was the most important [
37]. Alongside this, the study by Sutar et al. aimed to investigate the performance of the hydrothermally synthesized ZnO-based DSSC through the utilization of statistical and ML methods. In their classification approach—which involved DT, RF, and ANN models—the key input parameters were the morphology, dye, seed layer, precursor, synthesis time, and temperature. The results indicated that morphology and synthesis temperature had the greatest impact on the performance of the DSSC [
38]. Similarly, Onah et al. investigated the electrical parameters of Eu
3+-doped Y
2WO
6/TiO
2 photo-electrode DSSCs using Support Vector Machine (SVM) regression and RF regression methods. Among the two applied techniques, RF demonstrated greater accuracy [
39]. Additionally, Al-Sabana and Abdellatif focused on TiO
2-based DSSCs, selecting the TiO
2 layer thickness and its porosity as key parameters for optimization. The selection of dye and nanoparticle fabrication methods across all samples was identical. The applied RF model reached an accuracy of 99.87% and it was reported that the optimal efficiency of 4.17% can be achieved at 7 µm thickness and 63% porosity [
40]. Research findings indicate that conventional statistical methods are often unable to capture complex patterns in a high-dimensional parameter space. Furthermore, it was also revealed that the absence of decision-support mechanisms means that each parameter must be tested, leading to extended development time and costs.
To address the above-mentioned challenges, the aim of the current research is to develop a comprehensive database of TiO
2- (and ZnO-) based DSSCs. Subsequently, on the developed database, machine learning—based predictive classification models will be implemented to forecast the efficiency of the DSSC, thereby contributing to decision-making during the initial stage of production. As a preliminary step of the current research, a pilot study has been conducted in which the V
OC and short-circuit current density (J
SC) parameters were investigated using classification models, without applying synthetic data generation. The input parameters were the type of semiconductor layer, the thickness of the semiconductor layer, morphological structure, precursor, photosensitizing dye, synthesis temperature, and time. The results indicate that the V
OC could be efficiently predicted with an accuracy of 71% using an SVM with a total of 71 support vectors and polynomial weights. On the other hand, J
SC achieved 80% accuracy using either a (k = 5) KNN model with rectangular weights and Euclidean distance or a linearly weighted SVM with 74 support vectors [
41]. The novelty of the manuscript lies in building a DSSC-focused process and material database, followed by controlled synthetic data augmentation and explainability to identify key manufacturing parameters.
2. Materials and Methods
This section provides a comprehensive description of the methodology employed in the current study. Furthermore, the chapter outlines the application of machine learning methods as a crucial method in the engineering and mathematical field.
2.1. Methodology Approach
To fulfill the research objectives, an intensive literature review was conducted, and based on that, a database was developed. The database contains 213 publications sourced from Scopus, Web of Science, and supplementary literature such as the dataset described by Sutar et al. [
38]. Whenever earlier studies provided supplementary information, the reported data was double-checked.
Figure 1 represents the flowchart of the research framework, from data collection through model evaluation to result presentation.
The database developed serves as a basis for the current research. Data extraction can be performed through several methods: (i) data mining; (ii) semi-automated data extraction; (iii) manual data extraction. Among the methods mentioned, in this study, the manual approach was applied, allowing accurate expert filtering, although this method is time-consuming compared to the others. Furthermore, the processed publications were published from 2003 to 2025.
Figure 2 illustrates the average efficiency of DSSCs as a function of years in the form of a bubble plot. The size of each bubble is proportional to the number of publications. In other words, the bubble size corresponds to the volume of the published articles in the field of dye-sensitized solar cells. The horizontal axis shows the years, while the vertical axis represents the average efficiency in percentage. Moreover, based on the figure, it can be observed that TiO
2-based DSSCs achieved higher efficiencies than ZnO-based DSSCs.
The input parameters were chosen based on material, structural, and fabrication-related factors which have been repeatedly identified in the literature as critical to device performance. These features include optical/electrical characteristics (e.g., dye type, concentration of applied dye, absorption maxima, TCO/ITO resistance, electrolyte composition), morphological description (e.g., type of thin film, thickness of thin film and morphological structure), synthesis conditions (e.g., synthesis temperature, synthesis time, precursor type), testing conditions (e.g., operating temperature, irradiance). Moreover, the feature set provides a comprehensive basis to forecast the performance of the DSSC more accurately. The target parameters were efficiency, short-circuit current density, open-circuit voltage, and fill factor. On the other hand, the database reveals that, except for some isolated examples, the iodid/triiodide
redox couple was employed in most devices, confirming its ongoing role as the most widespread and stable electrolyte system in the DSSCs. Additionally, in these electrolyte systems, acetonitrile, valeronitrile, 3-methoxypropionitrile, and propylene carbonate are the most used solvents. Consequently, the type of electrolyte solution is not expected to influence the performance of the machine learning model; however, the concentration of
and
could serve as a key variable. On the other hand, various techniques are available for synthesizing the semiconductor layer, including the sol–gel method, spin coating, spray coating, dip coating, and the hydrothermal method. Of these, the hydrothermal process is known to be preferred due to superior material properties [
42]. According to the database, except for a few publications, the hydrothermal approach was the most frequently adopted. A similar observation can be made regarding the testing temperature. Most experiments were conducted at room temperature. It is known that efficiency tends to decrease as the temperature rises. Furthermore, similar patterns can be seen in the case of testing irradiance.
Morphological structures were categorized into three groups based on similarities in dimensions, structural composition, or fabrication methods. Classification was guided by two key principles: (i) geometric dimensionality, and (ii) statistical robustness of data. The first group (labeled as 0) includes the microsphere, nanosphere, nanoparticle, nanocrystal, nano bead, and nanocluster. The category labeled “1” contains the microrod, nanorod, nanowire, nanobelts, nanotube, nanoneedle, and nano bullet. The set of structures in group 2 consists of the microarchitecture, nanosheet, nanoplates, nano disk, nano flake, nanoforest, nanoflower, nano mushroom, nano grass, nanopyramid, nano star, and nano coral. In other words, group 0 is composed of symmetrical spheres and point-like particles, while group 1 is marked by linear and elongated forms, and group 2 contains branches and complex configurations. For analytical purposes, the dye molecules were also classified into categories. The first group, labeled as 1, comprised solely the N719 dye, which is widely used because of its broad spectral coverage. When it comes to the second group, labeled as 2, it also includes ruthenium polypyridyl-based complexes, namely N3, Z907, LEG-4, C-218, and N945. Furthermore, these complexes not only provide high stability but also exhibit a similar electron transfer mechanism. In other words, this behavior means that the excited electrons are injected into the conduction band of the semiconductor layer in a similar fashion, although the ligand structures differ. Group three contained organic indoline- and triarylamine-based dyes, including D102, D149, D131, and D205, which are metal-free dyes and have a high molar extinction coefficient. This group is characterized by the lack of metal elements, which allows them to become cost-effective and eco-friendly products. Both indoline- and triarylamine-based dyes belong to the donor--acceptor (also known as D--A) type of molecules, where the donor (D), the conjugated -bridge, and the acceptor (A) together determine their functionality. The fourth group represented coumarin and NKX-based derivatives, whereas group 5 consisted of natural dye extracts such as fruit and leaf extracts. To ensure the statistical reliability of the precursor categories, groups with a limited sample size were merged on the basis of comparable chemical features. During the grouping process six groups were identified: (1) the first group consisted of zinc nitrate, which is a separate, standalone category because of its statistical robustness; (2) the second group, labeled as 2, included zinc acetate and zinc chloride, reflecting that they both categorized as zinc-salts; (3) group 3 contained Ti-alkoxides with comparable condensation features such as titanium diisopropoxide bis(acetylacetonate) and titanium butoxide; (4) although TTIP is also Ti-alkoxide, its reaction kinetics different from the others and has a large representation in the dataset; (5) group 5 covered the halogenide precursors in solution form such as TiCl4 and Ti(SO4)2; (6) the last group included metallic titanium and commercial and non-solution Ti-sources.
Alongside the presentation of data, the description of the data-cleaning procedure should also be addressed. The aim of data cleaning is to enhance the quality of the dataset by detecting and eliminating errors. A critical aspect of data cleaning is handling missing values, since every attribute must be known for further processing. Therefore, several approaches can be found to address missing values: (i) removing incomplete rows; (ii) replacing missing values with the mean value; and (iii) replacing missing values with the most probable value. In this study, the latter approach was applied. Taking into account that the robustness of the machine learning model training can be enhanced, synthetic data generation was employed. Since the available dataset is limited and data collection is time-consuming, this approach offers notable advantages. During synthetic data generation, the statistical characteristics of the original dataset are preserved, while values are stochastically modified within a predefined range. For these purposes, the original parameter values of synthesis temperature, synthesis time, efficiency, short-circuit current density, open-circuit voltage and fill factor columns were altered within a range.
Along with this, data (quantile-based) categorization methods are crucial in machine learning models. According to Sutar et al. [
38], feature categorization has been applied to parameters such as synthesis temperature, synthesis time, and efficiency. In the current work, quantile-based categorization is also implemented for these parameters. During the process, the minimum, maximum, and quantile intervals have been used: values between the minimum and the first quartile (Q1) were categorized as low, those values between Q1 and the second quartile (Q2) were identified as medium, values between Q2 and the third quartile (Q3) were considered high, and very high ranges were set between Q3 and the maximum.
Correspondingly, these parameters are summarized in
Table 1, including the number of observations. It should be highlighted that
Table 1 does not include the synthetically generated data. The use of synthetic data generation allowed the number of entries in each group to be expanded.
In the table, in some rows, there is no value in the ordinary number and variables columns, as these parameters were included in the machine learning model as continuous variables.
2.2. Classification Learners
After the data processing and preparation, classification models were defined based on the literature review.
2.2.1. Decision Tree Classification
One of the most widely used and known ML classifiers is the Decision Tree (DT), which produces a tree-shaped flowchart which begins at the root and moves through the branches, also known as internal nodes. Therefore, these reached nodes represent the decisions [
43]. Furthermore, this hierarchical diagram structure represents a set of classification rules, and among these Decision Trees, binary DT is the most common, where each internal branch has two children. The end nodes of the tree correspond to distinct decision outcomes and can lead to conclusions, making it favorable for decision-making processes [
44]. It uses an iterative process, known as greedy search, where data is split step by step to reduce the variance in the subsets [
45]. The process stops when there are no further attributes available for splitting or when all data points are assigned to a class. Among the several DT algorithms, one of the most popular is the CART (Classification and Regression Tree), in which the Gini index is used for splitting attributes. Furthermore, during the definition of parameter space, the maximum depth was set to [3, 5, 10, None] and the minimum samples split was defined as [2, 5, 10], while the minimum samples leaf was set to [1, 2, 4]. For the splitting criterion, Gini and Entropy were considered. In addition, “
GridSearchCV” was applied to identify the combination of hyperparameters in the defined parameter space in order to achieve the best model performance.
2.2.2. Random Forest Classification
The Random Forest (RF) is based on the concept of DT, which relies on the outcomes of numerous independently generated trees. Therefore, this strategy helps to avoid overfitting while improving accuracy and providing robustness. In other words, the input data of the RF model is a subset of separate and individual DTs, each of which produced separate prediction outcomes [
46]. On top of that, one of the advantages of the model is that it can handle non-linear relationships. The result is obtained after aggregating the individual predictions by majority voting [
47]. Due to the complexity of the model, the visualization of data is more difficult compared to DT. Additionally, for the RF model, the “
GridSearchCV” approach was also applied, where the parameter space including the number of trees (
n_estimators) was set to [100, 200], whereas the maximum depth of tress was tested with 10 and unlimited, [10, None]. The minimum sample for a split was [2, 5] and the minimum samples per leaf were [1, 2]. In addition, the same splitting criterion was used as in the case of DT classification. Furthermore, the Out-of-Bag (also known as OOB) score was taken into consideration as it evaluates the prediction accuracy on data points excluded from the bootstrap sampling process. In this regard, this approach provides an internal validation of RF.
2.2.3. K-Nearest Neighbors Classification
Based on the work of Li et al., the RF and K-Nearest Neighbors (KNN) algorithms are the most suitable for predicting I-V characteristics [
31]. Along with this, the KNN classifier is considered a “lazy” classifier because it does not construct an explicit model during training [
48]. The algorithm consists of three major steps, which are (i) distance measurement: in this crucial step, the predefined distance between data points in the training and test sets is determined. Furthermore, it is considered that the datapoints are characterized by n-dimensional numerical vectors which are embedded in an n-dimensional feature space. The Euclidean (also known as L2 norm) and Manhattan (also known as L1 norm) distances are the most applied metrics [
49]. (ii) Selecting the Nearest Neighbors: the selection is based on the training dataset, where the
k training samples are identified by the shortest distance from the given test data point, taking the applied metrics into consideration. (iii) Classification: the final decision is determined by the use of a majority vote [
39]. In the case of a parameter grid, the number of neighbors was defined as [3, 5, 7, 9, 11]. Along with this, two types of weighting schemes were employed:
uniform, which takes equal weighting between each neighbor, and
distance, in which closer neighbors have a strong impact. Taking distance metrics into account, the L1 norm, L2 norm, and Minkowski distances were tested.
2.2.4. Support Vector Machine Classification
The following classifier, which has been used, is the Support Vector Machine (SVM), whose aim is to find the hyperplane (decision boundary) with the maximum margin between the data points of distinct classes [
50]. Moreover, the algorithm can be applied effectively with limited data samples. In addition, it is able to manage non-linear classification tasks by employing kernel functions into higher-dimensional feature spaces [
51]. In this regard, in the parameter search process, the regularization parameter (known as
C) was set to values [0.01, 0.1, 1, 5, 10], where at lower values, the model produces a smoother decision surface, whereas higher values impose strict classification rules. Moreover, three key kernel functions have been considered in the parameter space search method: linear, radial basis function (RBF), and polynomial function. Among these kernel functions, the RBF kernel is mostly applied. The gamma parameter (
) controls the influence of the radius of support vectors. For this parameter, scale and auto options were tested, and the parameter optimization was performed through “
GridSearchCV”.
2.2.5. XGBosst Classification
XGBoost, also known as Extreme Gradient Boosting, is one of the most popular and effective ML algorithms based on gradient boosting. The model builds multiple-week models using iterative combinations, and most commonly it uses DTs. The objective of the model is to minimize the loss function. A regularization term is included in the algorithm, helping it to prevent overfitting. Furthermore, the other advantage of the model is that it demonstrates high efficiency in identifying both linear and non-linear patterns [
52]. In the context of parameter space definition, the number of estimators [100, 200] was taken into consideration, where the parameter represents the number of Decision Trees. The maximum depth of the trees was regulated as [3, 5, 10], whereas the learning rate was tuned [0.01, 0.1, 0.2] in order to achieve a trade-off between convergence speed and generalization. Alongside other hyperparameters, the subsample parameter was set to [0.8, 1]. This parameter defines the proportion of the dataset used in each iteration, whereas the parameter that defines the fraction of the input variable each tree could utilize was tested as [0.8, 1].
2.2.6. Artificial Neural Network Classification
The last applied ML model was the Artificial Neural Network (ANN), which is inspired by the function of the human brain, as there is an interplay between neurobiology and computer science. ANNs are built from artificial neurons that communicate with each other via exchanging signals across weighted connections [
53]. In the case of the ANN, the layers between the input and output layer are called hidden layers. In these layers, the output of each neuron becomes the input of the following layer’s neuron, and the strength of the connection between the layers is characterized by weights [
54]. Furthermore, the generation of the output of the ANN relies on the activation functions, also known as transfer functions. Among the activation functions, linear, sigmoid, hyperbolic tangent, logistic, and radial functions are commonly used in practice [
55]. For the ANN, hyperparameter optimization of the multilayer perceptron classifier has been performed by applying grid search. The parameter grid included the architecture of the hidden layer structure, the type of activation functions such as linear, rectified linear, hyperbolic tangent, and logistic function. As optimization algorithms, adaptive moment estimation (also known as
adam) and stochastic gradient descent (also known as
sgd) have been tested, while for the regularization parameter, [0.0001, 0.01, 0.1, 1.0] were examined. Along with this, for the learning rate strategy, constant, gradually decreasing, and adaptive strategies were taken into account.
2.2.7. Evaluation Metrics to Compare the Models
All models were evaluated based on 10-fold cross-validation. In other words, the approach was configured with a 10-fold CV to enhance robustness and estimation accuracy. For the scoring criterion, the F1 metric was applied, which takes the precision and recall parameters into account, making it a favorable unit. In addition, labeled data were used for model training to enable verification of the model’s performance.
For results comparison, the following metrics were taken into consideration.
Table 2 represents the evaluation metrics, the equation of each metric, and their descriptions, where TP represents the True Positive (TP), FP means the False Positive (FP), FN illustrates the False Negative (FN), TN indicates the True Negative (TN), and MCC means the Matthews Correlation Coefficient.
In addition, in the final model selection, the accuracy metric was considered. The threshold value of 63% was adopted from the study of Maddah [
45], who also applied machine learning algorithms investigating the dye of DSSCs.
3. Results
After the database preparation and preprocessing, machine learning models were employed to predict the performance of the DSSC. During the model training, the model was trained on 80% of the dataset, and the remaining 20% were employed for validation. This chapter focuses on the evaluation of this parameter.
In the case of the DT classification model, it was found that the model performed best with the entropy criterion with no restriction on tree depth. Furthermore, it was also indicated that the configuration used two samples per split and one sample per leaf. What is more, the model achieved a cross-validation accuracy of 0.956 and a test set accuracy of 0.963. In addition, the Area Under Curve (also known as AUC) was assessed during the evaluation. The AUC metric indicates how well the model can distinguish between different classes. The model demonstrated excellent classification with an AUC value of 0.976. The closer the value is to 1, the better the classifier is. After the parameter tuning in the case of the RF classifier, the entropy criterion (entropy splitting rule) showed the highest efficiency taking unrestricted tree depth into consideration. Like the DT, the minimum number of samples per leaf was set to 1, and two samples for splitting were applied. Based on the applied parameter space of the RF, the best cross-validation accuracy was 0.957, while the test set accuracy achieved 0.963. Furthermore, the calculated overall AUC value was 0.976. The OOB metric was also taken into account, which evaluates the prediction accuracy on data points excluded from the bootstrap sampling process. In other words, it assesses how accurately each tree can predict data points that are excluded from the training set. The Out-of-Bag value of the RF classifier model is 0.953. Since the performance of DT and RF are similar, it suggests that the DT classifier already learns the pattern adequately, and the RF classifier reproduces this pattern.
In addition, optimal prediction performance of the SVM classifier was obtained under the following configurations: the C value was set to 10, gamma was applied in the “scale” settings, where its value is 0.066, and the kernel function was the radial basis function. The performance of the SVM was lower (cross-validation accuracy value of 0.816 and test set accuracy value of 0.848) compared to the performance of the DT and RF models. For the next classifier, KNN, it was observed that the Manhattan distance produced the best result, where the k value was set to 7 (the number of neighbors was equal to seven), and distance-based weighting was applied. Distance-based weighting means that the closer neighbors have a stronger impact on the model. Furthermore, based on the evaluation of cross-validation accuracy (0.957), test set accuracy (0.963), and average AUC value (0.976) metrics, it can be stated that the KNN model performs similarly to the DT and RF. Along with this, XGboost classification mode reached its optimal predictive performance with the following parameter settings: (i) every input feature was included during tree construction; (ii) a learning rate of 0.1 was considered which is a moderately low learning rate to ensure stable training; (iii) the maximum depth of the trees was 10; and (iv) 100 Decision Trees were built in sequence. What is more, the model also belongs to the best-performing algorithms. For the ANN, the most accurate predictions were achieved with the following parameter settings: (i) the activation function was the hyperbolic tangent; (ii) the regularization parameter was set to 0.001 to mitigate overfitting; (iii) the network architecture consisted of two hidden layers where each layer contained 50 neurons; (iv) the learning rate was set to constant; and (v) an Adam optimizer was employed, which is a widely used optimization method.
Moreover,
Table 3 shows the evaluation of the applied six machine learning models taking the cross-validation accuracy, test set accuracy, and average AUC metrics into account. Considering the preceding analysis and
Table 3, it can be stated that all machine learning algorithms deliver accurate predictions (the accuracy values are higher than 0.8). Additionally, in the case of the XGB classifier, it can be noted that the test accuracy (0.954) is slightly lower than the cross-validation accuracy (0.959), which might indicate a slight overfitting, but the accuracy still confirms that the model can be considered reliable and stable. In addition,
Figure 3 visualizes a comparative heatmap of the precision, recall, F1 score, and Matthews Correlation Coefficient (MCC) metrics in the case of the six applied machine learning algorithms. Furthermore, it must be noted that the greener a row appears, the more accurate the corresponding model is.
Moreover, based on
Figure 3, it can be concluded that the DT, RF, and KNN models achieved very similar and high performance across precision, recall, F1 score, and MCC metrics. The differences between the classes are minimal, which is also represented by the deep green coloring on the heatmap. These findings are further supported by the confusion matrix, where DT and RF models provided identical outcomes, and KNN showed only minor deviations. On the other hand, ANN and SVM revealed larger differences. Taking the confusion matrix into account, the prediction accuracy of a model is higher when the off-diagonal entries approach zero or decrease toward zero. In this context,
Table 4 represents the confusion matrix in the case of applied machine learning models, where the diagonal cells are highlighted in green for clarity. Moreover,
Table 4 also reveals that in the case of the Support Vector Machine classification mode, notable errors can be observed outside the diagonal. In this context, the model often misclassifies the second and third classes, and confusion can also be indicated in the fourth class. In contrast, XGBoost can be taken into account as a well-performing model compared to the SVM classifier; however, its accuracy remains slightly below the DT, RF, and KNN models. A small increase in misclassification also occurred in the case of the Artificial Neural Network classifier across specific classes. The assessment of precision, recall, F1 score, and MCC, alongside the analysis of the model’s confusion matrices can confirm that there is no overfitting, since the strong class-level performance is corroborated by the confusion matrix.
The accuracy value of all models exceeded the 63% threshold value; however, taking the previously described results into consideration, the most accurate classification models were Decision Tree Classification, Random Forest Classification, and K-Nearest Neighbors Classification.
Figure 4 illustrates the Decision Tree structure to a depth of three. Due to the large number of nodes and font sizes, only the main hierarchical levels are presented.
These levels highlight the most influential variables of the decision process and how the branching pattern was formed. Based on the figure, it can be seen that the first level of the DT is the synthesis temperature in which the Decision Tree splits into two branches at 98.5 .
Further divisions were made by synthesis time and the thickness of the thin films. In each box, below the category name, the “sample” entry indicates what percentage of the dataset falls into that node. The “value” field describes the distribution of the categorical breakdown, and the “class” highlights the dominant class in that branch.
In addition to all this, at the second level, based on the synthesis time, the split occurs at 19 h. If the synthesis time is lower than or equal to 19 h, the dye determines the next split. In contrast, if the synthesis time exceeds 19 h, the next split is determined by the TCO/ITO resistance. Additionally, when the synthesis temperature exceeds 98.5 , the decision process continues with the thickness of the thin film. If the thickness of thin films is lower than or equal to 10.15 , the dye will become the next splitting feature, whereas if the thickness is higher than 10.15 , the precursor drives the split.
The next part focuses on the feature importance in the case of the Decision Tree Classification and the Random Forest Classification models. For the K-Nearest Neighbors Classification model, feature importance cannot be discussed because the model does not build decision rules; it computes the distance between data points.
Figure 5 shows the feature importance plots in the case of the Decision Tree Classification and Random Forest Classification models, where the ranking of the top four features is consistent. It must be noted that there are differences in the relative importances of each classification model, but the order of the first four features remains the same.
Based on the figure, it can be observed that the thickness of the thin film is the most influential in both the DT classifier (with 20.96%) and the RF classifier (with 17.65%). The next feature that has a great impact on the models is the synthesis temperature (in the case of the DT: 16.79%, and in the case of the RF: 12.72%). The third most impactful feature is the synthesis time, which has a contribution of 12.5% in DT and 10.52% in RF models. In addition, the precursor feature is the fourth in the ranking with a value of approximately 9.5% in each model. These values represent how much the predictive performance of the model would decrease without the given feature in percentage terms. What is more, the initial three levels of the DT highlight similar influencing feature sets. Furthermore, the type of electrolyte and temperature have less than 1% contribution to the performance of the models.
4. Discussion
After the comparison of the six applied machine learning models, it can be concluded that DT, RD, and KNN algorithms are the most accurate, since they achieved 96% and an AUC value of more than 95%. Based on the literature, tree-based approaches outperform more complex algorithms if the working database is limited [
57,
58]. Furthermore, the literature also emphasizes that the number of trees in the case of RF classifiers should be between 64 and 128 trees [
59]. In this context, the number of trees applied in the Random Forest model was set to 100. Although the SVM model demonstrated lower performance values in all metrics and misclassification occurred in the second and third classes, the accuracy values exceeded 0.8. Moreover, ANN and XGBoost classifiers provided strong outcomes, but slightly more misclassification was observed in certain groups.
SHapley Additive exPlanations (SHAP) is an additive attribution approach based on Shapley value from game theory. SHAP provides local accuracy (additivity) and consistency [
39,
60]. Furthermore, the sign of a SHAP value determines whether the feature increases or decreases the output. Moreover, the magnitude of the vector (the SHAP value) indicates the strength of the prediction. If the value is positive, the attributes push the output up.
Figure 6 shows the SHAP plots in the case of Decision Tree and Random Forest models, where the bar length represents the mean absolute SHAP value. In other words, the bar lengths indicate how strongly that feature influences the model on average. In addition, the feature contributions are broken down by color.
For the Decision Tree, the SHAP analysis showed that synthesis temperature was the dominant feature, followed by the thickness of the thin film and precursor. Furthermore, the effect of each class in each of these parameters was consistent, confirming their key role in modeling DSSC performance. The dye feature exhibits higher importance for Class 3 and Class 0. The fifth most important feature is synthesis time, followed by morphological structure. That is, this implies that the process parameter (synthesis temperature), along with the thickness of the thin film, and the precursor, are dominant; the dye is also highly influenceable. Moreover, the morphological structure, the electrolyte concentrations, and the dye’s maximum absorptions, as mid-importance tier, show a moderate importance range. Furthermore, the type of electrolyte concentration and the measurement setting (such as temperature and irradiance) are low, indicating negligible influence.
In the case of Random Forest, the most dominant feature is the thickness of the thin film, followed by the precursor and the synthesis temperature. In addition, the type of thin film exhibits greater explanatory power. The results indicate that the RF assigns greater impact to structure/material descriptors than the DT, while the mid- and low-importance feature trends remain unchanged. Comparing the two models, it can be seen that both models consistently rank synthesis temperature and thickness of thin film as main drivers, followed by precursor and dye.
Furthermore, comparing the current study with results presented by other researchers is crucial. In this context, examining the applied methodology and results by Sutar et al. [
38], it can be concluded that a broader set of input parameters and classification algorithms has been taken into account, leading to a more detailed and complex evaluation. Furthermore, feature importance investigation provided a more in-depth perspective, such as that the thickness of the thin film plays an important role in Decision Tree (20.96%) and Random Forest (17.65%) classification models, emphasizing that the model performance would decrease by more than 17%. In addition, absorption maxima, which is ranked as the sixth (in the case of the DT) and as the fifth (in case of the RF) most important feature, also contributes to the model performance with 8%. Moreover, Li et al. [
31] described that KNN, RF, and GB models are suitable for predicting electric current—voltage characteristics and the related electric parameters such as efficiency. These findings are confirmed by the current study, where DT, RF, KNN, and XGB performed remarkably to predict the efficiency of the dye-sensitized solar cell.
The proposed interpretable classification framework provides decision support even under limited-data circumstances, cutting the experimental burden and shortening development cycles. Furthermore, using SHAP, the most critical manufacturing factors have been ranked, which enables a more cost- and resource-efficient design-of-experiments (DoE).
5. Conclusions
A DSSC database has been constructed and evaluated based on six classification algorithms (e.g., Decision Tree, Random Forest, Support Vector Machine, K-Nearest Neighbors, XGBoost, and Artificial Neural Network) using the Matthews Correlation Coefficient, F1 Score, Area Under Curve, Recall, Precision, Accuracy, and confusion matrices. Based on the evaluation, the most accurate classification models were Decision Tree Classification, Random Forest Classification, and K-Nearest Neighbors Classification. After the feature importance analysis, SHAP analysis has been added to identify the key drivers. It was found that synthesis temperature and thin-film thickness are dominant features, followed by precursor and dye. Furthermore, morphological structure, electrolyte concentrations, and dye’s absorption maximum are mid-importance tiers. Patterns can be recognized between the DT and RF models in the main driver identification. The results suggest that in optimizing the manufacturing process, targeted tuning of the synthesis temperature, the thickness of the thin film, the precursor, and the dye are likely to improve the performance of the device. Therefore, experimental effort should concentrate on these factors. The results provide data-driven decision support for DSSC development, thereby promoting sustainable, green-energy solutions. This work contributes to a DSSC-based process and materials database and an explainable classification framework that supports early-stage choices, narrows the search space, and focuses on the DoE, enabling cost- and time-efficient development. Furthermore, it is worthwhile examining the results in relation to other common PV technologies, chiefly monocrystalline Si (mono-Si), as the most widely deployed market player. Taking the operating mechanisms and device architecture of the two devices into account, the present DSSC-specific database is not directly applicable to mono-Si, while the methods are broadly applicable. Moreover, mono-Si systems underpin high-efficiency, large-scale deployments, whereas DSSCs occupy niche yet growing roles in BIPV applications.