Energy Level Prediction of Organic Semiconductors for Photodetectors and Mining of a Photovoltaic Database to Search for New Building Units

Due to the large versatility in organic semiconductors, selecting a suitable (organic semiconductor) material for photodetectors is a challenging task. Integrating computer science and artificial intelligence with conventional methods in optimization and material synthesis can guide experimental researchers to develop, design, predict and discover high-performance materials for photodetectors. To find high-performance organic semiconductor materials for photodetectors, it is crucial to establish a relationship between photovoltaic properties and chemical structures before performing synthetic procedures in laboratories. Moreover, the fast prediction of energy levels is desirable for designing better organic semiconductor photodetectors. Herein, we first collected large sets of data containing photovoltaic properties of organic semiconductor photodetectors reported in the literature. In addition, molecular descriptors that make it easy and fast to predict the required properties were used to train machine learning models. Power conversion efficiency and energy levels were also predicted. Multiple models were trained using experimental data. The light gradient boosting machine (LGBM) regression model and Hist gradient booting regression model are the best models. The best models were further tuned to achieve better prediction ability. The reliability of our designed approach was further verified by mining the photovoltaic database to search for new building units. The results revealed that good consistency is obtained between experimental outcomes and model predictions, indicating that machine learning is a powerful approach to predict the properties of photodetectors, which can facilitate their rapid development in various fields.


Introduction
The world is a place of discovery and billions of devices containing multiple sensors have been commercialized. Photodetectors or photosensors primarily work as optical receivers for the conversion of light into electrical signals. The photodetector has become a vital part of modern devices with a broad range of applications, including environmental monitoring, optical communication, health monitoring, image sensing, defense system and for safety purposes in industries [1]. In the modern age, silicon (Si), germanium (Ge), and indium gallium arsenide (InGaAs)-based inorganic photodetectors (PDs) have been popular in the market due to their stable performance, high quantum efficiency, high sensitivity/detectivity, response speed or responsivity. Despite having several advantages, were chosen for conducting further analyses. Moreover, detailed data about energy levels (HOMO and LUMO) were visualized to show the trend (hidden pattern) in the data. The parameter's feature importance was also evaluated for training machine learning models. In addition, Pearson correlation and Shapiro ranking was applied to demonstrate the correlation between different parameters. A similarity analysis was performed to find the similarities between reference structure and structure in the database. Furthermore, mining of the photovoltaic database was used to search for new building units.

Results and Discussions
The performance of varied materials depends on their chemistry [28][29][30]. Chemical data can help to understand their behavior [31,32]. The hidden pattern of data can provide much useful information [33].

Molecular Descriptors
Molecular descriptors are the mathematical representation of the molecules used to train the models based on machine learning (Table 1) [34][35][36]. Molecular descriptors can be generated with the help of different algorithms. These can be derived from the chemical structure of the molecules. Physical and chemical properties or information can be described quantitatively with the help of the numerical value of molecular descriptors [37,38]. Almost thousands of molecular descriptors were calculated. Then, these were shortlisted in several unique ways. Molecular descriptors are based on independent properties. They can help researchers to perform similarity tests by using different models such as RDkit in molecular libraries. Based on the similarities present in the descriptor values, the molecules with the same physical and chemical properties can be evaluated.

Regression Analysis
The performance of machine learning is strongly dependent on algorithms [39]. To identify which kind of variable has a strong effect on the topic of interest, these methods are effectively reliable. The regression analysis provides information on the way these factors influence each other, how to determine the factors that are of most importance, and which factors can be ignored. Regression analysis uses various algorithms of machine learning. The data can be integrated into two parts: the testing set and the training set. These data are of different ratios, 70% 30%, 60% 40%. The best one is training: test ratio. Afterwards, by analyzing the values of predicted PCE and experimental PCE, the correlation between these two was calculated. The obtained results are plotted in the form of a graph.

Pearson Ranking Correlation
In machine learning algorithms, Pearson correlation is the most widely used correlation for numerical variables. To effectively measure the degree of relationship (linear association or correlation) between two different variables, this correlation can be used. It shows how far the data points are from the line of best fit. For this method to work effectively, the variables should be normally distributed. The direction and strength of two variables can be measured by the number between 1 and −1, where −1 indicates negative correlation, 1 represents positive correlation, and 0 indicates no correlation.

HOMO Prediction
The value from 1 to 0 shows a positive correlation between the HOMO and other molecular descriptors. The value from 0 to −1 shows the negative correlation between HOMO and other molecular descriptors. The 0 value shows no correlation between the variables. The red color indicates a positive correlation, while blue color indicates a negative correlation. As shown in Figure 1, DEEC, SdssC, NNRS, Nr05, ti2-L, and SM5-X all are red in color, these molecular descriptors show a positive correlation with the HOMO lies on the x-axis. In contrast, SPMAD-AEAdm, RFD, RCI and Eta-D-AlphaB show a negative correlation with the HOMO of the x-axis. In addition, TI2-L shows a strong positive correlation with the HOMO lying on the y-axis. and which factors can be ignored. Regression analysis uses various algorithms of machine learning. The data can be integrated into two parts: the testing set and the training set. These data are of different ratios, 70% 30%, 60% 40%. The best one is training: test ratio. Afterwards, by analyzing the values of predicted PCE and experimental PCE, the correlation between these two was calculated. The obtained results are plotted in the form of a graph.

Pearson Ranking Correlation
In machine learning algorithms, Pearson correlation is the most widely used correlation for numerical variables. To effectively measure the degree of relationship (linear association or correlation) between two different variables, this correlation can be used. It shows how far the data points are from the line of best fit. For this method to work effectively, the variables should be normally distributed. The direction and strength of two variables can be measured by the number between 1 and −1, where −1 indicates negative correlation, 1 represents positive correlation, and 0 indicates no correlation.

HOMO Prediction
The value from 1 to 0 shows a positive correlation between the HOMO and other molecular descriptors. The value from 0 to −1 shows the negative correlation between HOMO and other molecular descriptors. The 0 value shows no correlation between the variables. The red color indicates a positive correlation, while blue color indicates a negative correlation. As shown in Figure 1, DEEC, SdssC, NNRS, Nr05, ti2-L, and SM5-X all are red in color, these molecular descriptors show a positive correlation with the HOMO lies on the x-axis. In contrast, SPMAD-AEAdm, RFD, RCI and Eta-D-AlphaB show a negative correlation with the HOMO of the x-axis. In addition, TI2-L shows a strong positive correlation with the HOMO lying on the y-axis.  For model training, calculated molecular descriptors are used as a source of input. The chemistry of donor molecules is represented by the help of molecular descriptors. The numerical form is used for presenting the chemical structure of materials in the numerical form presented by the molecular descriptors. Figure 2 shows that the Eta-D-AlphaB, RCI, RFD, and SPMAD-AEAdm show the negative dependent Pearson correlation. On the other hand, other molecular descriptors such as DECC, SdssC, NNRS, Nr05, TI2-L, and SM5-X show that they are positively dependent features.
For model training, calculated molecular descriptors are used as a source of input. The chemistry of donor molecules is represented by the help of molecular descriptors. The numerical form is used for presenting the chemical structure of materials in the numerical form presented by the molecular descriptors. Figure 2 shows that the Eta-D-AlphaB, RCI, RFD, and SPMAD-AEAdm show the negative dependent Pearson correlation. On the other hand, other molecular descriptors such as DECC, SdssC, NNRS, Nr05, TI2-L, and SM5-X show that they are positively dependent features.

Shapiro Ranking
The Shapiro ranking test is also known as the Shapiro-Wilk test. This test is usually performed to test the normality in statistics. Martin Wilk and Samuel Sanford Shapiro published this in 1965. This ranking uses the Shapiro-Wilk algorithm, which is generated by the Yellowbrick Python package. This is a one-dimensional feature ranking. To assess the normality of distribution, it considers a single feature at a time. The results are shown in the form of a bar plot, which shows the features with the maximum score on the one side and the features using the average score on the other side. Figure 3 shows that RFD shows the least amount of distribution according to this ranking.

Shapiro Ranking
The Shapiro ranking test is also known as the Shapiro-Wilk test. This test is usually performed to test the normality in statistics. Martin Wilk and Samuel Sanford Shapiro published this in 1965. This ranking uses the Shapiro-Wilk algorithm, which is generated by the Yellowbrick Python package. This is a one-dimensional feature ranking. To assess the normality of distribution, it considers a single feature at a time. The results are shown in the form of a bar plot, which shows the features with the maximum score on the one side and the features using the average score on the other side. Figure 3 shows that Eta-D-AlphaB shows the least amount of distribution according to this ranking.
For model training, calculated molecular descriptors are used as a source of input. The chemistry of donor molecules is represented by the help of molecular descriptors. The numerical form is used for presenting the chemical structure of materials in the numerical form presented by the molecular descriptors. Figure 2 shows that the Eta-D-AlphaB, RCI, RFD, and SPMAD-AEAdm show the negative dependent Pearson correlation. On the other hand, other molecular descriptors such as DECC, SdssC, NNRS, Nr05, TI2-L, and SM5-X show that they are positively dependent features.

Shapiro Ranking
The Shapiro ranking test is also known as the Shapiro-Wilk test. This test is usually performed to test the normality in statistics. Martin Wilk and Samuel Sanford Shapiro published this in 1965. This ranking uses the Shapiro-Wilk algorithm, which is generated by the Yellowbrick Python package. This is a one-dimensional feature ranking. To assess the normality of distribution, it considers a single feature at a time. The results are shown in the form of a bar plot, which shows the features with the maximum score on the one side and the features using the average score on the other side. Figure 3 shows that RFD shows the least amount of distribution according to this ranking.  All the molecular descriptors are not able to give the performance value equally during the time of model training [40,41]. Consequently, to evaluate the performance ability of each feature or molecular descriptor, it is important to calculate the relative importance of all the molecular descriptors used. The feature present with a high value of relative importance shows that it can contribute most to the algorithm used in machine learning. Moreover, the feature of high relative importance shows that these are considered helpful for predicting machine learning models. Figure 4 indicates that the molecular descriptor RCI shows the least value of relative importance and its contribution to the algorithms is extremely low. In contrast, ETA-D-AlphaB shows a high value of relative importance, and its contribution is greater than all the other features for the prediction of the machine learning models. The variety of the features shows different relative importance.
All the molecular descriptors are not able to give the performance value equally during the time of model training [40,41]. Consequently, to evaluate the performance ability of each feature or molecular descriptor, it is important to calculate the relative importance of all the molecular descriptors used. The feature present with a high value of relative importance shows that it can contribute most to the algorithm used in machine learning. Moreover, the feature of high relative importance shows that these are considered helpful for predicting machine learning models. Figure 4 indicates that the molecular descriptor RCI shows the least value of relative importance and its contribution to the algorithms is extremely low. In contrast, ETA-D-AlphaB shows a high value of relative importance, and its contribution is greater than all the other features for the prediction of the machine learning models. The variety of the features shows different relative importance. Different models are tested for their predictive capability ( Table 2). The light gradient boosting model (LGBM) and Hist gradient boosting are used for further analysis. A residual plot helps to identify problems associated with regression analysis. In the residual plot, the target variable is present on the x-axis and the residuals are on the y-axis. The deviation of the predicted value from the actual value is indicated by the residual values. If the data point is away from the zero line, the prediction value will differ from the actual values. The residual plot for the LGBM regression model is shown in Figure 5. The residual plot for the Hist gradient boosting model is shown in Figure 6. The behavior of LGBM regression models is like that of the Hist gradient boosting regression model. R 2 is the coefficient of determination for the test and trained value. R 2 for the test and train residues is not remarkably high: near zero, which is considered accurate. So, the results of both models are good. Dependence on expensive experimental techniques can be decreased by finding accurate results by machine learning models. The more similarities in the predicted and experimental values show that the model or method used was precise and accurate. The easy and fast prediction of results can speed up the design process of new structures of donor materials. Different models are tested for their predictive capability ( Table 2). The light gradient boosting model (LGBM) and Hist gradient boosting are used for further analysis. A residual plot helps to identify problems associated with regression analysis. In the residual plot, the target variable is present on the x-axis and the residuals are on the y-axis. The deviation of the predicted value from the actual value is indicated by the residual values. If the data point is away from the zero line, the prediction value will differ from the actual values. The residual plot for the LGBM regression model is shown in Figure 5. The residual plot for the Hist gradient boosting model is shown in Figure 6. The behavior of LGBM regression models is like that of the Hist gradient boosting regression model. R 2 is the coefficient of determination for the test and trained value. R 2 for the test and train residues is not remarkably high: near zero, which is considered accurate. So, the results of both models are good. Dependence on expensive experimental techniques can be decreased by finding accurate results by machine learning models. The more similarities in the predicted and experimental values show that the model or method used was precise and accurate. The easy and fast prediction of results can speed up the design process of new structures of donor materials.        the HGBM regression model and Hist gradient regression model is shown in Figure 7 and Figure 8, respectively. The scatter plot is drawn between the residuals for models and the experimental or predicted value. The majority of values are in the low range, close to zero, which is a clear indication of accurate results. The values for train residues and the values for the test residues are also close to zero. The results show that both the Hist gradient regression model and the LGBM regression models are the best models for regression analysis.    The scatter plot is drawn between the residuals for models and the experimental or predicted value. The majority of values are in the low range, close to zero, which is a clear indication of accurate results. The values for train residues and the values for the test residues are also close to zero. The results show that both the Hist gradient regression model and the LGBM regression models are the best models for regression analysis.

LUMO Prediction
In Pearson ranking, a correlation between the LUMO and molecular descriptors is determined (Figure 9). The value of the Pearson ranking shows that the value from 0 to +1 shows a positive correlation. The molecular descriptors that fall in the value of 0 to +1 are indicated by red color. There is no correlation at zero point. The molecular descriptors having a blue appearance indicate a negative correlation. The negative correlation ranges from 0 to −1.
GATS1s is the molecular descriptor (Table 3) presented on the y-axis. It indicates a positive correlation with LUMO lying on the x-axis, while the other molecular descriptors present on the y-axis show a negative correlation because their blue color indicates that the values of these molecular descriptors must lie between 0 and −1. In contrast, SPMAD-AEAdm, Eig04_EA(dm), GATS1s, EE_B(s), SM4_B(s), SM5_B(s), SM6_B(s), SHED-AL and Eig08_EA(dm) are the molecular descriptors present on the x-axis. These indicate a positive correlation with the LUMO present on the y-axis (red in color).   GATS1s is the molecular descriptor (Table 3) presented on the y-axis. It indicates a positive correlation with LUMO lying on the x-axis, while the other molecular descriptors present on the y-axis show a negative correlation because their blue color indicates that the values of these molecular descriptors must lie between 0 and −1. In contrast, SPMAD-AEAdm, Eig04_EA(dm), EE_B(s), SM4_B(s), SM5_B(s), SM6_B(s), SHED-AL and Eig08_EA(dm) are the molecular descriptors present on the x-axis. These indicate a positive correlation with the LUMO present on the y-axis (red in color). SHED-AL SHED Acceptor Lipophilic A source of input is considered a calculated molecular descriptor for the model training. The chemistry of donor molecules is represented by the help of molecular descriptors. Figure 10 shows that molecular descriptors such as SPMAD-AEAdm, Eig04_EA(dm), EE_B(s), SM4_B(s), SM5_B(s), Eig08_EA(dm), SM6_B(s), and Eig08_EA(dm) indicate the negative dependent Pearson correlation. This negative correlation is determined by noting that these molecular descriptors lie between 0 and _1. In LUMO's case, only one molecular descriptor, GATS1s, shows a positive dependent correlation, with a value from 0 to +1.

7
SM6_B(s) 2D matrix-based descriptors Spectral moment of order 6 from Burden matrix by I-State 8 Eig08_EA(dm) Edge adjacency indices Eigenvalue n. 8 from edge adjacency mat. weighted by dipole Moment 9 SHED-AL SHED Acceptor Lipophilic A source of input is considered a calculated molecular descriptor for the model training. The chemistry of donor molecules is represented by the help of molecular descriptors. Figure 10 shows that molecular descriptors such as SPMAD-AEAdm, Eig04_EA(dm), EE_B(s), SM4_B(s), SM5_B(s), Eig08_EA(dm), SM6_B(s), and Eig08_EA(dm) indicate the negative dependent Pearson correlation. This negative correlation is determined by noting that these molecular descriptors lie between 0 and _1. In LUMO's case, only one molecular descriptor, GATS1s, shows a positive dependent correlation, with a value from 0 to +1. To find the normality of the distribution of molecular descriptors, Shapiro ranking deals with a single molecular descriptor one at a time. In Figure 11, GATS1s shows the least normality according to Shapiro ranking. On the other hand, SPMAD-AEAdm and SM4-BS show the greatest normality according to this ranking. To find the normality of the distribution of molecular descriptors, Shapiro ranking deals with a single molecular descriptor one at a time. In Figure 11, GATS1s shows the least normality according to Shapiro ranking. On the other hand, SPMAD-AEAdm and SM4-BS show the greatest normality according to this ranking. The relative importance of features tells us about the performing ability of different molecular descriptors. During the training of the model, all the molecular descriptors present are not able to perform at an equal level. It is important to calculate the relative importance of all the molecular descriptors to check the performance ability of each. So, the relative importance of features helps to evaluate the performing ability of different mo- Figure 11. The normality of all the features analyzed by Shapiro test.
The relative importance of features tells us about the performing ability of different molecular descriptors. During the training of the model, all the molecular descriptors present are not able to perform at an equal level. It is important to calculate the relative importance of all the molecular descriptors to check the performance ability of each. So, the relative importance of features helps to evaluate the performing ability of different molecular descriptors. The molecular descriptor whose relative importance is high shows that it can be used mostly for the prediction of results. Additionally, the feature with the highest value of relative performance among all the features is considered helpful in training algorithms used in machine learning. Figure 12 shows that the molecular descriptor SM5-Bs shows the least value of relative importance and its contribution to the algorithms is extremely low. On the other hand, the molecular descriptors GATS1s, SPMAD-AEAdm, Eigo4-AEAdm and SHED-AL show high values of relative importance. The variety of features shows different relative importance. Figure 11. The normality of all the features analyzed by Shapiro test.
The relative importance of features tells us about the performing ability of different molecular descriptors. During the training of the model, all the molecular descriptors present are not able to perform at an equal level. It is important to calculate the relative importance of all the molecular descriptors to check the performance ability of each. So, the relative importance of features helps to evaluate the performing ability of different molecular descriptors. The molecular descriptor whose relative importance is high shows that it can be used mostly for the prediction of results. Additionally, the feature with the highest value of relative performance among all the features is considered helpful in training algorithms used in machine learning. Figure 12 shows that the molecular descriptor SM5-Bs shows the least value of relative importance and its contribution to the algorithms is extremely low. On the other hand, the molecular descriptors GATS1s, SPMAD-AEAdm, Eigo4-AEAdm and SHED-AL show high values of relative importance. The variety of features shows different relative importance.  A variety of regression models are used for the prediction of results [42]. R 2 values are given in Table 4. LGBM and Hist gradient boosting models consider the best working models for the prediction of LUMO. An analysis of different molecular descriptors is carried out by using these regression models. A residual plot is used to identify problems with regression analysis. In the residual plot, the relationship between the test value and the train value is predicted. The target variable is present on the x-axis, and the residuals are on the y-axis. If the value of train data is near the value of test data, then the chances of accurate results increase. If the values of the test and train data are not near each other or near the zero line, it indicates that the prediction value will differ further from the actual values. The residual plot for the LGBM regression model is shown in Figure 13. The residual plot for the Hist gradient boosting model is shown in Figure 14. The obtained results show that the behavior of LGBM regression models is like that of the Hist Gradient Boosting regression model. The coefficient of determination for the test and trained value is indicated by the symbol R 2 . These regression models show that the value of R 2 is near zero. So, the results of both models are considered good enough. By using machine learning models, accurate results can be achieved. This is helpful to avoid expensive experimental techniques. sidual plot for the Hist gradient boosting model is shown in Figure 14. The obtained results show that the behavior of LGBM regression models is like that of the Hist Gradient Boosting regression model. The coefficient of determination for the test and trained value is indicated by the symbol R 2 . These regression models show that the value of R 2 is near zero. So, the results of both models are considered good enough. By using machine learning models, accurate results can be achieved. This is helpful to avoid expensive experimental techniques.   A scattered plot between the experimental value and expected value of LUMO using the HGBM regression model and Hist gradient regression model is shown in Figure 15 and Figure 16, respectively. The scatter plot is drawn between the residuals for the models  A scattered plot between the experimental value and expected value of LUMO using the LGBM regression model and Hist gradient regression model is shown in Figure 15 and Figure 16, respectively. The scatter plot is drawn between the residuals for the models and the experimentally predicted value. It is mostly used to find problems with regression models. For data points above the line, residuals are positive. For the data points below the line, the residuals are negative. The closer the value of the data points to 0, the more accurate it is for results. The scatter plot of LGBM and Hist gradient regression models shows that most of the values lie in the low range, near the value of zero, indicating accurate results. A scattered plot between the experimental value and expected value of LUMO using the HGBM regression model and Hist gradient regression model is shown in Figure 15 and Figure 16, respectively. The scatter plot is drawn between the residuals for the models and the experimentally predicted value. It is mostly used to find problems with regression models. For data points above the line, residuals are positive. For the data points below the line, the residuals are negative. The closer the value of the data points to 0, the more accurate it is for results. The scatter plot of LGBM and Hist gradient regression models shows that most of the values lie in the low range, near the value of zero, indicating accurate results.

Database Mining
The Clean Energy Project (CEP) is a database that contains thousands of organic molecules. These molecules can be used for various photovoltaics applications [43,44]. A similarity analysis is performed to find suitable building units. O4TIC is a low-band gap molecule. O4TIC contains a carbon-oxygen bridged-type ladder with strong electron-donating capability with the oxygen atoms conjugation effect. The further band gap of the mol-

Database Mining
The Clean Energy Project (CEP) is a database that contains thousands of organic molecules. These molecules can be used for various photovoltaics applications [43,44]. A similarity analysis is performed to find suitable building units. O4TIC is a low-band gap molecule. O4TIC contains a carbon-oxygen bridged-type ladder with strong electrondonating capability with the oxygen atoms conjugation effect. The further band gap of the molecule is decreased with the attachment of a more electron-rich group instead of the central phenyl group, which increases the donating capability of the molecule [43,45]. The linear side increases the crystallinity, which in turn increases the mobility of the electrons. The top search hits for O4TIC references are given in Figure 17. The building blocks found are not overly similar to O4TIC; however, most are suitable for the design of polymers for organic solar cells. The top search hits for middle O4TIC are given in Figure 18. All the structures are unique and possible to synthesize. Figure 16. Scatter plot between experimental and predicted LUMO using Hist gradient booting.

Database Mining
The Clean Energy Project (CEP) is a database that contains thousands of organic molecules. These molecules can be used for various photovoltaics applications [43,44]. A similarity analysis is performed to find suitable building units. O4TIC is a low-band gap molecule. O4TIC contains a carbon-oxygen bridged-type ladder with strong electron-donating capability with the oxygen atoms conjugation effect. The further band gap of the molecule is decreased with the attachment of a more electron-rich group instead of the central phenyl group, which increases the donating capability of the molecule [43,45]. The linear side increases the crystallinity, which in turn increases the mobility of the electrons. The search hits for O4TIC references are given in Figure 17. The building blocks found are not overly similar to O4TIC; however, most are suitable for the design of polymers for organic solar cells. The search hits for middle O4TIC are given in Figure 18. All the structures are unique and possible to synthesize.  The search hits for Y6 are given in Figure 19. Many groups can be used for polymer design. After a little structural modification, other groups can also be useful candidates. The search hits for middle Y8 are given in Figure 20. PM6 is used as a donor and Y6 is used as an acceptor. The combination of these molecules results in an increase in efficiency. PM6 is mostly used as donor [46]. D18 is a copolymer with a narrow band gap [47]. Its backbone contains alternating electron donating and electron accepting fused-ring units. By fusing the Y6 as an acceptor and D18 as a donor, the efficiency of the resulting copolymer can be increased. It is a medium band gap polymer. The top search hits for Y5 are given in Figure 19. Many groups can be used for polymer designing. After a minor structural modification, other groups can also be useful candidates. The top search hits for Y5 middle are given in Figure 20. Molecular core of Y5 is used as an electron deficient group. Y5 core structure is considered as high performance (NFA) non-fullerene acceptor. Y5 can be applied to both inverted and conventional OPV devices because of versatility of Y5. OSCs based on NFA can achieve longer device life-time with greater photochemical and thermal stability [46]. The combination of Y5 electron deficient with five different donor polymers could lead to enhanced efficiency [47].

Dataset
The data for machine learning were collected from research papers. The volume of data was enough for good machine learning models. The data are based on energy levels and photovoltaic parameters. The performance of the machine learning model strongly depends on the quality and quantity of the data [56].

Molecular Descriptor Calculation
Several types of molecular descriptors of molecules were calculated using Dragon software [57]. About 4000 descriptors were generated. The best descriptors were shortlisted using univariate regression. These descriptors were used for training machine learning models.

Dataset
The data for machine learning were collected from research papers. The volume of data was enough for good machine learning models. The data are based on energy levels and photovoltaic parameters. The performance of the machine learning model strongly depends on the quality and quantity of the data [56].

Molecular Descriptor Calculation
Several types of molecular descriptors of molecules were calculated using Dragon software [57]. About 4000 descriptors were generated. The best descriptors were shortlisted using univariate regression. These descriptors were used for training machine learning models. The organic building units are varied because various positions are available to add or connect plenty of heteroatoms [48][49][50][51]. This is carried out to produce or synthesize countless organic molecules that are better in their characteristics than the previous ones. To design new polymers or organic semiconductors for organic photodetectors, electrondeficient and electron-rich groups can be used. Hundreds of building blocks can be selected based on the addition of terminal groups and the availability of the position for alkyl chains. Many organic semiconductor materials can be designed by connecting new building units. A suitable combination of electron-rich and electron-deficient results in the formation of an electron hole, which leads to an increase in conjugation [52][53][54][55].

Dataset
The data for machine learning were collected from research papers. The volume of data was enough for good machine learning models. The data are based on energy levels and photovoltaic parameters. The performance of the machine learning model strongly depends on the quality and quantity of the data [56].

Molecular Descriptor Calculation
Several types of molecular descriptors of molecules were calculated using Dragon software [57]. About 4000 descriptors were generated. The best descriptors were shortlisted using univariate regression. These descriptors were used for training machine learning models.

Training the Model
We have imported the necessary packages of Python such as Scikit-learn, Pandas, Scipy, Numpy, Seaborn, and Matplotlib. These packages are necessary for data visualization and analysis. The calculated descriptors and target properties in comma-separated value (CSV) files were imported with the help of the Pandas module.

Similarity Analysis
A similarity analysis was performed using RDKit [58]. The similarity analysis is a straightforward method to find the similarities between reference structure and structure in the database. For this purpose, pharmacophores, distances, fingerprints, etc., can be used. In our work, Tanimoto similarity was used. For this purpose, ECFP4 fingerprints were selected.

Conclusions
In summary, data on large photovoltaic properties were collected from already reported experimental studies and subsequently utilized to train machine learning models. Among the multiple trained models, the LGBM regression model and Hist gradient booting regression model demonstrated the best predictive capability. Moreover, HOMO and LUMO energy levels were successfully predicted. The results revealed that good consistency was obtained between experimental outcomes and model predictions. In addition, Pearson correlation and Shapiro ranking was applied to demonstrate the correlation between different parameters. Furthermore, a similarity analysis was performed to find the similarities between reference structure and structure in the database. The reliability of our designed approach was also verified by mining the photovoltaic database to search for new building units. This indicates that machine learning is a powerful approach to predict the properties of photodetectors, which can facilitate their rapid development in various fields. Fast screening or searching of new building units with minimal computational costs could significantly reduce experimentation (trial and error methods) costs by narrowing down the search for potential candidates.