Generation of Proton- and Alpha-Induced Nuclear Cross-Section Data via Random Forest Algorithm: Production of Radionuclide 111 In

: We investigated the generation of proton- and alpha-induced nuclear cross-section data in the production of Indium-111 ( 111 In) for application in nuclear medicine. Here, we are interested in three reaction channels, which are 109 Ag ( α , 2n), 111 Cd (p, n) and 112 Cd (p, 2n), in the production of 111 In. A random forest algorithm was used to generate nuclear cross-section data by using an experimental nuclear cross-section from the Experimental Nuclear Reaction Data (EXFOR) database as input. Hence, reasonably accurate regression curves of nuclear cross-section data could be produced with the evaluated nuclear data library ENDF/B-VII.0 set as the benchmark.


Introduction
Nuclear reaction cross-section data are very important to the field of medical radiobiology in both diagnostic imaging and targeted therapy for cancer treatment because they can be used to optimize established and new nuclear reaction routes, which is crucial for the optimization of the production of radionuclide [1]. One of the most common radionuclides in diagnostic nuclear medicine is Indium-111 ( 111 In), whose relatively short decay life and low energy gamma photon emissions make it popular for radiolabeling target cells [2]. The usual production of 111 In is performed by irradiating cadmium with protons, which can be achieved using a particle accelerator. In addition, nuclear cross-section data are important in nuclear reactions to identify possible production pathways for maximizing the yield of radionuclide and minimizing the impurity. For example, Ali et al. (2021) studied the production pathway of copper-67 via proton-induced nuclear reaction with a zinc-68 target since there is a limited supply of copper-67, and the production nuclear reaction 68 Zn (p, 2p) 67 Cu could be achieved by using incident proton energy-operable in-house medical cyclotron. Another example is that 72 As can also be used in theranostic therapy, specifically in positron emission tomography (PET). In the production of 72 As, Ge and Se targets are used in proton-induced nuclear reaction. Other than proton-induced nuclear reaction, muon capture has also previously been used in the production of various useful radionuclides such as technetium-99m, which is the decay product of molybdenum-99 [3]. Resonant photonuclear isotope transmutation has also been reported previously to use photonuclear reaction in the production of technetium-99 [4].
Machine learning has become increasingly popular in recent years, especially for making accurate predictions by training on a large dataset. The availability of big and public curated databases in various fields of science makes a data-driven approach such as machine learning favorable in modeling physical sciences. In the field of nuclear physics, databases of evaluated and experimental nuclear cross-sections such as the Evaluated Nuclear Data File (ENDF/B-VII.0) database, the Recommended Nuclear Data for the Production of Selected Therapeutic Radionuclides database and the Experimental Nuclear Reaction Data (EXFOR) database are constantly updated with excitation functions of nuclear reactions of various important radionuclides in medical radiobiology [5][6][7]. Despite being a new approach in nuclear physics, machine learning has been applied to improve the accuracy of nuclear cross-section data by using random forest algorithms [8]. A random forest algorithm is used as a benchmark for the quality of the Evaluated Nuclear Data Files (ENDF) library by tracing discrepancy between simulated and experimental effective neutron multiplication factors, k e f f . Gaussian process regression has also been performed to generate a data-driven nuclear cross-section of a nickel target using the EXFOR library as the predictor [9] and to make predictions on the uncertainty of the nuclear cross-section data.
In this paper, a random forest algorithm is used to build models to generate the nuclear cross-section data of proton-and deuteron-induced nuclear reactions with 111 In as the emission. 109 Ag (α, 2n) 111 In, 111 Cd (p, n) 111 In and 112 Cd (p, 2n) 111 In are the reaction channels that we were interested in. The generated nuclear cross-section data were then compared with the evaluated nuclear data library TENDL-2019 and the Recommended Nuclear Data for the Production of Selected Therapeutic Radionuclides database [7,8].

Materials and Methods
The predictor of this study consisted of the proton and deuteron incident energy in the range of 0-80 MeV and the experimental nuclear cross-section data obtained from the EX-FOR library. The descriptor was the TENDL-2019 evaluated library and the Recommended Nuclear Data for the Production of Selected Therapeutic Radionuclides database, where regression was performed. The list of predictors and descriptors is tabulated inside Table 1. The experimental nuclear cross-section data of the TENDL-2019 evaluated library and the Recommended Nuclear Data for the Production of Selected Therapeutic Radionuclides database are also plotted in Figure 1. TENDL-2019 is a nuclear data library based on TALYS code calculation, while the Recommended Nuclear Data for the Production of Selected Therapeutic Radionuclides database is based on the spline interpolation of experimental nuclear cross-section data. In Figure 1c, the results of Hermanne et al. (2014) are slightly higher compared to other experimental cross-section results and the spline fitted line, especially in between 10 and 13 MeV, because it is not included during the spline fitting of the 112 Cd (p, 2n) 111 In cross-section in the Recommended Nuclear Data for the Production of Selected Therapeutic Radionuclides database [21]. However, we included the experimental cross-section data of Hermanne et al. (2014) to increase the amount of input data available for training [21]. The predictors and descriptors were normalized using min-max normalization. To avoid overfitting our model, we performed a k-fold cross-validation, where k is a randomly chosen subset of equal or almost equal sets before being used for validation at least once for each subset. Here, k was set to 5, which means we performed 5-fold crossvalidation. Then, feature selection was performed, followed by Bayesian optimization to determine the optimized number of predictors and the combination of predictors and hyperparameters used.
A random forest algorithm was used in our study. Random forest is a type of supervised machine learning algorithm where multiple decision trees are grown recursively using ensemble methods. Each of these decisions acts as a regression function, and later, all are averaged out as the output of the random forest algorithm. There are two different kinds of ensemble methods used in random forest algorithms, namely boosted trees and bagged trees [23,24]. In bagged trees, random predictors of the same size are used to build the decision tree, while in boosted trees, the weight of each decision tree of the weak learner is adjusted. Random forest is a non-parametric algorithm that does not assume any prior parameters.  [2,[11][12][13][14][15][16][17][18][19][20][21][22] in between 0 and 80 MeV proton incident energy.

Results and Discussions
As a preliminary step, forward feature selection was performed to optimize the number of predictors. Since the models of our machine learning depend on the number and combination of predictors, estimators such as correlation coefficient (R 2 ) and the root mean square error (RMSE) can be the benchmark to evaluate the performance of our machine learning. The correlation coefficient (R 2 ) measures the closeness of a prediction to the actual value, which is defined as RMSE is defined as where n is the number of observations, y i is the actual value,ŷ i is the predicted value and y is the overall mean. The ideal way to generate nuclear cross-section data is if the models can give an R 2 value close to one and RMSE close to zero. Then, we initially removed all of the single point experimental datasets before proceeding with the forward feature selection. When the forward feature selection was applied, more features were added into the feature space sequentially. By manipulating the number of features for the predictors and different combinations of features, the accuracy of the machine can be improved. The number of predictors and combination of predictors along with their respective RMSE and correlation coefficients are tabulated in Tables 2-4. Different combinations of predictors are considered for 109 Ag (α, 2n) 111 In, 111 Cd (p, n) 111 In and 111 Cd (p, n) 111 In, respectively, in Tables 2-4. The forward selection step was taken whereby we added features into empty feature space, with the total number of features increasing from 1 to 8, from 1 to 5 and from 1 to 4 for 109 Ag (α, 2n) 111 In, 111 Cd (p, n) 111 In and 111 Cd (p, n) 111 In, respectively. The range of correlation coefficient values, R 2 , for all of our predictions is between 0.94 to 0.99, which indicates a good agreement between the prediction and the original data. In Tables 2-4, the optimized number of features (with high R 2 and low RMSE) is highlighted.   Figure 2 shows the trends of the number of predictors versus the RMSE. Intuitively, we may think that the higher the number of features used as predictors is, the higher the correlation coefficient will be and the lower the RMSE of our models will be, since using more datasets to build the models may enhance the prediction. However, these three cases demonstrate that a low RMSE value could be obtained with the use of between one and three features for the machine learning models. This means that only a small number of features are needed to build machine learning models that can generate nuclear cross-section data with good accuracy. Thus, better prediction can be achieved with a small dataset as the input. After the forward selection step, Bayesian optimization was also performed to optimize the hyperparameters of the random forest algorithm of the optimized model. The number of iterations was set to 100 steps. Here, the hyperparameters of the random forest consisted of the number of leaves, number of learners and number of predictors for the random forest, which were set between 1 and 34, between 10 and 500 and between 1 and 4, respectively.
In Table 5, the optimized hyperparameters of our machine learning model are tabulated. A single regression tree always suffers from high variance, meaning that the results can be very different if we randomly split the dataset into two sets at random and fit the regression tree to both of them. Thus, the ensemble method of bagged tree was introduced into the random forest algorithm, which is a procedure to reduce the variance. The bagged tree method is a bootstrap aggregation method that reduces the variance and increases the prediction accuracy by taking repeated N random samples from the training datasets. This produced N regression trees using N bootstrapped training sets, and we then averaged all of the N regression trees. Each regression tree had high variance and low bias with averaging the prediction that reduces variance. In this study, we used the bagged tree method to generate a regression curve with the optimized hyperparameters in Table 5. In Figure 3, we plotted the generated cross-section using an RF algorithm with experimental cross-sections as the input. From Figure 3, we can observe that the regression curve generated by our optimized model shows a good agreement with the TENDL-2019 evaluated data library and the Recommended Nuclear Data for the Production of Selected Therapeutic Radionuclides database results for 109 Ag (α, 2n) 111 In, 111 Cd (p, n) 111 In and 112 Cd (p, 2n) 111 In nuclear reactions, with R 2 values between 0.99 and 0.98. This indicates the potential use of a machine learning approach in generating nuclear cross-section data. Nuclear Data for the Production of Selected Therapeutic Radionuclide databases, and red line is our prediction using RF.
However, we can observe that the peak of the nuclear cross-section of 111 Cd (p, n) 111 In in Figure 3b is smaller compared to that of the Recommended Nuclear Data for the Production of Selected Therapeutic Radionuclide database. We suspect that this is because of the inherent error of each experimental data point used in the training. Thus, we expect that this small deviation can be improved by assigning a weight to each data point used in training based on its experimental error bar. Our study was limited to generating the nuclear cross-section data and not their errors, which is much more important in experimental design. Since we only studied well-known nuclear reactions, for future work, we plan to generate nuclear cross-section data of rare nuclear reactions with a machine learning approach using the combination of simulation and experimental datasets as the input.

Conclusions
In conclusion, we have studied the performance of a random forest (RF) machine learning algorithm in generating nuclear cross-section data for 109 Ag (α, 2n) 111 In, 111 Cd (p, n) 111 In and 112 Cd (p, 2n) 111 In nuclear reactions. We used the experimental dataset from the EXFOR database as the predictor and the datasets from the TENDL-2019 evaluated data library and the Recommended Nuclear Data for the Production of Selected Therapeutic Radionuclides database as the descriptors. A feature selection step was performed to determine the best number of predictors and combination of predictors that give the highest correlation coefficient and the lowest RMSE for our RF machine learning models. We found that not many experimental datasets are needed to generate nuclear cross-section data with good accuracy. We also found that the RF models are capable of generating nuclear cross-section data close to the predictor dataset with a correlation coefficient between 0.98 and 0.99 and RMSE between 25.418 and 28.782. This suggests that the machine learning approach can be applied in generating nuclear cross-section data.