Machine Learning Diffuse Optical Tomography Using Extreme Gradient Boosting and Genetic Programming

Diffuse optical tomography (DOT) is a non-invasive method for detecting breast cancer; however, it struggles to produce high-quality images due to the complexity of scattered light and the limitations of traditional image reconstruction algorithms. These algorithms can be affected by boundary conditions and have a low imaging accuracy, a shallow imaging depth, a long computation time, and a high signal-to-noise ratio. However, machine learning can potentially improve the performance of DOT by being better equipped to solve inverse problems, perform regression, classify medical images, and reconstruct biomedical images. In this study, we utilized a machine learning model called “XGBoost” to detect tumors in inhomogeneous breasts and applied a post-processing technique based on genetic programming to improve accuracy. The proposed algorithm was tested using simulated DOT measurements from complex inhomogeneous breasts and evaluated using the cosine similarity metrics and root mean square error loss. The results showed that the use of XGBoost and genetic programming in DOT could lead to more accurate and non-invasive detection of tumors in inhomogeneous breasts compared to traditional methods, with the reconstructed breasts having an average cosine similarity of more than 0.97 ± 0.07 and average root mean square error of around 0.1270 ± 0.0031 compared to the ground truth.


Introduction
Imaging with randomly scattered light is a significant challenge with a pressing need in non-invasive biomedical diagnosis [1][2][3][4][5][6]. One of the more significant applications is a non-invasive functional imaging technique called diffuse optical tomography (DOT), which uses near-infrared (NIR) light to map in 3D the optical characteristics of tissue by penetrating it deeply [7][8][9][10][11][12][13]. Soft tissue, including the breast and the brain, can be penetrated several centimeters by diffuse light in the NIR wavelength range. Measurements from reflected or transmitted light at the tissue surface are used to reconstruct DOT images using inverse problems [14][15][16]. Moreover, DOT has many advantages, including the nonionizing nature of the light spectra used for imaging tissues and the non-invasive nature of the technology. As a result, DOT has found critical applications, especially in breast tissue imaging [8,15,17] and tissue property estimation [10,[18][19][20][21]. However, many challenges persist. Although near-infrared (NIR) photons can travel several centimeters into the tissue to enable non-invasive biomedical imaging, these photons scatter widely and migrate in random directions before escaping or being absorbed by the medium, making imaging and detection tasks challenging [20,22].
Moreover, DOT reconstruction is an ill-defined and inadequately framed topic that calls for an inversion regularization to enhance convergence [15,23]. Inaccurate DOT reconstruction can also be caused by model inadequacies, including poor boundary conditions, results. Section 3 discusses the results of the proposed algorithm, and we finish with the conclusions derived from this study in Section 4.

Materials and Methods
Image reconstruction in DOT has traditionally been achieved using inverse problems, which take the form [37]: Findx ∈ X f rom data y = A(x) + δ, y ∈ Y (1) where the parameter space is defined by X, Y is the measurement space, A is the propagation model of photons convolved with the optical component response, and δ is the noise mechanism in the system. The scarcity of ballistic photons coupled with the loss of imaging data as a result of repeated scattering events, however, results in a non-linear, ill-posed inverse problem, needing an effective solution to accomplish this challenging task. In order to simplify the DOT system and precisely locate cancers embedded in the compressed breast tissue, approaches based on extreme gradient boosting and genetic programming are proposed in this section. This section includes a brief introduction to the proposed machine learning techniques, the proposed methodology, and the results obtained from this study.

Simulating Photon Migration in Digital Breast Phantoms to Generate Dataset
The compressed breast geometry is chosen for this study. This particular type of geometry is used as it considerably reduces movement artifacts, evenly spreads the different layers of tissue, and also helps considerably reduce the amount of tissue to be imaged [16,38]. The dataset used to train, validate, and test the algorithm is created using the finite element method (FEM) [39]. The evaluation of DOT models is usually conducted through computer simulations, as it allows for easy comparison to the work of other scientists, creating a larger dataset and providing a better-controlled environment. Additionally, simulations allow for the avoidance of costly and wasteful fabrication of clinical prototype systems that may possess inherent engineering problems. Furthermore, it is both complicated and expensive to create an inhomogeneous breast phantom for experimental purposes [14,38]. The compressed breast mesh itself is loaded from the "DigiBreast" [40] digital breast phantom modeled accurately using the optical properties of the skin, breast tissues, and the chest wall [17,41]. The digital breast phantom includes a realistic 3D glandularity map, which is more detailed than the representation of breast tissue in conventional numerical breast phantoms. These traditional phantoms use piecewise-constant regions to represent the fibroglandular and adipose tissue, which can result in a loss of information. In contrast, the digital breast phantom uses statistical or fuzzy segmentation methods to create spatially varying tissue volume fraction maps, which preserve more detailed information about the breast tissue.
The dimensions of the compressed breast are 220.8 mm, 102.9 mm, and 23.7 mm in the X, Y, and Z directions, respectively. A parallel-plate-based measurement scheme is employed to set a 48 × 54 source-detector arrangement on either side of the breast. The simulated compressed breast mesh and the source-detector arrangement are shown in Figure 1.
The simulations are then performed in the transmission geometry using the Toast++ [39] forward solver to simulate light intensity measurements at the detectors. As tumors often have a high blood concentration, in this study, they are modeled as oxyhemoglobin spheres with radii ranging from 2 mm to 15 mm randomly located inside a compressed breast mesh. The algorithm used 5000 simulated breast measurements taken in the transmission geometry as input, and the x, y, and z coordinates, the absorption coefficient matrix, and the tumor's radius were the labels. These simulated measurements were performed in the continuous wave regime at a wavelength of 833 nm, and a 2% gaussian noise was added to mimic errors that might occur if the measurements were taken from an experimental procedure. The simulations are then performed in the transmission geometry using the Toast++ [39] forward solver to simulate light intensity measurements at the detectors. As tumors often have a high blood concentration, in this study, they are modeled as oxyhemoglobin spheres with radii ranging from 2 mm to 15 mm randomly located inside a compressed breast mesh. The algorithm used 5000 simulated breast measurements taken in the transmission geometry as input, and the x, y, and z coordinates, the absorption coefficient matrix, and the tumor's radius were the labels. These simulated measurements were performed in the continuous wave regime at a wavelength of 833 nm, and a 2% gaussian noise was added to mimic errors that might occur if the measurements were taken from an experimental procedure.

Detecting Tumors in the Compressed Breast Using Extreme Gradient Boosting
Following the creation of the dataset, the collected data are used to solve the inverse problem and detect the location of tumors inside the compressed breast. A machine learning algorithm called extreme gradient boosting (XGBoost) is employed to solve the inverse problem. XGBoost is a distributed, scalable gradient-boosted decision tree (GBDT) machine learning framework with several applications that use regression [32] and classification [33]. Here, gradient boosting refers to "boosting" or strengthening one weak model by fusing it with several other weak models to create a more robust model. As an extension of boosting, gradient boosting formalizes the process of additively creating weak models as a gradient descent method over a fitness function. Gradient boosting algorithms create a model that predicts the label by evaluating a tree of logical feature questions and determining the minimum number of questions required to assess the probability of obtaining a correct decision. To reduce errors, gradient boosting establishes desired outcomes for the upcoming model. The gradient of the error concerning the prediction determines the targeted outcomes for each case, hence the name "gradient boosting." Created by Tianqi Chen and contributions from numerous developers, "Extreme Gradient Boosting (XGBoost)" is one of the most important and successful implementations of gradient boosting machines [31]. XGBoost is a scalable and highly accurate gradient-boosting algorithm that pushes the limits of computing power for boosted tree algorithms. It was created primarily to enhance machine learning models' performance and computational speed. Unlike GBDTs, where decision trees are generated sequentially, the trees in XGBoost are built in parallel. The XGBoost method takes a level-wise approach, scanning over multiple gradient values and using the partial sum generated from the gradient values to evaluate the performance of the splits for each possible split in the training dataset.
As described earlier, the dataset is created using 5000 different measurements at the

Detecting Tumors in the Compressed Breast Using Extreme Gradient Boosting
Following the creation of the dataset, the collected data are used to solve the inverse problem and detect the location of tumors inside the compressed breast. A machine learning algorithm called extreme gradient boosting (XGBoost) is employed to solve the inverse problem. XGBoost is a distributed, scalable gradient-boosted decision tree (GBDT) machine learning framework with several applications that use regression [32] and classification [33]. Here, gradient boosting refers to "boosting" or strengthening one weak model by fusing it with several other weak models to create a more robust model. As an extension of boosting, gradient boosting formalizes the process of additively creating weak models as a gradient descent method over a fitness function. Gradient boosting algorithms create a model that predicts the label by evaluating a tree of logical feature questions and determining the minimum number of questions required to assess the probability of obtaining a correct decision. To reduce errors, gradient boosting establishes desired outcomes for the upcoming model. The gradient of the error concerning the prediction determines the targeted outcomes for each case, hence the name "gradient boosting".
Created by Tianqi Chen and contributions from numerous developers, "Extreme Gradient Boosting (XGBoost)" is one of the most important and successful implementations of gradient boosting machines [31]. XGBoost is a scalable and highly accurate gradientboosting algorithm that pushes the limits of computing power for boosted tree algorithms. It was created primarily to enhance machine learning models' performance and computational speed. Unlike GBDTs, where decision trees are generated sequentially, the trees in XGBoost are built in parallel. The XGBoost method takes a level-wise approach, scanning over multiple gradient values and using the partial sum generated from the gradient values to evaluate the performance of the splits for each possible split in the training dataset.
As described earlier, the dataset is created using 5000 different measurements at the transmission boundaries of the compressed breast. The log of the measured signal at each detector with respect to different sources (log(d)) is used as the input to the XGBoost algorithm. The "x", "y", and "z" coordinates, the absorption coefficient, and the radius are the labels. The algorithm is employed to find an accurate model that establishes a precise relation between the simulated measurements and the tumor locations. The root mean squared error (RMSE) loss function is used to assess the algorithm's performance, and the accuracy of the predicted outcome is measured using the RMSE and the cosine similarity (CS) metrics [42]. The RMSE loss function is given by: Bioengineering 2023, 10, 382 5 of 13 and the cosine similarity is given by: where Y i and X i are the predictions and the labels, respectively, and N is the total number of datasets. These metrics are specifically used primarily because the R 2 metrics perform relatively poorly when the prediction window is narrow, as is the case with the radius of the spheres. Furthermore, the RMSE and the CS metrics are better suited to solving inverse problems where the prediction interval is narrow and continuous [8,43]. A computer with an i9 series 9900k processor and two NVIDIA GeForce RTX 2080Ti graphics processors were used to build and train the XGBoost algorithm. Each GPU contains 11 GB of VRAM, and the two GPUs are connected through an NVlink. The network is by Adam Optimizer with an initial learning rate of 0.0001. For the XGB predictions, 50% of the total dataset is used, out of which 60% is used for training and validation, and 40% of the data is used for testing. The XGBoost algorithm learns the features from the input and the labels and accurately predicts the relationship between the input (measurement data) and the labels (tumor location) by regression. Due to variations in the dimensions of the compressed breast and the nature of the coordinate system used to model the compressed breast, the model prediction is made separately for the three different coordinates, the absorption coefficient, and the radius. The best results obtained by the XGBoost algorithm are RMSE values of 0.1862 ± 0.0018 for the x coordinate, 0.1678 ± 0.0042 for the y coordinate, 0.1505 ± 0.0009 for the z coordinate, 0.1131 ± 0.0091 for the absorption coefficient, and 0.2157 ± 0.0103 for the radius. Some of the predictions of the XGBoost algorithm are shown below in Figure 2. From the results obtained by the XGBoost algorithm, it is undoubtedly well-posed to detect tumors to a certain degree of accuracy using the simulated measurements. The predictions by the XGBoost algorithm obtained a very high average cosine similarity value of 0.9564 ± 0.0076 compared to the ground truth. However, the predicted size and locations of the embedded tumors could still be improved, especially the predictions of the radius and the z coordinate, which is also the mean direction of propagation. Therefore, the following section applies a post-processing algorithm using genetic programming (GP) to improve the model predictions.

Enhancing Tumor Detection Capabilities Using Genetic Programming
Genetic programming (GP) is a nature-inspired hyper-heuristic search algorithm that optimizes a set (or a population) of solutions (or individuals), embodied as computer programs, which are typically represented as tree-like structures, using a predefined loss function (or a fitness function) for a given task [35,44]. In other words, GP starts with a high-level declaration of "what needs to be done" and constructs a computer program to solve the problem automatically. As described in Figure 3, GP begins with a primordial soup of a large pool of randomly generated computer programs (initial guess of the model prediction), and this population of programs evolves throughout generations. The evolutionary search employs the Darwinian concept of natural selection (survival of the fittest) and analogs of many naturally occurring events, such as crossover (sexual recombination), mutation, gene duplication, and gene deletion. Moreover, genetic algorithms, by means of fitness-based selections and applying genetic operators over time, can optimize solutions automatically during the course of simulated evolution. Thus, GPs' adaptive nature, free of human prejudices or biases, may often provide answers better than the most exemplary human efforts. Bioengineering 2023, 10, x FOR PEER REVIEW 6 of 14 From the results obtained by the XGBoost algorithm, it is undoubtedly well-posed to detect tumors to a certain degree of accuracy using the simulated measurements. The predictions by the XGBoost algorithm obtained a very high average cosine similarity value of 0.9564 ± 0.0076 compared to the ground truth. However, the predicted size and locations of the embedded tumors could still be improved, especially the predictions of the radius and the z coordinate, which is also the mean direction of propagation. Therefore, the following section applies a post-processing algorithm using genetic programming (GP) to improve the model predictions.

Enhancing Tumor Detection Capabilities Using Genetic Programming
Genetic programming (GP) is a nature-inspired hyper-heuristic search algorithm that optimizes a set (or a population) of solutions (or individuals), embodied as computer programs, which are typically represented as tree-like structures, using a predefined loss function (or a fitness function) for a given task [35,44]. In other words, GP starts with a high-level declaration of "what needs to be done" and constructs a computer program to solve the problem automatically. As described in Figure 3, GP begins with a primordial soup of a large pool of randomly generated computer programs (initial guess of the model prediction), and this population of programs evolves throughout generations. The evolutionary search employs the Darwinian concept of natural selection (survival of the fittest) Genetic or evolutionary algorithms have a wide range of applications, particularly in domains where an exact structure of the solution is unknown in advance or when finding an approximate solution is considered suitable [45]. Genetic programming is frequently used in conjunction with other types of machine learning because it is useful for performing symbolic regressions and feature classifications [46].
The execution of a GP algorithm is described below: 1.
Randomly generate an initial population of solutions called individuals. Each individual is generated as a random tree of limited depth, consisting of nodes taken from the terminal set and the function set. The terminal set contains constants and variables, and the function set consists of various operators, for example, mathematical operations, logical operators, etc.

2.
While the termination criterion is not fulfilled, the following sub-steps are repeated: a. Evaluate the individuals in the current population according to the fitness function, which outputs a numerical value representing the quality of the individual as a solution. b.
Select individuals from the population using a selection method, where the probability for selection is related to fitness values, for producing the next set of individuals.
c. Apply the following genetic operators to produce new individuals with predetermined probabilities: I. Reproduction: clone an individual selected by the sub-step "b" to the population. II.
Crossover: randomly recombine two selected individuals to produce two new offspring. III.
Mutation: randomly alter one selected individual to produce one new offspring.

3.
Output the best individual found during the run as the output.
Bioengineering 2023, 10, x FOR PEER REVIEW 8 of 14 are given in Table 1, and the entire workflow of the proposed method, including the XGBoost and GP, is shown in Figure 3.  The remaining 50% of the total data after the XGBoost is used for the GP algorithm. The data split for the GP algorithm's training, validation, and testing is similar to the data To implement the steps described above, we employed Koza-style [47] genetic programming using the DEAP Python environment [48]. We now describe the details of our implementation: Essentially, each GP individual represents an R → R function, which receives a single real-valued coordinate as input (namely, x, y, z, or radius) outputted by XGBoost in the previous stage), and outputs a single real-value, as an estimate for the actual coordinate (x, y, z, or radus). The function set for each individual (as can be seen in the pseudo-code above) is comprised of binary arithmetic operations. The terminal set is comprised of a single real-valued coordinate and integer ephemeral random constants (ERCs) [35,44,47] in the range [−5, 5]. For example, if the coordinate in question is x, it is possible to form an expression such as add (multiply (x, x), negate (multiply (x, 5.0))) representing the function: x 2 + (−5x). Typically, GP trees can represent expressions that are far more complex.
For assessing our individuals, fitness was calculated by randomly selecting 10% of the dataset (with the given single label) and applying the individual to each of the selected examples. The fitness score was determined by the RMSE measure between the actual values for the labels in the dataset and those predicted by the individual. We used tournament selection as the selection mechanism. Additional parameters of the GP algorithm are given in Table 1, and the entire workflow of the proposed method, including the XGBoost and GP, is shown in Figure 3. The remaining 50% of the total data after the XGBoost is used for the GP algorithm. The data split for the GP algorithm's training, validation, and testing is similar to the data split used for the XGBoost, and the algorithm is also run on the same computational hardware. After applying the algorithm for 30 runs with 100 generations each, the final results obtained after XGBoost and GP are RMSE values of 0.1808 ± 0.0014 for the x coordinate, 0.1539 ± 0.0057 for the y coordinate, 0.1340 ± 0.0032 for the z coordinate, 0.0975 ± 0.0065 for the absorption coefficient, and 0.2017 ± 0.0126 for the radius. The reconstructed images from the predictions are shown in Figure 4.
As seen in Figure 4, predictions significantly improve when the post-processing GP algorithm is employed. The most significant enhancement is observed in the predictions of the z coordinate of the embedded tumors. Considering that the z-axis is the mean direction of the photon propagation due to the forward scattering anisotropy of the media, this is a commendable result, as minimal errors in the prediction of the z coordinate will allow us to detect anomalies and tumors at greater depths. The results obtained using the XGBoost and the enhancement due to GP are summarized in Table 2. split used for the XGBoost, and the algorithm is also run on the same computational hard ware. After applying the algorithm for 30 runs with 100 generations each, the final results obtained after XGBoost and GP are RMSE values of 0.1808 ± 0.0014 for the x coordinate 0.1539 ± 0.0057 for the y coordinate, 0.1340 ± 0.0032 for the z coordinate, 0.0975 ± 0.0065 for the absorption coefficient, and 0.2017 ± 0.0126 for the radius. The reconstructed images from the predictions are shown in Figure 4. As seen in Figure 4, predictions significantly improve when the post-processing GP algorithm is employed. The most significant enhancement is observed in the predictions of the z coordinate of the embedded tumors. Considering that the z-axis is the mean di rection of the photon propagation due to the forward scattering anisotropy of the media this is a commendable result, as minimal errors in the prediction of the z coordinate wil

Discussion
As seen in Figure 4, the predicted outcomes are very close to the ground truth locations and the radii. When the CS metric is used to validate the model's performance, an average similarity of 0.9720 ± 0.0062 is observed when the predicted labels are compared to the ground truth. From the RMSE and CS values, it is apparent that the prediction is very robust. The algorithm is also computationally fast, as training consumes less than 60 s. It must also be noted that the algorithms do not have any prior information or knowledge regarding the inverse solution to DOT. Our study found that our predictions for the x and y coordinates were significantly more accurate than those for the z coordinate and radius. This discrepancy may be due to the fact that the range of the z coordinate and radius was limited compared to the range of the x and y coordinates. This limited range not only adversely affected the algorithm's performance but also negatively impacted the simulated measurements. The small changes within this limited range were not as clearly reflected in the measurement matrix, resulting in a less accurate regression model for predicting the tumor location and radius.
Moreover, our study's absorption coefficient reconstruction is still not very accurate. The XGBoost algorithm underestimates the absorption coefficient, while the GP regularly overestimates the absorption coefficient. This is likely due to the inhomogeneous distribution of tissue within the breast and the varying contrasts between the anomalies (such as tumors) and the surrounding tissue. Inhomogeneity in the tissue distribution can introduce variations in the optical properties of the breast, leading to challenges in accurately reconstructing the absorption coefficient. Similarly, the variability in contrast between the anomalies and the background can make it difficult to accurately distinguish between the two, leading to errors in the reconstruction. These factors likely contributed to the limitations in the accuracy of the absorption coefficient reconstruction in our study.
Despite these limitations, it is worth noting that the algorithm was still able to learn an accurate model for solving the inverse problem and identifying tumors based on a relatively small, simulated dataset. This demonstrates the robustness and effectiveness of the algorithm, which effectively processed and analyzed the data despite the challenges presented by the limited range of the z coordinate and radius. Overall, our results show that the combination of XGBoost and genetic programming (GP) can be a powerful tool for solving inverse problems in diffuse optical tomography and may have potential applications in medical imaging and other fields.
XG boost and genetic programming (GP) are highly effective algorithms for solving diffuse optical tomography (DOT) problems. XG boost, a decision tree-based algorithm, is particularly well-suited for handling highly non-linear data and can be used for classification and regression tasks. GP is versatile and can be used for a variety of purposes, including regression, classification, and optimization. These algorithms have been shown to outperform other approaches, such as conventional analytical inverse problems and convolutional neural networks (CNNs), in many complex applications due to their ability to accurately process and analyze the data. Therefore, XGBoost and GP are the preferred choice for solving DOT problems.
Moreover, the primary issue with employing neural networks to solve DOT inverse problems is the near impossibility of collecting massive amounts of experimental data to train the algorithm, leading to trade-offs between precision and accuracy in actual investigations. This study addresses this issue as shallow machine learning algorithms such as XGBoost do not require substantial amounts of data for accurate model prediction. The comparison of the effectiveness of our algorithm with that of the existing state-of-theart algorithms is shown in Table 3. A more detailed comparison to other imaging modalities and algorithms using different metrics can be found in excellent review articles [14,30,49].
As seen in Table 3, the proposed method is well-suited and outperforms other algorithms that solve a similar problem. Furthermore, transfer learning can close the gap between simulations and real tissue imaging. Additionally, to improve performance, the proposed methods can be trained using a more complex and realistic dataset with different geometries and source-detector configurations, paving the way for the realization of large-scale medical applications involving non-invasive and radiation-free diffuse optical techniques that can provide an accurate diagnosis.

Conclusions
In this study, an extreme gradient boosting algorithm was used in conjunction with a genetic programming algorithm to detect embedded tumors or anomalies in inhomogeneous compressed breast tissues. A dataset of 5000 compressed breasts with anomalies was simulated for the study, and simulated light measurements were used to determine the location and size of the tumors. After applying the proposed method, the obtained RMSE and cosine similarity values showed that the algorithm was highly accurate and robust when the predicted tumor locations and sizes were compared to the ground truth. Moreover, the issues regarding large datasets and expensive computational costs were addressed to a certain extent as only a few breast samples were used, and the tumor location and radius predictions consume a very minimal amount of time. The results from the proposed method to accurately determine the anomalies in breast tissues show that extreme gradient boosting and genetic programming algorithms can be employed to detect tumors in compressed breast tissues with excellent accuracy compared to previously existing methods.