Comparison of Machine Learning Methods in Stochastic Skin Optical Model Inversion

Featured Application: This research can potentially be applied in improving the accuracy of clinical skin cancer diagnostics. Abstract: In this study, we compare six different machine learning methods in the inversion of a stochastic model for light propagation in layered media, and use the inverse models to estimate four parameters of the skin from the simulated data: melanin concentration, hemoglobin volume fraction, and thicknesses of epidermis and dermis. The aim of this study is to determine the best methods for stochastic model inversion in order to improve current methods in skin related cancer diagnostics and in the future develop a non-invasive way to measure the physical parameters of the skin based partially on the results of the study. Of the compared methods, which are convolutional neural network, multi-layer perceptron, lasso, stochastic gradient descent, and linear support vector machine regressors, we ﬁnd the convolutional neural network to be the most accurate in the inversion task.


Introduction
Skin related diseases such as skin cancer are common [1], and non-invasive methods for diagnosing them are needed. According to Le et al. [2], the melanoma incidence is increasing. The incidence between 2009-2016 was 491.1 cases per hundred thousand people, which is nearly 64% more compared to the time period of 1999-2008. The difference in incidence at an older age is even starker, as the incidence nearly doubled from 1278.1 to 2424.9 in people older than 70 years old [2]. The cost of the melanoma for the society has the same trend and early detection and accurate treatment lowers these costs and improves the life expectancy of the patients. Hyperspectral imaging is one way for the early detection and guidance for the treatment. Skin physical parameter retrieval with hyperspectral camera or spectrometer and machine learning (ML) provide a non-invasive method of measuring the chromophore concentrations and other parameters in the skin [3].
One hinderance to developing ML models for clinical use is that the needed training datasets are large and difficult to produce. The ethical standards of using human testing make it hard to obtain data, and there needs to be a team of medical staff in addition to computer science specialists for the work. One way to avoid some of these pitfalls is to use mathematical modeling to produce training data for ML algorithms. In this study, we use the stochastic model, which is partially of our own design.
In our previous research [4,5], we have used convolutional neural networks (CNN) for stochastic model inversion. In [4], we used CNN in inverting the stochastic model for leaf optical properties (SLOP) [6], which was the inspiration for the stochastic model used in this study. We found the inversion successful. In [5], we compared the invertibility and usefulness in the physical parameter retrieval from the stochastic model used in this study and a Kubelka-Munk model [7] by inversion with CNN. It has since occurred to us that it would be useful to verify the applicability of the CNN networks in model inversion by comparing it to other ML algorithms.
The ML algorithms have been compared multiple times in different hyperspectral imaging scenarios. For example, one study compared ML algorithms in assessing strawberry foliage Anthracnose disease stage classification [8]. They compared spectral angle mapper, stepwise discriminant analysis, and their own spectral index they call simple slope measure. These algorithms did not show good performance in the task, with classification accuracy just breaking 80%. Another study found a least mean squares classifier to perform best in classifying a small batch of lamb muscles compared to six other machine learning algorithms including support vector machine (SVM) approaches, simple neural networks, nearest neighbor algorithm, and linear discriminator analysis [9]. The algorithms were also tested using principal component analysis (PCA) for dimensionality reduction in training and testing data. They found no statistical differences in the classification results between using PCA for the data or not.
Gewali et al. [10] have written a survey on machine learning methods in hyperspectral remote sensing tasks. The retrieval of (bio)physical or (bio)chemical parameters from the hyperspectral images was one of the tasks, and they found three articles where SVMs were used in retrieval, five articles that used latent linear models such as PCA, four that used ensemble learning, and five that used Gaussian processes. Surprisingly, they found no articles where deep learning was used in retrieval.
In the articles [8][9][10], the distinction to our work is that they do not apply mathematical modeling prior to applying machine learning. In contrast, the following articles by Vyas et al. and Liang et al. are similar to our work, only employing different mathematical and machine learning models [11][12][13].
Liang et al. [11] used simulated PROSAIL data to select optimal vegetation indices for leaf and canopy chlorophyll content prediction from an inverted PROSAIL model. In the inversion, they used an SVM algorithm and a random forest (RF) algorithm, of which they found RF to be better. The results were promising, as the coefficient of determination (r 2 ) was 0.96 between measured and predicted data. However, their usage of indices makes it rather incomparable to our research.
Vyas et al. [12,13] have done the work most similar to ours. In [12], the Kubelka-Munk (KM) model was inverted using a k nearest neighbors (k-NN) method with different distance metrics and a support vector regressor. The inversion parameters were melanosome volume fraction, collagen volume fraction, hemoglobin oxygenation percentage, and blood volume fraction. In the synthetic experiment, they found the k-NN with spectral angle distance to have the smallest mean absolute errors. In the in vivo experiment, they show that the predicted parameters produce modeled spectra strikingly similar to measured spectra. This is used as evidence of inversion success.
In their other study [13], the KM model was inverted using an SVM approach. The inversion parameter was thickness of the skin. The linear correlation coefficient (r) between inverted KM predictions and ultrasound measurements of the skin thickness was 0.999. In further experiments, they found the measured and modeled spectra nearly indistinguishable, although it is not disclosed how they chose the other parameters for the modeling. The difference to our approach in aforementioned studies is that the inverted models differ from ours, as we are trying to find useful models alternative to KM.
The objective of our study is to find a good way to invert the stochastic model for skin optical properties. Our hypothesis is that the convolutional neural network (CNN) will outperform the others as it has been shown to perform well in similar tasks [14][15][16][17].
In Section 2, we provide the reader with information of our data, the methods we use, including the stochastic model and the different machine learning algorithms, and the metrics we use in evaluating the results. In Section 3, we show the experimental results and, in Section 4, we compare our results to the previous research, discuss the strengths and weaknesses of our work, and discuss the direction of future work. Section 5 concludes the work.

Stochastic Model
The training data for the machine learning methods was produced by a stochastic model for light propagation in the skin (SM). The SM was adapted from the stochastic model for leaf optical properties [6] by using skin specific parameters [18].
SM is the Markov chain based model for light propagation. The light propagation can be expressed as a network of transitions and states (Figure 1), in which there is a physically meaningful probability of transfer from one state to another. The probabilities are based on the chromophore concentrations and absorption and scattering properties of the skin. SM is used in the same way and with the same parameters as in [5]. In this research, the first two main skin layers, epidermis and dermis, were considered. Light through dermis was considered absorbed. From the first state, illumination, the light either goes into the skin (P = 0.98) or is reflected (P = 0.02). In the epidermis and dermis layers, the transition probabilities are based on Beer's law and calculated as follows [6]: In the previous equations, λ is the wavelength in nanometers, a(λ) is the absorption coefficient [18], and s(λ) is the reduced scattering coefficient [18] and L is the length of the light path, which is assumed to be the same as the thickness of the layer [18].
In Equations (4) and (5), a i are the skin chromophore absorption coefficients and c i their concentrations, s(500 nm) the reduced scattering coefficient at 500 nm, f Ray fraction of the Rayleigh scattering and b Mie the Mie scattering power. The chromophores used in the study are melanin, deoxygenated and oxygenated hemoglobin and water. Their absorption coefficients are from [18], and the concentrations are in fractions of the volume they take. The deoxygenated and oxygenated hemoglobin volume fractions are calculated from blood oxygen fraction (S) and hemoglobin volume fraction (c HGb ) as follows [18]: c HGb, oxy = c HGb S The input parameters used in SM are described in Table 1. Example spectra produced by the stochastic model are in Figure 2.  Table 1 were used. Table 1. Input parameters and their ranges for the stochastic model. The parameters that are used as prediction targets are marked with (*). The wavelengths are described in detail in Table 2 [5].

Input Parameter
Range Layer For machine learning algorithm training, the hyperspectral data simulated with SM are normalized. The prediction target parameters, melanosome, and hemoglobin volume fractions, and epidermis and dermis thicknesses, are normalized to between 0 and 1, and the spectral data are normalized to between 0 and 1 for the neural network algorithms (convolutional and regular neural networks described in Sections 2.2.2 and 2.2.3) and to between −1 and 1 for scikit-learn algorithm implementations (other algorithms, described in Sections 2.2.4-2.2.6) by the StandardScaler algorithm [19].

Empirical Data
The empirical data were used for visual inspection of the trained machine learning regressors. The used hyperspectral image was captured using Revenio Prototype 2016 hyperspectral imager with spatial resolution of 1920 × 1200 pixels and spectral resolution of 120 wavelengths ( Table 2). The hyperspectral camera/imager is a device that captures multiple monochrome photographs in rapid succession, while controlling the wavelength of the light that gets through the controlling device to the imaging sensors. Each monochrome photograph represents a narrow wavelength interval that can be interpreted as single wavelength, such as 400 nm. There are usually approximately one hundred of these monochrome images in one hyperspectral image. The hyperspectral image contains the same spatial data as the normal RGB-picture, but far more data in the spectral domain, that is, the interaction between light and the subject of the image can be seen with greater precision [20].
The machine learning methods for inversion are selected by following criteria: 1. Applicability for special characters of hyperspectral data, such as • nonlinearity, • sparsity, i.e., some wavelengths have greater significance.

Applicability for the dataset and sample size
The multi-layer perceptron was chosen based on its applicability to nonlinear data [25,26], which is usually the case when light is scattering in complex media. The convolutional neural network fills both criteria, as it is proven to be good in image [14,15] and signal regression [16,17] problems, and it scales well to big datasets. It also takes the structure/shape of the data into account by use of convolution. The stochastic gradient descent regressor is also used in signal regression, and, according to scikit-learn documentation, version 0.23.2 [19], it is also a good fit for the size of our data, which is in the range of hundreds of thousands pixels. The lasso regressor was chosen based on the first criteria. It is also good for sparse problems [27], which means only some of the input parameters are important. This is the case with spectral data: the spectra will follow the same line in many places, but the differences are found in the wavelength zone of the chromophore of interest. The linear support vector regression was included based on its applicability on pattern recognition [28].
All machine learning algorithms were either implemented in scikit-learn Python library [19] version 0.20.1 or written by the authors using Python programming language and Tensorflow library [29] version 2.0.0.

Training and Testing Process
All the models were trained and tested with the simulated data. The data set consisted of 50,000 simulated spectra with associated target variables, which were melanosome volume fraction, hemoglobin volume fraction, epidermis thickness, and dermis thickness. All algorithms were trained for all targets simultaneously, resulting in regressors with an output size of four.
The data set was divided into training (n = 48,000) and test samples (n = 2000). For each method, the best set of hyperparameter values was found by the grid search method [19] and 5-fold cross-validation strategy on training samples. The final model was then trained by using all the 48,000 training samples and the best combinations of hyperparameter values. The performance of each model was then assessed with the test samples.
In addition to the simulated training and test data, empirical data were used in visual interpretation of the trained models.

Convolutional Neural Network
CNN is widely used in image related regression and classification tasks [14][15][16][17]. The state-of-theart image classification algorithms utilize some form of CNN [30,31]. There is also a lot of research for it being used with hyperspectral images [10]. The CNN algorithms we used are described in Tables 3-6.
In the first experiment with grid search (Table 3), we varied the shape of two convolutional layers, the size of the convolution kernel, and size of the max-pooling [32] window. The resulting network (Table 4) had two one-dimensional, convolution layers with convolution window of size 12, 32, and 64 filters, respectively. After each convolution, there were one-dimensional, max-pooling with pool size of two. Both convolutional layers have rectified linear unit (ReLU) [33] activation. After the last max-pooling layer, there was a flattening layer and three dense layers (ReLU activation), dropout of 50%, and an output layer. The first experiment is referred to in the results and the discussion as CNN. In the second experiment, we varied the amount of convolutional layers, the convolution kernel size, and size of the max-pooling window ( Table 5). The resulting network (Table 6) consisted of a one-dimensional, convolutional neural network, max-pooling, two one-dimensional, convolution layers, another max-pooling, flattening layer three dense layers, dropout of 50% [34], and output layer. All dense and convolutional layers except output layers utilized ReLU activation. Every convolutional layer had a convolution window of size 15, and the max-pooling pool size was two. The second experiment is referred to in the results and discussion as CNNV2.
The used optimizer was an Adam optimizer [35] in both experiments. The learning rate was 0.001, β 1 = 0.9, β 2 = 0.999 and = 10 −7 , which were not included as the optimization parameters as the focus was on the architecture of the network. The loss score was mean square error (MSE).

Multi-Layer Perceptron
With the MLP, we varied the amount and shape of the dense layers of the network ( Table 7). The resulting network (Table 8) had a flattening input layer and three dense layers with one hundred nodes each. After the dense layers, there is a dropout of 50% and the output layer. All layers except the output layer are ReLU activated and the network optimizer is the Adam optimizer with the same parameters as CNN. The loss score was MSE.

Lasso
Least absolute shrinkage and selection operator (More commonly known as its acronyme: Lasso) was chosen because it works well with data where only some of the features are important [24], as is the case with our data. Mathematically, Lasso is a linear model with an added regularization term, and its objective function to minimize is where α is the penalty term, ω the fitted model, n the number of samples in the training data, X the input training data, and y the target variables. In practice, we used the MultiClassLasso algorithm from scikit-learn [19], which applies Lasso regression for all target variables at the same time.
In Lasso grid search training, we varied the α, the stopping tolerance of the algorithm, and the selection method (Table 9). In the resulting estimator α = 1, tolerance = 10 −4 , and the selection method was cyclic.

Linear Support Vector Regressor
In binary classification tasks, linear support vector regressor (LSVR) means simply fitting a line or hyperplane in the space where the sum of minimum distances from each class to line is maximal. In regression, the goal is to fit a line that has the most points within a predetermined distance from the line [23].
The implementation we used was LinearSVR from scikit-learn package. It was transformed to multi-class regression by a MultiClassRegressor algorithm in scikit-learn.
With grid search (Table 10) for LSVR, we varied the loss method, the regularization term C, and whether or not the algorithm is solving primal or dual optimization problem. Stochastic gradient descent (SGD) is a gradient minimizing algorithm. It takes a random initial value of the input training data and tries to adjust the initial value to a point where the gradient of the objective function is zero. The adjustment is done one random observation at a time. The norm used in optimization is L2-norm, as opposed to L1-norm used in Lasso regressor [22].
The implementation we used is SGDRegressor from scikit-learn. It was transformed to multi-class regression by MultiOutputRegressor method in scikit-learn.
In grid search, we varied the regularization term α, loss function, penalty function, and the learning rate method (Table 11). In the results and discussion, this method is referred to as a stochastic gradient descent regressor (SGDR).

Accuracy Evaluation Metrics
The following metrics were calculated from the regression results with the simulated testing data: Root mean squared error (RMSE), also known as standard estimate of error (SEE), • Mean squared error (MSE), • Mean absolute error (MAE), and • Saliency maps [36] were calculated only for the CNN and MLP regressors.
The correlation coefficient r was calculated to see the connection between the prediction and true target values. It is covariance (cov) of the predictions (Y pred. ) and true targets (Y true ) normalized by standard deviations (σ) for both groups: RMSE, MSE, and MAE were calculated to capture the average errors in the predictions. They are calculated as follows: The saliency map shows how much each place in the map contribute to the prediction. In other words, the map indicates to which locations in the input space the output is most sensitive. The saliency map is described in detail in [36]. The saliency maps are useful in determining the most useful areas of the training data. For example, if there is need to reduce the size of the training data, one can drop the features that show lower values in the saliency map. The saliency maps were calculated using a keras-vis package version 0.4.1 [37].

Results
The six inversions had variable success with the simulated data (Figures 3 and 4 and Table 12). The CNN and CNNV2 experiments (Table 12) Figure 4, the second experiment of CNN regressor is the best of the neural network regressors, as the correlation seems to be strongest there. In the saliency maps of the CNN and MLP regressors (Figure 3), we see that the CNN experiments had similar areas of interest. The interval between 550 nm and 700 nm seems to be most important in predicting the target variables. In the MLP experiment, the same interval seems to be useful, but additionally it highlights some areas in the ends of the spectrum.
Lasso and stochastic gradient descent regressors showed similar results. As Figure 4 shows, the correlation is virtually identical, and the points scatter to nearly the same place. In fact, only looking at individual points, one can observe some differences between the lasso and SGD regressors. Lasso and SGDR had correlation coefficients between 0.55 and 0.88 (means 0.71 and 0.71, respectively), RMSEs between 0.14 and 0.24 (means 0. 20   The worst predictor was the linear support vector regressor. It is worst by all metrics by average, but, in hemoglobin volume fraction and epidermis thickness predictions, it seems to be marginally better than Lasso and SGDR. The correlation coefficients were between 0.59 and 0.68 (mean 0.64), RMSEs between 0.22 and 0.24 (mean 0.23), MSEs between 0.05 and 0.06 (mean 0.05), and MAEs between 0.16 and 0.19 (mean 0.18).
Overall, the least accurate predictions were for hemoglobin volume fraction and the models were most accurate for dermis thickness prediction. Melanosome volume fraction was relatively easily learned by all regressors, while, in epidermis thickness, there is a clear divide between group of CNN and MLP regressors and other regressors. The neural network models provide accurate responses which seems almost impossible for the other types of models.
Regarding the test with the measured hyperspectral data in Figure 5, the details of the lesions are most easily visible in the CNN experiments. In MLP, lasso, and LSVR experiments, the epidermis thickness is poorly visible due to noise. Looking at Figure 5, the fourth best model seems to be the SGDR, as the details of the lesions are somewhat visible.

Discussion
The purpose of this study was to investigate how different machine learning approaches perform in the inversion task of predicting stochastic model parameters from the spectral input data. The present results suggests that the most accurate models can be obtained by the CNN method. It is likely that performance of all the trained models can be improved by further parameter tuning or increasing the amount of data, but it is unlikely that one model would have improvement rate vastly different from others. The results were expected as the CNN has been shown to be superior in signal processing tasks, which are essentially analogous to one-dimensional, spectral data used in this study.
The measured correlation coefficients show that the CNN is as close to perfection as can be reasonably expected without overfitting. However, we must remember that the results were achieved with simulated data, and the testing with measured biophysical parameters are yet to be carried out. The error metrics show similarly good signs, as the MAE was less than 10% of the range (all true target parameters were normalized to between zero and one), and the RMSE is less than 2%. If the errors are similar in further testing with measured data, then the model would be ready for in-vivo clinical testing.
Based on the results regarding the measured hyperspectral data, it seems that the used stochastic skin model represents the optical properties of the skin quite well. The CNN predictors in Figure 5 show the same general pattern as Figure 6, and at least the melanosome volume fraction is predicted to be higher in the area of the lesion, which is clearly correct based on Figure 6. Of the absolute correctness of the model, these experiments do not provide information, as we did not have measurements of the skin biophysical parameters to compare to Figure 5. Compared to the previous research by Vyas et al. [12,13], our correlation results are slightly inferior. They found a correlation coefficient between measured thickness and estimated thickness to be nearly perfect, while our best correlation is 0.96. However, our goal was different: we set out to find the best inversion algorithm, instead of best inversion. In future studies, we hope we can build a more robust system of parameter retrieval with a perfected CNN model.
In future research, the stochastic model and CNN combination should be thoroughly tested with measured hyperspectral data and measured biophysical parameters. The goals of the testing could include finding the best CNN estimator by optimizing the absolute and relative accuracy of the predictions with respect to the model. Our ultimate goal is to obtain a comprehensive model for predicting skin optical properties and its inverse function for predicting skin biophysical parameters.

Conclusions
We provided experiments that show that a convolutional neural network is a good option in skin optical model inversion and skin physical parameter retrieval. The results indicate that the most meaningful parameters of the used stochastic model were predicted accurately by the best inverted model. We also tested the inverted models while measuring the hyperspectral data, and it showed promising results to be tested in further research. It seems that the inverted model may work well with the measured data, at least when one is looking at proportional differences of skin areas instead of absolute values. Our research also suggests that the capacity of the traditional non-neural machine learning models is insufficient for accurate modeling of the hyperspectral data.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: