Deep Learning in Hyperspectral Image Reconstruction from Single RGB images—A Case Study on Tomato Quality Parameters

: Hyperspectral imaging has many applications. However, the high device costs and low hyperspectral image resolution are major obstacles limiting its wider application in agriculture and other fields. Hyperspectral image reconstruction from a single RGB image fully addresses these two problems. The robust HSCNN-R model with mean relative absolute error loss function and evaluated by the Mean Relative Absolute Error metric was selected through permutation tests from models with combinations of loss functions and evaluation metrics, using tomato as a case study. Hyperspectral images were subsequently reconstructed from single tomato RGB images taken by a smartphone camera. The reconstructed images were used to predict tomato quality properties such as the ratio of soluble solid content to total titratable acidity and normalized anthocyanin index. Both predicted parameters showed very good agreement with corresponding “ground truth” values and high significance in an F test. This study showed the suitability of hyperspectral image reconstruction from single RGB images for fruit quality control purposes, underpinning the potential of the technology—recovering hyperspectral properties in high resolution—for real-world, real time monitoring applications in agriculture any beyond.


Introduction
Hyperspectral imaging (HSI) combines spectroscopy and optical imaging and provides information about the chemical properties of a material and its spatial distribution [1]. HSI is a form of non-invasive imaging that applies visible and near-infrared radiation (wavelengths 400 nm to 2500 nm) to chemicals or biological substances to measure differential reflection [2]. Due to the vast amount of information to be obtained from hyperspectral images-compared to images in the RGB (red, green, blue) color model-HSI has been widely applied in research and industry. Applications include rapid, environmentally friendly, and noninvasive analysis in remote sensing [3,4], biodiversity monitoring [5], health care [6], wood characterization [7], and the food industry [8,9]. However, despite the potential benefits, the wide application of HSI is restrained due to the considerable costs of high-quality imaging devices compared to conventional RGB sensors. Moreover, most of these HSI devices are scanning-based-using either push broom or filter scanning approaches-making them less portable and time consuming to operate, which seriously limits the broader application of HSI technology [10]. In addition, snapshot hyperspectral cameras-able to take images quickly-often feature a rather low spatial resolution [11]. High-resolution hyperspectral information is appealing as it not only provides spectral signatures of chemical elements but also spatial details [12].
Deep learning approaches are increasingly applied in many areas of research and industry [13][14][15], and recently allowed the development of hyperspectral image reconstruction approaches [16]. The reconstruction of hyperspectral information from RGB images is envisioned to provide a promising way of overcoming current limitations of both scanner-and snapshot camera-based hyperspectral imaging devices, providing image with both high spatial and spectral resolution, and being affordable, user friendly, and highly portable [17]. In particular, smartphone camera sensors could easily capture images in high spatial resolution, e.g., twelve million pixels per image, providing a sound basis for reconstructing high-resolution hyperspectral images. While reconstruction approaches were initially very rigid and complex [18]-limiting their usability for practical application-recent progress, in particular the application of deep learning approaches, enabled easier, faster, and more accurate hyperspectral image reconstruction pipelines [16,19]. Several contrasting approaches based on deep learning have been proposed recently [20][21][22].
While hyperspectral recovery from a single RGB image has seen a great improvement with the development of deep learning, it is still limited for several reasons. For example, hyperspectral images used during method development were previously restricted to the visual spectral range (VIS, 400-700 nm) with 31 wavebands and a spectral resolution of 10 nm [18]. Compared to the nearinfrared range (NIR; 800 to 2500 nm), images in the visual range miss information important for many applications [23]. In addition, considerable uncertainty exists on criteria for model performance evaluation. Currently three major evaluation metrics are widely used in performance assessment: Mean Relative Absolute Error (MRAEEM), Root Mean Square Error (RMSE), and Spectral Angle Mapper (SAM) [17,19,21,24,25]. However, there is no general agreement over which criterion is most robust for indicating a better model.
A key application of HSI is food quality evaluation [26]. Tomato is one of the most important fruits for daily consumption, and the fast and non-destructive evaluation of its quality is of great interest both in research and industry-rendering it a suitable object for a case study [27,28]. Taste of different tomato varieties and qualities is mainly affected by sugar content, acidity, and the ratio between them [29]. Previous studies used diverse instruments such as a Raman spectrometer [30], near-infrared spectrophotometers [31,32] and a multichannel hyperspectral imaging instrument [33] for quantifying those parameters. The normalized anthocyanin index (NAI) has been shown to be very effective in predicting lycopene content [34,35]. Lycopene, a secondary plant compound of the carotenoid class, may reduce the risk of developing several cancer types and coronary heart diseases [36]. Making use of readily available RGB cameras, e.g., smartphone cameras, in combination with hyperspectral image reconstruction techniques would greatly facilitate the assessment of tomato quality parameters. In particular, it will promote the selection and sorting process of tomato fruits in industry [37] and might even support consumers in the choice of tomato qualities.
In this study, we demonstrate the use of a permutation test to select an appropriate state-of-theart deep learning model for hyperspectral image reconstruction from a single RGB image. Subsequently, we show that the reconstructed images can be used to predict tomato quality properties through random forests (RF) regression at high accuracies-developing an efficient pipeline from automatic segmentation to quality assessment. Finally, the application potential of reconstructed hyperspectral image is discussed.

Plant Material, Growth Conditions, and Tomato Sampling
Ungrafted tomato plants (Solanum lycopersicum L., variety "Dometica" (Rijk Zwaan)) were used in this experiment. Seeds were sowed on 29 th of July 2018 in a climate-controlled chamber; 39 days after sowing seedlings were transplanted to a Venlo-type greenhouse in southwestern Norway (58°42'49.2"N 5°31'51.0"E) and grown on rockwool slabs with drip irrigation according to common practice [38]. The plants were irrigated with a complete nutrient solution based on standardized recommendations: 17.81 mM NO3, 0.71 mM NH4, 1.74 mM P, 9.2 mM K, 4.73 mM Ca, 2.72 mM Mg, 2.74 mM S, 15 µM Fe, 10 µM Mn, 5 µM Zn, 30 µM B, 0.75 µM Cu, and 0.5 µM Mo. The electrical conductivity of the nutrient solution was maintained at around 3.2 mS cm -1 and the pH was 5.8. Average daily temperature, relative humidity, CO2 concentration, and natural solar radiation during the growing period were 22.4 ± 2.8°C, 74 ± 7.8%, 670 ± 192 ppm and 33 ± 77 W•m -2 , respectively. Highpressure sodium lamps (Philips GP Plus, Gavita Nordic AS, Norway) with an intensity of 300 W•m -2 (1.5 m above the top of the canopy) were used in addition for ≤18 h per day (i.e., when solar radiation was < 250 W•m -2 ). Side shoots were pruned regularly, and the number of tomatoes in each truss was pruned to seven. Tomato fruits for the study were collected 210 days after sowing during the morning. Three undamaged tomatoes of similar size were selected from each of 12 color grades [39]. Color grades range from 1 to 12-where 1 is uniform green (e.g., mature green) and 12 is uniform dark red (i.e., red overripe).

Image Acquisition
Hyperspectral images of tomatoes were instantly taken by a portable hyperspectral camera, Specim IQ (Spectral Imaging Ltd., Finland; [40]), with a spatial resolution of 512×512 pixels, a spectral resolution of 7 nm, and 204 spectral bands from 397 to 1003 nm. Calibration was conducted following the user manual. When taking images, the camera was placed 100 cm above the table where 12 tomatoes were placed on a tripod. Two Arrilite 750 Plus halogen lamps (ARRI, Germany) were symmetrically placed beside the camera for illumination. A white panel (90% reflectance) was placed adjacent to the tomatoes as a reference target for reflectance transformation [40].
A set of RGB images was rendered directly from the hyperspectral images under CIE Standard Illuminant D65 with the CIE 1931 2° Standard Observer with gamma correction (γ = 1.4) ( Figure 1). Cropped images of 21×31 pixels, excluding overexposed area, from both rendered RGB image and corresponding hyperspectral image area of each tomato were subsequently used to build the model (see below). The built-in main camera of the smartphone Samsung Galaxy S9+, Android 9.0 (Samsung Corp., South Korea), was used to take RGB images immediately after hyperspectral imaging. The main 12 Mp camera consists of a 1/2.55'' sensor and f/1.5 to f/2.4 variable aperture lens; images were taken with a resolution of 3024×4032 pixels; the distance between the smartphone camera and the tomato was set to fit the fixed-size table inside the view finder-not using the optical zoom. The flashlight was turned off and autofocus mode was activated when shooting the RGB images. Images were saved in jpg format.

Tomato Quality Parameters ("ground truth")
After the imaging campaign, each tomato was immediately and separately homogenized with a handheld blender. The fresh, uniform samples were used for estimation of soluble solid content (SSC, expressed as °Brix) and total titratable acidity (TTA, expressed as % of citric acid equivalents (CAE) per FW; [41,42]). SSC was measured with a digital refractometer PR-101α (ATAGO, Japan). TTA was determined using an automatic titrator 794 Basic Titrino (Metrohm, Switzerland) by titrating with 0.1 M NaOH to pH 8.2. The ratio of SSC to TTA (STR) of each tomato was calculated.
Lycopene content was calculated with the normalized anthocyanin index (NAI) [35], using the reflectance at 570 nm and 780 nm as determined with the hyperspectral camera (see above). The NAI of each i th pixel was calculated through Equation 1.
where R780 and R570 are the reflectance of the i th pixel of the image at wavelengths 780 nm and 570 nm, respectively; NAI is the NAI value of the i th pixel. The median value of all NAI of each segmented tomato, excluding the overexposed areas [43] on tomato RGB images, was treated as the overall NAI of an individual tomato. These quality parameters, either determined according to laboratory measurements (SSC, TTA, STR) or calculated from reflectance measurements using the hyperspectral camera (NAI), were used as "ground truth" values in this study and subsequently compared with the predicted parameters (see below).

Model selection, training and validation
A state-of-the-art deep learning model, i.e., a residual neural network model named HSCNN-R [17], was selected-showing very good performance for hyperspectral image reconstruction. In HSCNN-R, a modern residual block [44] was used to replace the plain CNN architecture of HSCNN [16] to improve the model performance. Six residual blocks, improving the time efficacy without harming performance during the validation procedure compared with the 16 residual blocks proposed originally (data not shown), were chosen for the HSCNN-R model with 64 filters in each residual block. Permutation test was applied during model selection and a total of 36 samples was randomly divided into a training set (24 samples), and a testing set (12 samples), for 5 times. The batch size, learning rate, learning rate decay, and optimizer weight initialization in different layers in HSCNN-R were set according to Shi et al. (2018).
Two different loss functions were used to compare their effectiveness in hyperspectral image reconstruction: mean square error (MSE, Equation 2) [24,25], one of the most widely used loss functions, and mean relative absolute error (MRAELF, Equation 3) which was suggested to reduce the bias from different illuminance levels [17].
I ( ) and I ( ) represent the i th pixel of the ground truth and reconstructed hyperspectral images, respectively.
Three evaluation metrics, MRAEEM (Equation 4), RMSE (Equation 5), and SAM (Equation 6) were used to select models with their corresponding minimum values during validation; smaller values indicate less error on reconstructed hyperspectral images. RMSE was also used to evaluate the difference between ground truth and reconstructed hyperspectral images, and SAM quantifies the similarity of the original and reconstructed reflectance across the spectra through measuring the average angle between them [45].
I ( ) , I ( ) are the i th pixel in the ground truth and reconstructed hyperspectral images, respectively; T means transpose, and n is the total number of pixels of each image. For model training, the batch size was set to 8 and the optimizer AdaMax [46] with settings of β1 = 0.9, β2 = 0.999, and eps = 10 −8 . The weights were initialized though HeNormal initialization [44] in each convolutional layer. The initial learning rate was set at 0.005 and the learning rate decreased by 10% every 100 epochs. The model performance was evaluated through the three evaluation metrics MRAEEM, RMSE, and SAM (see above). All models were trained until no further decrease in validation loss occurred.
During validation, the whole cropped RGB image (21×31×3) from the validation set for HSCNN-R was used as input to reconstruct the hyperspectral image with 204 spectral bands (21×31×204); MRAEEM, RMSE, and SAM values between reconstructed and ground truth hyperspectral images were calculated accordingly. Based on the 30 models selected, two loss functions × three evaluation metrics × 5 times' random sampling, values from evaluation metrics were analyzed between and within two loss functions by permutation test using the EnvStats package [47] in R [48]. The model generating constant performance was selected and used to reconstruct hyperspectral images from single tomato RGB images.

Image segmentation and quality parameter prediction
For tomato quality parameter prediction, the overall spectral information of each tomato was considered. The RGB images from Samsung Galaxy S9+ and masks outlined manually through Labelme [49] were used for training. RetinaNet [50] was trained for 5 epochs with 10000 iterations in each epoch to detect and segment individual tomato; learning rate was set at 2e -8 and kept constant during training.
The reconstructed hyperspectral reflectance of each tomato was extracted based on the segmented tomato mask from RGB images from smartphone, excluding the overexposed region, and then subjected to asymmetric least squares baseline correction of the logarithmic linearized reflectance log(1/R) [51]; the median values of spectral reflectance of different wavebands were extracted for each tomato for quality parameter prediction. The recursive feature elimination method from the Caret package with RF model and repeated 10-fold cross validation was applied to select important wavebands which were then used to build the prediction model; the optimal model was selected based on the prediction accuracy by tuning the parameter, mtry, in RF models. The parameter mtry sets the number of input variables randomly chosen at each node of the RF models.
The predicted values of each sample were based on the model trained on the rest of the samples, R 2 and P values in the F test were calculated based on the predictions and corresponding ground truth values (see above) of all tomato samples.
The free of charge cloud service Google Colaboratory (Colab) with Python 3 runtime served as major platform for model training and validation. Colab is equipped with a 2.3 GHz and 12.6 GB RAM Intel Xeon processor with two cores and a NVIDIA Tesla K80 GPU with 12 GB RAM.
Detailed implementation of the whole analysis pipeline is available upon request.

Model Selection and Performance
The training and validation histories of the three evaluation metrics MRAEEM, RMSE, and SAM are displayed in Figure 2 Table 1). Both the number of epochs and time consumed for reaching the minimum values of three criteria during validation were much less in models with the MRAELF loss function, i.e., less than 500 epochs and 90 seconds, compared with MSE ones.  The minimum values of MRAE (MRAEEM_min) models with MRAELF loss function were significantly lower than the ones from models with MSE loss function (Figure 3a). Minimum values of RMSE and SAM models were statistically the same using either of the two loss functions (Figure  3b,c). During the permutation test there was no significant difference in MRAEEM_min, RMSEmin, and SAMmin values between the three evaluation metrics within each loss function (Figure 3d-f,g-i). Thus, a model with MRAELF loss function evaluated by MRAEEM was used to reconstruct the hyperspectral images from single RGB images for non-destructive tomato quality parameter quantification. The reconstructed spectral reflectance from RGB images, either directly taken by a smartphone RGB sensor or rendered from hyperspectral images, were generally very similar to their corresponding spectral reflectance as determined with a hyperspectral camera (Figure 4a), and its 1 st derivatives (Figure 4b). Largest deviations, with greater reconstructed reflectance in RGB images taken by the smartphone, occurred in the spectral range of approximately 380-740 nm. The reconstructed spectral reflectance of the central pixels of the validation set of tomatoes of color grades 1-12 are given in Supplementary Figure S1. i.e., orange). The black dots at the center of the RGB image (inset in a)--images originating either from a hyperspectral camera (blue frame, triangles), or from an RGB smartphone camera (black frame, circles)-denote the analyzed region and respective reflectance along the spectra. Red squares denote the spectral reflectance from original hyperspectral images (i.e., the corresponding "ground truth").

Tomato quality parameter prediction
Tomato quality properties' soluble solid content (SSC), total titratable acidity (TTA), and their ratio (STR) were predicted with good agreement to the corresponding laboratory measurements (Figure 5a-c), featuring R 2 of 0.51, 0.61, and 0.78, respectively. Lycopene content, as indicated by NAI, was predicted with a high accuracy of R 2 = 0.92 (Figure 5d). All four quality parameters were predicted significantly better than random guessing, denoted by corresponding P values of 4.02e-06, 8.8e-08, 1.37e-11, and 2.54e-18 in F test. The relationship between SSC, TTA, STR, and the NAI was found to be close to zero ( Figure S2).

Discussion
HSI reconstruction has become popular and opened a new field for low-cost methods of acquiring hyperspectral information in high resolution, both spatial and spectral. Even though there has been some research developing methods for reconstructing hyperspectral images [17,18,25], real world applications of these methods are still lacking [10]. This study has demonstrated the potential of using the HSCNN-R model for hyperspectral reconstruction in the visual near-infrared range to predict key quality parameters of tomato. Three models can be selected based on three evaluation metrics, MRAEEM, RMSE, and SAM, as their corresponding minimum values did not appear in the same model, as has been found previously [17,25,52]. A single lower value from any one of these three evaluation metrics thus cannot indicate a better model performance. As only the minimum errors reconstructed from models with MRAELF loss function and MRAEEM evaluation metric were found to be significantly lower than ones with MSE loss function, it can be concluded that these models were consistently superior in spectral reflectance reconstruction compared with models with other loss function and evaluation metric combinations. This was also found by Shi et al. [17] as MRAELF loss function was more robust to outliers and treated wavebands of the whole spectra with different illumination levels were more similar compared with the MSE loss function. As these loss functions can also be purpose specific, MRAELF loss function should thus be chosen if all wavebands are equally prioritized for better exploration of the whole spectra, while MSE should be preferred if the highly illuminated spectral reflectance is of greater interest.
Models trained with the MRAELF loss function were able to converge with fewer epochs at higher speed, and reached lower errors compared to the MSE loss function, which is beneficial for hyperspectral image reconstruction as model training is expected in practice to be implemented in real time prediction, e.g., sorting tomato based on lycopene content on the conveyor belt. The increase of validation error after reaching the minimum value, for either MRAEEM, RMSE, or SAM, was mainly due to overfitting on a small training dataset [53].
To further confirm the robustness of the selected model in reconstructing hyperspectral images, reconstructed spectral reflectance of RGB images either rendered from hyperspectral images or directly captured by smartphone camera, were compared with directly measured spectral reflectance from the hyperspectral camera ("ground truth"). The similarity of both the reflectance and corresponding 1 st derivative demonstrated that the selected approach resulted in a reliable reconstruction of the spectral pattern. As the RGB image used for training was rendered from hyperspectral images by a standard CIE matching function while smartphone RGB sensors have different spectral sensitivity functions likely deviating from CIE [22], an increase of errors during the reconstruction of hyperspectral images from RGB smartphone sensors might be expected. However, although the RGB images from the smartphone were completely new to the trained model, the reconstruction results demonstrated the soundness of the model in recovering the spectral reflectance even from regular RGB images taken by a standard smartphone model.
The very high R 2 value in NAI prediction showed that the reconstructed hyperspectral image was suitable for predicting tomato lycopene content non-destructively based on the RGB images of intact tomatoes. NAI is the indicator for lycopene which can be closely reflected by the color of tomato [54]. Tomato color change from green to red due to the degradation of chlorophyll while accumulating lycopene during development [55]. The high prediction accuracy for NAI via reconstructed hyperspectral image for lycopene is probably also related to the large range of color change in the corresponding tomato sample. Both TTA and SSC were less precisely predicted; however, their ratio STR was predicted with a high R 2 value and very high significance in F test, which agrees with earlier findings [35]. The higher precision of reconstructed hyperspectral images in predicting STR values is fortunate as it is also more informative compared with either TTA or SSC values alone-tomato flavor is determined mainly by the ratio of sugar and acidity rather than the two separate properties [29]. Overall, the high accuracy in tomato quality prediction highlights the robustness and potential of hyperspectral image reconstruction.
The good to very good performance, both in reconstructing hyperspectral images from model unseen RGB images from smartphone camera and in predicting tomato quality parameters at moderate to high accuracies, makes the HSCNN-R model an important tool for future imaging applications. Hyperspectral reconstruction from a single RGB image makes HSI application mobile and low cost, and allows for easy implementation through either a cloud service or an app. With the selected model and trained weights, we can now generate hyperspectral images of tomatoes of the same variety at least with consumer level cameras and explore the other hyperspectral properties of interest-as the example on tomato quality predicted from smartphone images illustrates in this study ( Figure 5). Specifically, this provides huge benefits for tomato research and industry, and potentially also for other fruit crops, such as cucumber and apple. Even though it is possible to predict the STR of each pixel of tomato image directly without reconstructing the whole spectrum from 400 to 1000 nm, the fully reconstructed spectral reflectance provides opportunities to explore other hyperspectral properties through different machine learning algorithms, offering much higher flexibility. Important bands can even be selected to reduce workload while improving prediction accuracy, as is commonly used in HSI analysis [56,57].

Conclusions
This study first demonstrates the use of hyperspectral image reconstruction from a single RGB image for a real-world application-using tomato fruits as an example. The capability of HSCNN-R for spectral reconstruction beyond the visual-range towards near-infrared was demonstrated. The reconstructed hyperspectral images from RGB images of tomato were able to estimate important tomato quality parameters with high accuracy.
Hyperspectral image reconstruction could be a promising approach for a range of other fields, thereby further developing its full potential. With HSCNN-R, we can potentially reconstruct hyperspectral images in both higher spatial and spectral resolution at much lower costs. However, the reconstructing model built on tomatoes can probably not be transferred easily to other categories, e.g., determination of chemical properties of other fruits, soils or rocks, or to different light conditions-which can be an obstacle for extending the application range. Thus, libraries containing hyperspectral images in different categories of interest (fruits of various varieties, and at different harvest stages and growing conditions (incl. stress), soil types, wood, skin etc.) should be built for training models to fit each category specifically or developing a general model that fits more categories.
Future advancement in this field should particularly focus on: 1) exploration of more robust models for hyperspectral image reconstruction in various illumination conditions, and 2) extending the application field using current state of the art models and building libraries for a wider range of objects as exemplified above. Thereby it can be expected that higher resolution hyperspectral images will be more accessible for a range of real-world applications in the future.