Multispectral Food Classification and Caloric Estimation Using Convolutional Neural Networks

Continuous monitoring and recording of the type and caloric content of ingested foods with a minimum of user intervention is very useful in preventing metabolic diseases and obesity. In this paper, automatic recognition of food type and caloric content was achieved via the use of multi-spectral images. A method of fusing the RGB image and the images captured at ultra violet, visible, and near-infrared regions at center wavelengths of 385, 405, 430, 470, 490, 510, 560, 590, 625, 645, 660, 810, 850, 870, 890, 910, 950, 970, and 1020 nm was adopted to improve the accuracy. A convolutional neural network (CNN) was adopted to classify food items and estimate the caloric amounts. The CNN was trained using 10,909 images acquired from 101 types. The objective functions including classification accuracy and mean absolute percentage error (MAPE) were investigated according to wavelength numbers. The optimal combinations of wavelengths (including/excluding the RGB image) were determined by using a piecewise selection method. Validation tests were carried out on 3636 images of the food types that were used in training the CNN. As a result of the experiments, the accuracy of food classification was increased from 88.9 to 97.1% and MAPEs were decreased from 41.97 to 18.97 even when one kind of NIR image was added to the RGB image. The highest accuracy for food type classification was 99.81% when using 19 images and the lowest MAPE for caloric content was 10.56 when using 14 images. These results demonstrated that the use of the images captured at various wavelengths in the UV and NIR bands was very helpful for improving the accuracy of food classification and caloric estimation.


Introduction
Precise and continuous monitoring of the types and amounts of foods consumed is very helpful for the maintenance of good health.For health professionals, being aware of the nutritional content of ingested food plays an important role in the proper treatment of patients with weight-related diseases and gastrointestinal diseases as well as those at high risk for metabolic diseases such as obesity [1].For people with no health problems, monitoring the types, amounts, and nutritional content of food consumed is useful in order to maintain this status.Monitoring of the types and amounts of foods eaten is often achieved via manual record-keeping methods that include food-frequency questionnaires [2], selfreport diaries [3], and multimedia diaries [4].Several user-friendly diet-related apps have become available on smartphones in which image recognition schemes are in part adopted to classify the types of food.In such an approach, however, the accuracy is affected by user inattention and erroneous record-keeping that often decreases usefulness.
Several automatic food recognizers (AFRs) are available to continuously recognize the types and amounts of consumed food with minimum user intervention required.Wearable sensing and digital signal-processing technologies are key factors in the implementation of AFRs which are divided into several categories according to the adopted sensing method.In acoustic-based methods, classifying the types of food is achieved via chewing and swallowing sounds.The underlying principle is that chewing sounds vary depending on the physical characteristics of the food, which includes shape, hardness, and moisture content.In-ear microphone [5][6][7] and throat microphone [8,9] are typically used to acquire the sounds of food intake.By using a throat microphone, recognition experiments were carried out on seven types of food [9].A recognition rate of 81.5∼90.1% was achieved where a hidden Markov model (HMM) was adopted to classify the types of food [9].Päfiler et al. performed recognition experiments on seven types of food using an in-ear microphone and reported a recognition rate of 79∼66%.The performance achieved using acoustic signals has been limited, however, because it is difficult to discriminate various foods by using only acoustic cues.
A variety of sensors are used in non-acoustic methods.These include an imaging sensor [10], a visible light spectrometer, a conductive sensor [11], a surface electromyogram (sEMG) sensor attached to the frame of eyeglasses [12], and a near-infrared (NIR) spectrometer [13].These methods have the advantages of distinguishing and sub-dividing different types of food while analyzing the principle constituents.Sensors that are inconvenient to wear, however, can be disadvantageous and separate sampling of the food is required [11,13].Ultrasonic Doppler shifts are also employed to classify the types of food [14].The underlying principle of non-acoustic methods involves movements of the jaw during the chewing of food, as well as vibrations of the jaw caused by the crushing of food, both of which reflect the characteristics of food types.The accuracy of the ultrasonic Doppler method was 90.13% for the six types of food [14].
Since types of food are easily distinguished according to their shape, texture, and color, visual cues have been used for the classification of food types and estimation of the food amount [10,[15][16][17][18][19][20][21][22][23].In a vision-based approach, the classification of food types can be formulated as pattern recognition problems where segmentation, feature selection and classification are sequentially carried out for food images.Due to recent advances in machine learning technology, an artificial neural network has been employed to classify food categories and to predict the caloric content of food [18,22,23].Convolutional neural networks (CNNs) have been used to established 15 food categories with an average classification accuracy of 82.5% and a correlation between the true and estimated calories of 0.81 [18].When RGB images under visible lighting sources were used in previous vision-based approaches, recognition accuracy was degraded for visually similar foods.Moreover, the lack of a specific reaction to UV and NIR lights is a limitation meaning that this technology cannot be utilized in food analysis.
Multispectral analysis has been widely adopted in food analysis [24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][39].The underlying principle is that individual ingredients in food have different absorption spectra.For example, infrared (IR) light is strongly absorbed by water compared with ultraviolet (UV) and visible (VIS) light.Therefore, differences in the absorption spectra between the VIS and IR bands are useful in estimating the amount of water contained in food.In previous studies, multispectral analysis of food was employed to quantify specific contents, such as oil, vinegar [24], water [27], sugar [28][29][30][31][32][33][34][35][36], and soluble protein [37].The multispectral analysis involved the use of a spectrometer and a wide-band light source (such as a halogen lamp).Using these methods, an optimal set of wavelengths was chosen from the absorption spectra in the interest of maximizing the prediction accuracy for the ingredients of interest.A correlation of 0.8912 was obtained using four wavelengths out of a 280-bin absorption spectrum when predicting the sugar content of apples [28].When hyperspectral imaging was adopted to predict the sugar content in wine, a maximum correlation of 0.93 was obtained when using partial least squares regression [32].The usage of a spectrometer and a wide-band light source confers the ability to select the optimal wavelength in sharp detail.Problems associated with size, weight, and power consumption, however, could potentially cause difficulties in implementing wearable monitoring devices.
The multispectral approach has also been implemented by using a number of narrow-bandwidth light sources, such as light emitting diodes (LEDs), and a digital camera [24,26,27,31,35,39].Compared with halogen lamps, the improvement gained when using an LED light source was verified for multispectral food analysis [31].Experimental results obtained showed that use of an LED light source returned a slightly higher correlation than that of halogen lamps (0.78 vs. 0.77).The number of employed wavelengths ranged from 5 [39] to 28 for UV, VIS, and NIR bands [31].Raju et al. [24] used the multispectral images from 10 LEDs with different wavelengths to detect dressing oil and vinegar on salad leaves, and reported an accuracy of 84.2% by using five LEDs.Previous studies were focused mainly on predicting the specific nutritional content of specific foods (e.g., water in beef [27], sugar in apples [28], sugar in sugarcane [29], sugar in peaches [30], soluble solids in pomegranates [33], sugar in potatoes [34], sugar in black tea [35], and soluble protein in oilseed rape leaves [37]).The caloric content of food is determined by the amounts of each ingredient, and it would be reasonable to expect accurate predictions when using multispectral analysis techniques.
In the present study, multi-wavelength imaging techniques were applied to classifying food items and estimating caloric content.Compared with conventional RGB imagebased methods, the usefulness of NIR/UV images was experimentally verified for food classification and predictions of caloric content.The optimal number and combination of the wavelengths was determined to maximize the estimation accuracy where a piecewise selection method was adopted.
This paper is organized as follows: In Section 2, the processing of data preparation and the properties of the employed food items are presented.The preliminary verification results and the overall procedure for food analysis are explained in Section 3. The experimental results and discussion of the results are presented in Section 4. Concluding remarks are provided in Section 5.

Data Acquisition
The list of the food items used in this study is presented in Table 1.The food items were selected to represent the various physical properties (liquid/soft/hard) of everyday foods and to reflect the naturally occurring balances between healthy and unhealthy foods.The caloric amount was obtained from existing nutrition fact tables for each food and food composition data released by the Ministry of Korea Food and Drug Safety (KFDA) [40].It is noteworthy that a number of foods were nutritionally different but were difficult to distinguish visually (e.g., cider and water, coffee and coffee with sugar, tofu and milk pudding, milk soda and milk. . .).Such pairs of food items were good choices for verifying the feasibility of UV and NIR images in the recognition of the types and calories of food.In the case of liquid food, images were acquired by putting the same amount of food in the same container (cup) so that the shape of the cup and the amount of food were not used as a clue for food recognition.In a similar manner, in the case of non-liquid foods, plates of the same size and shape were used.
A custom-made image acquisition system was employed to obtain the multispectral images.A schematic of the image acquisition system appears in Figure 1.The photography is shown in Figure 2. The light source was positioned 25 cm above the food tray.Four digital cameras faced the center of the food tray and were connected to the desktop PC via a universal serial bus (USB).The acquisition image size was set at 640 × 480 pixels (HV), and each pixel had a 16-bit resolution.Each camera was equipped with a motorized IR-cut filter to block the visible lights when the IR images were acquired.The light source consisted of a total of 20 types of LEDs emitting light at different wavelengths (385,405,430,470,490,510,560,590,625,645,660,810,850,870,890,910,950,970, 1020 nm, and white).The white LED was used to obtain the RGB images, which were split into three (R-G-B) channels.The light source of each wavelength was composed of 30 LEDs with the exception of the white light source(10 LEDs).The LEDs of each wavelength were arranged in a circular shape at a specified position on the printed circuit board (PCB), as shown in Figure 3. Before the image of a specific wavelength was acquired, the center of the corresponding LED area was moved to the center of the food tray.Since the intensities of LEDs were different according to wavelength, the driving current of LEDs for each wavelength was adjusted to minimize the differences in light intensity according to the wavelength.The LED panel moved back-and-forth and left-to-right using the linear stages powered by the stepping motors.Data augmentation was achieved not by image transformations but by capturing images from as many views as possible.Accordingly, the four cameras and a rotating table were employed.The angular resolution of the rotating table could be adjusted from 0.5∼90 • .The movement of the LED panel, rotation of the tables, and the on/off switch of each of the LEDs were all controlled by a micro-controller (Atmega128A).The number of control commands was predefined both in the control module of the acquisition system and in the host desktop PC.Hence, the task of acquisition was achieved entirely by constructing the sequence of the individual commands.The acquisition code was written in Python (version 3.6.11).The communication between the desktop PC and the control module was achieved using Bluetooth technology.Image acquisition was carried out in a dark chamber (470 mm × 620 mm × 340 mm, WDH) where external light was blocked.The angular resolution of the rotating table was set at 10 • (total 36 views per camera).The total acquisition time for each food was 2738 s, which corresponded to an acquisition time of 3.8 s per frame.The images of bread, castella, and a chocolate bar were captured under white light from specific angles, as shown in Figure 4.

Preliminary Feasibility Tests for UV/NIR Images
The main objective of this study was to improve the accuracy of food classification and calorie estimation by using UV/NIR images.Prior to construction of the classification/estimation rules, the use of UV/NIR images was experimentally verified as adequate for this purpose.The white LED used for capturing RGB images was experimentally measured as emitting light with a wavelength that ranged from 430 to 640 nm.Hence, images acquired under a light source with a wavelength shorter than 430 nm were considered UV images, and the images acquired under a light source with a wavelength longer than 640 nm were considered NIR images.Two food items could not be well-distinguished visually under a visible light source, which necessitated confirming whether the corresponding images acquired under UV or NIR light sources could be relatively well-distinguished.Food images are visually distinguished according to shape, texture, and the distribution of brightness values (histogram).The shapes and textures of foods, however, are uniquely determined independent of the light source, whereas the distribution of brightness values is affected not only by food types but also by the wavelength of the light source.Accordingly, brightness distribution was used to measure the difference between two food images that are affected by the wavelength of the light source.A histogram was obtained from the image in which the non-food portion was masked.In the present study, the Bhattacharyya distance was employed to measure the differences between two food images.The Bhat-tacharyya distance for two images I n and I m acquired under a light source of a wavelength λ k is given by where Y is the set of possible brightness values, and p(y|I, λ) is the probability density function of the brightness value y included in image I at a wavelength of λ.The Bhattacharyya distance represents a complete match at a value of 0 (minimum) and a complete mismatch at a value of 1 (maximum).Figure 4 presents an example of the two different food images acquired under the different wavelength light sources.Two food items, coke and sugar-free grape juice looked similar under a visible light source, as shown in the first row of Figure 4, whereas the differences between these two food images were apparent in the IR images at 810 nm.For this image pair, the Bhattacharyya distances for λ = 640 nm and λ = 810 nm were given as 0.45 and 0.99, respectively.This example shows that the Bhattacharyya distance can be a good indicator of visual differences between two different food items.The usefulness of UV/NIR images in terms of food classification was verified by examining the proportion of food pairs that were visually similar under visible light but visually distinct under UV or NIR light.To this end, a cumulative distribution function was defined; it can be heuristically computed as follows: where Λ V and Λ NV are the sets of visible and non-visible wavelengths, respectively.T V and T NV are the thresholds of the Bhattacharyya distances for the visible band and nonvisible band, respectively, and F C (T V , T NV ) represents the ratio of image pairs that satisfy the condition whereby the Bhattacharyya distance from the VIS images is less than T V while the Bhattacharyya distance from UV or IR images is greater than T NV .The case of T V < T NV , F(T V , T NV ) represents the frequency of a relatively small Bhattacharyya distance (a low degree of discrimination) in the visible light band but a high Bhattacharyya distance (a high degree of discrimination) in the non-visible light band.The cumulative distributions for the various thresholds are plotted in Figure 5.Note that the visual difference between two images was not significant at a Bhattacharyya distance of 0.45, as shown in Figure 6.And, hence, the curves with T V ≤ 0.4 correspond to the cumulative distribution obtained from food pairs that are not well distinguished under a visible light source.In the case of T V = 0.4, the ratio of the Bhattacharyya distance for UV/NIR images exceeding 0.5 (corresponds to well-distinguished) was about 70%.Similar results were obtained for other T V values.(e.g., For T V = 0.2, 0.3, and 0.4, F C (T V , T NV = T V ) = 89, 77, and 64%, respectively) This means that a significant number of food pairs that were not visually well-distinguished under visible light sources was better distinguished under non-visible light sources.Such results indicate that the performance of food classification can be improved by using UV/NIR images that are complementary to VIS images.
In terms of caloric estimation, the usefulness of a specific wavelength image was determined by examining whether differences in the amount of calories between two food items were significantly correlated with differences between the two corresponding images.The underlying assumption is that if the difference between the two food images is large, their caloric difference will also be large and vice-versa.The caloric count was computed via the measured weight and the nutrition facts for each food.The Bhattacharyya distances were also adopted to measure the differences between the two food items.The caloric difference between the two food items n and m is given by the following absolute relative difference: where c n and c m are the calories of food items n and m, respectively.The Pearson correlation for the k-th wavelength images is given by where cov[x, y] denotes the covariance of x and y and σ x is the standard deviation of x.D B (I|λ k ) denotes the Bhattacharyya distance computed from the images acquired under a light source of λ k wavelength.
The correlations are presented across the wavelengths of each light source in Figure 7.The maximum correlation was obtained at λ = 870 nm.A second maximum also appeared in the wavelength of the NIR band (λ = 850 nm).The results indicate that the differences between the NIR images are moderately correlated with differences in caloric content.The images acquired under the NIR light source are beneficial in terms of caloric estimation.The average correlations of the visible and non-visible bands were 0.636 and 0.633, respectively.The significance test also showed that there was no remarkable difference between the correlation values of the visible band and those of the non-visible band (p = 0.7).From such results, it can be reasonably assumed that VIS-and UV/NIR-images are equally useful in terms of caloric estimation.

Preprocessing
Although a highly stable current source was adopted to drive the LEDs, there was some variation in the intensity of the light from shot to shot.This caused unwanted changes in the acquired images and resulted in degradation of the estimation accuracy.A simple way to compensate for variations in the intensity of light sources is to adjust the intensity of the incident light so that the average intensity approximates that of the reference intensity.A typical scale factor is given by where ω and i are the indices of the wavelength and shot, respectively, and µ (i) ω and µ (re f ) ω represent the average and reference intensities, respectively.The reference intensity can be obtained by averaging a large number of light source images at different times.Such an intensity normalization method is very simple and easy to implement.

Food Analysis Using a Convolutional Neural Network
There are many machine learning schemes, such as random forest (RF), support vector regression (SVR), partial least squares regression (PLSR) and artificial neural networks (ANNs), that can be applied to recognition of the types, amounts, and nutritional content of food.Among them, the ANN-based approaches have an advantage wherein nonlinear relationships between the input (multispectral images) and the output (target values) can be taken into consideration in constructing the estimation rules.This leads to higher performance in terms of classification and regression.Accordingly, a supervised learning approach that employs convolutional neural networks (CNNs) was adopted in this study.The architecture of the CNN adopted in this study is shown in Figure 8, and was heuristically determined using a validation dataset (10% of the entire learning dataset).Note that the CNN architecture shown in Figure 8 was used to classify the types of food.The final output was replaced by a single output in the case of caloric estimation.The resultant architecture of the CNN was simpler than others previously proposed in image recognition tasks (e.g., ResNet-50, Inception-v3, Xception).This was due mainly to the smaller number of targets compared with that of other CNNs (101 vs. 1000).
We tested the performance in terms of food classification accuracy and estimation error for target values according to different sizes of CNN input (input image sizes).The results showed that sizes of 64 × 64 yielded the highest performance for both classification and estimation accuracies.Accordingly, all images from the camera were reduced to 64 × 64 by using interpolation.Before reducing the size, no cropping was carried out on the acquired images, and the entire image size (640 × 480) was used.
The performance according to the hyperparameters of CNN was also investigated using a validation dataset.This was performed separately for each task (food classification and caloric estimation).The resultant CNN was composed of two convolution/max pooling layers and a fully connected multi-layer perceptron (MLP) with a recognition output, as shown in Figure 8.The kernel sizes of the first and the second convolution layers were 11 × 11 and 7 × 7, respectively, while the window sizes of the max pooling layers were commonly 4 × 4.There were three fully connected layers in the employed CNN, which corresponded to input from the final convolution layer, and to the hidden and output layers.The numbers of nodes for each of the layers were also determined using the validation datasets 112, 128, and 18, respectively.Although the hyperparameters of each CNN were separately tuned for each task, the architecture of the CNN, shown in Figure 8, yielded satisfactory performance for both image classification and caloric estimation.
A rectified linear unit (ReLU) was adopted as an activation function for all hidden layers.A soft-max function and linear combination function were employed for the output layer for classification and regression CNNs, respectively.Accordingly, the loss functions were given by the cross-entropy and the mean absolute percentage error (MAPE) in the cases of food classification and caloric estimation, respectively.A total of 1000 epochs resulted in a trained CNN with sufficient performance in terms of food recognition accuracy.It is noteworthy that the accuracy of classification/estimation was strongly affected by the mini-batch size.The experimental results showed that a mini-batch size of 32 gave the best performance for all cases.Since the MAPE is given by dividing the absolute error value by its ground truth, the loss value cannot be calculated when the given target value is zero.As shown in Table 1, there were some cases when the ground-truth caloric count was zero.Note that a value of zero in the nutrition facts does not necessarily mean that the amount of the nutritional content is zero.A value of zero actually means that the amount is less than its predefined minimum.In the present study, zero caloric values were replaced by the minimum, which was 5 (kcal) according to [40].

Selection of the Wavelengths
Although evaluation of all possible wavelength combinations was performed via offline processing, it was desirable to avoid an enormous amount of computational time for a brute-force grid search.In the present study, a piecewise selection method was adopted to select the set of optimal wavelengths.Let Ω = {ω 0 , ω 1 , . . ., ω N−1 } be the set of the employed wavelengths, where N is the total number of the wavelengths.The set of the wavelengths was gradually constructed by adding and removing the wavelength either to or from the previously constructed set.The overall procedure is as follows: Step-1.Forward selection: Let Ω (i) F be the set of the wavelength at the i-th forward step, all combinations Ω (i−1) ∪ {ω| ω ∈ Ω(i−1) = Ω − Ω (i−1) } are evaluated to find the optimal wavelength ω * F that minimizes the given loss function, then construct Step-2.Backward elimination: The element (wavelength) that minimizes the loss function is removed from F .The set of the wavelength at the i-th backward step is then given by Ω (i) where and L(S) is the loss for a wavelength set S that is given by the final loss of the learned CNN.
Step-3.Final forward selection: The final set is built by the forward selection step wherein the optimal wavelength ω * is chosen from the set Ω − Ω (i) B so as to minimize the loss function.
The procedure Steps-1∼3 was iteratively performed until Ω (i) = Ω.The final set of the optimal wavelengths is given by A large number of the wavelengths in Ω * involved an increased number of LEDs and shots, which resulted in a large device, a high rate of power consumption, and long acquisition times.Hence, it is preferred that the number of wavelengths should also be considered in building the estimation rules.The three-channel (RGB) images could be obtained from one image acquired under the white light that could be regarded as a representative image within the visible light region.In the present study, food analysis was performed in combination with RGB, UV, and NIR images to reduce the number of shots and the results were compared by combining all the images acquired in the full set of wavelengths.

Accuracy for Food Item Classification
The food classification accuracy according to the number of input images is presented in Figure 9.The number of input images is equal to the number of images actually taken by the camera at a different wavelength.Note that although the RGB image was separated into three individual images that were inputted to CNN, they were considered as one image because they were captured by one shot.The results indicated that the accuracy was increased rapidly when the number of images was less than five.Even when one image was used, the classification accuracy was higher than the previous CNN-based food recognition method [18].Such a high level of accuracy was due mainly to the usage of a large-sized training dataset that included images acquired from various directions.A maximum accuracy of 99.81% was obtained when images acquired from 11 different wavelength light sources were used.However, the accuracy increased until the number of images was five, at which point no significant increase was observed when the number of images exceeded five.The maximum increase in classification accuracy (7.67%) was obtained when the number of input images was increased from one to two.This was confirmed by the fact that the correlation coefficient between the number of images and classification accuracies was 0.919 for as many as five images.The correlation coefficient was decreased to 0.587 when the number of images ranged from 6 to 19.Such results indicate that it is possible to obtain a sufficient recognition rate by using five images at different wavelengths.In fact, the recognition rate obtained from the five images was 99.12%, which was not significantly lower than the maximum recognition rate of 99.81%.
Thus far, the results were obtained by excluding RGB images.The food classification accuracy is presented in Figure 9, where single-wavelength images were added to the RGB image.The increase in recognition rate was more rapid compared with when the RGB image was not used.A maximum recognition rate of 99.83% was obtained when almost all images (19 out of 20) were used.The largest increase in classification accuracy was achieved when only one single wavelength image was added to the RGB image (e.g., accuracy was increased from 88.86 to 97.08% when an image from a wavelength of 890 nm was added to the RGB image.)The accuracy gradually increased until the number of images added exceeded three, and remained almost constant until the number exceeded five.When food classification was performed with only one type of image (as with the previous image-based food classification methods), the correct recognition rates of 88.86% and 87.9% were achieved by using the RGB image and one other image from a single wavelength, respectively.This indicates that an RGB image is a slightly better choice for food classification when only one type of image is used.As noted in the previous section, a high level of accuracy is paramount when using a small number of images.From this point of view, it is meaningful to examine how the correct recognition rate changes according to the food item when recognition is performed by adding only one type of NIR or UV image to an RGB image.The experimental results on two input images (including the RGB image) showed that the highest classification accuracy was achieved when an image from a wavelength of 890 nm was added to the RGB image.Accordingly, the change in the recognition rate for each food item was examined when adding only one NIR image from a wavelength of 890nm to the RGB image.From among 101 food items, 62 items showed an improved recognition rate by adding only one type of NIR image, and 35 items showed the same recognition rate.A decrease in the recognition rate was observed for only four food items, but the level of decrease was generally small (<5%).The list of food items that resulted in a significant level of improvement in the recognition rate by adding one type of NIR image is presented in Table 2.It is noteworthy that, for most of these food items, there exist other food items that are not easily distinguished under visible light.For example, the food pairs coffee and coffee with sugar, as well as caffelatte and caffelatte with sugar, were almost visually identical.These food items were often recognized as other foods that appeared almost identical under visible light.For the food item of grape soda, a recognition rate of 0% was obtained when only an RGB image captured under visible light was used.In this case, all grape soda images were recognized as sugar-free grape soda that is visually identical to grape soda.When adding an image acquired under a 890 nm wavelength light source, a recognition rate of 44% was obtained.Consequently, it is apparent that the NIR/UV images improve the accuracy of image-based food classification.

Accuracy for Caloric Estimation
The results for caloric estimation are presented in Figure 10 in which MAPEs are plotted for each number of images actually taken.Without an RGB image, the minimum of MAPE was 10.49 when a total of 10 different wavelength images were employed.Interestingly, MAPE was decreased until the number of input images reached 10 (R 2 = −0.9317),but was increased after the number of input images exceeded 11 (R 2 = 0.6530).That result was likely due to the limitations of the piecewise algorithm adopted in the selection of the wavelengths and overfitting problems caused by an excessive increase in the number of input images.When the three wavelength images were employed, the MAPE was 18.97%, which was significantly lower than when using the RGB image alone (=41.97%).Such a result was also remarkably better than those of the previous CNN-based caloric estimation schemes (27.93% [19], even though the number of the food items adopted was relatively large (101 vs. 15).The selected wavelengths were 385, 560, and 970 nm in the case of three input images, indicating that one each of UV, VIS, and NIR wavelength band images was selected.This implies that not only the number of input images, but also the selected wavelengths play an important role in the accuracy of caloric estimation.
When single-wavelength images were added to an RGB image, a minimum of 10.56% MAPE was achieved.A total of 14 input images were needed to obtain the minimum MAPE, which meant more images were required compared with when the RGB image was excluded (10 images).Overall, the MAPEs take the form of decreasing curves with an increasing number of input images (R 2 = −0.5935).However, the correlation values before and after the minimum point (14) were −0.6847 and −0.4919, respectively, indicating that there was no significant difference between the number of input images and MAPE when the number of input images exceeded 14.The Pearson correlation for the caloric counts between the ground truth and the estimation was also notably larger than that of the previous CNN-based estimation schemes [18,19] (0.975 vs. 0.806) when a total of 14 UV/VIS/NIR images was employed in the caloric estimation.This was due mainly to the usage of the UV/NIR images in this study.Even for visually similar foods with different caloric counts, their UV/NIR images were often clearly distinguishable.Adding more images obtained from light sources with various wavelengths in the UV and NIR bands progressively improved the process of classifying images according to the caloric count.
Figure 11 compares the cumulative distribution function of MAPE when using only RGB images to also using other images acquired from a single wavelength light source.The superiority of using additional images along with RGB images is confirmed by this figure .For example, about 50% of all test food images revealed a MAPE value of less than 30 in the case of using only RGB images.When adding one image to the RGB image, 50% of all test food images showed a MAPE value of less than 12.This was further reduced to five when five images were added to the RGB image.
It was also important to examine the accuracy of the caloric estimation for each food item when adding only one type of NIR/UV image to the RGB image.An image acquired at a wavelength of 970 nm represented the highest MAPE reduction rate and was chosen as an addition to the RGB image.Compared with the use of only RGB images, the number of food items with decreased, increased, and maintained MAPE values was 77, 20, and 1, respectively (out of 98 food items with valid caloric values).This indicates that the accuracy of caloric estimation for many food items (>78%) can be improved by including only one type of NIR image.This is also confirmed by the fact that most of the MAPE reduction (74.8%) was achieved when only one NIR image (at 970 nm) was added to the RGB image, as shown in Figure 11.Table 3 lists the large differences in MAPE when only the RGB image was used compared with using the RGB + 970 nm images.The estimated caloric values of the food items presented in Table 3 deviated from the ground truth by more than 50%.The results show that the MAPEs values for these food items were reduced by more than half when calories were estimated after adding one type of NIR image to the RGB image.
In conclusion, the use of UV/NIR images in additionj to RGB images increases the accuracy of caloric estimation.Even with a smaller number of UV/NIR images, the performance in terms of caloric estimation was significantly improved compared with conventional RGB-based estimation.

Analysis of the Selected Wavelengths
The wavelengths selected for food classification appear in Table 4 for each number of images.The selection rate for each wavelength is shown in Figure 12.The results are presented for two cases: inclusion and exclusion of RGB images.When the RGB image was excluded, the ratio of wavelengths corresponding to the NIR band being selected was 46.84%.On the other hand, 41.05% of wavelengths corresponded to the visible light band.The UV band was selected at a relatively low rate (12.11%).This was partially because the number of employed UV light sources was smaller than that of the UV/VIS light sources.Hence, the light source in the UV band was not likely to be selected.The chances of selecting the wavelength of the NIR band was increased to 60% in cases when a relatively small number of wavelengths was allowed (≤5).Considering the fact that the number of wavelengths belonging to the NIR band was slightly less than the number of VIS wavelengths (8 vs. 9), a higher selection rate of wavelengths in the NIR band indicates that the NIR images are more useful in food classification compared with the VIS images.This seems to be due in part to the nature of the NIR images, where the distribution of water content in food can be approximately obtained.
When the RGB image was included, the ratio of wavelengths in the NIR band being selected was further increased to 50.53%This is because the RGB image actually includes three visible light images (RGB images), so the NIR band images are likely to be selected.Whether the RGB images were included or excluded, frequent wavelength selections were 385, 870, 890, and 970 nm.This also indicates that the wavelengths corresponding to the invisible band were more frequently selected for food classification.The wavelength selection results for caloric estimation are shown in Figure 13 and in Table 5.When excluding the RGB images, the wavelengths in the NIR region were selected with a frequency of 31.58%,whereas wavelengths in the visible region were selected with a frequency of 51.05%.Even when RGB images were included, the wavelengths in the visible region were selected more frequently compared with those in the NIR band.This result was somewhat different from that for food classification.Wavelengths of 385, 430, 560, and 970 nm were frequently selected when excluding the RGB images, which indicates that the wavelengths corresponding to the visible band were also frequently selected.On the other hand, the wavelengths of 405, 430, 950, and 1020 nm were selected with relatively high frequency when the RGB images were included.

Conclusions
RGB images are mostly used in image-based food analysis.The proposed approach assumes that the types of food and the caloric count in them could partially be determined by the morphological characteristics and wavelength distributions of food images.A multispectral analysis technique using NIR and UV light along with VIS light was employed in the analysis of various foods and the results approximated those of conventional chemical analysis techniques.The present study was motivated by such a multi-spectral approach, and the procedure for food classification and caloric estimation adopted a multi-spectral analysis technique.Automated equipment capable of acquiring images of up to 20 wavelengths was devised, and approximately 15,000 images were acquired per wavelength from 101 types of food.The types of food and caloric content were estimated using a CNN.A CNN was used so that the relevant features for estimating the target variables could be automatically derived from the images, and a feature extraction step was unnecessary.
The experimental results showed that the performance in terms of the accuracy for food item classification and caloric estimation was notably superior to previous methods.This was due mainly to the usage of various light sources in the UV/VIS/NIR band, unlike conventional methods which use only RGB images.It would be interesting to connect the multi-spectral imaging techniques with the quantification of ingredients such as proteins, fats, carbohydrates, sugars, and sodium.Future work will focus on this issue.

Figure 1 .
Figure 1.Schematic of the image acquisition system.

Figure 2 .
Figure 2. Photograph of the image acquisition system.

Figure 3 .
Figure 3. Photograph of LED in the panel.
where (I m , I n ) is the pair of two different food items m and n, and |S| is the cardinal of the set S. In this study, the maximum D B value in each band (visible and non-visible bands) was chosen as the representative D B for the corresponding band, and, hence, D

Figure 5 .
Figure 5.The cumulative distributions of the distances obtained from non-visible images and those from visible images.

Figure 6 .
Figure 6.(a) Images acquired under a 640 nm light source.(b) Images acquired under a 810 nm light source.The food items are coke (top) and sugar-free grape juice (bottom).

Figure 7 .
Figure 7. Correlations for each of the wavelength images between the caloric differences and the Bhattacharyya distances.

Figure 8 .
Figure 8. Architecture of the proposed CNN for the classification of food type, where N w is the number of input images and N t is the number of targets.(101 for food classification and 1 for caloric estimation).

Figure 9 .
Figure 9.The classification accuracies of food items according to the number of light sources when the RGB image is included or excluded.

Figure 10 .
Figure 10.The mean absolute percentage errors (MAPEs) of caloric content (kcal) according to the number of wavelengths, when the RGB image was either included or excluded.

Figure 11 .
Figure 11.The cumulative distribution of MAPE according to the input of CNN (RGB image only, RGB + 1-images, and RGB + 5-images).

Figure 12 .
Figure 12.The selection ratio of each wavelength for food classification when the RGB image is included or excluded.

Figure 13 .
Figure 13.The selection ratio of each wavelength for calorie estimation when the RGB image is included or excluded.

Table 1 .
Dataset properties per food item.

Table 2 .
Comparisons of recognition rates (%) for food items that revealed a large difference between the use of only an RGB image and using an RGB image with one additional (NIR) image.

Table 3 .
Comparison of MAPEs in caloric content (%) for food items showing large differences between use of the RGB-image only and use of the RGB image and one (NIR) image.

Table 4 .
Selected wavelengths for food classification.Top: without RGB image.Bottom: with RGB image.

Table 5 .
Selected wavelengths for caloric estimation.Top: without RGB image.Bottom: with RGB image.