Indigenous Food Recognition Model Based on Various Convolutional Neural Network Architectures for Gastronomic Tourism Business Analytics

In gastronomic tourism, food is viewed as the central tourist attraction. Specifically, indigenous food is known to represent the expression of local culture and identity. To promote gastronomic tourism, it is critical to have a model for the food business analytics system. This research undertakes an empirical evaluation of recent transfer learning models for deep learning feature extraction for a food recognition model. The VIREO-Food172 Dataset and a newly established Sabah Food Dataset are used to evaluate the food recognition model. Afterwards, the model is implemented into a web application system as an attempt to automate food recognition. In this model, a fully connected layer with 11 and 10 Softmax neurons is used as the classifier for food categories in both datasets. Six pre-trained Convolutional Neural Network (CNN) models are evaluated as the feature extractors to extract essential features from food images. From the evaluation, the research found that the EfficientNet feature extractor-based and CNN classifier achieved the highest classification accuracy of 94.01% on the Sabah Food Dataset and 86.57% on VIREO-Food172 Dataset. EFFNet as a feature representation outperformed Xception in terms of overall performance. However, Xception can be considered despite some accuracy performance drawback if computational speed and memory space usage are more important than performance.


Introduction
Food and beverage expenditures are estimated to account for roughly a quarter of total tourism spending worldwide. As food and tourism are inextricably linked, gastronomic tourism, in which the local cuisine serves as the primary attraction for travelers, has gained popularity in recent years [1]. Local foods can contribute to the development of a local brand, which encourages tourism growth in several countries [2]. Sabah, one of Malaysia's states, is a well-known tourist destination for its magnificent scenery, contributing significantly to its economy. Sabah's diversity of indigenous groups and subgroups is notable for its unique traditions, cultures, practices, and traditional local foods. According to [3], it is highly likely that acceptance of local food brands among tourists and Sabah residents is critical to preserving the culinary heritage and providing visitors with a sense of uniqueness, and special, memorable experiences. Besides the preservation and appreciation, local

•
An empirical analysis was conducted to investigate the effect of deep-learning techniques on food recognition performance, using transfer-learning approaches as feature extractors on the Sabah Food Dataset and the VIREO-Food172 Dataset. • A Sabah Food Dataset was created, which contains 11 different categories of popular Sabah foods. It was used to train the machine-learning model for the classification of Sabah foods. • A preliminary prototype of a web-based application for a food recognition model is presented.
The following sections outline the structure of this paper. Section 2 discusses the related works of deep learning in food recognition, and Section 3 discusses the theoretical background of transfer learning through the use of a pre-trained deep-learning architecture. Subsequently, Section 4 explains the details of the experiment's procedure that was conducted. Then, in Section 5, the results of the experiments and the deployment of the food recognition model are discussed. Finally, Section 6 discusses the overall conclusion of the work and future works.

Related Works
Machine learning is used as a data processing technique to solve a wide range of problems in a variety of fields, including smart homes [11], human identification in healthcare [12], face recognition [13][14][15], water quality research [16], and many more. In traditional machine learning, tedious and exhaustive feature extraction is a very common practice in order to produce a highly discriminative feature. However, due to computational and storage capability advancements, a more profound representation of features based on deep learning has become a common practice for better performance for classification and regression. A deep Artificial Neural Network (ANN) composed of various layers with multilevel feature learning defines the general concept of deep learning. Specifically, a set of components comprising pooling, convolutional, and fully connected layers dubbed as the Convolutional Neural Network (CNN) has gained popularity as a patternrecognition technique, including in studies involving food recognition. This is due to the fact that the recognition capability is exceptional, even with simple CNN configurations. For instance, Lu [17] demonstrated four layers of hidden neurons to classify ten categories of a small-scale food images dataset. The RGB component of the image was reshaped into a two-dimensional form as input data. First, a convolutional layer with a 7 by 7 dimension and a stride value of one was used to extract 32 feature maps. Secondly, a 5 by 5 size of convolutional layers was used to extract 64 feature maps. Lastly, a total of 128 feature maps were generated from 3 by 3 convolutional layers. The best accuracy on the test set reported was 74%. However, over-fitting is suspected as a result of the limited size of the training data, which limits the accuracy of the testing dataset at a higher epoch.
A study conducted by [18] implemented CNN to recognize 11 categories of selfcollected Malaysian foods. The architecture of VGG19-CNN was modified by adding more layers consisting of 21 convolutional layers and three fully connected layers as compared to 16 convolutional layers in VGG19. However, the performance results were not reported. Islam et al. [19] evaluated their proposed CNN configuration and Inception V3 model on the Food-11 dataset for their food recognition module. The images were reshaped into 224 by 224 by 3 dimensions, and ZCA whitening was applied to eliminate unnecessary noise within the images. The accuracy reported for the proposed CNN configuration and pre-trained Inception V3 model was 74.7% and 92.86%, respectively.
The hyper-parameter configurations in conventional CNN are complicated and timeconsuming. Jeny et al. [20] proposed another method for managing the massive number of layers by implementing a FoNet-based Deep Residual Neural Network and testing it on six categories of Bangladesh foods. The model comprises 47 layers that contained pooling layers, activation functions, flattened layers, dropout and normalization. The reported accuracy of 98.16% on their testing dataset outperformed the Inception V3 and MobileNet models, which reported an accuracy of 95.8% and 94.5%, respectively.
In summary, previous research has demonstrated that CNN and transfer learningbased techniques are effective at food image recognition. However, there is a lack of analysis and evaluation of recent CNN architecture models, particularly in terms of feature extraction. Furthermore, CNNs have hyperparameters that must be tuned to the newly created dataset. Table 1 summarizes the related works on CNN models.

A Transfer Learning Approach Using Pre-Trained Deep Learning Architecture
This section discusses the theoretical background of the approaches that have been considered for feature extraction. The approaches to feature extraction include ResNet50, VGG16, MobileNet, Xception, Inception, and EfficientNet. Additionally, the RGB component of an image is used to represent the features.

ResNet50
The ResNet50 approach was introduced in the 2015 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [21]. This model is a residual learning framework that can alleviate the vanishing gradient problem of Deep Convolutional Neural Networks during deeper networks' training. The ResNet50 model was pre-trained on over a million high-resolution images from the ImageNet database. Zahisham et al. (2016) [22] proposed a ResNet50-based Deep Convolutional Neural Network (DCNN) for the food recognition task. The ResNet50 model architecture is imitated; pre-trained weights are imported; and classification layers are trained on three different food datasets (UECFOOD100, ETHZ-FOOD101, and UECFOOD256). The rank one accuracy achieved for the proposed DCNN-ResNet50 model was 39.75%, 41.08%, and 35.32% on the UECFOOD100, ETHZ-FOOD10, and UECFOOD256 datasets, respectively. This proposed model outperformed the accuracy of CNN-3CV (25%, 24.3% and 22%), CNN + Support Vector Machine (SVM) (33.1%, 31.9% and 30%) and CNN-5CV (20%, 17.9% and 15.5%).

VGG16
The VGG-16 approach was introduced by Simonyan and Zisserman [23] at the 2014 ILSVRC conference and was developed by the University of Oxford's Visual Graphics Group. This model is widely used in image classification tasks, as it can outperform the AlexNet-based model. The VGG-16 is trained on the ImageNet dataset with over fifteen million high-resolution images and 22,000 image classes. A comparison of CNN tolerance to the intraclass variety in food recognition was conducted by [24]. The feature extraction process was carried out using a variety of pre-trained CNN models, including ResNet, VGG16, VGG19, MobileNet, and InceptionV3. Additionally, the Food101 dataset was used to evaluate their performance. It was reported that InceptionV3 obtained the highest Top-1 accuracy of 87.16%, followed by VGG16 with a Top-1 accuracy of 84.48, when using 70% as the training set and 30% as the testing set.

MobileNet
Howard et al. [25] proposed MobileNet, a low-latency, low-computation model for on-device and embedded applications. Its architecture is based on depthwise separable convolution that significantly reduces computation and model size while maintaining classification performance similar to that of large-scale models, such as Inception. The ImageNet database was used in their experiment, and it was reported that the MobileNet achieved an accuracy of 70.6%, which is comparable to GoogLeNet (69.8%) and VGG-16 (71.5%) while requiring approximately ten times the computational resources required by GoogLeNet and VGG-16. Additionally, on the Stanford Dogs dataset, the MobileNet model achieved an accuracy of 83.3% for fine-grained recognition, which is nearly identical to the 84% accuracy of a large-scale Inception model, with ten times the computation and a twentyfold reduction in the parameter count. Following that, the paper in [7] implemented FD-MobileNet-TF-YOLO as an embedded food recognizer. FD-MobileNet was used as a food categorizer, while TF-YOLO was used as an ingredient locator and classifier. The FD-MobileNet approach achieved higher downsampling efficiency by conducting 32 downsamples within 12 levels on an image of 224 by 224 dimensions, resulting in reduced computational complexity and costs. The TF-YOLO approach identified smaller objects in images, using the YOLOv3-tiny procedure based on the K-means technique. The recognition accuracy of FD-MobileNet was 94.67%, which is higher than MobileNet's recognition accuracy of 92.83%.

Xception
Chollet [26] introduced the Xception model, a modified depthwise separable convolution model based on the Inception model. The Xception model outperforms the Inception model because it reduces the number of model parameters and makes more efficient use of them, allowing for the learning of richer representations with fewer parameters. On the ImageNet dataset, the Xception model achieved the highest Rank-1 accuracy of 79%, followed by Inception V3 at 78.2%, ResNet-152 at 77%, and VGG-16 at 71.5%. Additionally, the Xception model outperforms Inception V3 in terms of mean accuracy prediction (mAP) when evaluated against the FastEval14k dataset, containing 14,000 images classified into 6000 classes. In another report, Yao et al. [27] conducted a study on the classification of peach disease using the traditional Xception and the proposed improved Xception model. The proposed improved Xception network was based on ensembles of regularization terms of the L2-norm and mean. An experiment was conducted using a peach disease image dataset comprised of seven different disease categories and seven commonly used deep-learning models. It was reported that the validation accuracy for Xception and the improved Xception was 92.23% and 93.85%, respectively.

Inception
Inception is a deep neural network architecture for computer vision, introduced by [28] at the 2014 ILSVRC conference. The Inception architecture uses a sparse structure of a convolutional network with one-by-one convolution dimensions to reduce dimensionality. GoogLeNet is a deep-learning model that uses the Inception architecture, which comprises nine modules. Inception modules employ a total of 22 layers and five pooling layers. Singla et al. [29] demonstrated the feasibility of the Inception network-GoogLenet-for food category recognition. They reported that the food identification module achieved an accuracy of 83.6% when tested against the Food-11 dataset.

EfficientNet
Tan and Le [30] proposed the EfficientNet (EFFNet) model, which utilizes a simple and effective compound coefficient to scale up CNN structurally. In comparison to conventional neural network approaches, EFFNet uses a fixed set of scaling coefficients to scale each dimension of depth, width, and resolution uniformly. The EFFNet baseline network was built with the AutoML Mobile Neural Architecture Search (MNAS) framework to optimize accuracy and efficiency, while the remaining architecture was built with mobile inverted bottleneck convolution (MBConv). The performance of EFFNet on ImageNet is compared to that of conventional CNNs, and the findings shows that EFFNet models outperform conventional CNN models in both accuracy and efficiency. For instance, the EfficientNet-B0 model achieved a Rank-1 and Rank-5 accuracy of 77.1% and 93.3%, higher than ResNet-50's Rank-1 (76%) and Rank-5 (93.3%) accuracy. Liu et al. [31] implemented a transfer learning-based EFFNet model to recognize and classify maize leaf disease images. For their experiments, a larger leaf dataset containing 9279 images classified into eight disease categories was divided into a 7:3 training to testing set ratio. The reported recognition accuracy of their proposed model (98.52%) outperformed VGG-16's accuracy of 93.9%, Inception V3's accuracy of 96.35%, and ResNet-50's accuracy of 96.76%.   The Sabah Food Dataset is a newly created food dataset that was used in this study. The images in Sabah Food Dataset were gathered via Google image search and include a    The Sabah Food Dataset is a newly created food dataset that was used in this study. The images in Sabah Food Dataset were gathered via Google image search and include a The Sabah Food Dataset is a newly created food dataset that was used in this study. The images in Sabah Food Dataset were gathered via Google image search and include a range of image resolutions and compression formats. A total of 1926 food images were collected for the Sabah Food Dataset, which includes 11 different famous food categories. The details for each food category of the Sabah Food Dataset are presented in Table 2. The purpose of this dataset is to train a machine-learning classifier for the purpose of developing a Sabah food recognition model. The VIREO-Food172 Dataset [32] samples, as shown in Figure 2, are popular Chinese dishes retrieved from Google and Baidu image searches. Based on the recipes, the images were labeled with category names as well as over 300 ingredients. This dataset comprises 172 food categories from eight major groups, including (i) soup, (ii) vegetables, (iii) bean products, (iv) egg, (v) meat, (vi) fish, (vii) seafood, and (viii) staple. However, only ten categories (categories 1 to 10) of the food images were used in this experiment. The details for each food category of the VIREO-Food172 Dataset are presented in Table 3. For performance evaluation, a total of 9015 food images were selected from the VIREO-Food172 Dataset's ten categories. The test will be more challenging, due to the low interclass differences among those ten categories, most of which are pork-based. This will serve to further validate the system's capability for accurate classification. Crispy sweet and sour pork slices 991 4

Food Dataset Preparation
Steamed pork with rice powder 803 5 Pork with salted vegetable 997 6 Shredded pork with pepper 708 7 Yu-Shiang shredded pork 1010 8 Eggs, black fungus, and sautéed sliced pork 830 9 Braised spare ribs in brown sauce 712 10 Fried sweet and sour tenderloin 954 As for data training and testing preparation, 80% and 20% of the datasets (Sabah Food Dataset and VIREO-Food172 Dataset) are prepared for the training and testing dataset, respectively. For the Sabah Food Dataset, the distribution of the training and testing datasets is selected randomly, using the Python random sampling function. Additionally, the images in the training and testing datasets are not identical. The datasets are available upon request from the corresponding author for reproducibility purposes. For the VIREO-Food172 Dataset, the 80% (training dataset) and 20% (testing dataset) distribution was provided by the original source of the database.

Feature Representations and Classifiers
In order to conduct a more thorough evaluation, the efficiency of the feature representation based on the transfer learning approaches described in Section 3 is compared. The six pre-trained CNN models selected as the feature extractor are (i) ResNet50, (ii) VGG16, The proposed approach in this paper is labeled as "Feature Representation + Classifier". For instance, an approach labeled ResNet50 + SVM (OVO) implies the use of ResNet50 as a feature representation and SVM (OVO) as a classifier. Table 4 shows the CNN feature extractor's configuration details. The following are the definitions of the parameters shown in Table 4: 1.
The Model denotes a convolutional base of existing pre-trained CNN models as a feature extractor. 2.
The No.of.param denotes the total number of model parameters from the input layer to the final convolutional layer.

3.
The Input Shape (x, y, z) denotes input image data with a three-dimensional shape. The x represents the height of an image; the y represents the image's width; and the z represents the depth of an image. 4.
The Output Shape (x, y, z) denotes the output data shape produced from the last convolutional layer. The x represents the height of an image; the y represents the image's width; and the z represents the depth of an image. 5.
The Vector size denotes an output shape that is flattened into a one-dimensional linear vector. The images are resized to fit the fixed input form of the pre-trained CNN model. Numerous hyperparameters are included in pre-trained CNN models, and as shown in the second column of Table 4, EFFNet and VGG16 generate the most and fewest parameters, respectively. The Output Shape (Conv2D) and Vector Size (Conv1D) of the final CNN layer, which serves as the feature representation, are manually reshaped into a one-dimensional vector before being fed into a machine-learning classifier. The Conv2D generates the spatial features necessary for the detection of edges and colors. Both Input Shape and Output Shape represent the height, width, depth of the image. The Conv2D features are fed into the sequential model for classification.
The summary of the CNN architecture used for the data training phase is shown in Table 5. The following are the definitions of the parameters shown in Table 5: 1.
The Layer denotes the layer name.

2.
The Type denotes the type of layer.

3.
The Output denotes feature maps generated from the layer. 4.
The number of parameters of a layer is denoted as No.of.param.
The max_pooling2d_1 and max_pooling2d_2 denotes the max-pooling layer of 1 and 2.
The dense_1, dense_2, and dense_3 denotes the dense layer 1, 2, 3. The layers of the CNN classifier shown in Table 5 is a network that comprises three layers of neurons: two convolutional-pooling layers and one fully connected layer. The input is based on two parameters: (i) the output shape of the features generated by the pretrained CNN model, referred from Table 4, and (ii) the color features of a two-dimensional, reshaped 64 by 64 image, where the color features are composed of an image's RGB component. The first convolutional-pooling layer has the kernel dimensions of 3 by 3 to extract 32 feature maps. Subsequently, a max-pooling layer is added in a 2 by 2 dimension region. The fully connected layer has 512 rectified linear unit neurons with 11 and 10 Softmax neurons that indicate the 11 Sabah Food Dataset categories and the 10 VIREO-Food172 Dataset categories. In this paper, the Keras deep learning packages are used to train the CNN model [2,33].
On the other hand, the Conv1D features are represented in a vector and feed to ten machine-learning classifiers, including (i) non-linear SVM (OVO), (ii) non-linear SVM    Controls the pseudo-random number generation for shuffling the data for the dual coordinate descent (if dual = True). When dual = False the underlying implementation of LinearSVC is not random and random_state has no effect on the results. max_iter 1000 The maximum number of iterations to be run.   Controls the pseudo-random number generation for shuffling the data for the dual coordinate descent (if dual = True). When dual = False, the underlying implementation of LinearSVC is not random and random_state has no effect on the results. max_iter 1000 The maximum number of iterations to be run.      The activation parameter to the Conv2D class, allowing us to supply a string specifying the name of the activation function you want to apply after performing the convolution.

Performance Metrics
The accuracy metric is used as the performance metric to measure the model's overall performance on the testing set, supposing that CM is a confusion matrix of n by n dimensions, where n is the total number of different food categories. Furthermore, the row of CM indicates the actual category, while the column of CM indicates the predicted category. Finally, let C i,j indicates the CM cell's value at index row i and column j, where i, j = 1, 2, . . . , n. The accuracy metrics is defined as in (1):

Results and Discussions
This section is divided into four main sections. Section 5.1 describes the experiment results of the trained model on the Sabah Food Dataset and VIREO-Food172 Dataset, followed by Section 5.2, which describes the comparison of feature dimensions, using CNN as the classifier. Finally, Section 5.3 demonstrates the deployment of the food recognition model through a prototype web application. Figures 3 and 4 shows the classification accuracy of six CNN-based features derived from the transfer-learning process and one color feature over seven different traditional machine learning classifiers and one CNN-based classifier, tested on the Sabah Food Dataset and VIREO-Food172 Dataset, respectively. As seen in Figures 3 and 4, this paper evaluates a total of 56 combinations of machine-learning approaches.  Figures 3 and 4 shows the classification accuracy of six CNN-based features derived from the transfer-learning process and one color feature over seven different traditional machine learning classifiers and one CNN-based classifier, tested on the Sabah Food Dataset and VIREO-Food172 Dataset, respectively. As seen in Figures 3 and 4, this paper evaluates a total of 56 combinations of machine-learning approaches.    Figures 3 and 4 shows the classification accuracy of six CNN-based features derived from the transfer-learning process and one color feature over seven different traditional machine learning classifiers and one CNN-based classifier, tested on the Sabah Food Dataset and VIREO-Food172 Dataset, respectively. As seen in Figures 3 and 4, this paper evaluates a total of 56 combinations of machine-learning approaches.        Table 16, it can be seen that the EFFNet + CNN approach gives mance, yielding 0.9401 accuracy for the Sabah Food Dataset. This is follow + SVM (OVO) (0.8632) and Xception + CNN (0.8620). Additionally, as sho performance decreases significantly from EFFNet + CNN to Xception + SV accuracy drops with 0.0769 difference) before gradually decreasing from X (OVO) and the rest of the top 10 highest performing approaches (with diffe from 0.0012 to 0.0377). The results suggest that the EFFNet + CNN may onl a specific training and testing dataset of the Sabah Food Dataset rather tha the overall best approach. Nevertheless, EFFNet + CNN is the best perfor on the Sabah Food Dataset.

Experiments Results
On the other hand, for the VIREO-Food172 Dataset, it is observed tha SVM (OVO) provides the best performance (0.8657), as shown in Table 17. H pared to the top ten performing machine-learning approaches in the Sabah the differences between each machine-learning approach on the VIREO-Fo are more stable (with differences ranging from 0.0007 to 0.0269). In cont performing approach on the Sabah Food Dataset (Table 16), there is no sign accuracy from the highest to the second-highest accuracy on the VIREO-Fo Additionally, both the Sabah Food Dataset and the VIREO-Food172 Datas that EFFNet provides the best performance when used as a feature represe As previously stated, there are seven different feature representations bles 18 and 19 present seven machine-learning approaches for the Sabah Fo VIREO-Food172 Dataset, with the best one selected from each group of fe tations and ranked from best to worst accuracy. In Tables 18 and 19, the machine learning approaches and accuracy denote the best machine learn in that table. Tables 18 and 19 are similar in that EFFNet is the best feature followed by Xception, Inception V3, and VGG16. Further examination of T that the accuracy falls precipitously between Color + CNN (0.7422) and Res  Table 16, it can be seen that the EFFNet + CNN approach gives the best performance, yielding 0.9401 accuracy for the Sabah Food Dataset. This is followed by Xception + SVM (OVO) (0.8632) and Xception + CNN (0.8620). Additionally, as shown in Table 16, performance decreases significantly from EFFNet + CNN to Xception + SVM (OVO) (the accuracy drops with 0.0769 difference) before gradually decreasing from Xception + SVM (OVO) and the rest of the top 10 highest performing approaches (with differences ranging from 0.0012 to 0.0377). The results suggest that the EFFNet + CNN may only work well on a specific training and testing dataset of the Sabah Food Dataset rather than representing the overall best approach. Nevertheless, EFFNet + CNN is the best performing approach on the Sabah Food Dataset.
On the other hand, for the VIREO-Food172 Dataset, it is observed that the EFFNet + SVM (OVO) provides the best performance (0.8657), as shown in Table 17. However, compared to the top ten performing machine-learning approaches in the Sabah Food Dataset, the differences between each machine-learning approach on the VIREO-Food172 Dataset are more stable (with differences ranging from 0.0007 to 0.0269). In contrast to the best performing approach on the Sabah Food Dataset (Table 16), there is no significant drop in accuracy from the highest to the second-highest accuracy on the VIREO-Food172 Dataset. Additionally, both the Sabah Food Dataset and the VIREO-Food172 Dataset demonstrate that EFFNet provides the best performance when used as a feature representation.
As previously stated, there are seven different feature representations. Therefore, Tables 18 and 19 present seven machine-learning approaches for the Sabah Food Dataset and VIREO-Food172 Dataset, with the best one selected from each group of feature representations and ranked from best to worst accuracy. In Tables 18 and 19, the bold formatted machine learning approaches and accuracy denote the best machine learning approaches in that table. Tables 18 and 19 are similar in that EFFNet is the best feature representation, followed by Xception, Inception V3, and VGG16. Further examination of Table 18 reveals that the accuracy falls precipitously between Color + CNN (0.7422) and ResNet50 + LSVM (OVA) (0.5236), yielding 0.2186 differences. On the other hand, examining Table 19 reveals a gradual decline in accuracy within the first four machine-learning approaches before a significant decrease from VGG16 + LSVM (OVO) (0.7725) to MobileNet + LSVM (OVO) (0.6332), yielding a 0.1393 difference. This drop in accuracy is significant because it tells us which machine-learning approaches should be considered for any future work or subsequent experiments if accuracy is the most important factor in the food recognition model development. When the similarities between Tables 18 and 19 are compared, it is seen that EFFNet, Xception, Inception V3, and VGG16 provide more stable performance, with EFFNet feature representation being the best. As a result, an ensemble-based approach based on these four feature representation methods can be considered for future work.
Additionally, Tables 20 and 21 present ten machine-learning approaches for the Sabah Food Dataset and VIREO-Food172 Dataset. The best one was selected from each classifier group and ranked from best to worst accuracy. In Tables 20 and 21, the bold formatted machine learning approaches and accuracy represent the best machine learning approaches in that table. Tables 20 and 21 are then subjected to a similar analysis. From  Tables 20 and 21, it can be seen that the EFFNet-based feature representation appears most frequently. Table 20 shows four occurrences of EFFNet, whereas Table 21    In a subsequent analysis of the Sabah Food Dataset and the VIREO-Food172 Dataset, the overall performance of each feature representation is compared in Table 22. The value in the second row and second column in Table 22 (EFFNet) is produced by calculating the average of all machine-learning approaches that use EFFNet as a feature representation technique for the Sabah Food Dataset. This calculation is repeated for all feature representations and both datasets to fill in the second and third columns in Table 22. The value in the fourth column of Table 22 is filled with a value produced by the Overall Score defined in (2). The Overall Score is calculated by averaging the Mean Accuracy of the Sabah Food Dataset and the Mean Accuracy of the VIREO-Food172 Dataset from the second and third columns of Table 22. Equation (2) is applied to all of the feature representations listed in Table 22 to complete the fourth column.
where MASFD = Mean Accuracy of Sabah Food Dataset, and MAVFD = Mean Accuracy of VIREO-Food172 Dataset.
The Overall Score in (2) indicates the performance of a feature representation on both proposed datasets. Following that, the Overall Score calculated in (2) is used to facilitate the comparison of all feature representations. The bold formatted Feature Representation and Overall Score in Table 22 represent the best Feature Representation. From Table 22, it can be seen that the EFFNet has the best overall performance, followed by Xception, Inception V3, and VGG16 before the Overall Score drops significantly for MobileNet, ResNet50, and Color. Therefore, a combination of the EFFNet, Xception, Inception V3, and VGG16 approaches can be considered as components of an ensemblebased approach. Table 23 shows the overall performance of each classifier. The Overall Score in the fourth column of Table 23 is calculated based on (2), which is obtained by averaging the Mean Accuracy of the Sabah Food Dataset and the Mean Accuracy of the VIREO-Food172 Dataset from the second and third columns of Table 23. Similar to the analysis conducted in Table 22, the Overall Score calculated in (2) is used to facilitate the comparison of all classifiers. The bold formatted Classifier and Overall Score in Table 23 represent the best Classifier. From Table 23, it can be seen that the LSVM (OVO) classifier gives the best overall performance (0.6704), followed by LSVM (OVA) (0.6465), SVM (OVO) (0.6219), and CNN (0.5993) as the classifier. After the CNN classifier, there is a significant drop of Overall Score from CNN to kNN, yielding 0.914 difference. As a result, if one is considering a classifier, LSVM (OVO) and LSVM (OVA) are the best options. Additionally, for future work, LSVM (OVO), LSVM (OVA), and SVM (OVO) can be considered as components of an ensemblebased approach.
Finally, Table 24 compares the accuracy of the other methods in Table 1 as well as the accuracy of the food recognition reported in [32] to our work. However, a direct comparison between our model and their model is not possible, due to the differences in the training and testing conditions. Nonetheless, our best performance of 94.01% is comparable to that of [20], which has a 98.16% accuracy. Additionally, our EFFNet + CNN model outperformed the CNN and InceptionV3+CNN models in terms of overall accuracy.  [17] Small-scale dataset 10 CNN 74.00

A comparison of Feature Dimension Using CNN as the Classifier
In this work, pre-processing and feature extraction is performed, using a transfer learning strategy based on a pre-trained CNN. As the pre-trained CNN is built up with several layers, there is an option to use all the layers or to pick only a few layers in order to extract the relevant features. The relevance of features is determined by the type and volume of datasets used to train CNNs. For instance, the ImageNet dataset is used to train the CNN model, as it is one of the benchmark datasets in Computer Vision. However, the types of datasets and volume of data used to train the pre-trained CNN models vary, and the effectiveness of the transfer-learning strategy is dependent on the degree to which the trained models are related to the applied problem domains. Hence, the experiments conducted in this work have revealed the compatibility between the pre-trained CNN model as the feature extractor with the food recognition domain, especially the local food dataset, based on their classification performance.
The selection of layers in the pre-trained CNN model determines not only the relevancy of features but also their feature dimension. The size of the generated features determines the efficiency of running the algorithm. A large number of features entails additional computational effort but likely result in more discriminative features. As shown in Figure 7, the size of the generated features varies according to the final layer or the layer selection on the CNN architecture. It can be seen that the EFFnet has generated the largest feature dimensions (62,720), followed by Inception V3 (49,152) learning strategy based on a pre-trained CNN. As the pre-trained CNN is built up with several layers, there is an option to use all the layers or to pick only a few layers in order to extract the relevant features. The relevance of features is determined by the type and volume of datasets used to train CNNs. For instance, the ImageNet dataset is used to train the CNN model, as it is one of the benchmark datasets in Computer Vision. However, the types of datasets and volume of data used to train the pre-trained CNN models vary, and the effectiveness of the transfer-learning strategy is dependent on the degree to which the trained models are related to the applied problem domains. Hence, the experiments conducted in this work have revealed the compatibility between the pre-trained CNN model as the feature extractor with the food recognition domain, especially the local food dataset, based on their classification performance.
The selection of layers in the pre-trained CNN model determines not only the relevancy of features but also their feature dimension. The size of the generated features determines the efficiency of running the algorithm. A large number of features entails additional computational effort but likely result in more discriminative features. As shown in Figure 7, the size of the generated features varies according to the final layer or the layer selection on the CNN architecture. It can be seen that the EFFnet has generated the largest feature dimensions (62,720), followed by Inception V3 (49,152) Table 25 compares the feature dimension and the Overall Score of feature representation. The bold formatted Feature Representation and Overall Score in Tables 25 represent the best Feature Representation. While EFFNet as feature representations outperforms Xception in terms of overall performance, Table 25 shows that if computational speed is more important than performance, Xception as feature representation can be considered at the cost of some accuracy performance. The results in Table 25 also indicate that the data used in EFFNet training potentially contain the most relevant and consistent data for extracting meaningful features from food images when compared to other CNN models.  Table 25 compares the feature dimension and the Overall Score of feature representation. The bold formatted Feature Representation and Overall Score in Table 25 represent the best Feature Representation. While EFFNet as feature representations outperforms Xception in terms of overall performance, Table 25 shows that if computational speed is more important than performance, Xception as feature representation can be considered at the cost of some accuracy performance. The results in Table 25 also indicate that the data used in EFFNet training potentially contain the most relevant and consistent data for extracting meaningful features from food images when compared to other CNN models. Additionally, Figure 8 presents the length and width of features of a pre-trained CNN model for training with a CNN classifier. Each bar in Figure 8 has a label that represents the length and width values. In this case, the length and width are equal. As seen in Figure 8, the Conv2D features generated by EFFNet are minimal (16,16,245), compared to the Conv1D features. Despite the high depth of the feature dimension (245), the experiment revealed no noticeable effect of time efficiency during the training phase. Based on this finding, the model trained with EFFNet features is the best model, as it achieves the highest overall accuracy and generates highly distinctive, yet compact features. In this context, the depth level (z) of the feature's representation determines the efficacy of the classification performance, as more insight of spatial information can be generated. Furthermore, the level of depth of features (z) are more likely have less effect on the overall classification efficiency, compared to the value of x and y axis of the features. As depicted in Figure 8, the MobileNet-and Inception V3-based feature representations produce the highest values of x and y but cost more in terms of execution time than ResNet50, VGG16, Xception, and EFFNet based feature representations. However, in addition to the compatibility of the pre-trained CNN models with the newly developed classification model, the shape of the feature representations is another factor that must be taken into account in the experiment settings. Additionally, Figure 8 presents the length and width of features of a pre-trained CNN model for training with a CNN classifier. Each bar in Figure 8 has a label that represents the length and width values. In this case, the length and width are equal. As seen in Figure  8, the Conv2D features generated by EFFNet are minimal (16,16,245), compared to the Conv1D features. Despite the high depth of the feature dimension (245), the experiment revealed no noticeable effect of time efficiency during the training phase. Based on this finding, the model trained with EFFNet features is the best model, as it achieves the highest overall accuracy and generates highly distinctive, yet compact features. In this context, the depth level ( ) of the feature's representation determines the efficacy of the classification performance, as more insight of spatial information can be generated. Furthermore, the level of depth of features ( ) are more likely have less effect on the overall classification efficiency, compared to the value of and axis of the features. As depicted in Figure  8, the MobileNet-and Inception V3-based feature representations produce the highest values of and but cost more in terms of execution time than ResNet50, VGG16, Xception, and EFFNet based feature representations. However, in addition to the compatibility of the pre-trained CNN models with the newly developed classification model, the shape of the feature representations is another factor that must be taken into account in the experiment settings.

Food Recognition Model Deployment
As described previously, a web application system is deployed with the best recognition model (EFFNet-LSVM). The trained model is prepared as a NumPy data structure file using the Joblib library. At the same time, the back-end algorithm for food recognition is integrated with HTML using the Flask framework. Figure 9 shows the main homepage

Food Recognition Model Deployment
As described previously, a web application system is deployed with the best recognition model (EFFNet-LSVM). The trained model is prepared as a NumPy data structure file using the Joblib library. At the same time, the back-end algorithm for food recognition is integrated with HTML using the Flask framework. Figure 9 shows the main homepage of the preliminary outcome of the prototype web application. Two modules are developed: food recognition and customer feedback module, as shown in Figures 10-12.      As shown in Figure 10, the user must upload a JPG image of the food and click the Recognize food button to invoke the back-end of the food recognition algorithm. The food's name will then appear beneath the image. Finally, another feature included in this system is the ability to collect user feedback on foods via a form, as shown in Figure 11. The administrator can then view all of the customer feedback, as shown in Figure 12.  As shown in Figure 10, the user must upload a JPG image of the food and click the Recognize food button to invoke the back-end of the food recognition algorithm. The food's name will then appear beneath the image. Finally, another feature included in this system is the ability to collect user feedback on foods via a form, as shown in Figure 11. The administrator can then view all of the customer feedback, as shown in Figure 12. To summarize, the prototype web application is designed to accomplish three purposes: (i) to provide a food recognition feature for users who are unfamiliar with the food's name, (ii) to enable users to share their food-related experiences via a feedback feature, and (iii) to enable the administrator of this web application system to collect image and feedback data for use in food sentiment analyses and food business analytics. As shown in Figure 10, the user must upload a JPG image of the food and click the Recognize food button to invoke the back-end of the food recognition algorithm. The food's name will then appear beneath the image. Finally, another feature included in this system is the ability to collect user feedback on foods via a form, as shown in Figure 11. The administrator can then view all of the customer feedback, as shown in Figure 12.
To summarize, the prototype web application is designed to accomplish three purposes: (i) to provide a food recognition feature for users who are unfamiliar with the food's name, (ii) to enable users to share their food-related experiences via a feedback feature, and (iii) to enable the administrator of this web application system to collect image and feedback data for use in food sentiment analyses and food business analytics. Furthermore, the user's new images can be added to the current food dataset to update the training database, which in turn updates the training model.

Conclusions
This paper compared the performance of 70 combinations of food recognition approaches, which consist of six different pre-trained CNN-based models used as feature extractors, one feature representation based on the RGB component of an image, and ten commonly used machine-learning classifiers. Additionally, two types of datasets were used for performance evaluation: (i) the Sabah Food Dataset and (ii) the VIREO-Food172 Dataset. From the comparison, on the Sabah Food Dataset, it was found that the EFFNet + CNN (94.01% accuracy) approach gives the best performance, followed by Xception + SVM (OVO) (86.32% accuracy). However, the significant drop of accuracy from 94.01% to 86.32% suggests that the EFFNet + CNN may be an outlier and only works well on a specific training and testing dataset of the Sabah Food Dataset, rather than representing the overall best approach. On the VIREO-Food172 Dataset, it was found that the EFFNet + SVM (OVO) (86.57% accuracy) provides the best performance, followed by EFFNet + LSVM (OVO) (85.60% accuracy). In comparison to the Sabah Food Dataset, the difference between the best and second-best performing approaches in VIREO-Food172 Dataset is insignificant (0.97% difference). It should be noted that the best performing feature representation for both the Sabah Food Dataset and the VIREO-Food172 Dataset is the EFFNet-based feature representation. This is supported by the paper's discussion of the Overall Score of feature representation, which demonstrates that EFFNet has the highest Overall Score of feature representation. A similar comparison was made for the classifiers, and it was found that the LSVM (OVO) classifier gives the best overall performance for food recognition, followed by LSVM (OVA) as the classifier. In terms of computational complexity and memory space usage, while EFFNet (with 62,720 feature dimension) as feature representations outperformed Xception in terms of overall performance, if computational speed and memory space usage are more important than performance, then Xception (with 2048 feature dimension) can be considered at the expense of a small accuracy performance reduction. As part of the implication of this work, this paper also presented a food recognition model for indigenous foods in Sabah, Malaysia, by utilizing a pre-trained CNN model as a feature representation and a classifier. The classification accuracy (94.01%) achieved by EFFNet + CNN in the performance evaluation results for the Sabah Food Dataset is very promising for real-time use. As a result, a prototype web-based application for the Sabah food business analytics system was developed and implemented using the EFFNet + CNN approach for a fully automated food recognition using real-time food images.

Future Work
For future work, this research should conduct more experiments to obtain a more rigorous analysis of the CNN hyper-parameters and the CNN layers to achieve more solid and concrete findings. The types and number of implemented CNN layers and the feature shape can be further analyzed. Additionally, the feature selection algorithm can be studied further to reduce the dimensionality of the features, as this has a significant effect on the computational time. Furthermore, the criteria for selecting the training database for a food recognition system can be explored further. It was found in [34] that using the database's mean class in the training database can potentially improve the system's performance. Finally, to further improve the accuracy, a study on an ensemble-based approach, using a combination of EFFNet, Xception, Inception V3, VGG16, LSVM, and CNN, can be considered. Another interesting area to consider is food sentiment analysis. The user feedback data can be incorporated into a food sentiment analysis module, with the aim that it will assist business owners in remaining informed about the market acceptance of their food products. The customer feedback data can be analyzed further to improve the quality and innovation of indigenous foods, allowing them to be more commercialized and ultimately contribute to Sabah's gastronomic tourism industry. Finally, another area that can be investigated is the food business prediction module, which allows for the analysis of food market trends and provides additional data to industry practitioners in order to strategize their food business direction.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare that they have no conflict of interest.