An Image Recognition Method for the Foods of Northern Shaanxi Based on an Improved ResNet Network

Ma, Yonggang; Liu, Junmei; Cui, Angang

doi:10.3390/math13162572

Open AccessArticle

An Image Recognition Method for the Foods of Northern Shaanxi Based on an Improved ResNet Network

by

Yonggang Ma

^*,

Junmei Liu

and

Angang Cui

School of Mathematics and Statistics, Yulin University, Yulin 719000, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(16), 2572; https://doi.org/10.3390/math13162572

Submission received: 2 July 2025 / Revised: 3 August 2025 / Accepted: 4 August 2025 / Published: 12 August 2025

Download

Browse Figures

Versions Notes

Abstract

With the development of artificial intelligence technology, food image recognition has become an important research direction in the field of computer vision. The region of Northern Shaanxi is famous for its rich food culture. This paper aims to propose a food image recognition method based on an improved ResNet network to enhance the recognition rate of characteristic foods in Northern Shaanxi. Firstly, the principles and structure of basic convolutional neural networks (CNNs) were introduced, with a focus on the application and optimization design of CNNs in food image recognition. This mainly included AC blocks fused with asymmetric convolutions, attention modules based on improving food image recognition performance, and residual structure design for enhancing learning effectiveness. Secondly, the FoodResNet18 model was constructed with a specially designed enhancement block and a deep, shallow shared attention residual module to enhance the feature extraction ability and perception of visual information by the model. To improve the generalization ability of the model, this paper comprehensively preprocessed the self-built Northern Shaanxi Food-300 dataset, covering the sources of data, processing methods, and data augmentation strategies used for training. The model training and comparative analysis show that the food image recognition method based on improved ResNet outperforms traditional CNN models in multiple experiments. In the ablation experiment, the specific contribution of the design module to the final recognition performance was analyzed, and the advantages of the deep shallow shared attention residual module in feature extraction and preservation were verified.

Keywords:

food image recognition; ResNet network; CNN; attention mechanism

MSC:

68M10; 94A08

1. Introduction

As a pivotal cradle of Chinese culinary heritage, Northern Shaanxi’s distinctive geographical environment and time-honored artisanal traditions have cultivated unique regional specialties, including buckwheat noodles, lamb noodles, yellow steamed buns, and potato pancakes. The rapid advancement of digital technologies has brought to the forefront the critical challenge of preserving these intangible cultural heritage items through intelligent recognition systems based on computer vision [1]. Food image recognition, emerging as an interdisciplinary domain bridging computer vision and food science, demonstrates significant potential across multiple applications—from intelligent catering solutions to cultural heritage digitization and regional economic revitalization [2].

Traditional food recognition methods mainly rely on the combination of manual feature extraction (such as HOG, SIFT, etc.) and machine learning classifiers. Bossard et al. [3] used the random forest method to extract local visual features of dishes, which achieved certain results on small-scale datasets, but had limitations in their feature expression ability and insufficient generalization performance. With the development of deep learning technology, methods based on convolutional neural networks (CNNs) have gradually become mainstream. Keiji Yanai et al. [4] achieved an accuracy of 78.77% on the UEC-FOOD dataset using a DCNN model pre-trained with ImageNet. The CNN model constructed by Islam et al. [5] achieved a recognition rate of 74.7% on the Food-11 dataset, but its performance significantly decreased when faced with Chinese food with high inter-class similarity and large intra-class differences. Recent studies have attempted to enhance recognition performance through model improvement. Raza Yunus et al. [6] compared and found that Inception V4 achieved 80% accuracy on the Food-101 dataset. The Dense Food model proposed by Al Selwi Metwalli et al. [7] improved the accuracy to 81.23% on the VIREO Food-172 dataset by combining Softmax loss and central loss. The China Food CNN designed by Liao Enhong et al. [8] utilizes the maximum inter-class loss function and improves recognition accuracy by 3.6% compared to ResNet. Sheng, G et al. [9] proposed an Efficient Hybrid Food Recognition Net (EHFR–Net). Li, X et al. [10] proposed the comparison of manual food logging and artificial intelligence-enabled food image recognition. Bianco, R et al. [11] proposed a US nutrition 5k dataset for food image recognition. Finally, K.A et al. [12] proposed an explainable CNN and vision transformer-based approach for real-time food recognition.

As a key research field in the field of computer vision, food recognition technology has become indispensable in many fields, including dietary nutrition monitoring, intelligent service delivery in restaurants, and nutrition analysis [13]. The research of Jagadesh et al. [14] emphasizes the potential of hybrid transformer architecture and advanced preprocessing technology in promoting food recognition systems, providing greater accuracy and efficiency for the practical application of dietary monitoring and personalized nutrition recommendations. Xiong [15] proposed a food image recognition method based on ResNet. A large number of experiments have proved the effectiveness of our method, which can provide new insights for automatic food recognition. Image-based automatic food recognition is an emerging topic. Its purpose is to extract the features of a given food image and then predict its category. Liu [16] constructed a food recognition model based on EfficientNet and ResNet. Extensive experiments have been conducted on the Food-101 dataset. All results show the effectiveness of our model, which can accurately identify various foods. Xiao et al. [17] proposed a deep convolution module for obtaining local-enhanced feature representation and combined it with the global feature representation obtained by Swin Transformer for deep residuals to obtain deeper enhanced feature representation. An end-to-end fine-grained food general classifier is also proposed, which can extract effective feature information from enhanced feature representations more accurately and achieve accurate recognition. Kim [18] proposed a packaging recognition system based on deep learning, which can easily and accurately determine food safety information through a single image captured by a smartphone camera. The detection algorithm learned a total of 100 product images and optimized YOLOv7, with an accuracy of more than 95%. Bu et al. [19] proposed a food image recognition method based on transfer learning and ensemble learning. Firstly, general image features are extracted using a pre-trained convolutional neural network model (VGG19, ResNet50, MobileNet V2, AlexNet) on the ImageNet dataset. Secondly, the four pre-trained models are transferred to the food image dataset for fine-tuning. Finally, different basic learner combination strategies are used to establish an integrated model and classify feature information. However, existing methods still have significant limitations in dealing with the unique morphological diversity of food in Northern Shaanxi, such as irregular shapes and dynamic plate placement, as well as complex background interference.

This study proposes the following innovative solutions to address the key technical challenges of food image recognition in Northern Shaanxi.

(1): Construct the FoodResNet18 network architecture and enhance multi-scale feature extraction capability by introducing an asymmetric convolution (AC block).
(2): Design a deep, shallow collaborative attention mechanism module to achieve the dynamic fusion of local details and global features.
(3): Adopt an adaptive step size attenuation strategy to optimize the training process.

The experiment shows that the proposed method achieves a Top-1 accuracy of 85.26% based on the self-built Northern Shaanxi Food-300 dataset, which is 9.88 percentage points higher than the benchmark ResNet18.

2. Food Image Recognition Based on CNN

2.1. Basic Principles of CNN

The convolutional neural network (CNN), as one of the core algorithms in deep learning, has demonstrated excellent performance in fields such as image recognition and object detection. The core idea is to effectively reduce the size of network parameters and enhance feature expression ability through local perception, weight sharing, and hierarchical feature extraction. In Figure 1, the typical CNN structure is shown, which includes five parts: the input layer, the convolutional layer, the pooling layer, the fully connected layer, and the output layer. Each level works together to complete the mapping from the original image to the classification result. Figure 1 completed with Microsoft Visio Professional 2021.

2.1.1. Hierarchical Structure and Function

The input layer receives raw image data (such as the RGB three-channel matrix) and performs normalization preprocessing. The convolutional layer input data slides are passed through a convolutional kernel (filter) to extract local features. Its output can be expressed as follows:

F (x) = \sum_{i}^{N} f (c o n v (w_{i} x + b)),

(1)

where

x

is the input feature map;

N

is the number of convolution kernels;

w_{i}

and

b

are the convolution kernels and bias values; and

f

is the activation function. Convolution operation significantly reduces the number of parameters and preserves spatial information through local connections and weight sharing.

The activation function introduces nonlinear factors to enhance the model’s expressive power. To avoid the problem of gradient vanishing, this study adopted the modified Linear Unit (ReLU), which is expressed as follows:

f (x) = \max (0, x) .

(2)

The ReLU accelerates network convergence and alleviates overfitting by suppressing negative activation. The pooling layer reduces the dimensionality of feature maps through downsampling, enhancing the robustness of the model to geometric transformations such as translation and rotation. This study used Max Pooling to select the maximum value of a local region as the output, while preserving significant features and improving generalization ability.

The fully connected layer flattens the pooled multidimensional features into one-dimensional vectors and implements high-order feature combination and classification decisions through a multi-layer perceptron. The output layer uses the Softmax function to generate category probability distributions and complete the final classification.

2.1.2. Loss Function and Training

CNN optimizes parameters through the backpropagation algorithm with the goal of minimizing the difference between predicted results and true labels. This study used the cross-entropy loss function to measure the classification error:

L = - [y \log \hat{y} + (1 - y) \log (1 - \hat{y})],

(3)

where

y

is the true label and

\hat{y}

is the predicted probability. By iteratively updating the convolution kernel and bias values through gradient descent, the loss function converges to the minimum value, thereby achieving the high-precision classification of the model.

2.1.3. Core Advantages

(1): Local perception: Convolutional kernels only focus on local regions, reducing computational complexity.
(2): Weight sharing: The same convolution kernel traverses the entire input to reduce the number of parameters.
(3): Hierarchical feature extraction: The shallow capture of low-level features is performed, such as edges and textures, and the deep extraction of semantic information.
(4): Translation invariance: Pooling operations make the model insensitive to changes in target position.

Through the above mechanism, CNN achieved efficient feature learning and classification in image recognition tasks, laying a theoretical foundation for the design of improved models in the future.

2.2. Optimization Design of CNN

In response to the demand for food image recognition, this study optimized the CNN model by integrating asymmetric convolution, introducing an attention mechanism, and optimizing the residual structure. The specific plan is outlined below.

2.2.1. AC Block Fused with Asymmetric Convolution

Reference [20] indicates that the weight of the skeleton position (central region) of the convolutional kernel is more critical for feature extraction. Enhance skeleton information and improve local feature expression ability through asymmetric convolution.

The AC block that combines asymmetric convolution consists of three parallel branches, each using convolution kernels of different shapes:

Branch 1:

d \times d

Square convolution kernel (standard convolution);

Branch 2:

1 \times d

Asymmetric Convolutional Kernel (Lateral Feature Extraction);

Branch 3:

d \times 1

Asymmetric Convolutional Kernel (Vertical Feature Extraction).

Each branch is followed by a batch normalization operation to normalize the feature distribution. The three outputs are added together and combined into an equivalent kernel output through convolution additivity (Formula (4)), enriching the feature space.

A = C * (K 1 \oplus K 2 \oplus K 3) = C * K .

(4)

K 1

is a square branch convolution kernel with a size of

d \times d

used to extract global/isotropic features.

K 2

is a horizontal asymmetric branch convolution kernel with a size of

1 \times d

used to extract horizontal skeleton features.

K 3

is a vertical asymmetric branch convolution kernel with a size of

d \times 1

, used to extract vertical skeleton features.

K

represents the fused equivalent kernel, which is the sum of three convolution kernels.

C

is the input feature map, and

A

is the output feature map.

By using asymmetric convolution to enhance skeleton weights, redundant parameters can be reduced, and the sensitivity of the model to local textures of food can be increased (such as food shape and edges).

2.2.2. Global Feature Calibration Based on Attention Mechanism

The channel attention module is introduced, feature channel weights are adaptively adjusted, key features are highlighted, and noise is suppressed [21].

Figure 2 describes the structure of the attention module, where

M

is the original feature map;

U

is the output feature map; ⊗ is the weight application; and compress is the global average pooling of the input feature map

M

compressing the spatial dimension to obtain the global information vector

R

. Both the ReLU function and the Sigmoid function are used to excite

R

twice through the fully connected layer, excite valuable features, and suppress unimportant features. The ReLU function learns the nonlinear relationship between channels, and the Sigmoid function generates channel weight vectors, quantifies the importance of each channel, and weight application is performed by multiplying the weight vector with the original feature map

M

by channel and outputting the recalibrated feature map

U

. Figure 2 completed with Microsoft Visio Professional 2021.

In food images, the attention module can enhance category-related features (such as specific ingredient colors or textures), improving the model’s utilization efficiency of global information.

2.2.3. Residual Structure Optimization and Lightweight Design

The residual structure is the core of ResNet, which can solve the problem of deep network degradation and balance linear and nonlinear mapping (see Figure 3) [22]. Figure 3 completed with Microsoft Visio Professional 2021. The formula is as follows:

H (x) = F (x) + x,

(5)

where

F (x)

is the residual function, and

x

is the skip connection.

To avoid gradient vanishing, promote cross-layer information transmission, and enhance feature reuse capability, we chose the lightweight ResNet-18 network, reducing the use of computing resources. We embedded AC blocks and attention modules in ResNet-18 to construct the FoodResNet18 model. This reduced model complexity while ensuring performance, adapting to the scale and real-time requirements of food image data.

3. FoodResNet18 Model Structure

The food images of Northern Shaanxi have typical characteristics of Chinese food: small inter-class differences (different dishes may have similar appearances), and large intra-class differences (the same dish may have significant changes in appearance under different cooking methods). In response to this characteristic, this article proposes a Shaanbei food image recognition model called FoodResNet18, which can enhance detailed features. The network structure of FoodResNet18 is shown in Figure 4. Figure 4 completed with Microsoft Visio Professional 2021.

The FoodResNet18 model improves recognition accuracy through the following innovative designs:

An attention module shared by deep and shallow layers, enhancing global feature extraction.
An enhanced block structure, strengthening local detail feature extraction.
Asymmetric convolution and skip connections reduce degradation phenomena.

The FoodResNet18 model is composed of shared attention modules and enhancement blocks from both shallow and deep layers. Based on the characteristics of food images from Northern Shaanxi, to enhance the ability to extract image features, the shared attention module improves feature extraction by learning the weights of different channels in Northern Shaanxi food images, embedded from shallow to deep layers to enhance global features; the enhancement block is used to improve the feature extraction capability of local details in Northern Shaanxi food images, integrating asymmetric convolution and skip connections to reduce degradation while achieving local feature enhancement.

3.1. Enhancement Block

The enhancement block is the core component of the FoodResNet18 model, which enhances feature extraction capability through multi-branch convolution and a skeleton enhancement strategy. Its structure is shown in Figure 5. Figure 5 completed with Microsoft Visio Professional 2021.

The input feature map (size

H \times W \times C_{0}

) is processed by the enhancement block to output the feature map (size

H_{i} \times W_{i} \times C_{i}

);

C_{0}

and

C_{i}

represent the number of input channels and output channels. The specific implementation is as follows.

(1): Multi-branch convolution group

Each enhancement block contains three parallel branches, each using convolution kernels of different sizes.

3 \times 3

convolution: capturing local neighborhood features.

1 \times 3

convolution: enhances vertical texture.

3 \times 1

convolution: enhances horizontal details.

By controlling the depth of feature fusion through parameters, each branch is followed by batch normalization (BN). The formula is as follows:

\begin{array}{l} A_{1} = (I * K_{3 \times 3} - μ_{1}) \frac{γ_{1}}{σ_{1}} + β_{1} \\ A_{2} = (I * K_{1 \times 3} - μ_{2}) \frac{γ_{2}}{σ_{2}} + β_{2} \\ A_{3} = (I * K_{3 \times 1} - μ_{3}) \frac{γ_{3}}{σ_{3}} + β_{3} \end{array},

(6)

where

μ_{i}

and

σ_{i}

(i = 1, 2, 3)

are the mean and standard deviation of BN, and

γ_{i}

and

β_{i}

(i = 1, 2, 3)

are the learnable scaling and offset parameters.

(2): Feature fusion

Using the linear additivity of convolution, the three outputs are merged into a single-layer feature:

A = A_{1} + A_{2} + A_{3} = I * (\frac{γ_{1}}{σ_{1}} K_{3 \times 3} \oplus \frac{γ_{2}}{σ_{2}} K_{1 \times 3} \oplus \frac{γ_{3}}{σ_{3}} K_{3 \times 1}) + b .

(7)

The offset term

b

integrates the normalization parameters of each branch (see Formula (8)), ultimately forming a skeleton-enhanced convolution kernel

K

(see Formula (9)).

b = - \frac{μ_{1} γ_{1}}{σ_{1}} - \frac{μ_{2} γ_{2}}{σ_{2}} - \frac{μ_{3} γ_{3}}{σ_{3}} + β_{1} + β_{2} + β_{3},

(8)

K = \frac{γ_{1}}{σ_{1}} K_{3 \times 3} \oplus \frac{γ_{2}}{σ_{2}} K_{1 \times 3} \oplus \frac{γ_{3}}{σ_{3}} K_{3 \times 1} .

(9)

(3): Jumping Connection

The input and output are connected by the residual structure to alleviate the gradient disappearance problem. If the number of input and output channels is inconsistent, the

1 \times 1

convolution is used to adjust the dimension to ensure that the feature maps can be added directly:

Output = ReLU (A + Downsample (I)) .

(10)

The asymmetric convolution combination (

1 \times 3

,

3 \times 1

) works together with standard

3 \times 3

convolution to enhance the perception of local details from different directions, which is particularly suitable for the complex texture of food in Northern Shaanxi (such as the wrinkles of cake and the stripes of Liangpi). The scaling factor

γ_{i} / σ_{i}

(i = 1, 2, 3)

of the BN layer dynamically adjusts the weights of each convolution kernel, enabling the model to adaptively learn the importance of different branches, which is equivalent to optimizing the skeleton structure of the convolution kernel. The multi-branch design is implemented through parallel computing during the training phase, which is equivalent to a single enhanced kernel

K

during inference without increasing additional computational complexity (see Figure 6). Figure 6 completed with Microsoft Visio Professional 2021.

3.2. Deep Shallow Shared Attention Residual Module

This module achieved the collaborative optimization of deep and shallow features through the deep fusion of a hierarchical attention mechanism and residual structure. Its core design includes the following:

(1): Hierarchical attention adaptation

Shallow attention using class-independent global feature calibration to enhance the expression of basic features such as texture and color.

Deep attention focuses on class-specific semantic feature selection to enhance fine-grained discriminative ability.

(2): Parameter sharing mechanism

The deep and shallow layer shared attention generation network achieves feature importance grading by dynamically adjusting weights. The process of generating attention weights for a given input feature map

M \in R^{H \times W \times C}

is as follows:

\begin{array}{l} s = σ (W_{2} δ (W_{1} G A P (M))) \\ W_{1} \in R^{C / r \times C}, W_{2} \in R^{C \times C / r} \\ U_{c} = s_{c} \cdot m_{c} \end{array},

(11)

where GAP stands for global average pooling;

r = 16

is a shallow compression ratio (reducing computational complexity); and

r = 4

is a deep compression ratio (preserving semantic information).

Through four alternating connections (attention module → enhancement block → attention module → enhancement block), the gradient flow is continuously corrected during the feature transfer process to implement a degradation suppression mechanism. FoodResNet18 alternately connects the hierarchical attention module and enhancement block four times, embedding attention modules in both deep and shallow layers of the network, stimulating valuable features, improving learning effectiveness, and reducing CNN degradation.

4. Preprocessing of Food Image Data in Northern Shaanxi

4.1. The Food Image Dataset of Northern Shaanxi

The Northern Shaanxi Food-300 dataset is a self-built image dataset of food from Northern Shaanxi, as shown in Figure 7. It includes food categories and ingredient labels, with 300 food images from 15 categories; meat dishes are the most common, and all images are manually labeled based on ingredients.

4.2. Food Image Preprocessing

Due to the different sources of the original images, the image formats are different, with an average size of about 45 KB. Firstly, the size of the original image needs to be compressed to

256 \times 256

, and the input size needs to be unified to avoid the impact of food stretching and deformation on feature extraction. Multiple data augmentation methods should be applied, such as random masking, cropping, rotation, adjusting brightness, contrast, saturation, and color equality. Figure 8 shows an example of image preprocessing. In addition, a simple cross-validation method is used to divide the dataset into an 85% training set and a 15% testing set. The preprocessed images for each category are randomly divided into two parts: the training set and the testing set. Finally, the model is trained by adjusting the network parameters. Finally, 150 food images are collected from 15 categories as the training set (these images are different from those in the Shaanbei Food-130 dataset), and the images are preprocessed to train the model parameters.

5. Analysis of Food Image Data

5.1. Model Training

This article uses PyTorch 1.8.0 as a deep learning framework, with an 11th Gen Intel^® Core™ i9-11900K@3.50 GHz processor (Intel Corporation, Santa Clara, CA, USA), 64.0 GB of memory, NVIDIA GeForce RTX 3060 Ti graphics card (NVIDIA Corporation, Santa Clara, CA, USA), and 40,878 MB of VRAM.

5.1.1. Optimizer and Parameter Settings

The model parameters are optimized using the stochastic gradient descent (SGD) algorithm, with an initial learning rate set to 0.2 and a momentum value of 0.9. The introduction of momentum allows for the stability of the parameter update direction when the gradient direction changes, while still being able to update parameters through momentum accumulation when the partial derivative is zero. To alleviate the overfitting problem, L2 regularization was added, and the weight decay coefficient was determined to be 1 × 10⁻⁴ through multiple experiments.

5.1.2. Learning Rate Adjustment Strategy

In the initial stage of training, a fixed learning rate of 0.2 was used to accelerate model convergence. Subsequently, the learning rate based on the dynamic step size decay strategy was adjusted. Specifically, every five training epochs, the learning rate is updated according to the following formula:

η = γ \times η_{t},

(12)

where

η_{t}

represents the current learning rate and

γ

represents the attenuation factor, which was experimentally selected as 0.8. This strategy gradually reduces the learning rate with increasing training epochs, thereby helping the model converge to a better solution.

5.1.3. Training Parameters and Data Configuration

Dataset: The total sample size was approximately 13,000 and evenly distributed across all categories, with no long tail distribution issues. Batch setting: Mini-batch training was used, with 32 samples randomly selected as the input each time. Training epochs: A total of 120 epochs were trained, which is greater than the 90 epochs set by ImageNet. Due to the balanced distribution of data, extending the training epochs can fully tap into the potential of the model and ensure convergence to the optimal state.

5.1.4. Training Process Analysis

As shown in Figure 9 (completed with PyCharm Community Edition 2023.2), the model training performance is as follows:

Accuracy change: In the first 40 rounds of training, the training accuracy and testing accuracy rapidly increased. The accuracy of subsequent training gradually approached 100%, and the testing accuracy remained stable at over 85.26%, indicating that the model did not exhibit severe overfitting.

Change in loss function: Cross-entropy loss rapidly decreased in the first 40 rounds. After 40 rounds, the descent rate slowed down, and by 120 rounds, the loss value approached 0, indicating that the model reached a stable convergence state.

5.1.5. Experimental Conclusion

Through dynamic learning rate adjustment, regularization constraints, and sufficient training epochs, FoodResNet18 demonstrated good convergence and generalization ability on a balanced dataset, with a final test accuracy of over 85%, verifying the effectiveness of the model design.

5.2. Comparative Analysis

We compared the recognition accuracy and model footprint of the commonly used Arch-D [23], Dense Food [6], ResNet101, ResNet-18, and the optimized FoodResNet18 methods proposed in this paper.

From Table 1, it can be seen that on the Northern Shaanxi Food-130 dataset, our model outperformed existing CNN Chinese food image recognition networks in terms of recognition accuracy. The Top-1 accuracy of FoodResNet18 was 85.26%, which is significantly better than traditional models, with 9.88% improvement compared to ResNet-18 (75.38%), 10.05% improvement compared to ResNet-101 (75.21%), and a Top-5 accuracy of 96.11%, which is close to the SOTA model Arch-D (95.89%). FoodResNet18 (71.2 MB) showed a 65% increase in volume compared to ResNet18 (43 MB), but had significantly improved accuracy (9.88%), balancing performance and resource consumption. Compared to ResNet-101 (136 MB), FoodResNet18 only required 52% of storage space, but achieved higher precision and was more suitable for edge device deployment. The Top-1 error rate of ResNet-18 was 24.62%, while the error rate of FoodResNet18 was 14.74%. The actual error rate decreased by 9.88%. By enhancing blocks (local feature extraction) and deep and shallow layer attention (global feature fusion), the problem of large intra-class differences and small inter-class differences in Chinese food images (such as similar dish arrangements but different sauce textures) was solved. Deep networks, such as ResNet-101, are prone to accuracy degradation in Chinese food tasks due to the loss of details. FoodResNet18 optimizes this issue through a feature reuse mechanism.

The core advantages of FoodResNet18 are reflected in the following aspects.

(1): Lightweight and high-performance: FoodResNet18 achieved 85.26% Top-1 accuracy with a 71.2MB model volume, approaching the performance of SOTA models (such as 82.07% of Arch-D), while significantly outperforming traditional ResNet series.
(2): Edge device adaptation: Compared to ResNet-101 (136MB), the model size is reduced by 48%, making it more suitable for scenarios with limited computing resources, such as food and beverage robots and mobile apps.
(3): Domain-specific optimization: The fine-grained feature extraction capabilities were enhanced through modular design to address the unique challenges in Chinese food image recognition.

5.3. Ablation Experiment

To verify the effectiveness of the enhancement block (E) and deep shallow attention module (A) in the FoodResNet18 model, the following comparative experiments were designed: the baseline model was the original ResNet18, and the improved models were ResNet18+E (with the addition of enhancement blocks belonging to local feature extraction), ResNet18+A (with the addition of attention modules belonging to global feature fusion), and FoodResNet18 integrated both enhancement blocks and attention modules. The experiment was based on the Northern Shaanxi Food-130 dataset, and the classification accuracy (Top-1/Top-5) and parameter quantity of each model were compared. The experimental results are shown in Table 2.

From Table 2, it can be seen that both the enhancement block and the attention module can effectively improve the recognition performance of the network. When the enhancement block (E) is used alone, Top-1 accuracy is improved by 8.7% (75.37%→84.07%), and the parameter quantity is increased by 65% (11.26 M→18.60 M), indicating that the enhancement block significantly enhances the ability to extract local detail features (such as dish texture and food shape). After using the attention module (A) alone, the Top-1 accuracy improved by 7.83% (75.37%→83.2%) and the parameter count increased by 31% (11.26 M→14.79 M). The attention module improved the utilization of global contextual information (such as layout style and background association) through the fusion of deep and shallow features. Compared with ResNet18+E, FoodResNet18 (E+A combination) only increased the parameter counts by 3.9% (18.6 M→19.34 M), and further improved Top-1 accuracy by 1.18% (84.07%→85.26%). This indicates that the enhancement block and attention module are complementary, with local enhancement blocks focusing on fine-grained features, reducing the impact of intra-class differences, and attention modules, optimizing global feature associations to alleviate inter-class similarity issues. FoodResNet18 achieves 85.26% Top-1 accuracy with 19.34 M parameters, and its comprehensive performance is better than that of the improved model of a single module. Compared to the baseline ResNet-18, the accuracy improved by 9.88% and the parameter count only increased by 72%, demonstrating the efficiency of the model in lightweight design.

6. Conclusions

In this paper, a lightweight model, FoodResNet18, based on an improved ResNet architecture, was proposed to solve the key problems in Chinese food image recognition tasks, such as the loss of detailed features, high similarity between classes, and significant differences within classes. The experimental results on the self-built ShaanbeiFood-130 dataset show that the FoodResNet18 model proposed in this paper achieves significant performance improvement: the Top-1 classification accuracy rate reaches 85.26%, which is 9.89 percentage points higher than the baseline ResNet-18 model (75.38%), and surpasses the deeper ResNet-101 model (75.21%). In terms of model efficiency, FoodResNet18 exhibits excellent lightweight characteristics. Its model volume (71.2 MB) is only 52.35% of ResNet-101 (136 MB), and its computational resource consumption is reduced by 47.65%. It is worth noting that through the collaborative optimization between modules, the model achieves an additional increase of 1.18 percentage points in Top-1 accuracy (84.08%→85.26%) with an increase of only 3.9% of parameters (18.6 M→19.34 M).

Compared with existing advanced methods, FoodResNet18 shows advantages in many indicators. Compared with the Arch-D method, the accuracy of Top-1 and Top-5 increased by 3.19% and 0.22%, respectively. Compared with the Dense Food method, the accuracy of Top-1 and Top-5 increased by 4.03% and 0.64%, respectively. These experimental results fully verify the effectiveness of the modular design proposed in this paper in balancing model accuracy and computational efficiency, providing a high-performance and practical solution for the image recognition task of characteristic food in Northern Shaanxi. Future research directions include the transfer learning of cross-cultural dietary characteristics, robust optimization in dynamic scenarios, and model compression technology for edge devices. These works will further promote the development of food computing and provide technical support for the digital protection of food culture.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/math13162572/s1. The database includes 15 categories, each category folder is named after the food name, and each category contains 20 pictures.

Author Contributions

Conceptualization, methodology, writing—original draft preparation, writing—review and editing, funding acquisition, Y.M.; writing—original draft preparation, writing—review and editing, funding acquisition, J.L.; methodology, funding acquisition, A.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Yulin Science and Technology Bureau Project under Grant 2024-CYY-119 and Grant 2024-CYY-120, and the Shaanxi Fundamental Science Research Project for Mathematics and Physics under Grant 23JSQ056.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, R. History of Chinese Food Culture; China Light Industry Press: Beijing, China, 2014. [Google Scholar]
Min, W.; Jiang, S.; Liu, L.; Rui, Y.; Jain, R. A survey on food computing. ACM Comput. Surv. 2020, 52, 1–36. [Google Scholar] [CrossRef]
Bossard, L.; Guillaumin, M.; Van, G. Food-101-miming discriminative components with random forests. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 446–461. [Google Scholar] [CrossRef]
Yanai, K.; Kawano, Y. Food image recognition using deep convolutional network with pretraining and fine-tuning. In Proceedings of the 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Turin, Italy, 29 June–3 July 2015; pp. 1–6. [Google Scholar] [CrossRef]
Islam, M.; Siddique, M.; Rahman, S.; Jabid, T. Food image classification with convolutional neural network. In Proceedings of the 2018 International Conference on Intelligent Informatics and Biomedical Sciences(ICIIBMS), Bangkok, Thailand, 21–24 October 2018; pp. 257–262. [Google Scholar] [CrossRef]
Yunus, R.; Arif, O.; Afzal, H.; Amjad, M.F.; Abbas, H.; Bokhari, H.N.; Haider, S.T.; Zafar, N.; Nawaz, R. A framework to estimate the nutritional value of food in real time using deep learning techniques. IEEE Access 2019, 7, 2643–2652. [Google Scholar] [CrossRef]
Metwalli, A.; Shen, W.; Wu, C. Food image recognition based on densely connected convolutional neural networks. In Proceedings of the 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Fukuoka, Japan, 19–21 February 2020; pp. 27–32. [Google Scholar] [CrossRef]
Liao, E.H.; Li, H.; Wang, H.; Pang, X.W. Food image recognition based on convolutional neural networks. J. South China Norm. Univ. 2019, 51, 113–119. [Google Scholar]
Sheng, G.; Min, W.; Zhu, X.; Xu, L.; Sun, Q.; Yang, Y.; Wang, L.; Jiang, S.A. Lightweight Hybrid Model with Location-Preserving ViT for Efficient Food Recognition. Nutrients 2024, 16, 200. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Yin, A.; Choi, H.Y.; Chan, V.; Allman-Farinelli, M.; Chen, J. Evaluating the Quality and Comparative Validity of Manual Food Logging and Artificial Intelligence-Enabled Food Image Recognition in Apps for Nutrition Care. Nutrients 2024, 16, 2573. [Google Scholar] [CrossRef] [PubMed]
Bianco, R.; Marinoni, M.; Coluccia, S.; Carioni, G.; Fiori, F.; Gnagnarella, P.; Edefonti, V.; Parpinel, M. Tailoring the Nutritional Composition of Italian Foods to the US Nutrition5k Dataset for Food Image Recognition: Challenges and a Comparative Analysis. Nutrients 2024, 16, 3339. [Google Scholar] [CrossRef] [PubMed]
Nfor, K.A.; Theodore Armand, T.P.; Ismaylovna, K.P.; Joo, M.-I.; Kim, H.-C. An Explainable CNN and Vision Transformer-Based Approach for Real-Time Food Recognition. Nutrients 2025, 17, 362. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Liu, M. Food image recognition method based on iterative clustering and confidence screening mechanism. J. Electron. Imaging 2025, 34, 043025. [Google Scholar] [CrossRef]
Jagadesh, B.N.; Mantena, S.V.; Sathe, A.P.; Prabhakara Rao, T.; Lella, K.K.; Pabboju, S.S.; Vatambeti, R. Enhancing food recognition accuracy using hybrid transformer models and image preprocessing techniques. Sci. Rep. 2025, 15, 5591. [Google Scholar] [CrossRef] [PubMed]
Xiong, Y. Food Image Recognition based on ResNet. Appl. Comput. Eng. 2023, 8, 605–611. [Google Scholar] [CrossRef]
Liu, Y.Z. Automatic food recognition based on efficientnet and ResNet. J. Phys. Conf. Ser. 2023, 2646, 012037. [Google Scholar] [CrossRef]
Xiao, Z.; Diao, G.; Deng, Z. Fine grained food image recognition based on swin transformer. J. Food Eng. 2024, 380, 112134. [Google Scholar] [CrossRef]
Kim, Y.D. Consumer Usability Test of Mobile Food Safety Inquiry Platform Based on Image Recognition. Sustainability 2024, 16, 9538. [Google Scholar] [CrossRef]
Bu, L.; Hu, C.; Zhang, X. Recognition of food images based on transfer learning and ensemble learning. PLoS ONE 2024, 19, e0296789. [Google Scholar] [CrossRef] [PubMed]
Ding, X.; Guo, Y.; Ding, G.; Han, J. ACNet: Strengthening the kernel skeletons for powerful CNN via asymmetric convolution blocks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1911–1920. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Chen, J.; Ngo, C. Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 32–41. [Google Scholar] [CrossRef]

Figure 1. CNN hierarchical structure diagram.

Figure 2. Attention module structure diagram.

Figure 3. Residual structure diagram.

Figure 4. Network structure of FoodResNet18.

Figure 5. Internal structure of enhancement block.

Figure 6. Convolutional kernel with skeleton enhancement.

Figure 7. Example of food images in Northern Shaanxi Food-300 (Supplementary Materials).

Figure 8. Example of image preprocessing.

Figure 9. Training results of the FoodResNet18 model.

Table 1. Comparison of the accuracy of food classification network models based on the Northern Shaanxi Food-300 dataset (Supplementary Materials).

Method	Top-1/%	Top-5/%	Size/MB
Arch-D	82.07	95.89	\
Dense-Food [6]	81.23	95.47	\
ResNet101	75.21	91.22	136
ResNet-18	75.38	91.88	43
FoodResNet18	85.26	96.11	71.2

Table 2. Comparison of ablation experiment results.

Method	Top-1/%	Top-5/%	Parameter Quantities/M
ResNet-18	75.38	91.88	11.26
ResNet-18+E	84.08	95.28	18.60
ResNet18+A	83.22	95.15	14.79
FoodResNet18	85.26	96.11	19.34

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, Y.; Liu, J.; Cui, A. An Image Recognition Method for the Foods of Northern Shaanxi Based on an Improved ResNet Network. Mathematics 2025, 13, 2572. https://doi.org/10.3390/math13162572

AMA Style

Ma Y, Liu J, Cui A. An Image Recognition Method for the Foods of Northern Shaanxi Based on an Improved ResNet Network. Mathematics. 2025; 13(16):2572. https://doi.org/10.3390/math13162572

Chicago/Turabian Style

Ma, Yonggang, Junmei Liu, and Angang Cui. 2025. "An Image Recognition Method for the Foods of Northern Shaanxi Based on an Improved ResNet Network" Mathematics 13, no. 16: 2572. https://doi.org/10.3390/math13162572

APA Style

Ma, Y., Liu, J., & Cui, A. (2025). An Image Recognition Method for the Foods of Northern Shaanxi Based on an Improved ResNet Network. Mathematics, 13(16), 2572. https://doi.org/10.3390/math13162572

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Image Recognition Method for the Foods of Northern Shaanxi Based on an Improved ResNet Network

Abstract

1. Introduction

2. Food Image Recognition Based on CNN

2.1. Basic Principles of CNN

2.1.1. Hierarchical Structure and Function

2.1.2. Loss Function and Training

2.1.3. Core Advantages

2.2. Optimization Design of CNN

2.2.1. AC Block Fused with Asymmetric Convolution

2.2.2. Global Feature Calibration Based on Attention Mechanism

2.2.3. Residual Structure Optimization and Lightweight Design

3. FoodResNet18 Model Structure

3.1. Enhancement Block

3.2. Deep Shallow Shared Attention Residual Module

4. Preprocessing of Food Image Data in Northern Shaanxi

4.1. The Food Image Dataset of Northern Shaanxi

4.2. Food Image Preprocessing

5. Analysis of Food Image Data

5.1. Model Training

5.1.1. Optimizer and Parameter Settings

5.1.2. Learning Rate Adjustment Strategy

5.1.3. Training Parameters and Data Configuration

5.1.4. Training Process Analysis

5.1.5. Experimental Conclusion

5.2. Comparative Analysis

5.3. Ablation Experiment

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI