Beef Carcass Grading with EfficientViT: A Lightweight Vision Transformer Approach

Lim, Hyunwoo; Song, Eungyeol

doi:10.3390/app15116302

Open AccessArticle

Beef Carcass Grading with EfficientViT: A Lightweight Vision Transformer Approach

by

Hyunwoo Lim

¹ and

Eungyeol Song

^1,2,*

¹

Research and Development Department, Codevision Inc., Seoul 03722, Republic of Korea

²

Department of Applied Artificial Intelligence, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6302; https://doi.org/10.3390/app15116302

Submission received: 2 May 2025 / Revised: 1 June 2025 / Accepted: 2 June 2025 / Published: 4 June 2025

Download

Browse Figures

Versions Notes

Abstract

Beef carcass grading plays a pivotal role in determining market value and consumer preferences. While traditional visual inspection by experts remains the industry standard, it suffers from subjectivity and inconsistencies, particularly in high-throughput slaughterhouse environments. To address these limitations, we propose a one-stage automated grading model based on EfficientViT, a lightweight vision transformer architecture. Unlike conventional two-stage methods that require prior segmentation of the loin region, our model directly predicts beef quality grades from raw RGB images, significantly simplifying the pipeline and reducing computational overhead. We evaluate the proposed model against representative convolutional neural networks (VGG-16, ResNeXt-50, DenseNet-121) as well as two-stage combinations of segmentation and classification models. Experiments were conducted on a publicly available beef carcass dataset consisting of over 77,000 labeled images. EfficientViT achieves the highest accuracy (98.46%) and F1-score (0.9867) among all evaluated models while maintaining low inference latency (3.92 ms) and compact parameter size (36.4 MB). In particular, it outperforms CNNs in predicting the top grade (1++), where global visual patterns such as marbling distribution are crucial. Furthermore, we employ Grad-CAM and attention map visualizations to analyze the model’s focus regions and demonstrate that EfficientViT captures holistic contextual features better than CNNs. The model also exhibits robustness across varying loin area proportions. Our findings suggest that EfficientViT is not only accurate but also efficient and interpretable, making it a strong candidate for real-time industrial applications in beef quality grading.

Keywords:

EfficientViT; beef carcass grading; one-stage classification; image-based meat quality; computer vision; deep learning; lightweight transformer; food quality assessment

1. Introduction

Beef quality grading is a crucial process that directly affects pricing and significantly influences consumer choices [1]. Higher-grade beef commands a substantially higher price compared to lower-grade products [2]. In South Korea, beef carcasses are graded based on five criteria: intramuscular fat (IMF), meat color, fat color, texture, and maturity. The carcasses are categorized into five grades: 1++, 1+, 1, 2, and 3 [2]. The current standard grading method relies on visual inspection by trained experts [3].

However, such manual assessments are inherently subjective and often lack consistency among evaluators, posing limitations in terms of efficiency and reliability in large-scale industrial environments [4]. To overcome these challenges, image-based analysis techniques and deep learning approaches have been actively explored to automate the grading process [5]. Early automated systems relied on complex preprocessing steps such as illumination correction, background removal, and region-of-interest (ROI) extraction to achieve acceptable accuracy, but these multi-step pipelines are inefficient for practical deployment.

To reduce preprocessing burdens, several studies have proposed segmentation-based methods to automatically extract ROI regions and improve classification accuracy [3]. However, the two-stage structure that separates segmentation and classification increases model complexity and is less efficient in terms of end-to-end learning.

Recently, one-stage classification models that directly predict grades from entire images without segmentation have gained attention [6]. However, conventional CNN-based one-stage models tend to focus on local features, which limits their performance when global visual patterns—such as marbling distribution in IMF—are critical. Intramuscular fat is known to be highly influential and closely correlated with other grading criteria [7,8,9].

To address these limitations, this study proposes a one-stage beef carcass grading model based on EfficientViT, a lightweight hybrid architecture that combines Vision Transformers and CNNs. EfficientViT is designed to simultaneously capture local patterns via convolutional layers and global contextual features via self-attention, while maintaining a compact and fast-inference design suitable for real-time deployment [10].

The main contributions of this study are as follows:

We propose a high-accuracy one-stage grading model based on EfficientViT that eliminates the need for a complex two-stage pipeline;
We demonstrate the importance of leveraging global features through performance comparisons with CNN-based models (VGG16, ResNeXt50, DenseNet121);
We enhance model explainability through visual analysis using Grad-CAM [11] and attention maps [12];
We validate the model’s suitability for real-time industrial applications with low inference latency and compact parameter size.

This paper details the architecture of the proposed EfficientViT-based one-stage model and evaluates its superiority through quantitative and qualitative comparisons with existing methods. In particular, we conduct an in-depth analysis of the model’s performance in predicting the highest grade (1++) and its robustness to variations in loin area proportion within images.

To summarize, this study aims to develop a lightweight and accurate one-stage beef carcass grading model using EfficientViT, a vision transformer architecture capable of capturing both local and global features. We compare its performance with conventional CNN-based classifiers and two-stage segmentation–classification pipelines. The rest of this paper is organized as follows: Section 2 reviews related works; Section 3 describes the dataset, models, and experimental settings; Section 4 presents experimental results and analysis; Section 5 discusses the findings and limitations; and Section 6 concludes the study.

2. Related Works

Various image-based and deep learning-based approaches have recently been proposed to automate beef carcass grading. Early studies utilized hyperspectral imaging (HSI) to predict marbling and intramuscular fat (IMF). Velásquez et al. [13] employed decision tree classifiers using HSI data to predict marbling scores, while Naganathan et al. [14] utilized three-dimensional principal component analysis (PCA) and local binary pattern (LBP) features to estimate tenderness. However, HSI-based approaches are limited in practical use due to high equipment costs and computational complexity.

As a result, RGB image-based classification has gained popularity. Pinto et al. [15] combined LBP features and Random Forest classifiers to predict marbling grades from loin cross-section images, and Pranata et al. [16] quantitatively analyzed marbling areas using thresholding and morphological operations. Stewart et al. [17] demonstrated that LBP features and Partial Least Squares regression could predict MSA marbling scores and IMF percentages. Stewart et al. [18] also proposed a vision-based system capable of estimating Eye Muscle Area, IMF%, MSA, and AUS-MEAT marbling scores with high accuracy (

R^{2}

ranging from 0.70 to 0.83), showing the potential of image-based methods in industrial meat quality assessment.

Several segmentation-based approaches have also been proposed. Talacha et al. [19] applied pixel-wise classification using AlexNet to segment the longissimus muscle region, while Gonçalves et al. [20] evaluated multiple CNN architectures (SegNet, UNet, DeepLab, etc.) for carcass area extraction. Lee et al. [3] proposed MSENet, a multi-task network that performs both segmentation and marbling score regression simultaneously. Wakholi et al. [6] used DeepLabV3+ and a custom CNN to detect anatomical keypoints and then applied multivariate regression to estimate longissimus muscle parameters (LMP).

However, these approaches are based on two-stage pipelines that separate segmentation and classification, increasing training and inference complexity. Unlike previous studies that adopt a two-stage pipeline to enhance regional interpretability or enable multi-task learning (e.g., ROI segmentation followed by quality estimation), our one-stage approach aims to improve deployment efficiency by eliminating intermediate processing. As shown in our experiments (Section 4.1), the proposed model achieves higher classification accuracy and significantly faster inference speed. This highlights that, while two-stage methods offer modular benefits, they may not be optimal for real-time grading systems that prioritize speed and simplicity over anatomical localization.

To address this issue, recent studies have focused on one-stage end-to-end models that omit segmentation. Prakash et al. [21] performed part classification using CNNs on RGB images taken from boning lines. Pannier et al. [22] used conveyor-mounted camera systems to predict IMF% with an

R^{2}

of 0.87. Negretti et al. [23] proposed a smartphone-based VIA application to classify SEUROP grades quantitatively.

The EfficientViT model used in this study is a lightweight vision transformer architecture that achieves both high accuracy and fast inference. Cai et al. [24] reported up to 13× fewer computations and more than 6× faster inference compared to SegFormer by leveraging Multi-Scale Linear Attention. Liu et al. [10] further improved performance with Cascaded Group Attention and a Sandwich Layout, achieving 2.6× faster inference speeds while maintaining accuracy comparable to EfficientNet.

3. Materials and Methods

3.1. Materials

3.1.1. Dataset

The dataset provided by AI Hub [25] consists of 77,899 RGB images labeled with five beef quality grades: 1++, 1+, 1, 2, and 3. The dataset includes corresponding segmentation masks and was used without modification in this study. The original images have a resolution of 1080 × 1920 pixels and were resized to 512 × 512 pixels. The dataset was randomly split into training, validation, and test sets with a ratio of 7:1:1, resulting in 60,571 training images, 8664 validation images, and 8664 test images. Table 1 shows the number of images per grade in each subset.

In addition to the images, segmentation masks are provided to isolate the lean meat region by removing non-meat areas. These pre-labeled masks served as ground truth annotations for training the segmentation networks used in the two-stage pipeline. The segmentation masks were provided together with the images by AI Hub. According to the dataset description, the masks were constructed using a combination of automatic and manual labeling techniques based on OpenCV, focusing on removing the outer fat surrounding the beef cross-sectional area. These pre-generated masks were used directly for training the segmentation models in the two-stage pipeline. To better illustrate the dataset composition, Figure 1 shows representative examples of RGB images and corresponding segmentation masks for each of the five quality grades (1++, 1+, 1, 2, and 3). These masks isolate the lean meat area used for training 2-stage models.

The two-stage classification models were trained using the images with segmentation masks applied, whereas the one-stage models were trained directly on the original unmasked images.

To evaluate performance under different proportions of lean meat, we used the segmentation masks to calculate the cross-sectional area of meat in each image. Based on this ratio, the dataset was divided into two groups: images with over 20% loin area (405 samples) and those with under 10% loin area (2346 samples).

Figure 2 shows representative examples from the dataset, including beef cross-sectional images, labeled with different quality grades. The images exhibit varying degrees of intramuscular fat distribution, which is the key visual cue for determining beef quality.

3.1.2. Implementation Details

All experiments were conducted on a system equipped with an AMD Ryzen 7 7800X3D processor (8 cores, 16 threads), 32 GiB of RAM, and an NVIDIA GeForce RTX 4070 Ti SUPER GPU. The operating system used was Ubuntu 22.04.5 LTS (64-bit). All hyperparameters listed in Table 2 were optimized individually for each model through validation-based tuning. For classification models (EfficientViT, VGG-16, ResNeXt-50, DenseNet-121), we applied early stopping with a patience of 20 epochs to ensure convergence while avoiding overfitting. For segmentation models used in the 2-stage pipeline, a more aggressive early stopping with a patience of 10 epochs was employed due to the faster saturation of IoU-based validation metrics. These strategies ensured stable and efficient training across all architectures.

3.2. Methods

3.2.1. Two-Stage Model Configuration

In the 2-stage approach, segmentation and classification networks were trained independently. In the first stage, six segmentation models were employed: U-Net, Attention U-Net, SegNet, DeepLabV3+, EfficientPS, and MobileNetV2.

U-Net operates based on an encoder–decoder architecture with skip connections, which effectively preserve high-resolution features [26]. Attention U-Net builds upon U-Net by incorporating attention mechanisms to emphasize relevant regions [27]. SegNet is a lightweight encoder–decoder model that utilizes pooling indices during the decoding process to improve memory efficiency [28]. DeepLabV3+ enhances boundary recognition performance through Atrous Spatial Pyramid Pooling (ASPP), which captures multi-scale features in parallel [29]. EfficientPS is designed for panoptic segmentation and is optimized to handle complex foreground and background information simultaneously [30]. Lastly, MobileNetV2 is a lightweight model based on depthwise separable convolutions, well-suited for mobile and embedded systems [31].

In the experiments, each segmentation model was used to extract the region of interest (ROI) from beef cross-sectional images. The models produced binary masks to isolate the ROI, and these masks were saved as separate files to be used as input for the second stage.

As illustrated in Figure 3, the 2-stage model first segments the longissimus muscle region before classification, whereas the 1-stage approach directly predicts the beef grade from the original image. This structural difference significantly affects pipeline complexity and inference speed.

In the classification stage, three CNN-based architectures—ResNeXt-50, DenseNet-121, and VGG-16—were used to predict beef quality grades based on the segmented images. ResNeXt-50 consists of 16 bottleneck blocks and leverages grouped convolutions to improve parameter efficiency while maintaining strong feature extraction [32]. DenseNet-121 comprises four dense blocks and three transition layers, using densely connected layers that reuse features across the network to enhance learning efficiency [33]. The feature maps are downsampled to a resolution of 16 × 16. VGG-16 is a classic and straightforward CNN composed of 13 convolutional layers and 3 fully connected layers. The input image, initially at 512 × 512 resolution, is reduced to 16 × 16 through five stages of max pooling [34].

All segmentation and classification models were trained independently under identical data splitting and preprocessing conditions. Input image size was unified to 512 × 512 pixels for both stages.

For inference, the total runtime of the 2-stage model was measured by summing the inference times of both the segmentation and classification networks. Image saving/loading times were excluded to ensure a fair comparison.

A total of 18 combinations were evaluated by pairing the six segmentation models with the three classification models.

3.2.2. Proposed 1-Stage EfficientViT Model

Unlike the 2-stage approach, the 1-stage method directly inputs raw images into a classification model to predict beef quality grades, without any segmentation preprocessing.

In this study, we implemented 4 classification models in the 1-stage structure: EfficientViT, ResNeXt-50, DenseNet-121, and VGG-16.

EfficientViT is a lightweight architecture that captures global contextual information while maintaining low computational complexity. It combines convolutional layers for local feature extraction with a multi-scale linear attention mechanism for global feature representation (Figure 4). As a result, EfficientViT achieves high classification accuracy with fast inference speed.

ResNeXt-50, DenseNet-121, and VGG-16 were also implemented in the same 1-stage fashion, and their performance was evaluated under the same dataset and training conditions as EfficientViT.

3.2.3. Explainability Techniques

To interpret the model’s decision-making process, Grad-CAM (Gradient-weighted Class Activation Mapping) was applied to CNN-based models such as VGG-16, while attention map visualization was used for EfficientViT. These techniques help identify which regions of the input images the models focused on when making predictions.

Grad-CAM calculates class-specific gradients with respect to the last convolutional feature maps and uses them as weights to highlight important areas. The method is formally expressed as shown in Equation (1):

L_{Grad - CAM}^{c} = ReLU (\sum_{k} α_{k}^{c} A^{k}),

(1)

where

A^{k}

is the k-th activation map from the last convolutional layer and

α_{k}^{c}

is the average gradient for class c over the feature map:

α_{k}^{c} = \frac{1}{Z} \sum_{i} \sum_{j} \frac{\partial y^{c}}{\partial A_{i j}^{k}},

(2)

Here,

y^{c}

denotes the score for class c and Z is the normalization factor over spatial dimensions

i, j

.

EfficientViT adopts a lightweight linear attention structure instead of traditional multi-head attention. Attention maps are computed based on the inner product between queries (Q) and keys (K), as defined in Equation (3):

Attn (i, j) = \frac{K_{i}^{⊤} Q_{j}}{\sum_{j} K_{i}^{⊤} Q_{j} + ϵ},

(3)

where

Q_{j}, K_{i} \in R^{d}

are the query and key vectors at positions j and i, respectively, and

ϵ

is a small constant for numerical stability. During inference, the resulting attention scores were normalized to the range [0, 1] for visualization purposes.

4. Results

To validate the performance of the proposed EfficientViT-based one-stage beef carcass grading model, we conducted three major experiments. First, we compared the overall performance of conventional two-stage models with that of one-stage models to assess the efficiency and accuracy of omitting the segmentation step. Second, we quantitatively compared EfficientViT with conventional CNN-based classifiers (VGG-16, ResNeXt-50, DenseNet-121) within a one-stage framework to analyze the structural advantages of EfficientViT. Lastly, we investigated how the ratio of muscle cross-section in images affected classification accuracy, evaluating whether EfficientViT’s ability to capture global features remains robust under varying input conditions.

All experiments were evaluated using multiple metrics, including accuracy, F1 score, precision, recall, inference speed, and parameter size. This allowed for a comprehensive assessment of both performance and practical deployability.

4.1. One-Stage vs. Two-Stage

In this experiment, we compared the performance of the proposed one-stage EfficientViT model with conventional two-stage approaches. The two-stage architecture involves sequential execution of a segmentation network followed by a classification network, where the classification is performed on the segmentation output image. We experimented with 18 combinations formed by pairing 6 segmentation models (U-Net, Attention U-Net, SegNet, DeepLabV3+, EfficientPS, MobileNetV2) with 3 classifiers (ResNeXt-50, DenseNet-121, VGG-16).

In contrast, the one-stage approach omits segmentation entirely and performs classification using the original input images. This setting includes EfficientViT and three CNN models (ResNeXt-50, DenseNet-121, VGG-16), all trained and tested under the same conditions.

All models were trained and evaluated using the same dataset, preprocessing pipeline, input resolution (512 × 512), and hardware environment. Inference speed was measured as the average per-image latency (ms) using a batch size of 1. The results are summarized in Table 3.

Among the two-stage models, the MobileNetV2 + ResNeXt-50 combination achieved the best performance with an accuracy of 98.42% and an F1 score of 0.9860. However, its inference speed and model size were significantly higher than those of the one-stage ResNeXt-50 model, which achieved comparable accuracy (98.30%) at only 3.26 ms and 91.96 MB. Likewise, the one-stage DenseNet-121 (Accuracy: 98.12%; F1 score: 0.9828, Speed: 5.86 ms, Size: 28.87 MB) and VGG-16 (Accuracy: 97.99%; F1 score: 0.9825) models also outperformed their two-stage counterparts across all metrics.

The proposed EfficientViT model achieved the highest overall performance among all models, with 98.46% accuracy and a 0.9867 F1 score. It also demonstrated fast inference (3.92 ms) and compact model size (36.41 MB), making it a highly efficient choice for real-time applications.

Confusion matrices for each model are provided in Appendix A Figure A1, Figure A2, Figure A3 and Figure A4.

4.2. EfficientViT vs. CNN

In this experiment, we compared the performance of EfficientViT with conventional CNN-based models (VGG-16, ResNeXt-50, DenseNet-121) under the one-stage classification framework. Evaluation metrics included accuracy, F1 score, precision, and recall. The training convergence behavior of the one-stage models is visualized in Appendix C Figure A7, confirming consistent optimization across training, validation, and test sets. The results are summarized in Table 4.

As shown in Table 4, EfficientViT achieved the highest performance across all evaluation metrics, with an accuracy of 98.46% and an F1 score of 0.9867. It also recorded the best precision (0.9874) and recall (0.9859), outperforming all CNN-based baselines. While ResNeXt-50 achieved comparable performance (accuracy 98.30%; F1 score 0.9847), DenseNet-121 and VGG-16 showed slightly lower performance. Overall, EfficientViT demonstrated superior accuracy and robustness compared to traditional convolutional models.

Figure 5 illustrates the trade-off between inference speed and accuracy for all one-stage models. As shown in the scatter plot, EfficientViT achieves the highest accuracy among the evaluated models while maintaining low inference latency, demonstrating a favorable balance between performance and efficiency.

4.3. CNN vs. 1++ Grade: EfficientViT

This experiment focused on evaluating model performance specifically for the 1++ grade, which represents the highest quality beef carcasses. We compared EfficientViT against CNN-based classifiers under the one-stage framework using the same evaluation metrics. The results are summarized in Table 5.

EfficientViT outperformed all CNN models in 1++ grade classification, achieving the highest accuracy (99.24%), F1 score (0.9866), precision (0.9874), and recall (0.9858). Although the CNN models also demonstrated high accuracy (above 99%), EfficientViT achieved a more balanced trade-off between precision and recall, highlighting its strength in capturing the nuanced visual cues necessary for distinguishing high-grade marbling patterns.

Grad-CAM and Attention Map Visualization

To investigate the differences in decision-making mechanisms across models, we visualized the prediction basis of CNN models (ResNeXt-50, DenseNet-121, VGG-16) using Grad-CAM, and compared them with the attention maps produced by EfficientViT. Figure 6 illustrates which image regions were most emphasized by each model. All visualizations were normalized to the [0, 1] range and rendered using a Jet color map.

As shown in Figure 6, EfficientViT tended to attend to spatially distributed regions across the image, capturing global patterns more comprehensively. In contrast, CNN-based models typically focused on limited local areas. This indicates that EfficientViT is better suited for assessing high-grade beef, such as 1++, where holistic marbling distribution plays a critical role in quality determination.

4.4. Performance Comparison Based on Loin Area Ratio: EfficientViT vs. CNN

This experiment was designed to investigate whether the model performance varies depending on the relative proportion of the loin cross-sectional area within an image. Given the architectural advantage of EfficientViT in capturing global contextual information, we hypothesized that it may exhibit superior performance when the visible loin area is limited.

To test this, we divided the test images into two subsets based on the proportion of loin area relative to the total image size: (1) images where the loin area exceeds 20% of the image (“over 20%”) and (2) images where the loin area is below 10% (“under 10%”). We then evaluated the 1++ grade prediction performance of EfficientViT and three CNN models (VGG-16, ResNeXt-50, DenseNet-121) for each subset. The confusion matrices for both subsets are provided in Appendix B Figure A5 and Figure A6 for visual comparison.

4.4.1. Over 20% Loin Area

As shown in Table 6, EfficientViT achieved the best overall performance on the “over 20%” subset, with an accuracy of 99.75%, an F1 score of 0.9959, and perfect precision (1.0000). Although DenseNet-121 achieved the highest recall (1.0000), EfficientViT outperformed it in both precision and F1 score, indicating more balanced and reliable prediction capability.

4.4.2. Under 10% Loin Area

As shown in Table 7, ResNeXt-50 achieved the highest accuracy (98.38%) and F1 score (0.9686), with a slightly higher precision than other models. However, EfficientViT achieved the highest recall (0.9703), suggesting a stronger ability to detect 1++ grade cases, even when the visible loin area is limited. The accuracy difference between EfficientViT and ResNeXt-50 was marginal (only 0.09%).

5. Discussion

This study investigated the structural advantages of the proposed EfficientViT-based one-stage model for beef carcass grading, comparing its performance against conventional CNNs and two-stage approaches through extensive experiments.

EfficientViT combines convolutional layers with a multi-scale linear attention mechanism, forming a hybrid architecture capable of effectively capturing both local texture features and global visual patterns. Notably, it achieved the highest F1 score and precision in predicting the highest-quality grade (1++), indicating that the attention mechanism is particularly beneficial for identifying distributed marbling patterns, which are critical for this class.

In experiments stratified by loin cross-section ratio, EfficientViT consistently maintained high recall, not only in the over 20% group with abundant meat area but also in the under 10% group, where visual information was limited. This suggests that the model’s attention structure enables it to integrate sparse yet meaningful features from across the entire image, demonstrating robust performance, regardless of ROI size.

Moreover, visualization results using Grad-CAM and attention maps highlighted a clear difference in spatial focus: CNN-based models tended to concentrate on specific local regions, while EfficientViT displayed a more evenly distributed attention pattern across the image.

To further quantify the spatial extent of model attention, we measured the proportion of each image that received high activation (defined as scores ≥ 0.5 in Grad-CAM or attention map). Table 8 presents descriptive statistics for this high-activation area across all one-stage models.

EfficientViT exhibited the largest activated region, with a mean value of 49.78% and an interquartile range (IQR) spanning 39.42% to 59.91%. In contrast, the CNN-based models showed significantly narrower activation ranges: DenseNet-121 averaged 13.79%, ResNeXt-50 10.14%, and VGG-16 only 3.35%. This result suggests that EfficientViT tends to consider broader global contexts when making predictions, which aligns with its multi-scale attention architecture.

Interestingly, the ranking of average activation area—EfficientViT > DenseNet-121 > ResNeXt-50 > VGG-16—matches the models’ classification performance on the subset of samples with over 20% loin area (see Section 4.3). This supports the hypothesis that broader attention coverage is positively associated with robustness to spatial variation in beef cross-section images. These findings reveal fundamental differences in feature learning strategies and emphasize the importance of considering model architecture when explainability and spatial reasoning are required.

While the two-stage models benefited from explicit ROI segmentation, reducing irrelevant information and improving classification accuracy, they were computationally intensive and exhibited longer inference times. In contrast, EfficientViT achieved comparable or better performance without the need for segmentation, and its fast inference speed further supports its suitability for real-time deployment in production environments. Furthermore, the elimination of segmentation dependency offers significant operational advantages. In real-world settings such as slaughterhouses, obtaining accurate segmentation masks in real time can be challenging due to inconsistent lighting, background noise, and varying carcass positions. By directly operating on raw RGB images, the proposed one-stage model avoids the need for costly and error-prone preprocessing, reducing system complexity and potential failure points. This design choice enhances the model’s robustness and deployability, particularly in scenarios where high-throughput and low-latency processing are essential.

Although the performance gap between EfficientViT and other CNN models may appear numerically marginal (e.g., differences in F1 score of less than 0.02), the consistency of its superiority across different scenarios—including varying loin cross-section ratios—demonstrates its robustness and practical reliability. In industrial contexts, where premium grades such as 1++ are sold at the highest market prices, even a small improvement in classification accuracy can lead to meaningful economic benefits and enhance operational decisions. Therefore, these “slightly better performances” translate into substantial value when deployed in high-stakes production environments.

Nonetheless, this study is based on a curated dataset from AI Hub, and further validation is needed to assess generalizability under more variable industrial conditions such as differing lighting, background clutter, or carcass positioning. Additionally, while the attention mechanism offers improved global reasoning, its internal decision-making process still poses challenges in terms of interpretability. Future work should explore explainable AI techniques tailored to transformer-based models.

Further directions include comparisons with other vision transformers, incorporating data augmentation strategies, and extending the framework to other domains such as different meat species or medical imaging. The balance of efficiency and expressiveness achieved by EfficientViT makes it a strong candidate for edge computing and mobile applications in practical scenarios.

6. Conclusions

In this study, we proposed an efficient and lightweight one-stage beef carcass grading model based on EfficientViT. The proposed model achieves high classification accuracy directly from raw RGB images, without the need for a separate segmentation step. Comparative experiments with two-stage models and CNN-based one-stage models demonstrated that the EfficientViT model outperformed others across multiple metrics, including classification accuracy, inference speed, and parameter size.

The EfficientViT architecture combines convolutional layers, which are effective at capturing local features, with a multi-scale linear attention mechanism capable of modeling global context. This hybrid design proved particularly effective for predicting the highest beef quality grade (1++), where spatially distributed patterns such as marbling are critical. Notably, EfficientViT maintained strong predictive performance even when the visible loin region was minimal, demonstrating robustness across variable imaging conditions.

These results suggest that EfficientViT is not only accurate but also highly practical for real-world applications, particularly in resource-constrained environments or industrial settings where real-time processing is required. Furthermore, visualization results based on Grad-CAM and attention maps confirmed that EfficientViT effectively utilizes global contextual cues, making it well-suited for complex visual grading tasks.

Future work may explore architectural variations of EfficientViT, investigate domain generalization capabilities, and evaluate optimization strategies for deployment on edge devices. Additionally, research into improved explainability—such as advanced visualization techniques and integration with explainable AI (XAI) frameworks—will be valuable for enhancing model transparency and trust in practical applications.

Author Contributions

Writing—original draft, H.L.; Writing—review & editing, E.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This research used datasets from ‘The Open AI Dataset Project (AI-Hub, S. Korea)’. All data information can be accessed through ‘AI-Hub (https://aihub.or.kr)’.

Conflicts of Interest

Authors Hyunwoo Lim, Eungyeol Song were employed by the company Codevision Inc. All authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Additional Results

Appendix A.1. Confusion Matrices for Two-Stage Classification Models

Figure A1. Confusion The confusion matrix of 1++ grade prediction using the 2-stage model with ResNeXt-50 as the classifier is shown, where segmentation was performed prior to classification. Darker blue indicates higher probability values.

Figure A2. Confusion matrix of 1++ grade prediction using two-stage model with DenseNet-121 as classifier. Darker blue indicates higher probability values.

Figure A3. Confusion matrix of 1++ grade prediction using two-stage model with VGG-16 as classifier. Darker blue indicates higher probability values.

Appendix A.2. Confusion Matrices for One-Stage Classification Models

Figure A4. Confusion matrices for 1++ grade prediction using one-stage models. The models are displayed from left to right and top to bottom as follows: ResNeXt-50, DenseNet-121, VGG-16, EfficientViT. All models are trained directly on raw images without segmentation. Darker blue indicates higher probability values.

Appendix B. Additional Results

Figure A5. Confusion matrices of 1++ grade prediction for each one-stage model (over 20% loin area). Each subplot corresponds to a different model: ResNeXt-50, DenseNet121, VGG16, and EfficientViT. The diagonal elements indicate correctly predicted samples per class, and off-diagonal elements represent misclassifications. Darker blue indicates higher probability values.

Figure A6. Confusion matrices of 1++ grade prediction for each one-stage model (under 10% loin area). Each subplot shows prediction results for ResNeXt-50, DenseNet121, VGG16, and EfficientViT. The differences in class-wise accuracy indicate the impact of reduced loin area on classification performance. Darker blue indicates higher probability values.

Appendix C. Training Curves of One-Stage Models

Figure A7. Training, validation, and test loss/accuracy curves for the four one-stage classification models: ResNeXt-50, DenseNet-121, VGG-16, and EfficientViT. Each model was trained using early stopping with a patience of 20 epochs.

References

Hunt, M.R.; Garmyn, A.J.; O’Quinn, T.G.; Corbin, C.H.; Legako, J.F.; Rathmann, R.J.; Brooks, J.C.; Miller, M.F. Consumer assessment of beef palatability from four beef muscles from USDA Choice and Select graded carcasses. Meat Sci. 2014, 98, 1–8. [Google Scholar] [CrossRef] [PubMed]
Gajaweera, C.; Chung, K.Y.; Lee, S.H.; Wijayananda, H.I.; Kwon, E.G.; Kim, H.J.; Cho, S.H.; Lee, S.H. Assessment of carcass and meat quality of longissimus thoracis and semimembranosus muscles of Hanwoo with Korean beef grading standards. Meat Sci. 2020, 160, 107944. [Google Scholar] [CrossRef] [PubMed]
Lee, H.J.; Koh, Y.J.; Kim, Y.K.; Lee, S.H.; Lee, J.H.; Seo, D.W. MSENet: Marbling score estimation network for automated assessment of Korean beef. Meat Sci. 2022, 188, 108784. [Google Scholar] [CrossRef]
Pethick, D.W.; Hocquette, J.F.; Scollan, N.D.; Dunshea, F.R. Review: Improving the nutritional, sensory and market value of meat products from sheep and cattle. Animal 2021, 15, 100356. [Google Scholar] [CrossRef]
Allen, P. Recent developments in the objective measurement of carcass and meat quality for industrial application. Meat Sci. 2021, 181, 108601. [Google Scholar] [CrossRef] [PubMed]
Wakholi, C.; Kim, J.; Kwon, K.D.; Mo, C.; Seo, Y.; Cho, S.; Lim, J.; Lee, W.H.; Cho, B.K. Nondestructive estimation of beef carcass yield using digital image analysis. Comput. Electron. Agric. 2022, 194, 106769. [Google Scholar] [CrossRef]
Puente, J.; Samanta, S.S.; Bruce, H.L. Bovine M. longissimus thoracis meat quality differences due to Canada quality grade. Meat Sci. 2019, 155, 43–49. [Google Scholar] [CrossRef]
Gagaoua, M.; Picard, B.; Monteils, V. Associations among animal, carcass, muscle characteristics, and fresh meat color traits in Charolais cattle. Meat Sci. 2018, 140, 145–156. [Google Scholar] [CrossRef]
Joo, S.T.; Joo, S.H.; Hwang, Y.H. The relationships between muscle fiber characteristics, intramuscular fat content, and fatty acid compositions in M. longissimus lumborum of Hanwoo steers. Korean J. Food Sci. Anim. Resour. 2017, 37, 780–786. [Google Scholar] [CrossRef]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14420–14430. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
Shiraishi, T.; Miwa, D.; Katsuoka, T.; Le Duy, V.N.; Taji, K.; Takeuchi, I. Statistical Test for Attention Maps in Vision Transformers. Proc. Mach. Learn. Res. 2024, 235, 45079–45096. [Google Scholar]
Velásquez, L.; Cruz-Tirado, J.P.; Siche, R.; Quevedo, R. An application based on the decision tree to classify the marbling of beef by hyperspectral imaging. Meat Sci. 2017, 133, 43–50. [Google Scholar] [CrossRef] [PubMed]
Konda Naganathan, G.; Cluff, K.; Samal, A.; Calkins, C.R.; Jones, D.D.; Meyer, G.E.; Subbiah, J. Three dimensional chemometric analyses of hyperspectral images for beef tenderness forecasting. J. Food Eng. 2016, 169, 309–320. [Google Scholar] [CrossRef]
Pinto, D.L.; Selli, A.; Tulpan, D.; Andrietta, L.T.; Garbossa, P.L.M.; Voort, G.V.; Munro, J.; McMorris, M.; Alves, A.A.C.; Carvalheiro, R.; et al. Image feature extraction via local binary patterns for marbling score classification in beef cattle using tree-based algorithms. Livest. Sci. 2023, 267, 105152. [Google Scholar] [CrossRef]
Pranata, F.S.; Adif, A.M.; Na’am, J. Automatic Feature Extraction of Marble Fleck in Digital Beef Images to Support Decision Preferences. Int. J. Inform. Vis. 2025, 9, 97–103. [Google Scholar] [CrossRef]
Stewart, S.M.; Gardner, G.E.; McGilchrist, P.; Pethick, D.W.; Polkinghorne, R.; Thompson, J.M.; Tarr, G. Prediction of consumer palatability in beef using visual marbling scores and chemical intramuscular fat percentage. Meat Sci. 2021, 181, 108322. [Google Scholar] [CrossRef]
Stewart, S.M.; Lauridsen, T.; Toft, H.; Pethick, D.W.; Gardner, G.E.; McGilchrist, P.; Christensen, M. Objective grading of eye muscle area, intramuscular fat and marbling in Australian beef and lamb. Meat Sci. 2021, 181, 108358. [Google Scholar] [CrossRef]
Talacha, K.; Swiderski, B.; Kurek, J.; Kruk, M.; Póltorak, A.; Chmielewski, L.J.; Wieczorek, G.; Antoniuk, I.; Pach, J.; Orlowski, A. Context-based segmentation of the longissimus muscle in beef with a deep neural network. Mach. Graph. Vis. 2019, 28, 47–57. [Google Scholar] [CrossRef]
Gonçalves, D.N.; Weber, V.A.d.M.; Pistori, J.G.B.; Gomes, R.d.C.; de Araujo, A.V.; Pereira, M.F.; Gonçalves, W.N.; Pistori, H. Carcass image segmentation using CNN-based methods. Inf. Process. Agric. 2021, 8, 560–572. [Google Scholar] [CrossRef]
Prakash, S.; Berry, D.P.; Roantree, M.; Onibonoje, O.; Gualano, L.; Scriney, M.; McCarren, A. Using artificial intelligence to automate meat cut identification from the semimembranosus muscle on beef boning lines. J. Anim. Sci. 2021, 99, skab319. [Google Scholar] [CrossRef]
Pannier, L.; van de Weijer, T.M.; van der Steen, F.T.; Kranenbarg, R.; Gardner, G.E. Prediction of chemical intramuscular fat and visual marbling scores with a conveyor vision scanner system on beef portion steaks. Meat Sci. 2023, 199, 109141. [Google Scholar] [CrossRef] [PubMed]
Negretti, P.; Bianconi, G.; Cannata, G.; Catillo, G.; Steri, R.; Barrasso, R.; Bozzo, G. Visual Image Analysis for a new classification method of bovine carcasses according to EU legislation criteria. Meat Sci. 2022, 183, 108654. [Google Scholar] [CrossRef] [PubMed]
Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction. In Proceedings of the IEEE International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17256–17267. [Google Scholar] [CrossRef]
AI Hub. Beef Carcass Quality Grading Image Dataset. 2020. Available online: https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=158 (accessed on 26 May 2025).
Weng, W.; Zhu, X. U-Net: Convolutional Networks for Biomedical Image Segmentation. IEEE Access 2021, 9, 16591–16603. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2018; pp. 833–851. [Google Scholar] [CrossRef]
Mohan, R.; Valada, A. EfficientPS: Efficient Panoptic Segmentation. Int. J. Comput. Vis. 2021, 129, 1551–1579. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015—Conference Track Proceedings, San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]

Figure 1. Examples of beef carcass images and their corresponding segmentation masks for each quality grade. For each row, the left image shows the original RGB input, and the right image displays the associated mask highlighting the loin region.

Figure 2. Examples of beef carcass cross-section images from the dataset used in this study. The dataset includes various quality grades, where the distribution of intramuscular fat plays a critical role in determining the grade.

Figure 3. Comparison of 2-stage and 1-stage beef carcass grading approaches. The 2-stage model first applies a segmentation network (e.g., Attention U-Net) to extract the longissimus muscle region, followed by a classification network (e.g., DenseNet-121) to predict the beef grade. In contrast, the 1-stage model directly predicts the beef grade from the raw image using a single classification network (EfficientViT) without any segmentation preprocessing.

Figure 4. Overall architecture of the proposed EfficientViT-based 1-stage beef carcass grading model. The model integrates convolutional layers for local feature extraction and multi-scale linear attention mechanisms to capture global contextual information, enabling efficient and accurate grading without a segmentation step.

Figure 5. Scatter plot comparing inference speed and accuracy of 1-Stage classification models. EfficientViT shows the best trade-off, achieving high accuracy while maintaining fast inference.

Figure 6. Comparison of Grad-CAM (CNN) and attention map (EfficientViT) across different beef grades. Each visualization is normalized to [0, 1] before applying the color map.

Table 1. Distribution of images across beef quality grades in each subset.

Grade	Training	Validation	Test
1++	17,280	2473	2473
1+	15,855	2265	2265
1	14,930	2137	2137
2	8748	1250	1250
3	3758	539	539
Total	60,571	8664	8664

Table 2. Training configurations for segmentation and classification models.

Model Type	Batch Size	Epochs	Learning Rate	Optimizer
U-Net, SegNet, EfficientPS	8	200	0.001	Adam
Attention U-Net, DeepLabV3+, MobileNetV2	8	200	0.0001	Adam
ResNeXt-50, DenseNet-121, VGG-16	8	400	0.0001	SGD
EfficientViT	8	400	0.0005	AdamW

Table 3. Comparison between two-stage and one-stage models in terms of accuracy, F1 score, inference speed, and parameter size.

Models	Accuracy (%)	F1 Score	Inference Speed (ms)	Param Size (MB)
U-Net + ResNeXt-50	98.40	0.9857	19.31	185.48
Attention-U-Net + ResNeXt-50	98.31	0.9850	24.75	228.94
SegNet + ResNeXt-50	98.30	0.9850	16.46	209.73
DeepLab-V3+ + ResNeXt-50	98.40	0.9861	8.90	259.25
EfficientPS + ResNeXt-50	98.34	0.9848	15.56	160.23
MobileNetV2 + ResNeXt-50	98.42	0.9860	5.63	100.05
ResNeXt-50	98.30	0.9847	3.26	91.96
U-Net + DenseNet-121	95.07	0.9577	21.85	122.39
Attention-U-Net + DenseNet-121	95.14	0.9586	27.29	155.85
SegNet + DenseNet-121	95.29	0.9600	19.00	146.64
DeepLab-V3+ + DenseNet-121	95.12	0.9581	11.44	186.16
EfficientPS + DenseNet-121	95.00	0.9569	18.10	97.14
MobileNetV2 + DenseNet-121	95.27	0.9594	8.17	37.77
DenseNet-121	98.12	0.9828	5.86	28.87
U-Net + VGG-16	96.99	0.9733	22.94	567.64
Attention-U-Net + VGG-16	96.92	0.9723	28.38	601.10
SegNet + VGG-16	96.95	0.9726	20.09	591.89
DeepLab-V3+ + VGG-16	96.93	0.9724	12.53	631.41
EfficientPS + VGG-16	96.81	0.9713	19.19	542.39
MobileNetV2 + VGG-16	96.98	0.9729	9.26	483.02
VGG-16	97.99	0.9825	6.90	474.12
Efficient-ViT	98.46	0.9867	3.92	36.41

Table 4. Performance comparison of EfficientViT and CNN models.

Models	Accuracy (%)	F1 Score	Precision	Recall
ResNeXt-50	98.30	0.9847	0.9838	0.9855
DenseNet-121	98.12	0.9828	0.9834	0.9822
VGG-16	97.99	0.9825	0.9834	0.9816
EfficientViT	98.46	0.9867	0.9874	0.9859

Table 5. Performance comparison between EfficientViT and CNN models for 1++ grade prediction.

Models	Accuracy (%)	F1 Score	Precision	Recall
ResNeXt-50	99.09	0.9840	0.9846	0.9834
DenseNet-121	99.08	0.9838	0.9830	0.9846
VGG-16	98.95	0.9817	0.9787	0.9846
EfficientViT	99.24	0.9866	0.9874	0.9858

Table 6. Performance comparison of EfficientViT and CNN models on 1++ grade prediction (over 20% loin area).

Models	1++ Accuracy (%)	F1 Score	Precision	Recall
ResNeXt-50	99.01	0.9835	0.9917	0.9754
DenseNet-121	99.26	0.9879	0.9760	1.0000
VGG-16	98.77	0.9794	0.9835	0.9754
EfficientViT	99.75	0.9959	1.0000	0.9918

Table 7. Performance comparison of EfficientViT and CNN models on 1++ grade prediction (under 10% loin area).

Models	1++ Accuracy (%)	F1 Score	Precision	Recall
ResNeXt-50	98.38	0.9686	0.9702	0.9670
DenseNet-121	98.17	0.9644	0.9684	0.9604
VGG-16	98.12	0.9638	0.9622	0.9653
EfficientViT	98.29	0.9671	0.9639	0.9703

Table 8. Statistical comparison of high-activation regions (≥0.5) across visualization methods. Grad-CAM is used for CNNs, and attention map is used for EfficientViT.

Model	Method	Mean (%)	Std (%)	Median (%)	Q1 (%)	Q3 (%)
EfficientViT	Attention Map	49.78	14.56	49.76	39.42	59.91
DenseNet-121	Grad-CAM	13.79	4.90	13.19	10.16	16.62
ResNeXt-50	Grad-CAM	10.14	4.11	9.56	7.04	12.59
VGG-16	Grad-CAM	3.35	3.44	2.25	0.98	4.51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lim, H.; Song, E. Beef Carcass Grading with EfficientViT: A Lightweight Vision Transformer Approach. Appl. Sci. 2025, 15, 6302. https://doi.org/10.3390/app15116302

AMA Style

Lim H, Song E. Beef Carcass Grading with EfficientViT: A Lightweight Vision Transformer Approach. Applied Sciences. 2025; 15(11):6302. https://doi.org/10.3390/app15116302

Chicago/Turabian Style

Lim, Hyunwoo, and Eungyeol Song. 2025. "Beef Carcass Grading with EfficientViT: A Lightweight Vision Transformer Approach" Applied Sciences 15, no. 11: 6302. https://doi.org/10.3390/app15116302

APA Style

Lim, H., & Song, E. (2025). Beef Carcass Grading with EfficientViT: A Lightweight Vision Transformer Approach. Applied Sciences, 15(11), 6302. https://doi.org/10.3390/app15116302

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Beef Carcass Grading with EfficientViT: A Lightweight Vision Transformer Approach

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Materials

3.1.1. Dataset

3.1.2. Implementation Details

3.2. Methods

3.2.1. Two-Stage Model Configuration

3.2.2. Proposed 1-Stage EfficientViT Model

3.2.3. Explainability Techniques

4. Results

4.1. One-Stage vs. Two-Stage

4.2. EfficientViT vs. CNN

4.3. CNN vs. 1++ Grade: EfficientViT

Grad-CAM and Attention Map Visualization

4.4. Performance Comparison Based on Loin Area Ratio: EfficientViT vs. CNN

4.4.1. Over 20% Loin Area

4.4.2. Under 10% Loin Area

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Additional Results

Appendix A.1. Confusion Matrices for Two-Stage Classification Models

Appendix A.2. Confusion Matrices for One-Stage Classification Models

Appendix B. Additional Results

Appendix C. Training Curves of One-Stage Models

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI