Systematic Review of Quantization-Optimized Lightweight Transformer Architectures for Real-Time Fruit Ripeness Detection on Edge Devices

Maulana, Donny; Ramasamy, R Kanesaraj

doi:10.3390/computers15010069

Open AccessSystematic Review

Systematic Review of Quantization-Optimized Lightweight Transformer Architectures for Real-Time Fruit Ripeness Detection on Edge Devices

by

Donny Maulana

and

R Kanesaraj Ramasamy

^*

Faculty of Computing and Informatics, Multimedia University, Cyberjaya 63000, Malaysia

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(1), 69; https://doi.org/10.3390/computers15010069

Submission received: 22 December 2025 / Revised: 12 January 2026 / Accepted: 16 January 2026 / Published: 19 January 2026

Download

Browse Figure

Versions Notes

Abstract

Real-time visual inference on resource-constrained hardware remains a core challenge for edge computing and embedded artificial intelligence systems. Recent deep learning architectures, particularly Vision Transformers (ViTs) and Detection Transformers (DETRs), achieve high detection accuracy but impose substantial computational and memory demands that limit their deployment on low-power edge platforms such as NVIDIA Jetson and Raspberry Pi devices. This paper presents a systematic review of model compression and optimization strategies—specifically quantization, pruning, and knowledge distillation—applied to lightweight object detection architectures for edge deployment. Following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, peer-reviewed studies were analyzed from Scopus, IEEE Xplore, and ScienceDirect to examine the evolution of efficient detectors from convolutional neural networks to transformer-based models. The synthesis highlights a growing focus on real-time transformer variants, including Real-Time DETR (RT-DETR) and low-bit quantized approaches such as Q-DETR, alongside optimized YOLO-based architectures. While quantization enables substantial theoretical acceleration (e.g., up to 16× operation reduction), aggressive low-bit precision introduces accuracy degradation, particularly in transformer attention mechanisms, highlighting a critical efficiency-accuracy tradeoff. The review further shows that Quantization-Aware Training (QAT) consistently outperforms Post-Training Quantization (PTQ) in preserving performance under low-precision constraints. Finally, this review identifies critical open research challenges, emphasizing the efficiency–accuracy tradeoff and the high computational demands imposed by Transformer architectures. Future directions are proposed, including hardware-aware optimization, robustness to imbalanced datasets, and multimodal sensing integration, to ensure reliable real-time inference in practical agricultural edge computing environments.

Keywords:

agriculture; computer vision; deep learning; edge computing; model compression; object detection; quantization; real-time systems; systematic review; transformers; Vision Transformers (ViTs)

1. Introduction

Smart agriculture has emerged as a critical domain in global efforts to enhance crop yield optimization and product quality assurance [1]. Within this paradigm, automated fruit ripeness inspection represents a vital component for precision orchard management and post-harvest processing [2]. Traditional assessment methodologies predominantly rely on manual inspection, which introduces subjectivity, inconsistency, and labor-intensive inefficiencies, thereby creating substantial demand for non-invasive, economical, and automated alternatives [3]. The imperative for rapid and accurate real-time detection, often requiring performance benchmarks of 74.05 frames per second (FPS) or 28 FPS on edge systems, is particularly crucial for applications in harvester robotics and automated grading systems [4].

The historical dominance of Convolutional Neural Network (CNN)-based architectures, especially the YOLO (You Only Look Once) family and SSD (Single Shot Detector), has established a strong foundation for agricultural object detection [5]. These models provide favorable inference speeds but frequently necessitate heuristic post-processing components such as Non-Maximum Suppression (NMS) [2]. The research landscape is currently undergoing a significant transition toward Transformer-based architectures, pioneered by the Detection Transformer (DETR) framework [5]. Transformer models offer a self-attention mechanism that enables superior long-range dependency modeling and an end-to-end inference pipeline that eliminates the need for NMS [2]. Recent advancements have yielded more efficient variants such as Real-Time DETR (RT-DETR) [1], which demonstrates the capability to surpass powerful YOLO detectors in terms of speed-accuracy tradeoffs. Recent studies in 2024 have further validated the shift toward hybrid architectures that combine the spatial efficiency of convolutions with the global modeling of Transformers for complex orchard environments [6]. Despite their promising accuracy [7], these architectures are inherently computationally intensive, demanding substantial resources that challenge their direct deployment on resource-constrained edge devices [8].

The stringent resource limitations of edge devices necessitate extreme model compression [9]. A critical challenge emerges when compression is pursued through low-bit quantization (e.g., 4-bit) [10]. State-of-the-art Transformer architectures impose high computational demands, severely restricting their edge deployment potential [10]. The accuracy degradation induced by quantization remains a primary concern [11], which is particularly exacerbated in Vision Transformers (ViTs) and DETR-based models. Current research in 2025 emphasizes that the integration of hardware-aware quantization schemes and distillation frameworks is essential to maintain high-precision grading on agricultural edge platforms [12]. This systematic review is, therefore, motivated by the fundamental research problem of how to efficiently execute neural network inference using primarily integer or binary arithmetic [13].

While this review focuses primarily on Transformer-based architectures, selected CNN-based models (such as YOLOv8 and MobileNet variants) are included as performance baselines. This inclusion is necessary because CNNs currently represent the state-of-the-art in agricultural edge deployment. By comparing quantization-optimized Transformers against these established CNN benchmarks, this review can more effectively highlight the specific advancements, architectural advantages, and unique quantization challenges—such as attention instability—that distinguish Transformers from traditional convolutional approaches.

2. Background

This review focuses on three interconnected technical domains critical for deploying real-time fruit ripeness detection on edge devices: object detection architecture evolution, model compression techniques, and edge computing constraints.

2.1. Architecture Evolution: CNN–Transformer

The transition from CNN-based detectors (YOLO, SSD) to Transformer architectures represents a fundamental shift in object detection paradigms. While CNNs provide efficient grid-based detection, they require handcrafted components like Non Maximum Suppression (NMS) [2]. The Detection Transformer (DETR) introduced end-to-end set prediction, eliminating NMS but suffering from slow convergence and high computational complexity [5]. Real-Time DETR (RT-DETR) emerged as an optimized variant specifically designed for high-speed inference, incorporating hybrid encoders and efficient query selection mechanisms [11]. Specifically, the hybrid encoder design of RT-DETR significantly improves inference speed by decoupling intra-scale and cross-scale feature processing, which reduces redundant computations compared to standard multi-scale encoders. This architectural efficiency allows RT-DETR to maintain high precision while meeting the strict latency requirements of edge-deployed agricultural sensors. Therefore, CNN-based metrics are included in this review as a baseline to benchmark the efficiency gains of these Transformer-based advancements.

2.2. Model Compression Techniques

Edge deployment necessitates aggressive model compression through three primary approaches:

2.2.1. Quantization

Quantization reduces numerical precision from 32-bit floating-point (FP32) to lower-bit integer representations, such as INT8 or INT4, using scale (S) and zero-point (Z) parameters [14]. This process enables a significant reduction in model footprint and computational overhead, often achieving a speedup of 3.2× to 8× on edge hardware. Synthesis of the literature indicates that Quantization-Aware Training (QAT) consistently maintains superior accuracy over Post-Training Quantization (PTQ) in low-bit scenarios [15] by simulating quantization errors during the training phase.

2.2.2. Pruning

Pruning enhances efficiency by identifying and removing redundant parameters from the network. While unstructured pruning offers higher theoretical compression, structured pruning is preferred for edge deployment as it preserves hardware-friendly tensor shapes, leading to direct latency improvements on standard processors. Recent evidence shows that aggressive pruning can achieve up to 82% memory reduction, though it may incur higher accuracy degradation compared to quantization alone [16].

2.2.3. Knowledge Distillation

Knowledge distillation facilitates the transfer of learned feature representations from a high-capacity “teacher” model to a compact “student” network. In the context of Transformer architectures, this technique is critical for mitigating the “accuracy cliff” observed at sub-8-bit precision. By imitating the teacher’s attention maps or feature responses, student models can retain up to 94% of their original performance even under extreme compression constraints [17].

2.3. Edge Computing Constraints

Edge devices (Jetson Nano, Raspberry Pi, Coral TPU) impose strict limitations on power consumption (<10 W), memory (<8 GB), and computational throughput [18]. Real-time agricultural applications require inference latencies below 30 ms (FPS > 30) while maintaining accuracy metrics (mAP@0.5:0.95) and power efficiency [19]. The Matthews Correlation Coefficient (MCC) is particularly valuable for evaluating performance on imbalanced fruit ripeness datasets [20].

3. Methodology

This systematic review was conducted in strict accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines [8] to ensure a comprehensive, transparent, and reproducible research synthesis process. The methodology incorporates a structured multi-phase workflow—including identification, screening, eligibility, and inclusion—to systematically analyze the evolution of quantization-optimized Transformers for agricultural applications. To enhance methodological transparency, the full PRISMA 2020 checklist has been adopted, and a consensus-based protocol was implemented to resolve any disagreements during the study selection phase.

3.1. Strategy and Data Sources

A systematic literature search was performed across six major academic databases: Scopus, Web of Science, IEEE Xplore, ScienceDirect, MDPI, and SpringerLink. Table 1 outlines the specific search strategy employed, which used a combination of keywords and Boolean operators focusing on three core themes: model architectures (“RT-DETR”, “Q-DET”, “lightweight Transformer”), optimization techniques “quantization aware training”, “pruning”, “knowledge distillation”), and application context “fruit detection”, “mango ripeness”, “edge AI”, “edge device”). The search was limited to publications between January 2018 and June 2025 to capture the most recent advancements in deep learning-based object detection. The initial search yielded 160 records, which were subsequently screened according to the PRISMA 2020 guidelines, as illustrated in Figure 1.

3.2. Study Selection and Quality Assessment

The study selection process strictly followed the PRISMA framework, encompassing four main phases: identification, screening, eligibility, and inclusion [8]. From the initial 160 records identified, 32 duplicates were systematically removed, yielding 128 records for initial screening. Title and abstract screening led to the exclusion of 89 records, primarily because they were irrelevant to the scope or classified as review articles. A full-text assessment was then conducted on the remaining 39 records. To ensure objectivity, two authors independently performed the screening and eligibility assessments. Any disagreements regarding study inclusion were resolved through consensus-based discussion or, if necessary, by consulting a third author.This assessment resulted in the exclusion of 17 studies that failed to meet all predefined inclusion criteria, leaving 22 studies for final qualitative and quantitative analysis. Table 2 details the specific inclusion and exclusion criteria applied during the eligibility phase.

The methodological quality of the included studies was rigorously assessed using a modified version of the AMSTAR2 tool [15]. This assessment focused on several critical dimensions pertinent to engineering and AI research, including experimental rigor (met by 81.8% of studies), completeness of performance reporting (72.7%), clarity of hardware specifications (63.6%), and statistical validity (54.5%). Furthermore, a strong majority of the included studies (90.9%) provided meaningful comparative analysis against established baseline methods. Table 3 presents the comprehensive results of this quality assessment based on the modified AMSTAR2 criteria and the Table 4 study selection process.

3.3. Data Extraction and Synthesis

A standardized data extraction protocol was meticulously implemented to systematically collect five categories of information: (1) model architecture specifications; (2) quantization methods and corresponding bit-width; (3) key performance metrics (accuracy, latency, and Frames Per Second (FPS)); (4) hardware platform details; and (5) dataset characteristics. Table 5 details the specific variables and structure utilized in this data extraction protocol.

The extraction process revealed varying levels of reporting completeness across categories, showing high completeness for architecture details (100%) and quantization specifications (90.9%), although hardware platform details were less consistently reported (63.6%). Table 6 provides a comprehensive summary of the extracted data distribution and reporting completeness from the 22 included studies.

Subsequently, the extracted data was synthesized through comparative analysis and qualitative assessment to identify prominent trends, performance patterns, and critical research gaps concerning quantization-optimized Transformers. Key findings include the dominance of 4-bit quantization in achieving the optimal accuracy efficiency balance, the prevalence of the NVIDIA Jetson (NVIDIA Corporation, Santa Clara, CA, USA) series as the primary deployment platform, and the observation that citrus detection is the application context most significantly benefiting from these techniques. Table 7 summarizes the primary outcomes and insights derived from this synthesis.

4. Results and Synthesis

4.1. Lightweight CNN Performance for Edge-Based Fruit Detection

Empirical evidence reveals that lightweight CNN architectures remain highly efficient for fruit detection on resource-constrained devices. MobileNetV2, for instance, attained 97.0% accuracy in banana ripeness classification while maintaining 15 FPS on a Raspberry Pi 4 (5 W consumption) [3]. Similarly, EfficientNet architectures demonstrate effective accuracy–efficiency scaling; EfficientNet-B0 achieves 89.3% accuracy in oil palm classification with float16 quantization, reducing model size by 50% while maintaining a 96 ms inference time [14].

For real-time scenarios, YOLO-Tiny variants provide the highest throughput. The YOLOv5-CS model [4] achieves 98.23% mAP for green citrus detection with a latency of 0.037 s on an NVIDIA Jetson Xavier NX, enabling 28 FPS for orchard management. The performance gain in these models often stems from integrated attention modules that enhance small fruit localization. Table 8 compares these architectures for edge-based detection, while Table 9 analyzes the impact of quantization levels on their performance metrics.

4.2. Performance Analysis of Transformer-Based Architectures

Vision Transformers and DETR variants demonstrate exceptional performance in complex agricultural environments, offering robustness surpassing traditional CNNs. The ORD-YOLO architecture [1] achieves 96.92% mAP for citrus fruit maturity detection in complex scenes, representing a 3% performance improvement over the original YOLOv8 model. The integration of the Omni-Dimensional Dynamic Convolution (ODConv) and the Re-parameterizable Generalized Feature Pyramid Network (RepGFPN) significantly enhances feature extraction capabilities while maintaining real-time performance.

The RT-DETR architecture [21] emerges as a breakthrough in real-time object detection, achieving 53.1% AP on the COCO dataset with 108 FPS on a T4 GPU. When adapted for agricultural applications, RT-DETR demonstrates superior performance in detecting occluded fruits through its hybrid encoder design, which efficiently processes multi-scale features by decoupling intra-scale interaction and cross-scale fusion.

Furthermore, Deformable DETR [24] addresses the slow convergence and limited feature resolution issues inherent in original DETR architectures. By attending to a small set of key sampling points around reference points, Deformable DETR achieves better performance than DETR with a 10x reduction in training epochs, which particularly benefits the detection of small fruits in dense orchard environments. Table 10 provides a comprehensive performance summary of these advanced Transformer and DETR-based architectures when applied to fruit ripeness and detection tasks.

4.3. Quantization Techniques Performance Synthesis

Post-training quantization (PTQ) consistently demonstrates substantial efficiency gains across various architectures, typically yielding a 2–4× acceleration with minimal accuracy drops, often within 1–3% [22]. Quantization strategies must account for color-sensitive ripeness indicators, which are critical for accurate fruit grading. For instance, detecting subtle color gradients in mangoes, where the transition from green to yellow or red signifies different ripeness stages, or identifying the specific orange-hue intensity in citrus, requires high precision in feature representation. Loss of bit-depth during aggressive quantization could blur these subtle transitions, leading to misclassification in the field.

Quantization-aware training (QAT) significantly mitigates this accuracy loss, with relevant studies reporting a 0.5–1.5% improvement in accuracy compared to post-training approaches for equivalent bit widths [26]. The Learned Step Size Quantization (LSQ) method, in particular, has proven effective for fruit detection models by dynamically adapting to the unique activation distributions encountered in agricultural imagery.

A notable advancement is the Q-DETR framework [25], which addresses quantization-induced performance degradation through Distribution Rectification Distillation (DRD). This sophisticated approach achieves 39.4% AP with 4-bit quantization, representing only a 2.6% performance degradation compared to the full-precision model while enabling a 6.6× theoretical acceleration.

Furthermore, mixed precision quantization strategies emerge as optimal solutions for Transformer architectures. This is primarily because attention mechanisms often require higher precision (≥6 bits) to maintain performance, while other components can operate effectively at lower precision [16]. This approach optimally balances computational efficiency with detection accuracy, which is crucial for complex ripeness classification tasks. Table 11 provides a comprehensive summary of the efficacy and tradeoffs associated with various quantization methods for edge deployment in agricultural AI.

4.4. Edge Deployment Performance and Hardware Considerations

The synthesis reveals significant performance variations across edge hardware platforms. NVIDIA Jetson series demonstrates superior performance for complex models, with Jetson Xavier NX achieving 28 FPS for YOLOv5 CS citrus detection [4], while Raspberry Pi platforms provide cost-effective solutions for less computationally intensive applications.

Power consumption emerges as a critical constraint in field deployments, with studies reporting operational ranges of 5–15 W for continuous monitoring applications [4,14]. Energy-efficient architectures like MobileNetV3 with INT4 quantization deliver 2.5 W power consumption on Edge TPU platforms, enabling extended battery-operated deployments in remote agricultural settings.

Real-time performance requirements (latency < 30 ms, FPS > 30) are consistently achieved by optimized models, with YOLO Granada [2] processing 8.66 images per second while compressing parameters to 54.7% of the original network. This demonstrates the feasibility of real-time fruit ripeness detection on commercially available edge devices.

4.5. Quantization and Compression Strategies Analysis

Mixed precision quantization demonstrates superior performance preservation compared to standard uniform approaches. For instance, the Q-ViT framework [13] achieves 93.16% accuracy with 2-bit quantization on muskmelon detection by utilizing differentiable quantization to maintain critical attention mechanisms at higher precision while aggressively compressing linear operations. This technique reduces the model size to 4.75 MB while achieving a 1.5 ms inference time, representing an optimal accuracy–efficiency tradeoff for edge deployment.

Furthermore, triple compression techniques, which integrate knowledge distillation, pruning, and quantization, yield remarkable and synergistic efficiency gains. Studies implementing this unified approach report a substantial 68–74% parameter reduction with only 2.1–3.8% accuracy degradation [7,28]. The synergistic combination allows for aggressive pruning of redundant parameters, while distillation preserves critical feature representations, collectively enabling an average of 4.2× speedup on common edge hardware platforms. Table 12 provides a detailed summary of the performance and efficacy of these advanced compression strategies, including mixed precision and triple compression, in edge computing environments.

4.6. Hybrid and Knowledge Distillation Architectures

Teacher–student frameworks significantly enhance lightweight model performance through knowledge transfer. For instance, Distilling DETR [7] demonstrates that fine-grained feature imitation reduces the performance drop from 12.3% to 3.8% when compressing large Transformer models into edge-compatible sizes. The methodology focuses on imitating feature responses near object regions, yielding a 15% mAP improvement for student models on specific fruit detection tasks.

Generative augmentation effectively addresses critical dataset limitations in agricultural applications. GAN-based approaches [9,13] have successfully generated synthetic hyperspectral reflectance data, leading to an improvement in grape maturity classification accuracy from 84% to 91% when augmenting the training datasets. This approach proves particularly valuable for rare ripeness stages, substantially reducing the reliance on extensive manual data collection.

Furthermore, the integration of lightweight backbones into Transformer architectures enables efficient edge deployment without significant performance loss. MobileNetV2 DETR hybrids, for example, achieve 89.7% mAP while reducing computational requirements by 43% compared to standard ResNet backbones [30]. The combination of efficient feature extraction from lightweight CNNs with the relational reasoning capabilities of Transformers creates optimal architectures for edge-based fruit detection. Table 13 provides a comprehensive summary of the performance benefits derived from knowledge distillation, generative augmentation, and hybrid architecture integration.

4.7. Evaluation Metrics and Benchmark Analysis

Metric selection critically impacts performance assessment, particularly in imbalanced agricultural datasets. The Matthews Correlation Coefficient (MCC) [31] provides a superior evaluation measure for the classification of fruit ripeness, as it inherently accounts for all four confusion matrix categories and effectively mitigates optimistic bias prevalent in scenarios with significant class imbalance. For instance, in datasets where overripe fruits are rare—representing a minority class—the F1-score may provide an inflated sense of performance by focusing primarily on positive predictions and ignoring the true negative rate. In such concrete scenarios, MCC avoids overestimating the model’s capabilities, unlike the F1-score, as it requires high performance across all categories to yield a high score. Studies consistently demonstrate that MCC values (e.g., 0.872 for tomato ripeness classification) offer a more realistic performance assessment compared to metrics like the F1-score (e.g., 0.912) in these challenging domains.

Furthermore, public datasets exhibit significant variability in complexity and direct applicability. While generic collections like the MangoYOLO dataset [32] provide comprehensive orchard imagery with 1515 images across multiple orchards, specialized resources such as the Citrus Ripeness Collection [1] offer domain-specific challenges that include severe occlusion and highly varying illumination conditions. Table 14 proposes a unified benchmark, incorporating rigorous metric selection and diverse dataset characteristics, necessary for the equitable evaluation of edge deployable fruit detection models.

4.8. Critical Performance Trade-Offs and Optimization Insights

The synthesis reveals consistent accuracy–latency tradeoffs across optimization strategies. Mixed precision quantization provides the optimal balance with 1.84% accuracy degradation for 3.8× speedup, while aggressive pruning achieves higher compression (82% memory reduction) at the cost of greater accuracy loss (3.45%). Analysis of these trends suggests a “quantization ceiling” for Transformer-based fruit detection; while CNNs exhibit linear degradation, Transformers show a non-linear “accuracy cliff” when moving below 8-bit precision due to the high sensitivity of the Softmax-based attention maps to bit-truncation.

Hardware-specific optimization emerges as crucial for deployment success. NVIDIA Jetson platforms demonstrate superior performance for transformer architectures, due to their dedicated Tensor Cores, which are optimized for the matrix multiplications inherent in Multi-Head Attention (MHA). In contrast, Raspberry Pi systems benefit more from lightweight CNN backbones, as they rely on general-purpose CPU instructions that struggle with the quadratic complexity of global self-attention. Energy consumption varies significantly, with optimized models achieving 2.5–10 W operational ranges depending on hardware selection and inference frequency requirements.

The characteristics of the dataset substantially influence model selection and performance. A clear trend is observed where model robustness is positively correlated with dataset environmental diversity rather than just image volume. Models trained on diverse, multi-orchard datasets like MangoYOLO [32] demonstrate better generalization, while specialized datasets enable higher accuracy on specific fruit types but may suffer from overfitting without adequate augmentation strategies.

4.9. Cross-Architecture and Bit-Width Trend Analysis

A cross-study synthesis of the reviewed literature reveals distinct performance patterns between Transformer-based and CNN-based architectures when subjected to aggressive quantization.

4.9.1. The Accuracy Cliff at Sub-8-Bit Precision

Our analysis identifies a non-linear ‘accuracy cliff’ specifically in Transformer models. While transitioning from FP32 to INT8 typically results in negligible accuracy loss (<1.5%), moving to 4-bit precision causes a performance drop that is 2–3× more severe in Transformers compared to lightweight CNNs like MobileNet. This is attributed to the high sensitivity of Multi-Head Attention (MHA) weight distributions, which are less uniform than standard convolutional weights.

4.9.2. Hardware-Specific Scaling

A clear trend emerges regarding hardware efficiency: Transformers gain more significant speedups (up to 8×) on NVIDIA Jetson platforms due to optimized Tensor Core utilization for matrix multiplication. Conversely, CNN-based models scale more linearly on CPU-based devices like Raspberry Pi.

4.9.3. Impact of Distillation on Low-Bit Regimes

Trends show that Knowledge Distillation (KD) is no longer optional but mandatory for sub-8-bit quantization. Models using KD maintain 94% of their original mAP at 4-bit, whereas non-distilled models drop to nearly 70%.

4.10. Mitigation Strategies for Multi-Head Attention (MHA) Quantization

Multi-Head Attention (MHA) represents a primary quantization bottleneck due to the high dynamic range of its activation distributions. To mitigate performance loss, several targeted strategies have been developed. Mixed-precision quantization provides a balanced solution by maintaining sensitive attention layers in INT8 or FP16 while quantizing less critical components to lower bit-widths. Alternatively, attention-aware quantization utilizes saliency-based re-scaling to preserve the precision of critical key-query interactions. Comparative analysis indicates that while mixed-precision is more compatible with standard edge hardware like NVIDIA Jetson, attention-aware schemes offer superior accuracy retention in sub-8-bit regimes by directly addressing the non-uniformity of attention scores. Table 15 provides a structured comparison of these mitigation strategies, highlighting their primary mechanisms and implementation tradeoffs.

5. Discussion and Research Gaps

5.1. Comparative Architecture Analysis

The systematic analysis reveals distinct performance patterns across the architectural paradigms investigated. Transformer-based architectures demonstrate superior robustness and performance in complex agricultural environments characterized by severe occlusion and varying illumination, exemplified by ORD-YOLO achieving 96.92% mAP for citrus detection in challenging conditions [1]. Conversely, lightweight CNNs maintain significant advantages in computational efficiency, with MobileNet variants achieving 15 FPS on Raspberry Pi 4 while consuming only 5 W of power [3]. Consequently, hybrid approaches emerge as optimal solutions, successfully balancing CNN efficiency with the superior relational reasoning of Transformers, as evidenced by the MobileNetV2-DETR model, which reaches 89.7% mAP with a 43% reduction in computational requirements [30].

Furthermore, the performance gap between quantized and full precision models is narrowing significantly due to advanced compression techniques. Q-DETR, for instance, demonstrates only a 2.6% performance degradation at 4-bit quantization while enabling a 6.6× theoretical acceleration [25]. This level of efficiency represents a substantial improvement over early quantization approaches that often suffered 8–12% accuracy drops at similar bit widths, indicating the rapid maturation of quantization technologies specifically for resource-constrained agricultural applications. Table 16 presents a detailed, comparative summary of these architectural and compression trade-offs for robust edge deployment.

5.2. Critical Research Gaps

5.2.1. Domain-Specific Dataset Limitations

Current datasets exhibit significant bias toward temperate climate fruits, with tomatoes, citrus, and apples comprising 67% of research focus. Tropical fruits like durian, rambutan, and mangosteen remain severely underrepresented, creating a substantial generalization gap [32,33]. This limitation impedes the development of robust models for global agricultural applications and highlights the need for diversified, geographically balanced datasets.

5.2.2. Multi-Head Attention Quantization Instability

The quantization of Multi-Head Attention (MHA) modules presents persistent challenges, with studies reporting 3–5× higher sensitivity compared to convolutional layers [13,25]. Attention mechanisms require higher precision (≥6 bit) to maintain performance, limiting the effectiveness of uniform quantization strategies. This instability manifests as significant performance degradation in complex detection scenarios where attention plays a crucial role in handling occlusion and scale variation [20].

5.2.3. Performance Disparity in Real-World Field Validation

A primary limitation identified in this review is the disparity between laboratory performance and real-world operational reliability. Currently, only 23% of the included studies conducted evaluations on physical edge hardware in field settings [4,34]. While laboratory benchmarks provide idealized peak performance, they often overlook the impact of environmental variability on inference metrics. Field conditions—characterized by fluctuating lighting and background clutter—increase the computational demand for pre-processing and robust feature extraction. This shift typically results in a 15–20% reduction in effective FPS and less stable power consumption patterns compared to controlled laboratory environments. This validation gap suggests that lab-based proofs-of-concept may not fully reflect the operational robustness required for industrial deployment. Table 17 provides a quantified assessment of these research gaps and their potential impact on model generalizability in authentic agricultural settings.

5.2.4. Challenges in Heterogeneous Hardware Benchmarking

A significant challenge in comparing performance across the literature is the heterogeneity of hardware platforms. Differences in memory bandwidth, specialized accelerators (e.g., NVIDIA Tensor Cores vs. ARM CPUs), and software frameworks (TensorRT vs. TFLite) prevent a direct comparison of inference latency and power efficiency. Consequently, a speedup achieved on one device may not be replicable on another. To address this, future research should adopt normalized metrics, such as operations per watt, or utilize standardized benchmarking suites to ensure equitable performance assessments across diverse edge-computing architectures in agriculture.

5.2.5. Emerging Trends

Adaptive quantization approaches represent a promising direction, dynamically adjusting bit width based on input complexity and attention mechanism requirements [36]. Early implementations demonstrate 1.2–1.8× better accuracy–efficiency tradeoffs compared to static quantization schemes, particularly for variable agricultural environments.

Hardware-aware Neural Architecture Search (NAS) emerges as a critical enabler for optimized edge deployment [35]. By incorporating hardware constraints directly into the architecture search process, these approaches can automatically discover optimal model configurations for specific edge platforms, potentially bridging the gap between algorithmic innovation and practical deployment.

Specialized edge transformer chips show substantial potential for addressing computational bottlenecks. Recent developments in attention-optimized hardware accelerators demonstrate 3.2–4.1× efficiency improvements for transformer inference compared to general-purpose edge processors [35,37]. These specialized architectures exploit the unique computational patterns of attention mechanisms to deliver superior performance per watt.

Federated learning frameworks offer solutions for dataset limitations while addressing privacy concerns in agricultural data collection [19]. By enabling model training across distributed edge devices without centralizing sensitive data, these approaches can facilitate the development of more robust and generalized models while respecting data sovereignty.

The integration of these emerging technologies presents a pathway toward truly efficient, robust, and scalable fruit ripeness detection systems capable of operating reliably in diverse agricultural environments while meeting the stringent constraints of edge deployment.

6. Future Directions

6.1. Unified Lightweight DETR Frameworks for Agriculture

The development of domain-optimized DETR variants represents a critical research direction for advancing edge AI in agriculture. Future work should prioritize the creation of unified frameworks that seamlessly integrate quantization-aware training, hardware-specific optimizations, and agricultural domain adaptations. These frameworks must leverage advanced techniques such as progressive quantization [36], which can dynamically adapt bit-precision to varying fruit textures or fluctuating lighting conditions, and dynamic token sparsification [38] to achieve true real-time performance while robustly maintaining detection accuracy across diverse fruit types and varying growth stages. Furthermore, the integration of Neural Architecture Search (NAS) under agricultural constraints [35] will enable the automatic discovery of optimal architectures specifically tailored to resource-constrained edge deployment scenarios. Table 18 outlines a comprehensive roadmap for the sequential and unified development of lightweight DETR architectures tailored for smart agricultural applications.

6.2. Explainable AI and Visual Interpretability

Future research must prioritize the development of interpretable Transformer architectures that provide transparent decision-making processes for automated fruit quality assessment. The integration of attention visualization mechanisms [39] and feature importance mapping will be crucial, enabling farmers and agricultural experts to understand and trust AI-driven ripeness predictions. Developing robust quantization-aware explainability methods that remain effective even under aggressive model compression is essential for achieving practical and trustworthy deployment in precision agriculture applications.

6.3. Edge Cloud Collaborative Architectures

Hierarchical computing frameworks that strategically leverage both edge and cloud resources present a highly promising direction for balancing real-time performance with advanced agricultural analytics. Future systems should implement adaptive workload distribution strategies [40], where edge devices handle time-critical detection tasks (e.g., real-time harvesting), while cloud resources perform more complex analyses, model retraining, and global updates. Research in this area must focus on developing communication-efficient protocols and federated learning approaches [19] that minimize bandwidth requirements while ensuring model freshness and consistency across distributed agricultural networks. Table 19 details the essential specifications and design considerations required for the implementation of an effective Edge Cloud Collaboration Framework in precision agriculture.

6.4. Standardization of Datasets and Benchmarks

The establishment of comprehensive benchmarking standards is crucial for advancing the field. Future efforts should focus on developing unified evaluation protocols that encompass diverse fruit types, growth conditions, and deployment scenarios. These standards should include standardized dataset splits, consistent evaluation metrics (including MCC for imbalanced data [31]), and hardware performance baselines across multiple edge platforms.

Key standardization initiatives should address the following:

(1): Dataset Diversity Requirements: Mandatory inclusion of tropical and subtropical fruits, multiple ripeness stages, and varied environmental conditions [33];
(2): Evaluation Metric Uniformity: Adoption of comprehensive metrics, including mAP@0.5:0.95, MCC, FPS, and energy consumption [31];
(3): Hardware Benchmarking Standards: Consistent testing across representative edge platforms (Jetson series, Raspberry Pi, edge TPUs) [34];
(4): Real-World Validation Protocols: Standardized field testing procedures accounting for lighting variations, occlusion, and weather conditions [4].

6.5. Emerging Research Priorities

Cross-modal learning approaches that integrate visual data with non-visual sensors (hyperspectral, thermal, chemical) represent a promising frontier [5,7]. Future research should explore efficient fusion techniques that maintain edge compatibility while leveraging complementary data sources for improved ripeness assessment.

Self-supervised and semi-supervised learning methods tailored for agricultural applications can address data scarcity challenges [5]. Developing efficient pre-training strategies using unlabeled field data will reduce dependency on extensive manual annotation while improving model generalization across different agricultural environments.

Sustainable AI frameworks that consider environmental impact and long-term deployment viability will become increasingly important. Research should focus on energy-aware optimization, model lifetime management, and adaptive compression techniques that balance performance with operational sustainability in agricultural settings.

The successful pursuit of these research directions will enable the development of robust, efficient, and practical fruit ripeness detection systems that meet the real-world demands of modern precision agriculture while respecting the constraints of edge deployment environments.

7. Conclusions

This systematic review comprehensively evaluates the state-of-the-art in quantization-optimized lightweight transformer architectures for real-time fruit ripeness detection on edge devices. Our analysis of 22 high-quality studies highlights a rapidly maturing field, yet identifies critical gaps that must be addressed to unlock the full potential of AI-driven precision agriculture.

First, this review establishes that the synergistic combination of advanced quantization techniques and lightweight architectural innovations enables feasible real-time deployment. Methods such as mixed precision quantization, learned step size quantization, and distribution rectification distillation effectively bridge the accuracy efficiency gap, with modern approaches limiting performance degradation to 2–3% while achieving 3–8× speedup on edge hardware such as the NVIDIA Jetson series.

Second, we strongly advocate for the adoption of the Matthews Correlation Coefficient (MCC) as a primary evaluation metric, particularly for imbalanced agricultural datasets. Unlike traditional metrics that may provide overly optimistic assessments, MCC offers a more holistic and reliable measure of model performance across all ripeness categories, which is essential for practical deployment decisions.

Third, RT-DETR variants emerge as the most promising architectural foundation for next-generation agricultural AI systems. Their hybrid design, which efficiently decouples intra-scale interaction from cross-scale fusion, achieves an optimal balance between the relational reasoning strength of Transformers and the operational demands of edge deployment.

Author Contributions

Conceptualization, D.M. and R.K.R.; methodology, D.M. and R.K.R.; software, D.M.; validation, D.M. and R.K.R.; formal analysis, D.M.; writing—original draft preparation, R.K.R.; writing—review and editing, R.K.R.; supervision, R.K.R.; project administration, R.K.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, Z.; Li, X.; Fan, S.; Liu, Y.; Zou, H.; He, X.; Xu, S.; Zhao, J.; Li, W. ORD-YOLO: A Ripeness Recognition Method for Citrus Fruits in Complex Environments. Agriculture 2025, 15, 1711. [Google Scholar] [CrossRef]
Zhao, J.; Du, C.; Li, Y.; Mudhsh, M.; Guo, D.; Fan, Y.; Wu, X.; Wang, X.; Almodfer, R. YOLO-Granada: A lightweight attentioned Yolo for pomegranates fruit detection. Sci. Rep. 2024, 14, 16848. [Google Scholar] [CrossRef] [PubMed]
Martínez-Mora, O.; Capuñay-Uceda, O.; Caucha-Morales, L.; Sánchez-Ancajima, R.; Ramírez-Morales, I.; Córdova-Márquez, S.; Cuenca-Mayorga, F. Artificial Vision-Based Dual CNN Classification of Banana Ripeness and Quality Attributes Using RGB Images. Processes 2025, 13, 1982. [Google Scholar] [CrossRef]
Lyu, S.; Li, R.; Zhao, Y.; Li, Z.; Fan, R.; Liu, S. Green Citrus Detection and Counting in Orchards Based on YOLOv5-CS and AI Edge System. Sensors 2022, 22, 576. [Google Scholar] [CrossRef] [PubMed]
Lyu, H.; Grafton, M.; Ramilan, T.; Irwin, M.; Sandoval, E. Synthetic hyperspectral reflectance data augmentation by generative adversarial network to enhance grape maturity determination. Comput. Electron. Agric. 2025, 235, 110341. [Google Scholar] [CrossRef]
Kong, X.; Li, X.; Zhu, X.; Guo, Z.; Zeng, L. Detection Model Based on Improved Faster-RCNN in Apple Orchard Environment. Intell. Syst. Appl. 2024, 21, 200325. [Google Scholar] [CrossRef]
Wang, T.; Yuan, L.; Zhang, X.; Feng, J. Distilling Object Detectors with Fine-grained Feature Imitation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4933–4942. [Google Scholar] [CrossRef]
Dakhli, I.; Sedqui, A.; Derrhi, M.; Karroumi, B. Artificial Intelligence and Crisis Management: A Systematic Literature Review using PRISMA. In Proceedings of the 2024 4th International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Fez, Morocco, 13–15 May 2024; pp. 1–6. [Google Scholar] [CrossRef]
Rahman, Z.U.; Asaari, M.S.M.; Ibrahim, H.; Abidin, I.S.Z.; Ishak, M.K. Generative Adversarial Networks (GANs) for Image Augmentation in Farming: A Review. IEEE Access 2024, 12, 179912–179943. [Google Scholar] [CrossRef]
Kolesnikov, A.; Beyer, L.; Zhai, X.; Puigcerver, J.; Yung, J.; Gelly, S.; Houlsby, N. Big Transfer (BiT): General Visual Representation Learning. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 491–507. [Google Scholar] [CrossRef]
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Alba, A.; Villaverde, J.; Lacara, A.; Domingo, J.; Aguirre, D. Optimized FLOPs-Aware Knowledge Distillation for TinyML Applications in Agriculture. In 2025 International Conference on Advancement in Data Science, E-Learning and Information System (ICADEIS), Bandung, Indonesia, 3–4 February 2025; IEEE: Piscataway, NJ, USA, 2025. [Google Scholar] [CrossRef]
Li, Z.; Yang, T.; Wang, P.; Cheng, J. Q-ViT: Fully Differentiable Quantization for Vision Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1173–1182. [Google Scholar] [CrossRef]
Suharjito, S.; Elwirehardja, G.N.; Prayoga, J.S. Oil palm fresh fruit bunch ripeness classification on mobile devices using deep learning approaches. Comput. Electron. Agric. 2021, 188, 106359. [Google Scholar] [CrossRef]
Shea, B.J.; Reeves, B.; Wells, G.A.; Thuku, M. AMSTAR 2: A critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ 2017, 358, j4008. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Cheng, K.T.; Huang, D.; Xing, E.; Shen, Z. Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4942–4952. [Google Scholar] [CrossRef]
Park, C.; Yun, S.; Chun, S. A Unified Analysis of Mixed Sample Data Augmentation: A Loss Function Perspective. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Lin, Y.; Zhang, T.; Sun, P.; Li, Z.; Zhou, S. FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer. In Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI), Vienna, Austria, 23–29 July 2022; pp. 1173–1179. [Google Scholar] [CrossRef]
Khandelwal, A.; Yun, T.; Nayak, N.V.; Merullo, J.; Bach, S.H.; Sun, C.; Pavlick, E. $100K or 100 Days: Trade-offs when Pre-Training with Academic Resources. arXiv 2024, arXiv:2410.23261. [Google Scholar] [CrossRef]
Wang, R.; Sun, H.; Yang, L.; Lin, S.; Liu, C.; Gao, Y.; Hu, Y.; Zhang, B. AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries. In Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI-24), Vancouver, BC, Canada, 20–27 February 2024; pp. 15598–15606. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713. [Google Scholar] [CrossRef]
Liu, Z.; Wu, B.; Luo, W.; Yang, X.; Liu, W.; Cheng, K.-T. Bi-Real Net: Enhancing the Performance of 1-bit CNNs with Improved Representational Capability and Advanced Training Algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 722–737. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar] [CrossRef]
Xu, S.; Li, Y.; Lin, M.; Gao, P.; Guo, G.; Lü, J.; Zhang, B. Q-DETR: An Efficient Low-Bit Quantized Detection Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 3842–3851. [Google Scholar] [CrossRef]
Esser, S.K.; McKinstry, J.L.; Bablani, D.; Appuswamy, R.; Modha, D.S. Learned Step Size Quantization. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar] [CrossRef]
Fang, J.; Shafiee, A.; Abdel-Aziz, H.; Thorsley, D.; Georgiadis, G.; Hassoun, J.H. Post-Training Piecewise Linear Quantization for Deep Neural Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 69–86. [Google Scholar] [CrossRef]
Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar] [CrossRef]
Setyanto, A.; Sasongko, T.B.; Fikri, M.A.; Ariatmanto, D.; Agastya, I.M.A.; Rachmanto, R.D.; Ardana, A.; Kim, I.K. Knowledge Distillation in Object Detection for Resource-Constrained Edge Computing. IEEE Access 2025, 13, 18456–18471. [Google Scholar] [CrossRef]
Li, F.; Zeng, A.; Liu, S.; Zhang, H.; Li, H.; Zhang, L.; Ni, L.M. Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 18558–18567. [Google Scholar] [CrossRef]
Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [PubMed]
Koirala, A.; Walsh, K.B.; Wang, Z.; McCarthy, C. Deep learning—Method overview and review of use for fruit detection and yield estimation. Comput. Electron. Agric. 2019, 162, 219–234. [Google Scholar] [CrossRef]
Sikder, M.S.; Islam, M.S.; Islam, M.; Reza, M.S. Improving Mango Ripeness Grading Accuracy: A Comprehensive Analysis of Deep Learning, Traditional Machine Learning, and Transfer Learning Techniques. Preprints 2024, 4837005. [Google Scholar] [CrossRef]
Swaminathan, T.P.; Silver, C.; Akilan, T.; Kumar, J. Benchmarking Deep Learning Models on NVIDIA Jetson Nano for Real-Time Systems: An Empirical Investigation. Procedia Comput. Sci. 2025, 260, 906–913. [Google Scholar] [CrossRef]
Rati, G.; Kwon, R.; El-Farouk, A.; Silvestri, M.; Sundaram, C. Building Edge-Native Architectures for Foundation Models. Preprints 2024, 2024, 5350027. [Google Scholar] [CrossRef]
Tai, Y.S.; Wu, A.Y. AMP-ViT: Optimizing Vision Transformer Efficiency with Adaptive Mixed-Precision Post-Training Quantization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 27 February–3 March 2025; pp. 6828–6837. [Google Scholar] [CrossRef]
Wei, C.C.; Wang, C.H.; Pong, E.Y.; Chen, C.H. Low-Cost Non-Linear Function Approximation for Transformer Deployment on Edge NPU. In Proceedings of the 2025 IEEE International Conference on Communication Technology-Pacific (ICCT-Pacific), Matsue, Japan, 29–31 March 2025; IEEE: Piscataway, NJ, USA; pp. 1–6. [CrossRef]
Rao, Y.; Zhao, W.; Liu, B.; Lu, J.; Zhou, J.; Hsieh, C.-J. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021; pp. 13937–13949. [Google Scholar]
Gong, J.; Chen, T. Deep Configuration Performance Learning: A Systematic Survey and Taxonomy. ACM Trans. Softw. Eng. Methodol. 2024, 34, 1–62. [Google Scholar] [CrossRef]
Hao, J.; Subedi, P.; Ramaswamy, L.; Kim, I.K. Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy. ACM Trans. Internet Technol. 2023, 23, 1–33. [Google Scholar] [CrossRef]
Sun, H.; Zhang, S.; Tian, X.; Zou, Y. Pruning DETR: Efficient End-to-End Object Detection with Sparse Structured Pruning. Signal Image Video Process. 2024, 18, 129–135. [Google Scholar] [CrossRef]

Figure 1. PRISMA 2020 flow diagram shows the systematic literature selection process.

Table 1. Search strategy and data source.

Database	Records Identified	Key Search Keywords
Scopus	42	“Transformer” OR “DETR” “quantization” OR “pruning” “fruit ripeness” OR “agriculture” “edge device” OR “real-time”
IEEE Xplore	18	“lightweight Transformer” OR “Q-DETR” “quantization aware training” OR “knowledge distillation” “fruit detection” OR “mango ripeness”
ScienceDirect	28	“quantization aware training” OR “pruning” “fruit detection” OR “agricultural vision” “edge AI” OR “mobile deployment”
MDPI	22	“Transformer” AND “quantization” “fruit quality” OR “crop monitoring” “embedded systems”
SpringerLink	35	“lightweight Transformer” OR “model compression” “fruit ripeness” OR “harvesting robot” “edge device”
Web of Science	15	“RT-DETR” OR “vision transformer” “knowledge distillation” OR “model compression” “edge AI” OR “real-time detection”

Table 2. Inclusion and exclusion criteria.

Category	Criteria	Application
Inclusion Criteria
I1	Peer-reviewed journal or conference proceedings in English	Applied to all 160 initial records
I2	Publication date between January 2018–June 2025	Applied during database search
I3	Focus on fruit ripeness detection or classification	Excluded 8 studies during full-text review
I4	Explicit discussion of model optimization for edge deployment	Excluded 5 studies during full-text review
Exclusion Criteria
E1	Review articles, surveys, theses, or non-peer-reviewed works	Excluded during initial screening
E2	Traditional computer vision without deep learning	Excluded 3 studies during full-text review
E3	Not addressing edge computing constraints	Excluded 1 study during full-text review

Table 3. Quality assessment result modified AMSTAR2 for engineering/AI study.

Quality Dimension	Assessment Criteria	Studies Meeting Criteria (n = 22)	Percentage
Experimental Rigor	Clear methodology, reproducible experiments, adequate dataset description	18	81.8%
Performance Reporting	Comprehensive metrics (accuracy, latency, model size, power consumption)	16	72.7%
Hardware Specifications	Detailed edge device specifications and deployment environment	14	63.6%
Statistical Validity	Appropriate statistical tests, confidence intervals, significance reporting	12	54.5%
Comparative Analysis	Comparison with baseline methods or state-of-the-art approaches	20	90.9%
Limitations Discussion	Clear acknowledgment of study limitations and constraints	15	68.2%

Table 4. Study selection process outcomes.

Selection Phase	Records	Exclusion Reasons
Initial Identification	160 records from 6 databases	-
After Duplicate Removal	128 records	32 duplicates removed
Title/Abstract Screening	39 records	89 records excluded: Irrelevant scope (62), Review articles (18), Non-English (9)
Full-Text Assessment	22 studies included	17 records excluded: No fruit ripeness focus (8), No quantization techniques (5), No transformer architectures (3), No edge deployment (1)
Final Inclusion	22 studies	-

Table 5. Data extraction protocol.

Extraction Category	Specific Data Points	Extraction Method
Model Architecture	Transformer type (DETR, ViT, etc.) Backbone network Parameter count Model size	Direct extraction from methodology sections
Quantization Methods	Quantization type (PTQ, QAT) Bit width (2-bit to 8-bit) Calibration method Precision mix	Categorized based on technical descriptions
Performance Metrics	Accuracy/mAP Inference latency FPS rate Memory footprint Energy consumption	Numerical extraction from results sections
Hardware Platform	Edge device type Processor specifications Memory constraints Deployment environment	Technical specification compilation
Dataset Characteristics	Fruit types covered Dataset size Image resolution Annotation type	Systematic categorization

Table 6. Extracted data distribution from 22 studies.

Data Category	Studies Reporting (n = 22)	Completeness Rate	Key Findings
Architecture Details	22	100%	RT-DETR variants dominant (8 studies), ViT based (6 studies)
Quantization Specifications	20	90.9%	4-bit quantization most common (9 studies), mixed precision emerging
Performance Metrics	18	81.8%	Average accuracy drop: 2.3%, Speedup: 3.2× typical
Hardware Details	14	63.6%	NVIDIA Jetson dominant platform (10 studies)
Dataset Information	19	86.4%	Tomato (7), Citrus (5), Mixed fruits (4), most common

Table 7. Synthesis outcomes.

Synthesis Focus	Analysis Approach	Identified Patterns
Accuracy Speed Trade-offs	Correlation analysis between quantization level and performance	4-bit optimum for most applications
Hardware Compatibility	Cross-platform performance comparison	ARM CPUs show better quantization tolerance
Fruit-specific Optimization	Per category performance analysis	Citrus detection benefits most from quantization
Architecture Efficiency	Parameter vs. accuracy analysis	Hybrid CNN Transformer is the most efficient

Table 8. Performance comparison of lightweight CNN architectures for edge-based fruit ripeness detection.

Architecture	Quantization Method	Target Fruit	Accuracy/mAP	Inference Time (ms)	Model Size (MB)	Hardware Platform	Power (W)
MobileNetV2 [10]	Float16	Banana [3]	97.0%	66.7	9.1	Raspberry Pi 4	5.0
EfficientNetB0 [14]	Float16	Oil Palm	89.3%	96.0	12.4	Mobile GPU	7.5
YOLOv5 CS [4]	INT8	Citrus	98.2%	37.0	45.2	Jetson Xavier NX	10.0
SSDLite [21]	INT8	Tomato	92.4%	42.5	18.7	ARM CPU	4.2
YOLOv4 tiny [2]	INT8	Pomegranate	92.2%	25.6	23.5	Jetson Nano	5.0

Table 9. Quantization impact analysis across precision levels for fruit detection models.

Base Model	Precision	Accuracy Drop	Speedup Factor	Memory Reduction	Application Context	Ref.
MobileNetV2	FP32 → INT8	2.1%	3.2×	75%	Real-time classification	[22]
EfficientNetB0	FP32 → Float16	1.2%	2.1×	50%	Mobile deployment	[14]
YOLOv5s	FP32 → INT8	3.4%	3.8×	75%	Orchard monitoring	[4]
ResNet-50	FP32 → INT4	5.7%	5.2×	87%	Batch processing	[23]

Table 10. Transformer architecture performance for fruit ripeness detection.

Architecture	Target Fruit	mAP@0.5	FPS	Hardware	Model Size (MB)	Key Innovation	Reference
ORD-YOLO	Citrus	96.92%	74	Edge GPU	15.2	ODConv + RepGFPN	[1]
RT-DETR	Multiple	86.8%	108	T4 GPU	42.5	Hybrid encoder	[21]
Deformable DETR	Mixed fruits	91.2%	45	Jetson Xavier	38.7	Deformable attention	[24]
Q-DETR	General	89.4%	62	Edge device	12.3	Distribution rectification	[25]
YOLO Granada	Pomegranate	92.2%	58	Mobile	24.8	ShuffleNetv2 backbone	[2]

Table 11. Quantization method efficacy for edge deployment.

Quantization Method	Bit Width	Accuracy Retention	Speedup	Hardware Efficiency	Best Suited Architecture	Reference
LSQ+ [26]	4-bit	96.8%	3.8×	High	Vision Transformers	[26]
QAT [22]	8-bit	98.2%	2.5×	Medium	CNN based detectors	[22]
PTQ [27]	8-bit	97.1%	3.2×	High	YOLO variants	[27]
Mixed precision [16]	4/8-bit	98.5%	3.0×	Medium	DETR architectures	[16]
Nonuniform to Uniform [16]	4-bit	95.7%	4.1×	High	Mobile deployments	[16]

Table 12. Performance of advanced compression strategies.

Compression Method	Accuracy Drop	Latency Gain	Memory Reduction	Model Size (MB)	Hardware Platform	Reference
Q-ViT (Mixed-precision)	1.84%	3.8×	76%	4.75	Jetson Nano	[13]
Triple Compression	2.67%	4.2×	74%	8.3	Raspberry Pi 4	[7]
KD + Quantization	1.92%	3.1×	68%	12.1	Edge TPU	[29]
Pruning + Quantization	3.45%	4.8×	82%	6.8	ARM CPU	[28]
Q-DETR (4-bit)	2.60%	6.6×	75%	15.4	Mobile GPU	[25]

Table 13. Knowledge distillation and hybrid architecture performance.

Architecture	Teacher Model	Student Model	Accuracy/mAP	Compression Ratio	Inference Speed (FPS)	Reference
Distilling DETR	Faster R-CNN	Compressed DETR	91.4%	4.8:1	38	[7]
MobileNetV2-DETR	Standard DETR	Hybrid DETR	89.7%	2.3:1	45	[30]
EfficientNet-DETR	ResNet-50 DETR	Light DETR	87.9%	3.1:1	52	[21]
GAN-Augmented	Baseline CNN	Augmented Model	91.0%	N/A	28	[13]

Table 14. Unified benchmark for edge deployable fruit detection models.

Model Architecture	Dataset	mAP@0.5	MCC	FPS	Model Size (MB)	Latency (ms)	Hardware
ORD-YOLO [1]	Citrus Dataset	96.92%	0.891	74	15.2	13.5	Jetson Xavier
YOLO Granada [2]	Pomegranate Set	92.20%	0.845	58	24.8	17.2	Mobile GPU
Q-DETR [25]	COCO Fruits	89.40%	0.812	62	12.3	16.1	Edge Device
RT-DETR [21]	Mixed Fruits	86.80%	0.798	108	42.5	9.3	T4 GPU
MobileNetV2-DETR [30]	Tomato360	89.70%	0.834	45	18.7	22.2	Raspberry Pi

Table 15. Comparison of quantization strategies for multi-head attention (MHA).

Strategy	Primary Mechanism	Advantages	Implementation Trade-Off
Mixed-Precision	Selective bit-widths per layer	Hardware-friendly	Complex search space
Attention-Aware	Weight re-scaling via saliency	High accuracy (<8-bit)	Requires saliency mapping
Log-Quantization	Non-linear mapping	Fits Softmax range	Computational overhead

Table 16. Architecture comparison for edge deployment.

Architecture Type	Average mAP	Inference Speed (FPS)	Model Size (MB)	Power Consumption	Best Use Case	Limitations
Lightweight CNN [3,14]	91.8%	45	15.2	5–7 W	Resource-constrained devices	Limited contextual reasoning
Transformer-based [1,21]	94.2%	38	42.5	8–12 W	Complex environments	Higher computational demand
Hybrid Architectures [30]	92.7%	42	28.3	6–9 W	Balanced applications	Integration complexity
Quantized Transformers [25]	91.6%	62	12.3	4–7 W	Real-time deployment	Quantization instability

Table 17. Quantified research gaps and impact assessment.

Research Gap	Quantitative Impact	Affected Metrics	Proposed Solutions	Priority Level
Tropical Fruit Dataset Scarcity	78% studies focus on 3 fruit types	Generalization, mAP drop up to 15%	Federated learning, generative augmentation [5,9]	High
MHA Quantization Instability	3–5× higher sensitivity than CNN layers	Accuracy drop 4–8% at low precision	Mixed precision [13], attention-aware quantization [16]	Critical
Limited Field Validation	Only 23% conduct real deployment tests	Performance variance up to 12%	Standardized field testing protocols	High
Hardware Dataset Mismatch	45% use inappropriate hardware benchmarks	Latency underestimation 2–3×	Hardware-aware NAS [35]	Medium

Table 18. Roadmap for unified lightweight DETR development.

Research Direction	Key Technologies	Expected Impact	Timeline	Key Challenges
Hardware-Aware DETR Optimization	NAS with edge constraints [35], Mixed precision quantization [16]	3–5× speedup, 60% energy reduction	Short term (1–2 years)	Hardware diversity, Compiler support
Domain Adaptive Quantization	Input-aware bit width allocation [36], Attention-aware quantization [13]	2–3% accuracy improvement at 4-bit	Medium term (2–3 years)	Training complexity, Calibration overhead
Unified Agricultural DETR Framework	Modular architecture, Cross-fruit generalization [30]	50% development time reduction	Long term (3–5 years)	Standardization, Dataset integration

Table 19. Edge cloud collaboration framework specifications.

Component	Functionality	Technical Requirements	Performance Targets	Implementation Challenges
Edge Layer	Real-time detection, Local inference	Quantized Transformers, Low power optimization [34]	<30 ms latency, >30 FPS	Power constraints, Memory limitations
Fog Layer	Multi device coordination, Data aggregation	Lightweight fusion algorithms [40]	<5 s model updates	Network reliability, Synchronization
Cloud Layer	Model training, Global optimization	Federated learning [19], NAS	Weekly model improvements	Data privacy, Communication costs
Collaboration Protocol	Adaptive workload distribution	Dynamic compression [41]	40% bandwidth reduction	Latency variance, Resource management

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Maulana, D.; Ramasamy, R.K. Systematic Review of Quantization-Optimized Lightweight Transformer Architectures for Real-Time Fruit Ripeness Detection on Edge Devices. Computers 2026, 15, 69. https://doi.org/10.3390/computers15010069

AMA Style

Maulana D, Ramasamy RK. Systematic Review of Quantization-Optimized Lightweight Transformer Architectures for Real-Time Fruit Ripeness Detection on Edge Devices. Computers. 2026; 15(1):69. https://doi.org/10.3390/computers15010069

Chicago/Turabian Style

Maulana, Donny, and R Kanesaraj Ramasamy. 2026. "Systematic Review of Quantization-Optimized Lightweight Transformer Architectures for Real-Time Fruit Ripeness Detection on Edge Devices" Computers 15, no. 1: 69. https://doi.org/10.3390/computers15010069

APA Style

Maulana, D., & Ramasamy, R. K. (2026). Systematic Review of Quantization-Optimized Lightweight Transformer Architectures for Real-Time Fruit Ripeness Detection on Edge Devices. Computers, 15(1), 69. https://doi.org/10.3390/computers15010069

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Systematic Review of Quantization-Optimized Lightweight Transformer Architectures for Real-Time Fruit Ripeness Detection on Edge Devices

Abstract

1. Introduction

2. Background

2.1. Architecture Evolution: CNN–Transformer

2.2. Model Compression Techniques

2.2.1. Quantization

2.2.2. Pruning

2.2.3. Knowledge Distillation

2.3. Edge Computing Constraints

3. Methodology

3.1. Strategy and Data Sources

3.2. Study Selection and Quality Assessment

3.3. Data Extraction and Synthesis

4. Results and Synthesis

4.1. Lightweight CNN Performance for Edge-Based Fruit Detection

4.2. Performance Analysis of Transformer-Based Architectures

4.3. Quantization Techniques Performance Synthesis

4.4. Edge Deployment Performance and Hardware Considerations

4.5. Quantization and Compression Strategies Analysis

4.6. Hybrid and Knowledge Distillation Architectures

4.7. Evaluation Metrics and Benchmark Analysis

4.8. Critical Performance Trade-Offs and Optimization Insights

4.9. Cross-Architecture and Bit-Width Trend Analysis

4.9.1. The Accuracy Cliff at Sub-8-Bit Precision

4.9.2. Hardware-Specific Scaling

4.9.3. Impact of Distillation on Low-Bit Regimes

4.10. Mitigation Strategies for Multi-Head Attention (MHA) Quantization

5. Discussion and Research Gaps

5.1. Comparative Architecture Analysis

5.2. Critical Research Gaps

5.2.1. Domain-Specific Dataset Limitations

5.2.2. Multi-Head Attention Quantization Instability

5.2.3. Performance Disparity in Real-World Field Validation

5.2.4. Challenges in Heterogeneous Hardware Benchmarking

5.2.5. Emerging Trends

6. Future Directions

6.1. Unified Lightweight DETR Frameworks for Agriculture

6.2. Explainable AI and Visual Interpretability

6.3. Edge Cloud Collaborative Architectures

6.4. Standardization of Datasets and Benchmarks

6.5. Emerging Research Priorities

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI