1. Introduction
Smart agriculture has emerged as a critical domain in global efforts to enhance crop yield optimization and product quality assurance [
1]. Within this paradigm, automated fruit ripeness inspection represents a vital component for precision orchard management and post-harvest processing [
2]. Traditional assessment methodologies predominantly rely on manual inspection, which introduces subjectivity, inconsistency, and labor-intensive inefficiencies, thereby creating substantial demand for non-invasive, economical, and automated alternatives [
3]. The imperative for rapid and accurate real-time detection, often requiring performance benchmarks of 74.05 frames per second (FPS) or 28 FPS on edge systems, is particularly crucial for applications in harvester robotics and automated grading systems [
4].
The historical dominance of Convolutional Neural Network (CNN)-based architectures, especially the YOLO (You Only Look Once) family and SSD (Single Shot Detector), has established a strong foundation for agricultural object detection [
5]. These models provide favorable inference speeds but frequently necessitate heuristic post-processing components such as Non-Maximum Suppression (NMS) [
2]. The research landscape is currently undergoing a significant transition toward Transformer-based architectures, pioneered by the Detection Transformer (DETR) framework [
5]. Transformer models offer a self-attention mechanism that enables superior long-range dependency modeling and an end-to-end inference pipeline that eliminates the need for NMS [
2]. Recent advancements have yielded more efficient variants such as Real-Time DETR (RT-DETR) [
1], which demonstrates the capability to surpass powerful YOLO detectors in terms of speed-accuracy tradeoffs. Recent studies in 2024 have further validated the shift toward hybrid architectures that combine the spatial efficiency of convolutions with the global modeling of Transformers for complex orchard environments [
6]. Despite their promising accuracy [
7], these architectures are inherently computationally intensive, demanding substantial resources that challenge their direct deployment on resource-constrained edge devices [
8].
The stringent resource limitations of edge devices necessitate extreme model compression [
9]. A critical challenge emerges when compression is pursued through low-bit quantization (e.g., 4-bit) [
10]. State-of-the-art Transformer architectures impose high computational demands, severely restricting their edge deployment potential [
10]. The accuracy degradation induced by quantization remains a primary concern [
11], which is particularly exacerbated in Vision Transformers (ViTs) and DETR-based models. Current research in 2025 emphasizes that the integration of hardware-aware quantization schemes and distillation frameworks is essential to maintain high-precision grading on agricultural edge platforms [
12]. This systematic review is, therefore, motivated by the fundamental research problem of how to efficiently execute neural network inference using primarily integer or binary arithmetic [
13].
While this review focuses primarily on Transformer-based architectures, selected CNN-based models (such as YOLOv8 and MobileNet variants) are included as performance baselines. This inclusion is necessary because CNNs currently represent the state-of-the-art in agricultural edge deployment. By comparing quantization-optimized Transformers against these established CNN benchmarks, this review can more effectively highlight the specific advancements, architectural advantages, and unique quantization challenges—such as attention instability—that distinguish Transformers from traditional convolutional approaches.
2. Background
This review focuses on three interconnected technical domains critical for deploying real-time fruit ripeness detection on edge devices: object detection architecture evolution, model compression techniques, and edge computing constraints.
2.1. Architecture Evolution: CNN–Transformer
The transition from CNN-based detectors (YOLO, SSD) to Transformer architectures represents a fundamental shift in object detection paradigms. While CNNs provide efficient grid-based detection, they require handcrafted components like Non Maximum Suppression (NMS) [
2]. The Detection Transformer (DETR) introduced end-to-end set prediction, eliminating NMS but suffering from slow convergence and high computational complexity [
5]. Real-Time DETR (RT-DETR) emerged as an optimized variant specifically designed for high-speed inference, incorporating hybrid encoders and efficient query selection mechanisms [
11]. Specifically, the hybrid encoder design of RT-DETR significantly improves inference speed by decoupling intra-scale and cross-scale feature processing, which reduces redundant computations compared to standard multi-scale encoders. This architectural efficiency allows RT-DETR to maintain high precision while meeting the strict latency requirements of edge-deployed agricultural sensors. Therefore, CNN-based metrics are included in this review as a baseline to benchmark the efficiency gains of these Transformer-based advancements.
2.2. Model Compression Techniques
Edge deployment necessitates aggressive model compression through three primary approaches:
2.2.1. Quantization
Quantization reduces numerical precision from 32-bit floating-point (FP32) to lower-bit integer representations, such as INT8 or INT4, using scale (
S) and zero-point (
Z) parameters [
14]. This process enables a significant reduction in model footprint and computational overhead, often achieving a speedup of 3.2× to 8× on edge hardware. Synthesis of the literature indicates that Quantization-Aware Training (QAT) consistently maintains superior accuracy over Post-Training Quantization (PTQ) in low-bit scenarios [
15] by simulating quantization errors during the training phase.
2.2.2. Pruning
Pruning enhances efficiency by identifying and removing redundant parameters from the network. While unstructured pruning offers higher theoretical compression, structured pruning is preferred for edge deployment as it preserves hardware-friendly tensor shapes, leading to direct latency improvements on standard processors. Recent evidence shows that aggressive pruning can achieve up to 82% memory reduction, though it may incur higher accuracy degradation compared to quantization alone [
16].
2.2.3. Knowledge Distillation
Knowledge distillation facilitates the transfer of learned feature representations from a high-capacity “teacher” model to a compact “student” network. In the context of Transformer architectures, this technique is critical for mitigating the “accuracy cliff” observed at sub-8-bit precision. By imitating the teacher’s attention maps or feature responses, student models can retain up to 94% of their original performance even under extreme compression constraints [
17].
2.3. Edge Computing Constraints
Edge devices (Jetson Nano, Raspberry Pi, Coral TPU) impose strict limitations on power consumption (<10 W), memory (<8 GB), and computational throughput [
18]. Real-time agricultural applications require inference latencies below 30 ms (FPS > 30) while maintaining accuracy metrics (
mAP@0.5:0.95) and power efficiency [
19]. The Matthews Correlation Coefficient (MCC) is particularly valuable for evaluating performance on imbalanced fruit ripeness datasets [
20].
3. Methodology
This systematic review was conducted in strict accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines [
8] to ensure a comprehensive, transparent, and reproducible research synthesis process. The methodology incorporates a structured multi-phase workflow—including identification, screening, eligibility, and inclusion—to systematically analyze the evolution of quantization-optimized Transformers for agricultural applications. To enhance methodological transparency, the full PRISMA 2020 checklist has been adopted, and a consensus-based protocol was implemented to resolve any disagreements during the study selection phase.
3.1. Strategy and Data Sources
A systematic literature search was performed across six major academic databases: Scopus, Web of Science, IEEE Xplore, ScienceDirect, MDPI, and SpringerLink.
Table 1 outlines the specific search strategy employed, which used a combination of keywords and Boolean operators focusing on three core themes: model architectures (“RT-DETR”, “Q-DET”, “lightweight Transformer”), optimization techniques “quantization aware training”, “pruning”, “knowledge distillation”), and application context “fruit detection”, “mango ripeness”, “edge AI”, “edge device”). The search was limited to publications between January 2018 and June 2025 to capture the most recent advancements in deep learning-based object detection. The initial search yielded 160 records, which were subsequently screened according to the PRISMA 2020 guidelines, as illustrated in
Figure 1.
3.2. Study Selection and Quality Assessment
The study selection process strictly followed the PRISMA framework, encompassing four main phases: identification, screening, eligibility, and inclusion [
8]. From the initial 160 records identified, 32 duplicates were systematically removed, yielding 128 records for initial screening. Title and abstract screening led to the exclusion of 89 records, primarily because they were irrelevant to the scope or classified as review articles. A full-text assessment was then conducted on the remaining 39 records. To ensure objectivity, two authors independently performed the screening and eligibility assessments. Any disagreements regarding study inclusion were resolved through consensus-based discussion or, if necessary, by consulting a third author.This assessment resulted in the exclusion of 17 studies that failed to meet all predefined inclusion criteria, leaving 22 studies for final qualitative and quantitative analysis.
Table 2 details the specific inclusion and exclusion criteria applied during the eligibility phase.
The methodological quality of the included studies was rigorously assessed using a modified version of the AMSTAR2 tool [
15]. This assessment focused on several critical dimensions pertinent to engineering and AI research, including experimental rigor (met by 81.8% of studies), completeness of performance reporting (72.7%), clarity of hardware specifications (63.6%), and statistical validity (54.5%). Furthermore, a strong majority of the included studies (90.9%) provided meaningful comparative analysis against established baseline methods.
Table 3 presents the comprehensive results of this quality assessment based on the modified AMSTAR2 criteria and the
Table 4 study selection process.
3.3. Data Extraction and Synthesis
A standardized data extraction protocol was meticulously implemented to systematically collect five categories of information: (1) model architecture specifications; (2) quantization methods and corresponding bit-width; (3) key performance metrics (accuracy, latency, and Frames Per Second (FPS)); (4) hardware platform details; and (5) dataset characteristics.
Table 5 details the specific variables and structure utilized in this data extraction protocol.
The extraction process revealed varying levels of reporting completeness across categories, showing high completeness for architecture details (100%) and quantization specifications (90.9%), although hardware platform details were less consistently reported (63.6%).
Table 6 provides a comprehensive summary of the extracted data distribution and reporting completeness from the 22 included studies.
Subsequently, the extracted data was synthesized through comparative analysis and qualitative assessment to identify prominent trends, performance patterns, and critical research gaps concerning quantization-optimized Transformers. Key findings include the dominance of 4-bit quantization in achieving the optimal accuracy efficiency balance, the prevalence of the NVIDIA Jetson (NVIDIA Corporation, Santa Clara, CA, USA) series as the primary deployment platform, and the observation that citrus detection is the application context most significantly benefiting from these techniques.
Table 7 summarizes the primary outcomes and insights derived from this synthesis.
4. Results and Synthesis
4.1. Lightweight CNN Performance for Edge-Based Fruit Detection
Empirical evidence reveals that lightweight CNN architectures remain highly efficient for fruit detection on resource-constrained devices. MobileNetV2, for instance, attained 97.0% accuracy in banana ripeness classification while maintaining 15 FPS on a Raspberry Pi 4 (5 W consumption) [
3]. Similarly, EfficientNet architectures demonstrate effective accuracy–efficiency scaling; EfficientNet-B0 achieves 89.3% accuracy in oil palm classification with float16 quantization, reducing model size by 50% while maintaining a 96 ms inference time [
14].
For real-time scenarios, YOLO-Tiny variants provide the highest throughput. The YOLOv5-CS model [
4] achieves 98.23%
mAP for green citrus detection with a latency of 0.037 s on an NVIDIA Jetson Xavier NX, enabling 28 FPS for orchard management. The performance gain in these models often stems from integrated attention modules that enhance small fruit localization.
Table 8 compares these architectures for edge-based detection, while
Table 9 analyzes the impact of quantization levels on their performance metrics.
4.2. Performance Analysis of Transformer-Based Architectures
Vision Transformers and DETR variants demonstrate exceptional performance in complex agricultural environments, offering robustness surpassing traditional CNNs. The ORD-YOLO architecture [
1] achieves 96.92%
mAP for citrus fruit maturity detection in complex scenes, representing a 3% performance improvement over the original YOLOv8 model. The integration of the Omni-Dimensional Dynamic Convolution (ODConv) and the Re-parameterizable Generalized Feature Pyramid Network (RepGFPN) significantly enhances feature extraction capabilities while maintaining real-time performance.
The RT-DETR architecture [
21] emerges as a breakthrough in real-time object detection, achieving 53.1% AP on the COCO dataset with 108 FPS on a T4 GPU. When adapted for agricultural applications, RT-DETR demonstrates superior performance in detecting occluded fruits through its hybrid encoder design, which efficiently processes multi-scale features by decoupling intra-scale interaction and cross-scale fusion.
Furthermore, Deformable DETR [
24] addresses the slow convergence and limited feature resolution issues inherent in original DETR architectures. By attending to a small set of key sampling points around reference points, Deformable DETR achieves better performance than DETR with a 10x reduction in training epochs, which particularly benefits the detection of small fruits in dense orchard environments.
Table 10 provides a comprehensive performance summary of these advanced Transformer and DETR-based architectures when applied to fruit ripeness and detection tasks.
4.3. Quantization Techniques Performance Synthesis
Post-training quantization (PTQ) consistently demonstrates substantial efficiency gains across various architectures, typically yielding a 2–4× acceleration with minimal accuracy drops, often within 1–3% [
22]. Quantization strategies must account for color-sensitive ripeness indicators, which are critical for accurate fruit grading. For instance, detecting subtle color gradients in mangoes, where the transition from green to yellow or red signifies different ripeness stages, or identifying the specific orange-hue intensity in citrus, requires high precision in feature representation. Loss of bit-depth during aggressive quantization could blur these subtle transitions, leading to misclassification in the field.
Quantization-aware training (QAT) significantly mitigates this accuracy loss, with relevant studies reporting a 0.5–1.5% improvement in accuracy compared to post-training approaches for equivalent bit widths [
26]. The Learned Step Size Quantization (LSQ) method, in particular, has proven effective for fruit detection models by dynamically adapting to the unique activation distributions encountered in agricultural imagery.
A notable advancement is the Q-DETR framework [
25], which addresses quantization-induced performance degradation through Distribution Rectification Distillation (DRD). This sophisticated approach achieves 39.4% AP with 4-bit quantization, representing only a 2.6% performance degradation compared to the full-precision model while enabling a 6.6× theoretical acceleration.
Furthermore, mixed precision quantization strategies emerge as optimal solutions for Transformer architectures. This is primarily because attention mechanisms often require higher precision (≥6 bits) to maintain performance, while other components can operate effectively at lower precision [
16]. This approach optimally balances computational efficiency with detection accuracy, which is crucial for complex ripeness classification tasks.
Table 11 provides a comprehensive summary of the efficacy and tradeoffs associated with various quantization methods for edge deployment in agricultural AI.
4.4. Edge Deployment Performance and Hardware Considerations
The synthesis reveals significant performance variations across edge hardware platforms. NVIDIA Jetson series demonstrates superior performance for complex models, with Jetson Xavier NX achieving 28 FPS for YOLOv5 CS citrus detection [
4], while Raspberry Pi platforms provide cost-effective solutions for less computationally intensive applications.
Power consumption emerges as a critical constraint in field deployments, with studies reporting operational ranges of 5–15 W for continuous monitoring applications [
4,
14]. Energy-efficient architectures like MobileNetV3 with INT4 quantization deliver 2.5 W power consumption on Edge TPU platforms, enabling extended battery-operated deployments in remote agricultural settings.
Real-time performance requirements (latency < 30 ms, FPS > 30) are consistently achieved by optimized models, with YOLO Granada [
2] processing 8.66 images per second while compressing parameters to 54.7% of the original network. This demonstrates the feasibility of real-time fruit ripeness detection on commercially available edge devices.
4.5. Quantization and Compression Strategies Analysis
Mixed precision quantization demonstrates superior performance preservation compared to standard uniform approaches. For instance, the Q-ViT framework [
13] achieves 93.16% accuracy with 2-bit quantization on muskmelon detection by utilizing differentiable quantization to maintain critical attention mechanisms at higher precision while aggressively compressing linear operations. This technique reduces the model size to 4.75 MB while achieving a 1.5 ms inference time, representing an optimal accuracy–efficiency tradeoff for edge deployment.
Furthermore, triple compression techniques, which integrate knowledge distillation, pruning, and quantization, yield remarkable and synergistic efficiency gains. Studies implementing this unified approach report a substantial 68–74% parameter reduction with only 2.1–3.8% accuracy degradation [
7,
28]. The synergistic combination allows for aggressive pruning of redundant parameters, while distillation preserves critical feature representations, collectively enabling an average of 4.2× speedup on common edge hardware platforms.
Table 12 provides a detailed summary of the performance and efficacy of these advanced compression strategies, including mixed precision and triple compression, in edge computing environments.
4.6. Hybrid and Knowledge Distillation Architectures
Teacher–student frameworks significantly enhance lightweight model performance through knowledge transfer. For instance, Distilling DETR [
7] demonstrates that fine-grained feature imitation reduces the performance drop from 12.3% to 3.8% when compressing large Transformer models into edge-compatible sizes. The methodology focuses on imitating feature responses near object regions, yielding a 15%
mAP improvement for student models on specific fruit detection tasks.
Generative augmentation effectively addresses critical dataset limitations in agricultural applications. GAN-based approaches [
9,
13] have successfully generated synthetic hyperspectral reflectance data, leading to an improvement in grape maturity classification accuracy from 84% to 91% when augmenting the training datasets. This approach proves particularly valuable for rare ripeness stages, substantially reducing the reliance on extensive manual data collection.
Furthermore, the integration of lightweight backbones into Transformer architectures enables efficient edge deployment without significant performance loss. MobileNetV2 DETR hybrids, for example, achieve 89.7%
mAP while reducing computational requirements by 43% compared to standard ResNet backbones [
30]. The combination of efficient feature extraction from lightweight CNNs with the relational reasoning capabilities of Transformers creates optimal architectures for edge-based fruit detection.
Table 13 provides a comprehensive summary of the performance benefits derived from knowledge distillation, generative augmentation, and hybrid architecture integration.
4.7. Evaluation Metrics and Benchmark Analysis
Metric selection critically impacts performance assessment, particularly in imbalanced agricultural datasets. The Matthews Correlation Coefficient (MCC) [
31] provides a superior evaluation measure for the classification of fruit ripeness, as it inherently accounts for all four confusion matrix categories and effectively mitigates optimistic bias prevalent in scenarios with significant class imbalance. For instance, in datasets where overripe fruits are rare—representing a minority class—the F1-score may provide an inflated sense of performance by focusing primarily on positive predictions and ignoring the true negative rate. In such concrete scenarios, MCC avoids overestimating the model’s capabilities, unlike the F1-score, as it requires high performance across all categories to yield a high score. Studies consistently demonstrate that MCC values (e.g., 0.872 for tomato ripeness classification) offer a more realistic performance assessment compared to metrics like the F1-score (e.g., 0.912) in these challenging domains.
Furthermore, public datasets exhibit significant variability in complexity and direct applicability. While generic collections like the MangoYOLO dataset [
32] provide comprehensive orchard imagery with 1515 images across multiple orchards, specialized resources such as the Citrus Ripeness Collection [
1] offer domain-specific challenges that include severe occlusion and highly varying illumination conditions.
Table 14 proposes a unified benchmark, incorporating rigorous metric selection and diverse dataset characteristics, necessary for the equitable evaluation of edge deployable fruit detection models.
4.8. Critical Performance Trade-Offs and Optimization Insights
The synthesis reveals consistent accuracy–latency tradeoffs across optimization strategies. Mixed precision quantization provides the optimal balance with 1.84% accuracy degradation for 3.8× speedup, while aggressive pruning achieves higher compression (82% memory reduction) at the cost of greater accuracy loss (3.45%). Analysis of these trends suggests a “quantization ceiling” for Transformer-based fruit detection; while CNNs exhibit linear degradation, Transformers show a non-linear “accuracy cliff” when moving below 8-bit precision due to the high sensitivity of the Softmax-based attention maps to bit-truncation.
Hardware-specific optimization emerges as crucial for deployment success. NVIDIA Jetson platforms demonstrate superior performance for transformer architectures, due to their dedicated Tensor Cores, which are optimized for the matrix multiplications inherent in Multi-Head Attention (MHA). In contrast, Raspberry Pi systems benefit more from lightweight CNN backbones, as they rely on general-purpose CPU instructions that struggle with the quadratic complexity of global self-attention. Energy consumption varies significantly, with optimized models achieving 2.5–10 W operational ranges depending on hardware selection and inference frequency requirements.
The characteristics of the dataset substantially influence model selection and performance. A clear trend is observed where model robustness is positively correlated with dataset environmental diversity rather than just image volume. Models trained on diverse, multi-orchard datasets like MangoYOLO [
32] demonstrate better generalization, while specialized datasets enable higher accuracy on specific fruit types but may suffer from overfitting without adequate augmentation strategies.
4.9. Cross-Architecture and Bit-Width Trend Analysis
A cross-study synthesis of the reviewed literature reveals distinct performance patterns between Transformer-based and CNN-based architectures when subjected to aggressive quantization.
4.9.1. The Accuracy Cliff at Sub-8-Bit Precision
Our analysis identifies a non-linear ‘accuracy cliff’ specifically in Transformer models. While transitioning from FP32 to INT8 typically results in negligible accuracy loss (<1.5%), moving to 4-bit precision causes a performance drop that is 2–3× more severe in Transformers compared to lightweight CNNs like MobileNet. This is attributed to the high sensitivity of Multi-Head Attention (MHA) weight distributions, which are less uniform than standard convolutional weights.
4.9.2. Hardware-Specific Scaling
A clear trend emerges regarding hardware efficiency: Transformers gain more significant speedups (up to 8×) on NVIDIA Jetson platforms due to optimized Tensor Core utilization for matrix multiplication. Conversely, CNN-based models scale more linearly on CPU-based devices like Raspberry Pi.
4.9.3. Impact of Distillation on Low-Bit Regimes
Trends show that Knowledge Distillation (KD) is no longer optional but mandatory for sub-8-bit quantization. Models using KD maintain 94% of their original mAP at 4-bit, whereas non-distilled models drop to nearly 70%.
4.10. Mitigation Strategies for Multi-Head Attention (MHA) Quantization
Multi-Head Attention (MHA) represents a primary quantization bottleneck due to the high dynamic range of its activation distributions. To mitigate performance loss, several targeted strategies have been developed. Mixed-precision quantization provides a balanced solution by maintaining sensitive attention layers in INT8 or FP16 while quantizing less critical components to lower bit-widths. Alternatively, attention-aware quantization utilizes saliency-based re-scaling to preserve the precision of critical key-query interactions. Comparative analysis indicates that while mixed-precision is more compatible with standard edge hardware like NVIDIA Jetson, attention-aware schemes offer superior accuracy retention in sub-8-bit regimes by directly addressing the non-uniformity of attention scores.
Table 15 provides a structured comparison of these mitigation strategies, highlighting their primary mechanisms and implementation tradeoffs.
5. Discussion and Research Gaps
5.1. Comparative Architecture Analysis
The systematic analysis reveals distinct performance patterns across the architectural paradigms investigated. Transformer-based architectures demonstrate superior robustness and performance in complex agricultural environments characterized by severe occlusion and varying illumination, exemplified by ORD-YOLO achieving 96.92%
mAP for citrus detection in challenging conditions [
1]. Conversely, lightweight CNNs maintain significant advantages in computational efficiency, with MobileNet variants achieving 15 FPS on Raspberry Pi 4 while consuming only 5 W of power [
3]. Consequently, hybrid approaches emerge as optimal solutions, successfully balancing CNN efficiency with the superior relational reasoning of Transformers, as evidenced by the MobileNetV2-DETR model, which reaches 89.7%
mAP with a 43% reduction in computational requirements [
30].
Furthermore, the performance gap between quantized and full precision models is narrowing significantly due to advanced compression techniques. Q-DETR, for instance, demonstrates only a 2.6% performance degradation at 4-bit quantization while enabling a 6.6× theoretical acceleration [
25]. This level of efficiency represents a substantial improvement over early quantization approaches that often suffered 8–12% accuracy drops at similar bit widths, indicating the rapid maturation of quantization technologies specifically for resource-constrained agricultural applications.
Table 16 presents a detailed, comparative summary of these architectural and compression trade-offs for robust edge deployment.
5.2. Critical Research Gaps
5.2.1. Domain-Specific Dataset Limitations
Current datasets exhibit significant bias toward temperate climate fruits, with tomatoes, citrus, and apples comprising 67% of research focus. Tropical fruits like durian, rambutan, and mangosteen remain severely underrepresented, creating a substantial generalization gap [
32,
33]. This limitation impedes the development of robust models for global agricultural applications and highlights the need for diversified, geographically balanced datasets.
5.2.2. Multi-Head Attention Quantization Instability
The quantization of Multi-Head Attention (MHA) modules presents persistent challenges, with studies reporting 3–5× higher sensitivity compared to convolutional layers [
13,
25]. Attention mechanisms require higher precision (≥6 bit) to maintain performance, limiting the effectiveness of uniform quantization strategies. This instability manifests as significant performance degradation in complex detection scenarios where attention plays a crucial role in handling occlusion and scale variation [
20].
5.2.3. Performance Disparity in Real-World Field Validation
A primary limitation identified in this review is the disparity between laboratory performance and real-world operational reliability. Currently, only 23% of the included studies conducted evaluations on physical edge hardware in field settings [
4,
34]. While laboratory benchmarks provide idealized peak performance, they often overlook the impact of environmental variability on inference metrics. Field conditions—characterized by fluctuating lighting and background clutter—increase the computational demand for pre-processing and robust feature extraction. This shift typically results in a 15–20% reduction in effective FPS and less stable power consumption patterns compared to controlled laboratory environments. This validation gap suggests that lab-based proofs-of-concept may not fully reflect the operational robustness required for industrial deployment.
Table 17 provides a quantified assessment of these research gaps and their potential impact on model generalizability in authentic agricultural settings.
5.2.4. Challenges in Heterogeneous Hardware Benchmarking
A significant challenge in comparing performance across the literature is the heterogeneity of hardware platforms. Differences in memory bandwidth, specialized accelerators (e.g., NVIDIA Tensor Cores vs. ARM CPUs), and software frameworks (TensorRT vs. TFLite) prevent a direct comparison of inference latency and power efficiency. Consequently, a speedup achieved on one device may not be replicable on another. To address this, future research should adopt normalized metrics, such as operations per watt, or utilize standardized benchmarking suites to ensure equitable performance assessments across diverse edge-computing architectures in agriculture.
5.2.5. Emerging Trends
Adaptive quantization approaches represent a promising direction, dynamically adjusting bit width based on input complexity and attention mechanism requirements [
36]. Early implementations demonstrate 1.2–1.8× better accuracy–efficiency tradeoffs compared to static quantization schemes, particularly for variable agricultural environments.
Hardware-aware Neural Architecture Search (NAS) emerges as a critical enabler for optimized edge deployment [
35]. By incorporating hardware constraints directly into the architecture search process, these approaches can automatically discover optimal model configurations for specific edge platforms, potentially bridging the gap between algorithmic innovation and practical deployment.
Specialized edge transformer chips show substantial potential for addressing computational bottlenecks. Recent developments in attention-optimized hardware accelerators demonstrate 3.2–4.1× efficiency improvements for transformer inference compared to general-purpose edge processors [
35,
37]. These specialized architectures exploit the unique computational patterns of attention mechanisms to deliver superior performance per watt.
Federated learning frameworks offer solutions for dataset limitations while addressing privacy concerns in agricultural data collection [
19]. By enabling model training across distributed edge devices without centralizing sensitive data, these approaches can facilitate the development of more robust and generalized models while respecting data sovereignty.
The integration of these emerging technologies presents a pathway toward truly efficient, robust, and scalable fruit ripeness detection systems capable of operating reliably in diverse agricultural environments while meeting the stringent constraints of edge deployment.
6. Future Directions
6.1. Unified Lightweight DETR Frameworks for Agriculture
The development of domain-optimized DETR variants represents a critical research direction for advancing edge AI in agriculture. Future work should prioritize the creation of unified frameworks that seamlessly integrate quantization-aware training, hardware-specific optimizations, and agricultural domain adaptations. These frameworks must leverage advanced techniques such as progressive quantization [
36], which can dynamically adapt bit-precision to varying fruit textures or fluctuating lighting conditions, and dynamic token sparsification [
38] to achieve true real-time performance while robustly maintaining detection accuracy across diverse fruit types and varying growth stages. Furthermore, the integration of Neural Architecture Search (NAS) under agricultural constraints [
35] will enable the automatic discovery of optimal architectures specifically tailored to resource-constrained edge deployment scenarios.
Table 18 outlines a comprehensive roadmap for the sequential and unified development of lightweight DETR architectures tailored for smart agricultural applications.
6.2. Explainable AI and Visual Interpretability
Future research must prioritize the development of interpretable Transformer architectures that provide transparent decision-making processes for automated fruit quality assessment. The integration of attention visualization mechanisms [
39] and feature importance mapping will be crucial, enabling farmers and agricultural experts to understand and trust AI-driven ripeness predictions. Developing robust quantization-aware explainability methods that remain effective even under aggressive model compression is essential for achieving practical and trustworthy deployment in precision agriculture applications.
6.3. Edge Cloud Collaborative Architectures
Hierarchical computing frameworks that strategically leverage both edge and cloud resources present a highly promising direction for balancing real-time performance with advanced agricultural analytics. Future systems should implement adaptive workload distribution strategies [
40], where edge devices handle time-critical detection tasks (e.g., real-time harvesting), while cloud resources perform more complex analyses, model retraining, and global updates. Research in this area must focus on developing communication-efficient protocols and federated learning approaches [
19] that minimize bandwidth requirements while ensuring model freshness and consistency across distributed agricultural networks.
Table 19 details the essential specifications and design considerations required for the implementation of an effective Edge Cloud Collaboration Framework in precision agriculture.
6.4. Standardization of Datasets and Benchmarks
The establishment of comprehensive benchmarking standards is crucial for advancing the field. Future efforts should focus on developing unified evaluation protocols that encompass diverse fruit types, growth conditions, and deployment scenarios. These standards should include standardized dataset splits, consistent evaluation metrics (including MCC for imbalanced data [
31]), and hardware performance baselines across multiple edge platforms.
Key standardization initiatives should address the following:
- (1)
Dataset Diversity Requirements: Mandatory inclusion of tropical and subtropical fruits, multiple ripeness stages, and varied environmental conditions [
33];
- (2)
Evaluation Metric Uniformity: Adoption of comprehensive metrics, including
mAP@0.5:0.95, MCC, FPS, and energy consumption [
31];
- (3)
Hardware Benchmarking Standards: Consistent testing across representative edge platforms (Jetson series, Raspberry Pi, edge TPUs) [
34];
- (4)
Real-World Validation Protocols: Standardized field testing procedures accounting for lighting variations, occlusion, and weather conditions [
4].
6.5. Emerging Research Priorities
Cross-modal learning approaches that integrate visual data with non-visual sensors (hyperspectral, thermal, chemical) represent a promising frontier [
5,
7]. Future research should explore efficient fusion techniques that maintain edge compatibility while leveraging complementary data sources for improved ripeness assessment.
Self-supervised and semi-supervised learning methods tailored for agricultural applications can address data scarcity challenges [
5]. Developing efficient pre-training strategies using unlabeled field data will reduce dependency on extensive manual annotation while improving model generalization across different agricultural environments.
Sustainable AI frameworks that consider environmental impact and long-term deployment viability will become increasingly important. Research should focus on energy-aware optimization, model lifetime management, and adaptive compression techniques that balance performance with operational sustainability in agricultural settings.
The successful pursuit of these research directions will enable the development of robust, efficient, and practical fruit ripeness detection systems that meet the real-world demands of modern precision agriculture while respecting the constraints of edge deployment environments.
7. Conclusions
This systematic review comprehensively evaluates the state-of-the-art in quantization-optimized lightweight transformer architectures for real-time fruit ripeness detection on edge devices. Our analysis of 22 high-quality studies highlights a rapidly maturing field, yet identifies critical gaps that must be addressed to unlock the full potential of AI-driven precision agriculture.
First, this review establishes that the synergistic combination of advanced quantization techniques and lightweight architectural innovations enables feasible real-time deployment. Methods such as mixed precision quantization, learned step size quantization, and distribution rectification distillation effectively bridge the accuracy efficiency gap, with modern approaches limiting performance degradation to 2–3% while achieving 3–8× speedup on edge hardware such as the NVIDIA Jetson series.
Second, we strongly advocate for the adoption of the Matthews Correlation Coefficient (MCC) as a primary evaluation metric, particularly for imbalanced agricultural datasets. Unlike traditional metrics that may provide overly optimistic assessments, MCC offers a more holistic and reliable measure of model performance across all ripeness categories, which is essential for practical deployment decisions.
Third, RT-DETR variants emerge as the most promising architectural foundation for next-generation agricultural AI systems. Their hybrid design, which efficiently decouples intra-scale interaction from cross-scale fusion, achieves an optimal balance between the relational reasoning strength of Transformers and the operational demands of edge deployment.
Author Contributions
Conceptualization, D.M. and R.K.R.; methodology, D.M. and R.K.R.; software, D.M.; validation, D.M. and R.K.R.; formal analysis, D.M.; writing—original draft preparation, R.K.R.; writing—review and editing, R.K.R.; supervision, R.K.R.; project administration, R.K.R. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Huang, Z.; Li, X.; Fan, S.; Liu, Y.; Zou, H.; He, X.; Xu, S.; Zhao, J.; Li, W. ORD-YOLO: A Ripeness Recognition Method for Citrus Fruits in Complex Environments. Agriculture 2025, 15, 1711. [Google Scholar] [CrossRef]
- Zhao, J.; Du, C.; Li, Y.; Mudhsh, M.; Guo, D.; Fan, Y.; Wu, X.; Wang, X.; Almodfer, R. YOLO-Granada: A lightweight attentioned Yolo for pomegranates fruit detection. Sci. Rep. 2024, 14, 16848. [Google Scholar] [CrossRef] [PubMed]
- Martínez-Mora, O.; Capuñay-Uceda, O.; Caucha-Morales, L.; Sánchez-Ancajima, R.; Ramírez-Morales, I.; Córdova-Márquez, S.; Cuenca-Mayorga, F. Artificial Vision-Based Dual CNN Classification of Banana Ripeness and Quality Attributes Using RGB Images. Processes 2025, 13, 1982. [Google Scholar] [CrossRef]
- Lyu, S.; Li, R.; Zhao, Y.; Li, Z.; Fan, R.; Liu, S. Green Citrus Detection and Counting in Orchards Based on YOLOv5-CS and AI Edge System. Sensors 2022, 22, 576. [Google Scholar] [CrossRef] [PubMed]
- Lyu, H.; Grafton, M.; Ramilan, T.; Irwin, M.; Sandoval, E. Synthetic hyperspectral reflectance data augmentation by generative adversarial network to enhance grape maturity determination. Comput. Electron. Agric. 2025, 235, 110341. [Google Scholar] [CrossRef]
- Kong, X.; Li, X.; Zhu, X.; Guo, Z.; Zeng, L. Detection Model Based on Improved Faster-RCNN in Apple Orchard Environment. Intell. Syst. Appl. 2024, 21, 200325. [Google Scholar] [CrossRef]
- Wang, T.; Yuan, L.; Zhang, X.; Feng, J. Distilling Object Detectors with Fine-grained Feature Imitation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4933–4942. [Google Scholar] [CrossRef]
- Dakhli, I.; Sedqui, A.; Derrhi, M.; Karroumi, B. Artificial Intelligence and Crisis Management: A Systematic Literature Review using PRISMA. In Proceedings of the 2024 4th International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Fez, Morocco, 13–15 May 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Rahman, Z.U.; Asaari, M.S.M.; Ibrahim, H.; Abidin, I.S.Z.; Ishak, M.K. Generative Adversarial Networks (GANs) for Image Augmentation in Farming: A Review. IEEE Access 2024, 12, 179912–179943. [Google Scholar] [CrossRef]
- Kolesnikov, A.; Beyer, L.; Zhai, X.; Puigcerver, J.; Yung, J.; Gelly, S.; Houlsby, N. Big Transfer (BiT): General Visual Representation Learning. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 491–507. [Google Scholar] [CrossRef]
- Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
- Alba, A.; Villaverde, J.; Lacara, A.; Domingo, J.; Aguirre, D. Optimized FLOPs-Aware Knowledge Distillation for TinyML Applications in Agriculture. In 2025 International Conference on Advancement in Data Science, E-Learning and Information System (ICADEIS), Bandung, Indonesia, 3–4 February 2025; IEEE: Piscataway, NJ, USA, 2025. [Google Scholar] [CrossRef]
- Li, Z.; Yang, T.; Wang, P.; Cheng, J. Q-ViT: Fully Differentiable Quantization for Vision Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1173–1182. [Google Scholar] [CrossRef]
- Suharjito, S.; Elwirehardja, G.N.; Prayoga, J.S. Oil palm fresh fruit bunch ripeness classification on mobile devices using deep learning approaches. Comput. Electron. Agric. 2021, 188, 106359. [Google Scholar] [CrossRef]
- Shea, B.J.; Reeves, B.; Wells, G.A.; Thuku, M. AMSTAR 2: A critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ 2017, 358, j4008. [Google Scholar] [CrossRef] [PubMed]
- Liu, Z.; Cheng, K.T.; Huang, D.; Xing, E.; Shen, Z. Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4942–4952. [Google Scholar] [CrossRef]
- Park, C.; Yun, S.; Chun, S. A Unified Analysis of Mixed Sample Data Augmentation: A Loss Function Perspective. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
- Lin, Y.; Zhang, T.; Sun, P.; Li, Z.; Zhou, S. FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer. In Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI), Vienna, Austria, 23–29 July 2022; pp. 1173–1179. [Google Scholar] [CrossRef]
- Khandelwal, A.; Yun, T.; Nayak, N.V.; Merullo, J.; Bach, S.H.; Sun, C.; Pavlick, E. $100K or 100 Days: Trade-offs when Pre-Training with Academic Resources. arXiv 2024, arXiv:2410.23261. [Google Scholar] [CrossRef]
- Wang, R.; Sun, H.; Yang, L.; Lin, S.; Liu, C.; Gao, Y.; Hu, Y.; Zhang, B. AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries. In Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI-24), Vancouver, BC, Canada, 20–27 February 2024; pp. 15598–15606. [Google Scholar] [CrossRef]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
- Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713. [Google Scholar] [CrossRef]
- Liu, Z.; Wu, B.; Luo, W.; Yang, X.; Liu, W.; Cheng, K.-T. Bi-Real Net: Enhancing the Performance of 1-bit CNNs with Improved Representational Capability and Advanced Training Algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 722–737. [Google Scholar] [CrossRef]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar] [CrossRef]
- Xu, S.; Li, Y.; Lin, M.; Gao, P.; Guo, G.; Lü, J.; Zhang, B. Q-DETR: An Efficient Low-Bit Quantized Detection Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 3842–3851. [Google Scholar] [CrossRef]
- Esser, S.K.; McKinstry, J.L.; Bablani, D.; Appuswamy, R.; Modha, D.S. Learned Step Size Quantization. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar] [CrossRef]
- Fang, J.; Shafiee, A.; Abdel-Aziz, H.; Thorsley, D.; Georgiadis, G.; Hassoun, J.H. Post-Training Piecewise Linear Quantization for Deep Neural Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 69–86. [Google Scholar] [CrossRef]
- Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar] [CrossRef]
- Setyanto, A.; Sasongko, T.B.; Fikri, M.A.; Ariatmanto, D.; Agastya, I.M.A.; Rachmanto, R.D.; Ardana, A.; Kim, I.K. Knowledge Distillation in Object Detection for Resource-Constrained Edge Computing. IEEE Access 2025, 13, 18456–18471. [Google Scholar] [CrossRef]
- Li, F.; Zeng, A.; Liu, S.; Zhang, H.; Li, H.; Zhang, L.; Ni, L.M. Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 18558–18567. [Google Scholar] [CrossRef]
- Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [PubMed]
- Koirala, A.; Walsh, K.B.; Wang, Z.; McCarthy, C. Deep learning—Method overview and review of use for fruit detection and yield estimation. Comput. Electron. Agric. 2019, 162, 219–234. [Google Scholar] [CrossRef]
- Sikder, M.S.; Islam, M.S.; Islam, M.; Reza, M.S. Improving Mango Ripeness Grading Accuracy: A Comprehensive Analysis of Deep Learning, Traditional Machine Learning, and Transfer Learning Techniques. Preprints 2024, 4837005. [Google Scholar] [CrossRef]
- Swaminathan, T.P.; Silver, C.; Akilan, T.; Kumar, J. Benchmarking Deep Learning Models on NVIDIA Jetson Nano for Real-Time Systems: An Empirical Investigation. Procedia Comput. Sci. 2025, 260, 906–913. [Google Scholar] [CrossRef]
- Rati, G.; Kwon, R.; El-Farouk, A.; Silvestri, M.; Sundaram, C. Building Edge-Native Architectures for Foundation Models. Preprints 2024, 2024, 5350027. [Google Scholar] [CrossRef]
- Tai, Y.S.; Wu, A.Y. AMP-ViT: Optimizing Vision Transformer Efficiency with Adaptive Mixed-Precision Post-Training Quantization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 27 February–3 March 2025; pp. 6828–6837. [Google Scholar] [CrossRef]
- Wei, C.C.; Wang, C.H.; Pong, E.Y.; Chen, C.H. Low-Cost Non-Linear Function Approximation for Transformer Deployment on Edge NPU. In Proceedings of the 2025 IEEE International Conference on Communication Technology-Pacific (ICCT-Pacific), Matsue, Japan, 29–31 March 2025; IEEE: Piscataway, NJ, USA; pp. 1–6. [CrossRef]
- Rao, Y.; Zhao, W.; Liu, B.; Lu, J.; Zhou, J.; Hsieh, C.-J. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021; pp. 13937–13949. [Google Scholar]
- Gong, J.; Chen, T. Deep Configuration Performance Learning: A Systematic Survey and Taxonomy. ACM Trans. Softw. Eng. Methodol. 2024, 34, 1–62. [Google Scholar] [CrossRef]
- Hao, J.; Subedi, P.; Ramaswamy, L.; Kim, I.K. Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy. ACM Trans. Internet Technol. 2023, 23, 1–33. [Google Scholar] [CrossRef]
- Sun, H.; Zhang, S.; Tian, X.; Zou, Y. Pruning DETR: Efficient End-to-End Object Detection with Sparse Structured Pruning. Signal Image Video Process. 2024, 18, 129–135. [Google Scholar] [CrossRef]
Figure 1.
PRISMA 2020 flow diagram shows the systematic literature selection process.
Figure 1.
PRISMA 2020 flow diagram shows the systematic literature selection process.
Table 1.
Search strategy and data source.
Table 1.
Search strategy and data source.
| Database | Records Identified | Key Search Keywords |
|---|
| Scopus | 42 | “Transformer” OR “DETR” “quantization” OR “pruning” “fruit ripeness” OR “agriculture” “edge device” OR “real-time”
|
| IEEE Xplore | 18 | “lightweight Transformer” OR “Q-DETR” “quantization aware training” OR “knowledge distillation” “fruit detection” OR “mango ripeness”
|
| ScienceDirect | 28 | “quantization aware training” OR “pruning” “fruit detection” OR “agricultural vision” “edge AI” OR “mobile deployment”
|
| MDPI | 22 | |
| SpringerLink | 35 | |
| Web of Science | 15 | “RT-DETR” OR “vision transformer” “knowledge distillation” OR “model compression” “edge AI” OR “real-time detection”
|
Table 2.
Inclusion and exclusion criteria.
Table 2.
Inclusion and exclusion criteria.
| Category | Criteria | Application |
|---|
| Inclusion Criteria |
| I1 | Peer-reviewed journal or conference proceedings in English | Applied to all 160 initial records |
| I2 | Publication date between January 2018–June 2025 | Applied during database search |
| I3 | Focus on fruit ripeness detection or classification | Excluded 8 studies during full-text review |
| I4 | Explicit discussion of model optimization for edge deployment | Excluded 5 studies during full-text review |
| Exclusion Criteria |
| E1 | Review articles, surveys, theses, or non-peer-reviewed works | Excluded during initial screening |
| E2 | Traditional computer vision without deep learning | Excluded 3 studies during full-text review |
| E3 | Not addressing edge computing constraints | Excluded 1 study during full-text review |
Table 3.
Quality assessment result modified AMSTAR2 for engineering/AI study.
Table 3.
Quality assessment result modified AMSTAR2 for engineering/AI study.
| Quality Dimension | Assessment Criteria | Studies Meeting Criteria (n = 22) | Percentage |
|---|
| Experimental Rigor | Clear methodology, reproducible experiments, adequate dataset description | 18 | 81.8% |
| Performance Reporting | Comprehensive metrics (accuracy, latency, model size, power consumption) | 16 | 72.7% |
| Hardware Specifications | Detailed edge device specifications and deployment environment | 14 | 63.6% |
| Statistical Validity | Appropriate statistical tests, confidence intervals, significance reporting | 12 | 54.5% |
| Comparative Analysis | Comparison with baseline methods or state-of-the-art approaches | 20 | 90.9% |
| Limitations Discussion | Clear acknowledgment of study limitations and constraints | 15 | 68.2% |
Table 4.
Study selection process outcomes.
Table 4.
Study selection process outcomes.
| Selection Phase | Records | Exclusion Reasons |
|---|
| Initial Identification | 160 records from 6 databases | - |
| After Duplicate Removal | 128 records | 32 duplicates removed |
| Title/Abstract Screening | 39 records | 89 records excluded: Irrelevant scope (62), Review articles (18), Non-English (9) |
| Full-Text Assessment | 22 studies included | 17 records excluded: No fruit ripeness focus (8), No quantization techniques (5), No transformer architectures (3), No edge deployment (1) |
| Final Inclusion | 22 studies | - |
Table 5.
Data extraction protocol.
Table 5.
Data extraction protocol.
| Extraction Category | Specific Data Points | Extraction Method |
|---|
| Model Architecture | | Direct extraction from methodology sections |
| Quantization Methods | | Categorized based on technical descriptions |
| Performance Metrics | Accuracy/mAP Inference latency FPS rate Memory footprint Energy consumption
| Numerical extraction from results sections |
| Hardware Platform | Edge device type Processor specifications Memory constraints Deployment environment
| Technical specification compilation |
| Dataset Characteristics | Fruit types covered Dataset size Image resolution Annotation type
| Systematic categorization |
Table 6.
Extracted data distribution from 22 studies.
Table 6.
Extracted data distribution from 22 studies.
| Data Category | Studies Reporting (n = 22) | Completeness Rate | Key Findings |
|---|
| Architecture Details | 22 | 100% | RT-DETR variants dominant (8 studies), ViT based (6 studies) |
| Quantization Specifications | 20 | 90.9% | 4-bit quantization most common (9 studies), mixed precision emerging |
| Performance Metrics | 18 | 81.8% | Average accuracy drop: 2.3%, Speedup: 3.2× typical |
| Hardware Details | 14 | 63.6% | NVIDIA Jetson dominant platform (10 studies) |
| Dataset Information | 19 | 86.4% | Tomato (7), Citrus (5), Mixed fruits (4), most common |
Table 7.
Synthesis outcomes.
Table 7.
Synthesis outcomes.
| Synthesis Focus | Analysis Approach | Identified Patterns |
|---|
| Accuracy Speed Trade-offs | Correlation analysis between quantization level and performance | 4-bit optimum for most applications |
| Hardware Compatibility | Cross-platform performance comparison | ARM CPUs show better quantization tolerance |
| Fruit-specific Optimization | Per category performance analysis | Citrus detection benefits most from quantization |
| Architecture Efficiency | Parameter vs. accuracy analysis | Hybrid CNN Transformer is the most efficient |
Table 8.
Performance comparison of lightweight CNN architectures for edge-based fruit ripeness detection.
Table 8.
Performance comparison of lightweight CNN architectures for edge-based fruit ripeness detection.
| Architecture | Quantization Method | Target Fruit | Accuracy/mAP | Inference Time (ms) | Model Size (MB) | Hardware Platform | Power (W) |
|---|
| MobileNetV2 [10] | Float16 | Banana [3] | 97.0% | 66.7 | 9.1 | Raspberry Pi 4 | 5.0 |
| EfficientNetB0 [14] | Float16 | Oil Palm | 89.3% | 96.0 | 12.4 | Mobile GPU | 7.5 |
| YOLOv5 CS [4] | INT8 | Citrus | 98.2% | 37.0 | 45.2 | Jetson Xavier NX | 10.0 |
| SSDLite [21] | INT8 | Tomato | 92.4% | 42.5 | 18.7 | ARM CPU | 4.2 |
| YOLOv4 tiny [2] | INT8 | Pomegranate | 92.2% | 25.6 | 23.5 | Jetson Nano | 5.0 |
Table 9.
Quantization impact analysis across precision levels for fruit detection models.
Table 9.
Quantization impact analysis across precision levels for fruit detection models.
| Base Model | Precision | Accuracy Drop | Speedup Factor | Memory Reduction | Application Context | Ref. |
|---|
| MobileNetV2 | FP32 → INT8 | 2.1% | 3.2× | 75% | Real-time classification | [22] |
| EfficientNetB0 | FP32 → Float16 | 1.2% | 2.1× | 50% | Mobile deployment | [14] |
| YOLOv5s | FP32 → INT8 | 3.4% | 3.8× | 75% | Orchard monitoring | [4] |
| ResNet-50 | FP32 → INT4 | 5.7% | 5.2× | 87% | Batch processing | [23] |
Table 10.
Transformer architecture performance for fruit ripeness detection.
Table 10.
Transformer architecture performance for fruit ripeness detection.
| Architecture | Target Fruit | mAP@0.5 | FPS | Hardware | Model Size (MB) | Key Innovation | Reference |
|---|
| ORD-YOLO | Citrus | 96.92% | 74 | Edge GPU | 15.2 | ODConv + RepGFPN | [1] |
| RT-DETR | Multiple | 86.8% | 108 | T4 GPU | 42.5 | Hybrid encoder | [21] |
| Deformable DETR | Mixed fruits | 91.2% | 45 | Jetson Xavier | 38.7 | Deformable attention | [24] |
| Q-DETR | General | 89.4% | 62 | Edge device | 12.3 | Distribution rectification | [25] |
| YOLO Granada | Pomegranate | 92.2% | 58 | Mobile | 24.8 | ShuffleNetv2 backbone | [2] |
Table 11.
Quantization method efficacy for edge deployment.
Table 11.
Quantization method efficacy for edge deployment.
| Quantization Method | Bit Width | Accuracy Retention | Speedup | Hardware Efficiency | Best Suited Architecture | Reference |
|---|
| LSQ+ [26] | 4-bit | 96.8% | 3.8× | High | Vision Transformers | [26] |
| QAT [22] | 8-bit | 98.2% | 2.5× | Medium | CNN based detectors | [22] |
| PTQ [27] | 8-bit | 97.1% | 3.2× | High | YOLO variants | [27] |
| Mixed precision [16] | 4/8-bit | 98.5% | 3.0× | Medium | DETR architectures | [16] |
| Nonuniform to Uniform [16] | 4-bit | 95.7% | 4.1× | High | Mobile deployments | [16] |
Table 12.
Performance of advanced compression strategies.
Table 12.
Performance of advanced compression strategies.
| Compression Method | Accuracy Drop | Latency Gain | Memory Reduction | Model Size (MB) | Hardware Platform | Reference |
|---|
| Q-ViT (Mixed-precision) | 1.84% | 3.8× | 76% | 4.75 | Jetson Nano | [13] |
| Triple Compression | 2.67% | 4.2× | 74% | 8.3 | Raspberry Pi 4 | [7] |
| KD + Quantization | 1.92% | 3.1× | 68% | 12.1 | Edge TPU | [29] |
| Pruning + Quantization | 3.45% | 4.8× | 82% | 6.8 | ARM CPU | [28] |
| Q-DETR (4-bit) | 2.60% | 6.6× | 75% | 15.4 | Mobile GPU | [25] |
Table 13.
Knowledge distillation and hybrid architecture performance.
Table 13.
Knowledge distillation and hybrid architecture performance.
| Architecture | Teacher Model | Student Model | Accuracy/mAP | Compression Ratio | Inference Speed (FPS) | Reference |
|---|
| Distilling DETR | Faster R-CNN | Compressed DETR | 91.4% | 4.8:1 | 38 | [7] |
| MobileNetV2-DETR | Standard DETR | Hybrid DETR | 89.7% | 2.3:1 | 45 | [30] |
| EfficientNet-DETR | ResNet-50 DETR | Light DETR | 87.9% | 3.1:1 | 52 | [21] |
| GAN-Augmented | Baseline CNN | Augmented Model | 91.0% | N/A | 28 | [13] |
Table 14.
Unified benchmark for edge deployable fruit detection models.
Table 14.
Unified benchmark for edge deployable fruit detection models.
| Model Architecture | Dataset | mAP@0.5 | MCC | FPS | Model Size (MB) | Latency (ms) | Hardware |
|---|
| ORD-YOLO [1] | Citrus Dataset | 96.92% | 0.891 | 74 | 15.2 | 13.5 | Jetson Xavier |
| YOLO Granada [2] | Pomegranate Set | 92.20% | 0.845 | 58 | 24.8 | 17.2 | Mobile GPU |
| Q-DETR [25] | COCO Fruits | 89.40% | 0.812 | 62 | 12.3 | 16.1 | Edge Device |
| RT-DETR [21] | Mixed Fruits | 86.80% | 0.798 | 108 | 42.5 | 9.3 | T4 GPU |
| MobileNetV2-DETR [30] | Tomato360 | 89.70% | 0.834 | 45 | 18.7 | 22.2 | Raspberry Pi |
Table 15.
Comparison of quantization strategies for multi-head attention (MHA).
Table 15.
Comparison of quantization strategies for multi-head attention (MHA).
| Strategy | Primary Mechanism | Advantages | Implementation Trade-Off |
|---|
| Mixed-Precision | Selective bit-widths per layer | Hardware-friendly | Complex search space |
| Attention-Aware | Weight re-scaling via saliency | High accuracy (<8-bit) | Requires saliency mapping |
| Log-Quantization | Non-linear mapping | Fits Softmax range | Computational overhead |
Table 16.
Architecture comparison for edge deployment.
Table 16.
Architecture comparison for edge deployment.
| Architecture Type | Average mAP | Inference Speed (FPS) | Model Size (MB) | Power Consumption | Best Use Case | Limitations |
|---|
| Lightweight CNN [3,14] | 91.8% | 45 | 15.2 | 5–7 W | Resource-constrained devices | Limited contextual reasoning |
| Transformer-based [1,21] | 94.2% | 38 | 42.5 | 8–12 W | Complex environments | Higher computational demand |
| Hybrid Architectures [30] | 92.7% | 42 | 28.3 | 6–9 W | Balanced applications | Integration complexity |
| Quantized Transformers [25] | 91.6% | 62 | 12.3 | 4–7 W | Real-time deployment | Quantization instability |
Table 17.
Quantified research gaps and impact assessment.
Table 17.
Quantified research gaps and impact assessment.
| Research Gap | Quantitative Impact | Affected Metrics | Proposed Solutions | Priority Level |
|---|
| Tropical Fruit Dataset Scarcity | 78% studies focus on 3 fruit types | Generalization, mAP drop up to 15% | Federated learning, generative augmentation [5,9] | High |
| MHA Quantization Instability | 3–5× higher sensitivity than CNN layers | Accuracy drop 4–8% at low precision | Mixed precision [13], attention-aware quantization [16] | Critical |
| Limited Field Validation | Only 23% conduct real deployment tests | Performance variance up to 12% | Standardized field testing protocols | High |
| Hardware Dataset Mismatch | 45% use inappropriate hardware benchmarks | Latency underestimation 2–3× | Hardware-aware NAS [35] | Medium |
Table 18.
Roadmap for unified lightweight DETR development.
Table 18.
Roadmap for unified lightweight DETR development.
| Research Direction | Key Technologies | Expected Impact | Timeline | Key Challenges |
|---|
| Hardware-Aware DETR Optimization | NAS with edge constraints [35], Mixed precision quantization [16] | 3–5× speedup, 60% energy reduction | Short term (1–2 years) | Hardware diversity, Compiler support |
| Domain Adaptive Quantization | Input-aware bit width allocation [36], Attention-aware quantization [13] | 2–3% accuracy improvement at 4-bit | Medium term (2–3 years) | Training complexity, Calibration overhead |
| Unified Agricultural DETR Framework | Modular architecture, Cross-fruit generalization [30] | 50% development time reduction | Long term (3–5 years) | Standardization, Dataset integration |
Table 19.
Edge cloud collaboration framework specifications.
Table 19.
Edge cloud collaboration framework specifications.
| Component | Functionality | Technical Requirements | Performance Targets | Implementation Challenges |
|---|
| Edge Layer | Real-time detection, Local inference | Quantized Transformers, Low power optimization [34] | <30 ms latency, >30 FPS | Power constraints, Memory limitations |
| Fog Layer | Multi device coordination, Data aggregation | Lightweight fusion algorithms [40] | <5 s model updates | Network reliability, Synchronization |
| Cloud Layer | Model training, Global optimization | Federated learning [19], NAS | Weekly model improvements | Data privacy, Communication costs |
| Collaboration Protocol | Adaptive workload distribution | Dynamic compression [41] | 40% bandwidth reduction | Latency variance, Resource management |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |