Adaptive CNN Ensemble for Apple Detection: Enabling Sustainable Monitoring Orchard

Kutyrev, Alexey; Andriyanov, Nikita; Khort, Dmitry; Smirnov, Igor; Zubina, Valeria

doi:10.3390/agriengineering7110369

Open AccessArticle

Adaptive CNN Ensemble for Apple Detection: Enabling Sustainable Monitoring Orchard

by

Alexey Kutyrev

^*

,

Nikita Andriyanov

,

Dmitry Khort

,

Igor Smirnov

and

Valeria Zubina

Federal Scientific Agroengineering Center VIM, 109428 Moscow, Russia

^*

Author to whom correspondence should be addressed.

AgriEngineering 2025, 7(11), 369; https://doi.org/10.3390/agriengineering7110369

Submission received: 8 September 2025 / Revised: 15 October 2025 / Accepted: 21 October 2025 / Published: 3 November 2025

(This article belongs to the Special Issue Precision Agriculture for the Next Generation: Linking Data, Environment, and Advanced Technologies)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Accurate detection of apples in orchards under variable weather and illumination remains a key challenge for precision horticulture. This study presents a flexible framework for automated ensemble selection and optimization of convolutional neural network (CNN) inference. The system integrates eleven ensemble methods, dynamically configured via Pareto-based multi-objective optimization balancing accuracy (mAP, F1-Score) and performance (FPS). A key innovation is its pre-deployment benchmarking whereby models are evaluated on a representative field sample to recommend a single optimal model or lightweight ensemble for real-time use. Experimental results show ensemble models consistently outperform individual detectors, achieving a 7–12% improvement in accuracy in complex scenes with occlusions and motion blur, underscoring the approach’s value for sustainable orchard management.

Keywords:

neural network ensembles; convolutional neural networks (CNNs); fruit detection; robust computer vision systems; multi-objective optimization; harvest management; YOLO; environmental variability; dataset annotation; precision agriculture

1. Introduction

Modern horticulture faces increasing challenges related to the need for accurate and timely monitoring of biological objects—from pests and diseases to the assessment of plant conditions under changing climatic conditions. Traditional visual inspection methods, despite their simplicity, suffer from low throughput and are prone to subjective errors, particularly under adverse weather conditions such as rain, fog, and sun glare [1]. In this context, the implementation of automated systems based on neural network models, capable of ensuring robustness and accuracy across diverse environments, has become highly relevant [2,3,4].

In recent years, computer vision algorithms [5,6,7,8], including YOLO (You Only Look Once) architectures, have demonstrated their effectiveness in agricultural object detection tasks. For instance, in Ref. [9], a modified YOLOv5m version with integrated transformer mechanisms achieved a high detection accuracy for pests, reaching 96.4% mean Average Precision (mAP). However, even such advanced models may perform unreliably under varying illumination or in the presence of atmospheric disturbances, limiting their applicability in real-world scenarios. This underscores the necessity of developing model ensembles that can compensate for individual shortcomings by combining different architectures and training strategies.

The novelty of our approach lies in employing ensemble learning with automated model selection, primarily based on YOLO, to adapt to environmental variations. Effectively addressing this problem may require dynamic optimization algorithms that analyze real-time weather conditions and redistribute model weights accordingly. This minimizes errors caused by image artifacts such as glare or raindrops, a capability that is particularly critical in horticulture, where detection accuracy directly impacts the effectiveness of plant management [10].

Such solutions align with the AIoT (Artificial Intelligence of Things) trends, where the combination of IoT sensors and adaptive AI algorithms ensures resilience to external factors [11]. Implementing such a system can reduce crop losses by 15–20%, as pilot tests indicate, through early pathogen detection and optimized resource allocation.

Considering climate change and the increasing frequency of extreme weather events, effective monitoring of micro-ecological parameters in orchards is becoming especially important. Even within a single farm, significant fluctuations in temperature, humidity, illumination, and precipitation can occur depending on exposure, soil conditions, and agronomic practices. For planning pest and disease management, determining optimal harvest timing, and scheduling other agronomic operations, it is essential to integrate data from micro-meteorological sensor networks, weather forecasts, and plant phenological models. Our study accounts for these factors by simulating various illumination and weather scenarios (Clear morning; Clear midday; Overcast; Fog; Rain—light/heavy; Evening/sunset; Night; Windy/Cloudy) and including over 62,000 annotated objects. This allows for the assessment of algorithm robustness to natural variability and providing practitioners with reliable adaptive management strategies [11,12].

Thus, the proposed solution not only extends the methodological foundation through the combination of neural network architectures but also contributes to the advancement of horticulture, a field that is becoming increasingly critical in the context of growing demand for food security and sustainable agriculture.

The main contributions of this work are as follows:

We propose a flexible framework for automated ensemble selection and optimization of CNN inference, specifically tailored to agricultural applications under variable environmental conditions.
We integrate and benchmark eleven ensemble methods, dynamically configured via Pareto-based multi-objective optimization, an approach not yet fully explored for fruit detection.
We introduce a novel decision-support framework for pre-deployment benchmarking that enables data-driven selection of neural network models and ensemble strategies before field integration. This moves beyond conventional trial-and-error by leveraging multi-objective Pareto optimization to identify the optimal trade-off between accuracy and computational efficiency for specific operational scenarios (e.g., time of day, weather conditions, hardware constraints).

In this study, we hypothesize that an adaptive ensemble framework can provide a measurable improvement in detection reliability under challenging agricultural conditions. Specifically, three research hypotheses are formulated to guide the investigation:

H1.

Adaptive ensembles improve detection accuracy (mAP@0.5–0.95) by at least 8% under adverse visual conditions such as occlusion, variable illumination, and background clutter compared to single detectors.

H2.

Cross-architecture stability can be achieved when combining heterogeneous models (e.g., YOLOv8, EfficientDet, RT-DETR) through dynamically optimized weighting, thereby maintaining performance consistency across distinct feature extractors.

H3.

Pareto-based optimization enables an effective trade-off between accuracy and inference speed, ensuring that the ensemble remains computationally feasible for real-time or embedded AIoT deployment scenarios.

These hypotheses define the measurable objectives and evaluation criteria for the subsequent methodological design and experiments presented in this work.

2. Related Works

The pursuit of automated apple detection has driven significant innovation in computer vision for precision horticulture. Previous research can be broadly categorized into advancements in single-model architectures and explorations into ensemble and advanced methods, often revealing a trade-off between accuracy, speed, and robustness.

2.1. Advancements in Single-Model Architectures

A dominant trend involves refining single-model detectors to balance high accuracy with computational efficiency, crucial for real-time applications. The YOLO (You Only Look Once) family is particularly prominent due to its optimal speed–accuracy trade-off. Studies have demonstrated its efficacy in various agricultural tasks, from fruit counting to disease diagnosis [13]. For instance, lightweight YOLO architectures have been successfully deployed on embedded systems like the Jetson Xavier NX, achieving 94–99% accuracy for apple variety detection at 37 FPS [13]. Similarly, YOLOv4 demonstrated superior performance for orange fruit detection and yield estimation, achieving a mAP of 90.8% under both day and night conditions [14]. Beyond fruit detection, YOLO-based models are also used for ancillary tasks such as accurate apple tree localization in orchard rows, which is critical for autonomous machinery navigation [15].

The evolution beyond pure YOLO architectures includes hybrid models that combine strengths of different network types. The Rep-ViG-Apple model [16], for example, integrates Convolutional Neural Networks (CNNs) with Graph Neural Networks (GCNs) to overcome challenges like weather factors, occlusion, and scale variability. Through components like an inverted residual multi-scale block and a sparse graph attention mechanism, it achieved a mAP of 93.3%, outperforming YOLOv8n. Other architectural innovations focus on specific challenges: modified YOLO versions with attention modules and adjusted loss functions improve robustness to noise and blur [17], while transformer-based mechanisms integrated into YOLOv7 have enhanced precision and recall for occluded apples [18]. For adverse weather, YOLOv7-tiny has been tailored to maintain a mAP of 80.4% for small apples under rain and fog [19].

Despite these advances, single-model approaches face inherent limitations. They often remain vulnerable to leaf occlusion, glare, and complex backgrounds [20], and their performance can be constrained by data scarcity and the limitations of public datasets, which often feature artificial backgrounds or class imbalances [21]. Techniques like anchor box optimization [22], Feature Pyramid Networks (FPNs) [23], and multimodal sensing with RGB-D cameras [24] have been employed to mitigate these issues, yet achieving consistent robustness in real-world, variable conditions remains a challenge.

Recent Transformer-based models, such as Swin-Transformer [25], ViT [26], and VM-YOLO [6], demonstrate improved global context modeling and resilience to occlusion; however, their computational demand limits deployment on embedded agricultural devices. Furthermore, Vision Mamba [27] has shown promise for image processing. But such models require high calculation resources. Consequently, lightweight CNN ensembles remain advantageous for real-time orchard monitoring, as they address occlusion, inter-class similarity, and background complexity under field conditions.

The literature on agricultural object detection [1,2,3,4,5,6,7,8,9,10,11,12,13] reveals several intrinsic challenges specific to orchard environments. The most prominent among them include fruit occlusion caused by dense foliage, inter-class and intra-class similarity arising from overlapping color tones among different fruit varieties, and complex background interference due to leaves, branches, and illumination variation. These factors collectively reduce localization accuracy and hinder the discriminative capacity of convolutional models.

Earlier approaches attempted to mitigate these effects through techniques such as illumination normalization, multi-scale feature fusion, and attention-guided refinement. However, few studies have systematically categorized these difficulties or examined their combined impact on detection robustness under real-world agricultural conditions. Unlike general-purpose vision datasets, orchard imagery exhibits high spatial heterogeneity, seasonal variation, and limited annotated diversity, which significantly amplify the model’s sensitivity to environmental fluctuations.

Therefore, it becomes crucial to develop adaptive ensemble frameworks capable of integrating complementary representations from multiple CNN architectures to maintain stable detection performance across diverse environmental conditions. Such frameworks are particularly relevant for practical applications in precision horticulture, where consistency and reliability under uncontrolled conditions are essential for sustainable monitoring.

2.2. Ensemble and Advanced Methods

To overcome the limitations of single models, ensemble methods have been explored, albeit more fragmentarily. The core principle is to aggregate predictions from multiple models to compensate for individual weaknesses. In non-agricultural domains, ensembles have shown remarkable success. For instance, an ensemble combining a Vision Transformer (ViT) with modified CNNs achieved up to 99.91% accuracy on standard datasets [28]. Similarly, ensembles of Faster R-CNN, RetinaNet, and SSD have been used for precise tasks like hand joint localization, improving accuracy by 15% over single models [29].

However, the direct application of such complex ensembles to agriculture is often impractical. They typically require extensive GPU infrastructure and are unsuitable for low-power edge devices in field conditions [29]. Other advanced methods from related fields also face hurdles. For example, Zero-Shot detection [30] and models like VE-DINO [20] show promise for detecting partially hidden objects but are computationally intensive and not validated on agricultural data. Similarly, methods using elliptical bounding boxes with Faster R-CNN [31] or parallel ensembles with template matching [32] suffer from high computational costs or a lack of true, adaptive ensembling.

Within agriculture, some ensemble approaches have been tested. A YOLOv5 ensemble was shown to achieve a mAP of 95% for apple detection, surpassing single-model performance [33]. Nonetheless, these are often based on fixed voting schemes and do not dynamically adapt to changing environmental conditions. This points to a critical gap in the literature: while complex, resource-heavy ensembles exist for image classification [28] and high-precision detection [29], there is a lack of validated, lightweight ensemble frameworks specifically designed for real-time fruit detection in orchards. Such frameworks need to combine efficient inference on embedded systems with adaptive strategies to handle the core challenges of occlusion, variable lighting, and fruit clustering.

Broadening the scope further, research into specific horticultural applications and alternative sensing modalities reinforces both the potential and the limitations of current approaches. For targeted robotic harvesting, studies have focused on detecting obstructions rather than fruit itself; for instance, YOLOv4 was used to classify occlusions like branches and wires for safe kiwi harvesting, achieving a mAP of 91.9% but without employing ensembling [34,35]. Beyond fruit, systems for flower detection [36] and apple leaf disease classification using YOLO with transformer blocks [5] have achieved high accuracy (>90%), aiding in bloom assessment and phytosanitary monitoring. Multimodal and multispectral approaches seek to overcome inherent RGB limitations. A multi-drone system combining RGB and infrared imaging achieved a mAP of 86.8%, proving robust to shading and particularly effective for detecting under-leaf green apples [37]. Meanwhile, comparative studies consistently highlight the architectural trade-offs in the field, such as YOLOv8′s superior speed and accuracy over Mask R-CNN in segmentation tasks [38], but also the persistent speed constraints of two-stage detectors like Faster R-CNN for field robots [39,40]. The foundational challenge of data scarcity is widely noted, with public datasets often being inadequate for field conditions [41], prompting the use of small-shot learning and synthetic data to mitigate significant accuracy drops when moving from controlled to real environments [42,43]. In pursuit of efficiency, lightweight models like SSD-MobileNet enable real-time fruit counting [44], while transformer-based architectures, though computationally intensive, remain a promising research direction [45]. These diverse efforts underscore a collective move towards more resilient systems, yet they also highlight the fragmented nature of solutions that address either accuracy, speed, or robustness in isolation, rather than offering a unified, adaptive framework.

This section analyzes the application of ensemble and advanced methods in agricultural computer vision. While ensemble techniques have demonstrated remarkable success in other domains by aggregating predictions to overcome single-model limitations, their implementation in agriculture remains fragmented and faces significant challenges. The primary obstacles include high computational complexity, making them unsuitable for resource-constrained field devices, and a general lack of adaptive, real-time frameworks tailored to dynamic orchard conditions. Research has branched into specialized applications, such as occlusion detection for robotic harvesting and the use of multimodal sensing to overcome RGB limitations. Furthermore, studies consistently highlight fundamental trade-offs between model accuracy, speed, and robustness, as well as the persistent challenge of data scarcity. The collective research effort, while advancing the field through diverse solutions, ultimately reveals a critical gap: the absence of a unified, lightweight, and adaptive ensemble framework capable of robust, real-time fruit detection under variable agricultural conditions.

2.3. The Identified Research Gap and Our Contribution

The literature review confirms that while significant progress has been made in single-model detectors and complex, off-the-shelf ensembles, a clear gap exists. There is a need for a practical, adaptive ensemble framework that is both computationally efficient enough for real-time use on edge devices and robust enough to handle the dynamic conditions of a real orchard.

Our work addresses this gap by proposing a flexible framework for automated ensemble selection and optimization. Unlike fixed ensembles, our system integrates multiple lightweight ensemble methods (including NMS, WBF, and Bayesian Ensembling) and uses Pareto-based multi-objective optimization to dynamically select the best configuration based on current environmental conditions and hardware constraints. A key innovation is the pre-deployment benchmarking capability, which allows for the data-driven selection of an optimal model or lightweight ensemble before field deployment, ensuring robustness and suitability for specific operational scenarios. The following sections detail the ensemble methods, dataset, and experimental results that demonstrate the efficacy of this approach.

Unlike conventional ensemble approaches that rely on fixed or manually tuned fusion weights, the proposed framework employs a dynamic weight optimization strategy driven by Pareto-front analysis. This mechanism jointly considers the trade-off between detection accuracy (mAP) and inference speed (FPS), automatically adjusting the contribution of each component model according to its real-time performance profile. Such a design enables the ensemble to maintain an optimal balance between precision and efficiency, distinguishing it from traditional static-weight or Soft-NMS–based fusion methods that cannot adapt to changing visual or computational conditions.

3. Materials and Methods

3.1. Dataset Description

For the training and testing of robust neural network ensembles in automatic apple detection tasks, a specialized image dataset was created, encompassing a wide range of natural conditions and lighting scenarios. The dataset construction emphasized environmental variability, as ensemble robustness directly depends on the models’ ability to adapt to changing visual conditions. Data were collected in real orchards using high-resolution RGB cameras. The dataset was collected from experimental apple orchards located in Russia in growing seasons as part of an institutional agricultural automation project. Image acquisition was conducted under field conditions using RGB cameras mounted on mobile platforms, covering diverse lighting and occlusion scenarios. All images were manually annotated by two trained experts using the Roboflow platform and subsequently reviewed by an agronomist for consistency, yielding a high inter-annotator agreement (Cohen’s κ = 0.91). Data collection and usage were performed with the consent of orchard owners and in accordance with institutional research ethics guidelines.

Sampling was conducted with attention to the specifics of each weather scenario. In bright sunlight, high contrast and glare on fruit surfaces were considered, requiring exposure compensation and polarizing filters to preserve textural features, particularly for shiny, mature apples. Foggy and overcast conditions resulted in a significant reduction in local contrast, making boundaries between fruits and background less distinguishable. Such data proved valuable for developing robust features applicable in scenarios where standard algorithms often fail.

During rainy conditions, raindrops appeared on the lens, introducing glare and partial blur, simulating real-world distortions. The dataset included varying rainfall intensities, from light drizzle to heavy downpour. Critically, evening and nighttime images were captured using low-light RGB cameras, enabling object detection under reduced illumination and providing scenarios relevant for 24/7 automated monitoring systems.

Windy conditions introduced motion of leaves, branches, and fruits, creating motion blur and dynamic noise. These images helped models learn to extract stable features even in visually unstable environments. Each scene was manually annotated following the YOLO standard, specifying bounding box coordinates and object class for every instance.

The overall dataset statistics are presented in Table 1 below.

A total of approximately 62,000 apples were annotated, including ripe, semi-ripe, and partially leaf-occluded fruits. Such volume and structured diversity of the dataset provide the models with a high degree of generalization and reduce overfitting to specific weather patterns. Consequently, the ensemble is trained not only to distinguish between classes but also to develop robustness to environmental distortions, which is particularly important for agricultural monitoring tasks under variable climatic conditions.

It should be emphasized that the dataset construction was based on the concept of creating a contextually rich training environment, where each subgroup of images forms a separate “weather subspace” in which the model must identify stable visual patterns. This methodology enhances the generalization capability of the ensemble not only through architectural diversity but also through the variability of perceptual conditions, representing a key direction in the development of resilient AI for precision agriculture.

Ensemble learning has emerged as a fundamental strategy to enhance the robustness and generalization capability of deep neural networks, particularly in agricultural vision tasks where environmental conditions and visual variability are difficult to control. The principal paradigms include bagging, boosting, stacking, Bayesian aggregation, and consensus fusion.

Bagging (Bootstrap Aggregating) aims to reduce variance by training multiple base learners on bootstrapped subsets of data and averaging their predictions, thereby improving stability under stochastic perturbations such as illumination or background noise. Boosting sequentially emphasizes hard-to-classify samples, effectively decreasing bias and increasing sensitivity to subtle fruit features. Stacking combines heterogeneous models through a meta-learner that optimizes decision boundaries at a higher level of abstraction, integrating complementary features from different architectures.

Bayesian aggregation introduces probabilistic weighting of individual predictions based on posterior model reliability, providing uncertainty-aware decisions useful for adaptive confidence calibration. Finally, consensus fusion aggregates spatial and semantic outputs through rule-based or learned fusion, which is particularly effective for tasks involving dense fruit clusters or overlapping canopies.

In this study, these paradigms inform the design of the proposed adaptive CNN ensemble, which integrates the diversity of base models via Pareto-based optimization to balance accuracy, inference cost, and stability across heterogeneous orchard conditions.

Figure 1 shows workflow of the adaptive CNN ensemble framework.

3.2. Algorithms and Models

The software complex “Automated System for Selection and Ensembling of Neural Network Models for Monitoring Biological Objects in Horticulture” was developed in Python 3.13, targeting applications of intelligent computer vision in the agricultural sector. The system architecture is designed for both research and engineering purposes, particularly for comparative analysis and parametric optimization of convolutional neural networks (CNNs) and their ensembles according to spatial detection quality metrics (Precision, Recall, F1-Score, mean Average Precision (mAP), Intersection over Union (IoU)), as well as real-time performance (measured in frames per second, FPS), robustness to image transformations (rotation, noise, lighting variability), and adaptability to unstable visual environments.

It should be noted that our framework is model-agnostic and its innovation lies not in architectural modifications of individual detectors, but in the intelligent selection and ensembling of them. The baseline models are state-of-the-art detectors (primarily the YOLO family from YOLOv5 to YOLOv11) used in their standard, pre-trained forms via transfer learning. For the core experimental results reported in the study, the YOLOv11m model was employed as a representative and high-performing base detector to ensure a consistent and fair comparison of the ensemble methods.

So, Figure 2 illustrates structure of the ensembling process.

The “optimization” is performed at the system level through our Pareto-based multi-objective selection process, which finds the best-performing model or ensemble configuration for a given scenario.

The proposed framework operates in two distinct phases.

Pre-deployment Benchmarking and Optimization Phase. In this phase, the system performs an exhaustive evaluation of a portfolio of individual models (e.g., YOLOv8, EfficientDet) and their combinations using 11 ensemble methods (NMS, Soft NMS, Non-Maximum Weighted (NMW), Weighted Boxes Fusion (WBF), Score Averaging, Weighted Averaging, IoU Voting, Consensus Fusion, Adaptive NMS, Test-Time Augmentation (TTA), and Bayesian Ensembling). This evaluation uses a small, representative dataset (e.g., a single image or a short clip from the target environment) to simulate expected conditions. The selection of the optimal configuration (a single model or a specific ensemble method with tuned hyperparameters) is driven by Pareto-based multi-objective optimization, balancing accuracy (mAP) and performance (FPS).
Runtime Deployment Phase. Only the single best configuration identified in the first phase is deployed for continuous, real-time monitoring. This ensures high throughput and low latency during field operation, as the computational overhead of running multiple models and complex fusion algorithms is incurred only once during the setup phase.

Modern visual monitoring methods in horticulture, based on deep learning algorithms, face limited generalization capabilities under high variability in the shape, texture, and contextual environment of biological objects (e.g., flowers, fruits, and buds), especially when visual noise factors (occlusion, lighting, background) are present. Single-model approaches without environmental adaptation often result in decreased detection accuracy and false positives. In such conditions, ensemble predictions from multiple models become a fundamentally important approach, allowing compensation for individual architecture errors and increasing result reliability. Nevertheless, the lack of standardized methodologies for selecting specific ensembling techniques, as well as uncertainty in parameter choices, limits the practical implementation of such solutions.

The proposed system addresses these limitations by implementing a pre-inference neural network configuration optimization method, which includes empirical evaluation of both individual models and their ensembles on specially constructed representative subsets. This enables the formation of optimal configurations that minimize error while satisfying inference time and resource constraints. Specifically, for agricultural images with high object density and heterogeneous backgrounds, the software accounts for spatial clustering of fruits and contextual scene complexity. To improve computational efficiency, asynchronous inference mechanisms were implemented using QThreadPool (part of PyQt5), supporting task distribution across multiple GPUs. Adaptive ensemble algorithms perform dynamic weighting of model predictions based on F1 scores and latency, applying gradient optimization and global parameter search strategies.

The software implementation relies on a wide ecosystem of libraries, including PyQt5 5.15.11 for GUI visualization, OpenCV 4.12.0.88 for image processing, TensorFlow 2.7.1 and PyTorch 2.7.1 for neural network operations, and NumPy 2.2.6 and Pandas 2.3.1 for numerical analysis and data transformation. Visual analytics is built using Matplotlib 3.10.6 and Seaborn 0.13.0, while statistical analysis and numerical optimization leverage SciPy 1.16.0. Metrics management and validation procedures are handled with scikit-learn 1.7.0, and the object storage and logging model is implemented via SQLAlchemy 2.0.41.

User-defined high-level functions play a key role in system adaptability, managing critical computation stages:

-: auto_threshold_tuning: Automatically selects detection thresholds based on image characteristics and object density;
-: extract_labels_and_scores: Extracts class labels and probabilities from model outputs;
-: convert_json: Converts annotation formats, supporting JSON-to-CSV and vice versa;
-: evaluate_detector: Validates a model on a validation subset and computes key metrics;
-: compute_map_metric: Calculates mean Average Precision (mAP) across multiple IoU thresholds;
-: generate_recommendations: Provides decision-support by analyzing metrics and recommending optimal models and ensembling strategies.

Thus, the developed software represents an intelligent, modular, and configurable platform for automated validation and selection of deep learning model ensembles in agronomic monitoring tasks. The system offers researchers and developers a reproducible, evidence-based tool, facilitating the transfer of fundamental research into real-world applications under highly variable horticultural conditions.

For comprehensive quantitative assessment of detection model performance and comparative evaluation of individual architectures and ensembles in horticultural applications, a metrics system was implemented reflecting both localization accuracy and detection completeness. The evaluation framework covers Precision, Recall, F1-Score, mean Average Precision (mAP), and Intersection over Union (IoU) for spatially overlapping predictions.

Precision and Recall are evaluated considering spatial identification of objects through bounding boxes or segmentation masks. They are formally defined as follows:

{P r e c i s i o n}_{B o x / M a s k} = \frac{{T P}_{B o x / M a s k}}{{T P}_{B o x / M a s k} + {F P}_{B o x / M a s k}},

(1)

{R e c a l l}_{B o x / M a s k} = \frac{{T P}_{B o x / M a s k}}{{T P}_{B o x / M a s k} + {F N}_{B o x / M a s k}},

(2)

where

{T P}_{B o x / M a s k}

is the number of correctly detected bounding boxes or object mask pixels,

{F P}_{B o x / M a s k}

is the number of false positive bounding boxes or object mask pixels, and

{F N}_{B o x / M a s k}

is the number of false negative bounding boxes or object mask pixels.

For the integral evaluation, the F1-score is used—a symmetric metric reflecting the balance between precision and recall. Its analytical expression is given by

F 1 - S c o r e = 2 \frac{P r e c i s i o n R e c a l l}{P r e c i s i o n + R e c a l l} .

(3)

The key metric in object detection tasks is the mean Average Precision (mAP), defined as the average area under the precision-recall curve for each class, followed by averaging across all classes:

{m A P}_{B o x} = \frac{1}{n} \sum_{i = 1}^{N} {A P}_{B o x / M a s k, i},

(4)

where N is the total number of classes, and

{A P}_{B o x / M a s k, i}

is the area under the PR curve for the i-th class, obtained by sequentially evaluating performance while varying the confidence threshold.

The Intersection over Union (IoU) metric serves as a fundamental measure for detection success. It is calculated as the ratio of the area of overlap between the predicted and ground-truth regions to the area of their union:

I o U = \frac{A_{B o x / M a s k}}{A_{B o x / M a s k}} .

(5)

The IoU threshold (e.g., 0.5 or 0.75) defines the cutoff above which a prediction is considered correct. This is particularly critical in tasks with densely clustered objects, such as imaging apple trees during fruiting.

System performance is evaluated in terms of frame rate (FPS), reflecting the model’s ability to process images in real time. The inference speed is calculated as

F P S = \frac{1000}{T_{c p}},

(6)

where T_cp is the average time to process a single image (in milliseconds). This metric serves as a fundamental indicator of the model’s computational suitability for embedded and autonomous monitoring systems.

The results of the calculated metrics are visualized using IoU distribution histograms and Precision–Recall plots, which allow assessment of the advantages of ensemble methods compared to single architectures, especially under anomalous lighting and background noise conditions. These plots enable both pointwise and integral evaluation of model behavior on subsets of images with varying degrees of visual complexity.

The software suite supports a wide range of convolutional neural network architectures, including YOLOv5–YOLOv11, EfficientDet, and ResNet, as well as custom configurations adapted to the specifics of orchard monitoring tasks. The system implements eleven algorithmically distinct ensemble methods, each providing a unique mechanism for reconciling spatial predictions and recalculating confidence metrics.

These methods cover classical approaches for suppressing insignificant detections (e.g., Non-Maximum Suppression and its modifications), more sophisticated aggregation schemes considering the probabilistic structure of predictions (including Bayesian Ensembling), object spatial density (Adaptive NMS), and contextual consistency across independent models (Consensus Fusion, IoU Voting).

The operation of all ensemble mechanisms is based on preliminary clustering of bounding boxes according to their overlap degree, determined by the IoU threshold parameter (IoUthr), followed by aggregation of coordinates and/or confidence scores with weights reflecting the reliability of each prediction source. This approach significantly improves object localization accuracy and robustness against artifacts that occur under complex environmental conditions. A comparative analysis of the effectiveness of these methods is presented in Table 2, showing the advantages and limitations of each approach when applied to real-time tasks.

For NMS (Non-Maximum Suppression), the approach detailed in the YOLO architecture review was employed, which notes that NMS is a critical post-processing step for eliminating multiple detections of the same object [46].

In the case of Soft NMS, reference is made to the seminal work “Soft NMS—Improving Object Detection With One Line of Code”, which demonstrates improvements in mAP on standard datasets without additional hyperparameters [47].

For the Weighted Boxes Fusion (WBF) method, the reference is a 2021 paper that implements and analyzes the weighted box merging algorithm [48].

Adaptive NMS is described in a study presenting it as a modification of classical NMS with a dynamic threshold accounting for object density [49].

Bayesian Ensembling is presented in [50].

For the remaining methods—NMW, Score Averaging, Weighted Averaging, IoU Voting, Consensus Fusion, and TTA—references are rarer due to their limited use; however, for conceptual illustration, survey works on ensemble methods in computer vision and object detection can be employed [51,52,53].

The principle of Non-Maximum Suppression (NMS) consists of strictly filtering overlapping bounding boxes. To achieve this, the degree of overlap between two bounding boxes B_i and B_j is calculated using the metric

I o U (B_{i}, B_{j}) = \frac{A r e a (B_{i} \cap B_{j})}{A r e a (B_{i} {\cup B}_{j})},

(7)

where each bounding box is represented by its coordinates [x₁, y₁, x₂, y₂], and IoU(B_i,B_j) is compared against a predefined IoU threshold (IoU_thr). If this threshold is exceeded, the bounding box with the lower confidence score is suppressed, ensuring that only the most significant detections are retained.

Soft-NMS represents an enhancement over the classical Non-Maximum Suppression (NMS) algorithm. It addresses the limitation of aggressively suppressing detections by replacing the hard removal of bounding boxes with a gradual, exponential decay of their confidence scores:

s_{i}^{'} = s_{i} e^{\frac{- {I o U (B_{i}, B_{m a x})}^{2}}{σ}},

(8)

where s_i is the initial confidence score of the bounding box B_i, and B_max is the bounding box with the maximum confidence score, while a smoothing parameter σ (typically set to 0.8) controls the rate of confidence decay for neighboring boxes. This mechanism preserves highly probable, yet slightly overlapping, detections. This approach has proven effective in boosting mean Average Precision (mAP) without the need for complex architectural modifications.

The Non-Maximum Weighted (NMW) method meticulously weights the contribution of each detection to the final localization output. For all bounding boxes within a cluster exhibiting an Intersection-over-Union (IoU) ratio greater than or equal to a threshold IoU ≥ IoU_thr, relative weights are computed.

ω_{i} = \frac{s_{i}}{\sum_{j} s_{j}},

(9)

and the final coordinates of the fused bounding box are determined by the weighted sum:

B_{f u s e d} = \sum_{i} ω_{i} B_{i},

(10)

This approach accounts for the reliability of each model, yielding more precise box placement.

Weighted Boxes Fusion (WBF) integrates concepts from NMW and incorporates model-specific weights w_i. For a group of detections G, the resulting coordinates are computed as follows:

x_{1, f u s e d} = \frac{\sum_{i \in G} w_{i} s_{i} x_{1, i}}{\sum_{i \in G} w_{i} s_{i}}, y_{1, f u s e d} = \frac{\sum_{i \in G} w_{i} s_{i} y_{1, i}}{\sum_{i \in G} w_{i} s_{i}},

(11)

x_{2, f u s e d} = \frac{\sum_{i \in G} w_{i} s_{i} x_{2, i}}{\sum_{i \in G} w_{i} s_{i}}, y_{2, f u s e d} = \frac{\sum_{i \in G} w_{i} s_{i} y_{2, i}}{\sum_{i \in G} w_{i} s_{i}},

(12)

And the adjusted confidence score of the fused box is calculated as

s_{f u s e d} = \frac{\sum_{i \in G} w_{i} s_{i}}{\sum_{i \in G} w_{i}} .

(13)

This method incorporates both the confidence of each detection and the reliability of the originating model, enhancing result accuracy compared to techniques that simply remove overlaps without averaging.

A fundamental strategy is Score Averaging, where the resulting coordinates and confidence values are computed as the unweighted mean across all detections in a group G that meets a specified IoU condition (IoU ≥ IoU_tℎr). The fused bounding box for the group is given by

B_{f u s e d} = \frac{1}{|G|} \sum_{i \in G} B_{i},

(14)

and the resulting confidence score is defined as

s_{f u s e d} = \frac{1}{|G|} \sum_{i \in G} s_{i}

(15)

where B_i are the coordinates of the i-th model’s bounding box, s_i is its corresponding confidence score, and ∣G∣ is the number of detections in the cluster. This method is straightforward to implement but does not account for variations in prediction reliability, which can impair accuracy when models are of heterogeneous quality.

To improve robustness against source reliability variations, the Weighted Averaging method assigns a weight w_i to each bounding box, reflecting, for instance, the model’s trustworthiness or its performance on validation data. The final localization is then determined by

B_{f u s e d} = \frac{\sum_{i \in G} w_{i} B_{i}}{\sum_{i \in G} w_{i}}

(16)

and the confidence score can be averaged similarly (if weights are used solely for coordinate averaging):

s_{f u s e d} = \frac{\sum_{i \in G} w_{i} s_{i}}{|G|} .

(17)

Applying weights amplifies the influence of more reliable predictions while suppressing contributions from less certain ones, which is particularly crucial in heterogeneous ensembles.

The IoU Voting method posits that multiple high-overlap detections with strong mutual agreement are a reliable indicator of a true object. Here, the box coordinates are averaged:

B_{f u s e d} = \frac{1}{|G|} \sum_{i \in G} B_{i},

(18)

while the confidence score is taken as the maximum among all contributions:

s_{f u s e d} = \underset{i \in G}{m a x} s_{i} .

(19)

This strategy can be beneficial in high-density object scenarios where multiple detections of the same object with varying confidence scores are likely.

The Consensus Fusion mechanism imposes an additional constraint on the number of overlapping predictions. Only clusters satisfying the condition ∣G∣ ≥ min_votes are aggregated; others are discarded as insufficiently reliable. Coordinate and confidence score averaging is performed using the following equations:

B_{f u s e d} = \frac{1}{|G|} \sum_{i \in G} B_{i},

(20)

s_{f u s e d} = \frac{1}{|G|} \sum_{i \in G} s_{i}

(21)

This achieves a balance between excessive suppression and over-generalization, minimizing the risk of erroneously incorporating spurious detections.

A distinct adaptive approach is Adaptive NMS, where the suppression threshold IoU_adaptive is a function of object density. Let n be the total number of detections and be the number of models. The global density is defined as

d e n s i t y = \frac{n}{N} .

(22)

and the adaptive suppression threshold is then calculated as

{I o U}_{a d a p t i v e} = m i n (0.95, {I o U}_{t h r} + 0.01 \cdot d e n s i t y)

(23)

Subsequently, for each local image region, the local density is computed:

L o c D e n s i t y = \frac{N_{d e t e c t i o n s}}{A_{r e g i o n}} .

(24)

where N_detections is the number of boxes in the region and A_region is its area. The local suppression threshold is then adapted according to

{I o U}_{a d a p t i v e} = {I o U}_{b a s e} + 0.05 (\frac{L o c D e n s i t y}{{L o c D e n s i t y}_{m a x}})

(25)

where IoU_base is a base threshold value and LocDensity_max = 0.1 corresponds to the maximum normalized density (e.g., 10 objects per 100 squared pixels). This approach ensures the preservation of meaningful detections in high-density areas, such as clusters of fruit on trees.

Among the ensemble methods implemented in the proposed system, Test-Time Augmentation (TTA) holds particular importance, aiming to enhance model robustness to various input image transformations. The method involves applying symmetric augmentations to the input image at inference time, followed by the fusion of the resulting detections. In the base configuration, horizontal flipping is applied to the original image I, creating a mirrored version I_flip = flip(I). Each image version undergoes an independent detection procedure, generating two sets of bounding boxes:

B_{o r i g} o n o r i g i n a l i m a g e I,

(26)

B_{f l i p} o n f l i p p e d i m a g e I_{f l i p} .

(27)

As the bounding boxes B_flip are defined in the coordinate space of the flipped image, they must be transformed back into the original image’s coordinate space. For normalized coordinates in the range [0, 1], this transformation is formalized as follows for a bounding box B = [x₁, y₁, x₂, y₂]:

B_{f l i p}^{t r a n s f o r m e d} = [1 - x_{2}, y_{1}, 1 - x_{1}, y_{2}]

(28)

The combined set of predictions = B_orig∪

B_{flip}^{transformed}

is passed to the subsequent ensemble module for processing using a selected algorithm (e.g., NMS or WBF). The TTA methodology significantly enhances detection recall and robustness by leveraging symmetrical feature generalization, particularly useful for directional biases common in agricultural imagery.

A second approach, critical for model evaluation under epistemic uncertainty, is Bayesian Ensembling. This method involves the integration of predictions via weighted aggregation, followed by a correction of the confidence score to improve interpretation reliability. Initially, a weighted average of the coordinates and confidence scores is computed based on weights w_i, assigned to each model in the ensemble:

B_{f u s e d} = \frac{\sum_{i \in G} w_{i} B_{i}}{\sum_{i \in G} w_{i}},

(29)

s_{f u s e d} = \frac{\sum_{i \in G} w_{i} s_{i}}{|G|},

(30)

where B_i and s_i are the coordinates and confidence of the bounding box, and w_i are the weights of the corresponding models, reflecting its reliability (e.g., based on validation F1-score). The aggregated confidence score is then adjusted to compensate for potential overestimation inherent in aggressive averaging techniques:

s_{a d j u s t e d} = 0.95 \cdot s_{f u s e d} .

(31)

Thus, the final box retains consensus coordinates, while the confidence score is modified to account for ensemble uncertainty. This Bayesian correction is particularly effective under conditions of low inter-detector agreement or when dealing with rare object classes.

It should be noted that all bounding box coordinates B_i = [x₁, y₁, x₂, y₂] in the implementation are specified in pixel coordinates and, if necessary, normalized to the unit interval [0, 1] by dividing by the image width and height. Model confidence s_i is interpreted as a scalar in the range [0, 1], and weights w_i are dimensionless quantities reflecting the relative contribution of a model, based, for example, on quality metrics.

The selection of the optimal ensemble strategy within the system is based on multi-objective Pareto optimization, considering both model accuracy (via mAP) and its performance (FPS). The objective function determining the best ensemble configuration m is formulated as

O p t i m a l = a r g \underset{m}{m a x} (\frac{{m A P}_{m}}{m a x (m A P)} + \frac{{F P S}_{m}}{m a x (F P S)})

(32)

The computation is performed via a brute-force search, with dynamic tuning of confidence and IoU thresholds based on object density and available hardware resources (particularly VRAM and CPU/GPU load). Additionally, the system incorporates a benchmarking module that includes inference time analysis and memory consumption monitoring (RAM/VRAM) for various ensemble methods. This ensures an objective comparison of models under real-world operational scenarios.

Additional platform functionalities include support for asynchronous image processing via QThreadPool, a multi-threaded and multi-process inference architecture (multi-GPU), and logging mechanisms with DEBUG, INFO, WARNING levels, alongside automatic crash report generation.

An intelligent assistant, implemented via a generate_recommendations function, is embedded within the system. Based on the analysis of key metrics (mAP, F1-Score, FPS), it suggests optimal model combinations and ensemble methods to the user. Final reporting includes exportable tables, graphs, and summary recommendations available in Excel, SVG, and PDF formats, with the capability to save experiment history to an SQLite database and subsequent filtering by quality criteria (e.g., mAP > 0.7).

This comprehensive approach to building and managing ensembles ensures both scientific reproducibility and the practical applicability of the solution for monitoring agri-food objects.

According to the conducted experiments, the choice of the optimal ensemble method depends on object density within the image, lighting characteristics, the presence of mutual occlusions, and the computational constraints of the system.

To ensure methodological transparency and reproducibility, the benchmarking and model fusion workflow is summarized in Pre-deployment Pareto Optimization. The algorithm evaluates a set of candidate detectors in terms of accuracy (mAP) and inference speed (FPS), constructing a Pareto frontier that identifies models offering the best trade-offs. Only the models located on this frontier are included in the adaptive ensemble, where their weights are dynamically optimized according to scene-level performance indicators. This procedure guarantees that the ensemble remains both accurate and computationally efficient under real-world conditions.

For comprehensive evaluation, additional benchmarks were incorporated, including RT-DETR (ResNet-50) and Rep-ViG-Apple (2024), representing the latest transformer-based and graph-enhanced architectures. Their inclusion ensures that the proposed ensemble is compared against the most recent state-of-the-art detectors in agricultural vision.

4. Results and Discussion

To evaluate the practical feasibility of the proposed adaptive ensemble, a detailed computational resource analysis was conducted. All experiments were performed on an NVIDIA RTX 2080 Ti GPU with 11 GB of VRAM and an Intel i9-13900K CPU. The peak VRAM utilization during ensemble inference reached 7.4 GB, while the average power draw was approximately 210 W. These results demonstrate that the proposed ensemble achieves a favorable trade-off between accuracy and computational cost. Moreover, when deployed with lightweight backbones such as YOLOv8n or EfficientDet-D1, the system retains within 2% accuracy loss while reducing memory consumption by nearly 40%, ensuring scalability for embedded and AIoT platforms.

Results of the experimental validation of the developed system are presented within the context of a complete functional pipeline, encompassing the entire lifecycle of user interaction with the intelligent platform. This lifecycle spans from the initial loading and adaptive preprocessing of visual data to the parametric configuration of inference algorithms, computation of key quality metrics, and generation of substantiated analytical reports. Figure 3 illustrates the program workflow as a flowchart of user operations, reflecting the structure and logic governing the automated process of neural model selection and ensembling. The system demonstrates capabilities for context-dependent adaptation and metric-driven self-improvement, ensuring not only enhanced detection accuracy but also reproducible evaluation standards that meet the requirements of applied tasks in precision agriculture and smart horticulture.

The graphical user interface (GUI) is implemented as a modular panel with a logical segmentation into two functional tabs—”Settings” and “Results”—which facilitate the complete cycle of operator interaction with the system.

The «Settings» tab (Figure 4) provides an integrated environment for performing core image loading and visual management operations, including scaling, normalization, filtering, manual annotation, and the application of augmentation procedures. Furthermore, this tab enables the configuration of inference parameters, such as Confidence Score threshold, IoU Threshold, and the smoothing parameter (Sigma, σ) for Soft-NMS. The same section also supports the selection of model architectures and corresponding ensemble methods, ensuring that the detection process can be prepared and launched with a high degree of adaptation to specific visual data conditions.

The «Results» tab (Figure 5) serves as the analytical module of the interface, designed for the visual presentation of inference outcomes obtained from both individual convolutional neural network (CNN) models and their ensembles. Within this section, users can perform a comparative analysis of output predictions, which are displayed as graphical overlays on the original images, as well as in the form of comparative tables and graphs highlighting differences in quality metrics. This visualization provides an interpretable performance assessment of each model and the selected ensemble strategy in terms of detection accuracy, recall, and consistency. This software is recommended for initial use in selecting the optimal neural network.

To quantitatively assess the robustness of various ensemble methods to weather condition variability, experimental testing was conducted on a specially curated dataset containing images of apple trees captured under diverse lighting conditions, contrast levels, and visual distortions. The analysis primarily focused on the mean Average Precision (mAP) metric. The YOLOv11m model was used as the base detector, although other models can also be applied. The models were trained on a computer with an Intel Core i9-10900X central processor (10 cores, 3.7 GHz), 64 GB DDR4 RAM, a 1 TB high-speed SSD, and 2 NVIDIA GeForce RTX 2080 TI graphics cards with 11 GB GDDR6 memory.

In the experiment, each of the eleven ensemble strategies was applied to the same set of test images corresponding to specific weather conditions, with the models being pre-trained on a consistent dataset. The mAP was calculated over an IoU range from 0.5 to 0.95 with a step size of 0.05, following the COCO evaluation protocol. The obtained values reflect not only localization accuracy but also the method’s ability to handle artifacts caused by atmospheric and optical distortions.

The summarized results are presented in Table 3, where rows correspond to weather scenarios and columns represent the respective ensemble methods. Each cell contains the mAP value for apple detection, rounded to three decimal places, enabling a comparison of strategy effectiveness under various visual degradation conditions. This table provides a basis for selecting robust ensemble configurations in automated horticultural monitoring applications.

Analysis values presented in Table 3 (for the “apple” class) demonstrate a clear and logical variation in the effectiveness of ensemble methods depending on the weather conditions during imaging.

The most advanced methods, such as Weighted Boxes Fusion (WBF), Test-Time Augmentation (TTA), and Bayesian Ensembling, consistently show superior accuracy (mAP > 0.93) under challenging conditions like rain, night, and fog. This underscores their robustness to visual artifacts and noise, a core finding of our study. WBF, in particular, demonstrates stable and high performance across all adverse scenarios, making it a highly reliable choice.

In contrast, traditional approaches like NMS and Score Averaging show competitive or even slightly higher performance (e.g., mAP = 0.966 for NMS) under ideal lighting conditions (“Clear, Day”), where the detection task is inherently simpler and the advantages of complex ensembling are less critical. However, their effectiveness significantly decreases as visibility deteriorates.

A clear trend emerges: in scenarios with high object density and spatial instability, such as “Overcast” and “Windy, cloudy” conditions, methods that adapt to density (Adaptive NMS) and leverage prediction consensus (Consensus Fusion, IoU Voting) also perform notably better than their non-adaptive counterparts. This confirms their ability to maintain localization accuracy amidst occlusion and motion blur.

To further quantify the impact of occlusion and background complexity, we evaluated the proposed ensemble under three distinct conditions: moderate occlusion (30–50%), severe occlusion (>50%), and high background complexity characterized by overlapping branches and mixed illumination. The results demonstrate that the ensemble maintained mAP@0.5–0.95 scores of 0.937, 0.914, and 0.905, respectively, with corresponding mean IoU values of 0.82, 0.78, and 0.76. Even under severe occlusion, where individual CNNs such as YOLOv8 and EfficientDet exhibited performance degradation of up to 6–8%, the adaptive fusion module successfully preserved detection stability, reducing variance to below 2%. These findings confirm that the Weighted Boxes Fusion and Adaptive NMS components effectively mitigate localization errors caused by overlapping canopies and fruit clustering. Moreover, the ensemble’s ability to maintain consistent precision across heterogeneous visual conditions underscores its robustness and suitability for long-term orchard monitoring.

Table 4 expands the research through deeper experiments in modeling.

To further assess the generalizability of the proposed ensemble, the experimental setup was expanded to encompass detectors of diverse architectural paradigms and computational scales. In addition to the previously employed YOLOv8m, we incorporated the lightweight YOLOv8n, the mid-size EfficientDet-D1, and a transformer-based model RT-DETR (ResNet-50 backbone) representing the DETR family. Each model was trained and evaluated under identical conditions of illumination, occlusion, and background complexity, followed by integration into the adaptive ensemble. The results reveal that while YOLOv8n and EfficientDet-D1 contribute complementary spatial and contextual features, the inclusion of RT-DETR considerably increased inference latency (average 63 ms per frame) without yielding proportional accuracy gains (mAP@0.5–0.95 ≈ 74.1%). Consequently, transformer-based detectors were excluded from the final optimized ensemble to preserve computational feasibility for embedded deployment. Overall, the ensemble maintains consistent accuracy improvements (+2–3% mAP@0.5–0.95) across all CNN-based architectures, confirming its architecture-agnostic robustness.

To reinforce the statistical validity of the proposed framework, we performed an additional quantitative analysis incorporating uncertainty estimation and significance testing. Each model configuration was trained and evaluated across five independent runs with randomized seeds and scene-level bootstrapping. The resulting metrics are expressed as mean ± standard deviation (95% confidence intervals). Statistical significance of performance gains was assessed using a paired bootstrap test with 1000 resamples over image scenes (p < 0.05). Table 5 summarizes these outcomes, demonstrating that the proposed adaptive ensemble consistently achieves statistically significant improvements in detection accuracy (mAP@0.5–0.95) and recall across diverse environmental and occlusion conditions.

Table 5 shows the results of experiments.

Values are mean ± standard deviation (95% confidence intervals) over five independent runs with randomized seeds. Statistical significance was assessed using paired bootstrap resampling over test scenes; bold entries indicate significant improvement (p < 0.05) relative to the best single-model baseline.

As evidenced in Table 5, the proposed ensemble establishes a new state-of-the-art for apple detection, significantly surpassing all compared models. It outperforms the lightweight YOLOv8n by a large margin (e.g., +4.2% Precision, +4.6% Recall), exceeds the efficient EfficientDet-D1, and even edges out the more computationally intensive RT-DETR in the critical mAP@0.5 metric by +4.8%. The statistical significance (p < 0.05) of these results underscores that the performance gain is substantial and reliable, validating the effectiveness of our adaptive ensemble approach.

As part of the experimental validation of the developed software suite, targeted testing was conducted on images captured under conditions of significant atmospheric interference—specifically, intense rain accompanied by visual scene distortions. To evaluate effectiveness, five convolutional neural network models were employed, each pre-trained on specialized subsets corresponding to different weather scenarios: Rain.pt (precipitation), Cloudy.pt (overcast), Sun.pt (sunny conditions), Fog.pt (fog), and Clear(x).pt (clear weather). Ensemble aggregation of predictions was performed using the eleven methods integrated into the system, allowing for the investigation of their behavior on a uniform test case.

The image used contained a section of an apple tree canopy with multiple fruits and was characterized by a noticeable reduction in local contrast compared to a reference daytime scenario—the brightness change was ΔL = 22%, measured by the deviation in the luminance component. Additionally, the scene featured water droplets on the lens with an approximate density of 28 ± 2 particles per square pixel, as well as localized blurring zones caused by atmospheric dispersion, which were simulated using a Gaussian filter with σ = 3.0. This configuration provides a realistic approximation of the visual instability typical for monitoring systems operating in open-field conditions.

Visual examples of the spatial object recognition results, both using individual models and applying ensemble methods, are presented in Figure 6. These illustrate the impact of aggregation on localization accuracy and prediction stability in a challenging visual environment.

The testing yielded the following results: the Rain.pt model demonstrated the highest fruit detection accuracy, with a Precision of 0.91 and a Recall of 0.87. Its F1-Score was 19% higher than that of the Sun.pt model (F1-Score = 0.72). The Cloudy.pt model showed partial resilience to precipitation (IoU = 0.74); however, detection errors occurred when fruits were occluded by leaves.

The ensemble method Weighted Boxes Fusion increased the mAP@0.5:0.95 to 0.86 (a 7.5% improvement over the Rain.pt model) by effectively combining predictions from the Rain.pt and Clear.pt models. Bayesian Ensembling reduced the proportion of false positive detections (FP-rate = 0.09 at a confidence threshold > 0.6). Adaptive NMS improved the stability of fruit boundary delineation in blurred image regions (IoU = 0.79 ± 0.03).

The use of ensemble methods compensated for the limitations of individual models, increasing the mAP@0.5:0.95 by 7–12%. The low inference speed of the ensembles (FPS < 0.06) is attributed to the computational complexity of prediction aggregation. Graphs generated from the program’s results are presented in Figure 7.

Experimental results confirm the proposed ensemble framework’s superior performance over single-model detectors across diverse environmental conditions. Methods like Weighted Boxes Fusion (WBF) and Bayesian Ensembling maintained high accuracy in adverse scenarios (rain, night), demonstrating how prediction aggregation compensates for individual model shortcomings through architectural diversity and multi-objective optimization.

However, several limitations warrant discussion. The computational complexity of running multiple CNNs with sophisticated ensemble algorithms results in inference latency prohibitive for real-time processing on low-power embedded systems. Crucially, this does not undermine the framework’s utility, as its innovation lies in pre-inference optimization. The system uses a single representative frame under current conditions to benchmark and select the optimal configuration (either a single model or lightweight ensemble) for subsequent high-FPS, real-time monitoring, ensuring adaptability without sacrificing operational performance.

Additional limitations include bounded generalizability to apple cultivars beyond those in our dataset and the resource-intensive nature of the benchmarking process itself. These limitations direct future work toward developing lightweight ensemble techniques, validating the framework on other horticultural products, integrating multimodal sensors, porting to edge platforms, exploring advanced base detectors like Vision Transformers, and automating benchmarking via meta-learning. Addressing these challenges will fully unlock the approach’s potential within agricultural AIoT.

To assess the individual contribution of each component within the proposed ensemble framework, an ablation study was performed. When the adaptive weighting mechanism was disabled and uniform fusion weights were applied, the mean mAP@0.5–0.95 decreased from 0.948 to 0.923, indicating that dynamic weighting substantially enhances detection reliability under variable visual conditions. Furthermore, when the number of component models was reduced from four to two, the average mAP dropped by approximately 2.1%, confirming that model diversity plays an important role in ensemble robustness. These findings highlight that both the adaptive optimization and multi-architecture integration are essential to achieving high stability and accuracy in complex orchard environments.

5. Conclusions

The conducted study demonstrated the high efficacy of adaptive ensembles of neural network models for detecting biological objects in horticultural applications, particularly under unstable visual conditions. The developed software framework integrated functionalities for image preprocessing, inference parameter configuration, implementation of 11 ensemble methods, and automated selection of the optimal configuration based on multi-objective Pareto optimization. The created dataset, comprising images captured under eight distinct weather scenarios, enabled a comprehensive validation of model robustness against atmospheric and photometric distortions.

The results indicated that Weighted Boxes Fusion (WBF), Bayesian Ensembling, and Adaptive NMS delivered the highest accuracy under conditions of rain, fog, and nighttime imaging, which is critically important for 24/7 and all-weather monitoring systems. The application of Test-Time Augmentation (TTA) and consensus-based strategies further improved detection recall without a significant increase in false positive rates. The WBF method achieved a mAP of up to 0.86, while Bayesian Ensembling reduced the FP-rate to 0.09 at a reasonable confidence threshold.

Furthermore, the developed software platform implements a method for pre-inference optimization of neural network configurations. This approach involves the empirical evaluation of both individual models and their ensembles on specially curated representative data subsets (e.g., a single frame captured under current field conditions) before their deployment in practical monitoring scenarios. Based on multi-objective metrics (accuracy, speed, robustness), the system recommends a single optimal model—or a lightweight, pre-optimized ensemble—that will subsequently be deployed for real-time, high-FPS monitoring. This allows agricultural engineers and researchers to pre-select the most robust and efficient model–ensemble combination tailored to the specific environmental conditions and hardware constraints of their target application. Thus, the system serves not only as a monitoring tool but also as a pre-deployment decision-support system for the preliminary design and configuration of reliable, adaptive computer vision pipelines in precision horticulture.

Despite the increased computational load, the integration of adaptive ensemble methods enabled the formation of a robust architecture for real-time systems.

The developed software platform implements a pre-deployment optimization methodology for neural network configurations. By benchmarking models and ensembles on representative data before deployment, the system recommends a single optimal model or a lightweight ensemble. This allows practitioners to pre-select the most robust and efficient configuration tailored to their specific environment and hardware. Thus, the system’s value lies not in continuous, heavy-weight ensembling but in being a decision-support tool for designing reliable and efficient computer vision pipelines for precision horticulture. The obtained results highlight the practical applicability of this approach for AI-powered IoT (AIoT) environments with limited resources, where high accuracy is required under highly variable scene conditions. The developed software platform implements a method for pre-inference optimization of neural network configurations. This approach involves the empirical evaluation of both individual models and their ensembles on specially curated representative data subsets before their deployment in practical monitoring scenarios. Thus, for practicing horticulturists, this translates to the capability for timely detection of problems at the individual tree level, optimization of agrotechnical measures, accurate yield prediction, and improved resource use efficiency, ultimately contributing to the sustainable development of horticulture and enhanced product quality.

The experimental results confirm all three research hypotheses formulated in the Introduction.

Regarding H1, the adaptive ensemble demonstrated a consistent performance gain of 8–10% in mAP@0.5–0.95 under severe occlusion and illumination variability compared to single-model baselines, validating its robustness in orchard-level conditions.

For H2, cross-architecture evaluation involving YOLOv8n, EfficientDet-D1, and RT-DETR showed stable accuracy (mAP variance < 1.5%) and inference consistency, confirming that adaptive weighting effectively harmonizes heterogeneous feature extractors.

Finally, H3 was verified through Pareto-front optimization, which achieved near–real-time inference (18.9 ms per frame) with less than 2% accuracy degradation relative to full-scale ensembles, proving the approach’s suitability for embedded and AIoT deployment.

Collectively, these findings substantiate the theoretical assumptions and demonstrate that adaptive ensemble strategies can enhance both robustness and efficiency in real-world precision agriculture applications.

Beyond the immediate gains in detection accuracy, this adaptive ensemble framework paves the way for more sustainable and resource-efficient orchard management. By providing reliable, all-weather detection capabilities, the system enables precise monitoring of fruit load and health, which is fundamental for optimizing key agronomic practices. This allows for targeted application of water, fertilizers, and pesticides, significantly reducing waste and environmental impact. Furthermore, the framework’s compatibility with edge devices makes it a cornerstone for integrated AIoT systems, where it can fuse its visual data with inputs from micro-meteorological sensors and weather forecasts. This synergy creates a powerful decision-support system for growers, facilitating data-driven interventions for optimized harvest timing, early disease detection, and ultimately guiding the transition towards fully autonomous, precision farming operations that enhance both yield and ecological sustainability.

The proposed adaptive ensemble framework demonstrates strong potential for real-world deployment in precision horticulture. Beyond its quantitative advantages, the model can be integrated into automated pest detection and yield prediction systems, providing early alerts and supporting strategic decision-making for farm management. Coupled with AIoT sensor networks, it enables continuous monitoring of micro-climatic and phenological variables, enhancing situational awareness in orchards. Furthermore, its lightweight and modular architecture allows seamless incorporation into autonomous robotic harvesting platforms, where reliable fruit localization and classification are essential for operational safety and efficiency. By bridging experimental validation with practical field applications, the proposed approach contributes to data-driven, resource-efficient, and sustainable agricultural production.

In future work, the proposed adaptive ensemble approach will be evaluated on other fruit datasets beyond apples, such as TomatoDet-2023 and related orchard collections, to further verify its generalization capability across different crop types and visual conditions.

Author Contributions

A.K.: Project administration, Writing—original draft, Visualization, Validation, Supervision, Software, Methodology, Investigation, Funding acquisition. N.A.: Writing—review and editing, Visualization, Methodology, Investigation, Data curation, Validation. D.K.: Conceptualization, Writing—review and editing, Investigation, Formal Analysis, Validation. I.S.: Formal Analysis, Investigation, Validation, Methodology. V.Z.: Writing—original draft, Writing—review and editing, Visualization, Data curation, Validation. All authors have read and agreed to the published version of the manuscript.

Funding

The research was carried out using the expenses of a grant from the Russian Science Foundation—Agreement No. 24-76-10071, dated 9 August 2024, https://rscf.ru/project/24-76-10071/ (accessed on 22 October 2025).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author..

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Cembrowska-Lech, D.; Krzemińska, A.; Miller, T.; Nowakowska, A.; Adamski, C.; Radaczyńska, M.; Mikiciuk, G.; Mikiciuk, M. An Integrated Multi-Omics and Artificial Intelligence Framework for Advance Plant Phenotyping in Horticulture. Biology 2023, 12, 1298. [Google Scholar] [CrossRef]
Korchagin, S.A.; Gataullin, S.T.; Osipov, A.V.; Smirnov, M.V.; Suvorov, S.V.; Serdechnyi, D.V.; Bublikov, K.V. Development of an Optimal Algorithm for Detecting Damaged and Diseased Potato Tubers Moving along a Conveyor Belt Using Computer Vision Systems. Agronomy 2021, 11, 1980. [Google Scholar] [CrossRef]
Andriyanov, N.A.; Dementiev, V.E.; Kargashin, Y.D. Analysis of the impact of visual attacks on the characteristics of neural networks in image recognition. Procedia Comput. Sci. 2021, 186, 495–502. [Google Scholar] [CrossRef]
Al Mudawi, N.; Qureshi, A.M.; Abdelhaq, M.; Alshahrani, A.; Alazeb, A.; Alonazi, M.; Algarni, A. Vehicle Detection and Classification via YOLOv8 and Deep Belief Network over Aerial Image Sequences. Sustainability 2023, 15, 14597. [Google Scholar] [CrossRef]
Kutyrev, A.; Andriyanov, N. Apple Flower Recognition Using Convolutional Neural Networks with Transfer Learning and Data Augmentation Technique. E3S Web Conf. 2024, 493, 01006. [Google Scholar] [CrossRef]
Wang, Y.; Lin, X.; Xiang, Z.; Su, W.-H. VM-YOLO: YOLO with VMamba for Strawberry Flowers Detection. Plants 2025, 14, 468. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Gao, Y.; Yin, M.; Li, H. Automatic Apple Detection and Counting with AD-YOLO and MR-SORT. Sensors 2024, 24, 7012. [Google Scholar] [CrossRef] [PubMed]
Sun, X.; He, L.; Jiang, H.; Li, R.; Mao, W.; Zhang, D.; Majeed, Y.; Andriyanov, N.; Soloviev, V.; Fu, L. Morphological estimation of primary branch length of individual apple trees during the deciduous period in modern orchard based on PointNet++. Comput. Electron. Agric. 2024, 220, 108873. [Google Scholar] [CrossRef]
Dai, M.; Dorjoy, M.M.H.; Miao, H.; Zhang, S. A New Pest Detection Method Based on Improved YOLOv5m. Insects 2023, 14, 54. [Google Scholar] [CrossRef]
Rathar, A.S.; Choudhury, S.; Sharma, A.; Nautiyal, P.; Shah, G. Empowering vertical farming through IoT and AI-Driven technologies: A comprehensive review. Heliyon 2024, 10, 1014. [Google Scholar] [CrossRef]
Muhammed, D.; Ahvar, E.; Ahvar, S.; Trocan, M.; Montpetit, M.J.; Ehsani, R. Artificial Intelligence of Things (AIoT) for smart agriculture: A review of architectures, technologies and solutions. J. Netw. Comput. Appl. 2024, 228, 103905. [Google Scholar] [CrossRef]
Khan, A.T.; Jensen, S.M. LEAF-Net: A Unified Framework for Leaf Extraction and Analysis in Multi-Crop Phenotyping Using YOLOv11. Agriculture 2025, 15, 196. [Google Scholar] [CrossRef]
Olguín-Rojas, J.C.; Vasquez, J.I.; López-Canteñs, G.d.J.; Herrera-Lozada, J.C.; Mota-Delfin, C. A Lightweight YOLO-Based Architecture for Apple Detection on Embedded Systems. Agriculture 2025, 15, 838. [Google Scholar] [CrossRef]
Mirhaji, H.; Soleymani, M.; Asakereh, A.; Mehdizadeh, S.A. Fruit detection and load estimation of an orange orchard using the YOLO models through simple approaches in different imaging and illumination conditions. Comput. Electron. Agric. 2021, 191, 106533. [Google Scholar] [CrossRef]
Nix, S.; Sato, A.; Madokoro, H.; Yamamoto, S.; Nishimura, Y.; Sato, K. Detection of Apple Trees in Orchard Using Monocular Camera. Agriculture 2025, 15, 564. [Google Scholar] [CrossRef]
Han, B.; Lu, Z.; Zhang, J.; Almodfer, R.; Wang, Z.; Sun, W.; Dong, L. Rep-ViG-Apple: A CNN-GCN Hybrid Model for Apple Detection in Complex Orchard Environments. Agronomy 2024, 14, 1733. [Google Scholar] [CrossRef]
Sekharamantry, P.K.; Melgani, F.; Malacarne, J.; Ricci, R.; de Almeida Silva, R.; Marcato Junior, J. A Seamless Deep Learning Approach for Apple Detection, Depth Estimation, and Tracking Using YOLO Models Enhanced by Multi-Head Attention Mechanism. Computers 2024, 13, 83. [Google Scholar] [CrossRef]
Amantay, N.; Mohamad, K. Deep Learning Based Apple Detection: A Comparative Analysis of CNN Architectures. In Proceedings of the 2025 IEEE 5th International Conference on Smart Information Systems and Technologies (SIST), Astana, Kazakhstan, 14–16 May 2025; pp. 1–7. [Google Scholar] [CrossRef]
Ma, L.; Zhao, L.; Wang, Z.; Zhang, J.; Chen, G. Detection and Counting of Small Target Apples under Complicated Environments by Using Improved YOLOv7-tiny. Agronomy 2023, 13, 1419. [Google Scholar] [CrossRef]
Lv, M.; Xu, Y.; Miao, Y.; Su, W. A Comprehensive Review of Deep Learning in Computer Vision for Monitoring Apple Tree Growth and Fruit Production. Sensors 2025, 25, 2433. [Google Scholar] [CrossRef]
Xiao, F.; Wang, H.; Xu, Y.; Zhang, R. Fruit Detection and Recognition Based on Deep Learning for Automatic Harvesting: An Overview and Review. Agronomy 2023, 13, 1625. [Google Scholar] [CrossRef]
Khosravi, H.; Saedi, S.I.; Rezaei, M. Real-Time Recognition of on-Branch Olive Ripening Stages by a Deep Convolutional Neural Network. Sci. Hortic. 2021, 287, 110252. [Google Scholar] [CrossRef]
Wang, D.; Li, C.; Song, H.; Xiong, H.; Liu, C.; He, D. Deep Learning Approach for Apple Edge Detection to Remotely Monitor Apple Growth in Orchards. IEEE Access 2020, 8, 26911–26925. [Google Scholar] [CrossRef]
Fu, L.; Gao, F.; Wu, J.; Li, R.; Karkee, M.; Zhang, Q. Application of Consumer RGB-D Cameras for Fruit Detection and Localization in Field: A Critical Review. Comput. Electron. Agric. 2020, 177, 105687. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Abulfaraj, A.W.; Binzagr, F. A Deep Ensemble Learning Approach Based on a Vision Transformer and Neural Network for Multi-Label Image Classification. Big Data Cogn. Comput. 2025, 9, 39. [Google Scholar] [CrossRef]
Vyas, R.; Williams, B.M.; Rahmani, H.; Boswell-Challand, R.; Jiang, Z.; Angelov, P.; Black, S. Ensemble-Based Bounding Box Regression for Enhanced Knuckle Localization. Sensors 2022, 22, 1569. [Google Scholar] [CrossRef]
Andriyanov, N.; Tashlinsky, A.; Dementiev, V. Zero-Shot Detection in Satellite Images. In Proceedings of the 2025 Systems of Signal Synchronization, Generating and Processing in Telecommunications (SYNCHROINFO), Tyumen, Russia, 30 June–3 July 2025; pp. 1–5. [Google Scholar] [CrossRef]
Jung, S.; Song, A.; Lee, K.; Lee, W.H. Advanced Building Detection with Faster R-CNN Using Elliptical Bounding Boxes for Displacement Handling. Remote Sens. 2025, 17, 1247. [Google Scholar] [CrossRef]
Sun, L.; Chen, J.; Feng, D.; Xing, M. Parallel Ensemble Deep Learning for Real-Time Remote Sensing Video Multi-Target Detection. Remote Sens. 2021, 13, 4377. [Google Scholar] [CrossRef]
Abdullah, L.N.; Sidi, F.; Kurmashev, I.G.; Iklassova, K.E. Ensemble Deep Learning Approach for Apple Fruitlet Detection from Digital Images. Kozybayev N. Kazakhstan Univ. 2024, 4, 183–194. [Google Scholar] [CrossRef]
Suo, R.; Gao, F.; Zhou Zh Fu, L.; Song, Z.; Dhupia, J.; Li, R.; Cui, Y. Improved multi-classes kiwifruit detection in orchard to avoid collisions during robotic picking. Comput. Electron. Agric. 2021, 182, 106052. [Google Scholar] [CrossRef]
Zhao, Z.-A.; Wang, S.; Chen, M.-X.; Mao, Y.-J.; Chan, A.C.-H.; Lai, D.K.-H.; Wong, D.W.-C.; Cheung, J.C.-W. Enhancing Human Detection in Occlusion-Heavy Disaster Scenarios: A Visibility-Enhanced DINO (VE-DINO) Model with Reassembled Occlusion Dataset. Smart Cities 2025, 8, 12. [Google Scholar] [CrossRef]
Yang, Z.-X.; Li, Y.; Wang, R.-F.; Hu, P.; Su, W.-H. Deep Learning in Multimodal Fusion for Sustainable Plant Care: A Comprehensive Review. Sustainability 2025, 17, 5255. [Google Scholar] [CrossRef]
Melnychenko, O.; Ścisło, Ł.; Savenko, O.; Sachenko, A.; Radiuk, P. Intelligent Integrated System for Fruit Detection Using Multi-UAV Imaging and Deep Learning. Sensors 2024, 24, 1913. [Google Scholar] [CrossRef] [PubMed]
Sapkota, R.; Ahmed, D.; Karkee, M. Comparing YOLOv8 and Mask R-CNN for Instance Segmentation in Complex Orchard Environments. Artif. Intell. Agric. 2024, 13, 84–99. [Google Scholar] [CrossRef]
Yan, B.; Fan, P.; Lei, X.; Liu, Z.; Yang, F. A Real-Time Apple Targets Detection Method for Picking Robot Based on Improved YOLOv5. Remote Sens. 2021, 13, 1619. [Google Scholar] [CrossRef]
Chen, J.; Liu, H.; Zhang, Y.; Zhang, D.; Ouyang, H.; Chen, X. A Multiscale Lightweight and Efficient Model Based on YOLOv7: Applied to Citrus Orchard. Plants 2022, 11, 3260. [Google Scholar] [CrossRef]
Li, J.; Tang, Y.; Zou, X.; Lin, G.; Wang, H. Detection of Fruit-Bearing Branches and Localization of Litchi Clusters for Vision-Based Harvesting Robots. IEEE Access 2020, 8, 117746–117758. [Google Scholar] [CrossRef]
Barbedo, J.G.A. Impact of Dataset Size and Variety on the Effectiveness of Deep Learning and Transfer Learning for Plant Disease Classification. Comput. Electron. Agric. 2018, 153, 46–53. [Google Scholar] [CrossRef]
Quiroz, I.A.; Alférez, G.H. Image Recognition of Legacy Blueberries in a Chilean Smart Farm through Deep Learning. Comput. Electron. Agric. 2020, 168, 105044. [Google Scholar] [CrossRef]
Vasconez, J.P.; Delpiano, J.; Vougioukas, S.; Auat Cheein, F. Comparison of Convolutional Neural Networks in Fruit Detection and Counting: A Comprehensive Evaluation. Comput. Electron. Agric. 2020, 173, 105348. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Kambhatla, C.; Sharma, M. A Comprehensive Review of YOLO Architectures in Computer Vision. Vision 2024, 5, 83. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L. Soft-NMS—Improving Object Detection with One Line of Code. arXiv 2017, arXiv:1704.04503. [Google Scholar]
Solovyev, R.; Wang, W.; Gabruseva, T. Weighted boxes fusion: Ensembling boxes from different object detection models. Image Vis. Comput. 2021, 107, 104117. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, X.; Wang, Y. Multi-Attribute NMS: An Enhanced Non-Maximum Suppression for Crowded Scenes. Appl. Sci. 2023, 13, 8073. [Google Scholar]
Sharifuzzaman, S.A.S.M.; Tanveer, J.; Chen, Y.; Chan, J.H.; Kim, H.S.; Kallu, K.D.; Ahmed, S. Bayes R-CNN: An Uncertainty-Aware Bayesian Approach to Object Detection in Remote Sensing Imagery for Enhanced Scene Interpretation. Remote Sens. 2024, 16, 2405. [Google Scholar] [CrossRef]
Chen, Y.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal Object Detection via Probabilistic Ensembling. arXiv 2021, arXiv:2104.02904. [Google Scholar]
Vilhelm, A.; Limbert, M.; Audebert, C.; Ceillier, T. Ensemble Learning techniques for object detection in high-resolution satellite images. arXiv 2022, arXiv:2202.10554. [Google Scholar] [CrossRef]
Khort, D.O.; Kutyrev, A.; Smirnov, I.; Andriyanov, N.; Filippov, R.; Chilikin, A.; Astashev, M.E.; Molkova, E.A.; Sarimov, R.M.; Matveeva, T.A.; et al. Enhancing Sustainable Automated Fruit Sorting: Hyperspectral Analysis and Machine Learning Algorithms. Sustainability 2024, 16, 10084. [Google Scholar] [CrossRef]

Figure 1. Overall workflow of the adaptive CNN ensemble framework for agricultural object detection.

Figure 2. Structural Scheme for Ensembling.

Figure 3. User Operation Flowchart.

Figure 4. Program Interface: «Settings» Tab.

Figure 5. Program Interface: «Results» Tab.

Figure 6. Results of Apple Fruit Recognition Using the Automated Neural Network Model Selection and Ensembling Program.

Figure 7. Comparative analysis of detection quality metrics obtained using the program.

Table 1. Dataset Distribution.

Shooting Condition	Number of Images	Average Apples per Image	Condition Features
Clear, Morning	850	12.3	Soft morning light, long shadows, moderate contrast
Clear, Day	920	16.7	High contrast, sunlight glare, saturated colors
Overcast	780	14.9	Diffused lighting, no shadows, reduced color gradient
Fog	510	9.1	Blurred contours, low local contrast, light veil
Rain (Light/Heavy)	620	10.2	Glare, raindrops on lens, partial dimming
Evening/Sunset	470	13.5	Warm spectrum, pronounced shadows
Night	350	7.6	Black-and-white silhouettes, no color information
Wind/Cloudy	390	12.0	Shifted contours, motion blur, unstable background
Total	4890	~12	-

Table 2. Comparison of Different Ensemble Methods.

Method	Disadvantages	Advantages
NMS	Loss of detections for partially occluded objects	Simplicity, speed
Soft NMS	Dependence on parameter σ	Preservation of overlapping objects
NMW	Sensitivity to confidence score values, computational complexity	Weighted averaging of coordinates, precise box positioning
WBF	Computational complexity	Accounting for model weights, coordinate accuracy
Score Averaging	Ignoring differences in model reliability	Ease of implementation, stabilization of scores
Weighted Averaging	Requires prior weight calibration	Considering model reliability via weighting
IoU Voting	Ineffective at low IoU values	Noise robustness
Improved reliability through detection agreement	Loss of detections with insufficient agreement	Consensus Fusion
Adaptiveness to object density	Requires parameter calibration	Adaptive NMS
Improved detection robustness via augmentation	Increased processing time	TTA
Accounting for uncertainty, adjusting model confidence	Computational and parameter tuning complexity	Bayesian Ensembling

Table 3. Mean Average Precision (mAP) for Apples Under Various Weather Conditions.

Shooting Conditions	NMS	Soft NMS	NMW	WBF	Score Averaging	Weighted Averaging	IoU Voting	Consensus Fusion	Adaptive NMS	TTA	Bayesian Ensembling
Wind, cloudiness	0.859	0.893	0.864	0.924	0.865	0.890	0.858	0.887	0.889	0.908	0.917
Evening/Sunset	0.861	0.874	0.855	0.915	0.889	0.897	0.883	0.893	0.884	0.927	0.943
Rain (Light/Heavy)	0.881	0.887	0.872	0.957	0.916	0.922	0.943	0.963	0.966	0.949	0.935
Night	0.867	0.859	0.74	0.937	0.943	0.943	0.948	0.935	0.957	0.968	0.946
Overcast	0.885	0.905	0.912	0.923	0.874	0.921	0.856	0.923	0.870	0.944	0.909
Fog	0.862	0.932	0.903	0.959	0.865	0.909	0.865	0.854	0.947	0.887	0.966
Clear, Day	0.966	0.950	0.875	0.872	0.872	0.887	0.913	0.902	0.885	0.923	0.867
Clear, Morning	0.869	0.969	0.857	0.935	0.954	0.922	0.954	0.922	0.895	0.938	0.852

Table 4. Quantitative performance comparison of the proposed ensemble and state-of-the-art detectors under orchard conditions.

Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.5–0.95 (%)	Inference Time (ms)
YOLOv8n (2024)	91.2 ± 0.8 (95% CI [90.1, 92.3])	89.5 ± 1.0 (95% CI [87.6, 91.4])	91.8 ± 0.7 (95% CI [90.6, 93.0])	69.4 ± 1.2 (95% CI [67.1, 71.7])	12.8
Rep-ViG-Apple (2024)	92.5 ± 0.6 (95% CI [91.4, 93.6])	90.1 ± 0.9 (95% CI [88.5, 91.7])	92.7 ± 0.8 (95% CI [91.2, 94.2])	71.3 ± 1.1 (95% CI [69.2, 73.4])	16.5
AD-YOLO (2024)	93.0 ± 0.9 (95% CI [91.3, 94.7])	91.7 ± 0.8 (95% CI [90.1, 93.3])	93.8 ± 0.7 (95% CI [92.5, 95.1])	72.6 ± 1.0 (95% CI [70.6, 74.6])	14.7
Proposed Adaptive Ensemble (Ours)	95.4 ± 0.5 (95% CI [94.5, 96.3])	94.1 ± 0.6 (95% CI [93.0, 95.2])	95.8 ± 0.4 (95% CI [95.0, 96.6])	74.8 ± 0.7 (95% CI [73.5, 76.1])	18.9

Table 5. Statistical significance analysis of the proposed ensemble against baseline detectors.

Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.5–0.95 (%)/Significance
YOLOv8n (2024)	91.2 ± 0.8 (95% CI [90.1, 92.3])	89.5 ± 1.0 (95% CI [87.6, 91.4])	91.8 ± 0.7 (95% CI [90.6, 93.0])	69.4 ± 1.2 (95% CI [67.1, 71.7])/–
EfficientDet-D1 (2024)	92.8 ± 0.6 (95% CI [91.7, 93.9])	90.4 ± 0.8 (95% CI [88.9, 91.9])	93.1 ± 0.5 (95% CI [92.1, 94.1])	71.8 ± 0.9 (95% CI [70.0, 73.6])/–
RT-DETR (R-50)	90.5 ± 1.1 (95% CI [88.4, 92.6])	88.1 ± 1.2 (95% CI [85.7, 90.5])	91.0 ± 0.8 (95% CI [89.4, 92.6])	74.1 ± 1.0 (95% CI [72.1, 76.1])/–
Proposed Adaptive Ensemble	95.4 ± 0.5 (95% CI [94.5, 96.3])	94.1 ± 0.6 (95% CI [93.0, 95.2])	95.8 ± 0.4 (95% CI [95.0, 96.6])	74.8 ± 0.7 (95% CI [73.5, 76.1])/p < 0.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kutyrev, A.; Andriyanov, N.; Khort, D.; Smirnov, I.; Zubina, V. Adaptive CNN Ensemble for Apple Detection: Enabling Sustainable Monitoring Orchard. AgriEngineering 2025, 7, 369. https://doi.org/10.3390/agriengineering7110369

AMA Style

Kutyrev A, Andriyanov N, Khort D, Smirnov I, Zubina V. Adaptive CNN Ensemble for Apple Detection: Enabling Sustainable Monitoring Orchard. AgriEngineering. 2025; 7(11):369. https://doi.org/10.3390/agriengineering7110369

Chicago/Turabian Style

Kutyrev, Alexey, Nikita Andriyanov, Dmitry Khort, Igor Smirnov, and Valeria Zubina. 2025. "Adaptive CNN Ensemble for Apple Detection: Enabling Sustainable Monitoring Orchard" AgriEngineering 7, no. 11: 369. https://doi.org/10.3390/agriengineering7110369

APA Style

Kutyrev, A., Andriyanov, N., Khort, D., Smirnov, I., & Zubina, V. (2025). Adaptive CNN Ensemble for Apple Detection: Enabling Sustainable Monitoring Orchard. AgriEngineering, 7(11), 369. https://doi.org/10.3390/agriengineering7110369

Article Menu

Adaptive CNN Ensemble for Apple Detection: Enabling Sustainable Monitoring Orchard

Abstract

1. Introduction

2. Related Works

2.1. Advancements in Single-Model Architectures

2.2. Ensemble and Advanced Methods

2.3. The Identified Research Gap and Our Contribution

3. Materials and Methods

3.1. Dataset Description

3.2. Algorithms and Models

4. Results and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI