Next Article in Journal
Image Captioning Through Deep Learning: An Adaptation of the BLIP-2 Model to Arabic
Previous Article in Journal
Multidirectional Ultrasound Propagation Velocity as a Predictor of Open Porosity and Water Absorption in Volcanic Rocks: Traditional Regression and Machine Learning
Previous Article in Special Issue
CloudCropFuture: Intelligent Monitoring Platform for Greenhouse Crops with Enhanced Agricultural Vision Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Tomato Maturity Classification and Fruit Counting Based on RGB and Multispectral Images †

1
Department of Computer Science and Information Engineering, National Taipei University of Technology, Taipei 10608, Taiwan
2
Department of Electrical Engineering, National Chung Cheng University, Chiayi 621301, Taiwan
3
Department of Computer Science and Information Engineering, National United University, Miaoli 360302, Taiwan
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in the 2024 IEEE International Conference on Industrial Technology (ICIT 2024), Bristol, UK, 25–27 March 2024, under the title “Maturity and yield estimation of tomatoes using RGB and multispectral images”.
Appl. Sci. 2026, 16(7), 3227; https://doi.org/10.3390/app16073227
Submission received: 25 February 2026 / Revised: 25 March 2026 / Accepted: 25 March 2026 / Published: 26 March 2026
(This article belongs to the Special Issue Applications of Image Processing Technology in Agriculture)

Abstract

Accurate monitoring of tomato maturity and fruit number is essential for improving crop management and supporting accurate yield estimation in greenhouse environments. However, variations in lighting conditions, occlusions, and overlapping fruits often make reliable maturity classification and fruit counting challenging. This paper presents an integrated approach for tomato maturity classification and fruit number estimation using RGB and multispectral images. The proposed approach consists of tomato detection, tomato tracking and counting, and maturity classification of tomatoes. The detection model identifies tomatoes in each frame, the tracking module associates individual tomatoes across image sequences to avoid duplicate counting, and the classification models determine maturity levels. Experiments are conducted on three tomato datasets collected in greenhouse environments. The results show that the proposed method achieves a maximum maturity classification accuracy of 81%. In addition, the proposed approach facilitates consistent fruit counting across image sequences, supporting practical applications in greenhouse monitoring. These findings demonstrate the potential of integrating RGB and multispectral information for automated tomato maturity classification and fruit counting in precision agriculture.

1. Introduction

Tomato is one of the most extensively cultivated vegetable crops in the world and plays an important role in global food supply and agricultural economies. According to recent agricultural statistics, global tomato production exceeds hundreds of millions of tons annually, making it one of the most economically significant horticultural crops. Tomatoes are commonly consumed due to their nutritional value and are an essential component of many diets worldwide. In addition to their economic importance, efficient tomato production is closely related to issues of food security and sustainable agricultural development.
Accurate monitoring of tomato maturity and fruit number is a critical task in greenhouse production. This information directly influences fruit number estimation, harvest scheduling, and crop management decisions. However, traditional manual monitoring methods are labor-intensive, time-consuming, and prone to human error, especially in large-scale production environments. Therefore, developing automated and reliable methods for detecting tomato maturity and estimating fruit numbers using image-based techniques has become an important research direction in precision agriculture.
Several existing approaches to crop maturity classification rely on the spectral response of fruits across different wavelength bands. Zhang et al. [1] analyzed spectral variations in strawberries at multiple developmental stages to determine maturity, while infrared reflectance from the fruit surface was used to evaluate the maturity of blueberries [2]. In addition to spectral characteristics, visual changes associated with fruit development provide important cues for quality assessment and can be exploited using conventional classification methods. Ashtiani et al. [3] presented an approach for the classification of mulberry fruit ripening stages using convolutional neural networks (CNNs). Transfer learning was used to fine-tune the CNNs for reducing the training cost and improving the accuracy of classification. They categorized blackberries and raspberries into four maturity stages: unripe, partially ripe, ripe, and overripe. Yang et al. [4] proposed image processing techniques to assess the growth status of crops and predict the harvest time of strawberries in a greenhouse. They used object detection and classification with machine learning models to classify the growth stages of strawberries.
In addition to evaluating fruit quality and maturity, vision-based approaches have been applied to fruit detection and localization for both fruit number estimation and automated harvesting systems. Liu et al. [5] introduced a multi-ellipse boundary representation in the Cr–Cb color space to effectively distinguish citrus fruits and tree trunks from complex backgrounds. Tanigaki et al. [6] developed an autonomous robotic system for cherry harvesting. Their system-integrated thermal imaging was used to discriminate infrared reflectance between ripe cherries and surrounding foliage, and can obtain more accurate localization of mature fruits. Their approach can improve the accuracy of identifying fruits suitable for harvesting. Niedbała et al. [7] integrated a diverse set of maturity-related indicators, including over 30 multispectral vegetation indices, meteorological data collected between the 121st and 181st days of the year, and bee foraging activity density. DeLong et al. [8] focused on temporal monitoring of fruit chlorophyll content to determine the optimal harvest initiation time. The adoption of intelligent agricultural technologies permitted growers to optimize orchard management and apply timely measures to control pests and diseases. Consequently, both fruit quality and the number of harvested tomatoes were significantly improved. Li et al. [9] used the YOLOv8 object detection framework for identifying crops in agricultural environments. They employed YOLOv8 as the primary detection model due to its real-time performance and strong feature extraction capability. However, this approach focused mainly on object detection and did not incorporate additional analysis tasks such as crop growth stage estimation or yield prediction. Wang et al. [10] proposed an enhanced object detection framework for identifying tomatoes in greenhouse scenes. They adopted YOLOv8 as the baseline detector and introduced several modifications to improve detection accuracy under agricultural conditions. However, this method focused primarily on fruit detection and did not address subsequent tasks such as fruit tracking and maturity classification.
Several methods rely on single-image analysis without tracking individual fruits over time. This limitation often results in duplicated counting and unstable maturity predictions. Moreover, numerous approaches based solely on spectral indices or visual features may lack the capability to distinguish visually similar maturity stages. To overcome these challenges, this paper proposes an integrated approach that combines tomato detection, tomato tracking and counting, and maturity classification of tomatoes. By associating individual tomatoes across image sequences and jointly analyzing RGB and multispectral features, the proposed approach facilitates consistent maturity classification and accurate counting of tomatoes at different maturity stages. This integrated approach provides a practical solution for supporting data-driven harvest planning in real agricultural environments [11].
The proposed method does not introduce a new algorithm. The novelty of this paper lies in the integration of multiple established components within a unified framework. The main contributions of this paper are summarized as follows. First, we propose an integrated approach that combines tomato detection, multi-object tracking and counting, and maturity classification from image sequences. This integration allows consistent association of individual tomatoes across frames and reduces duplicate counting. Second, we investigate the effectiveness of combining RGB features with multispectral vegetation indices for tomato maturity classification and analyze the relative contributions of different spectral indicators. Third, we provide an experimental evaluation on multiple tomato datasets and analyze the performance of several classification models under different feature configurations, including RGB and multispectral inputs.

2. Related Works

2.1. Fruit Detection

Fruit detection was an important task in precision agriculture and has been studied using computer vision and deep learning techniques. Afonso et al. [12] employed the Mask R-CNN algorithm to simultaneously detect and segment tomatoes in images. To enhance depth perception, a RealSense camera was used to measure the distance between the tomatoes and the imaging system, thereby facilitating effective foreground-background separation. For the detection backbone, ResNet50, ResNet101, and ResNeXt101 were integrated with Mask R-CNN and comparatively evaluated. The incorporation of depth-assisted foreground segmentation using the RealSense camera significantly improved tomato detection accuracy and robustness in complex scenes. Vasconez et al. [13] investigated fruit detection for counting applications using Faster R-CNN with Inception V2 and SSD with MobileNet. Their approach considered multiple fruit types, including avocados, lemons, and apples, which exhibit substantial variability in color, size, and shape. To address overfitting and improve model generalization, two data augmentation strategies were adopted: horizontal image flipping and the use of images with varying pixel resolutions. These augmentations effectively enhanced detection performance across diverse visual conditions.

2.2. Fruit Tracking

Fruit tracking aimed to monitor individual fruits across image sequences to avoid repeated counting and to analyze growth dynamics. Liu et al. [14] proposed an aggregation-based approach for tracking small objects. Initially, discriminative features were extracted from regions surrounding each object, including local appearance features, gradient information, and edge cues. The local descriptors consisted of histogram-based representations, texture measures, and shape features, which were combined to form a comprehensive feature vector. Aggregation techniques, such as hash-based functions, were then applied to compress these representations into compact signatures. The resulting signatures were stored in a memory bank to facilitate efficient and reliable tracking over time. Rincon et al. [15] introduced a unified representation technique for highly occluded agricultural environments using multi-view perception and 3D multi-object tracking. Multiple cameras captured tomato plants from different viewpoints, and 3D tracking algorithms associated fruit instances across sequential frames. By exploiting geometric constraints and computer vision techniques, their system reconstructed detailed 3D models of the fruits. This reconstruction facilitated accurate estimation of fruit size, volume, and shape, thereby supporting more precise counting and phenotypic analysis.
Zhao and Tao [16] proposed a tracking approach based on color correlograms as the primary feature descriptor. A simplified color correlogram was used to represent target appearance while preserving essential spatial information. Gradient descent optimization and the mean-shift algorithm served as the core localization mechanisms and were extended to operate in three-dimensional space. Through iterative mean-shift refinement, the method estimated the most probable target position and orientation with improved stability. Zhao et al. [17] focused on enhancing the conventional two-stage detection-and-tracking architecture by developing a correlation filter-based tracker built on compressed deep convolutional neural network features. By integrating CNN representations with a correlation filter tracker and embedding target-specific semantic information into the compressed features, their method reduced computational overhead while maintaining high tracking performance. The approach was validated on standard benchmarks, including the MOT [18] and KITTI [19] datasets.

2.3. Maturity Estimation

Fruit maturity classification focused on determining the developmental stage of fruits based on visual or spectral features. Castro et al. [20] applied an image-based recognition technique to classify gooseberry maturity into seven distinct levels. In addition to RGB images, HSV and L*a*b* color spaces were incorporated, and principal component analysis (PCA) was performed to extract discriminative color features. Several classifiers, including support vector machines (SVMs), k-nearest neighbors (KNNs), artificial neural networks (ANNs), and decision trees, were evaluated for maturity prediction. Prior to training, the gooseberry images were segmented to remove background noise, and the resulting color components from the three-color spaces were used as feature inputs. This multi-color space strategy enhanced the separability of maturity stages. El-Bendary et al. [21] investigated tomato maturity evaluation using machine learning techniques. After segmenting the tomato regions from the captured images, the ratio of green to non-green surface areas was computed to categorize maturity into five levels. PCA was conducted on the HSV hue component to reduce feature dimensionality. For classification, SVM and linear discriminant analysis (LDA) were employed, with the SVM models implemented using both one-against-one and one-against-all strategies. Their approach demonstrated that color distribution features are effective indicators of tomato ripeness.
Martins et al. [22] proposed a vegetation index-based approach for monitoring coffee maturity using aerial multispectral images. A multispectral camera was used to capture coffee bean datasets under controlled illumination. Reflectance values at different wavelengths were analyzed to compute vegetation indices [23,24]. These indices, derived from spectral combinations, were used to evaluate the sensitivity of coffee beans across developmental stages. The resulting spectral signatures allowed estimation of the proportion of mature beans and facilitated the construction of maturity datasets. Nandi et al. [25] developed a machine vision system for predicting mango maturity to support automated sorting of harvested fruit. Each mango variety was categorized into four shelf-life-based maturity levels (M1–M4). For PCA-based feature extraction, each fruit was divided into apex, equator, and stalk regions. Mean RGB histogram values, inter-channel differences, and vertical color gradients were computed for each region. An SVM classifier combined with recursive feature elimination (SVM-RFE) was used to rank the 27 extracted features and select the most informative subset. Classification performance was assessed using confusion matrix analysis, demonstrating effective maturity discrimination.
Tan et al. [26] introduced an approach to determine blueberry maturity. An initial HOG-based SVM classifier showed limited performance, prompting the integration of HSV color features. The hue channel was extracted to capture maturity-related color variations, while HOG descriptors represented texture information. These complementary features were fused into a unified representation. KNN and template matching with weighted Euclidean distance (TMWE) classifiers were subsequently applied, resulting in improved classification accuracy. Waseem et al. [27] proposed an automated tomato maturity estimation approach based on an optimized residual neural network enhanced with pruning and quantization techniques. A residual network backbone first classified tomatoes into discrete maturity stages using image inputs. Structured pruning eliminated redundant parameters, and weight quantization reduced numerical precision, thereby decreasing model size, memory consumption, and inference latency. This optimization enabled efficient deployment in agricultural applications. However, the model’s performance remained sensitive to illumination variations and dataset diversity, indicating potential limitations in real-world environments.
Although numerous studies have explored image-based techniques for plant maturity classification and fruit counting, several limitations remain. Many existing approaches relied primarily on RGB images, which were sensitive to variations in illumination, shadows, and background complexity in greenhouse environments. In addition, some methods focused on general object detection frameworks without incorporating vegetation-specific spectral information, which may limit their effectiveness in distinguishing subtle maturity differences. Furthermore, several previous approaches have evaluated classification models using limited feature sets or single algorithms, which may not fully capture the spectral characteristics associated with plant physiological changes during the maturity process. These limitations highlighted the need for more robust feature representations and comparative analysis of multiple classification methods. Therefore, the main objective of this paper is to investigate the effectiveness of combining RGB features with vegetation indices for tomato maturity classification. In addition, different machine learning models are evaluated to determine the most suitable approach for improving classification accuracy and reliability in greenhouse tomato monitoring.

3. Proposed Approach

The flowchart of the proposed approach is shown in Figure 1, comprising three major stages: tomato detection, tomato tracking and counting, and maturity classification of tomatoes. Initially, sequences of RGB images are processed by an object detection algorithm to identify tomatoes. Subsequently, we use a tracking algorithm to assign a distinct identity (ID) to each detected tomato across consecutive frames. These IDs are used for tomato counting. Finally, the proposed approach facilitates consistent maturity classification and accurate counting of tomatoes at different maturity stages. The maturity of each tomato is classified into three categories: mature, almost mature, and immature. With the image frame associated with each identified tomato, the proposed approach facilitates consistent maturity classification and accurate counting of tomatoes at different maturity stages.

3.1. Tomato Detection

Tomatoes in the images are detected using a detection network that integrates the YOLOv8 detector with the omni-scale network (OSNet) architecture [28]. YOLOv8 is chosen because it represents a recent generation of one-stage object detection models that provide a strong balance between detection accuracy, computational efficiency, and real-time performance. These characteristics are important for agricultural monitoring systems, where large numbers of image sequences must be processed efficiently. OSNet is incorporated as a feature extraction module within the detection framework to enhance multi-scale feature representation. The detection pipeline follows the YOLOv8 architecture for object localization, while OSNet is utilized to strengthen feature extraction through its omni-scale residual blocks. This design facilitates the model to capture multi-scale visual patterns of tomatoes more effectively while maintaining computational efficiency.
The OSNet is a convolutional neural network originally developed for person re-identification tasks. The OSNet employs depth-wise separable convolution operations. A standard 3 × 3 convolution is factorized into a 1 × 1 convolution followed by a depth-wise 3 × 3 separable convolution. This design improves computational efficiency. Within its omni-scale residual block, multiple bottleneck pathways composed of lightweight 3 × 3 convolutional layers at different scales are jointly constructed to capture multi-scale feature representations. These bottleneck features are subsequently aggregated through a unified aggregation gate, which dynamically combines the information extracted from the individual bottleneck branches.

3.2. Tomato Tracking and Counting

We employ the tracking algorithm StrongSORT [29] for individual tomato tracking because it can achieve improved results in terms of multi-object tracking accuracy (MOTA) and ID consistency. Moreover, StrongSORT is selected due to its ability to maintain stable identity assignment across image frames by combining motion information with appearance-based association and trajectory linking mechanisms.
The tracking algorithm assigns a distinct ID to each detected tomato across consecutive frames. These unique IDs were subsequently used to count the total number of tomatoes. StrongSORT extends the DeepSORT framework [30] by introducing AFLink, an appearance-independent trajectory association module that merges fragmented tracklets into longer and continuous trajectories. AFLink utilizes spatiotemporal features from pairs of tracklets to estimate their likelihood of association, with confidence scores generated through a multilayer perceptron (MLP).
To eliminate disturbance from occluding branches, the tomato segmentation technique [31] is adopted. Initially, the acquired tomato images are transformed from the RGB color space into the YUV representation. Binary segmentation is subsequently performed on the V channel using Otsu’s thresholding method [32]. Otsu’s thresholding technique in the YUV color space is primarily used as a preprocessing step to reduce background noise and remove specular highlights before feature extraction. This step aims to improve the reliability of the spectral and color features computed from the tomato regions, particularly in scenes where occlusions from leaves or illumination variations may introduce noise. The segmentation process helps isolate the tomato surface region. Therefore, the subsequent computation of RGB features and multispectral indices is less affected by surrounding foliage or background artifacts.
Given the dense distribution of tomatoes and frequent occlusions from foliage, stems, or adjacent fruits, rectangular detection regions are further approximated as elliptical shapes. Although this approach may encompass partially occluded or neighboring tomatoes, it effectively reduces segmentation-induced pixel inaccuracies and improves the robustness of subsequent feature computation.

3.3. Maturity Classification

The maturity of tomatoes is classified into three categories: mature, almost mature, and immature. During different developmental phases, crops exhibit distinct spectral reflectance characteristics across multiple wavelength bands [20]. Multispectral imaging systems capture crop spectral responses across multiple wavelength bands. This information permits the extraction of essential growth-related attributes, including leaf coverage, chlorophyll concentration, and vegetation indices. By examining the temporal variation in these indicators, crop maturity levels can be determined. These features are generally derived from either RGB or multispectral images.
In the proposed approach, three spectral indices [21,33], namely, normalized difference vegetation index (NDVI), green normalized difference vegetation index (GNDVI), and green-red ratio index (GRRI), are adopted to assess tomato maturity. NDVI is a metric for evaluating vegetation presence and growth conditions across land surfaces. It is derived from the relationship between near-infrared and visible spectral reflectance captured by remote sensing platforms [12]. NDVI values fall within the interval of −1 to +1, where values below zero typically correspond to non-vegetative surfaces such as water bodies, exposed soil, or rocky terrain, whereas positive values signify areas covered by vegetation. Generally, larger NDVI values are associated with denser and healthier plant growth. The NDVI formulation is expressed as
N D V I = N I R R E D N I R + R E D ,
where NIR and RED are near-infrared and red channels, respectively. GNDVI is an alternative spectral indicator used to evaluate plant vitality and canopy density through the analysis of green-band reflectance. GNDVI produces values between −1 and +1, where larger values correspond to more robust and healthy vegetation conditions. The GNDVI is given by
G N D V I = N I R G R E N I R + G R E ,
where GRE is the green channel. GRRI aims to investigate the applicability of airborne multispectral data for evaluating ripeness levels in coffee plantations [13]. Multispectral sensors can capture distinct spectral responses of surface targets across multiple wavelength bands. To achieve this objective, a series of aerial multispectral images is collected across different developmental stages of coffee crops. Image processing and analytical procedures are subsequently applied to the acquired data. The resulting GRRI metric serves as an indicator for evaluating crop ripeness across the plantation. The GRRI is defined by
G R R I = R E D G R E
In the proposed approach, multiple vegetation indices, including NDVI, GNDVI, and GRRI, are first transformed using PCA. PCA is chosen because it can reduce feature redundancy and capture the most informative components. Then, the proposed approach adopts a feature-level fusion strategy. RGB color features (R, G, and B) and multispectral vegetation indices (NDVI, GNDVI, and GRRI) are computed independently from the segmented tomato regions. These features are combined into a unified feature vector through feature concatenation, which is subsequently used as the input to the classification models.
Tomato maturity classification is performed using three machine learning models, namely SVMs, KNNs, and ANNs. The performance of these classifiers may vary depending on parameter configurations, affecting aspects such as classification accuracy, robustness, and generalization capability. Moreover, the way in which the dataset is divided into training and testing subsets can further influence the results. After determining the optimal model parameters, the trained classifiers are applied to the test dataset to produce the final classification results.

4. Dataset

4.1. Data Collection

For experimental data acquisition, a Parrot Sequoia+ (Parrot S.A., Paris, France) imaging system was employed. This platform was adopted in precision agriculture and integrated both a multispectral imaging unit and an ambient light sensor. It integrated a 16-MP RGB camera (OmniVision Technologies, Inc., Santa Clara, United States) with a spatial resolution of 4608 × 3456 pixels for standard color imagery, and four single-band multispectral cameras (green, red, red-edge, and near-infrared), each with a resolution of 1280 × 960 pixels. Additionally, the integrated sunshine sensor monitored incident solar radiation, facilitating radiometric calibration of the multispectral data. This calibration process compensated for fluctuations in illumination conditions, ensuring consistent and reliable image outputs under varying sunlight intensities.
The datasets were collected during three different acquisition periods in two greenhouse environments, namely Farm Nineteen and Lin Family Farm. The shooting distance was about 40 cm from the tomato trellis. One greenhouse tomato farm was captured twice, once in each of the two seasons, and another greenhouse tomato farm was captured once. Therefore, the data collection consisted of three sets in total. The three datasets were collected sequentially at different periods. The dataset referred to as 19th_tomato contained images captured between September and October, followed by the lin_tomato dataset collected between November and January. The final dataset, 19th2_tomato, included images captured between March and April. These datasets, therefore, represented different observation periods during the tomato cultivation cycle.
In addition to the acquisition periods, the datasets differed in their sampling frequency. For the first two datasets (19th_tomato and lin_tomato), images were captured once every seven days. In contrast, the third dataset (19th2_tomato) was collected at a higher temporal frequency, with images captured twice every seven days. For each tomato greenhouse, three rows of tomato plants were selected for image acquisition. Among these rows, two were used to construct the training dataset, while the remaining row was reserved for testing. This spatial separation ensured that the training and testing data originated from different plant rows, reducing the risk of information leakage and providing a more reliable evaluation of the proposed method. The 19th_tomato dataset contained three training image sequences and two testing image sequences. The lin_tomato dataset contained five training image sequences and five testing image sequences. Similarly, the 19th2_tomato dataset contained five training image sequences and five testing image sequences. Examples of RGB and multispectral images are shown in Figure 2.
Although the RGB and multispectral images were captured simultaneously, slight spatial discrepancies between the spectral images exist. Therefore, we performed image calibration and alignment for the multispectral images. Figure 3 shows examples of an RGB image and the corresponding calibrated and aligned multispectral images for computing the three spectral indices.

4.2. Maturity Labeling

The acquired tomato image sequences were converted into the IRIS dataset structure [34]. Within this format, frame-by-frame information for each tomato was stored, including unique IDs, RGB images, multispectral images, and corresponding maturity labels. The multispectral features additionally comprised NDVI, GNDVI, and GRRI values. Apart from the tomato ID and maturity labels, all remaining attributes were computed by averaging pixel intensities over the unmasked regions inside each tomato’s bounding box. Each dataset contained the complete image sequences, the converted IRIS-formatted data, test sequences organized according to the MOT dataset specification, and the trained model parameters obtained from SVM, KNN, and ANN classifiers.
The maturity categories were determined based on the expected time remaining before harvest and the observable color characteristics of the tomatoes. During data collection, tomato plants were monitored weekly, and the maturity label for each tomato was assigned by examining image sequences captured at the same location across consecutive weeks. When a tomato turned red and was harvested in a subsequent week, earlier observations of that tomato were retrospectively labeled according to the number of weeks remaining before harvest. Tomatoes that were ready for harvest were labeled as mature, those expected to reach harvest within approximately one week were labeled as almost mature, and those requiring two or more additional weeks before harvest were labeled as immature. In addition to the temporal criterion, visual characteristics, particularly skin color transition from green to red, were used as practical indicators to support the labeling process. These criteria reflected typical physiological ripening stages of tomatoes observed during cultivation.
Specifically, when a tomato appears green and remains unharvested during week n, turns red in week n + 1, and is collected in week n + 2, it is assigned a label of −2 for week n (corresponding to n − (n + 2)) and −1 for week n + 1 (corresponding to (n + 1) − (n + 2)). Tomatoes that are expected to require three or more additional weeks before harvest exhibit negligible variations in appearance. Therefore, these samples are uniformly assigned a label of −3, including those exactly three weeks from harvesting. In summary, labels −1, −2, and −3 represent tomatoes that are harvest-ready, one week away from harvest, and two or more weeks away from harvest, respectively.

5. Results

All experiments were conducted on a standard workstation equipped with GPU acceleration, ensuring efficient training and inference processes. The proposed approach was implemented in Python (version 3.10.0) using the PyTorch (version 2.0.1) deep learning framework. The computing environment supports CUDA-based (version 11.8) parallel processing to improve training efficiency and inference speed. The YOLOv8 detector was trained using the Ultralytics framework with the following settings: an initial learning rate of 0.01, the stochastic gradient descent (SGD) optimizer with a momentum of 0.937 and a weight decay of 0.001, 200 training epochs, and a batch size of 8.

5.1. Maturity Classification Results

We used SVMs, KNNs, and ANNs to perform tomato maturity classification. For the SVM classifier, the radial basis function (RBF) kernel mapped the original input features into a higher-dimensional feature space, allowing the classifier to construct nonlinear decision boundaries that better separated complex data distributions. The penalty parameter C controlled the trade-off between maximizing the margin and minimizing classification errors. The value of C was explored within the range of 10 to 40 with a step size of 2 to identify a suitable regularization strength. The gamma parameter of the RBF kernel was set using the default scaling strategy based on the number of features. For the KNN classifier, the neighborhood size K was varied from 2 to 30 to analyze the influence of local neighborhood size on classification performance. This range was selected to balance sensitivity to local patterns and robustness to noise in the feature space. For the ANN classifier, we used a feedforward multilayer perceptron consisting of an input layer corresponding to the selected feature set, one hidden layer, and an output layer for three-class maturity classification. The hidden layer used nonlinear activation functions, and the network was trained using a standard optimization algorithm with a fixed number of training epochs and batch size. The combination of 300 epochs and a batch size of 40 was selected to ensure model convergence while maintaining stable training performance. In addition, all classifiers were evaluated using a 10-fold cross-validation procedure to ensure consistent performance estimation across different data partitions. The specific parameter configurations and kernel selections for the three learning algorithms were summarized in Table 1. The dimensionality of the feature space was dictated by the number of indicators incorporated in the classification process. For instance, when NDVI and GNDVI were selected as input features, the resulting feature space was two-dimensional.
Using the proposed maturity labeling strategy in conjunction with the selected classification models, tomato maturity levels were determined for the test image sequences. Classification performance was evaluated by computing accuracy as the ratio of correctly predicted tomato bounding boxes, after applying a voting scheme across frames, to the total number of bounding boxes present in all frames. At this stage, tracking information was excluded from the evaluation, and all bounding boxes in both the training and test datasets were derived from ground-truth labels. The datasets were divided into training and testing subsets based on predefined sequence splits. By performing the split at the sequence level, temporal separation between the subsets was maintained, which helped prevent information leakage. Each subset contained independent image sequences to ensure that the training and testing data did not share identical frame-level labels. The classification models were trained using the training subset, and their performance was evaluated on the testing subset. The number of annotated tomatoes across the training and testing splits for the three datasets was summarized in Table 2.
Within image sequences, a single tomato may be captured across multiple frames, often under varying conditions such as partial occlusion by leaves or slight shifts in detection location. These factors can cause discrepancies in classification results across frames. To address this issue, a voting strategy inspired by ensemble learning was adopted. Predictions from multiple frames were aggregated, and the class receiving the highest number of votes was selected as the final maturity label for each tomato. Consequently, the maturity decision was determined by majority consensus among the frame-level classifications. The summarized classification results were presented in Table 3, where accuracy was calculated as the proportion of correctly identified tomatoes relative to the total number of bounding boxes across all frames. The proposed method achieved a maximum maturity classification accuracy of 81% and demonstrated consistent fruit counting performance across three datasets.
The results indicated that, except for the 19th_tomato dataset, RGB-based methods outperformed multispectral approaches in the remaining two datasets. The relatively lower classification accuracy obtained for the 19th_tomato dataset when using the combined RGB and vegetation index features may be related to the acquisition period of the dataset. The images in this dataset were captured earlier in the growing season, when many tomatoes were still in the early stages of development. At this stage, the visual and spectral differences between maturity categories were less pronounced, as the characteristic color transition from green to red had not yet fully developed. As a result, both RGB features and vegetation indices may exhibit reduced discriminative capability, leading to overlapping feature distributions among the maturity classes. Consequently, combining RGB and spectral indices did not significantly improve classification accuracy for this dataset compared with other datasets captured at later growth stages.
To further evaluate the influence of different feature configurations, the average classification accuracy across all datasets and classifiers was calculated for each feature set. The results showed that the combined RGB and vegetation index features achieved the highest average accuracy of 66.8%. In comparison, using only vegetation indices (NDVI, GNDVI, and GRRI) resulted in an average accuracy of 64.9%, while using only RGB features achieved an average accuracy of 63.8%. Integrating RGB information with spectral vegetation indices provided a more informative feature representation for tomato maturity classification. To investigate the underlying reasons for this performance difference, the relative influence of individual features was examined, as depicted in Figure 4. The analysis revealed that the green spectral component consistently contributed the most to accurate maturity discrimination. In contrast, the impact of multispectral indices, along with red and blue channels, varied across datasets. These findings suggested that the improved performance achieved by RGB images can largely be attributed to the strong discriminative capability of the green channel in maturity assessment.
To further validate the preceding observations, classification accuracy was evaluated by comparing the use of complete RGB features against configurations that included only the red and blue channels. As shown in Table 4, the results demonstrated that, aside from the 19th_tomato dataset, incorporating all RGB components led to markedly improved performance relative to excluding the green channel. In addition, omitting the green component resulted in accuracy levels that fell below those achieved using multispectral data. For the 19th_tomato dataset, however, the contribution of green light to classification was less dominant than in the other two cases. As illustrated in Figure 4, although the green channel remained the most influential feature, its margin over the second most significant indicator was considerably narrower. This reduced disparity may explain why multispectral approaches produced comparatively better results for the 19th_tomato dataset than RGB-based methods.

5.2. Change in Maturity Annotation Interval

To provide a more comprehensive evaluation of the classification performance, additional metrics, including precision, recall, and F1-score, were calculated for the three tomato datasets. Classification performance metrics for the three tomato datasets are shown in Table 5. Here, precision, recall, and F1-score represented macro-averaged values across the three maturity categories. For the 19th_tomato dataset, the overall accuracy was approximately 66.7%, with noticeable confusion between the immature and almost mature categories. This indicates that the visual characteristics of these maturity stages are relatively similar in this dataset. For the lin_tomato dataset, the classification performance improved significantly, achieving an accuracy of approximately 79.8% and higher precision and recall across most categories. The results showed that the maturity stages in this dataset exhibited clearer visual and spectral distinctions. For the 19th2_tomato dataset, the model achieved an accuracy of approximately 72.2%. While the mature category was classified with high precision, some confusion remained between the immature and almost mature categories. These results demonstrated that the classification performance varies across datasets due to differences in image acquisition conditions, tomato growth stages, and visual variability in the collected data.
Moreover, the classification behavior of different maturity categories was further analyzed by inspecting the confusion matrices presented in Figure 5. The results indicated relatively high discriminative accuracy between the −1 (mature) and −2 (almost mature) categories, whereas substantial misclassification occurred between the −2 (almost mature) and −3 (immature) categories. This confusion can be attributed to the close visual resemblance in skin coloration between almost mature and immature tomatoes, which limits the effectiveness of color-based features for reliable separation.
Initially, image acquisition was conducted on a weekly basis due to the large scale of the tomato cultivation area. Based on observations from the first two datasets, tomato coloration remained largely unchanged until approximately one and a half weeks prior to full maturity. In contrast, a rapid and pronounced color transition occurred during the final one and a half weeks before maturity. Consequently, the sampling frequency for the third dataset was increased to twice per week rather than maintaining the original weekly schedule.

5.3. Fruit Counting Results

The tomato counting procedure consisted of two main stages: detection and tracking. In the detection stage, YOLOv8 was trained on RGB images, where tomato was defined as the only target class. The trained YOLOv8 model was then integrated into the tracking framework. For multi-object tracking, a pre-trained StrongSORT model was employed. YOLOv8 was responsible for generating bounding box detections, while StrongSORT assigned unique IDs to each detected tomato across frames. Finally, the allocated IDs were used to derive the number of tomatoes.
For maturity-based fruit number estimation, tomato tracking was integrated with the maturity classification to estimate both the fruit number and distribution of tomatoes across different maturity stages. The resulting counts for each maturity category in all test image sequences were summarized in Table 6. Figure 6 presents visual examples of the combined detection and classification results, where tomatoes classified as mature, almost mature, and immature were denoted by red, orange, and yellow bounding boxes, respectively.

6. Discussion

The comparative analysis between RGB-based and multispectral features indicated that RGB information, particularly the green channel, played a dominant role in tomato maturity classification. In two of the three datasets, RGB features outperformed multispectral vegetation indices. Hence, visible color changes associated with tomato maturity provided highly discriminative cues for maturity classification. Although multispectral indices such as NDVI, GNDVI, and GRRI were designed to capture physiological and chlorophyll-related variations, their advantage may be reduced when strong color transitions are already present in the visible spectrum. In practical agricultural environments, low-cost RGB imaging systems may be adequate for many maturity assessment tasks, thereby potentially lowering deployment costs.
To perform tomato detection, the YOLOv8 detector [9,10,35,36] was adopted in the proposed approach. YOLO-based detectors have been used in object detection tasks due to their real-time processing capability, high computational efficiency, and competitive detection accuracy. They were suitable for agricultural applications involving large-scale image sequences. Moreover, YOLO-based detectors have demonstrated effectiveness in fruit detection under complex environmental conditions, particularly in scenarios with varying illumination and occlusion, demonstrating their robustness and practical applicability. Compared to two-stage detectors, YOLOv8, as a one-stage detection framework, provided a favorable balance between accuracy and inference speed, which was beneficial for integration with tracking algorithms in sequential image analysis.
The introduction of a voting mechanism across multiple frames significantly stabilized maturity predictions. Because individual tomatoes appeared in multiple frames with varying occlusions and viewpoints, frame-level predictions can fluctuate. Aggregating predictions through majority voting reduced the impact of transient errors and produced more consistent tomato-level classifications. This strategy demonstrated the value of integrating tracking information with classification. By associating observations over time, the proposed approach effectively exploited temporal redundancy to enhance robustness.
The maturity-based fruit number estimation experiments revealed that tracking accuracy and detection reliability directly influence counting performance. In sequences with heavy occlusion or dense foliage, the predicted tomato count deviated more substantially from ground truth. These errors propagated to fruit number estimation, especially in the lin_tomato dataset, where large discrepancies were observed. Although StrongSORT provided strong multi-object tracking performance, agricultural environments presented unique challenges, including irregular fruit motion, partial occlusion by leaves, and similar appearance among neighboring tomatoes. Enhancing detection robustness through improved segmentation or incorporating depth information could reduce counting errors.
The change in sampling frequency from weekly to twice per week also offered valuable insights. Higher temporal resolution captured rapid color transitions occurring shortly before harvest. Increased sampling frequency improved the system’s ability to detect maturity progression and reduced uncertainty in classification. From an operational standpoint, a trade-off existed between data acquisition cost and prediction accuracy. Optimizing sampling schedules based on crop growth dynamics could improve efficiency in real-world deployment.
The imaging system included an integrated sunshine sensor that recorded incident solar radiation and enabled radiometric calibration of the captured multispectral images. This calibration compensated for variations in illumination intensity during image acquisition. It thereby improved the consistency of the spectral measurements used to compute vegetation indices. In addition, the proposed approach incorporated several mechanisms that helped mitigate illumination-related variability. The preprocessing step based on Otsu thresholding in the YUV color space helped reduce the influence of background regions and specular highlights, while the use of multispectral indices provided features that were generally more robust to illumination changes than raw RGB values. Furthermore, the classification results were aggregated using a multi-frame voting strategy, which helped stabilize predictions across frames captured under slightly different lighting conditions.
The elliptical approximation was introduced to better represent the natural shape of tomatoes and to reduce the influence of background pixels within the rectangular bounding box region. In dense agricultural scenes, bounding boxes often included portions of leaves, stems, or neighboring fruits due to occlusion and irregular object boundaries. By approximating the tomato region with an elliptical shape inside the detected bounding box, the feature computation can focus more closely on the actual fruit surface. This reduced the noise from surrounding pixels.
In comparison with previous methods, the proposed approach achieved comparable detection and classification performance while integrating multiple analysis tasks. Unlike many previous methods that focused primarily on fruit detection or maturity classification independently, the proposed system combined tomato detection, multi-object tracking, and maturity classification using both RGB and multispectral images. Furthermore, the incorporation of spectral indices derived from multispectral data provided additional information that can improve the discrimination of tomato maturity stages. Compared with previous tomato maturity classification approaches that relied solely on RGB image features, the proposed approach also incorporated multispectral information through vegetation indices. This combination allowed the model to capture both visual appearance and spectral characteristics associated with tomato development.
The limitations of the proposed approach were as follows: First, the dataset was collected from a limited number of greenhouse environments, and variations in environmental conditions such as lighting intensity, shadows, and occlusions may affect the robustness of the detection and classification results. Second, the proposed method estimated the number of tomatoes rather than the actual yield in terms of mass or volume per unit area, which may limit its direct applicability for precise yield prediction in agricultural management. Finally, the maturity classification relied mainly on visual and spectral features derived from RGB and multispectral images. In cases where tomatoes were heavily occluded by leaves or other fruits, the detection and classification accuracy may decrease.

7. Conclusions

We have presented an integrated approach for tomato maturity classification and fruit number estimation based on both RGB and multispectral images. The proposed approach is composed of three main stages: tomato detection, tomato tracking and counting, and maturity classification of tomatoes. The combined RGB and vegetation index features can obtain the highest average accuracy of 66.8%, providing a more informative feature representation for tomato maturity classification. Moreover, the results show that the proposed method achieves a maximum maturity classification accuracy of 81% and demonstrates consistent fruit counting performance across the three datasets. The maturity-based fruit number estimation can provide practical guidance for optimizing harvesting strategies and labor allocation.
Future research directions include expanding the datasets to include more diverse environmental conditions and different cultivation stages to improve model generalization. In addition, integrating more advanced deep learning architectures and temporal modeling techniques may further enhance detection robustness and counting accuracy. Another promising direction is to combine fruit counting with weight estimation models to provide a more comprehensive indicator of potential tomato production in greenhouse environments.

Author Contributions

Methodology, C.-A.P. and H.-Y.L.; Supervision, H.-Y.L. and C.-C.C.; Writing—original draft, C.-A.P. and H.-Y.L.; Writing—review & editing, H.-Y.L. and C.-C.C. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to thank the National Science and Technology Council of Taiwan for financially supporting this research under contract no. NSTC 114-2221-E-239-024-.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhang, C.; Guo, C.; Liu, F.; Kong, W.; He, Y.; Lou, B. Hyperspectral imaging analysis for ripeness evaluation of strawberry with support vector machine. J. Food Eng. 2016, 179, 11–18. [Google Scholar] [CrossRef]
  2. Zhao, C.; Li, X. A multispectral image based object detection approach in natural scene. In Proceedings of the 2021 International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 9–11 April 2021; IEEE: New York, NY, USA, 2021; pp. 566–569. [Google Scholar]
  3. Ashtiani, S.H.M.; Javanmardi, S.; Jahanbanifard, M.; Martynenko, A.; Verbeek, F.J. Detection of mulberry maturity stages using deep learning models. IEEE Access 2021, 9, 100380–100394. [Google Scholar] [CrossRef]
  4. Yang, M.H.; Nam, W.H.; Kim, T.; Lee, K.; Kim, Y. Machine learning application for predicting the strawberry harvesting time. Korean J. Agric. Sci. 2019, 46, 381–393. [Google Scholar] [CrossRef]
  5. Liu, T.H.; Ehsani, R.; Toudeshki, A.; Zou, X.J.; Wang, H.J. Detection of citrus fruit and tree trunks in natural environments using a multi-elliptical boundary model. Comput. Ind. 2018, 99, 9–16. [Google Scholar] [CrossRef]
  6. Tanigaki, K.; Fujiura, T.; Akase, A.; Imagawa, J. Cherry-harvesting robot. Comput. Electron. Agric. 2008, 63, 65–72. [Google Scholar] [CrossRef]
  7. Niedbała, G.; Kurek, J.; Świderski, B.; Wojciechowski, T.; Antoniuk, I.; Bobran, K. Prediction of blueberry (Vaccinium corymbosum L.) yield based on artificial intelligence methods. Agriculture 2022, 12, 2089. [Google Scholar] [CrossRef]
  8. DeLong, J.; Prange, R.; Harrison, P.; Nichols, D.; Wright, H. Determination of optimal harvest boundaries for honeycrisp™ fruit using a new chlorophyll meter. Can. J. Plant Sci. 2014, 94, 361–369. [Google Scholar] [CrossRef]
  9. Li, J.; Wang, P.; Zhao, Y. Application of YOLOv8 for crop detection in precision agriculture. Sensors 2023, 23, 8095. [Google Scholar]
  10. Wang, T.; Zhang, L.; Sun, Q. Improved YOLOv8-based tomato detection method in greenhouse environments. Agriculture 2024, 14, 287. [Google Scholar]
  11. Chen, I.T.; Lin, H.Y. Detection, counting and maturity assessment of cherry tomatoes using multi-spectral images and machine learning techniques. In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020), Valletta, Malta, 27–29 February 2020; SciTePress, Science and Technology Publications: Setúbal, Portugal; pp. 759–766.
  12. Afonso, M.; Fonteijn, H.; Fiorentin, F.S.; Lensink, D.; Mooij, M.; Faber, N.; Polder, G.; Wehrens, R. Tomato fruit detection and counting in greenhouses using deep learning. Front. Plant Sci. 2020, 11, 571299. [Google Scholar] [CrossRef]
  13. Vasconez, J.; Delpiano, J.; Vougioukas, S.; Auat Cheein, F. Comparison of convolutional neural networks in fruit detection and counting: A comprehensive evaluation. Comput. Electron. Agric. 2020, 173, 105348. [Google Scholar] [CrossRef]
  14. Liu, C.; Ding, W.; Yang, J.; Murino, V.; Zhang, B.; Han, J.; Guo, G. Aggregation signature for small object tracking. IEEE Trans. Image Process. 2019, 29, 1738–1747. [Google Scholar] [CrossRef] [PubMed]
  15. Rincon, D.R.; van Henten, E.J.; Kootstra, G. Development and evaluation of automated localization and reconstruction of all fruits on tomato plants in a greenhouse based on multi-view perception and 3D multi-object tracking. arXiv 2022, arXiv:2211.02760. [Google Scholar]
  16. Zhao, Q.; Tao, H. Object tracking using color correlogram. In Proceedings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China, 15–16 October 2005; IEEE: New York, NY, USA, 2005; pp. 263–270. [Google Scholar]
  17. Zhao, D.; Fu, H.; Xiao, L.; Wu, T.; Dai, B. Multi-object tracking with correlation filter for autonomous vehicle. Sensors 2018, 18, 2004. [Google Scholar] [CrossRef]
  18. Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 17–35. [Google Scholar]
  19. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: New York, NY, USA, 2012; pp. 3354–3361. [Google Scholar]
  20. Castro, W.; Oblitas, J.; De-La-Torre, M.; Cotrina, C.; Bazán, K.; Avila-George, H. Classification of cape gooseberry fruit according to its level of ripeness using machine learning techniques and different color spaces. IEEE Access 2019, 7, 27389–27400. [Google Scholar] [CrossRef]
  21. El-Bendary, N.; El-Hariri, E.; Hassanien, A.E.; Badr, A. Using machine learning techniques for evaluating tomato ripeness. Expert Syst. Appl. 2015, 42, 1892–1905. [Google Scholar] [CrossRef]
  22. Martins, R.N.; de Carvalho Pinto, F.A.; de Queiroz, D.M.; Valente, D.S.M.; Rosas, J.T.F. A novel vegetation index for coffee ripeness monitoring using aerial imagery. Remote Sens. 2021, 13, 263. [Google Scholar]
  23. Rouse, J.W.; Haas, R.H.; Schell, J.A.; Deering, D.W. Monitoring vegetation systems in the great plains with ERTS. In Proceedings of the Third Earth Resources Technology Satellite-1 Symposium, Greenbelt, MD, USA, 10–14 December 1973; NASA: Washington, DC, USA, 1974; pp. 309–317. [Google Scholar]
  24. Fitzgerald, G.; Rodriguez, D.; Christensen, L.; Belford, R.; Sadras, V.; Clarke, T. Spectral and thermal sensing for nitrogen and water status in rainfed and irrigated wheat environments. Precis. Agric. 2006, 7, 233–248. [Google Scholar] [CrossRef]
  25. Nandi, C.S.; Tudu, B.; Koley, C. A machine vision-based maturity prediction system for sorting of harvested mangoes. IEEE Trans. Instrum. Meas. 2014, 63, 1722–1730. [Google Scholar] [CrossRef]
  26. Tan, K.; Lee, W.S.; Gan, H.; Wang, S. Recognising blueberry fruit of different maturity using histogram oriented gradients and colour features in outdoor scenes. Biosyst. Eng. 2018, 176, 59–72. [Google Scholar] [CrossRef]
  27. Waseem, M.; Huang, C.H.; Sajjad, M.M.; Naqvi, L.H.; Majeed, Y.; Rehman, T.U.; Nadeem, T. Automated tomato maturity estimation using an optimized residual model with pruning and quantization techniques. arXiv 2025, arXiv:2503.10940. [Google Scholar] [CrossRef]
  28. Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-scale feature learning for person re-identification. arXiv 2019, arXiv:1905.00953. [Google Scholar] [CrossRef]
  29. Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strongsort: Make deepsort great again. arXiv 2023, arXiv:2202.13514. [Google Scholar] [CrossRef]
  30. Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. arXiv 2017, arXiv:1703.07402. [Google Scholar] [CrossRef]
  31. Sari, Y.A.; Adinugroho, S. Tomato ripeness clustering using 6-means algorithm based on v-channel otsu segmentation. In Proceedings of the 2017 International Symposium on Computational and Business Intelligence (ISCBI), Dubai, United Arab Emirates, 11–14 August 2017; IEEE: New York, NY, USA, 2017; pp. 32–36. [Google Scholar]
  32. Yang, P.; Song, W.; Zhao, X.; Zheng, R.; Qingge, L. An improved Otsu threshold segmentation algorithm. Int. J. Comput. Sci. Eng. 2020, 22, 146–153. [Google Scholar]
  33. Pai, C.A.; Lin, H.Y. Maturity and yield estimation of tomatoes using RGB and multispectral images. In Proceedings of the 2024 IEEE International Conference on Industrial Technology (ICIT 2024), Bristol, UK, 25–27 March 2024. [Google Scholar]
  34. Omelina, L.; Goga, J.; Pavlovicova, J.; Oravec, M.; Jansen, B. A survey of iris datasets. Image Vis. Comput. 2021, 108, 104109. [Google Scholar] [CrossRef]
  35. Fu, Y.; Li, W.; Li, G.; Dong, Y.; Wang, S.; Zhang, Q.; Li, Y.; Dai, Z. Multi-stage tomato fruit recognition method based on improved YOLOv8. Front. Plant Sci. 2024, 15, 1447263. [Google Scholar] [CrossRef]
  36. Liu, Y.; Han, X.; Zhang, H.; Liu, S.; Ma, W.; Yan, Y.; Sun, L.; Jing, L.; Wang, Y.; Wang, J. YOLOv8-MSP-PD: A lightweight YOLOv8-based detection method for Jinxiu Malus fruit in field conditions. Agronomy 2025, 15, 1581. [Google Scholar]
Figure 1. Flowchart of the proposed approach.
Figure 1. Flowchart of the proposed approach.
Applsci 16 03227 g001
Figure 2. Examples of RGB and multispectral images. (a) RGB image, (b) near-infrared image, (c) red image, (d) red-edge image, (e) green image.
Figure 2. Examples of RGB and multispectral images. (a) RGB image, (b) near-infrared image, (c) red image, (d) red-edge image, (e) green image.
Applsci 16 03227 g002
Figure 3. Examples of RGB and the corresponding calibrated and aligned multispectral images. (a) RGB image, (b) calibrated and aligned near-infrared image, (c) calibrated and aligned red image, (d) calibrated and aligned green image.
Figure 3. Examples of RGB and the corresponding calibrated and aligned multispectral images. (a) RGB image, (b) calibrated and aligned near-infrared image, (c) calibrated and aligned red image, (d) calibrated and aligned green image.
Applsci 16 03227 g003
Figure 4. Contribution of each indicator to the datasets. (a) 19th_tomato, (b) lin_tomato, (c) 19th2_tomato.
Figure 4. Contribution of each indicator to the datasets. (a) 19th_tomato, (b) lin_tomato, (c) 19th2_tomato.
Applsci 16 03227 g004
Figure 5. Confusion matrix for three datasets using all classification indices: (a) 19th_tomato, (b) lin_tomato, (c) 19th2_tomato.
Figure 5. Confusion matrix for three datasets using all classification indices: (a) 19th_tomato, (b) lin_tomato, (c) 19th2_tomato.
Applsci 16 03227 g005
Figure 6. Final detection and classification results.
Figure 6. Final detection and classification results.
Applsci 16 03227 g006
Table 1. Parameter settings of each classification algorithm.
Table 1. Parameter settings of each classification algorithm.
AlgorithmSVMKNNANN
Kernel functionRBFN/AN/A
ParameterC = 10∼40K = 2∼30epoch = 300, batch size = 40
Step22N/A
Cross-validation10-fold10-fold10-fold
Classification indexR, G, B, NDVI, GNDVI, GRRI
Table 2. Number of annotated tomatoes in each dataset.
Table 2. Number of annotated tomatoes in each dataset.
Dataset Training SetTesting Set
19th_tomato 4247 1775
lin_tomato 45542423
19th2_tomato 8720 4338
Table 3. Classification results using different algorithms.
Table 3. Classification results using different algorithms.
DatasetAlgorithmClassification Index
R, G, BGNDVI,
NDVI, GRRI
R, G, B, NDVI,
GNDVI, GRRI
19th_tomatoSVM63%70%63%
19th_tomatoKNN64%66%64%
19th_tomatoANN62%73%73%
lin_tomatoSVM64%55%65%
lin_tomatoKNN58%58%58%
lin_tomatoANN38%75%46%
19th2_tomatoSVM72%70%78%
19th2_tomatoKNN81%59%81%
19th2_tomatoANN72%58%73%
Table 4. Comparison of classification results: Red-Blue channels only vs. RGB with all indices (N + G + G = NDVI + GNDVI + GRRI).
Table 4. Comparison of classification results: Red-Blue channels only vs. RGB with all indices (N + G + G = NDVI + GNDVI + GRRI).
DatasetAlgorithmClassification Index
R + G + BR + BN + G + G
19th_tomatoSVM63%61%70%
19th_tomatoKNN64%62%66%
19th_tomatoANN62%54%73%
lin_tomatoSVM64%44%55%
lin_tomatoKNN58%44%58%
lin_tomatoANN38%50%75%
19th2_tomatoSVM72%66%70%
19th2_tomatoKNN81%64%59%
19th2_tomatoANN72%66%58%
Table 5. Classification performance metrics for the three tomato datasets.
Table 5. Classification performance metrics for the three tomato datasets.
Dataset PrecisionRecallF1Accuracy
19th_tomato0.540.540.5366.7%
lin_tomato0.800.830.8079.8%
19th2_tomato0.760.760.7472.2%
Table 6. Results of maturity-based fruit number estimation.
Table 6. Results of maturity-based fruit number estimation.
DatasetGround TruthPredictionTotal Number DifferenceRMSE
−3−2−1Total−3−2−1Total
once a week
19th_tomato(6th)01619356111734−14.65
19th_tomato(7th)03111414712−22.45
lin_tomato(4th)16112291607−229.20
lin_tomato(5th)4691924410−93.32
lin_tomato(6th)44122014510−104.40
lin_tomato(7th)1009181124−146.61
lin_tomato(8th)24173216−11.29
19th2_tomato(5th)1917137197127−105.77
19th2_tomato(7th)41414325101328−42.45
19th2_tomato(9th)01101111101210.58
twice a week
19th2_tomato(5th)324137223227−105.83
19th2_tomato(6th)2286368141234−29.45
19th2_tomato(7th)1410832971228−44.08
19th2_tomato(8th)4102943272635−82.71
19th2_tomato(9th)01101102101210.58
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, H.-Y.; Pai, C.-A.; Chang, C.-C. Tomato Maturity Classification and Fruit Counting Based on RGB and Multispectral Images. Appl. Sci. 2026, 16, 3227. https://doi.org/10.3390/app16073227

AMA Style

Lin H-Y, Pai C-A, Chang C-C. Tomato Maturity Classification and Fruit Counting Based on RGB and Multispectral Images. Applied Sciences. 2026; 16(7):3227. https://doi.org/10.3390/app16073227

Chicago/Turabian Style

Lin, Huei-Yung, Chu-An Pai, and Chin-Chen Chang. 2026. "Tomato Maturity Classification and Fruit Counting Based on RGB and Multispectral Images" Applied Sciences 16, no. 7: 3227. https://doi.org/10.3390/app16073227

APA Style

Lin, H.-Y., Pai, C.-A., & Chang, C.-C. (2026). Tomato Maturity Classification and Fruit Counting Based on RGB and Multispectral Images. Applied Sciences, 16(7), 3227. https://doi.org/10.3390/app16073227

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop